WO2024103383A1 - 音频处理方法、装置、设备、存储介质及程序产品 - Google Patents

音频处理方法、装置、设备、存储介质及程序产品 Download PDF

Info

Publication number
WO2024103383A1
WO2024103383A1 PCT/CN2022/132820 CN2022132820W WO2024103383A1 WO 2024103383 A1 WO2024103383 A1 WO 2024103383A1 CN 2022132820 W CN2022132820 W CN 2022132820W WO 2024103383 A1 WO2024103383 A1 WO 2024103383A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
user
features
audio file
feature
Prior art date
Application number
PCT/CN2022/132820
Other languages
English (en)
French (fr)
Inventor
孙洪文
陈传艺
吴东海
劳振锋
关迪聆
Original Assignee
广州酷狗计算机科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州酷狗计算机科技有限公司 filed Critical 广州酷狗计算机科技有限公司
Priority to PCT/CN2022/132820 priority Critical patent/WO2024103383A1/zh
Priority to CN202280004371.XA priority patent/CN116034423A/zh
Publication of WO2024103383A1 publication Critical patent/WO2024103383A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers

Definitions

  • the embodiments of the present application relate to the field of audio technology, and in particular to an audio processing method, apparatus, device, storage medium and program product.
  • a user can record, tune and play the produced audio for himself through an audio production application.
  • the embodiments of the present application provide an audio processing method, apparatus, device, storage medium and program product, which can enhance the richness of audio content.
  • the technical solution is as follows:
  • an audio processing method comprising:
  • a second audio file generated based on the first audio file by the acoustic model of the first user is displayed; wherein the acoustic model of the first user is a model learned with the acoustic characteristics of the first user, and the second audio file has the timbre of the first user.
  • an audio processing device comprising:
  • An information display module used to display relevant information of the first audio file
  • a file display module is used to display a second audio file generated according to the first audio file by the acoustic model of the first user in response to a timbre production instruction for the first audio file; wherein the acoustic model of the first user is a model learned with the acoustic characteristics of the first user, and the second audio file has the timbre of the first user.
  • a computer device comprising a processor and a memory, wherein a computer program is stored in the memory, and the computer program is loaded and executed by the processor to implement the above-mentioned audio processing method.
  • a computer-readable storage medium in which a computer program is stored.
  • the computer program is loaded and executed by a processor to implement the above-mentioned audio processing method.
  • a computer program product is provided, and the computer program product is loaded and executed by a processor to implement the above audio processing method.
  • the user's acoustic features are fused with the first audio file to generate a second audio file with the user's timbre, thereby realizing the function of modifying the timbre of the audio, thereby improving the richness of the audio content.
  • FIG1 is a flow chart of an audio processing method provided by an embodiment of the present application.
  • FIG2 is a schematic diagram of a phoneme provided by an embodiment of the present application.
  • FIG3 is a schematic diagram of an acoustic model provided by an embodiment of the present application.
  • FIG4 is a block diagram of an audio processing device provided by an embodiment of the present application.
  • FIG5 is a block diagram of an audio processing device provided by another embodiment of the present application.
  • FIG. 6 is a block diagram of a computer device provided by an embodiment of the present application.
  • the execution subject of each step may be a computer device, which refers to an electronic device with data calculation, processing and storage capabilities.
  • the computer device may be a terminal such as a PC (Personal Computer), a tablet computer, a smart phone, a wearable device, an intelligent robot, etc.; it may also be a server.
  • the server may be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud server that provides cloud computing services.
  • FIG. 1 shows a flow chart of an audio processing method provided by an embodiment of the present application.
  • the method is mainly applied to the computer device introduced above as an example.
  • the method may include the following steps (110-130):
  • Step 110 Obtain a first audio file.
  • the first audio file may be a song, dubbing, poetry recitation, audio book, radio drama, or other type of audio.
  • one or more first audio files are obtained, that is, the timbre production can be performed on a single audio file, or the timbre production can be performed on multiple audio files at the same time, thereby improving the efficiency of the timbre production.
  • the first audio file may be an audio file obtained through wired or wireless transmission (such as a network connection).
  • the method is applied to a target application of a terminal device (such as a client of a target application).
  • the target application may be an audio application, such as a music production application, an audio playback application, an audio live broadcast application, a karaoke application, etc., which is not specifically limited in the embodiments of the present application.
  • the target application may also be any application with an audio processing function, such as a social application, a payment application, a video application, a shopping application, a news application, a game application, etc.
  • the first audio file may be an audio file recorded and/or produced by the client of the target application.
  • Step 120 extracting audio features of the first audio file.
  • the first audio file includes voice content uttered by any user, and audio features of the voice content uttered by the user are extracted from the first audio file.
  • the audio features include at least one of the following:
  • Phoneme features used to represent phoneme information of audio content in the first audio file
  • the pitch feature is used to represent the pitch information of the audio content in the first audio file.
  • phoneme refers to the smallest unit of speech divided according to the natural properties of speech, and is the smallest linear unit of speech divided from the perspective of sound quality.
  • Phoneme is a specific physical phenomenon. According to the analysis of the pronunciation action in the syllable, one action constitutes a phoneme.
  • phonemes are divided into two categories: vowels and consonants. For example, the Chinese syllable ⁇ ( ⁇ ) has only one phoneme, ⁇ (ài) has two phonemes, and ⁇ (dài) has three phonemes.
  • the phoneme information includes the phonemes contained in the audio content in the first audio file, and the pronunciation duration of each phoneme, and these features together constitute the phoneme features.
  • the pronunciation time of the phoneme corresponding to the vowel is relatively long; for another example, some people speak faster and pronounce more briefly, then the duration of each phoneme is relatively short; for another example, affected by physiological phonemes or living environment, some people find it difficult to pronounce certain phonemes (such as "h", "n”, etc.).
  • each phoneme can be represented by a phoneme block, and the length of the phoneme block is used to represent the pronunciation duration of the corresponding phoneme; for example, the length a1 of the phoneme block 21 is used to represent the pronunciation duration of the phoneme a.
  • pitch refers to the pitch of the sound, which is determined by the frequency and wavelength of the sound wave. The higher the frequency and the shorter the wavelength, the higher the pitch; conversely, the lower the frequency and the longer the wavelength, the lower the pitch.
  • the audio features may also include energy features, breath features, tension features, etc. of the audio content in the first audio file, which are not limited in this application.
  • the energy feature can be used to indicate the volume/loudness of the audio content in the first audio file; breath refers to the pronunciation method in which the vocal cords do not vibrate or hardly vibrate, and the breath feature can indicate the regularity or rhythm of the user's breath pronunciation; the tension feature refers to the change characteristics between the bass and treble, and between the weak and strong sounds of the audio content in the first audio file.
  • Step 130 processing the audio features through the acoustic model of the first user to generate a second audio file; wherein the acoustic model of the first user is a model learned with the acoustic features of the first user, and the second audio file has the timbre of the first user.
  • the acoustic feature of the first user includes a timbre feature of the first user.
  • Timbre refers to the sound characteristics of different sounds, which is physically manifested in the waveform characteristics of sound waves, so timbre can also be called a voiceprint feature. The timbre of the voices of different people is different.
  • the model that has learned the acoustic features of the first user is used to process the audio features of the first audio file to generate a second audio file. That is, the timbre of the first user is fused with the audio features (such as phoneme features, pitch features, etc.) of the first audio file to generate a second audio file that has the timbre of the first user and the phoneme and pitch features of the first audio file.
  • the timbre of the first user is fused with the audio features (such as phoneme features, pitch features, etc.) of the first audio file to generate a second audio file that has the timbre of the first user and the phoneme and pitch features of the first audio file.
  • step 130 also includes: processing the audio features through the acoustic model of the first user to generate a mel spectrogram; generating a second audio file based on the mel spectrogram.
  • human perception of sound frequency is not linear, and is more sensitive to low-frequency signals than high-frequency signals. For example, people can easily perceive the difference between 500 and 1000 Hz (Hertz), but it is difficult to find the difference between 7500 and 8000 Hz.
  • the Mel Scale proposed for this situation is a nonlinear transformation of sound frequency. For signals (such as sound signals) in units of Mel scale, it can simulate people's linear perception of changes in sound signals.
  • the Mel spectrum may also be replaced by other feasible spectrums, which is not specifically limited in the embodiments of the present application.
  • the acoustic model 30 includes an encoder 31 and a decoder 32 ; processing the audio features through the acoustic model of the first user to generate a Mel spectrum includes the following steps:
  • the fused features are processed by the decoder 32 to obtain a Mel spectrum.
  • the encoder 31 obtains the phoneme features in the audio features, encodes the phoneme features, and obtains the encoded phoneme features 33 (also referred to as intermediate layer variables).
  • the encoded phoneme features 33 also referred to as intermediate layer variables.
  • the encoded lengths of different phoneme features are adjusted by the length adjuster, so that the encoded phoneme features have the same length.
  • the length of the phoneme feature after the preliminary encoding with the longest length is taken as the standard length, and the parts of the other phoneme features after the preliminary encoding with respect to the standard length are supplemented, such as filling the shortfall with "0", so as to unify the lengths of all phoneme features and obtain the phoneme features after the uniform length encoding.
  • a standard length is pre-set, and the parts of the various phoneme features with respect to the standard length are supplemented, so that the lengths of all encoded phoneme features are unified to the standard length.
  • the standard length can be set by relevant technical personnel according to actual conditions, and the embodiments of the present application do not specifically limit this.
  • the standard length is not shorter than the length of the longest phoneme feature after preliminary encoding processing.
  • the method further includes: extracting a slice feature of a set length from the fused feature; wherein the slice feature is used as the input of the decoder 32 to obtain a Mel spectrum. That is, the fused feature will not be used as the input of the decoder 32 in its entirety, but a continuous feature segment of a set length is extracted from it, and the feature segment is sliced to obtain a plurality of slice features, and the plurality of slice features are input into the decoder 32 to obtain a Mel spectrum.
  • the audio is composed of a plurality of audio frames (i.e., a plurality of audio segments).
  • the length (i.e., time length) of each audio frame is equal, and the length of an audio frame can be considered to be 1, then the length of 100 continuous audio frames is 100.
  • the length of each slice feature is the same (i.e., the number of audio frames contained in each slice feature is the same).
  • the length of the fused feature is 3000, and a plurality of continuous slice features are extracted from the fused feature and input into the decoder 32, and the length of each slice feature is 500.
  • the voiceprint feature of the first user is obtained; the fused feature and the voiceprint feature of the first user are processed by a decoder to obtain a Mel spectrum.
  • the audio features of the audio content of the first audio file are fused with the voiceprint feature of the first user to obtain a second audio file having the voiceprint feature of the first user, the phoneme feature of the first audio file, and the pitch feature.
  • a song i.e., the second audio file
  • a song that sounds like the first user singing in the singing style of the singer in the first audio file can be obtained, thereby improving the content richness of the processed audio file.
  • the acoustic characteristics of the user are fused with the first audio file through the relevant information of the first audio file, the timbre production instructions and the acoustic model of the user to generate a second audio file with the timbre of the user, thereby realizing the function of timbre modification of the audio, thereby enhancing the richness of the audio content.
  • the method further includes:
  • the first user obtains an audio file of the first user by singing a song, reciting a poem, dubbing, etc. Based on the audio file of the first user, the pre-trained acoustic model is adjusted to obtain the acoustic model of the first user.
  • the pre-trained acoustic model is adjusted using the audio file of the first user to obtain the acoustic model of the first user, including:
  • the parameters of the pre-trained acoustic model are adjusted to obtain the acoustic model of the first user.
  • the pre-trained acoustic model is fine-tuned using the audio file of the first user.
  • the audio features and voiceprint features extracted from the audio file of the first user are input into the pre-trained acoustic model, and the pre-trained acoustic model outputs the corresponding predicted Mel spectrum; the loss is calculated based on the predicted Mel spectrum and the standard Mel spectrum, and the parameters of the pre-trained acoustic model are adjusted according to the loss calculation result, so that the loss function shows a gradient descent trend until the fine-tuning of the pre-trained acoustic model is completed, and the acoustic model of the first user is obtained.
  • the audio features of the audio file can be processed, and the voiceprint/timbre of the speech (such as sung songs, recitation content, dubbing content, etc.) emitted by the person in the audio file can be modified to the voiceprint/timbre of the first user, thereby realizing the modification and replacement of the timbre.
  • the voiceprint/timbre of the speech such as sung songs, recitation content, dubbing content, etc.
  • the audio features and voiceprint features corresponding to the first user's audio file are preloaded into the GPU (Graphics Processing Unit) video memory, eliminating the need to spend more time obtaining the audio features and voiceprint features corresponding to the first user's audio file from elsewhere, thereby improving data loading speed and saving model training time.
  • GPU Graphics Processing Unit
  • the method further includes: obtaining a sample audio file; training the initial acoustic model using the sample audio file to obtain a pre-trained acoustic model.
  • the audio features, voiceprint features and standard Mel spectrum corresponding to the sample audio file are extracted; the predicted Mel spectrum corresponding to the sample audio file is generated by the initial acoustic model according to the audio features and voiceprint features corresponding to the sample audio file; and the parameters of the initial acoustic model are adjusted according to the predicted Mel spectrum corresponding to the sample audio file and the standard Mel spectrum corresponding to the sample audio file to obtain a pre-trained acoustic model.
  • the process of training the initial acoustic model using the sample audio file to obtain the pre-trained acoustic model can refer to the relevant content of adjusting the parameters of the pre-trained acoustic model and obtaining the acoustic model of the first user in the above embodiment, which will not be repeated here.
  • the sample audio file may be a relatively large-scale audio file.
  • the sample audio file may include songs sung by stars or singers, or songs sung by ordinary people, which is not specifically limited in the present embodiment.
  • the pre-trained acoustic model is adjusted based on the audio file of the first user to obtain the acoustic model of the first user; since the number of audio files of the first user is small, the pre-trained acoustic model can be quickly adjusted using small sample data, so as to quickly obtain a personalized acoustic model exclusive to the first user.
  • FIG 4 shows a block diagram of an audio processing device provided by an embodiment of the present application.
  • the device has the function of implementing the above-mentioned audio processing method example, and the function can be implemented by hardware, or by hardware executing corresponding software.
  • the device can be the computer device introduced above, or it can be set on a computer device.
  • the device 400 may include: a file acquisition module 410, a feature extraction module 420 and a file generation module 430.
  • the file acquisition module 410 is used to acquire a first audio file.
  • the feature extraction module 420 is used to extract audio features of the first audio file.
  • the file generation module 430 is used to process the audio features through the acoustic model of the first user to generate a second audio file; wherein the acoustic model of the first user is a model learned with the acoustic features of the first user, and the second audio file has the timbre of the first user.
  • the audio feature includes at least one of the following:
  • Phoneme features used to represent phoneme information of the audio content in the first audio file
  • the pitch feature is used to represent the pitch information of the audio content in the first audio file.
  • the file generation module 430 includes: a spectrum generation submodule 431 and a file generation submodule 432 .
  • the spectrum generating submodule 431 is used to process the audio features through the acoustic model of the first user to generate a Mel spectrum.
  • the file generation submodule 432 is used to generate the second audio file according to the Mel spectrum.
  • the acoustic model includes an encoder and a decoder; as shown in FIG5 , the spectrum generation submodule 431 is used to:
  • the fused features are processed by the decoder to obtain the Mel spectrum.
  • the device 400 further includes: a feature interception module 440 .
  • the feature extraction module 440 is used to extract slice features of a set length from the fused features; wherein the slice features are used as input of the decoder to obtain the Mel spectrum.
  • the apparatus 400 further includes: a feature acquisition module 450 .
  • the feature acquisition module 450 is used to acquire the voiceprint feature of the first user.
  • the spectrum generating submodule 431 is used to process the fusion feature and the voiceprint feature of the first user through the decoder to obtain the Mel spectrum.
  • the apparatus 400 further includes: a model adjustment module 460 .
  • the file acquisition module 410 is further configured to acquire an audio file of the first user, where the audio file of the first user refers to a file obtained by recording the audio content of the first user.
  • the model adjustment module 460 is used to use the audio file of the first user to adjust the pre-trained acoustic model to obtain the acoustic model of the first user.
  • the model adjustment module 460 is used to:
  • the parameters of the pre-trained acoustic model are adjusted to obtain the acoustic model of the first user.
  • the audio features and voiceprint features corresponding to the audio file of the first user are preloaded into the graphics memory of the graphics processor GPU.
  • the device 400 further includes: a model training module 470 .
  • the file acquisition module 410 is also used to acquire a sample audio file.
  • the model training module 470 is used to train the initial acoustic model using the sample audio file to obtain the pre-trained acoustic model.
  • the acoustic characteristics of the user are fused with the first audio file through the relevant information of the first audio file, the timbre production instructions and the acoustic model of the user to generate a second audio file with the acoustic characteristics of the user, thereby improving the richness of the audio content.
  • the device provided in the above embodiment when implementing its functions, only uses the division of the above functional modules as an example.
  • the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the device and method embodiments provided in the above embodiment belong to the same concept, and their specific implementation process is detailed in the method embodiment, which will not be repeated here.
  • FIG6 shows a block diagram of a computer device provided in one embodiment of the present application.
  • the computer device is used to implement the audio processing method provided in the above embodiment. Specifically:
  • the computer device 600 includes a CPU (Central Processing Unit) 601, a system memory 604 including a RAM (Random Access Memory) 602 and a ROM (Read-Only Memory) 603, and a system bus 605 connecting the system memory 604 and the central processing unit 601.
  • the computer device 600 also includes a basic I/O (Input/Output) system 606 for facilitating information transmission between various components in the computer, and a large-capacity storage device 607 for storing an operating system 613, application programs 614, and other program modules 615.
  • a basic I/O Input/Output
  • the basic input/output system 606 includes a display 608 for displaying information and an input device 609 such as a mouse and a keyboard for user inputting information.
  • the display 608 and the input device 609 are connected to the central processing unit 601 through an input/output controller 610 connected to the system bus 605.
  • the basic input/output system 606 may also include an input/output controller 610 for receiving and processing inputs from a plurality of other devices such as a keyboard, a mouse, or an electronic stylus.
  • the input/output controller 610 also provides output to a display screen, a printer, or other types of output devices.
  • the mass storage device 607 is connected to the central processing unit 601 through a mass storage controller (not shown) connected to the system bus 605.
  • the mass storage device 607 and its associated computer readable medium provide non-volatile storage for the computer device 600. That is, the mass storage device 607 may include a computer readable medium (not shown) such as a hard disk or a CD-ROM (Compact Disc Read-Only Memory) drive.
  • a computer readable medium such as a hard disk or a CD-ROM (Compact Disc Read-Only Memory) drive.
  • the computer-readable medium may include computer storage media and communication media.
  • Computer storage media include volatile and non-volatile, removable and non-removable media implemented by any method or technology for storing information such as computer-readable instructions, data structures, program modules or other data.
  • Computer storage media include RAM, ROM, EPROM (Erasable Programmable Read Only Memory), EEPROM (Electrically Erasable Programmable Read Only Memory), flash memory or other solid-state memory, CD-ROM, DVD (Digital Video Disc) or other optical storage, tape cassettes, magnetic tapes, disk storage or other magnetic storage devices.
  • RAM random access memory
  • ROM read only memory
  • EPROM Erasable Programmable Read Only Memory
  • EEPROM Electrical Erasable Programmable Read Only Memory
  • flash memory or other solid-state memory
  • CD-ROM Compact Disc
  • DVD Digital Video Disc
  • the computer device 600 can also be connected to a remote computer on the network through a network such as the Internet. That is, the computer device 600 can be connected to the network 612 through the network interface unit 611 connected to the system bus 605, or the network interface unit 611 can be used to connect to other types of networks or remote computer systems (not shown).
  • a computer-readable storage medium is further provided, wherein a computer program is stored in the storage medium, and when the computer program is executed by a processor, the above-mentioned audio processing method is implemented.
  • a computer program product is also provided.
  • the computer program product is loaded and executed by a processor to implement the above audio processing method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Reverberation, Karaoke And Other Acoustics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例提供了一种音频处理方法、装置、设备、存储介质及程序产品,涉及音频技术领域。所述方法包括:获取第一音频文件(110);提取所述第一音频文件的音频特征(120);通过第一用户的声学模型对所述音频特征进行处理,生成第二音频文件;其中,所述第一用户的声学模型是学习有所述第一用户的声学特征的模型,所述第二音频文件具有所述第一用户的音色(130)。采用本申请实施例提供的技术方案,能够提升音频内容的丰富性。

Description

音频处理方法、装置、设备、存储介质及程序产品 技术领域
本申请实施例涉及音频技术领域,特别涉及一种音频处理方法、装置、设备、存储介质及程序产品。
背景技术
目前,随着音频技术的发展,音频处理方式越来越多种多样。
在相关技术中,用户可以通过某个音频制作应用程序给自己录音、调音并播放制作的音频。
在上述相关技术中,用户只能采用自己录音得到的音频进行音频制作,制作得到的音频内容较为单一。
发明内容
本申请实施例提供了一种音频处理方法、装置、设备、存储介质及程序产品,能够提升音频内容的丰富性。所述技术方案如下:
根据本申请实施例的一个方面,提供了一种音频处理方法,所述方法包括:
显示第一音频文件的相关信息;
响应于针对所述第一音频文件的音色制作指令,显示通过第一用户的声学模型根据所述第一音频文件生成的第二音频文件;其中,所述第一用户的声学模型是学习有所述第一用户的声学特征的模型,所述第二音频文件具有所述第一用户的音色。
根据本申请实施例的一个方面,提供了一种音频处理装置,所述装置包括:
信息显示模块,用于显示第一音频文件的相关信息;
文件显示模块,用于响应于针对所述第一音频文件的音色制作指令,显示通过第一用户的声学模型根据所述第一音频文件生成的第二音频文件;其中,所述第一用户的声学模型是学习有所述第一用户的声学特征的模型,所述第二音频文件具有所述第一用户的音色。
根据本申请实施例的一个方面,提供了一种计算机设备,所述计算机设备 包括处理器和存储器,所述存储器中存储有计算机程序,所述计算机程序由所述处理器加载并执行以实现上述音频处理方法。
根据本申请实施例的一个方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,所述计算机程序由处理器加载并执行以实现上述音频处理方法。
根据本申请实施例的一个方面,提供了一种计算机程序产品,所述计算机程序产品由处理器加载并执行以实现上述音频处理方法。
本申请实施例提供的技术方案可以包括如下有益效果:
通过提取第一音频文件的音频特征,并基于第一音频文件的音频特征、和用户的声学模型,将该用户的声学特征与第一音频文件融合,生成具有该用户音色的第二音频文件,实现了对音频进行音色修改的功能,从而提升了音频内容的丰富性。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一个实施例提供的音频处理方法的流程图;
图2是本申请一个实施例提供的音素的示意图;
图3是本申请一个实施例提供的声学模型的示意图;
图4是本申请一个实施例提供的音频处理装置的框图;
图5是本申请另一个实施例提供的音频处理装置的框图;
图6是本申请一个实施例提供的计算机设备的框图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描 述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的方法的例子。
本申请实施例提供的方法,各步骤的执行主体可以是计算机设备,该计算机设备是指具备数据计算、处理和存储能力的电子设备。该计算机设备可以是诸如PC(Personal Computer,个人计算机)、平板电脑、智能手机、可穿戴设备、智能机器人等终端;也可以是服务器。其中,服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云计算服务的云服务器。
下面,通过几个实施例对本申请技术方案进行介绍说明。
请参考图1,其示出了本申请一个实施例提供的音频处理方法的流程图。在本实施例中,主要以该方法应用于上文介绍的计算机设备中来举例说明。该方法可以包括如下几个步骤(110~130):
步骤110,获取第一音频文件。
在一些实施例中,第一音频文件可以是歌曲、配音、诗朗诵、有声读物、广播剧等类型的音频。
在一些实施例中,获取一个或多个第一音频文件。也即,可以对单个音频文件进行音色制作;也可以对多个音频文件同时进行音色制作,从而提升音色制作效率。
在一些实施例中,第一音频文件可以是通过有线或无线传输(如网络连接)获取到的音频文件。在一些实施例中,该方法应用于终端设备的目标应用程序中(如目标应用程序的客户端)。该目标应用程序可以是音频类应用程序,如音乐制作应用程序、音频播放应用程序、音频直播应用程序、K歌应用程序等,本申请实施例对此不作具体限定。该目标应用程序还可以是社交应用程序、支付应用程序、视频应用程序、购物应用程序、新闻应用程序、游戏应用程序等任何具有音频处理功能的应用程序。在一些实施例中,第一音频文件可以是通过目标应用程序的客户端录制和/或制作得到的音频文件。
步骤120,提取第一音频文件的音频特征。
在一些实施例中,第一音频文件中包括任意用户发出的语音内容,从第一音频文件中提取出该用户发出的语音内容的音频特征。
在一些实施例中,音频特征包括以下至少之一:
音素特征,用于表征第一音频文件中的音频内容的音素信息;
音高特征,用于表征第一音频文件中的音频内容的音高信息。
其中,音素是指是根据语音的自然属性划分出来的最小语音单位,是从音质的角度划分出来的最小的线性的语音单位。音素是具体存在的物理现象。依据音节里的发音动作来分析,一个动作构成一个音素。在一些实施例中,音素分为元音与辅音两大类。例如,汉语音节啊(ā)只有一个音素,爱(ài)有两个音素,代(dài)有三个音素。在一些实施例中,音素信息包括第一音频文件中的音频内容包含的音素、以及各个音素的发音时长,这些特征共同组成了音素特征。例如,有些人发音较为饱满,则在正常语速下,元音对应的音素发音时间就相对较长;又例如,有些人语速较快,发音较为短促,则每个音素的时长都比较短;又例如,受生理音素或生活环境影响,有些人很难发出某些音素(如“h”、“n”等)。
在一些实施例中,如图2所示,各个音素可以用音素块表示,音素块的长度用于表示对应音素的发音时长;例如,音素块21的长度a 1用于表示音素a的发音时长。
其中,音高是指声音的音调高低,音高由声波的频率和波长决定。频率越高、波长越短,则音高越高;反之,频率越低、波长越长,则音高越低。
在一些实施例中,音频特征还可以包括第一音频文件中的音频内容的能量特征、气声特征、张力特征等,本申请对此不作限定。其中,能量特征可以用于指示第一音频文件中的音频内容的音量/响度大小;气声是指声带不振动或几乎不振动的发音方式,气声特征可以指示用户使用气声发音的规律或节奏;张力特征是指第一音频文件中的音频内容的低音与高音之间、弱音与强音之间的变化特征。
步骤130,通过第一用户的声学模型对音频特征进行处理,生成第二音频文件;其中,第一用户的声学模型是学习有第一用户的声学特征的模型,第二音频文件具有第一用户的音色。
在一些实施例中,第一用户的声学特征包括第一用户的音色特征。音色是指不同声音的声音特点,在物理上表现在声波的波形特点,因而音色也可以称为声纹特征。不同人说话的声音的音色各不相同。
在一些实施例中,采用学习有第一用户的声学特征的模型,对第一音频文件的音频特征进行处理,生成第二音频文件。也即,将第一用户的音色与第一音频文件的音频特征(如音素特征、音高特征等)融合,生成兼具第一用户音色、第一音频文件的音素和音高特征的第二音频文件。
在一些实施例中,该步骤130还包括:通过第一用户的声学模型对音频特征进行处理,生成梅尔频谱(mel spectrogram);根据梅尔频谱,生成第二音频文件。研究表明,人类对声音频率的感知并不是线性的,并且对低频信号的感知要比高频信号敏感。例如,人们可以比较容易地感知到500和1000Hz(Hertz,赫兹)的区别,却很难发现7500和8000Hz的区别。针对这种情况提出的梅尔标度(the Mel Scale),是针对声音频率的非线性变换,对于以梅尔标度为单位的信号(如声音信号),可以模拟人对于声音信号变化的线性感知。
在一些实施例中,梅尔频谱也可以替换为其他可行的频谱,本申请实施例对此不作具体限定。
在一些实施例中,如图3所示,声学模型30包括编码器31和解码器32;通过第一用户的声学模型对音频特征进行处理,生成梅尔频谱,包括如下步骤:
1、通过编码器31对音频特征中的音素特征进行处理,得到编码后的音素特征;其中,音素特征用于表征第一音频文件中的音频内容的音素信息;
2、将编码后的音素特征与音频特征中的音高特征进行融合,得到融合特征;
3、通过解码器32对融合特征进行处理,得到梅尔频谱。
在一些实施例中,编码器31通过获取音频特征中的音素特征,对音素特征进行编码处理,得到编码后的音素特征33(也可以称为中间层变量)。可选地,由于音素的发音时长并不完全一致,在对音素特征进行编码处理时,通过长度调节器对不同音素特征的编码后长度进行调节,从而使得编码后的音素特征长度相同。例如,对音素特征进行初步编码处理后得到的各个音素特征的长度还不统一,则以长度最长的初步编码处理后的音素特征的长度为标准长度,将其他初步编码处理后的音素特征相对于标准长度短缺/不足的部分补全,如将短缺的部分用“0”填充补全,从而将所有音素特征的长度统一,得到长度统一编码 后的音素特征。又例如,预先设定一个标准长度,将各个音素特征相对于标准长度短缺的部分补全,从而将所有编码后的音素特征的长度都统一为标准长度。其中,标准长度可以由相关技术人员根据实际情况进行设定,本申请实施例对此不作具体限定。可选地,标准长度不短于长度最长的初步编码处理后的音素特征的长度。
在一些实施例中,将编码后的音素特征与音频特征中的音高特征进行融合,得到融合特征之后,还包括:从融合特征中截取设定长度的切片特征;其中,切片特征用于作为解码器32的输入,得到梅尔频谱。也即,融合特征不会全部作为解码器32的输入,而是将其截取出设定长度的连续的特征片段,并将该特征片段进行切片处理,得到多个切片特征,并将该多个切片特征输入解码器32,得到梅尔频谱。在一些实施例中,音频是由多个音频帧(即多个音频片段)组成的。可选地,每个音频帧的长度(即时长)相等,一个音频帧的长度可以认为是1,则100个连续音频帧的长度就是100。在一些实施例中,每个切片特征的长度相同(即每个切片特征的中包含的音频帧的数量相同)。例如,融合特征长度为3000,从融合特征中截取多个连续的切片特征输入解码器32,每一个切片特征的长度均为500。
在上述实施例中,仅从融合特征中截取出设定长度的切片特征进行处理,无需对整个融合特征进行处理,根据实验结果,这样处理对模型精度的影响较小,从而在保证声学模型精度的前提下,节省处理资源、并提升模型的处理效率。
在一些实施例中,获取第一用户的声纹特征;通过解码器对融合特征和第一用户的声纹特征进行处理,得到梅尔频谱。从而将第一音频文件的音频内容的音频特征,与第一用户的声纹特征进行融合,得到兼具第一用户的声纹特征、第一音频文件的音素特征和音高特征的第二音频文件。对于唱歌场景,可以得到听上去像是第一用户按照第一音频文件中的演唱者的唱法演唱出的歌曲(即第二音频文件),从而提升处理得到的音频文件的内容丰富性。
综上所述,本申请实施例提供的技术方案中,通过第一音频文件的相关信息、音色制作指令和用户的声学模型,将该用户的声学特征与第一音频文件融合,生成具有该用户音色的第二音频文件,实现了对音频进行音色修改的功能,从而提升了音频内容的丰富性。
在一些可能的实现方式中,方法还包括:
1、获取第一用户的音频文件,第一用户的音频文件是指对第一用户的音频内容进行录制得到的文件;
2、采用第一用户的音频文件,对预训练的声学模型进行调整,得到第一用户的声学模型。
在一些实施例中,第一用户通过演唱歌曲、诗朗诵、配音等方式录制得到第一用户的音频文件。并基于第一用户的音频文件,对预训练的声学模型进行调整,得到第一用户的声学模型。
在一些实施例中,采用第一用户的音频文件,对预训练的声学模型进行调整,得到第一用户的声学模型,包括:
(1)提取第一用户的音频文件对应的音频特征、声纹特征和标准梅尔频谱;
(2)通过预训练的声学模型根据第一用户的音频文件对应的音频特征和声纹特征,生成预测梅尔频谱;
(3)根据预测梅尔频谱和标准梅尔频谱,对预训练的声学模型的参数进行调整,得到第一用户的声学模型。
在上述实施例中,采用第一用户的音频文件对预训练的声学模型进行微调。将从第一用户的音频文件中提取出来的音频特征和声纹特征输入预训练的声学模型中,预训练的声学模型输出对应的预测梅尔频谱;基于预测梅尔频谱和标准梅尔频谱计算损失,并根据损失计算结果调整预训练的声学模型的参数,使其损失函数的呈梯度下降的趋势,直到预训练的声学模型微调完成,则得到第一用户的声学模型。从而可以对音频文件的音频特征进行处理,将该音频文件中人发出的语音(如演唱的歌曲、朗诵内容、配音内容等)的声纹/音色,修改为第一用户的声纹/音色,实现音色的修改和替换。
在一些实施例中,第一用户的音频文件对应的音频特征和声纹特征,预加载进GPU(Graphics Processing Unit,图形处理器)显存中,从而无需从别处花更多时间获取第一用户的音频文件对应的音频特征和声纹特征,从而提升数据加载速度,节省模型的训练时间。
在一些实施例中,该方法还包括:获取样本音频文件;采用样本音频文件对初始的声学模型进行训练,得到预训练的声学模型。在上述实施例中,提取 样本音频文件对应的音频特征、声纹特征和标准梅尔频谱;通过初始的声学模型根据样本音频文件对应的音频特征和声纹特征,生成样本音频文件对应的预测梅尔频谱;再根据根据样本音频文件对应的预测梅尔频谱和样本音频文件对应的标准梅尔频谱,对初始的声学模型的参数进行调整,得到预训练的声学模型。采用样本音频文件对初始的声学模型进行训练、得到预训练的声学模型的过程,可以参考上文实施例中对预训练的声学模型的参数进行调整、得到第一用户的声学模型的相关内容,此处不再赘述。
其中,样本音频文件可以是较大规模的音频文件。在音频文件为歌曲的情况下,样本音频文件可以包括明星、歌手演唱的歌曲,也可以包括普通人演唱的歌曲,本申请实施例对此不作具体限定。
在上述实现方式中,基于第一用户的音频文件,对预训练的声学模型进行调整,得到第一用户的声学模型;由于第一用户的音频文件的数量较少,可以采用小样本数据对预训练的声学模型进行快速调整,从而快速得到专属于第一用户的个性化声学模型。
下述为本申请装置实施例,可以用于执行本申请方法实施例。对于本申请装置实施例中未披露的细节,请参照本申请方法实施例。
请参考图4,其示出了本申请一个实施例提供的音频处理装置的框图。该装置具有实现上述音频处理方法示例的功能,所述功能可以由硬件实现,也可以由硬件执行相应的软件实现。该装置可以是上文介绍的计算机设备,也可以设置在计算机设备上。该装置400可以包括:文件获取模块410、特征提取模块420和文件生成模块430。
所述文件获取模块410,用于获取第一音频文件。
所述特征提取模块420,用于提取所述第一音频文件的音频特征。
所述文件生成模块430,用于通过第一用户的声学模型对所述音频特征进行处理,生成第二音频文件;其中,所述第一用户的声学模型是学习有所述第一用户的声学特征的模型,所述第二音频文件具有所述第一用户的音色。
在一些实施例中,所述音频特征包括以下至少之一:
音素特征,用于表征所述第一音频文件中的音频内容的音素信息;
音高特征,用于表征所述第一音频文件中的音频内容的音高信息。
在一些实施例中,如图5所示,所述文件生成模块430,包括:频谱生成子模块431和文件生成子模块432。
所述频谱生成子模块431,用于通过所述第一用户的声学模型对所述音频特征进行处理,生成梅尔频谱。
所述文件生成子模块432,用于根据所述梅尔频谱,生成所述第二音频文件。
在一些实施例中,所述声学模型包括编码器和解码器;如图5所示,频谱生成子模块431,用于:
通过所述编码器对所述音频特征中的音素特征进行处理,得到编码后的音素特征;其中,所述音素特征用于表征所述第一音频文件中的音频内容的音素信息;
将所述编码后的音素特征与所述音频特征中的音高特征进行融合,得到融合特征;
通过所述解码器对所述融合特征进行处理,得到所述梅尔频谱。
在一些实施例中,如图5所示,所述装置400还包括:特征截取模块440。
所述特征截取模块440,用于从所述融合特征中截取设定长度的切片特征;其中,所述切片特征用于作为所述解码器的输入,得到所述梅尔频谱。
在一些实施例中,如图5所示,所述装置400还包括:特征获取模块450。
所述特征获取模块450,用于获取所述第一用户的声纹特征。
所述频谱生成子模块431,用于通过所述解码器对所述融合特征和第一用户的声纹特征进行处理,得到所述梅尔频谱。
在一些实施例中,如图5所示,所述装置400还包括:模型调整模块460。
所述文件获取模块410,还用于获取所述第一用户的音频文件,所述第一用户的音频文件是指对所述第一用户的音频内容进行录制得到的文件。
所述模型调整模块460,用于采用所述第一用户的音频文件,对预训练的所述声学模型进行调整,得到所述第一用户的声学模型。
在一些实施例中,如图5所示,所述模型调整模块460,用于:
提取所述第一用户的音频文件对应的音频特征、声纹特征和标准梅尔频谱;
通过预训练的所述声学模型根据所述第一用户的音频文件对应的音频特征和声纹特征,生成预测梅尔频谱;
根据所述预测梅尔频谱和所述标准梅尔频谱,对预训练的所述声学模型的 参数进行调整,得到所述第一用户的声学模型。
在一些实施例中,所述第一用户的音频文件对应的音频特征和声纹特征,预加载进图形处理器GPU显存中。
在一些实施例中,如图5所示,所述装置400还包括:模型训练模块470。
所述文件获取模块410,还用于获取样本音频文件。
所述模型训练模块470,用于采用所述样本音频文件对初始的所述声学模型进行训练,得到预训练的所述声学模型。
综上所述,本申请实施例提供的技术方案中,通过第一音频文件的相关信息、音色制作指令和用户的声学模型,将该用户的声学特征与第一音频文件融合,生成具有该用户声学特征的第二音频文件,从而提升了音频内容的丰富性。
需要说明的是,上述实施例提供的装置,在实现其功能时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的装置与方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
请参考图6,其示出了本申请一个实施例提供的计算机设备的结构框图。该计算机设备用于实施上述实施例中提供的音频处理方法。具体来讲:
所述计算机设备600包括CPU(Central Processing Unit,中央处理单元)601、包括RAM(Random Access Memory,随机存取存储器)602和ROM(Read-Only Memory,只读存储器)603的系统存储器604,以及连接系统存储器604和中央处理单元601的系统总线605。所述计算机设备600还包括帮助计算机内的各个器件之间传输信息的基本I/O(Input/Output,输入/输出)系统606,和用于存储操作系统613、应用程序614和其他程序模块615的大容量存储设备607。
所述基本输入/输出系统606包括有用于显示信息的显示器608和用于用户输入信息的诸如鼠标、键盘之类的输入设备609。其中所述显示器608和输入设备609都通过连接到系统总线605的输入输出控制器610连接到中央处理单元601。所述基本输入/输出系统606还可以包括输入输出控制器610以用于接收和处理来自键盘、鼠标、或电子触控笔等多个其他设备的输入。类似地,输入输 出控制器610还提供输出到显示屏、打印机或其他类型的输出设备。
所述大容量存储设备607通过连接到系统总线605的大容量存储控制器(未示出)连接到中央处理单元601。所述大容量存储设备607及其相关联的计算机可读介质为计算机设备600提供非易失性存储。也就是说,所述大容量存储设备607可以包括诸如硬盘或者CD-ROM(Compact Disc Read-Only Memory,只读光盘)驱动器之类的计算机可读介质(未示出)。
不失一般性,所述计算机可读介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括RAM、ROM、EPROM(Erasable Programmable Read Only Memory,可擦除可编程只读存储器)、EEPROM(Electrically Erasable Programmable Read Only Memory,可擦除可编程只读存储器)、闪存或其他固态存储器,CD-ROM、DVD(Digital Video Disc,高密度数字视频光盘)或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知所述计算机存储介质不局限于上述几种。上述的系统存储器604和大容量存储设备607可以统称为存储器。
根据本申请的各种实施例,所述计算机设备600还可以通过诸如因特网等网络连接到网络上的远程计算机运行。也即计算机设备600可以通过连接在所述系统总线605上的网络接口单元611连接到网络612,或者说,也可以使用网络接口单元611来连接到其他类型的网络或远程计算机系统(未示出)。
在示例性实施例中,还提供了一种计算机可读存储介质,所述存储介质中存储有计算机程序,所述计算机程序在被处理器执行时以实现上述音频处理方法。
在示例性实施例中,还提供了一种计算机程序产品,所述计算机程序产品由处理器加载并执行以实现上述音频处理方法。
应当理解的是,在本文中提及的“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表 示前后关联对象是一种“或”的关系。
以上所述仅为本申请的示例性实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (14)

  1. 一种音频处理方法,其特征在于,所述方法包括:
    获取第一音频文件;
    提取所述第一音频文件的音频特征;
    通过第一用户的声学模型对所述音频特征进行处理,生成第二音频文件;其中,所述第一用户的声学模型是学习有所述第一用户的声学特征的模型,所述第二音频文件具有所述第一用户的音色。
  2. 根据权利要求1所述的方法,其特征在于,所述音频特征包括以下至少之一:
    音素特征,用于表征所述第一音频文件中的音频内容的音素信息;
    音高特征,用于表征所述第一音频文件中的音频内容的音高信息。
  3. 根据权利要求1所述的方法,其特征在于,所述通过第一用户的声学模型对所述音频特征进行处理,生成第二音频文件,包括:
    通过所述第一用户的声学模型对所述音频特征进行处理,生成梅尔频谱;
    根据所述梅尔频谱,生成所述第二音频文件。
  4. 根据权利要求3所述的方法,其特征在于,所述声学模型包括编码器和解码器;
    所述通过所述第一用户的声学模型对所述音频特征进行处理,生成梅尔频谱,包括:
    通过所述编码器对所述音频特征中的音素特征进行处理,得到编码后的音素特征;其中,所述音素特征用于表征所述第一音频文件中的音频内容的音素信息;
    将所述编码后的音素特征与所述音频特征中的音高特征进行融合,得到融合特征;
    通过所述解码器对所述融合特征进行处理,得到所述梅尔频谱。
  5. 根据权利要求4所述的方法,其特征在于,所述将所述编码后的音素特征 与所述音频特征中的音高特征进行融合,得到融合特征之后,还包括:
    从所述融合特征中截取设定长度的切片特征;
    其中,所述切片特征用于作为所述解码器的输入,得到所述梅尔频谱。
  6. 根据权利要求4所述的方法,其特征在于,所述方法还包括:
    获取所述第一用户的声纹特征;
    所述通过所述解码器对所述融合特征进行处理,得到所述梅尔频谱,包括:
    通过所述解码器对所述融合特征和第一用户的声纹特征进行处理,得到所述梅尔频谱。
  7. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    获取所述第一用户的音频文件,所述第一用户的音频文件是指对所述第一用户的音频内容进行录制得到的文件;
    采用所述第一用户的音频文件,对预训练的所述声学模型进行调整,得到所述第一用户的声学模型。
  8. 根据权利要求7所述的方法,其特征在于,所述采用所述第一用户的音频文件,对预训练的所述声学模型进行调整,得到所述第一用户的声学模型,包括:
    提取所述第一用户的音频文件对应的音频特征、声纹特征和标准梅尔频谱;
    通过预训练的所述声学模型根据所述第一用户的音频文件对应的音频特征和声纹特征,生成预测梅尔频谱;
    根据所述预测梅尔频谱和所述标准梅尔频谱,对预训练的所述声学模型的参数进行调整,得到所述第一用户的声学模型。
  9. 根据权利要求8所述的方法,其特征在于,所述第一用户的音频文件对应的音频特征和声纹特征,预加载进图形处理器GPU显存中。
  10. 根据权利要求7所述的方法,其特征在于,所述方法还包括:
    获取样本音频文件;
    采用所述样本音频文件对初始的所述声学模型进行训练,得到预训练的所述声学模型。
  11. 一种音频处理装置,其特征在于,所述装置包括:
    文件获取模块,用于获取第一音频文件;
    特征提取模块,用于提取所述第一音频文件的音频特征;
    文件生成模块,用于通过第一用户的声学模型对所述音频特征进行处理,生成第二音频文件;其中,所述第一用户的声学模型是学习有所述第一用户的声学特征的模型,所述第二音频文件具有所述第一用户的音色。
  12. 一种计算机设备,其特征在于,所述计算机设备包括处理器和存储器,所述存储器中存储有计算机程序,所述计算机程序由所述处理器加载并执行以实现上述权利要求1至10任一项所述的音频处理方法。
  13. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机程序,所述计算机程序由处理器加载并执行以实现上述权利要求1至10任一项所述的音频处理方法。
  14. 一种计算机程序产品,其特征在于,所述计算机程序产品由处理器加载并执行以实现上述权利要求1至10任一项所述的音频处理方法。
PCT/CN2022/132820 2022-11-18 2022-11-18 音频处理方法、装置、设备、存储介质及程序产品 WO2024103383A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2022/132820 WO2024103383A1 (zh) 2022-11-18 2022-11-18 音频处理方法、装置、设备、存储介质及程序产品
CN202280004371.XA CN116034423A (zh) 2022-11-18 2022-11-18 音频处理方法、装置、设备、存储介质及程序产品

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/132820 WO2024103383A1 (zh) 2022-11-18 2022-11-18 音频处理方法、装置、设备、存储介质及程序产品

Publications (1)

Publication Number Publication Date
WO2024103383A1 true WO2024103383A1 (zh) 2024-05-23

Family

ID=86072780

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/132820 WO2024103383A1 (zh) 2022-11-18 2022-11-18 音频处理方法、装置、设备、存储介质及程序产品

Country Status (2)

Country Link
CN (1) CN116034423A (zh)
WO (1) WO2024103383A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112927674A (zh) * 2021-01-20 2021-06-08 北京有竹居网络技术有限公司 语音风格的迁移方法、装置、可读介质和电子设备
CN112992107A (zh) * 2021-03-25 2021-06-18 腾讯音乐娱乐科技(深圳)有限公司 训练声学转换模型的方法、终端及存储介质
CN113571039A (zh) * 2021-08-09 2021-10-29 北京百度网讯科技有限公司 语音转换方法、系统、电子设备及可读存储介质
US20220068259A1 (en) * 2020-08-28 2022-03-03 Microsoft Technology Licensing, Llc System and method for cross-speaker style transfer in text-to-speech and training data generation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220068259A1 (en) * 2020-08-28 2022-03-03 Microsoft Technology Licensing, Llc System and method for cross-speaker style transfer in text-to-speech and training data generation
CN112927674A (zh) * 2021-01-20 2021-06-08 北京有竹居网络技术有限公司 语音风格的迁移方法、装置、可读介质和电子设备
CN112992107A (zh) * 2021-03-25 2021-06-18 腾讯音乐娱乐科技(深圳)有限公司 训练声学转换模型的方法、终端及存储介质
CN113571039A (zh) * 2021-08-09 2021-10-29 北京百度网讯科技有限公司 语音转换方法、系统、电子设备及可读存储介质

Also Published As

Publication number Publication date
CN116034423A (zh) 2023-04-28

Similar Documents

Publication Publication Date Title
CN106898340B (zh) 一种歌曲的合成方法及终端
JP6876752B2 (ja) 応答方法及び装置
CN110675886B (zh) 音频信号处理方法、装置、电子设备及存储介质
EP3872806B1 (en) Text-to-speech from media content item snippets
US12027165B2 (en) Computer program, server, terminal, and speech signal processing method
WO2020113733A1 (zh) 动画生成方法、装置、电子设备及计算机可读存储介质
CN111489424A (zh) 虚拟角色表情生成方法、控制方法、装置和终端设备
CN107172449A (zh) 多媒体播放方法、装置及多媒体存储方法
JP2008517315A (ja) メディアコンテンツ項目のカテゴリに関してユーザに通知するためのデータ処理装置及び方法
CN110211556B (zh) 音乐文件的处理方法、装置、终端及存储介质
US10453434B1 (en) System for synthesizing sounds from prototypes
US10108395B2 (en) Audio device with auditory system display and methods for use therewith
WO2022184055A1 (zh) 文章的语音播放方法、装置、设备、存储介质及程序产品
KR101164379B1 (ko) 사용자 맞춤형 컨텐츠 제작이 가능한 학습 장치 및 이를 이용한 학습 방법
US11687314B2 (en) Digital audio workstation with audio processing recommendations
CN113542626B (zh) 视频配乐方法、装置、计算机设备和存储介质
CN114999441A (zh) 虚拟形象生成方法、装置、设备、存储介质以及程序产品
JP2008216486A (ja) 音楽再生システム
WO2024103383A1 (zh) 音频处理方法、装置、设备、存储介质及程序产品
JP2014123085A (ja) カラオケにおいて歌唱に合わせて視聴者が行う身体動作等をより有効に演出し提供する装置、方法、およびプログラム
WO2022041177A1 (zh) 通信消息处理方法、设备及即时通信客户端
CN112071287A (zh) 用于生成歌谱的方法、装置、电子设备和计算机可读介质
JP2020204683A (ja) 電子出版物視聴覚システム、視聴覚用電子出版物作成プログラム、及び利用者端末用プログラム
CN104464717B (zh) 声音合成装置
CN116092508A (zh) 音频处理方法、装置、终端、存储介质及程序产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22965564

Country of ref document: EP

Kind code of ref document: A1