WO2023051155A1 - Procédés de traitement et d'entraînement de voix et dispositif électronique - Google Patents

Procédés de traitement et d'entraînement de voix et dispositif électronique Download PDF

Info

Publication number
WO2023051155A1
WO2023051155A1 PCT/CN2022/116572 CN2022116572W WO2023051155A1 WO 2023051155 A1 WO2023051155 A1 WO 2023051155A1 CN 2022116572 W CN2022116572 W CN 2022116572W WO 2023051155 A1 WO2023051155 A1 WO 2023051155A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
feature information
timbre
sets
source audio
Prior art date
Application number
PCT/CN2022/116572
Other languages
English (en)
Chinese (zh)
Inventor
黄涛
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023051155A1 publication Critical patent/WO2023051155A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch

Definitions

  • the embodiments of the present application relate to the field of data processing, and in particular, to a voice processing and training method and electronic equipment.
  • the mainstream content of audiobooks are novels, such as romance novels, suspense novels, science fiction novels, martial arts novels, etc. Since there are multiple characters in the novel content, when recording audiobooks, a single person usually performs multi-role interpretation through false voices to realize the recording of multi-role voices in the novel.
  • the present application provides a speech processing and training method and electronic equipment.
  • the unicast audio data can be converted into the multicast audio data under the premise of ensuring the expressive force of the unicast audio book, and the conversion of the unicast audio book into the multicast audio book can be realized.
  • the embodiment of the present application provides a voice processing method, including: acquiring unicast audio data of a unicast audiobook, where the unicast audio data includes N pieces of source audio data; and then, from the M sets of reference timbre feature information, Determining N sets of target timbre feature information matching N pieces of source audio data, M sets of reference timbre feature information corresponding to M timbres, and the timbre discrimination of any two groups in the M sets of reference timbre feature information is greater than the discrimination threshold; and obtaining N sets N sets of expressive feature information corresponding to the source audio data; then, based on N sets of target timbre feature information and N sets of expressive feature information, perform timbre conversion on N pieces of source audio data to generate multicast audio data.
  • the timbre of each piece of source audio data can be converted, the conversion of the unicast audiobook into a multicast audiobook can be realized, and the timbre distinction of the characters in the reading can be improved, thereby improving the quality of the audiobook.
  • the expressiveness of the scene deduction makes it easy for users to understand the plot.
  • the application can automatically match the timbre of each source audio data, without the need for the user to manually select the matching timbre for each source audio data, and can improve the conversion of unicast audio data into multiple It improves the efficiency of broadcasting audio data, simplifies user operations, and improves user experience.
  • this application only requires a single voice actor to perform multi-role interpretation and record unicast audiobooks through pseudonyms, and then convert the unicast audiobooks into multicast audiobooks.
  • this application can reduce the production time and production cost of multicast audiobooks, and improve the production efficiency of multicast audiobooks.
  • the audiobook producer can directly convert the unicast audiobooks into multicast audiobooks without reproducing the already produced unicast audiobooks, and the efficiency High and low cost.
  • N is a positive integer
  • M is a positive integer
  • the timbre characteristic information can be used to characterize the information related to the timbre, which can include but not limited to: personality characteristics, gender characteristics, age characteristics, and sound region (such as high-pitched, middle-pitched, and low-pitched) features, etc., can be adopted Vector or sequence representation, which is not limited in this application.
  • determining N sets of target timbre feature information matching N pieces of source audio data from M sets of reference timbre feature information includes: performing audio sentence division on unicast audio data to obtain N pieces of source audio data, One-to-one correspondence between N pieces of source audio data and audio sentences; obtain N sets of source timbre feature information corresponding to N pieces of source audio data; determine N sets of target timbre feature information from M sets of reference timbre feature information based on N sets of source timbre feature information .
  • N sets of target timbre feature information matching N pieces of source audio data are determined from M sets of reference timbre feature information based on N sets of source timbre feature information, including : For the i-th piece of source audio data in the N pieces of source audio data: respectively determine the similarity between M groups of reference tone feature information and the i-th piece of source audio data; use the reference tone with the highest similarity with the i-th piece of source audio data The feature information is determined as the target timbre feature information matching the i-th piece of source audio data, where i is a positive integer with a value ranging from 1 to N.
  • the timbre with high similarity to the timbre of the character corresponding to the source audio data can be used as the timbre matching the source audio data, so that the converted timbre matches the timbre of the character corresponding to the source audio data.
  • 1 and N includes 1 and N, that is, i may be 1 or N.
  • acquiring N sets of expressive feature information corresponding to N sets of source audio data includes: acquiring N sets of prosodic feature information corresponding to N sets of source audio data; acquiring N sets of prosody feature information corresponding to N sets of source audio data; N sets of emotional feature information corresponding to sets of source audio data; N sets of expressive feature information corresponding to N pieces of source audio data are generated based on N sets of prosodic feature information and N sets of emotional feature information.
  • the prosodic feature information may be used to represent sound emotional skill information such as severity, urgency, virtual reality, etc. of speech, and may be represented by a vector or a sequence.
  • the emotional feature information can be used to represent the type of emotion (such as happy, sad, high-pitched, low-pitched, etc.), and the scale of attitude (such as affirmation, negation, praise, irony, etc.), which can be represented by vector or sequence.
  • type of emotion such as happy, sad, high-pitched, low-pitched, etc.
  • scale of attitude such as affirmation, negation, praise, irony, etc.
  • N pieces of source audio data are respectively subjected to timbre conversion to generate multicast audio data, Including: obtaining N sets of content feature information corresponding to N pieces of source audio data; performing spectral reconstruction based on N sets of target timbre feature information, N sets of expressiveness feature information, and N sets of content feature information to obtain N sets of audio spectrum feature information; N sets of audio spectrum feature information are respectively subjected to frequency-time transformation to obtain N pieces of target audio data; and the N pieces of target audio data are spliced to obtain multicast audio data.
  • the rhythm of the target audio data is consistent with the rhythm, emotion and content of the source audio data, but the timbre is different.
  • the unicast audio data is divided into audio sentences to obtain N pieces of source audio data, including: performing voice activity detection VAD (Voice Activity) on the unicast audio data Detection, voice activity detection) detection, divides the unicast audio data into N pieces of source audio data.
  • VAD Voice Activity
  • the unicast audio data is divided into audio sentences to obtain N pieces of source audio data, including: obtaining the reading text of the unicast audiobook, dividing the reading text Be N text sentences; Align the unicast audio data and the reading text to determine the N audio time intervals corresponding to the N text sentences in the unicast audio data; Based on the N audio time intervals, divide the unicast audio data into N pieces of source audio data. In this way, the accuracy of audio sentence division for unicast audio data can be increased, thereby improving the accuracy of converting unicast audiobooks into multicast audiobooks.
  • obtaining N sets of source timbre feature information corresponding to N pieces of source audio data includes: determining the role corresponding to each piece of source audio data in the N pieces of source audio data; Obtain N sets of initial timbre feature information corresponding to N pieces of source audio data; for X pieces of source audio data corresponding to the first character in N pieces of source audio data: perform weighted calculation based on X sets of initial timbre feature information corresponding to X pieces of source audio data , determining the weighted calculation result as the source timbre feature information corresponding to each piece of source audio data in the X pieces of source audio data, where X is a positive integer.
  • the source timbre feature information of the source audio data of the same character in the reading can be made the same, ensuring the unity of the timbre of the same character in the reading.
  • the similarity of the initial timbre feature information; among the P1 pieces of source audio data, the role corresponding to the source audio data with the highest similarity to the initial timbre feature information corresponding to the j-th piece of source audio data is determined as the role of the j-th piece of source audio data ;
  • j is a positive integer
  • the value range is between 1 and P2.
  • 1 to P2 includes 1 and P2, that is, j may be equal to 1 or P2.
  • the embodiment of the present application provides a training method, the method includes: first, collecting training data, the training data includes training audio data and reference character labels corresponding to the training audio data, and the expressive feature information of the training audio data satisfies the performance Force condition, the training audio data includes audio data recorded by multiple users using their own timbre, and/or audio data recorded by multiple users using false sounds, and the timbre discrimination of different false sounds used by the same user is greater than the discrimination threshold.
  • a timbre extractor that can extract accurate timbre feature information
  • a prosody extractor that can extract accurate prosody feature information
  • an emotion extractor that can extract accurate emotional feature information
  • a spectral reconstruction model that can reconstruct audio spectrum feature information
  • the method further includes: on the one hand, inputting the timbre feature information into the first classifier for calculation to obtain the first role label, and calculating the second loss function value based on the first role label and the reference role label.
  • the emotional feature information is input to the second classifier for calculation to obtain a second character label, and a third loss function value is calculated based on the second role label and the reference role label.
  • adjust the model parameters of the timbre extractor with the goal of maximizing the third loss function value and minimizing the mutual information, Tuning the model parameters of the emotion extractor. In this way, the overlapping of the timbre feature information and the emotion feature information can be reduced, so that the timbre feature information and the emotion feature information are decoupled.
  • an embodiment of the present application provides an electronic device, including: a memory and a processor, the memory is coupled to the processor; the memory stores program instructions, and when the program instructions are executed by the processor, the electronic device executes the first aspect or The speech processing method in any possible implementation manner of the first aspect.
  • the third aspect and any implementation manner of the third aspect correspond to the first aspect and any implementation manner of the first aspect respectively.
  • technical effects corresponding to the third aspect and any implementation manner of the third aspect reference may be made to the technical effects corresponding to the above-mentioned first aspect and any implementation manner of the first aspect, and details are not repeated here.
  • an embodiment of the present application provides an electronic device, including: a memory and a processor, the memory is coupled to the processor; the memory stores program instructions, and when the program instructions are executed by the processor, the electronic device executes the second aspect or The training method in any possible implementation of the second aspect.
  • the fourth aspect and any implementation manner of the fourth aspect correspond to the second aspect and any implementation manner of the second aspect respectively.
  • the technical effects corresponding to the fourth aspect and any one of the implementation manners of the fourth aspect refer to the above-mentioned second aspect and the technical effects corresponding to any one of the implementation manners of the second aspect, and details are not repeated here.
  • the embodiment of the present application provides a chip, including one or more interface circuits and one or more processors; the interface circuit is used to receive signals from the memory of the electronic device and send signals to the processor, and the signals include memory Computer instructions stored in the computer; when the processor executes the computer instructions, the electronic device is made to execute the speech processing method in the first aspect or any possible implementation manner of the first aspect.
  • the fifth aspect and any implementation manner of the fifth aspect correspond to the first aspect and any implementation manner of the first aspect respectively.
  • the technical effects corresponding to the fifth aspect and any one of the implementation manners of the fifth aspect refer to the technical effects corresponding to the above-mentioned first aspect and any one of the implementation manners of the first aspect, and details are not repeated here.
  • the embodiment of the present application provides a chip, including one or more interface circuits and one or more processors; the interface circuit is used to receive a signal from the memory of the electronic device and send a signal to the processor, and the signal includes the memory Computer instructions stored in the computer; when the processor executes the computer instructions, the electronic device is made to execute the training method in the second aspect or any possible implementation manner of the second aspect.
  • the sixth aspect and any implementation manner of the sixth aspect correspond to the second aspect and any implementation manner of the second aspect respectively.
  • the technical effects corresponding to the sixth aspect and any one of the implementation manners of the sixth aspect refer to the above-mentioned second aspect and the technical effects corresponding to any one of the implementation manners of the second aspect, and details are not repeated here.
  • the embodiment of the present application provides a computer storage medium, the computer readable storage medium stores a computer program, and when the computer program runs on the computer or the processor, the computer or the processor executes the first aspect or the first aspect Speech processing methods in any possible implementation of .
  • the seventh aspect and any implementation manner of the seventh aspect correspond to the first aspect and any implementation manner of the first aspect respectively.
  • the technical effects corresponding to the seventh aspect and any one of the implementation manners of the seventh aspect refer to the technical effects corresponding to the above-mentioned first aspect and any one of the implementation manners of the first aspect, and details are not repeated here.
  • the embodiment of the present application provides a computer storage medium, the computer readable storage medium stores a computer program, and when the computer program runs on the computer or the processor, the computer or the processor executes the second aspect or the second aspect
  • the training method in any possible implementation of .
  • the eighth aspect and any implementation manner of the eighth aspect correspond to the second aspect and any implementation manner of the second aspect respectively.
  • the technical effects corresponding to the eighth aspect and any one of the implementation manners of the eighth aspect refer to the above-mentioned second aspect and the technical effects corresponding to any one of the implementation manners of the second aspect, and details are not repeated here.
  • the embodiment of the present application provides a computer program product.
  • the computer program product includes a software program.
  • the software program is executed by a computer or a processor, the voice in the first aspect or any possible implementation of the first aspect The steps of the processing method are executed.
  • the ninth aspect and any implementation manner of the ninth aspect correspond to the first aspect and any implementation manner of the first aspect respectively.
  • the technical effects corresponding to the ninth aspect and any one of the implementation manners of the ninth aspect refer to the technical effects corresponding to the above-mentioned first aspect and any one of the implementation manners of the first aspect, and details are not repeated here.
  • the embodiment of the present application provides a computer program product, the computer program product includes a software program, when the software program is executed by a computer or a processor, the training in the second aspect or any possible implementation of the second aspect The steps of the method are executed.
  • the tenth aspect and any implementation manner of the tenth aspect correspond to the second aspect and any implementation manner of the second aspect respectively.
  • the embodiment of the present application provides a speech processing device, the device comprising:
  • the data acquisition module is used to obtain the unicast audio data of the unicast audiobook, the unicast audio data includes N pieces of source audio data, and N is a positive integer;
  • the role timbre analysis module is used to determine N sets of target timbre feature information matching N pieces of source audio data from M sets of reference timbre feature information, M sets of reference timbre feature information corresponding to M timbres, and M sets of reference timbre feature information
  • the timbre discrimination between any two groups is greater than the discrimination threshold, and M is a positive integer;
  • the character timbre conversion module is used to obtain N sets of expressive feature information corresponding to N sets of source audio data; and based on N sets of target timbre feature information and N sets of expressive feature information, respectively perform timbre conversion on N pieces of source audio data, to Generate multicast audio data.
  • the role timbre analysis module includes:
  • the audio sentence division module is used to divide the audio sentence into the unicast audio data, so as to obtain N source audio data, and the N source audio data corresponds to the audio sentence one by one;
  • the timbre feature extraction module is used to extract N groups of source timbre feature information corresponding to N pieces of source audio data by using a timbre extractor;
  • the timbre feature matching module is configured to determine N sets of target timbre feature information from M sets of reference timbre feature information based on N sets of source timbre feature information.
  • the timbre feature matching module is used for determining the i-th piece of source audio data among the N pieces of source audio data: respectively determining M sets of reference timbre feature information and The similarity of the i-th source audio data; the reference timbre feature information with the highest similarity with the i-th source audio data is determined as the target timbre feature information matching the i-th source audio data, where i is a positive integer , and the value range is between 1 and N.
  • the character timbre conversion module includes:
  • the prosody feature extraction module is used to extract N sets of prosody feature information corresponding to N sets of source audio data by using a prosody extractor;
  • the emotional feature extraction module is used to extract N groups of emotional feature information corresponding to N groups of source audio data by using an emotion extractor;
  • An expressive feature generation module configured to generate N sets of expressive feature information corresponding to N pieces of source audio data based on N sets of prosodic feature information and N sets of emotional feature information.
  • the character timbre conversion module includes:
  • a content feature extraction module configured to use a content extractor to extract N sets of content feature information corresponding to N pieces of source audio data
  • the feature spectrum reconstruction module is used for adopting the spectrum reconstruction model to carry out spectrum reconstruction based on N groups of target timbre feature information, N groups of expressive force feature information and N groups of content feature information, to obtain N groups of audio spectrum feature information;
  • a frequency-time transformation module is used to perform frequency-time transformation on N groups of audio spectrum feature information respectively, so as to obtain N groups of target audio data;
  • the splicing module is used for splicing N sets of target audio data to obtain multicast audio data.
  • the audio sentence division module is used to perform VAD (Voice Activity Detection, Voice Activity Detection) detection on the unicast audio data, and convert the unicast audio data Divided into N pieces of source audio data.
  • VAD Voice Activity Detection, Voice Activity Detection
  • the audio sentence division module is used to obtain the reading text of the unicast audiobook, and divide the reading text into N text sentences; the unicast audio data Align with the text of the reading material to determine N audio time intervals corresponding to N text sentences in the unicast audio data; divide the unicast audio data into N pieces of source audio data based on the N audio time intervals.
  • the timbre feature extraction module is used to determine the role corresponding to each piece of source audio data in the N pieces of source audio data; N sets of initial timbre feature information; for X pieces of source audio data corresponding to the first character in the N pieces of source audio data: weighted calculations are performed based on X sets of initial timbre feature information corresponding to the X pieces of source audio data, and the weighted calculation result is determined as Source timbre feature information corresponding to each piece of source audio data in the X pieces of source audio data, where X is a positive integer.
  • a role determination module for the j-th piece of source audio data in the P2 piece of source audio data: respectively calculate the initial timbre feature information corresponding to the P1 piece of source audio data, which is similar to the initial timbre feature information corresponding to the j-th piece of source audio data degree; in the P1 piece of source audio data, the role corresponding to the source audio data corresponding to the initial timbre feature information corresponding to the j-th piece of source audio data is determined to be the role of the j-th piece of source audio data; wherein, j is positive Integer, the value range is between 1 and P2.
  • the eleventh aspect and any implementation manner of the eleventh aspect correspond to the first aspect and any implementation manner of the first aspect respectively.
  • the technical effects corresponding to the eleventh aspect and any one of the implementation manners of the eleventh aspect refer to the above-mentioned first aspect and the technical effects corresponding to any one of the implementation manners of the first aspect, and details are not repeated here.
  • the embodiment of the present application provides a training device, which includes:
  • the collection module is used to collect training data.
  • the training data includes training audio data and reference role labels corresponding to the training audio data.
  • the expressiveness feature information of the training audio data meets the expressiveness condition.
  • the training audio data includes multiple users recording with their own timbre. Audio data, and/or, audio data recorded by multiple users using false sounds, and the timbre discrimination of different false sounds used by the same user is greater than the discrimination threshold;
  • the feature information extraction module is used to input the training audio data to the emotion extractor, the content extractor and the prosody extractor for calculation, so as to obtain the prosody feature information output by the emotion extractor, the content feature information and prosody extraction output by the content extractor
  • the prosodic feature information output by the device; and the training audio data and the reference role label are input to the timbre extractor for calculation, so as to obtain the timbre feature information output by the timbre extractor;
  • the audio data reconstruction module is used to input the prosody feature information, content feature information, prosody feature information and timbre feature information into the spectrum reconstruction model to perform spectrum reconstruction to obtain the audio spectrum feature information, and to perform frequency-time transformation on the audio spectrum feature information, to obtain reconstructed audio data;
  • the backpropagation module is used to calculate the first loss function value based on the reconstructed audio data and the training audio data, with the goal of minimizing the first loss function value, and jointly adjust the emotion extractor, content extractor, prosody extractor, and timbre extractor and the model parameters of the spectral reconstruction model.
  • the device further comprises:
  • the loss function value calculation module is used to input the timbre feature information into the first classifier for calculation to obtain the first role label, and calculate the second loss function value based on the first role label and the reference role label; input the emotional feature information to the second classifier for calculation to obtain a second role label, and calculate a second loss function value based on the second role label and the reference role label;
  • the timbre extractor training module is used to adjust the model parameters of the timbre extractor with the goal of minimizing the second loss function value and the mutual information of the timbre feature information and the emotional feature information;
  • the emotion extractor training module is used to adjust the model parameters of the emotion extractor with the aim of maximizing the third loss function value and minimizing the mutual information.
  • the twelfth aspect and any implementation manner of the twelfth aspect correspond to the second aspect and any implementation manner of the second aspect respectively.
  • For the technical effects corresponding to the twelfth aspect and any one of the implementation manners of the twelfth aspect refer to the above-mentioned second aspect and the technical effects corresponding to any one of the implementation manners of the second aspect, and details are not repeated here.
  • FIG. 1 is a schematic diagram of an exemplary application scenario
  • FIG. 2 is a schematic diagram of an exemplary application scenario
  • FIG. 3 is a schematic diagram of an exemplary processing process
  • FIG. 4 is a schematic diagram of an exemplary processing process
  • Fig. 5 is the schematic structural view of the model shown by way of example.
  • FIG. 6 is a schematic diagram of an exemplary training process
  • FIG. 7 is a schematic diagram of an exemplary training process
  • FIG. 8 is a schematic diagram of an exemplary information extraction process
  • Fig. 9a is a schematic diagram of an exemplary information extraction process
  • Fig. 9b is a schematic diagram of an information matching process exemplarily shown.
  • Fig. 10 is a schematic diagram of timbre conversion shown by way of example.
  • FIG. 11 is a schematic diagram of an exemplary processing process
  • FIG. 12 is a schematic diagram of an exemplary information extraction process
  • Fig. 13a is a schematic structural diagram of an exemplary voice processing device
  • Fig. 13b is a schematic structural diagram of an exemplary voice processing device
  • Fig. 14 is a schematic structural view of an exemplary training device
  • Fig. 15 is a schematic structural diagram of the device shown exemplarily.
  • first and second in the description and claims of the embodiments of the present application are used to distinguish different objects, rather than to describe a specific order of objects.
  • first target object, the second target object, etc. are used to distinguish different target objects, rather than describing a specific order of the target objects.
  • words such as “exemplary” or “for example” are used as examples, illustrations or illustrations. Any embodiment or design scheme described as “exemplary” or “for example” in the embodiments of the present application shall not be interpreted as being more preferred or more advantageous than other embodiments or design schemes. Rather, the use of words such as “exemplary” or “such as” is intended to present related concepts in a concrete manner.
  • multiple processing units refer to two or more processing units; multiple systems refer to two or more systems.
  • Fig. 1 is a schematic diagram of an exemplary application scenario.
  • a possible scenario is a scenario where a user plays an audiobook.
  • the user can start the audiobook application in the mobile phone and enter the audiobook application main interface 101.
  • the audiobook application main interface 101 can include one or more controls, including but not limited to: search box, search option, Reading list, text reading option, audio reading option 102 and more.
  • the user may input query words in the search box, and click a search option to query desired reading materials.
  • the user may perform a page turning operation or a sliding operation in the reading material list to find the desired reading material.
  • the user may click the text reading option to enter the text reading interface for text reading.
  • the user finds the desired reading material in the reading material list or searches for the desired reading material he can click the audio reading option 102, and the mobile phone can respond to the user's operation behavior and play the corresponding audio data.
  • audiobook applications include unicast audiobooks and multicast audiobooks, wherein unicast audiobooks refer to audiobooks recorded by a single voice actor performing multi-role interpretation through false voices, and the timbre differentiation of each role Low, the performance of scene interpretation is low.
  • Multicast audiobooks refer to audiobooks recorded by multiple voice actors for different roles. Each role has a high degree of timbre differentiation and a high degree of scene interpretation. Wherein, if the timbre discrimination of the two timbres is greater than or equal to the discrimination threshold, it can be determined that the timbre discrimination of the two timbres is high.
  • the discrimination threshold can be set according to requirements, which is not limited in this application.
  • the mobile phone can respond to the user's operation behavior and convert the unicast audiobook into Multicast the audiobook, and then play the converted multicast audiobook, so as to improve the scene deduction performance of the audiobook, and facilitate users to quickly and fully understand the plot.
  • Fig. 2 is a schematic diagram of an exemplary application scenario.
  • a possible scenario is a scenario in which an audiobook creator makes an audiobook.
  • an audiobook producer can start the audiobook production platform, enter the main interface 201 of the audiobook production platform, and create an audiobook.
  • the main interface 201 of the audiobook production platform may include one or more controls, including but not limited to: options for produced audiobooks, options for making audiobooks, and the like.
  • the electronic device may respond to the user's action to enter the editing interface of the produced audiobook, and edit the produced audiobook such as changing the name and cover.
  • the user can click on the audiobook production option 202, and the electronic device can respond to the user's behavior operation to enter the audiobook production interface, and the user can produce audiobooks in the audiobook production interface (such as making unicast audiobooks, making multicast audiobooks, etc.) Audiobooks).
  • the electronic device can respond to the user's behavior operation to enter the audiobook production interface, and the user can produce audiobooks in the audiobook production interface (such as making unicast audiobooks, making multicast audiobooks, etc.) Audiobooks).
  • the production of multicast audiobooks in the prior art requires multiple voice actors to allocate and record according to the roles in the readings, the production time is long and the efficiency is low, while the unicast audiobooks only require a single voice actor to perform multiple roles, so
  • audiobook creators can record unicast audiobooks first, and then convert the recorded unicast audiobooks into multicast audiobooks. In this way, the production time and production cost of the multicast audiobook can be reduced, and the production efficiency of the multicast audiobook can be improved.
  • the audiobook producer can perform a conversion operation to convert the produced unicast audiobook into a multicast audiobook. Subsequently, after the user enters the main interface of the audiobook application and clicks the audio reading option 102 of a certain multicast audiobook, the mobile phone can quickly play the audio data of the multicast audiobook in response to the user's operation behavior.
  • reading materials in this application may include reading materials with multiple roles, such as novels, stories, sketches and so on.
  • a unicast audiobook may include unicast audio data and reading text.
  • the unicast audio data may include multiple pieces of source audio data, each piece of source audio data includes multiple frames of audio data, and each piece of source audio data corresponds to one audio sentence.
  • the reading text includes multiple text sentences.
  • Fig. 3 is a schematic diagram of an exemplary processing procedure.
  • the electronic device may include a character timbre analysis module and a character timbre conversion module. It should be understood that the electronic device shown in FIG. 3 is only an example of the electronic device, and the electronic device may have more or fewer modules than those shown in the figure, which is not limited by the present application.
  • the role timbre analysis module is used to analyze target timbre characteristic information matching with each piece of source audio data in the unicast audio data.
  • the character timbre conversion module is used to convert the timbre of each piece of source audio data in the unicast audio data.
  • this application can convert unicast audiobooks into multicast audiobooks through the character timbre analysis module and character timbre conversion module in the electronic device, and the process can be as follows:
  • the unicast audio data of the audio reading option 102 corresponding to the unicast audio book may be obtained in response to the user's operation behavior, and then the obtained unicast The audio data is input to the character timbre analysis module.
  • the unicast audio data of the unicast audiobook corresponding to the conversion operation may be obtained in response to the user operation behavior, Then input the obtained unicast audio data to the role timbre analysis module.
  • an audiobook maker when an audiobook maker receives a conversion operation on a unicast audiobook recorded in the process of making a multicast audiobook in the audiobook creation interface, it can respond to the user's operation behavior and obtain the conversion operation corresponding
  • the unicast audio data of the audiobook is unicast, and the obtained unicast audio data is input to the role timbre analysis module.
  • the character timbre analysis module outputs N pieces of source audio data in the unicast audio data and corresponding matching N sets of target timbre feature information.
  • N is a positive integer.
  • Fig. 4 is a schematic diagram of an exemplary processing procedure.
  • the role timbre analysis module may divide the unicast audio data into audio sentences to obtain N pieces of source audio data. Then perform timbre feature information extraction on the N pieces of source audio data respectively to obtain N sets of source timbre feature information corresponding to the N pieces of source audio data, wherein a set of source timbre feature information corresponds to a piece of source audio data. Matching is then performed based on N sets of source timbre feature information corresponding to the N pieces of source audio data, and N sets of target timbre feature information corresponding to the N pieces of source audio data are determined, wherein a set of target timbre feature information corresponds to a piece of source audio data.
  • the timbre characteristic information can be used to characterize the information related to the timbre, which can include but not limited to: personality characteristics, gender characteristics, age characteristics, and vocal region (such as high-pitched, middle-pitched, and low-pitched) features. This is not limited.
  • the reading material may include multiple chapters, and the unicast audio data may be divided into audio chapters to obtain W (W is a positive integer) pieces of chapter audio data, and each chapter audio data corresponds to a chapter. Then, audio sentence division can be performed on each section of chapter audio data in W sections of chapter audio data, and each section of chapter audio data can be divided into R (R is a positive integer) pieces of source audio data, so as to improve the efficiency of audio sentence division.
  • N W*R.
  • each frame of audio data in the unicast audio data has a chapter identifier, and the chapter identifier is used to uniquely identify a chapter. Furthermore, in a possible manner, the role timbre analysis module can determine each frame of audio data belonging to the same chapter through the chapter identification of each frame of audio data in the unicast audio data, so that the unicast audio data can be divided into W segments Chapter audio data.
  • the role timbre analysis module can use VAD (Voice Activity Detection, voice activity detection) detection to divide the unicast audio data into W sections of chapter audio data.
  • VAD Voice Activity Detection, voice activity detection
  • the character timbre analysis module may use VAD to detect whether the time interval between two adjacent frames of audio data is greater than or equal to the first preset duration. If the time interval between two adjacent frames of audio data is greater than or equal to the first preset duration, then two adjacent frames of audio data can be divided into chapters, and the previous frame of audio data is divided into the chapter audio of the previous chapter The end of the data, the beginning of the chapter audio data of the next chapter divided into the audio data of the next frame.
  • text recognition may be performed on unicast audio data to obtain a corresponding text recognition result.
  • text analysis may be performed based on the text recognition result, and the unicast audio data may be divided into audio chapters to obtain W sections of chapter audio data.
  • a chapter distinguishing text such as "Chapter *", "chapter*”, etc.
  • the audio time interval corresponding to the chapter distinguishing text in the unicast audio data can be determined, and the audio corresponding to the chapter distinguishing text can be determined based on the chapter distinguishing text
  • the endpoint of the time interval is used for audio chapter segmentation.
  • the unicast audio data may be stored in chapters.
  • a piece of chapter audio data corresponding to a chapter may be directly obtained each time, and then audio sentence division is performed on the chapter audio data to obtain R pieces of source audio data.
  • the pause between adjacent two pieces of source audio data is greater than the second preset duration and shorter than the first preset duration
  • the second preset duration is shorter than the first preset duration
  • the second preset duration can be set according to requirements, This application is not limited to this.
  • k is a positive integer, the value range is between 1 and W; the description of 1 to W includes 1 and W, that is Said, k can be 1, also can be (w) segment chapter audio data, and role timbre analysis module can adopt VAD to detect, and the kth segment chapter audio data is divided into R source audio data.
  • the role timbre analysis module can detect whether the time interval between two adjacent frames of audio data in the kth chapter audio data is greater than or equal to the second preset duration through VAD. If the time interval between two adjacent frames of audio data is greater than or equal to the second preset duration, audio sentences can be divided between adjacent two frames of audio data, and the previous frame of audio data is divided into the source of the previous sentence The end of the audio data, after which the audio data of one frame is divided into the beginning of the source audio data of the next sentence.
  • unicast audio data may not be divided into audio chapters, but audio sentences may be divided directly, which is not limited in the present application.
  • the unicast audio data is divided into audio sentences to obtain N pieces of source audio data.
  • the timbre extractor may be pre-trained, and then the trained timbre extractor is used to extract timbre feature information to obtain N sets of source timbre feature information corresponding to N pieces of source audio data.
  • Fig. 5 is a schematic structural diagram of the model shown exemplarily.
  • the conversion model may include: a timbre extractor, an expressive force extraction module, a content extractor and a spectral reconstruction model.
  • the expression extraction module includes but not limited to: a prosody extractor and an emotion extractor. It should be understood that the conversion model shown in FIG. 5 is only an example of the conversion model, and the conversion model may have more or fewer modules than shown in the figure, which is not limited by the present application.
  • the timbre extractor, prosody extractor, emotion extractor, content extractor and spectrum reconstruction model in the conversion model can be jointly trained.
  • multiple pieces of highly expressive training audio data recorded by multiple users using their own timbres may be collected; wherein, each user records at least one piece of training audio data.
  • each user records at least one piece of training audio data.
  • each piece of training audio data may include at least one piece of training audio data, and each piece of training audio data corresponds to a sentence.
  • high expressiveness may mean that the expressiveness feature information satisfies the expressiveness condition.
  • the expressive feature information may include prosodic feature information and emotional feature information, and the expressive force condition includes prosodic condition and emotional condition.
  • High expressiveness may mean that the prosodic feature information satisfies the prosody condition and the emotional feature information satisfies the emotional condition.
  • the expression condition, prosody condition and emotion condition can be set according to requirements, which is not limited in this application.
  • a corresponding reference tone label can be added to the piece of training audio data.
  • the role information includes but not limited to: gender, age, personality, vocal range, etc.
  • the character information may be encoded (such as ont-hot (one-bit valid encoding), etc.) to obtain the reference tone label.
  • a piece of training audio data and a reference tone label corresponding to the piece of training audio data can be used as a set of training data, and then multiple sets of training data can be obtained.
  • the transformation model can then be trained using multiple sets of training data.
  • This application uses a set of training data to train the conversion model as an example for illustration.
  • Fig. 6 is a schematic diagram of a training process exemplarily shown.
  • the training audio data in the training data can be input into the prosody extractor, the emotion extractor and the content extractor respectively.
  • the training audio data in the training data and the corresponding reference tone label may be input to the tone extractor.
  • the timbre extractor receives the training audio data and the corresponding reference timbre label, it can perform forward calculation on the training audio data and the reference timbre label, and output the timbre feature information to the spectral reconstruction model.
  • the timbre characteristic information may be represented by a vector or a sequence.
  • the prosody extractor may perform forward calculation on the training audio data, and output the prosody feature information to the spectral reconstruction model.
  • the prosodic feature information may be used to represent sound emotional skill information such as severity, urgency, virtual reality, etc. of speech, and may be represented by a vector or a sequence.
  • the emotion extractor may perform forward calculation on the training audio data, and output the emotion feature information to the spectral reconstruction model.
  • the emotional feature information can be used to represent the type of emotion (such as happy, sad, high-pitched, low-pitched, etc.), and the scale of attitude (such as affirmation, negation, praise, irony, etc.), which can be represented by vector or sequence.
  • the content extractor may perform forward calculation on the training audio data, and output the content feature information to the spectral reconstruction model.
  • content feature information may be used to characterize voice content.
  • the content feature information may be a phoneme feature, which may be represented by a vector or a sequence.
  • the content feature information may be the posterior probability of the phoneme (that is, the probability distribution of the phoneme), which may be represented by a matrix.
  • the spectrum reconstruction model can perform spectrum reconstruction based on the timbre feature information, prosody feature information, emotion feature information and content feature information, Output audio spectrum feature information.
  • the audio spectral feature information output by the spectral reconstruction model may be subjected to time-domain conversion to obtain reconstructed audio data of the audio spectral feature information in the time domain. That is, the spectral reconstruction model only performs spectral reconstruction, while the time domain transformation is performed by other modules.
  • the spectral reconstruction model can first base on the timbre feature information, prosody feature information, emotional feature information and content feature information Perform spectrum reconstruction to obtain audio spectrum feature information, and then perform time domain conversion on the audio spectrum feature information to obtain and output the reconstructed audio data of the audio spectrum feature information in the time domain. That is, spectral reconstruction and frequency-time transformation are performed by the spectral reconstruction model.
  • the present application does not limit whether the spectrum reconstruction model only performs spectrum reconstruction, or performs spectrum reconstruction and frequency-time transformation.
  • the reconstructed audio data of the audio spectral feature information in the time domain may be compared with the training audio data in the training data, and the corresponding first loss function value may be calculated. Then with the goal of minimizing the value of the first loss function, the model parameters of the timbre extractor, prosody extractor, emotion extractor, content extractor and spectral reconstruction model in the conversion model are adjusted.
  • each set of training data can be used to train the conversion model until the value of the first loss function meets the first loss condition, or the training times of each module in the conversion model meet the corresponding training frequency condition, or Until the performance of each module in the conversion model meets the corresponding performance conditions.
  • the first loss condition, the training frequency condition and the performance condition can all be set according to requirements, which is not limited in this application.
  • the training frequency conditions of different modules in the conversion model may be different or the same; the performance conditions of different models may be different, which is not limited in this application.
  • the content extractor can also be trained independently of the timbre extractor, prosody extractor, emotion extractor and spectrum reconstruction model; this application does not limit this.
  • Fig. 7 is a schematic diagram of an exemplary training process.
  • the timbre extractor performs forward calculation based on the training audio data and the reference timbre label in the training data, and after obtaining the timbre feature information, it can output the timbre feature information to the first classifier and the mutual information module respectively.
  • the emotion extractor performs forward calculation based on the training audio data in the training data, and after obtaining the emotion feature information, can output the emotion feature information to the second classifier and the mutual information module respectively.
  • the first classifier may perform calculation based on the timbre feature information, and output the first timbre label.
  • the second classifier may perform calculation based on the emotional feature information, and output the second timbre label.
  • the mutual information module may perform calculation based on the timbre feature information and the emotion feature information, and calculate the mutual information between the timbre feature information and the emotion feature information.
  • mutual information can be the amount of information contained in one variable about another variable
  • the mutual information between timbre feature information and emotional feature information refers to the amount of information that timbre feature information includes emotional feature information, or emotional feature information The amount of information that contains information about the characteristics of the tone.
  • the second loss function value may be calculated based on the first timbre label and the reference timbre label in the training data
  • the third loss function value may be calculated based on the second timbre label features and the reference timbre label in the training data.
  • the model parameters of the timbre extractor are adjusted.
  • the model parameters of the emotion extractor are adjusted.
  • the difference between the emotion feature information extracted by the emotion extractor and the timbre feature information extracted by the timbre extractor can be increased, so that the emotion feature information extracted by the emotion extractor and the timbre feature information extracted by the timbre extractor are decoupled.
  • each set of training data can be used to train the timbre extractor and the emotion extractor until the value of the second loss function satisfies the second loss condition, or the training times of the timbre extractor meet the training times of the timbre extractor. frequency condition, or when the performance of the timbre extractor meets the performance condition of the timbre extractor, the training of the timbre extractor is stopped. And until the third loss function value satisfies the third loss condition, or the number of training times of the emotion extractor meets the training frequency condition of the emotion extractor, or when the performance of the emotion extractor meets the performance condition of the emotion extractor, stop the emotion extractor train.
  • both the second loss condition and the third loss condition may be set according to requirements, which is not limited in this embodiment of the present application.
  • the N pieces of source audio data in the unicast audio data can be sequentially input into the trained timbre extractor, and the timbre extractor can extract N groups of sources corresponding to the N pieces of source audio data Voice characteristics information.
  • Fig. 8 is a schematic diagram of an information extraction process exemplarily shown.
  • source audio data 1 source audio data 2
  • source audio data 3 source audio data 4
  • source audio data 5 source audio data 5
  • the source audio data 1 is input to the trained timbre extractor, and the timbre feature information A can be output.
  • the timbre feature information A is the source timbre feature information corresponding to the source audio data 1 .
  • the source audio data 2 is input to the trained timbre extractor, and the timbre feature information B can be output.
  • the timbre feature information B is the source timbre feature information corresponding to the source audio data 2 .
  • the source audio data 3 is input to the trained timbre extractor, and the timbre feature information A can be output.
  • the timbre feature information A is the source timbre feature information corresponding to the source audio data 3 .
  • the source audio data 4 is input to the trained timbre extractor to output timbre feature information B, which is the source timbre feature information corresponding to the source audio data 4 .
  • timbre feature information B which is the source timbre feature information corresponding to the source audio data 4 .
  • timbre feature information C which is the source timbre feature information corresponding to the source audio data 5 .
  • the timbre feature information of the source audio data 1 and the source audio data 3 are the same, that is, the source audio data 1 and the source audio data 3 are audio data of the same character.
  • the timbre characteristic information of the source audio data 2 and the source audio data 4 are the same, that is to say, the source audio data 2 and the source audio data 4 are audio data of the same character.
  • the training audio data includes training audio data recorded by different users, the timbres of different users are highly differentiated, and includes training audio data recorded by the same user using multiple highly differentiated false sounds; that is, each training audio
  • the timbres corresponding to the data have a high degree of discrimination. Therefore, in order to improve the timbre discrimination of different characters in the unicast audio data, the timbres of each source audio data in the unicast audio data may be converted into timbres matching the timbres of the source audio data in the corresponding timbres of the training audio data.
  • Fig. 9a is a schematic diagram of an information extraction process exemplarily shown.
  • the training audio data in each group of training data can be input to the trained timbre extractor, and the corresponding timbre feature information is output.
  • the timbre characteristic information is called reference timbre characteristic information.
  • the training audio data in the set of training data may be divided into multiple pieces of training audio data in the manner described above. Then, the trained timbre extractor is used to extract the reference timbre feature information of each piece of training audio data.
  • the training audio data includes audio data recorded using M (M is a positive integer) kinds of timbres (including the user's own timbre and pseudo-sound), and each timbre corresponds to recording multiple pieces of training audio data.
  • the rth of M timbres (r is a positive integer, the value range is between 1 and M; the description between 1 and M includes 1 and M, that is, r can be 1 , can also be M) kinds of timbres, and the r reference timbre feature information corresponding to multiple pieces of training audio data recorded using the r th timbre can be weighted and calculated to obtain the reference timbre feature information corresponding to the r th timbre.
  • the reference tone color feature information corresponding to the rth tone color may be referred to as the rth set of reference tone color feature information.
  • the weighting calculation may be calculating an average value.
  • M sets of reference timbre feature information can be obtained.
  • reference timbre feature information that matches the source timbre feature information corresponding to each source audio data can be searched for from multiple sets of reference timbre feature information, so as to realize searching for unicast audio data from the timbre corresponding to the training audio data.
  • Timbres that match the timbres of each source audio data in the Wherein, for the convenience of distinction and description, the reference timbre feature information that matches the source timbre feature information corresponding to the source audio data may be referred to as target timbre feature information.
  • an i-th (i is a positive integer, ranging from 1 to N) source audio data among N pieces of source audio data is taken as an example for illustration.
  • the similarity between the M sets of reference timbre feature information and the source timbre feature information of the i-th source audio data can be calculated respectively, and the reference timbre feature information with the highest similarity to the source timbre feature information of the i-th source audio data , as the target timbre feature information.
  • N sets of target timbre feature information matching N pieces of source audio data can be searched from M sets of reference timbre feature information.
  • the target timbre characteristic information matched with different source audio data may be the same or different.
  • the distance information between the M sets of reference timbre feature information and the source timbre feature information of the i-th source audio data can be calculated respectively, and the distance information is used as the source timbre feature information and the reference timbre feature of the i-th source audio data information similarity.
  • the distance information is inversely proportional to the similarity, that is, the greater the distance information, the lower the similarity; the smaller the distance information, the higher the similarity.
  • the source timbre feature information of the source audio data and each reference timbre feature information can be determined by calculating the Euclidean distance, cosine similarity, Minkowski distance, etc. between the source timbre feature information of the source audio data and each reference timbre feature information.
  • the distance information of the feature information is not limited in this application.
  • Fig. 9b is a schematic diagram of an information matching process exemplarily shown.
  • FIG. 9 b it is exemplarily described by taking the source timbre feature information of the source audio data 3 as an example to search for matching target timbre feature information from M sets of reference timbre feature information.
  • the source timbre characteristic information of the source audio data 3 is timbre characteristic information A
  • the distance information between the timbre characteristic information A and the reference timbre characteristic information 1 can be calculated to obtain distance information 1, and the timbre characteristic information A and Refer to the distance information of the timbre characteristic information 2 to obtain the distance information 2 . . .
  • the reference timbre feature information 2 matches the timbre feature information A, that is, the reference timbre feature information 2 is the target timbre feature information that matches the source timbre feature information of the source audio data .
  • the character tone conversion module outputs the multicast audio data of the multicast audiobook.
  • the character timbre conversion module can use the trained prosody extractor, emotion extractor, content extractor and spectrum reconstruction model to convert unicast audiobooks into multicast audiobooks based on the target timbre feature information of each source audio data. Reading, that is, converting unicast audio data to multicast audio data.
  • each piece of source audio data in the N pieces of source audio data of the unicast audio data can be sequentially input to the trained prosody extractor, and the trained prosody extractor can extract the N pieces of source audio data corresponding to N sets of prosodic feature information.
  • each piece of source audio data in the N pieces of source audio data of the unicast audio data can be sequentially input to the trained emotion extractor, and the trained emotion extractor can extract the N pieces of source audio data corresponding to N groups of emotional feature information.
  • each piece of source audio data in the N pieces of source audio data of the unicast audio data can be sequentially input to the trained content extractor, and the trained content extractor can extract the N pieces of source audio data corresponding to N groups of content characteristic information.
  • the following takes the i-th piece of source audio data among the N pieces of source audio data as an example to illustrate how to convert the character timbre of the source audio data.
  • the emotion feature information extracted by the post-training emotion extractor for the i-th source audio data the prosody feature information extracted by the trained prosody extractor for the i-th source audio data, and the trained content extractor
  • the content feature information extracted for the i-th source audio data, and the matching target timbre feature information corresponding to the i-th source audio data are input to the trained spectral reconstruction model.
  • the trained spectrum reconstruction model performs spectrum reconstruction based on the emotion feature information, prosody feature information, content feature information and target timbre feature information of the i-th source audio data, and can obtain and output the i-th group of audio spectrum feature information. Then, the i-th group of audio spectrum feature information can be time-domain converted to obtain the i-th piece of target audio data after timbre conversion.
  • Fig. 10 is a schematic diagram of timbre conversion exemplarily shown.
  • source audio data 3 can be respectively input to prosody extractor, emotion extractor and content extractor, obtain the prosody characteristic information 3 that the prosody extractor outputs, the emotion characteristic information 3 that the emotion extractor outputs and The content feature information3 output by the content extractor.
  • the prosodic feature information 3 , emotional feature information 3 , content feature information 3 , and target timbre feature information correspondingly matched with the source audio data: reference timbre feature information 2 are input to the spectral reconstruction model.
  • the spectrum reconstruction model can perform spectrum reconstruction based on prosody feature information 3 , emotion feature information 3 , content feature information 3 and reference timbre feature information 2 , and output audio spectrum feature information 3 .
  • frequency-time transformation is performed on the audio spectrum characteristic information 3 to obtain the target audio data 3 .
  • the target audio data 3 is the audio data after timbre conversion corresponding to the source audio data 3 .
  • the trained spectrum reconstruction model performs spectrum reconstruction based on the emotion feature information, prosody feature information, content feature information and target timbre feature information of the i-th source audio data, and the i-th group of audio spectrum feature information can be obtained; and then Time-domain conversion is performed on the i-th group of audio spectrum feature information to obtain and output the i-th target audio data after timbre conversion.
  • N pieces of target audio data can be obtained, and then N pieces of target audio data can be used for splicing to obtain multicast audio data, that is, audio data of multicast audiobooks.
  • each piece of source speech data in the unicast audio data has high expressiveness (including prosodic expressiveness and emotional expressiveness)
  • the audio data corresponds to the matching target timbre feature information to perform spectral reconstruction and generate multicast audio data; on the premise of ensuring the emotional expressiveness of the source audio data in the unicast audio data, the character timbre of the source audio data can be converted to realize the Unicast audio data is converted to multicast audio data.
  • it is possible to improve the timbre distinction of the characters in the reading thereby improving the scene deduction performance of the audiobook, and making it easier for users to understand the plot.
  • the present application can automatically match the timbre of each source audio data, and can improve the conversion of unicast audio data to multicast.
  • the efficiency of audio data simplifies user operations and improves user experience.
  • the unicast audiobooks that have been produced need to be recorded by multiple voice actors according to role distribution, so as to realize the conversion of unicast audiobooks into multicast audiobooks.
  • the method of the present application can directly convert unicast audiobooks into multicast audiobooks, with high efficiency and low cost.
  • sentence division may be performed on the unicast audio data in combination with the reading text of the unicast audiobook, so as to improve the accuracy of sentence division, and then analyze the accuracy of the corresponding role of the source audio data in the unicast audio data, thereby improving Accuracy of converting unicast audiobooks to multicast audiobooks.
  • Fig. 11 is a schematic diagram of an exemplary processing procedure.
  • the electronic device may include a character timbre analysis module and a character timbre conversion module. It should be understood that the electronic device shown in FIG. 11 is only an example of the electronic device, and the electronic device may have more or fewer modules than shown in the figure, which is not limited by the present application.
  • the functions of the role timbre analysis module and the role timbre conversion module can refer to the description above, and will not be repeated here.
  • the process of converting a unicast audiobook to a multicast audiobook can be as follows:
  • the unicast audio data and reading text of the unicast audio book corresponding to the audio reading option 102 can be obtained in response to the user's operation behavior, and then the acquired The unicast audio data and text of readings are input to the character timbre analysis module.
  • the unicast corresponding to the conversion operation can be obtained in response to the user operation behavior.
  • the unicast audio data and reading text of the audiobook and then input the obtained unicast audio data and reading text into the character timbre analysis module.
  • an audiobook maker when an audiobook maker receives a conversion operation on a unicast audiobook recorded in the process of making a multicast audiobook in the audiobook creation interface, it can respond to the user's operation behavior and obtain the conversion operation corresponding
  • the unicast audio data and reading text of the audiobook are unicast, and the acquired unicast audio data and reading text are input to the character timbre analysis module.
  • the character timbre analysis module outputs N pieces of source audio data in the unicast audio data and corresponding matching N sets of target timbre feature information.
  • Fig. 12 is a schematic diagram of an information extraction process exemplarily shown.
  • the reading text can be divided into text sentences first to obtain N text sentences; and then combined with the N text sentences, single-frequency audio data can be divided into audio sentences to obtain N source audio data.
  • the reading text may be divided into text chapters first to obtain W sections of chapter text.
  • text analysis may be performed on the reading text, and the text is distinguished according to chapters in the reading text (such as "Chapter *", "chapter*”, etc.), and the reading text is divided into W sections of chapter text.
  • text chapter division may be performed before text is divided from chapters, and text chapter division may be performed after text is divided from chapters.
  • character name recognition may be performed to identify the roles contained in the section of chapter text. Then perform role dialogue segmentation on the chapter text to obtain multiple dialogue texts, and analyze the roles corresponding to each dialogue. Then divide each paragraph of dialogue text into text sentences to obtain R text sentences, and use the role of the dialogue text to which each text sentence belongs as the corresponding role of the text sentence.
  • the reading text may be stored in chapters.
  • a piece of chapter text corresponding to a chapter may be directly obtained each time, and then text sentences are divided into the chapter text to obtain R text sentences.
  • the role name of the corresponding role may be used for identification.
  • an unknown role label (such as Unknown) can be used for identification.
  • the chapter text is divided into text sentences, and the obtained text sentences and corresponding role names can be shown in Table 1:
  • the text chapters are divided into W paragraphs of chapter texts, so as to divide the reading text into N text sentences.
  • the unicast audio data may be divided into audio chapters first to obtain multi-chapter audio data, which may refer to the description above, and will not be repeated here. Then, for each chapter, based on the R pieces of text sentences corresponding to the chapter, audio sentence division may be performed on the chapter audio data corresponding to the chapter.
  • This application takes a chapter as an example to illustrate how to divide chapter audio data into multiple pieces of source audio data.
  • text recognition may be performed on chapter audio data to obtain a text recognition result. Then, based on the text recognition result, the chapter audio data and the chapter text are aligned on the time axis, and then R audio time intervals corresponding to the R text sentences in the chapter audio data of the chapter are determined. Based on the R audio time intervals, the unicast audio data is divided into R pieces of source audio data. And based on the role of the text sentence corresponding to each piece of source audio data, the role of each piece of source audio data is determined. For example, you can refer to Table 2:
  • the source audio data corresponding to the text sentence "Zhang ** asked according to the example” is the audio data during the period of 0 ⁇ 0:02.582s, which can be 0 ⁇ 0:02.582s
  • the audio data within a period of time is divided into a piece of source audio data, and the corresponding role is determined as "narration".
  • the source audio data corresponding to the text sentence "Are you the king**" is the audio data during the period of 0:02.582 ⁇ 0:02.048s, and it can be 0:02.582 ⁇ 0:
  • the audio data during the period of 02.048s is divided into a piece of source audio data, and the corresponding role is determined as "Zhang**".
  • the source audio data corresponding to the text sentence "the dish moved to the side of "yes” with great reluctance, rotated the pointer on the body of the dish, and aligned it with that word" is 0:
  • the audio data during the period from 02.048 to 0:13.969s can be divided into a piece of source audio data, and the corresponding character can be determined as an unknown character.
  • the source audio data corresponding to the text sentence "age” is the audio data during the period from 0:13.969 to 0:14.818s, and the period from 0:13.969 to 0:14.818s can be
  • the audio data within is divided into a piece of source audio data, and the corresponding role is determined as an unknown role.
  • the source audio data corresponding to the text sentence "twenty-three" is the audio data during the period of 0:14.818 ⁇ 0:16.217s, which can be 0:14.818 ⁇ 0:16.217s
  • the audio data within a period of time is divided into a piece of source audio data, and the corresponding role is determined as "King **".
  • audio sentence division is performed on W sections of chapter audio data, so as to realize division of unicast audio data into N pieces of source audio data.
  • X (X is a positive integer) pieces of source audio data corresponding to the first role can be found from the source audio data of P1 pieces of determined roles (that is, multiple pieces of source audio data corresponding to the same role can be found). data). Then use the trained timbre extractor to extract X sets of initial timbre feature information corresponding to X pieces of source audio data, and then perform weighted calculation based on the X set of initial timbre feature information to obtain the weighted calculation result. Then, the weighted calculation result may be determined as source timbre feature information corresponding to each piece of source audio data in the X pieces of source audio data.
  • a role corresponding to P2 pieces of source audio data whose roles have not been determined may be determined.
  • a trained timbre extractor may be used to extract N sets of initial timbre feature information corresponding to N pieces of source audio data.
  • the initial timbre feature information corresponding to the P1 piece of source audio data can be calculated respectively, and the initial timbre feature information corresponding to the j piece of source audio data. similarity.
  • the role of the source audio data with the highest similarity to the initial timbre feature information corresponding to the j-th piece of source audio data is determined as the role of the j-th piece of source audio data.
  • X pieces of source audio data corresponding to the first character may be found from the N pieces of source audio data of the determined roles.
  • use the trained timbre extractor to extract X sets of initial timbre feature information corresponding to X pieces of source audio data, perform weighted calculation based on the X set of initial timbre feature information, and obtain the weighted calculation result.
  • the weighted calculation result is determined as source timbre characteristic information corresponding to each piece of source audio data in the X pieces of source audio data.
  • the timbre feature information has the highest similarity with the timbre feature information of the source audio data from 0 to 0:02.582s (the corresponding text sentence is "Zhang** asked according to the example"), Then it can be determined that the role of the source audio data of 0:02.048-0:13.969s is narration.
  • the timbre feature information of the source audio data from 0:13.969 to 0:14.818s (the corresponding text sentence is "age"), and the source audio data from 0:02.582 to 0:02.048s (the corresponding text sentence is "you Is it Wang**?") has the highest similarity of the timbre feature information, so it can be determined that the role of the source audio data from 0:13.969 to 0:14.818s is Zhang**.
  • VAD detection is performed on the unicast audio data
  • voice recognition can be performed on the N pieces of source audio data respectively to obtain corresponding text recognition results.
  • the above method can be used to determine the characters corresponding to the N pieces of source audio data according to the text recognition results.
  • the above method again, first determine the X pieces of source audio data corresponding to the first role, and determine the source timbre characteristics of each piece of source audio data in the X pieces of source audio data by performing weighted calculations on the initial timbre feature information of the X pieces of source audio data information and will not be repeated here.
  • N sets of source timbre feature information corresponding to N pieces of source audio data are determined, matching targets corresponding to N pieces of source audio data can be searched from M sets of reference timbre feature information based on N sets of source timbre feature information.
  • timbre characteristic information reference may be made to the above description, and details are not repeated here.
  • the character tone conversion module outputs the multicast audio data of the multicast audiobook.
  • Fig. 13a is a schematic structural diagram of an exemplary speech processing device.
  • the voice processing device 1300 includes:
  • the data acquisition module 1301 is used to acquire the unicast audio data of the unicast audiobook, the unicast audio data includes N pieces of source audio data, and N is a positive integer;
  • the role timbre analysis module 1302 is used to determine N sets of target timbre feature information matching N pieces of source audio data from M sets of reference timbre feature information, M sets of reference timbre feature information corresponding to M timbres, and M sets of reference timbre feature information
  • the timbre discrimination between any two groups is greater than the discrimination threshold, and M is a positive integer;
  • the role timbre conversion module 1303 is used to obtain N sets of expressiveness feature information corresponding to N sets of source audio data; and based on N sets of target timbre feature information and N sets of expressiveness feature information, respectively perform timbre conversion on N pieces of source audio data, to generate multicast audio data.
  • the data acquisition module 1301 acquires the unicast audio data of the unicast audiobook, and then inputs the unicast audio data to the role timbre analysis module 1302 .
  • the role timbre analysis module 1302 can determine N sets of target timbre feature information matching N pieces of source audio data from the M sets of reference timbre feature information, and then input the N sets of target timbre feature information to the role timbre conversion module 1303 .
  • the character timbre conversion module 1303 can obtain N sets of expressiveness feature information corresponding to N sets of source audio data; then, based on N sets of target timbre feature information and N sets of expressive feature information, respectively perform timbre conversion on N pieces of source audio data to generate Multicast audio data.
  • the M sets of reference timbre feature information correspond to M timbres, and the timbre discrimination between any two groups in the M sets of reference timbre feature information is greater than the discrimination threshold.
  • the timbre of each piece of source audio data can be converted, the conversion of the unicast audiobook into a multicast audiobook can be realized, and the timbre distinction of the characters in the reading can be improved, thereby improving the quality of the audiobook.
  • the expressiveness of the scene deduction makes it easy for users to understand the plot.
  • Fig. 13b is a schematic structural diagram of an exemplary voice processing device.
  • the role timbre analysis module 1302 includes:
  • the audio sentence division module 13021 is used to divide the unicast audio data into audio sentences to obtain N pieces of source audio data, N pieces of source audio data and audio sentences;
  • the timbre feature extraction module 13022 is used to extract N sets of source timbre feature information corresponding to N pieces of source audio data by using a timbre extractor;
  • the timbre feature matching module 13023 is configured to determine N sets of target timbre feature information from M sets of reference timbre feature information based on N sets of source timbre feature information.
  • the timbre feature matching module 13023 is used to determine the similarity between M sets of reference timbre feature information and the i-th source audio data for the i-th source audio data among the N pieces of source audio data;
  • the reference timbre feature information with the highest similarity to the piece of source audio data is determined as the target timbre feature information matching the i-th piece of source audio data, where i is a positive integer with a value ranging from 1 to N.
  • the role timbre conversion module 1303 includes:
  • the prosody feature extraction module 13031 is used to extract N sets of prosody feature information corresponding to N sets of source audio data by using a prosody extractor;
  • the emotional feature extraction module 13032 is used to extract N sets of emotional feature information corresponding to N sets of source audio data by using an emotional extractor;
  • the expressive feature generating module 13033 is configured to generate N sets of expressive feature information corresponding to N pieces of source audio data based on N sets of prosodic feature information and N sets of emotional feature information.
  • the role timbre conversion module 1303 includes:
  • the content feature extraction module 13034 is used to extract N sets of content feature information corresponding to N pieces of source audio data by using a content extractor;
  • the feature spectrum reconstruction module 13035 is used to use the spectrum reconstruction model to perform spectrum reconstruction based on N sets of target timbre feature information, N sets of expressive force feature information and N sets of content feature information, so as to obtain N sets of audio spectrum feature information;
  • the frequency-time transformation module 13036 is used to perform frequency-time transformation on N groups of audio spectrum feature information respectively, so as to obtain N groups of target audio data;
  • the splicing module 13037 is used for splicing N groups of target audio data to obtain multicast audio data.
  • the audio sentence division module 13021 is configured to divide the unicast audio data into N pieces of source audio data by performing voice activity detection (VAD) detection on the unicast audio data.
  • VAD voice activity detection
  • the audio sentence division module 13021 is used to obtain the reading text of the unicast audiobook, and divide the reading text into N text sentences; align the unicast audio data with the reading text, so as to determine the N N audio time intervals corresponding to the text sentences; based on the N audio time intervals, divide the unicast audio data into N pieces of source audio data.
  • the timbre feature extraction module 13022 is used to determine the role corresponding to each piece of source audio data in the N pieces of source audio data; obtain N sets of initial timbre feature information corresponding to the N pieces of source audio data; X pieces of source audio data corresponding to the first role: extract X sets of initial timbre feature information corresponding to X pieces of source audio data; perform weighted calculation based on X sets of initial timbre feature information corresponding to X pieces of source audio data, and determine the result of the weighted calculation is source timbre feature information corresponding to each piece of source audio data in X pieces of source audio data, where X is a positive integer.
  • the speech processing apparatus 1300 further includes:
  • the role determination module 1304 is used to calculate the similarity between the timbre feature information corresponding to the P1 piece of source audio data and the initial timbre feature information corresponding to the jth piece of source audio data;
  • the role corresponding to the source audio data with the highest similarity to the initial timbre feature information corresponding to the audio data is determined as the role of the j-th piece of source audio data; wherein, j is a positive integer, and the value range is between 1 and P2.
  • Fig. 14 is a schematic structural diagram of an exemplary training device.
  • the training device 1400 includes:
  • the collection module 1401 is used to collect training data.
  • the training data includes training audio data and reference role labels corresponding to the training audio data.
  • the expressiveness feature information of the training audio data meets the expressiveness condition.
  • the training audio data includes recordings by multiple users using their own timbres. audio data, and/or audio data recorded by multiple users using false sounds, and the timbre discrimination of different false sounds used by the same user is greater than the discrimination threshold;
  • the feature information extraction module 1402 is used to input the training audio data to the emotion extractor, the content extractor and the prosody extractor respectively for calculation, so as to obtain the prosody feature information output by the emotion extractor, the content feature information and the prosody output by the content extractor The prosodic feature information output by the extractor; and inputting the training audio data and the reference role label to the timbre extractor for calculation, so as to obtain the timbre feature information output by the timbre extractor;
  • the audio data reconstruction module 1403 is used to input the prosody feature information, content feature information, prosody feature information and timbre feature information into the spectrum reconstruction model to perform spectrum reconstruction to obtain audio spectrum feature information, and to perform frequency-time transformation on the audio spectrum feature information , to get the reconstructed audio data;
  • the backpropagation module 1404 is used to calculate the first loss function value based on the reconstructed audio data and the training audio data, with the goal of minimizing the first loss function value, and jointly adjust the emotion extractor, content extractor, prosody extractor, and timbre extraction Model parameters for the detector and spectral reconstruction models.
  • the device also includes:
  • the loss function value calculation module 1405 is used to input the timbre feature information into the first classifier for calculation to obtain the first role label, and calculate the second loss function value based on the first role label and the reference role label; the emotion feature information input to the second classifier for calculation to obtain a second role label, and calculate a second loss function value based on the second role label and the reference role label;
  • the timbre extractor training module 1406 is used to adjust the model parameters of the timbre extractor with the goal of minimizing the second loss function value and the mutual information of the timbre feature information and the emotional feature information;
  • the emotion extractor training module 1407 is configured to adjust the model parameters of the emotion extractor with the goal of maximizing the value of the third loss function and minimizing the mutual information.
  • FIG. 15 shows a schematic block diagram of an apparatus 1500 according to an embodiment of the present application.
  • the apparatus 1500 may include: a processor 1501 and a transceiver/transceiving pin 1502 , and optionally, a memory 1503 .
  • bus 1504 includes a power bus, a control bus, and a status signal bus in addition to a data bus.
  • the various buses are referred to as bus 1504 in the figure.
  • the memory 1503 may be used for the instructions in the foregoing method embodiments.
  • the processor 1501 can be used to execute instructions in the memory 1503, and control the receiving pin to receive signals, and control the sending pin to send signals.
  • Apparatus 1500 may be the electronic device or the chip of the electronic device in the foregoing method embodiments.
  • This embodiment also provides a computer storage medium, in which computer instructions are stored, and when the computer instructions are run on the electronic device, the electronic device is made to execute the steps of the above-mentioned related methods to realize the speech processing and training in the above-mentioned embodiment method.
  • This embodiment also provides a computer program product, which, when running on a computer, causes the computer to execute the above-mentioned related steps, so as to realize the voice processing and training method in the above-mentioned embodiment.
  • an embodiment of the present application also provides a device, which may specifically be a chip, a component or a module, and the device may include a connected processor and a memory; wherein the memory is used to store computer-executable instructions, and when the device is running, The processor can execute the computer-executable instructions stored in the memory, so that the chip executes the speech processing and training methods in the above method embodiments.
  • the electronic device, computer storage medium, computer program product or chip provided in this embodiment is all used to execute the corresponding method provided above, therefore, the beneficial effects it can achieve can refer to the corresponding method provided above The beneficial effects in the method will not be repeated here.
  • the disclosed devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of modules or units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or It may be integrated into another device, or some features may be omitted, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • a unit described as a separate component may or may not be physically separated, and a component shown as a unit may be one physical unit or multiple physical units, which may be located in one place or distributed to multiple different places. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • an integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a readable storage medium.
  • the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the software product is stored in a storage medium Among them, several instructions are included to make a device (which may be a single-chip microcomputer, a chip, etc.) or a processor (processor) execute all or part of the steps of the methods in various embodiments of the present application.
  • the aforementioned storage medium includes: various media that can store program codes such as U disk, mobile hard disk, read only memory (ROM), random access memory (random access memory, RAM), magnetic disk or optical disk.
  • the steps of the methods or algorithms described in connection with the disclosure of the embodiments of the present application may be implemented in the form of hardware, or may be implemented in the form of a processor executing software instructions.
  • the software instructions can be composed of corresponding software modules, and the software modules can be stored in random access memory (Random Access Memory, RAM), flash memory, read-only memory (Read Only Memory, ROM), erasable programmable read-only memory ( Erasable Programmable ROM, EPROM), Electrically Erasable Programmable Read-Only Memory (Electrically EPROM, EEPROM), registers, hard disk, removable hard disk, CD-ROM, or any other form of storage medium known in the art.
  • An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium.
  • the storage medium may also be a component of the processor.
  • the processor and storage medium can be located in the ASIC.
  • the functions described in the embodiments of the present application may be implemented by hardware, software, firmware or any combination thereof.
  • the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
  • Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another.
  • a storage media may be any available media that can be accessed by a general purpose or special purpose computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Electrophonic Musical Instruments (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Des modes de réalisation de la présente demande concernent des procédés de traitement et d'entraînement de voix ainsi qu'un dispositif électronique. Le procédé de traitement de voix consiste : à acquérir des données audio de diffusion individuelle d'un livre audio à diffusion individuelle, les données audio de diffusion individuelle comprenant N éléments de données audio source ; à déterminer, parmi M ensembles d'informations de caractéristique de timbre de référence, N ensembles d'informations de caractéristique de timbre cible qui correspondent aux N éléments de données audio source, le degré de différenciation de timbre entre deux ensembles quelconques des M ensembles d'informations de caractéristique de timbre de référence étant supérieur à un seuil de différenciation ; à acquérir N ensembles d'informations de caractéristique de puissance expressive correspondant aux N ensembles de données audio source, et à effectuer une conversion de timbre sur les N éléments de données audio source sur la base des N ensembles d'informations de caractéristique de timbre cible et des N ensembles d'informations de caractéristique de puissance expressive, de manière à générer des données audio de diffusion groupée. De cette manière, le timbre de données audio source peut être converti à condition d'assurer une puissance expressive, et un livre audio à diffusion individuelle est converti en un livre audio à diffusion groupée, ce qui augmente le degré de différenciation de timbre entre des personnages dans le texte de lecture, ce qui permet de mieux exprimer le développement d'une scène pour faciliter la compréhension d'une intrigue par un utilisateur.
PCT/CN2022/116572 2021-09-30 2022-09-01 Procédés de traitement et d'entraînement de voix et dispositif électronique WO2023051155A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111158143.1A CN115881145A (zh) 2021-09-30 2021-09-30 语音处理和训练方法以及电子设备
CN202111158143.1 2021-09-30

Publications (1)

Publication Number Publication Date
WO2023051155A1 true WO2023051155A1 (fr) 2023-04-06

Family

ID=85756611

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/116572 WO2023051155A1 (fr) 2021-09-30 2022-09-01 Procédés de traitement et d'entraînement de voix et dispositif électronique

Country Status (2)

Country Link
CN (1) CN115881145A (fr)
WO (1) WO2023051155A1 (fr)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5563358A (en) * 1991-12-06 1996-10-08 Zimmerman; Thomas G. Music training apparatus
US20110157299A1 (en) * 2009-12-24 2011-06-30 Samsung Electronics Co., Ltd Apparatus and method of video conference to distinguish speaker from participants
CN107293286A (zh) * 2017-05-27 2017-10-24 华南理工大学 一种基于网络配音游戏的语音样本收集方法
CN110933330A (zh) * 2019-12-09 2020-03-27 广州酷狗计算机科技有限公司 视频配音方法、装置、计算机设备及计算机可读存储介质
CN112037766A (zh) * 2020-09-09 2020-12-04 广州华多网络科技有限公司 一种语音音色转换方法及相关设备
CN112820287A (zh) * 2020-12-31 2021-05-18 乐鑫信息科技(上海)股份有限公司 分布式语音处理系统及方法
CN112863483A (zh) * 2021-01-05 2021-05-28 杭州一知智能科技有限公司 支持多说话人风格、语言切换且韵律可控的语音合成装置
CN113096634A (zh) * 2021-03-30 2021-07-09 平安科技(深圳)有限公司 语音合成方法、装置、服务器及存储介质
CN113436609A (zh) * 2021-07-06 2021-09-24 南京硅语智能科技有限公司 语音转换模型及其训练方法、语音转换方法及系统

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5563358A (en) * 1991-12-06 1996-10-08 Zimmerman; Thomas G. Music training apparatus
US20110157299A1 (en) * 2009-12-24 2011-06-30 Samsung Electronics Co., Ltd Apparatus and method of video conference to distinguish speaker from participants
CN107293286A (zh) * 2017-05-27 2017-10-24 华南理工大学 一种基于网络配音游戏的语音样本收集方法
CN110933330A (zh) * 2019-12-09 2020-03-27 广州酷狗计算机科技有限公司 视频配音方法、装置、计算机设备及计算机可读存储介质
CN112037766A (zh) * 2020-09-09 2020-12-04 广州华多网络科技有限公司 一种语音音色转换方法及相关设备
CN112820287A (zh) * 2020-12-31 2021-05-18 乐鑫信息科技(上海)股份有限公司 分布式语音处理系统及方法
CN112863483A (zh) * 2021-01-05 2021-05-28 杭州一知智能科技有限公司 支持多说话人风格、语言切换且韵律可控的语音合成装置
CN113096634A (zh) * 2021-03-30 2021-07-09 平安科技(深圳)有限公司 语音合成方法、装置、服务器及存储介质
CN113436609A (zh) * 2021-07-06 2021-09-24 南京硅语智能科技有限公司 语音转换模型及其训练方法、语音转换方法及系统

Also Published As

Publication number Publication date
CN115881145A (zh) 2023-03-31

Similar Documents

Publication Publication Date Title
CN108806655B (zh) 歌曲的自动生成
JP2019527371A (ja) 声紋識別方法及び装置
TW202008349A (zh) 語音標註方法、裝置及設備
US20210158790A1 (en) Autonomous generation of melody
Origlia et al. Continuous emotion recognition with phonetic syllables
CN107329996A (zh) 一种基于模糊神经网络的聊天机器人系统与聊天方法
Zhang et al. Pre-trained deep convolution neural network model with attention for speech emotion recognition
Deng et al. Fisher kernels on phase-based features for speech emotion recognition
CN112185363B (zh) 音频处理方法及装置
CN110851650B (zh) 一种评论输出方法、装置、以及计算机存储介质
WO2023245389A1 (fr) Procédé de gestion de chanson, appareil, dispositif électronique et support de stockage
Zvarevashe et al. Recognition of speech emotion using custom 2D-convolution neural network deep learning algorithm
CN107221344A (zh) 一种语音情感迁移方法
JP2013029690A (ja) 話者分類装置、話者分類方法および話者分類プログラム
Kumar et al. Machine learning based speech emotions recognition system
Piotrowska et al. Evaluation of aspiration problems in L2 English pronunciation employing machine learning
Turnbull et al. Modelling music and words using a multi-class naıve bayes approach
Xu et al. A comprehensive survey of automated audio captioning
CN114927126A (zh) 基于语义分析的方案输出方法、装置、设备以及存储介质
Wang et al. A research on HMM based speech recognition in spoken English
CN114125506A (zh) 语音审核方法及装置
WO2023051155A1 (fr) Procédés de traitement et d'entraînement de voix et dispositif électronique
CN114254649A (zh) 一种语言模型的训练方法、装置、存储介质及设备
WO2022041177A1 (fr) Procédé de traitement de message de communication, dispositif et client de messagerie instantanée
Chimthankar Speech Emotion Recognition using Deep Learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22874546

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE