CN115881145A - Voice processing and training method and electronic equipment - Google Patents

Voice processing and training method and electronic equipment Download PDF

Info

Publication number
CN115881145A
CN115881145A CN202111158143.1A CN202111158143A CN115881145A CN 115881145 A CN115881145 A CN 115881145A CN 202111158143 A CN202111158143 A CN 202111158143A CN 115881145 A CN115881145 A CN 115881145A
Authority
CN
China
Prior art keywords
audio data
characteristic information
tone
groups
source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111158143.1A
Other languages
Chinese (zh)
Inventor
黄涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202111158143.1A priority Critical patent/CN115881145A/en
Priority to PCT/CN2022/116572 priority patent/WO2023051155A1/en
Publication of CN115881145A publication Critical patent/CN115881145A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch

Abstract

The embodiment of the application provides a voice processing and training method and electronic equipment. The voice processing method comprises the following steps: acquiring unicast audio data of the unicast audio book, wherein the unicast audio data comprises N pieces of source audio data; then, determining N groups of target tone characteristic information matched with the N pieces of source audio data from M groups of reference tone characteristic information, wherein the tone discrimination between any two groups of the M groups of reference tone characteristic information is greater than a discrimination threshold; then, N groups of performance characteristic information corresponding to the N groups of source audio data are obtained, and then, tone conversion is performed on the N pieces of source audio data respectively based on the N groups of target tone characteristic information and the N groups of performance characteristic information, so that multicast audio data are generated. Therefore, the tone of the source audio data can be converted on the premise of ensuring the expressive force, the single-broadcast audio reading is converted into the multi-broadcast audio reading, the tone discrimination of the roles in the reading is improved, the scene deduction expressive force is further improved, and the user can conveniently understand the plot.

Description

Voice processing and training method and electronic equipment
Technical Field
The embodiment of the application relates to the field of data processing, in particular to a voice processing and training method and electronic equipment.
Background
At present, the mainstream contents of the talking book are all novels, such as a story novel, a suspicion novel, a science fiction novel, a swordsman novel and the like. Since the content of the novel has a plurality of characters, when recording the audio books, a single person usually performs multi-character deduction through pseudo sound to realize the recording of multi-character voice in the novel.
However, the voice that can be disguised by a single person is limited, and when there are too many characters in the novel, even if the single person's voice performs multi-angle deduction preferentially through pseudo-voice, the voice of all characters cannot be deduced, so that the voice discrimination of each character is low, thereby causing a lack of scene deduction expressiveness and being not good for the user to understand the plot.
Disclosure of Invention
In order to solve the above technical problem, the present application provides a speech processing and training method and an electronic device. In the voice processing method, the unicast audio data can be converted into the multicast audio data on the premise of ensuring the expressive force of the unicast audio book, so that the unicast audio book can be converted into the multicast audio book.
In a first aspect, an embodiment of the present application provides a speech processing method, including: acquiring unicast audio data of the unicast audio book, wherein the unicast audio data comprises N pieces of source audio data; subsequently, N groups of target tone characteristic information matched with the N pieces of source audio data are determined from the M groups of reference tone characteristic information, the M groups of reference tone characteristic information correspond to M tones, and the tone discrimination of any two groups of the M groups of reference tone characteristic information is greater than a discrimination threshold; acquiring N groups of expression characteristic information corresponding to the N groups of source audio data; then, based on the N sets of target tone characteristic information and the N sets of performance characteristic information, tone conversion is performed on the N pieces of source audio data, respectively, to generate multicast audio data. Therefore, the tone of each piece of source audio data can be converted on the premise of ensuring the expressive force of the unicast audio book, so that the unicast audio book is converted into the multicast audio book, and the tone discrimination of the characters in the book is improved, thereby improving the scene deductive expressive force of the audio book and facilitating the understanding of the plot by the user. In addition to this, the present invention is,
on one hand, after the user executes the audio reading operation on the unicast audio reading material, the method and the device can automatically match the tone of each source audio data, the user does not need to manually select the matched tone for each source audio data, the efficiency of converting unicast audio data into multicast audio data can be improved, the user operation is simplified, and the user experience is improved.
On the other hand, in the process of making the multicast audio book by the audio book maker, the unicast audio book is subjected to multi-role deduction recording only by single voice preferably through pseudo sound, and then the unicast audio book is converted into the multicast audio book, so that the making of the multicast audio book is realized, the making of the multicast audio book is realized without adopting multi-voice preferably to distribute and record according to roles, the making time and the making cost of the multicast audio book can be reduced, and the making efficiency of the multicast audio book is improved.
On the other hand, in the application, for the recorded unicast audio books, the audio book producer can directly convert the unicast audio books into multicast audio books without re-producing the produced unicast audio books, so that the efficiency is high and the cost is low.
Illustratively, N is a positive integer and M is a positive integer.
For example, the tone color feature information may be used to characterize information related to tone color, and may include but is not limited to: character features, gender features, age features, and vocal range (e.g., treble, midrange, and bass) features, etc., may be represented by vectors or sequences, which are not limited in this application.
According to the first aspect, determining N sets of target tone feature information that match N pieces of source audio data from among the M sets of reference tone feature information includes: performing audio statement division on the unicast audio data to obtain N pieces of source audio data, wherein the N pieces of source audio data correspond to the audio statements one by one; acquiring N groups of source tone characteristic information corresponding to N pieces of source audio data; and determining N groups of target tone color characteristic information from M groups of reference tone color characteristic information based on the N groups of source tone color characteristic information.
According to the first aspect, or any one of the above implementation manners of the first aspect, determining N sets of target tone color feature information matched with N pieces of source audio data from M sets of reference tone color feature information based on N sets of source tone color feature information includes: for an ith source audio data of the N source audio data: respectively determining the similarity between the M groups of reference tone characteristic information and the ith piece of source audio data; and determining the reference tone characteristic information with the highest similarity with the ith piece of source audio data as target tone characteristic information matched with the ith piece of source audio data, wherein i is a positive integer and the value range is between 1 and N. In this way, the tone color having high tone color similarity with the character corresponding to the source audio data can be made to match the source audio data, so that the tone color after tone color conversion can be made to match the tone color of the character corresponding to the source audio data.
Illustratively, the description between 1 and N includes 1 and N, that is, i may be 1 or N.
According to the first aspect, or any implementation manner of the first aspect above, acquiring N groups of performance characteristic information corresponding to N groups of source audio data includes: acquiring N groups of prosodic feature information corresponding to the N groups of source audio data; acquiring N groups of emotional characteristic information corresponding to the N groups of source audio data; and generating N groups of expression characteristic information corresponding to the N pieces of source audio data based on the N groups of rhythm characteristic information and the N groups of emotion characteristic information.
For example, the prosodic feature information may be used to characterize vocal emotional skill information such as lightness, slowness, virtuality and reality of speech, and may be represented by vectors or sequences.
For example, the emotional characteristic information may be used to characterize the emotional type (e.g., open, hard, high, low, etc.), and the attitude size (e.g., affirmative, negative, commendative, ironic), etc.), and may be represented in a vector or sequence.
According to the first aspect, or any implementation manner of the first aspect above, performing tone color conversion on N pieces of source audio data based on N sets of target tone color feature information and N sets of performance feature information, respectively, to generate multicast audio data includes: acquiring N groups of content characteristic information corresponding to N pieces of source audio data; performing spectrum reconstruction based on the N groups of target tone characteristic information, the N groups of expression characteristic information and the N groups of content characteristic information to obtain N groups of sound spectrum characteristic information; respectively carrying out frequency-time transformation on the N groups of audio spectrum characteristic information to obtain N items of label audio data; and splicing the N items of label audio data to obtain the multicast audio data. Thus, it is possible to ensure that the prosody of the target audio data is consistent with the prosody, emotion, and content of the source audio data, but the timbres are different.
According to the first aspect, or any implementation manner of the first aspect, the audio statement division is performed on unicast audio data to obtain N pieces of source audio data, including: the unicast audio data is divided into N pieces of source audio data by performing voice activity detection, VAD, detection on the unicast audio data.
According to the first aspect, or any implementation manner of the first aspect above, performing audio statement division on unicast audio data to obtain N pieces of source audio data includes: reading texts of the unicast audio reading are obtained, and the reading texts are divided into N text sentences; aligning the unicast audio data with the reading text to determine N audio time intervals corresponding to the N text statements in the unicast audio data; the unicast audio data is divided into N pieces of source audio data based on the N audio time intervals. Therefore, the accuracy of audio statement division of the unicast audio data can be increased, and the accuracy of converting the unicast audio readings into the multicast audio readings is further improved.
According to the first aspect, or any implementation manner of the first aspect above, acquiring N sets of source tone feature information corresponding to N pieces of source audio data includes: determining a role corresponding to each source audio data in the N pieces of source audio data; acquiring N groups of initial tone characteristic information corresponding to the N pieces of source audio data; aiming at X pieces of source audio data corresponding to a first role in the N pieces of source audio data: and performing weighting calculation based on X groups of initial tone characteristic information corresponding to X pieces of source audio data, and determining the result of the weighting calculation as the source tone characteristic information corresponding to each piece of source audio data in the X pieces of source audio data, wherein X is a positive integer. Therefore, the source tone characteristic information of the source audio data of the same role in the reading can be the same, and the tone of the same role in the reading is ensured to be uniform.
According to the first aspect, or any implementation manner of the first aspect above, the N pieces of source audio data include P1 pieces of source audio data of a determined role and P2 pieces of source audio data of an undetermined role, N = P1+ P2, and P1 and P2 are positive integers, and the method further includes: for the jth source audio data of the P2 pieces of source audio data: respectively calculating the similarity of the initial tone characteristic information corresponding to the P1 pieces of source audio data and the initial tone characteristic information corresponding to the jth piece of source audio data; determining the role corresponding to the source audio data with the highest initial tone characteristic information similarity in the P1 pieces of source audio data as the role of the jth piece of source audio data; wherein j is a positive integer and has a value ranging from 1 to P2. Thus, the role can be accurately determined for the source audio data of the undetermined role, and the error rate of reference disambiguation is reduced.
Illustratively, 1 to P2 include 1 and P2, that is, j may be equal to 1 or P2.
In a second aspect, an embodiment of the present application provides a training method, where the method includes: firstly, training data are collected, wherein the training data comprise training audio data and reference role labels corresponding to the training audio data, expressive characteristic information of the training audio data meets expressive force conditions, the training audio data comprise audio data recorded by a plurality of users by adopting own timbres, and/or the audio data recorded by the plurality of users by using pseudo-tones, and the timbre discrimination of different pseudo-tones used by the same user is greater than a discrimination threshold. Then, respectively inputting the training audio data to an emotion extractor, a content extractor and a prosody extractor for calculation to obtain prosody feature information output by the emotion extractor, content feature information output by the content extractor and prosody feature information output by the prosody extractor; and inputting the training audio data and the reference role label into the tone extractor for calculation to obtain tone characteristic information output by the tone extractor. Secondly, inputting the rhythm characteristic information, the content characteristic information, the rhythm characteristic information and the tone characteristic information into a spectrum reconstruction model for spectrum reconstruction to obtain audio spectrum characteristic information, and performing frequency-time transformation on the audio spectrum characteristic information to obtain reconstructed audio data; subsequently, a first loss function value is calculated based on the reconstructed audio data and the training audio data, and model parameters of the emotion extractor, the content extractor, the prosody extractor, the timbre extractor, and the spectral reconstruction model are jointly adjusted with a goal of minimizing the first loss function value. Thus, a tone extractor capable of extracting accurate tone characteristic information, a rhythm extractor capable of extracting accurate rhythm characteristic information, an emotion extractor capable of extracting accurate emotion characteristic information, a content extractor capable of extracting accurate content characteristic information, and a spectrum reconstruction model capable of reconstructing audio spectrum characteristic information can be trained.
According to a second aspect, the method further comprises: in one aspect, the tone color feature information is input to a first classifier for calculation to obtain a first role label, and a second loss function value is calculated based on the first role label and a reference role label. On the other hand, the emotional feature information is input to the second classifier to be calculated so as to obtain a second role label, and a second loss function value is calculated based on the second role label and the reference role label. Then, taking the minimized second loss function value and the mutual information of the tone characteristic information and the emotion characteristic information as targets, and adjusting the model parameters of the tone extractor; the model parameters of the emotion extractor are adjusted with the goal of maximizing the third loss function value and minimizing the mutual information. In this way, overlapping of the tone color feature information and the emotion feature information can be reduced, so that the tone color feature information and the emotion feature information are decoupled.
In a third aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor, the memory coupled to the processor; the memory stores program instructions that, when executed by the processor, cause the electronic device to perform the method of speech processing of the first aspect or any possible implementation of the first aspect.
Any one implementation manner of the third aspect and the third aspect corresponds to any one implementation manner of the first aspect and the first aspect, respectively. For technical effects corresponding to any one implementation manner of the third aspect and the third aspect, reference may be made to the technical effects corresponding to any one implementation manner of the first aspect and the first aspect, and details are not repeated here.
In a fourth aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor, the memory coupled with the processor; the memory stores program instructions that, when executed by the processor, cause the electronic device to perform the training method of the second aspect or any possible implementation of the second aspect.
Any one implementation manner of the fourth aspect and the fourth aspect corresponds to any one implementation manner of the second aspect and the second aspect, respectively. For technical effects corresponding to any one of the implementation manners of the fourth aspect and the fourth aspect, reference may be made to the technical effects corresponding to any one of the implementation manners of the second aspect and the second aspect, and details are not described here.
In a fifth aspect, an embodiment of the present application provides a chip, including one or more interface circuits and one or more processors; the interface circuit is used for receiving signals from a memory of the electronic equipment and sending the signals to the processor, and the signals comprise computer instructions stored in the memory; the computer instructions, when executed by a processor, cause an electronic device to perform the method of speech processing of the first aspect or any possible implementation of the first aspect.
Any one implementation manner of the fifth aspect and the fifth aspect corresponds to any one implementation manner of the first aspect and the first aspect, respectively. For technical effects corresponding to any one of the implementation manners of the fifth aspect and the fifth aspect, reference may be made to the technical effects corresponding to any one of the implementation manners of the first aspect and the first aspect, and details are not repeated here.
In a sixth aspect, embodiments of the present application provide a chip, including one or more interface circuits and one or more processors; the interface circuit is used for receiving signals from a memory of the electronic equipment and sending the signals to the processor, and the signals comprise computer instructions stored in the memory; the computer instructions, when executed by a processor, cause an electronic device to perform the training method of the second aspect or any possible implementation of the second aspect.
Any one of the implementation manners of the sixth aspect and the sixth aspect corresponds to any one of the implementation manners of the second aspect and the second aspect, respectively. For technical effects corresponding to any one implementation manner of the sixth aspect and the sixth aspect, reference may be made to the technical effects corresponding to any one implementation manner of the second aspect and the second aspect, and details are not described here again.
In a seventh aspect, an embodiment of the present application provides a computer storage medium, where a computer program is stored in the computer storage medium, and when the computer program runs on a computer or a processor, the computer or the processor is caused to execute the speech processing method in the first aspect or any possible implementation manner of the first aspect.
Any one of the implementations of the seventh aspect and the seventh aspect corresponds to any one of the implementations of the first aspect and the first aspect, respectively. For technical effects corresponding to any one of the implementation manners of the seventh aspect and the seventh aspect, reference may be made to the technical effects corresponding to any one of the implementation manners of the first aspect and the first aspect, and details are not repeated here.
In an eighth aspect, embodiments of the present application provide a computer storage medium, where a computer program is stored in the computer readable storage medium, and when the computer program runs on a computer or a processor, the computer or the processor is caused to execute the training method in the second aspect or any possible implementation manner of the second aspect.
Any one implementation manner of the eighth aspect and the eighth aspect corresponds to any one implementation manner of the second aspect and the second aspect, respectively. For technical effects corresponding to any one implementation manner of the eighth aspect and the eighth aspect, reference may be made to the technical effects corresponding to any one implementation manner of the second aspect and the second aspect, and details are not described here.
In a ninth aspect, embodiments of the present application provide a computer program product, which contains a software program that, when executed by a computer or a processor, causes the steps of the speech processing method in the first aspect or any possible implementation manner of the first aspect to be performed.
Any one of the ninth aspect and the ninth aspect corresponds to any one of the first aspect and the first aspect, respectively. For technical effects corresponding to any one implementation manner of the ninth aspect and the ninth aspect, reference may be made to the technical effects corresponding to any one implementation manner of the first aspect and the first aspect, and details are not repeated here.
In a tenth aspect, embodiments of the present application provide a computer program product, which contains a software program that, when executed by a computer or a processor, causes the steps of the training method in the second aspect or any possible implementation manner of the second aspect to be performed.
Any one of the tenth aspect and the tenth aspect corresponds to any one of the second aspect and the second aspect, respectively. For technical effects corresponding to any one implementation manner of the tenth aspect and the tenth aspect, reference may be made to the technical effects corresponding to any one implementation manner of the second aspect and the second aspect, and details are not described here.
In an eleventh aspect, an embodiment of the present application provides a speech processing apparatus, including:
the data acquisition module is used for acquiring unicast audio data of the unicast audio book, wherein the unicast audio data comprise N pieces of source audio data, and N is a positive integer;
the role tone analysis module is used for determining N groups of target tone characteristic information matched with the N pieces of source audio data from M groups of reference tone characteristic information, wherein the M groups of reference tone characteristic information correspond to M tones, the tone discrimination between any two groups of the M groups of reference tone characteristic information is greater than a discrimination threshold, and M is a positive integer;
the role tone conversion module is used for acquiring N groups of expression characteristic information corresponding to the N groups of source audio data; and performing tone conversion on the N pieces of source audio data respectively based on the N groups of target tone characteristic information and the N groups of expression characteristic information to generate multicast audio data.
According to an eleventh aspect, the character timbre analysis module comprises:
the audio statement division module is used for carrying out audio statement division on the unicast audio data to obtain N pieces of source audio data, and the N pieces of source audio data correspond to the audio statements one by one;
the tone characteristic extraction module is used for extracting N groups of source tone characteristic information corresponding to the N pieces of source audio data by adopting a tone extractor;
and the tone characteristic matching module is used for determining N groups of target tone characteristic information from M groups of reference tone characteristic information based on N groups of source tone characteristic information.
According to an eleventh aspect, or any implementation manner of the above eleventh aspect, the timbre feature matching module is configured to, for an ith source audio data of the N source audio data: respectively determining the similarity between the M groups of reference tone characteristic information and the ith piece of source audio data; and determining the reference tone characteristic information with the highest similarity with the ith piece of source audio data as target tone characteristic information matched with the ith piece of source audio data, wherein i is a positive integer and the value range is between 1 and N.
According to an eleventh aspect, or any implementation manner of the above eleventh aspect, the character timbre conversion module includes:
the prosodic feature extraction module is used for extracting N groups of prosodic feature information corresponding to the N groups of source audio data by adopting a prosodic extractor;
the emotion characteristic extraction module is used for extracting N groups of emotion characteristic information corresponding to the N groups of source audio data by adopting an emotion extractor;
and the expressive force characteristic generation module is used for generating N groups of expressive force characteristic information corresponding to the N pieces of source audio data based on the N groups of prosodic characteristic information and the N groups of emotional characteristic information.
According to an eleventh aspect, or any implementation manner of the above eleventh aspect, the character timbre conversion module includes:
the content characteristic extraction module is used for extracting N groups of content characteristic information corresponding to the N pieces of source audio data by adopting a content extractor;
the characteristic spectrum reconstruction module is used for performing spectrum reconstruction on the basis of N groups of target tone characteristic information, N groups of expression characteristic information and N groups of content characteristic information by adopting a spectrum reconstruction model to obtain N groups of tone spectrum characteristic information;
the frequency-time transformation module is used for respectively carrying out frequency-time transformation on the N groups of audio spectrum characteristic information to obtain N groups of target audio data;
and the splicing module is used for splicing the N groups of target audio data to obtain the multicast audio data.
According to an eleventh aspect or any implementation manner of the above eleventh aspect, the audio statement dividing module is configured to divide the unicast audio data into N pieces of source audio data by performing VAD (voice activity Detection) Detection on the unicast audio data.
According to an eleventh aspect or any implementation manner of the above eleventh aspect, the audio sentence dividing module is configured to acquire a reading text of a unicast audio reading, and divide the reading text into N text sentences; aligning the unicast audio data with the reading text to determine N audio time intervals corresponding to the N text statements in the unicast audio data; the unicast audio data is divided into N pieces of source audio data based on the N audio time intervals.
According to an eleventh aspect or any implementation manner of the eleventh aspect above, the tone feature extraction module is configured to determine a role corresponding to each source audio data in the N source audio data; acquiring N groups of initial tone characteristic information corresponding to the N pieces of source audio data; aiming at X pieces of source audio data corresponding to a first role in the N pieces of source audio data: and performing weighting calculation based on X groups of initial tone characteristic information corresponding to X pieces of source audio data, and determining the result of the weighting calculation as the source tone characteristic information corresponding to each piece of source audio data in the X pieces of source audio data, wherein X is a positive integer.
According to an eleventh aspect, or any implementation manner of the above eleventh aspect, the N pieces of source audio data include P1 pieces of source audio data of determined characters and P2 pieces of source audio data of undetermined characters, N = P1+ P2, and P1 and P2 are positive integers, and the apparatus further includes:
a role determination module, configured to, for a jth source audio data of the P2 pieces of source audio data: respectively calculating the similarity of the initial tone characteristic information corresponding to the P1 pieces of source audio data and the initial tone characteristic information corresponding to the jth piece of source audio data; determining the role corresponding to the source audio data with the highest initial tone characteristic information similarity in the P1 pieces of source audio data as the role of the jth piece of source audio data; wherein j is a positive integer and has a value ranging from 1 to P2.
Any one implementation manner of the eleventh aspect and the eleventh aspect corresponds to any one implementation manner of the first aspect and the first aspect, respectively. For technical effects corresponding to any one implementation manner of the eleventh aspect and the eleventh aspect, reference may be made to the technical effects corresponding to any one implementation manner of the first aspect and the first aspect, and details are not repeated here.
In a twelfth aspect, an embodiment of the present application provides an exercise device, including:
the training audio data comprises audio data which are recorded by a plurality of users by using own timbres, and/or the audio data which are recorded by a plurality of users by using pseudo-tones, and the timbre discrimination of different pseudo-tones used by the same user is greater than a discrimination threshold;
the feature information extraction module is used for respectively inputting the training audio data to the emotion extractor, the content extractor and the rhythm extractor for calculation so as to obtain rhythm feature information output by the emotion extractor, content feature information output by the content extractor and rhythm feature information output by the rhythm extractor; inputting the training audio data and the reference role label into a tone extractor for calculation to obtain tone characteristic information output by the tone extractor;
the audio data reconstruction module is used for inputting the rhythm characteristic information, the content characteristic information, the rhythm characteristic information and the tone characteristic information into the spectrum reconstruction model for spectrum reconstruction so as to obtain audio spectrum characteristic information, and performing frequency-time transformation on the audio spectrum characteristic information so as to obtain reconstructed audio data;
and the back propagation module is used for calculating a first loss function value based on the reconstructed audio data and the training audio data, taking the minimized first loss function value as a target, and jointly adjusting model parameters of the emotion extractor, the content extractor, the rhythm extractor, the tone extractor and the spectrum reconstruction model.
According to a twelfth aspect, the apparatus further comprises:
the loss function value calculation module is used for inputting the tone characteristic information into the first classifier to be calculated so as to obtain a first role label and calculating a second loss function value based on the first role label and the reference role label; inputting the emotional characteristic information into a second classifier for calculation to obtain a second role label, and calculating a second loss function value based on the second role label and a reference role label;
the tone extractor training module is used for adjusting model parameters of the tone extractor by taking the minimized second loss function value and mutual information of tone characteristic information and emotion characteristic information as targets;
and the emotion extractor training module is used for adjusting the model parameters of the emotion extractor by taking the third loss function value maximization and the minimum mutual information minimization as targets.
Any one implementation manner of the twelfth aspect and the twelfth aspect corresponds to any one implementation manner of the second aspect and the second aspect, respectively. For technical effects corresponding to any one of the implementations of the twelfth aspect and the twelfth aspect, reference may be made to the technical effects corresponding to any one of the implementations of the second aspect and the second aspect, and details are not described here again.
Drawings
Fig. 1 is a schematic diagram of an exemplary application scenario;
FIG. 2 is a schematic diagram of an exemplary application scenario;
FIG. 3 is a schematic diagram of an exemplary process;
FIG. 4 is a schematic diagram of an exemplary process;
FIG. 5 is a schematic structural diagram of an exemplary illustrative model;
FIG. 6 is a schematic diagram of an exemplary training process;
FIG. 7 is a schematic diagram of an exemplary training process;
FIG. 8 is a schematic diagram illustrating an exemplary information extraction process;
FIG. 9a is a schematic diagram of an exemplary information extraction process;
FIG. 9b is a diagram illustrating an exemplary information matching process;
FIG. 10 is a schematic diagram of an exemplary illustrated timbre conversion;
FIG. 11 is a schematic diagram of an exemplary process;
FIG. 12 is a schematic diagram of an exemplary information extraction process;
FIG. 13a is a schematic diagram of an exemplary illustrated speech processing apparatus;
FIG. 13b is a schematic diagram of an exemplary voice processing apparatus;
FIG. 14 is a schematic diagram of an exemplary exercise device;
fig. 15 is a schematic structural diagram of an exemplary illustrated apparatus.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone.
The terms "first" and "second," and the like in the description and in the claims of the embodiments of the present application, are used for distinguishing between different objects and not for describing a particular order of the objects. For example, the first target object and the second target object, etc. are specific sequences for distinguishing different target objects, rather than describing target objects.
In the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "such as" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.
In the description of the embodiments of the present application, the meaning of "a plurality" means two or more unless otherwise specified. For example, a plurality of processing units refers to two or more processing units; the plurality of systems refers to two or more systems.
Fig. 1 is a schematic diagram of an exemplary application scenario.
Referring to fig. 1, one possible scenario is, illustratively, a scenario in which a user plays an audio book.
Referring to fig. 1, for example, a user may launch an audio book application in a mobile phone and enter an audio book application home interface 101, where the audio book application home interface 101 may include one or more controls, including but not limited to: a search box, a search option, a list of reads, a text reading option, an audio reading option 102, and the like. For example, the user may enter a query term in the search box and click on a search option to query for the desired reading. For example, the user may perform a page turning operation or a sliding operation on the reading list to find the reading to be read. For example, after the user finds the reading material to be read in the reading material list or searches the reading material to be read, the user can click the text reading option and enter the text reading interface to read the text. For example, after finding the reading to be read in the reading list or searching the reading to be read, the user may click the audio reading option 102, and the mobile phone may respond to the operation behavior of the user and play the audio data corresponding to the reading to be read.
The audio book applications include unicast audio books and multicast audio books, wherein the unicast audio books refer to audio books obtained by performing multi-role deductive recording on single human voices preferentially through pseudo voices, and the roles have low tone color distinction degree and low scenario deductive expression degree. The multicast audio books are audio books recorded by multiple voice-preferred roles, and the roles have high tone discrimination and high scene deduction expression. If the timbre discrimination of the two timbres is greater than or equal to the discrimination threshold, it can be determined that the timbre discrimination of the two timbres is high. If the timbre discrimination of the two timbres is smaller than the discrimination threshold, it can be determined that the timbre discrimination of the two timbres is low. The discrimination threshold may be set as required, which is not limited in this application. Illustratively, if the user clicks the audio reading option 102 for the unicast audio reading, in the present application, after receiving the user clicks the audio reading option 102, the mobile phone may respond to the operation behavior of the user, convert the unicast audio reading into the multicast audio reading, and then play the multicast audio reading obtained by conversion, so as to improve the scene deduction expression of the audio reading, and facilitate the user to quickly and fully understand the scene.
Fig. 2 is a schematic diagram of an exemplary application scenario.
Referring to fig. 2, illustratively, one possible scenario is one in which an audiobook producer makes an audiobook.
Referring to fig. 2, an exemplary audio reading creator can activate the audio reading creation platform and enter the audio reading creation platform main interface 201 to create an audio reading. For example, the audio book production platform home interface 201 may include one or more controls, including but not limited to: an audio book making option, etc. For example, after the user clicks the option of the made audio book, the electronic device may enter an editing interface of the made audio book in response to the behavior operation of the user, and edit the made audio book, such as changing a name, a cover page, and the like. Illustratively, the user can click the audio book making option 202, the electronic device can enter the audio book making interface in response to the user's behavior operation, and the user can make an audio book in the audio book making interface (e.g., make a unicast audio book, make a multicast audio book).
For example, because the prior art needs a plurality of voice preferences to be distributed and recorded according to roles in the readings, the manufacturing time is long, the efficiency is low, and the unicast voice readings only need a single voice preference to perform multi-role deduction, in the present application, a voice reading manufacturer can record the unicast voice readings first and then convert the recorded unicast voice readings into multicast voice readings. Therefore, the manufacturing time and the manufacturing cost of the multicast audio book can be reduced, and the manufacturing efficiency of the multicast audio book can be improved.
For example, in the present application, for a unicast audio book that has been produced in the audio book production platform, the audio book producer may perform a conversion operation to convert the produced unicast audio book into a multicast audio book. Subsequently, after the user enters the main interface of the audio book application and clicks the audio reading option 102 of a certain multicast book, the mobile phone can respond to the operation behavior of the user and quickly play the audio data of the multicast audio book.
It is noted that the reading material in the present application may include a reading material having a plurality of characters, such as novel, story, small article, and the like.
An exemplary description of how unicast audio books can be converted into multicast audio books will now be described.
Illustratively, a unicast audio reading may include unicast audio data and reading text. For example, the unicast audio data may include a plurality of pieces of source audio data, each piece of source audio data including a plurality of frames of audio data, each piece of source audio data corresponding to one audio sentence. Correspondingly, the reading text comprises a plurality of text sentences.
Fig. 3 is a schematic diagram of an exemplary process.
Referring to fig. 3, an electronic device may illustratively include a character tone analysis module and a character tone transformation module. It should be understood that the electronic device shown in fig. 3 is only one example of an electronic device, and the electronic device may have more or fewer modules than those shown in the figure, and the application is not limited thereto.
Illustratively, the character tone analysis module is configured to analyze target tone color characteristic information that matches respective pieces of source audio data in the unicast audio data.
Illustratively, the character tone color conversion module is configured to convert the tone color of each piece of source audio data in the unicast audio data.
For example, the application may convert a unicast audio book into a multicast audio book through a character tone analysis module and a character tone conversion module in the electronic device, and the process may be as follows:
s301, inputting the unicast audio data of the unicast audio book to the character tone color analysis module.
In one possible scenario, when it is received that a user clicks the audio reading option 102 in fig. 1, unicast audio data of the unicast audio reading corresponding to the audio reading option 102 may be obtained in response to a user operation behavior, and then the obtained unicast audio data may be input to the character tone color analysis module.
In one possible scenario, when a conversion operation of a voiced reading producer for the unicast voiced reading produced in fig. 2 is received, unicast audio data of the unicast voiced reading corresponding to the conversion operation may be acquired in response to a user operation behavior, and then the acquired unicast audio data may be input to the character tone color analysis module.
In a possible scenario, when receiving a switching operation of a sound reading maker on a sound reading making interface for a unicast sound reading recorded in a process of making a multicast sound reading, the unicast audio data of the unicast sound reading corresponding to the switching operation can be acquired in response to a user operation behavior, and the acquired unicast audio data is input to a character tone analysis module.
S302, the role tone analysis module outputs N pieces of source audio data in the unicast audio data and N groups of correspondingly matched target tone characteristic information.
Wherein N is a positive integer.
Fig. 4 is a schematic diagram illustrating an exemplary process.
Referring to fig. 4, for example, after receiving the unicast audio data, the role tone analysis module may perform audio statement division on the unicast audio data to obtain N pieces of source audio data. And then extracting tone color characteristic information of the N pieces of source audio data respectively to obtain N groups of source tone color characteristic information corresponding to the N pieces of source audio data, wherein one group of source tone color characteristic information corresponds to one piece of source audio data. And respectively matching based on N groups of source tone characteristic information corresponding to the N pieces of source audio data, and determining N groups of target tone characteristic information corresponding to the N pieces of source audio data, wherein one group of target tone characteristic information corresponds to one piece of source audio data.
For example, the tone color feature information may be used to characterize information related to tone color, and may include, but is not limited to: personality characteristics, gender characteristics, age characteristics, and range (e.g., treble, midrange, and bass) characteristics, etc., to which this application is not limited.
Audio sentence division
For example, the reading material may include a plurality of chapters, and the unicast audio data may be subjected to audio chapter division to obtain W (W is a positive integer) sections of chapter audio data, where each section of chapter audio data corresponds to one chapter. Then, each section of chapter audio data in the W sections of chapter audio data can be divided into R (R is a positive integer) pieces of source audio data, so as to improve the efficiency of audio sentence division. Wherein N = W × R.
Illustratively, each frame of audio data in the unicast audio data has a chapter identification that uniquely identifies the chapter. Furthermore, in a possible manner, the character tone analysis module may determine each frame of audio data belonging to the same chapter through a chapter identifier of each frame of audio data in the unicast audio data, so that the unicast audio data may be divided into W sections of chapter audio data.
Illustratively, there is a pause between two adjacent chapters of the reading, and correspondingly, there is a pause between two adjacent chapters of the unicast audio data. The pause between two adjacent sections of audio data of the unicast audio data may be a first preset time, and the first preset time may be set as required, which is not limited in the present application. Furthermore, in a possible manner, the character tone color analysis module may employ VAD (voice activity Detection) Detection to divide the unicast audio data into W sections of chapter audio data.
For example, the character timbre analysis module may detect whether a time interval between two adjacent frames of audio data is greater than or equal to a first preset time duration by using VAD. If the time interval between two adjacent frames of audio data is greater than or equal to the first preset time length, chapter division can be performed between the two adjacent frames of audio data, the previous frame of audio data is divided into the end of the chapter audio data of the previous chapter, and the next frame of audio data is divided into the beginning of the chapter audio data of the next chapter.
For example, text recognition may be performed on unicast audio data to obtain a corresponding text recognition result. And then, text analysis can be carried out based on the text recognition result, and audio chapter division is carried out on the unicast audio data to obtain W-section chapter audio data. For example, when a chapter separator text (e.g., "chapter i", etc.) is detected, an audio time interval corresponding to the chapter separator text in the unicast audio data may be determined, and audio chapter separation may be performed based on an end point of the audio time interval corresponding to the chapter separator text.
Illustratively, the unicast audio data may be stored in chapters, and at this time, a section of chapter audio data corresponding to one chapter may be directly obtained each time, and then the audio data of the chapter is divided into R pieces of source audio data.
It should be noted that other ways may also be used to perform audio chapter division on the unicast audio text, which is not limited in this application.
Illustratively, there is a pause between two adjacent sentences in the reading, and correspondingly, there is a pause between two adjacent pieces of source audio data in each chapter of audio data. The pause between two adjacent source audio data is longer than a second preset time and shorter than a first preset time, the second preset time is shorter than the first preset time, and the second preset time can be set as required, which is not limited in the present application. Furthermore, in a possible manner, for the kth (k is a positive integer, and the value range is from 1 to W; the description manner of 1 to W includes 1 and W, that is, k may be 1 or W) section audio data in the section audio data of W, the role tone color analysis module may adopt VAD detection to divide the section audio data of kth section into R pieces of source audio data.
For example, for the kth section of chapter audio data, the character timbre analysis module may detect whether a time interval between two adjacent frames of audio data in the kth section of chapter audio data is greater than or equal to a second preset time duration through the VAD. If the time interval between two adjacent frames of audio data is greater than or equal to a second preset time length, audio statement division can be performed between two adjacent frames of audio data, the previous frame of audio data is divided into the end of the source audio data of the previous statement, and the next frame of audio data is divided into the beginning of the source audio data of the next statement.
It should be noted that, the audio chapter division may not be performed on the unicast audio data, but the audio sentence division may be performed directly, which is not limited in this application.
Furthermore, according to the above manner, the unicast audio data is subjected to audio statement division, and N pieces of source audio data can be obtained.
Timbre feature information extraction
For example, the tone extractor may be trained in advance, and then the trained tone extractor is used to extract tone characteristic information, so as to obtain N sets of source tone characteristic information corresponding to N pieces of source audio data.
Fig. 5 is a schematic structural diagram of an exemplary model.
Referring to fig. 5, exemplary conversion models may include: a tone extractor, a expressive force extraction module, a content extractor, and a spectral reconstruction model. Exemplary, expression force extraction modules include, but are not limited to: a rhythm extractor and an emotion extractor. It should be understood that the conversion model shown in fig. 5 is only one example of a conversion model, and that the conversion model may have more or fewer modules than shown in the figures, and the present application is not limited thereto.
Illustratively, the tone extractor, the prosody extractor, the emotion extractor, the content extractor and the spectral reconstruction model in the conversion model can be jointly trained.
Illustratively, training audio data recorded by a plurality of users by using own timbres and having high expressive force can be collected; wherein each user records at least one piece of training audio data. For example, a plurality of pieces of training audio data with high expressiveness, which are recorded by a plurality of users using pseudo tones with high timbre distinction degrees, may be collected, wherein each user uses at least one pseudo tone for recording, and each user uses one pseudo tone for recording at least one piece of training audio data. For example, each piece of training audio data may include at least one piece of training audio data, and each piece of training audio data corresponds to one sentence.
For example, high expressiveness may mean that the expressiveness characteristic information satisfies the expressiveness condition. The expressive force characteristic information can comprise prosodic characteristic information and emotional characteristic information, the expressive force conditions comprise prosodic conditions and emotional conditions, and the high expressive force can mean that the prosodic characteristic information meets the prosodic conditions and the emotional characteristic information meets the emotional conditions. The expressive force condition, the prosodic condition and the emotional condition may be set as required, which is not limited in this application.
For example, for each piece of training audio data collected, a corresponding reference tone label may be added to the piece of training audio data based on the character information of the piece of training audio data corresponding to the tone used by the user. Wherein, the role information includes but is not limited to: gender, age, character, vocal range, etc. Illustratively, the character information may be encoded (e.g., ont-hot (one-bit-efficient encoding), etc.) to obtain the reference tone label.
For example, a piece of training audio data and a reference tone label corresponding to the piece of training audio data may be used as a set of training data, and then a plurality of sets of training data may be obtained. The conversion model may then be trained using multiple sets of training data. The present application exemplifies training of a conversion model with a set of training data.
Fig. 6 is a schematic diagram of an exemplary training process.
Illustratively, the training audio data in the training data may be input to the prosody extractor, the emotion extractor, and the content extractor, respectively.
Illustratively, training audio data in the training data and corresponding reference timbre labels may be input to the timbre extractor.
For example, after receiving the training audio data and the corresponding reference tone label, the tone extractor may perform forward calculation on the training audio data and the reference tone label, and output tone characteristic information to the spectral reconstruction model. For example, the tone color feature information may be represented in a vector or a sequence.
For example, after receiving the training audio data, the prosody extractor may perform forward calculation on the training audio data, and output prosody feature information to the spectrum reconstruction model. For example, the prosodic feature information can be used for representing the vocal emotional skill information such as the degree of speech, the degree of reality, and the like, and can be represented by vectors or sequences.
For example, after receiving the training audio data, the emotion extractor may perform forward calculation on the training audio data, and output emotion feature information to the spectrum reconstruction model. For example, the emotional characteristic information may be used to characterize the emotional type (e.g., happy, sad, hypertonic, downed, etc.), and the attitude segment (e.g., positive, negative, commendatory, ironic), and may be represented in a vector or sequence.
For example, after receiving the training audio data, the content extractor may perform forward calculation on the training audio data, and output content feature information to the spectral reconstruction model. Illustratively, the content characteristic information may be used to characterize the speech content. Illustratively, the content feature information may be a phoneme feature, and may be represented in a vector or a sequence. Illustratively, the content feature information may be a phoneme posterior probability (i.e., a probability distribution of phonemes), and may be represented by a matrix.
In a possible mode, after the spectrum reconstruction model receives the tone color feature information, the prosody feature information, the emotion feature information and the content feature information, spectrum reconstruction can be performed on the basis of the tone color feature information, the prosody feature information, the emotion feature information and the content feature information, and audio spectrum feature information is output. For example, the audio spectral feature information output by the spectral reconstruction model may be subjected to time domain conversion, so as to obtain reconstructed audio data of the audio spectral feature information in a time domain. That is, the spectral reconstruction model performs only spectral reconstruction, while the time domain conversion is performed by the other modules.
In a possible mode, after the spectrum reconstruction model receives the tone color feature information, the prosodic feature information, the emotional feature information and the content feature information, the spectrum reconstruction model may perform spectrum reconstruction based on the tone color feature information, the prosodic feature information, the emotional feature information and the content feature information to obtain audio spectrum feature information, and then perform time domain conversion on the audio spectrum feature information to obtain reconstructed audio data of the audio spectrum feature information in a time domain and output the reconstructed audio data. That is, spectral reconstruction and frequency-time transformation are performed by the spectral reconstruction model.
It should be noted that, the present application does not limit whether the spectral reconstruction model only performs spectral reconstruction, or performs spectral reconstruction and frequency-time transformation.
For example, the reconstructed audio data of the audio spectral feature information in the time domain may be compared with the training audio data in the training data to calculate a corresponding first loss function value. And then, aiming at minimizing the first loss function value, adjusting model parameters of a tone extractor, a rhythm extractor, an emotion extractor, a content extractor and a spectrum reconstruction model in the conversion model.
For example, referring to the above manner, the conversion model may be trained by using each set of training data until the first loss function value satisfies the first loss condition, or the number of times of training of each module in the conversion model satisfies the corresponding training frequency condition, or the performance of each module in the conversion model satisfies the corresponding performance condition. For example, the first loss condition, the training frequency condition, and the performance condition may be set as required, and the present application is not limited thereto. The training frequency conditions of different modules in the conversion model can be different or the same; the performance conditions of different models may be different and are not limited in this application.
It should be noted that the content extractor may also be independently trained from the tone extractor, the prosody extractor, the emotion extractor, and the spectrum reconstruction model; this is not limited by the present application.
For example, most of the pseudo voices are realized by adjusting shallow features such as a speech speed, a fundamental frequency, a formant and the like, and emotions are often expressed by these skills, so that the tone feature information and the emotion feature information are overlapped, on the basis of the training of the tone extractor and the emotion extractor, the following method is adopted to train the tone extractor and the emotion extractor so as to reduce the overlapping of the tone feature information and the emotion feature information, and decouple the tone feature information output by the tone extractor and the emotion feature information output by the emotion extractor.
Fig. 7 is a schematic diagram of an exemplary training process.
Illustratively, the tone extractor performs forward calculation based on training audio data in the training data and a reference tone label, and after obtaining tone characteristic information, may output the tone characteristic information to the first classifier and the mutual information module, respectively.
Illustratively, the emotion extractor performs forward calculation based on training audio data in the training data, and after obtaining emotion feature information, may output the emotion feature information to the second classifier and the mutual information module, respectively.
For example, the first classifier may perform a calculation based on the tone color feature information, and output a first tone color label.
For example, the second classifier may perform a calculation based on the emotional feature information and output a second timbre label.
Illustratively, the mutual information module may perform calculation based on the tone color feature information and the emotion feature information, and calculate mutual information between the tone color feature information and the emotion feature information. The mutual information may be an information amount related to another variable contained in one variable, and the mutual information between the tone color feature information and the emotion feature information refers to an information amount where the tone color feature information contains emotion feature information, or an information amount where the emotion feature information contains tone color feature information.
For example, a second loss function value may be calculated based on the first timbre label and a reference timbre label in the training data, and a third loss function value may be calculated based on the second timbre label characteristic and the reference timbre label in the training data. Illustratively, the model parameters of the tone color extractor are adjusted with the goal of minimizing the second loss function value and the mutual information. Illustratively, the model parameters of the emotion extractor are adjusted with the goal of minimizing mutual information and maximizing the third loss function value. In this way, the difference between the emotion characteristic information extracted by the emotion extractor and the tone characteristic information extracted by the tone extractor can be increased, so that the emotion characteristic information extracted by the emotion extractor and the tone characteristic information extracted by the tone extractor are decoupled.
For example, in the above manner, the tone extractor and the emotion extractor may be trained using the sets of training data, and the training of the tone extractor may be stopped until the second loss function value satisfies the second loss condition, or the number of times of training of the tone extractor satisfies the training frequency condition of the tone extractor, or the performance of the tone extractor satisfies the performance condition of the tone extractor. And stopping the training of the emotion extractor until the third loss function value meets a third loss condition, or the training frequency of the emotion extractor meets the training frequency condition of the emotion extractor, or the performance of the emotion extractor meets the performance condition of the emotion extractor. For example, the second loss condition and the third loss condition may be set as required, and the embodiment of the present application does not limit this.
After each module in the conversion model finishes training, N pieces of source audio data in the unicast audio data can be sequentially input into the trained tone extractor, and N groups of source tone characteristic information corresponding to the N pieces of source audio data are extracted by the tone extractor.
Fig. 8 is a schematic diagram illustrating an exemplary information extraction process.
Referring to fig. 8, exemplary, 5 pieces of source audio data are shown in fig. 8: source audio data 1, source audio data 2, source audio data 3, source audio data 4, and source audio data 5. The source audio data 1 is input to the trained tone extractor, and tone characteristic information a, that is, source tone characteristic information corresponding to the source audio data 1, can be output. The source audio data 2 is input to the trained timbre extractor, and timbre feature information B, which is source timbre feature information corresponding to the source audio data 2, can be output. The source audio data 3 is input to the trained tone extractor, and tone characteristic information a, that is, source tone characteristic information corresponding to the source audio data 3, can be output. The source audio data 4 is input to the trained tone extractor, and tone characteristic information B, that is, source tone characteristic information corresponding to the source audio data 4, can be output. The source audio data 5 is input to the trained tone extractor, and tone characteristic information C, that is, source tone characteristic information corresponding to the source audio data 5, can be output. Illustratively, the tone characteristic information of the source audio data 1 and the source audio data 3 are the same, that is, the source audio data 1 and the source audio data 3 are audio data of the same character. Illustratively, the tone characteristic information of the source audio data 2 and the source audio data 4 are the same, that is, the source audio data 2 and the source audio data 4 are audio data of the same character.
Illustratively, the training audio data includes training audio data recorded by different users, the tone distinction degrees of the different users are high, and the training audio data recorded by the same user by adopting a plurality of pseudo tones with high distinction degrees is included; that is, the tone color discrimination corresponding to each piece of training audio data is high. Therefore, in order to improve the tone discrimination of different roles in the unicast audio data, the tone of each source audio data in the unicast audio data can be converted into a tone which is matched with the tone of the source audio data in the tone corresponding to the training audio data.
Fig. 9a is a schematic diagram illustrating an exemplary information extraction process.
Referring to fig. 9a, for example, the training audio data in each set of training data may be input to a trained tone extractor, and corresponding tone characteristic information is output, and for convenience of distinction, the tone characteristic information extracted by the tone extractor for the training audio data may be referred to as reference tone characteristic information.
For example, for each set of training data, the training audio data in the set of training data may be divided into a plurality of pieces of training audio data in the manner described above. And then extracting reference tone characteristic information of each piece of training audio data by adopting the trained tone extractor. Illustratively, the training audio data includes audio data recorded in M (M is a positive integer) tones (including the user's own tone and the pseudo tone), each tone corresponding to a plurality of pieces of training audio data. Illustratively, for the r-th (r is a positive integer and has a value range from 1 to M; the description from 1 to M includes 1 and M, that is, r may be 1 or M), of M timbres, r reference timbre feature information corresponding to a plurality of pieces of training audio data recorded by using the r-th timbre may be weighted and calculated to obtain reference timbre feature information corresponding to the r-th timbre. The reference tone color feature information corresponding to the r-th tone color may be referred to as an r-th group of reference tone color feature information. Alternatively, the weighted calculation may be a calculated average. Furthermore, in the above manner, M sets of reference tone color feature information can be obtained.
For example, the reference tone characteristic information matching the source tone characteristic information corresponding to each source audio data may be respectively searched from the sets of reference tone characteristic information, so as to search for a tone matching the tone of each source audio data in the unicast audio data from the tone corresponding to the training audio data. For convenience of description, reference tone color feature information that matches source tone color feature information corresponding to source audio data may be referred to as target tone color feature information.
For example, the ith (i is a positive integer and the value range is 1 to N) piece of source audio data in the N pieces of source audio data is taken as an example for illustration. For example, the similarity between M groups of reference tone color feature information and the source tone color feature information of the ith piece of source audio data may be calculated, and the reference tone color feature information having the highest similarity to the source tone color feature information of the ith piece of source audio data may be used as the target tone color feature information. In this way, N sets of target tone characteristic information matched with the N pieces of source audio data can be searched from the M sets of reference tone characteristic information. For example, the target timbre characteristic information matched with different source audio data may be the same or different.
For example, distance information between M groups of reference tone color feature information and source tone color feature information of the ith piece of source audio data may be calculated, respectively, and the distance information may be used as a similarity between the source tone color feature information of the ith piece of source audio data and the reference tone color feature information. In one possible approach, the distance information is inversely proportional to the similarity, i.e., the greater the distance information, the lower the similarity; the smaller the distance information, the higher the similarity.
For example, the distance information between the source tone color feature information of the source audio data and each reference tone color feature information may be determined by calculating an euclidean distance, a cosine similarity, a minkowski distance, etc. between the source tone color feature information of the source audio data and each reference tone color feature information, which is not limited in this application.
Fig. 9b is a schematic diagram illustrating an exemplary information matching process.
Referring to fig. 9b, exemplary description is given by taking an example of finding matching target timbre characteristic information from M sets of reference timbre characteristic information for the source timbre characteristic information of the source audio data 3.
Continuing to refer to fig. 9b, exemplarily, the source tone characteristic information of the source audio data 3 is tone characteristic information a, distance information 1 may be obtained by calculating distance information between the tone characteristic information a and reference tone characteristic information 1, and distance information 2 may be obtained by calculating distance information between the tone characteristic information a and reference tone characteristic information 2, so as to obtain distance information 2. The size of the distance information M may then be compared to determine the minimum distance information 1. Assuming that the distance information 2 is the minimum distance information, it can be determined that the reference tone color feature information 2 matches the tone color feature information a, that is, the reference tone color feature information 2 is the target tone color feature information matching the source tone color feature information of the source audio data.
And S303, the character tone color conversion module outputs multicast audio data of the multicast audio book.
For example, the character tone conversion module may convert the unicast audio book into the multicast audio book, that is, convert the unicast audio data into the multicast audio data, based on the target tone characteristic information of each source audio data, using the trained prosody extractor, emotion extractor, content extractor, and spectral reconstruction model.
For example, each source audio data in N pieces of source audio data of the unicast audio data may be sequentially input to the trained prosody extractor, and the trained prosody extractor extracts N sets of prosody feature information corresponding to the N pieces of source audio data.
For example, each piece of source audio data in N pieces of source audio data of the unicast audio data may be sequentially input to the trained emotion extractor, and N sets of emotion feature information corresponding to the N pieces of source audio data are extracted by the trained emotion extractor.
For example, each piece of source audio data in N pieces of source audio data of the unicast audio data may be sequentially input to the trained content extractor, and N sets of content feature information corresponding to the N pieces of source audio data are extracted by the trained content extractor.
In the following, the ith source audio data of the N source audio data is taken as an example to illustrate how to convert the character tone of the source audio data.
For example, emotion feature information extracted by a trained post-emotion extractor for ith source audio data, prosody feature information extracted by a trained prosody extractor for ith source audio data, content feature information extracted by a trained content extractor for ith source audio data, and corresponding matching target tone color feature information of ith source audio data can be input into a trained spectrum reconstruction model.
Illustratively, the trained spectrum reconstruction model performs spectrum reconstruction based on emotion feature information, prosody feature information, content feature information and target tone color feature information of the ith piece of source audio data, and may obtain and output ith group of tone spectrum feature information. And then, performing time domain conversion on the ith group of audio spectrum characteristic information to obtain the ith entry mark audio data after tone conversion.
Fig. 10 is a schematic diagram illustrating tone color conversion.
Referring to fig. 10, for example, the source audio data 3 may be input to the prosody extractor, the emotion extractor, and the content extractor, respectively, to obtain prosody feature information 3 output by the prosody extractor, emotion feature information 3 output by the emotion extractor, and content feature information 3 output by the content extractor. Illustratively, the prosodic feature information 3, emotional feature information 3, content feature information 3, and the source audio data are correspondingly matched with the target tone color feature information: the reference tone color feature information 2 is input to the spectral reconstruction model. The spectrum reconstruction model can perform spectrum reconstruction based on the rhythm feature information 3, the emotion feature information 3, the content feature information 3 and the reference tone feature information 2, and output the audio spectrum feature information 3. And then, performing frequency-time conversion on the audio spectrum characteristic information 3 to obtain target audio data 3. The target audio data 3 is audio data obtained by performing tone conversion on the corresponding source audio data 3.
Exemplarily, the trained spectrum reconstruction model performs spectrum reconstruction based on emotion characteristic information, prosody characteristic information, content characteristic information and target tone characteristic information of the ith piece of source audio data to obtain ith group of tone spectrum characteristic information; and then, performing time domain conversion on the ith group of sound spectrum characteristic information to obtain and output the ith target audio data after tone conversion.
For example, in the above manner, N entries of audio data may be obtained, and then the N entries of audio data may be used for splicing to obtain multicast audio data, that is, audio data of multicast audio books.
Because each piece of source audio data in the unicast audio data has high expressive force (including prosodic expressive force and emotional expressive force), spectral reconstruction is performed by extracting emotional characteristic information and prosodic characteristic information of the source audio data and combining target timbre characteristic information correspondingly matched with the source audio data to generate multicast audio data; the role tone of the source audio data can be converted on the premise of ensuring the emotional expressive power of the source audio data in the unicast audio data, so that the unicast audio data can be converted into the multicast audio data. Furthermore, the tone distinction degree of the characters in the reading can be improved, so that the scene deductive expression degree of the audio reading is improved, and the user can understand the plot conveniently. In addition to this, the present invention is,
on one hand, from the perspective of a user, compared with the prior art that the user needs to manually select the matched timbres for the source audio data, the method and the device for matching the timbres of the source audio data can automatically match the timbres of the source audio data, can improve the efficiency of converting unicast audio data into multicast audio data, simplify user operation and improve user experience.
On the other hand, from the perspective of a sound reading producer, compared with the prior art that a plurality of sound readings are required to be distributed and recorded according to roles to realize the production of the multicast sound reading, the method only needs to carry out multi-role deduction recording on the unicast sound reading through pseudo sound by the single sound reading, and then converts the unicast sound reading into the multicast sound reading to realize the production of the multicast sound reading, so that the production time and the production cost of the multicast sound reading can be reduced, and the production efficiency of the multicast sound reading is improved.
On the other hand, from the perspective of the audio book producer, compared with the prior art that the unicast audio book needs to be distributed and recorded according to roles by multiple persons and the unicast audio book is converted into the multicast audio book, the method can directly convert the unicast audio book into the multicast audio book, and is high in efficiency and low in cost.
Illustratively, the unicast audio data can be divided into sentences by combining with the reading text of the unicast audio reading, so that the accuracy of sentence division is improved, the accuracy of role analysis corresponding to the source audio data in the unicast audio data is further improved, and the accuracy of converting the unicast audio reading into the multicast audio reading is further improved.
Fig. 11 is a schematic diagram illustrating an exemplary process.
Referring to fig. 11, for example, an electronic device may include a character tone analysis module and a character tone conversion module. It should be understood that the electronic device shown in fig. 11 is only one example of an electronic device, and that the electronic device may have more or fewer modules than those shown in the figures, and the present application is not limited thereto.
For example, the functions of the role tone color analysis module and the role tone color conversion module may refer to the description above, and are not described herein again.
The process of converting a unicast audio book to a multicast audio book may be as follows:
s1101, inputting the unicast audio data of the unicast audio reading and the reading text into the character tone color analysis module.
In one possible scenario, when it is received that a user clicks the audio reading option 102 in fig. 1, unicast audio data and a reading text of the audio reading option 102 corresponding to the unicast audio reading may be obtained in response to a user operation behavior, and then the obtained unicast audio data and the reading text are input to the character tone color analysis module.
In one possible scenario, when a conversion operation of a produced unicast audio book by an audio book producer for the produced unicast audio book in the produced audio book list in fig. 2 is received, the unicast audio data and the book text corresponding to the unicast audio book by the conversion operation may be acquired in response to a user operation behavior, and then the acquired unicast audio data and the book text may be input to the character tone color analysis module.
In a possible scenario, when receiving a switching operation of a sound reading producer on a sound reading production interface aiming at a unicast sound reading recorded in a multicast sound reading production process, the unicast audio data and a reading text of the unicast sound reading corresponding to the switching operation can be acquired in response to a user operation behavior, and the acquired unicast audio data and the reading text are input to a role tone analysis module.
S1102, the role tone analysis module outputs N pieces of source audio data in the unicast audio data and N groups of correspondingly matched target tone characteristic information.
Fig. 12 is a schematic diagram illustrating an exemplary information extraction process.
Referring to fig. 12, for example, a reading text may be divided into text sentences to obtain N text sentences; and then, combining the N text sentences, and carrying out audio sentence division on the single-frequency audio data to obtain N pieces of source audio data.
For example, in order to improve the efficiency of text sentence division on the reading text, the reading text may be divided into text chapters first to obtain W chapters. For example, the reading text may be subjected to text analysis, and the reading text may be divided into W sections of chapter texts by section differentiation texts (such as "chapter i", and the like) in the reading text. For example, text chapter division may be performed before text is distinguished from chapters, and text chapter division may be performed after text is distinguished from chapters.
For example, after obtaining W sections of chapter text, for each section of chapter text, role name identification may be performed to identify roles included in the section of chapter text. And then, carrying out role dialogue segmentation on the section of chapter text to obtain a plurality of sections of dialogue texts, and analyzing roles corresponding to each section of dialogue. And then, text sentence division is carried out on each section of the dialogue text to obtain R text sentences, and the roles of the dialogue text to which each text sentence belongs are used as the corresponding roles of the text sentences.
For example, the reading text may be stored in chapters, and at this time, a section of chapter text corresponding to one chapter may be directly obtained each time, and then the text sentence division is performed on the chapter text, so as to obtain R text sentences.
For example, for a text statement with a determined role, the role name of the corresponding role can be used for identification. For text statements with undetermined roles, unknown role labels (e.g., unknown) may be used for identification.
For example, a chapter text is:
open according to the example: "do you are a king? "
The disk is unwillingly rubbed and moved to the side of the 'Yes', the pointer on the disk body is rotated to aim at the character.
"age? "
"twenty three". "
The text sentence division is performed on the section of chapter text, and each obtained text sentence and the corresponding role name can be as shown in table 1:
TABLE 1
Figure BDA0003289087230000201
Referring to table 1, for example, the above chapter text is divided into 5 text sentences, and the role names of the corresponding roles are identified for each text sentence. Wherein Unknown is an Unknown role label.
Exemplary, the textual statement "open x depends on the instance lane: the "corresponding role is" voice over ". The textual statement "do you are king? The "corresponding role is" open star ". The text sentence ' the disk is unwilling to rub and move to the side of the ' yes ', the pointer on the disk body is rotated, and the character is aligned. "the corresponding role is an unknown role. The textual statement "age? "the corresponding role is an unknown role. The text sentence "twenty-three. The "corresponding role is" king.
Illustratively, according to the above manner, the text chapter division is performed on the W chapter texts, so that the reading text is divided into N text sentences.
For example, the audio chapter division may be performed on the unicast audio data to obtain multiple sections of chapter audio data, which may refer to the description above and will not be described herein again. Then, for each chapter, audio sentence division can be performed on chapter audio data corresponding to the chapter based on the W text sentences corresponding to the chapter. The present application takes a chapter as an example, and exemplifies how chapter audio data is divided into a plurality of pieces of source audio data.
Illustratively, text recognition may be performed on the chapter audio data to obtain a text recognition result. And then aligning the chapter audio data and the chapter text on a time axis based on the text recognition result, and then determining W audio time intervals corresponding to the W text sentences in the chapter audio data of the chapter. Dividing the unicast audio data into W pieces of source audio data based on the W audio time intervals. And determining the roles of the source audio data based on the roles of the text sentences corresponding to the source audio data. For example, table 2 may be referenced:
TABLE 2
Figure BDA0003289087230000202
For example, in table 2, the source audio data corresponding to the text sentence "a. X. Depending on the example lane" is audio data in the time period of 0 to 0.
In table 2, for example, the source audio data corresponding to the text sentence "you are king is audio data in the period of 0.
Exemplarily, in table 2, the text statement "the disk is extremely unwilling to rub and move to the side of" yes ", the pointer on the disk is rotated, the source audio data corresponding to the word" is 0.
For example, in table 2, the source audio data corresponding to the text statement "age" is audio data in a period of time from 0.
In table 2, for example, the source audio data corresponding to the text sentence "twenty three" is audio data in the period of time from 0 to 14.818 to 0.
Illustratively, in the above manner, the audio statement division is performed on the W sections of chapter audio data, so as to divide the unicast audio data into N pieces of source audio data. Illustratively, the N pieces of source audio data include P1 pieces of source audio data of determined roles (P1 is a positive integer) and P2 pieces of source voice data of undetermined roles (P2 is a positive integer), where N = P1+ P2.
In one possible approach, from the P1 pieces of source audio data with determined roles, X (X is a positive integer) pieces of source audio data corresponding to the first role can be found (i.e., a plurality of pieces of source audio data corresponding to the same role are found). Then, extracting X groups of initial tone characteristic information corresponding to the X pieces of source audio data by adopting a trained tone extractor, and performing weighted calculation based on the X groups of initial tone characteristic information to obtain a weighted calculation result. The result of the weighting calculation may then be determined as source tone color feature information corresponding to each of the X pieces of source audio data.
In one possible approach, the roles corresponding to the P2 pieces of source audio data whose roles are not determined can be determined. For example, a trained tone extractor may be used to extract N sets of initial tone feature information corresponding to N pieces of source audio data. For j (j is a positive integer) source audio data in the P2 pieces of source audio data, the similarity between the initial tone color feature information corresponding to the P1 pieces of source audio data and the initial tone color feature information corresponding to the j piece of source audio data can be calculated respectively. And then determining the role of the source audio data with the highest similarity of the initial tone characteristic information corresponding to the jth source audio data in the P1 source audio data as the role of the jth source audio data. For example, after determining the roles corresponding to the P2 pieces of source audio data, X pieces of source audio data corresponding to the first role may be found from the N pieces of source audio data of the determined roles. Then, extracting X groups of initial tone characteristic information corresponding to the X pieces of source audio data by adopting a trained tone extractor, and performing weighted calculation based on the X groups of initial tone characteristic information to obtain a weighted calculation result. And determining the result of the weighted calculation as source tone color characteristic information corresponding to each piece of source audio data in the X pieces of source audio data.
For example, referring to table 3, the determined roles of the source audio data of unknown roles according to the roles of the source audio data of known roles are as follows:
TABLE 3
Figure BDA0003289087230000221
Illustratively, in table 3, the similarity between the tone color feature information of the source audio data of 0.048-0.969 s (the corresponding text sentence "the disk is unwillingly moved to the side of" yes ", the pointer on the disk is rotated, and the word is aligned") and the tone color feature information of the source audio data of 0-0.582 s (the corresponding text sentence "zhang yi sugi") is the highest, and the role of the source audio data of 0-0. 0.
It should be noted that, after the unicast audio data is divided into N pieces of source audio data by performing VAD detection on the unicast audio data, voice recognition may be performed on the N pieces of source audio data, so as to obtain corresponding text recognition results. Then, the roles corresponding to the N pieces of source audio data can be determined according to the text recognition result in the above manner. And then, by adopting the above mode, determining X pieces of source audio data corresponding to the first role, and determining the source tone characteristic information of each piece of source audio data in the X pieces of source audio data by performing weighted calculation on the initial tone characteristic information of the X pieces of source audio data, which is not described herein again.
For example, after N sets of source tone feature information corresponding to N pieces of source audio data are determined, matching target tone feature information corresponding to N pieces of source audio data may be respectively searched for from M sets of reference tone feature information based on the N sets of source tone feature information, which may refer to the above description and is not described herein again.
It should be noted that text chapter information on the reading text is not needed, but the reading text is directly divided into N text sentences, and then the unicast audio data and the reading text are aligned to determine N audio time intervals corresponding to the N text sentences in the unicast audio data; and dividing the unicast audio data into N pieces of source audio data based on the N audio time intervals.
And S1103, the role tone conversion module outputs multicast audio data of the multicast audio book.
For example, S1103 may refer to the description of S303 above, and will not be described herein again.
Fig. 13a is a schematic structural diagram of an exemplary voice processing apparatus.
Referring to fig. 13a, the speech processing apparatus 1300 illustratively comprises:
a data obtaining module 1301, configured to obtain unicast audio data of a unicast audio book, where the unicast audio data includes N pieces of source audio data, and N is a positive integer;
a role tone analysis module 1302, configured to determine N sets of target tone feature information matched with the N pieces of source audio data from M sets of reference tone feature information, where M sets of the reference tone feature information correspond to M tones, a tone discrimination between any two sets of the M sets of the reference tone feature information is greater than a discrimination threshold, and M is a positive integer;
the role tone conversion module 1303 is used for acquiring N groups of representation characteristic information corresponding to the N groups of source audio data; and performing tone conversion on the N pieces of source audio data respectively based on the N sets of target tone characteristic information and the N sets of expression characteristic information to generate multicast audio data.
Illustratively, the data obtaining module 1301 obtains unicast audio data of the unicast audio book, and then inputs the unicast audio data to the character tone color analyzing module 1302. The character tone analysis module 1302 may determine N sets of target tone feature information that match the N pieces of source audio data from the M sets of reference tone feature information, and then may input the N sets of target tone feature information to the character tone conversion module 1303. The role tone conversion module 1303 can obtain N groups of representation characteristic information corresponding to the N groups of source audio data; and then performing tone conversion on the N pieces of source audio data respectively based on the N groups of target tone characteristic information and the N groups of expression characteristic information to generate multicast audio data. The M groups of reference tone characteristic information correspond to M tones, and the tone discrimination between any two groups of the M groups of reference tone characteristic information is greater than a discrimination threshold. Therefore, the tone of each piece of source audio data can be converted on the premise of ensuring the expressive force of the unicast audio book, so that the unicast audio book is converted into the multicast audio book, and the tone discrimination of the characters in the book is improved, thereby improving the scene deductive expressive force of the audio book and facilitating the understanding of the plot by the user.
Fig. 13b is a schematic structural diagram of an exemplary voice processing apparatus.
Referring to fig. 13b, exemplary character timbre analysis module 1302 includes:
an audio statement division module 13021 configured to perform audio statement division on the unicast audio data to obtain N pieces of source audio data, where the N pieces of source audio data are the audio statements;
a tone characteristic extraction module 13022, configured to extract N groups of source tone characteristic information corresponding to the N pieces of source audio data by using a tone extractor;
and a timbre feature matching module 13023 configured to determine N sets of target timbre feature information from the M sets of reference timbre feature information based on the N sets of source timbre feature information.
Illustratively, the timbre feature matching module 13023 is configured to, for an ith source audio data of the N source audio data: respectively determining the similarity between the M groups of reference tone characteristic information and the ith piece of source audio data; and determining the reference tone characteristic information with the highest similarity with the ith piece of source audio data as target tone characteristic information matched with the ith piece of source audio data, wherein i is a positive integer and the value range is between 1 and N.
Referring to fig. 13b, the character tone color conversion module 1303 includes:
a prosodic feature extracting module 13031 configured to extract N sets of prosodic feature information corresponding to the N sets of source audio data by using a prosody extractor;
an emotion feature extraction module 13032, configured to extract, by using an emotion extractor, N groups of emotion feature information corresponding to the N groups of source audio data;
the expression force characteristic generating module 13033 is configured to generate N groups of expression force characteristic information corresponding to the N pieces of source audio data based on the N groups of prosody characteristic information and the N groups of emotion characteristic information.
Referring to fig. 13b, the character tone color conversion module 1303 includes:
a content feature extraction module 13034, configured to extract N groups of content feature information corresponding to the N pieces of source audio data by using a content extractor;
a feature spectrum reconstruction module 13035, configured to perform spectrum reconstruction on the basis of the N groups of target tone color feature information, the N groups of performance feature information, and the N groups of content feature information by using a spectrum reconstruction model to obtain N groups of tone spectrum feature information;
a frequency-time transformation module 13036, configured to perform frequency-time transformation on the N groups of audio spectrum feature information respectively to obtain N groups of target audio data;
a splicing module 13037, configured to splice the N groups of target audio data to obtain multicast audio data.
Illustratively, the audio statement division module 13021 is configured to divide the unicast audio data into N pieces of source audio data by performing voice activity detection, VAD, detection on the unicast audio data.
Illustratively, the audio statement dividing module 13021 is configured to acquire a reading text of a unicast audio reading, and divide the reading text into N text statements; aligning the unicast audio data with the reading text to determine N audio time intervals corresponding to the N text statements in the unicast audio data; dividing the unicast audio data into N pieces of source audio data based on the N audio time intervals.
Illustratively, the tone feature extraction module 13022 is configured to determine a role corresponding to each source audio data in the N pieces of source audio data; acquiring N groups of initial tone characteristic information corresponding to the N pieces of source audio data; aiming at X pieces of source audio data corresponding to a first role in the N pieces of source audio data: extracting X groups of initial tone characteristic information corresponding to X pieces of source audio data; and performing weighting calculation based on X groups of initial tone characteristic information corresponding to X pieces of source audio data, and determining the result of the weighting calculation as the source tone characteristic information corresponding to each piece of source audio data in the X pieces of source audio data, wherein X is a positive integer.
Referring to fig. 13b, the speech processing apparatus 1300 further includes:
a role determination module 1304, configured to calculate similarity between the tone feature information corresponding to the P1 pieces of source audio data and the initial tone feature information corresponding to the jth piece of source audio data; determining the role corresponding to the source audio data with the highest initial tone characteristic information similarity in the P1 pieces of source audio data as the role of the jth piece of source audio data; wherein j is a positive integer and the value range is between 1 and P2.
Fig. 14 is a schematic diagram of an exemplary training apparatus.
Referring to fig. 14, exemplary training device 1400 comprises:
a collecting module 1401, configured to collect training data, where the training data includes training audio data and reference character labels corresponding to the training audio data, where expressive characteristic information of the training audio data satisfies expressive force conditions, the training audio data includes audio data recorded by multiple users using their own timbres, and/or audio data recorded by multiple users using pseudo-tones, and a timbre discrimination of different pseudo-tones used by the same user is greater than a discrimination threshold;
the feature information extraction module 1402 is configured to input the training audio data to the emotion extractor, the content extractor, and the prosody extractor respectively for calculation, so as to obtain prosody feature information output by the emotion extractor, content feature information output by the content extractor, and prosody feature information output by the prosody extractor; inputting the training audio data and the reference role label into a tone extractor for calculation to obtain tone characteristic information output by the tone extractor;
an audio data reconstruction module 1403, configured to input the prosody feature information, the content feature information, the prosody feature information, and the tone feature information to a spectrum reconstruction model for spectrum reconstruction, so as to obtain audio spectrum feature information, and perform frequency-time transformation on the audio spectrum feature information, so as to obtain reconstructed audio data;
a back propagation module 1404 configured to calculate a first loss function value based on the reconstructed audio data and the training audio data, and jointly adjust model parameters of the emotion extractor, the content extractor, the prosody extractor, the timbre extractor, and the spectral reconstruction model with a goal of minimizing the first loss function value.
Illustratively, the apparatus further comprises:
a loss function value calculation module 1405, configured to input the tone color feature information to the first classifier for calculation to obtain a first role label, and calculate a second loss function value based on the first role label and the reference role label; inputting the emotional characteristic information into a second classifier for calculation to obtain a second role label, and calculating a second loss function value based on the second role label and a reference role label;
a tone extractor training module 1406 for adjusting model parameters of the tone extractor with the objective of minimizing the second loss function value and mutual information of the tone characteristic information and the emotion characteristic information;
and an emotion extractor training module 1407 for adjusting model parameters of the emotion extractor with the goal of maximizing the third loss function value and minimizing mutual information.
In one example, fig. 15 shows a schematic block diagram of an apparatus 1500 of an embodiment of the present application that the apparatus 1500 may comprise: a processor 1501 and transceiver/transceiver pins 1502 and optionally memory 1503.
The various components of the device 1500 are coupled together by a bus 1504, where the bus 1504 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for purposes of clarity will be referred to in the drawings as bus 1504.
Optionally, memory 1503 may be used for instructions in the foregoing method embodiments. The processor 1501 may be used to execute instructions in the memory 1503 and control the receive pin to receive signals and the transmit pin to transmit signals.
The apparatus 1500 may be an electronic device or a chip of an electronic device in the above method embodiments.
All relevant contents of each step related to the above method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again.
The present embodiment also provides a computer storage medium, in which computer instructions are stored, and when the computer instructions are run on an electronic device, the electronic device executes the above related method steps to implement the voice processing and training method in the above embodiments.
The present embodiment also provides a computer program product, which when running on a computer, causes the computer to execute the relevant steps described above, so as to implement the speech processing and training method in the above embodiments.
In addition, embodiments of the present application also provide an apparatus, which may be specifically a chip, a component or a module, and may include a processor and a memory connected to each other; the memory is used for storing computer execution instructions, and when the device runs, the processor can execute the computer execution instructions stored by the memory, so that the chip can execute the voice processing and training method in the above method embodiments.
The electronic device, the computer storage medium, the computer program product, or the chip provided in this embodiment are all configured to execute the corresponding method provided above, so that the beneficial effects achieved by the electronic device, the computer storage medium, the computer program product, or the chip may refer to the beneficial effects in the corresponding method provided above, and are not described herein again.
Through the description of the above embodiments, those skilled in the art will understand that, for convenience and simplicity of description, only the division of the above functional modules is used as an example, and in practical applications, the above function distribution may be completed by different functional modules as needed, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, a module or a unit may be divided into only one logic function, and may be implemented in other ways, for example, a plurality of units or components may be combined or integrated into another apparatus, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed to a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
Any of the various embodiments of the present application, as well as any of the same embodiments, can be freely combined. Any combination of the above is within the scope of the present application.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application, or portions of the technical solutions that substantially contribute to the prior art, or all or portions of the technical solutions may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.
The steps of a method or algorithm described in connection with the disclosure of the embodiments of the application may be embodied in hardware or in software instructions executed by a processor. The software instructions may be comprised of corresponding software modules that may be stored in Random Access Memory (RAM), flash Memory, read Only Memory (ROM), erasable Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), registers, a hard disk, a removable disk, a compact disc Read Only Memory (CD-ROM), or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC.
Those skilled in the art will recognize that the functionality described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof, in one or more of the examples described above. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the present embodiments are not limited to those precise embodiments, which are intended to be illustrative rather than restrictive, and that various changes and modifications may be effected therein by one skilled in the art without departing from the scope of the appended claims.

Claims (17)

1. A method of speech processing, comprising:
acquiring unicast audio data of the unicast audio book, wherein the unicast audio data comprise N pieces of source audio data, and N is a positive integer;
determining N groups of target tone characteristic information matched with the N pieces of source audio data from M groups of reference tone characteristic information, wherein the M groups of reference tone characteristic information correspond to M tones, the tone discrimination between any two groups of the M groups of reference tone characteristic information is greater than a discrimination threshold, and M is a positive integer;
acquiring N groups of expression characteristic information corresponding to the N groups of source audio data;
and performing tone conversion on the N pieces of source audio data respectively based on the N groups of target tone characteristic information and the N groups of expression characteristic information to generate multicast audio data.
2. The method according to claim 1, wherein the determining N sets of target timbre feature information matching the N pieces of source audio data from the M sets of reference timbre feature information comprises:
performing audio statement division on the unicast audio data to obtain the N pieces of source audio data, wherein the N pieces of source audio data correspond to audio statements one to one;
acquiring N groups of source tone characteristic information corresponding to the N pieces of source audio data;
and determining the N groups of target tone characteristic information from the M groups of reference tone characteristic information based on the N groups of source tone characteristic information.
3. The method according to claim 2, wherein the determining the N sets of target timbre feature information matching the N pieces of source audio data from the M sets of reference timbre feature information based on the N sets of source timbre feature information comprises:
for an ith source audio data of the N source audio data:
respectively determining the similarity between the M groups of reference tone characteristic information and the ith piece of source audio data;
and determining the reference tone characteristic information with the highest similarity with the ith piece of source audio data as target tone characteristic information matched with the ith piece of source audio data, wherein i is a positive integer and the value range is between 1 and N.
4. The method according to any one of claims 1 to 3, wherein obtaining N sets of performance characteristic information corresponding to the N sets of source audio data comprises:
acquiring N groups of prosody feature information corresponding to the N groups of source audio data;
acquiring N groups of emotional characteristic information corresponding to the N groups of source audio data;
and generating the N groups of expression characteristic information corresponding to the N pieces of source audio data based on the N groups of prosodic characteristic information and the N groups of emotional characteristic information.
5. The method according to any one of claims 1 to 4, wherein the performing tone color conversion on the N pieces of source audio data based on the N sets of target tone color feature information and the N sets of performance feature information to generate multicast audio data respectively comprises:
acquiring N groups of content characteristic information corresponding to the N pieces of source audio data;
performing spectrum reconstruction based on the N groups of target tone characteristic information, the N groups of expression characteristic information and the N groups of content characteristic information to obtain N groups of sound spectrum characteristic information;
respectively carrying out frequency-time conversion on the N groups of audio frequency spectrum characteristic information to obtain N groups of target audio data;
and splicing the N groups of target audio data to obtain the multicast audio data.
6. The method of claim 2, wherein the audio statement splitting of the unicast audio data to obtain N pieces of source audio data comprises:
dividing the unicast audio data into the N pieces of source audio data by performing Voice Activity Detection (VAD) detection on the unicast audio data.
7. The method of claim 2, wherein the audio statement splitting of the unicast audio data to obtain N pieces of source audio data comprises:
acquiring a reading text of the unicast audio reading, and dividing the reading text into N text sentences;
aligning the unicast audio data and the reading text to determine N audio time intervals corresponding to the N text sentences in the unicast audio data;
dividing the unicast audio data into the N pieces of source audio data based on the N audio time intervals.
8. The method according to any one of claims 2 to 7, wherein the obtaining N sets of source tone color feature information corresponding to the N pieces of source audio data includes:
determining a role corresponding to each source audio data in the N pieces of source audio data;
acquiring N groups of initial tone characteristic information corresponding to the N pieces of source audio data;
for X pieces of source audio data corresponding to a first role in the N pieces of source audio data:
and performing weighting calculation based on X groups of initial tone characteristic information corresponding to the X pieces of source audio data, and determining the result of the weighting calculation as source tone characteristic information corresponding to each piece of source audio data in the X pieces of source audio data, wherein X is a positive integer.
9. The method according to claim 8, wherein the N pieces of source audio data include P1 pieces of source audio data of determined roles and P2 pieces of source audio data of undetermined roles, N = P1+ P2, and P1 and P2 are positive integers, the method further comprising:
for a jth piece of source audio data of the P2 pieces of source audio data:
respectively calculating the similarity of the initial tone characteristic information corresponding to the P1 pieces of source audio data and the initial tone characteristic information corresponding to the jth piece of source audio data;
determining a role corresponding to source audio data with highest initial tone characteristic information similarity in the P1 pieces of source audio data corresponding to the jth piece of source audio data as a role of the jth piece of source audio data;
wherein j is a positive integer and the value range is between 1 and P2.
10. A method of training, comprising:
collecting training data, wherein the training data comprises training audio data and reference role labels corresponding to the training audio data, expressive characteristic information of the training audio data meets expressive force conditions, the training audio data comprises audio data recorded by a plurality of users by using self timbres, and/or the audio data recorded by a plurality of users by using pseudo-tones, and the timbre discrimination of different pseudo-tones used by the same user is greater than a discrimination threshold;
respectively inputting the training audio data to an emotion extractor, a content extractor and a prosody extractor for calculation to obtain prosody feature information output by the emotion extractor, content feature information output by the content extractor and prosody feature information output by the prosody extractor;
inputting the training audio data and the reference role label into a tone extractor for calculation to obtain tone characteristic information output by the tone extractor;
inputting the rhythm feature information, the content feature information, the rhythm feature information and the tone feature information into a spectrum reconstruction model for spectrum reconstruction to obtain audio spectrum feature information, and performing frequency-time transformation on the audio spectrum feature information to obtain reconstructed audio data;
calculating a first loss function value based on the reconstructed audio data and the training audio data, jointly adjusting model parameters of the emotion extractor, the content extractor, the prosody extractor, the timbre extractor, and the spectral reconstruction model with a goal of minimizing the first loss function value.
11. The method of claim 10, further comprising:
inputting the tone characteristic information into a first classifier to be calculated so as to obtain a first role label, and calculating a second loss function value based on the first role label and the reference role label;
inputting the emotional feature information into a second classifier for calculation to obtain a second role label, and calculating a second loss function value based on the second role label and the reference role label;
adjusting the model parameters of the tone extractor with the goal of minimizing the second loss function value and the mutual information of the tone characteristic information and the emotion characteristic information;
adjusting model parameters of the emotion extractor with the goal of maximizing the third loss function value and minimizing the mutual information.
12. An electronic device, comprising:
a memory and a processor, the memory coupled with the processor;
the memory stores program instructions that, when executed by the processor, cause the electronic device to perform the speech processing method of any of claims 1 to 9.
13. An electronic device, comprising:
a memory and a processor, the memory coupled with the processor;
the memory stores program instructions that, when executed by the processor, cause the electronic device to perform the training method of any one of claims 10 to 11.
14. A chip comprising one or more interface circuits and one or more processors; the interface circuit is configured to receive signals from a memory of an electronic device and to transmit the signals to the processor, the signals including computer instructions stored in the memory; the computer instructions, when executed by the processor, cause the electronic device to perform the speech processing method of any of claims 1-9.
15. A chip comprising one or more interface circuits and one or more processors; the interface circuit is configured to receive signals from a memory of an electronic device and to transmit the signals to the processor, the signals including computer instructions stored in the memory; the computer instructions, when executed by the processor, cause the electronic device to perform the training method of any one of claims 10 to 11.
16. A computer storage medium, characterized in that the computer-readable storage medium stores a computer program which, when run on a computer or a processor, causes the computer or the processor to perform the method according to any one of claims 1 to 11.
17. A computer program product, characterized in that it contains a software program which, when executed by a computer or a processor, causes the steps of the method of any one of claims 1 to 11 to be performed.
CN202111158143.1A 2021-09-30 2021-09-30 Voice processing and training method and electronic equipment Pending CN115881145A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111158143.1A CN115881145A (en) 2021-09-30 2021-09-30 Voice processing and training method and electronic equipment
PCT/CN2022/116572 WO2023051155A1 (en) 2021-09-30 2022-09-01 Voice processing and training methods and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111158143.1A CN115881145A (en) 2021-09-30 2021-09-30 Voice processing and training method and electronic equipment

Publications (1)

Publication Number Publication Date
CN115881145A true CN115881145A (en) 2023-03-31

Family

ID=85756611

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111158143.1A Pending CN115881145A (en) 2021-09-30 2021-09-30 Voice processing and training method and electronic equipment

Country Status (2)

Country Link
CN (1) CN115881145A (en)
WO (1) WO2023051155A1 (en)

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5563358A (en) * 1991-12-06 1996-10-08 Zimmerman; Thomas G. Music training apparatus
KR101636716B1 (en) * 2009-12-24 2016-07-06 삼성전자주식회사 Apparatus of video conference for distinguish speaker from participants and method of the same
CN107293286B (en) * 2017-05-27 2020-11-24 华南理工大学 Voice sample collection method based on network dubbing game
CN110933330A (en) * 2019-12-09 2020-03-27 广州酷狗计算机科技有限公司 Video dubbing method and device, computer equipment and computer-readable storage medium
CN112037766B (en) * 2020-09-09 2022-03-04 广州方硅信息技术有限公司 Voice tone conversion method and related equipment
CN112820287A (en) * 2020-12-31 2021-05-18 乐鑫信息科技(上海)股份有限公司 Distributed speech processing system and method
CN112863483B (en) * 2021-01-05 2022-11-08 杭州一知智能科技有限公司 Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm
CN113096634B (en) * 2021-03-30 2024-03-01 平安科技(深圳)有限公司 Speech synthesis method, device, server and storage medium
CN113436609B (en) * 2021-07-06 2023-03-10 南京硅语智能科技有限公司 Voice conversion model, training method thereof, voice conversion method and system

Also Published As

Publication number Publication date
WO2023051155A1 (en) 2023-04-06

Similar Documents

Publication Publication Date Title
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
US11551708B2 (en) Label generation device, model learning device, emotion recognition apparatus, methods therefor, program, and recording medium
Lee et al. A comparison-based approach to mispronunciation detection
CN110706690A (en) Speech recognition method and device
US10157619B2 (en) Method and device for searching according to speech based on artificial intelligence
CN106649644B (en) Lyric file generation method and device
Schuller et al. Brute-forcing hierarchical functionals for paralinguistics: A waste of feature space?
Deng et al. Fisher kernels on phase-based features for speech emotion recognition
Das et al. Bangladeshi dialect recognition using Mel frequency cepstral coefficient, delta, delta-delta and Gaussian mixture model
CN109119069B (en) Specific crowd identification method, electronic device and computer readable storage medium
CN112185363B (en) Audio processing method and device
Zhang et al. Pre-trained deep convolution neural network model with attention for speech emotion recognition
CN112735371B (en) Method and device for generating speaker video based on text information
CN115171731A (en) Emotion category determination method, device and equipment and readable storage medium
CN110782875B (en) Voice rhythm processing method and device based on artificial intelligence
CN113327575B (en) Speech synthesis method, device, computer equipment and storage medium
CN114125506A (en) Voice auditing method and device
CN109344221B (en) Recording text generation method, device and equipment
Schuller et al. Incremental acoustic valence recognition: an inter-corpus perspective on features, matching, and performance in a gating paradigm
Chen et al. Speaker and expression factorization for audiobook data: Expressiveness and transplantation
CN115881145A (en) Voice processing and training method and electronic equipment
CN112270929B (en) Song identification method and device
Cooper et al. Characteristics of text-to-speech and other corpora
Liu et al. Supra-Segmental Feature Based Speaker Trait Detection.
CN112686041A (en) Pinyin marking method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination