CN117292673A - Tone attribute adjustment audio frequency determining method, device, equipment and storage medium - Google Patents

Tone attribute adjustment audio frequency determining method, device, equipment and storage medium Download PDF

Info

Publication number
CN117292673A
CN117292673A CN202311330964.8A CN202311330964A CN117292673A CN 117292673 A CN117292673 A CN 117292673A CN 202311330964 A CN202311330964 A CN 202311330964A CN 117292673 A CN117292673 A CN 117292673A
Authority
CN
China
Prior art keywords
vector
tone
target
attribute
hidden
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311330964.8A
Other languages
Chinese (zh)
Inventor
邹雨巷
马泽君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youzhuju Network Technology Co Ltd
Original Assignee
Beijing Youzhuju Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Youzhuju Network Technology Co Ltd filed Critical Beijing Youzhuju Network Technology Co Ltd
Priority to CN202311330964.8A priority Critical patent/CN117292673A/en
Publication of CN117292673A publication Critical patent/CN117292673A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used

Abstract

The disclosure relates to the technical field of speech synthesis, and discloses a method, a device, equipment and a storage medium for determining audio frequency of tone attribute adjustment, wherein the method comprises the following steps: the method comprises the steps of obtaining a direction vector corresponding to an attribute to be adjusted of a current tone, wherein the direction vector is obtained based on a first hidden vector and a second hidden vector, the first hidden vector is obtained by mapping a first sample set containing the attribute of the first tone, the second hidden vector is obtained by mapping a second sample set containing the attribute of the second tone, and the number of the attribute to be adjusted is at least one; obtaining a target hidden vector after transformation mapping according to a first target representation vector of the current tone; and obtaining target tone color audio after tone attribute adjustment of the current tone color according to the target hidden vector and the direction vector. In the field of voiced novels, when the tone color changes the attribute, the attribute adjustment can be performed on any tone color, so that the technical problem that the tone color attribute of a speaker cannot be flexibly adjusted in the related art is solved.

Description

Tone attribute adjustment audio frequency determining method, device, equipment and storage medium
Technical Field
The disclosure relates to the technical field of speech synthesis of voiced novels, in particular to an audio determining method, device and equipment for tone attribute adjustment and a storage medium.
Background
Speech synthesis mainly refers to the process of converting text into speech, and is one of core technologies for realizing man-machine interaction. In the field of voiced novels, the recording data of a speaker is generally required, and the voice of the speaker is recovered by using a voice synthesis technique.
The existing voice synthesis adjusting functions generally adjust the voice speed, the voice volume and the pitch of the speaker, but cannot adjust the tone of the attribute of the speaker. Therefore, when the current related technology faces the scene that the tone color of the speaker needs to be changed to obtain the complete audio voice, the technical problem that the flexible adjustment of the tone color attribute of the speaker cannot be achieved due to the technical limitation exists.
Disclosure of Invention
In view of the above, the present disclosure provides an audio determining method, apparatus, device and storage medium for tone attribute adjustment, so as to solve the technical problem that the tone attribute of a speaker cannot be flexibly adjusted due to technical limitations when the related art faces a scene where the tone of the speaker needs to be changed to obtain complete audio speech.
In a first aspect, the present disclosure provides a method of audio determination for timbre attribute adjustment, the method comprising:
the method comprises the steps of obtaining a direction vector corresponding to an attribute to be adjusted of a current tone, wherein the direction vector is obtained based on a first hidden vector and a second hidden vector, the first hidden vector is obtained by mapping a first sample set containing the attribute of the first tone, the second hidden vector is obtained by mapping a second sample set containing the attribute of the second tone, and the number of the attribute to be adjusted is at least one;
obtaining a target hidden vector after transformation mapping according to a first target representation vector of the current tone;
and obtaining target tone color audio after tone attribute adjustment of the current tone color according to the target hidden vector and the direction vector.
In the embodiment of the disclosure, a direction vector corresponding to an attribute to be adjusted of a current tone is obtained by obtaining the direction vector based on a first hidden vector and a second hidden vector, wherein the first hidden vector is obtained by mapping a first sample set containing the attribute of the first tone in a transformation way, the second hidden vector is obtained by mapping a second sample set containing the attribute of the second tone in a transformation way, and the number of the attributes to be adjusted is at least one; obtaining a target hidden vector after transformation mapping according to a first target representation vector of the current tone; and obtaining target tone color audio after tone attribute adjustment of the current tone color according to the target hidden vector and the direction vector. The direction vector can be directly obtained according to the attribute to be adjusted, and then the target hidden vector of the tone color is directly adjusted based on the direction vector, so that the attribute adjustment can be carried out on any tone color in the process of changing the attribute of the tone color in the embodiment of the disclosure, and the technical problem that the tone color attribute of the speaker cannot be flexibly adjusted due to technical limitation when the tone color of the speaker needs to be changed in the face of a scene of obtaining complete audio voice in the related technology is solved.
In a second aspect, the present disclosure provides an audio determining apparatus for timbre attribute adjustment, the apparatus comprising:
the device comprises an acquisition module, a control module and a control module, wherein the acquisition module is used for acquiring a direction vector corresponding to an attribute to be adjusted of a current tone, the direction vector is obtained based on a first hidden vector and a second hidden vector, the first hidden vector is obtained by mapping a first sample set containing the attribute of the first tone, the second hidden vector is obtained by mapping a second sample set containing the attribute of the second tone, and the number of the attributes to be adjusted is at least one;
the first obtaining module is used for obtaining a target hidden vector after transformation mapping according to a first target representation vector of the current tone;
and the second obtaining module is used for obtaining target tone color audio after tone attribute adjustment of the current tone color according to the target hidden vector and the direction vector.
In a third aspect, the present disclosure provides a computer device comprising: the audio determining method for tone quality adjustment according to the first aspect or any one of the corresponding embodiments of the first aspect is implemented by the processor and the memory, the memory and the processor are in communication connection with each other, and the memory stores computer instructions, and the processor executes the computer instructions.
In a fourth aspect, the present disclosure provides a computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the tone color property adjusting audio determining method of the first aspect or any one of its corresponding embodiments.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the prior art, the drawings that are required in the detailed description or the prior art will be briefly described, it will be apparent that the drawings in the following description are some embodiments of the present disclosure, and other drawings may be obtained according to the drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a flow chart diagram of an audio determination method of timbre attribute adjustment according to an embodiment of the present disclosure;
FIG. 2 is a tone audio generation schematic of a flow model and an acoustic model according to an embodiment of the present disclosure;
FIG. 3 is a block diagram of an audio determining apparatus for timbre attribute adjustment according to an embodiment of the present disclosure;
fig. 4 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present disclosure.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. Based on the embodiments in this disclosure, all other embodiments that a person skilled in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.
It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.
For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.
As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.
It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.
It will be appreciated that the data (including but not limited to the data itself, the acquisition or use of the data) involved in the present technical solution should comply with the corresponding legal regulations and the requirements of the relevant regulations.
The existing voice synthesis adjusting function generally adjusts the voice speed, the voice volume and the pitch of the speaker, but cannot adjust the tone of the person-set attribute of the speaker, and cannot be applied to application scenes in which the tone attribute of the speaker needs to be changed. To address this problem, the presently disclosed embodiments provide a tone property adjustment method, as shown in fig. 1, that provides an audio determination method embodiment of tone property adjustment, it should be noted that the steps shown in the flowchart of the figures may be performed in a computer system such as a set of computer-executable instructions, and, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order other than that shown herein.
In this embodiment, there is provided a method for determining audio of tone property adjustment, fig. 1 is a schematic flow diagram of the method for determining audio of tone property adjustment according to an embodiment of the disclosure, as shown in fig. 1, the method may be applied to a server side, and the method flow includes the following steps:
step S101, a direction vector corresponding to an attribute to be adjusted of a current tone is obtained, wherein the direction vector is obtained based on a first hidden vector and a second hidden vector, the first hidden vector is obtained by mapping a first sample set containing the attribute of the first tone in a transformation way, the second hidden vector is obtained by mapping a second sample set containing the attribute of the second tone in a transformation way, and the number of the attributes to be adjusted is at least one.
Optionally, in the embodiment of the present disclosure, the server first obtains the current timbre of any user, and then obtains the direction vector corresponding to the attribute to be adjusted for implementing timbre adjustment for the current timbre, where the number of attributes to be adjusted may be multiple, for example, the attribute to be adjusted includes from high to low, from men to women, from silhouette to sweet, and so on, and at this time, the corresponding direction vector needs to be determined according to the attributes to be adjusted.
The direction vector is determined from a first hidden vector and a second hidden vector, wherein the first hidden vector is obtained by mapping a first sample set containing a first timbre attribute, and the first timbre attribute may be a high pitch, a male pitch, etc., and the corresponding first sample may be understood as a positive sample in the data sample set. The second hidden vector is obtained by mapping a second sample set containing a second timbre attribute, where the second timbre attribute may be a dip tone, a girl tone, etc., and the corresponding second sample may be understood as a negative sample in the data sample set.
Step S102, obtaining a target hidden vector after transformation mapping according to a first target representation vector of the current tone.
Optionally, the current timbre of the user is currently encoded to obtain a first target characterization vector, and then a target hidden vector which corresponds to the first target characterization vector and is subjected to transformation mapping is obtained. The first target token vector is the speaker token vector of the user.
And step S103, obtaining target tone color audio after tone attribute adjustment of the current tone color according to the target hidden vector and the direction vector.
Optionally, performing numerical transformation on the target hidden vector by using the target hidden vector of the current tone and a direction vector corresponding to the attribute to be adjusted of the current tone to obtain a target hidden vector after numerical transformation, and obtaining target tone audio after tone attribute adjustment on the current tone based on the target hidden vector after numerical transformation. It can be understood that, the target tone color audio is the final audio obtained by performing voice synthesis on the adjusted tone color and text information after the tone color attribute of the current tone color is adjusted.
In the embodiment of the disclosure, a direction vector corresponding to an attribute to be adjusted of a current tone is obtained by obtaining the direction vector based on a first hidden vector and a second hidden vector, wherein the first hidden vector is obtained by mapping a first sample set containing the attribute of the first tone in a transformation way, the second hidden vector is obtained by mapping a second sample set containing the attribute of the second tone in a transformation way, and the number of the attributes to be adjusted is at least one; obtaining a target hidden vector after transformation mapping according to a first target representation vector of the current tone; and obtaining target tone color audio after tone attribute adjustment of the current tone color according to the target hidden vector and the direction vector. The direction vector can be directly obtained according to the attribute to be adjusted, and then the target hidden vector of the tone color is directly adjusted based on the direction vector, so that the attribute adjustment can be carried out on any tone color in the process of changing the attribute of the tone color in the embodiment of the disclosure, and the technical problem that the tone color attribute of the speaker cannot be flexibly adjusted due to technical limitation when the tone color of the speaker needs to be changed in the face of a scene of obtaining complete audio voice in the related technology is solved.
In some optional embodiments, obtaining a direction vector corresponding to the attribute to be adjusted of the current tone includes:
according to the attribute to be adjusted of the current tone, a first sample set containing a first tone attribute and a second sample set containing a second tone attribute are obtained;
obtaining a transformed and mapped first hidden vector according to the first sample set; obtaining a second hidden vector after transformation mapping according to the second sample set;
and obtaining a direction vector according to the first hidden vector and the second hidden vector.
Optionally, as can be seen from the above embodiment, after determining the attribute to be adjusted of the current timbre, a positive sample set (i.e. a first sample set corresponding to a first timbre attribute (such as a high-pitched attribute that needs to be adjusted)) and a negative sample set (i.e. a second sample set corresponding to a second timbre attribute (such as a low-pitched attribute that needs to be adjusted) may be determined.
And then respectively carrying out transformation mapping on the first sample set to obtain a first hidden vector, and carrying out transformation mapping on the second sample set to obtain a second hidden vector.
And then calculating a direction vector according to the first hidden vector and the second hidden vector.
In the embodiment of the disclosure, the direction vector when the tone attribute is adjusted is determined by the first hidden vector of the first sample set and the second hidden vector of the second sample set, so that any adjustment of the tone attribute can be realized according to the direction vector.
In some alternative embodiments, a transformed mapped first hidden vector is obtained from a first sample set; obtaining a transformed mapped second hidden vector according to the second sample set, including:
extracting first characterization vectors corresponding to a plurality of first samples of a first sample object from a first sample set, and extracting second characterization vectors corresponding to a plurality of second samples of a second sample object from a second sample set;
and inputting each first characterization vector into the target model, mapping through transformation to obtain a plurality of mapped first hidden vectors, inputting each second characterization vector into the target model, and mapping through transformation to obtain a plurality of mapped second hidden vectors.
Optionally, in an embodiment of the disclosure, a stream model-based timbre hidden variable space encoder is employed, and the stream model is utilized to project the speaker token into a new hidden space, resulting in a hidden vector.
Further, extracting the speaker characterization vector of each audio frequency by using a pre-trained speaker identification network, namely extracting first characterization vectors corresponding to a plurality of first samples of the first sample object from the first sample set by using the speaker identification network, and extracting second characterization vectors corresponding to a plurality of second samples of the second sample object from the second sample by using the speaker identification network.
Each first characterization vector is input into a target model, mapped to obtain a plurality of mapped first hidden vectors, each second characterization vector is input into the target model, mapped to obtain a plurality of mapped second hidden vectors, the target model can be a flow model, wherein the flow model is a bijective (reversible) function, a network path which is changed from A distribution to B distribution can be found, and the path can be changed from B to A.
In some alternative embodiments, obtaining the direction vector from the first hidden vector and the second hidden vector includes:
determining an average value of the first hidden vectors to obtain a first average hidden vector of the first sample set;
determining an average value of the plurality of second hidden vectors to obtain a second average hidden vector of the second sample set;
and obtaining the difference between the first average hidden vector and the second average hidden vector to obtain a direction vector.
Optionally, after all first hidden vectors corresponding to all positive sample sets are obtained, an average value is obtained for the first hidden vectors to obtain first average hidden vectors of the first sample sets, and after all second hidden vectors corresponding to all negative sample sets are obtained, an average value is obtained for the second hidden vectors to obtain second average hidden vectors of the second sample sets.
The second average hidden vector is subtracted from the first average hidden vector to obtain a direction vector.
In the embodiment of the disclosure, the final direction vector is obtained by utilizing the first hidden vector of the positive sample set and the second hidden vector of the negative sample set, and then the tone color of the attribute to be adjusted is adjusted by utilizing the direction vector, so that random change of various styles is realized, the adjustment diversity of the tone color is realized, a speaker is not required to be searched, data is recorded in a professional recording studio, copyrights of the speaker are not required to be purchased, the manufacturing period of new tone color on-line is greatly shortened, and the time cost and the fund cost are saved.
In some alternative embodiments, obtaining the transformed mapped target hidden vector according to the first target token vector of the current timbre includes:
extracting a first target characterization vector from the current tone;
and inputting the first target characterization vector into a target model, and obtaining a mapped target hidden vector through transformation mapping.
Optionally, extracting a first target characterization vector corresponding to the current tone of the user, inputting the first target characterization vector into the target model, transforming, and projecting the first target characterization vector into a standard normal distribution space to obtain the mapped target hidden vector.
In some optional embodiments, obtaining the target tone color audio after tone attribute adjustment for the current tone color according to the target hidden vector and the direction vector includes:
determining an adjusting weight according to the attribute to be adjusted;
determining a combined calculation relation of the adjusting weight, the target hidden vector and the direction vector according to the attribute to be adjusted;
and obtaining target tone color audio after tone attribute adjustment of the current tone color according to the combination calculation relation.
Optionally, according to the attribute to be adjusted, such as the input tone is changed to be young, the value of the weight, such as 1, is selected according to the range of the weight, such as [ -1,1], then the combined calculation relation of the adjustment weight, the target hidden vector and the direction vector is determined, and at this time, since the input tone is changed to be young, i.e. the tone is changed to be high from low to low, the combined calculation relation is: the target hidden vector minus the weight is multiplied by the direction vector. If the input tone is aged, the value of the weight, such as 1, is selected according to the weight range, such as [ -1,1], and then the combined calculation relation of the adjustment weight, the target hidden vector and the direction vector is determined, wherein the combined calculation relation is that: the target hidden vector is added with the weight multiplied by the direction vector.
By the continuous specification of the weight, continuous change of the tone color from which the tone color is continuously aged or continuously younger can be realized. And synthesizing target tone audio for the tone with the tone attribute adjusted.
In the embodiment of the disclosure, the target tone color audio is obtained based on smooth continuous tone color characteristics, so that the voice synthesis quality of the target tone color audio is better and the robustness is higher.
In some optional embodiments, obtaining the target tone audio after the tone attribute adjustment of the current tone according to the combination calculation relation includes:
obtaining a third hidden vector after processing the target hidden vector according to the combination calculation relation;
inputting the third hidden vector into the target model, and obtaining a second target characterization vector through inverse transformation;
and performing voice synthesis based on the text feature vector of the text to be synthesized and the second target feature vector to obtain synthesized target tone color audio.
Alternatively, in embodiments of the present disclosure, a flow model based timbre hidden variable space encoder and a multi-speaker ParatacoHubert based acoustic model are employed, and a flow model is utilized to project a speaker token into a new hidden space, resulting in hidden vectors. And synthesizing voice audios of different speakers by using the characterization vectors of the speakers by using the acoustic model.
As shown in fig. 2, the speaker characterization vector sv is transformed by a stream model and projected to a standard normal distribution space to obtain a speaker hidden layer space z. Similarly, the speaker hidden layer space z outputs a speaker characterization vector sv through the inverse process of the flow model.
Therefore, the embodiments of the present disclosure utilize the reversible characteristic of the flow model (i.e., not only the network path that changes from the a-distribution to the B-distribution can be found, but also the path can change from the B-distribution to the a), first subtract the weight from the target hidden vector and multiply the direction vector to obtain the third hidden vector, or the target hidden vector adds the weight to multiply the direction vector to obtain the third hidden vector, then input the third hidden vector into the target model (i.e., the flow model), and then can decode the second target feature vector through the inverse transformation of the target model, where the second target feature vector is the speaker feature vector sv that is output by the inverse process of the flow model in fig. 2.
And then coding the text to be synthesized to generate text feature vectors. And performing virtual voice synthesis on the decoded second target characterization vector and the text feature vector of the text to be synthesized, and further obtaining synthesized target tone audio.
When synthesizing speech and text, the acoustic model in fig. 2 can be utilized, firstly, the encoder of the acoustic model encodes the text to be synthesized into a text high-level representation, the decoder of the acoustic model decodes the text high-level representation into hubert features, a second target representation vector is input into the acoustic model, the spectrum of the tone is obtained from the text through forward reasoning, and finally, the spectrum is restored into synthesized target tone audio through a general vocoder.
In the embodiment of the disclosure, the stream model and the acoustic model are used as the model deployment, after the model is trained, attribute change can be carried out on any input tone, the attribute of any tone can be changed for a plurality of times only by deploying the model once, a plurality of new tone can be obtained, the model deployment for a plurality of times is not needed, and the cost of model training and deployment can be reduced.
The embodiment also provides an audio determining device for tone attribute adjustment, which is used for implementing the foregoing embodiments and preferred embodiments, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
The present embodiment provides an audio determining apparatus for tone property adjustment, as shown in fig. 3, fig. 3 is a block diagram of a structure of the audio determining apparatus for tone property adjustment according to an embodiment of the present disclosure, including:
the obtaining module 301 is configured to obtain a direction vector corresponding to an attribute to be adjusted of a current tone, where the direction vector is obtained based on a first hidden vector and a second hidden vector, the first hidden vector is obtained by mapping a first sample set containing the attribute of the first tone, the second hidden vector is obtained by mapping a second sample set containing the attribute of the second tone, and the number of the attributes to be adjusted is at least one;
a first obtaining module 302, configured to obtain a target hidden vector after transformation mapping according to a first target representation vector of a current timbre;
and a second obtaining module 303, configured to obtain, according to the target hidden vector and the direction vector, target tone audio after tone attribute adjustment for the current tone.
In the embodiment of the disclosure, a direction vector corresponding to an attribute to be adjusted of a current tone is obtained by obtaining the direction vector based on a first hidden vector and a second hidden vector, wherein the first hidden vector is obtained by mapping a first sample set containing the attribute of the first tone in a transformation way, the second hidden vector is obtained by mapping a second sample set containing the attribute of the second tone in a transformation way, and the number of the attributes to be adjusted is at least one; obtaining a target hidden vector after transformation mapping according to a first target representation vector of the current tone; and obtaining target tone color audio after tone attribute adjustment of the current tone color according to the target hidden vector and the direction vector. The direction vector can be directly obtained according to the attribute to be adjusted, and then the target hidden vector of the tone color is directly adjusted based on the direction vector, so that the attribute adjustment can be carried out on any tone color in the process of changing the attribute of the tone color in the embodiment of the disclosure, and the technical problem that the tone color attribute of the speaker cannot be flexibly adjusted due to technical limitation when the tone color of the speaker needs to be changed in the face of a scene of obtaining complete audio voice in the related technology is solved.
In some alternative embodiments, the acquisition module 301 includes:
the acquisition unit is used for acquiring a first sample set containing the attribute of the first tone and a second sample set containing the attribute of the second tone according to the attribute to be adjusted of the current tone;
the first obtaining unit is used for obtaining a first hidden vector after transformation mapping according to the first sample set; obtaining a second hidden vector after transformation mapping according to the second sample set;
and the second obtaining unit is used for obtaining the direction vector according to the first hidden vector and the second hidden vector.
In some alternative embodiments, the first deriving unit comprises:
the extraction submodule is used for extracting first characterization vectors corresponding to a plurality of first samples of the first sample object from the first sample set and extracting second characterization vectors corresponding to a plurality of second samples of the second sample object from the second sample;
the first obtaining submodule is used for inputting each first characterization vector into the target model, mapping the first characterization vector through transformation to obtain a plurality of mapped first hidden vectors, inputting each second characterization vector into the target model, and mapping the second characterization vector through transformation to obtain a plurality of mapped second hidden vectors.
In some alternative embodiments, the second deriving unit comprises:
the first determining submodule is used for determining the average value of a plurality of first hidden vectors to obtain a first average hidden vector of the first sample set;
the second determining submodule is used for determining the average value of a plurality of second hidden vectors to obtain a second average hidden vector of the second sample set;
the first acquisition sub-module is used for acquiring the difference between the first average hidden vector and the second average hidden vector to obtain a direction vector.
In some alternative embodiments, the first obtaining module 302 includes:
the extraction unit is used for extracting a first target characterization vector from the current tone;
and the third obtaining unit is used for inputting the first target characterization vector into the target model, and obtaining the mapped target hidden vector through transformation and mapping.
In some alternative embodiments, the second obtaining module 303 includes:
the first determining unit is used for determining an adjusting weight according to the attribute to be adjusted;
the second determining unit is used for determining a combined calculation relation of the adjusting weight, the target hidden vector and the direction vector according to the attribute to be adjusted;
and the fourth obtaining unit is used for obtaining the target tone color audio after tone attribute adjustment of the current tone color according to the combination calculation relation.
In some alternative embodiments, the fourth deriving unit comprises:
the second acquisition sub-module is used for acquiring a third hidden vector after the target hidden vector is processed according to the combination calculation relation;
the second obtaining submodule is used for inputting the third hidden vector into the target model and obtaining a second target characterization vector through back propagation;
and the synthesis submodule is used for carrying out voice synthesis based on the text feature vector of the text to be synthesized and the second target feature vector to obtain synthesized target tone color audio.
The tone property adjusted audio determining apparatus in this embodiment is presented in the form of functional units, where units refer to ASIC circuits, processors and memories executing one or more software or fixed programs, and/or other devices that can provide the above described functionality.
Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.
The embodiment of the disclosure also provides a computer device, which is provided with the audio determining device for tone quality adjustment shown in the figure 3.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a computer device according to an alternative embodiment of the disclosure, as shown in fig. 4, the computer device includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 4.
The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.
Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform a method for implementing the embodiments described above.
The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created from the use of the computer device of the presentation of a sort of applet landing page, and the like. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; the memory 20 may also comprise a combination of the above types of memories.
The computer device also includes a communication interface 30 for the computer device to communicate with other devices or communication networks.
The presently disclosed embodiments also provide a computer readable storage medium, and the methods described above according to the presently disclosed embodiments may be implemented in hardware, firmware, or as recordable storage medium, or as computer code downloaded over a network that is originally stored in a remote storage medium or a non-transitory machine-readable storage medium and is to be stored in a local storage medium, such that the methods described herein may be stored on such software processes on a storage medium using a general purpose computer, special purpose processor, or programmable or dedicated hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.
Although embodiments of the present disclosure have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the disclosure, and such modifications and variations are within the scope defined by the appended claims.

Claims (10)

1. A method of audio determination for tone property adjustment, the method comprising:
the method comprises the steps of obtaining a direction vector corresponding to an attribute to be adjusted of a current tone, wherein the direction vector is obtained based on a first hidden vector and a second hidden vector, the first hidden vector is obtained by mapping a first sample set containing the attribute of the first tone in a transformation way, the second hidden vector is obtained by mapping a second sample set containing the attribute of the second tone in a transformation way, and the number of the attributes to be adjusted is at least one;
obtaining a target hidden vector after transformation mapping according to the first target representation vector of the current tone;
and obtaining target tone color audio after tone attribute adjustment of the current tone color according to the target hidden vector and the direction vector.
2. The method of claim 1, wherein the obtaining a direction vector corresponding to the attribute to be adjusted of the current tone color comprises:
according to the attribute to be adjusted of the current tone, a first sample set containing the first tone attribute and a second sample set containing the second tone attribute are obtained;
obtaining the first hidden vector after transformation mapping according to the first sample set; obtaining the second hidden vector after transformation mapping according to the second sample set;
and obtaining the direction vector according to the first hidden vector and the second hidden vector.
3. The method according to claim 2, wherein the transformed mapped first hidden vector is obtained from the first sample set; obtaining the transformed and mapped second hidden vector according to the second sample set, including:
extracting first characterization vectors corresponding to a plurality of first samples of a first sample object from the first sample set, and extracting second characterization vectors corresponding to a plurality of second samples of a second sample object from the second sample set;
and inputting each first characterization vector into a target model, mapping through transformation to obtain a plurality of mapped first hidden vectors, inputting each second characterization vector into the target model, and mapping through transformation to obtain a plurality of mapped second hidden vectors.
4. The method of claim 2, wherein the deriving the direction vector from the first hidden vector and the second hidden vector comprises:
determining the average value of a plurality of first hidden vectors to obtain a first average hidden vector of the first sample set;
determining an average value of the plurality of second hidden vectors to obtain a second average hidden vector of the second sample set;
and obtaining the difference between the first average hidden vector and the second average hidden vector to obtain the direction vector.
5. The method of claim 1, wherein the obtaining the transformed mapped object hidden vector from the first object representation vector of the current timbre comprises:
extracting the first target characterization vector from the current tone;
and inputting the first target characterization vector into a target model, and obtaining the mapped target hidden vector through transformation and mapping.
6. The method of claim 1, wherein the obtaining, according to the target hidden vector and the direction vector, target timbre audio with timbre attribute adjustment for the current timbre includes:
determining an adjusting weight according to the attribute to be adjusted;
determining a combined calculation relation of the adjusting weight, the target hidden vector and the direction vector according to the attribute to be adjusted;
and obtaining the target tone color audio after tone attribute adjustment of the current tone color according to the combination calculation relation.
7. The method of claim 6, wherein the obtaining the target timbre audio with timbre attribute adjustment for the current timbre according to the combined calculated relationship comprises:
obtaining a third hidden vector after the target hidden vector is processed according to the combination calculation relation;
inputting the third hidden vector into a target model, and obtaining a second target characterization vector through inverse transformation;
and performing voice synthesis based on the text feature vector of the text to be synthesized and the second target feature vector to obtain the synthesized target tone color audio.
8. An audio determining apparatus for tone property adjustment, the apparatus comprising:
the device comprises an acquisition module, a control module and a control module, wherein the acquisition module is used for acquiring a direction vector corresponding to an attribute to be adjusted of a current tone, the direction vector is obtained based on a first hidden vector and a second hidden vector, the first hidden vector is obtained by mapping a first sample set containing the attribute of the first tone in a transformation way, the second hidden vector is obtained by mapping a second sample set containing the attribute of the second tone in a transformation way, and the number of the attributes to be adjusted is at least one;
the first obtaining module is used for obtaining a target hidden vector after transformation mapping according to the first target representation vector of the current tone;
and the second obtaining module is used for obtaining target tone color audio after tone attribute adjustment of the current tone color according to the target hidden vector and the direction vector.
9. A computer device, comprising:
a memory and a processor in communication with each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the tone property adjusted audio determining method of any one of claims 1 to 7.
10. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the tone property adjusted audio determining method of any one of claims 1 to 7.
CN202311330964.8A 2023-10-13 2023-10-13 Tone attribute adjustment audio frequency determining method, device, equipment and storage medium Pending CN117292673A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311330964.8A CN117292673A (en) 2023-10-13 2023-10-13 Tone attribute adjustment audio frequency determining method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311330964.8A CN117292673A (en) 2023-10-13 2023-10-13 Tone attribute adjustment audio frequency determining method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117292673A true CN117292673A (en) 2023-12-26

Family

ID=89247920

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311330964.8A Pending CN117292673A (en) 2023-10-13 2023-10-13 Tone attribute adjustment audio frequency determining method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117292673A (en)

Similar Documents

Publication Publication Date Title
US10511908B1 (en) Audio denoising and normalization using image transforming neural network
US11842728B2 (en) Training neural networks to predict acoustic sequences using observed prosody info
CN113724686B (en) Method and device for editing audio, electronic equipment and storage medium
Oyamada et al. Non-native speech conversion with consistency-aware recursive network and generative adversarial network
CN113299312A (en) Image generation method, device, equipment and storage medium
WO2021227308A1 (en) Video resource generation method and apparatus
CN113724683A (en) Audio generation method, computer device, and computer-readable storage medium
CN113571047A (en) Audio data processing method, device and equipment
CN114449313B (en) Method and device for adjusting audio and video playing rate of video
CN110890098B (en) Blind signal separation method and device and electronic equipment
CN113327576A (en) Speech synthesis method, apparatus, device and storage medium
CN117292673A (en) Tone attribute adjustment audio frequency determining method, device, equipment and storage medium
KR102198598B1 (en) Method for generating synthesized speech signal, neural vocoder, and training method thereof
CN113205797B (en) Virtual anchor generation method, device, computer equipment and readable storage medium
CN115472174A (en) Sound noise reduction method and device, electronic equipment and storage medium
KR20230124266A (en) Speech synthesis method and apparatus using adversarial learning technique
CN115938338A (en) Speech synthesis method, device, electronic equipment and readable storage medium
CN111048065A (en) Text error correction data generation method and related device
CN116092465B (en) Vehicle-mounted audio noise reduction method and device, storage medium and electronic equipment
CN113744716B (en) Method and apparatus for synthesizing speech
US20240105192A1 (en) Spatial noise filling in multi-channel codec
CN113160849B (en) Singing voice synthesizing method, singing voice synthesizing device, electronic equipment and computer readable storage medium
WO2023192046A1 (en) Context aware audio capture and rendering
CN113160849A (en) Singing voice synthesis method and device, electronic equipment and computer readable storage medium
CN117676449A (en) Audio generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination