WO2021028236A1

WO2021028236A1 - Systems and methods for sound conversion

Info

Publication number: WO2021028236A1
Application number: PCT/EP2020/071576
Authority: WO
Inventors: Antoine CAILLON; Alexey Ozerov; Quang Khanh Ngoc DUONG; Gilles PUY; Xu YAO
Original assignee: Interdigital Ce Patent Holdings, Sas
Priority date: 2019-08-12
Filing date: 2020-07-30
Publication date: 2021-02-18

Abstract

The present disclosure relates to a method including: - obtaining an input audio signal originated from an input sound producer; - selecting a target sound producer from a dataset of candidate sound producers labelled with a plurality of domains, the selecting being based at least partially on a target domain of the plurality of domains and on the obtained input audio signal. The present disclosure also relates to the corresponding apparatus, computer program product and medium.

Description

Systems and Methods for sound conversion

Introduction

The domain technical field of the one or more embodiments of the present disclosure is related to sound conversion, for instance Voice Conversion (VC). Sound conversion can still raise issues that it would be helpful to address.

Description

According to a first aspect, the present principles enable at least one of the above disadvantages to be resolved by proposing a method for performing a sound (or audio) conversion on an audio signal. The method can be performed for instance before and/or during an encoding of an audio signal, or the audio part of an audiovisual signal at a transmitter side, or after or/and during a decoding of an audio signal, or the audio part of an audiovisual signal, at a receiver side.

According to another aspect, there is provided an apparatus. The apparatus comprises a processor. The processor can be configured to perform a sound conversion on an audio signal, or on the audio part of an audiovisual signal, by executing any of the aforementioned methods.

The apparatus can be comprised and/or be coupled to an encoder and/or to a decoder, so as to be adapted to perform a sound conversion on the audio signal, or the audio part of an audiovisual signal, before and/or during an encoding of the audio and/or audiovisual signal at a transmitter side, or after or/and during a decoding of the audio and/or audiovisual signal at a receiver side.

According to another general aspect of at least one embodiment, there is provided a device comprising an apparatus according to any of the decoding embodiments; and at least one of (i) an antenna configured to receive a signal, the signal including a video block, (ii) a band limiter configured to limit the received signal to a band of frequencies that includes the video block, or (iii) a display configured to display an output representative of a video block.

According to another general aspect of at least one embodiment, there is provided a non-transitory computer readable medium containing data content generated according to any of the described embodiments (for instance encoding embodiments ) or variants.

According to another general aspect of at least one embodiment, there is provided a signal comprising audio and/or video data generated according to any of the described encoding embodiments or variants.

According to another general aspect of at least one embodiment, a bitstream is formatted to include data content generated according to any of the described encoding embodiments or variants. According to another general aspect of at least one embodiment, there is provided a computer program product comprising instructions which, when the program is executed by a computer, cause the computer to carry out any of the described decoding embodiments or variants.

Brief description of the drawings

• Figure 1 shows a typical processor arrangement in which the described embodiments may be implemented;

• Figure 2 shows a generic, standard encoding scheme.

• Figure 3 shows a generic, standard decoding scheme.

• Figure 4 shows an overview of an audio conversion method of the present disclosure according to at least some embodiments;

• Figure 5 shows high level Block diagrams of an audio conversion according to at least some exemplary embodiments of the present disclosure, adapted to voice age conversion;

• Figure 6 represents a block diagram of an exemplar architecture of the sound conversion workflow according to at least some embodiments of the present disclosure.

• Figures 7A, 7B and 7C represent several implementations of a Target Speaker Selection model according to at least some embodiments of the present disclosure.

• Figures 8A and 8B represent exemplary implementations of a CycleGAN Voice Conversion (CGVC) according to at least some embodiments of the present disclosure.

• Figure 9 represents an exemplary End-to-end waveform Voice Conversion architecture according to at least some embodiments of the present disclosure.

• Figures 10A to 10 C show a comparison between the encoding of four samples x for different encoder types.

• Figure 11 shows a comparison between stacked causal convolutions (top) and stacked dilated causal convolutions (bottom).

• Figure 12 shows a comparison between activated and bypassed Gradient Reversal Layer (GRL) in an experimental environment.

It is to be noted that the drawings illustrate example embodiments and that the embodiments of the present disclosure are not limited to the illustrated embodiments.

Detailed description

Audio (or sound) conversion can be of great interest for many technical fields. When applied to voices, audio conversion is often known as “speech conversion” or “voice conversion”. Voice Conversion can be seen as a set of techniques focusing on modifying para- linguistic information about a speech without modifying its explicit semantic content. For instance, words present in a sentence before a voice conversion of the sentence will be present in the converted sentence but the pronunciation of some words, the rhythm of the sentence can be modified by the voice conversion). Voice conversion can modify for instance at least some parameters of a speech and alter the perception of the speaker(s) of the speech by a third person. Voice conversion can be of interest in many end-user applications and or in professional applications, like for audiovisual production (for instance movie or documentation production). Voice conversion can be used for instance in order to anonymize voices, to transform a speaker’s voice in another person voice (which is known as “identity conversion”), to alter the gender or the accent of a speaker, an/or to give an emotional touch to a speech.

At least some embodiments of the present disclosure propose an audio conversion of an input audio signal based on at least one reference (or target) audio signal, the at least one reference (or target) audio signal being chosen at least partially automatically. In some embodiments, the at least one target audio signal can be selected automatically according to a classification criterion (for instance input to the system or obtained via communication and/or storage means). In some embodiments, several candidate audio signals can be selected automatically according to different classification criteria, the at least one target audio signal being further selected between the selected candidate audio signals.

At least some embodiments of the present disclosure can be used for voice conversion, for instance for aging or de-aging a voice . More precisely, at least some embodiments of present disclosure relate to ways of making a voice sound naturally older or younger that it is actually, while preserving the voice speaker identity, so that one can consider that the speaker of the original and converted voice is the same person, but aged (or de-aged).

At least some embodiments of the present disclosure can for instance permit to de correlate age from identity in a voice signal.

Aging and/or de-aging voices (also called hereinafter “sound age conversion”) can permit for instance to match a fictional character’s age when dubbing animation movie, or de age an actor’s voice in order to re-record an old and heavily distorted recording of him made decades before.

As we age, the human body undergoes important physiological changes, such as calcification of cartilage, vocal atrophy, reduced mucosal wave, reduced pulmonary lung pressure, volumes, elasticity. Some perceptual aspects of voice like hoarseness, breathiness or instability seems to be correlated with age, however the manner and extent to which a human voice changes acoustically with age is not yet clearly defined.

Studies on the effects of age on voice have yielded to different, sometime contradictory, results.

At least some embodiments of the present disclosure propose a method 400 comprising : obtaining 420 an input audio signal originated from at least one input sound producer; obtaining 440 at least one target sound producer based at least partially on the obtained input audio signal.

The method can further comprise, in at least some embodiments, applying 460 at least one audio conversion between the obtained input audio signal and at least one target audio signal associated with the at least one target sound producer.

In the present disclosure, the term “signal” is to be understood as any sequence of audio data, like an audio stream, an audio file, or an audio sample.

In at least some of the embodiments illustrated, the input, target and output audio signals are human voice signals. However, it is to be understand that other embodiments of the present disclosure can be applied to input, target and output audio signals being audio sounds of at least one other type. The at least one other type of sounds can include sounds originating for at least one living being other than a human being, for instance sounds made by a kind and/or breed of animals (like sounds originating from dogs, or sounds originating from Briard breed dogs) and/or sounds originating from a non-living being, for instance a mechanical or electronical object (like a motor vehicle for instance).

For a purpose of simplicity, the sound producer of the input, target and/or output audio signal(s) will also be called hereinafter as input, target and/or output “speaker”, even if as explicitly explained above the present disclosure can find applications for many kinds of sound producers, including human or not-human sound producers.

Depending upon embodiments, the input audio signal can be obtained via an input user interface like a microphone, from a storage medium, or via a communication interface for instance.

Depending upon embodiments, the output audio signal can be rendered via an output user interface like a speaker, on a storage medium, or transmitted via a communication interface for instance.

The input, target and/or output audio signal can be an audio component of an input, target and/or output multimedia content.

The obtaining of the at least one target sound producer can be performed differently depending upon embodiments. For instance, in at least some embodiments, at least one identifier of the at least one target sound producer or the at least one corresponding target audio signal, can be obtained by using a random pick inside a sound producers dataset or/and an audio signals dataset associating audio signals with sound producers of the sound producers dataset. The sound producers dataset can store information regarding to at least one sound producer (used as a reference) like an identifier of the sound producer, at least one indication regarding at least one classification (like at least one classifier), and/or at least one indication adapted to determine a belonging of the sound producer to a class). Of course, depending upon embodiments, the sound producers dataset and the audio signal dataset can be implemented by separate storage units or can be gathered in a single storage unit. For simplicity purpose, the sound producers dataset and the audio signal dataset will be commonly referred to hereinafter as a dataset gathering both sound producers and audio signal datasets. The sound producers of the sound producers dataset can be of a same type or of heterogenous types. Hereinafter, the present disclosure will often only refer to “sound producer(s)” (or “speaker(s)” ), “sound producer(s) id” (or “speaker(s) id” ), instead of “sound producer identifier(s) ” (or “speaker identifier(s)”) for simplicity purpose.

At least some embodiments of the present disclosure propose to obtain a target sound producer via the use of machine learning techniques, for instance deep learning models (like Target Speaker Selection models that are detailed more deeply hereinafter). At least some embodiments of the present disclosure propose to perform voice conversion of the input audio signal based at least partially on machine learning techniques, for instance deep learning models. Indeed, deep learning techniques can be helpful for finding hidden structure inside data, like characteristics that makes age in voices. Figure 5 illustrates some block diagrams of an exemplary embodiment of the present disclosure that can be used for voice age conversion workflow 500, using a Target Speaker Selection (TSS) 510 block (based on a Target Speaker Selection Model) and a voice conversion block 520 . Both the TSS block and the VC voice conversion block are illustrated both as “black boxes” and as exemplary “white boxes” more detailed than the black boxes. The same numeral reference is associated to the black and white boxes of a same block.

The method 400, presented in figure 4, is detailed hereafter in link with figures 4 and 5.

In at least some embodiments of the present disclosure, the method 400 can comprise obtaining 420 an input audio signal (AC1) and extracting 442 acoustic descriptors from the obtained input audio signal (AC1). For instance, when the input audio signal is a speech (or human voice), the extracted acoustic descriptors can include Mel-Frequency Cepstral Coefficients (MFCC) of a speech envelope of the raw waveform of the input audio signal. Indeed, when applied to sound age conversion, Target Speaker Selection model of at least some embodiments of the present disclosure is intended to be age-agnostic (i.e. its classification abilities should not rely (or lightly rely) on aspects of speech which are affected by age). The use of MFCCs (instead of other time-frequency representation or raw waveform), for instance only low order MFCCs, can permit to discard pitch information, thus giving to the model no possibility to extract parameters related to shimmer or jitter for instance. The acoustic descriptors extracting can be performed via a vocoder like the WORD vocoder for instance.

In at least some embodiments, like in the illustrated embodiments of figure 5, obtaining 440 at least one target speaker can comprise selecting 444 at least one speaker by using a target speaker selection (TSS) model, adapted to provide at least one target speaker by “matching” the acoustic descriptors extracted from the input audio signal and the acoustic descriptors of at least some audio signals of the at least one dataset. The matched audio signals are called herein “target” audio signals. For example, a matching can be performed based on a proximity criterion between at least some of the input audio signal acoustic descriptors and at least some of the acoustic descriptors of at least some of the dataset audio signals.

In the exemplary embodiments illustrated, the one or more datasets comprise audio signals associated with a speaker identifier and, optionally, an association of the audio signals and/or the speaker identifier with at least one identifier of at least one label.

The set of labels or classifiers can differ upon the embodiments of the present disclosure. In the exemplary embodiments illustrated, the one or more datasets comprise audio signals associated with at least one age class.

The table “Table 1” hereinafter gives some exemplary age classes, adapted to human voice dataset, where each age class (or domain) is identified by a class name (or label) and can be associated with voice signals of speakers belonging to some exemplary age range. Class name Age range

Child from 0 to 10 years old

Teen from 10 to 18 years old young adult from 18 to 30 years old

Adult from 30 to 50 years old

Senior from 50 to 70 years old

Elder beyond 70 years old

Table 1 : first example of age classes

Of courses, the age classes indicated in “Table 1” have only an illustrative purpose and can differ upon embodiments. For instance, the number of classes, and/or the class names can be different, and/or the age range can vary. An age class can also comprise additional information, like at least one information relating to a type of individual (kind of animals, human,...), a genre or and accent, a cultural indication, an educational level, an indication regarding a disease known as having an impact on voice, and so on. Depending upon embodiments, the age ranges can be disjoint (like when clustering some elements) or at least some of the age ranges can present some intersections. (Cases with not disjoint ranges can be considered as cases with disjoint ranges as ranges with intersection can be split to produce smaller disjoints ranges). Also, a same set of voice signals can be classified according to several set of age ranges. For instance, a same set of voice signals can be both classified according to the age ranges of table 1 and also according to the age ranges of “Table 2” detailed hereinafter. Class name Age range

Baby from 0 to 2 years and a half old

Child from 2 years and a half to 12 years old

Teenager from 12 to 18 years old

Adult from 18 to 65 years old

Senior Beyond 65 years old Table 2: second example of age classes

According to at least one embodiments of the present disclosure, the obtaining 440 of at least one target audio signal can be performed based on Machine Learnings techniques, like Artificial Neural Networks (ANNs) (or Deep Neural Networks (DNNs)), that have shown state of the art performance in variety of domains such as computer vision, speech recognition, natural language processing, etc...

An Artificial Neural Network (ANN) is a mathematical model trying to reproduce the architecture of the human brain, and is thus composed of interconnected neurons, each of them having multiple inputs and one output. Given a specific combination of inputs, each neuron may activate itself, producing an output that is passed to the next layer of neurons. Formally, the activation of a neuron is written as follows:

where x_;,y are respectively the inputs and output of the neuron, w_; is the i^th weight associated with the i^th input, b is a bias and f is an activation function. We can define a neural layer by stacking multiple neurons in parallel, therefore yielding the following formulation y = f(nn.c + b)

. This formulation allows us to define a neural layer as a biased matrix product followed by an element-wise non-linear transformation, also called Linear Layer.

Now that we defined what a neural layer is, we can simply describe an artificial neural network as a series of neural layers, each layer sending its outputs to the following layer inputs. The activation function f must be non-linear, otherwise the network can be reduced to a simple biased matrix operation.

Let f_e be the function defined by an artificial neural network whose set of parameters (weights and bias) is denoted Q, mapping an input x e X into an output y e Y. The goal of our artificial network is to make it approximate the function f . This can be done by defining a loss function L, and then optimize Q to obtain an optimal artificial network f_e

The loss function can thus help assessing the model and as a consequence can be helpful to improve (or in simpler words “optimize”) the model.

The optimization of the model is often performed during a training (or learning) phase, involving annotated (or labelled) signals that permit to check the consistence of a classification. The performance of a model can thus heavily depend on the training (or Learning step) of the Network . Often, the more accurate and numerous are the training data, the more reliable is the trained Network.

However, in some domains, like in the exemplary domain of sound age conversion, very few parallel audio signals (like speech utterances of two different speakers saying the same sentence) for multiple domains (as separate age classes “young” and “old” for instance) exist. Furthermore, the very few existing parallel audio signals can be too short to be used. Also, for voice age conversion for instance, there is also no satisfying dataset of a single speaker speaking at different ages. Indeed, audio acquisition techniques have evolved over the years and furthermore preservation of audio signals stored on a storage medium during several ten years can be hypothetic. As a consequence, nowadays’ audio quality of an audio signal can be very different than, for instance, a quality of an audio signal captured in 1970- 1980’s.

So, it can be very difficult to obtain a ground truth (that is to say parallel signals of a same speaker at different ages in the exemplary case of voice age conversion).

The present disclosure thus proposes, in at least some of its embodiments, to use a TSS model not requesting parallel datasets for training.

Figure 6 illustrates an exemplary target speaker selection (TSS) model 510, compatible with the embodiments of figure 5, that can be used for obtaining at least one target speaker and /or the corresponding target audio signal. The exemplar Target Speaker Selection architecture of figure 6 can be used according to at least some embodiments of the present disclosure for aging or de-aging a voice. The TSS model of figure 6 yields for instance to the identifier of at least one target speaker from the labelled “young” / “old” elements of the dataset(s). When used for aging purpose, for instance, the goal of the TSS model is to find the speaker(s) from the “old” dataset whose voices most closely match the voices of the speakers from the “young” dataset, no matter how old those voices are.

The exemplary target audio signal selection model of figure 6 comprises at least three blocks: a feature extraction block 620 (or feature extractor) comprising at least one feature extraction model (noted as TSS1 in figure 5), a domain confusion block 640 comprising at least one domain confusion model (noted as TSS3 in figure 5) and a classification block 660 (or identifier classier) comprising at least one classification model (noted as TSS4 in figure 5). The feature extraction model of the feature extraction block 630 can reduce the dimension of the acoustic descriptors of the input audio signal that are provided to the TSS model 510 while keeping information of the acoustic descriptors that is needed later for the processing. For instance, high dimensional acoustic descriptors (like Mel-Frequency Cepstral Coefficients) of the input audio signal can be reduce into a simpler vector (TSS2). As an exemplar, the feature extraction model can reduce a 3072-dimensional vector to a 128-dimensional vector. The feature extraction block 620 outputs at least one a vector z which is then fed to the classification block 660 in order to predict an identifier of a speaker (a digit number for instance).

The method of the present disclosure comprises training the TSS model (not specifically illustrated in figure 4) and thus using the trained TSS model on the input audio signal.

During the training of the TSS model, the at least one feature extraction model is made age-agnostic by using the at least one domain confusion model (TSS3) of the domain confusion block 640. More precisely, during training, audio descriptors of training audio signals are input to the feature extraction block, which output is provided to both the classifier and the domain confusion model which predicts the class (the age class for instance) of a training audio signal.

Once the TSS is trained, acoustic descriptors of the input audio signal are input to the feature extraction block, which output is provided to the classifier block. The providing of the output to the domain confusion model is only optional and can be omitted in some embodiments.

The training dataset can be split in several sub-datasets, comprising at least one source dataset and one target dataset, each sub-dataset comprising audio streams corresponding to at least one value of the set of classifiers and each classifier corresponding to at least one subdataset. For example, there can be one source dataset for age class “young” and one target dataset for age class “old” for aging purpose, and vice versa for de-aging. In some embodiments, during the training, the TSS model can take fixed-length MFCCs of signals from both source and target datasets as input.

If the signal comes from the target dataset, the TSS model is trained to determine the speaker id and the age class. If the signal comes from the source dataset, the model only tries to determine the age class.

Domain Confusion (or Domain Adaptation) techniques are presented in more detail in link with some experimental results later. At least some embodiments of the present disclosure can use a method embedding domain adaptation in the learning representation, as it allows a classification based on a both discriminative and domain-invariant representation. In the exemplary embodiments of figures 5 and 6, the TSS model is trained to classify speakers from the dataset.

From the TSS model we can define two losses: a classification loss L_cl and a domain confusion loss L_dc, the first one measuring how good the TSS model is predicting the speaker identifiers, the latter measuring how good the TSS model is predicting the domain (for instance age class) of the input signal.

As we want the feature extraction to be domain-agnostic (for instance, in the at least one exemplar embodiment of figure 6 to be independent of the age classes) , we want to minimize the classification loss while maximizing the domain confusion loss.

We can write the following weight update rule:

where 0_fe0_cl, 0_dc are respectively the weights associated with the feature extraction model, classifier and domain confusion model, m is the learning rate, l a scalar defining a trade off between minimizing the classification loss and maximizing the domain confusion loss.

This update rule is very similar to a gradient descent rule shown in equation, and can be implemented directly by using a Gradient Reversal Layer (GRL), which is a pseudo function defined by

where I is the identity matrix.

According to figure 6, once the target speaker selection model is trained, the features (TSS2) extracted by the feature extraction model from an input audio signal can be classified using the classification model (TSS4) in order to yield at least one matching (or target) speaker identify.

The Target Speaker Selection model of figure 6 is a Convolutional Neural Network (CNN), that can take fixed-length MFFCs as input. In the exemplary use case of figure 6, during training, we have two sub-datasets (young / old). Thus, we need to define a source dataset and a target dataset, the target dataset containing the speaker(s) that will eventually be chosen by the TSS model as the voice conversion system’s target.

The TSS model can be trained differently upon embodiments. For instance, the training of the TSS model process can depend on the dataset (source or target) from which the speech signals are taken. If it is a signal from the dataset used as the target dataset, the TSS model is trained to determine the speaker id and determine from which dataset the signal is coming from. If the signal comes from the source dataset, the model can only try to determine from which dataset the signal is coming from. According to at least some embodiments of the present disclosure, a Gradient Reversal Layer can also be applied after the feature space e. g. just after the feature space) in order to achieve Domain Confusion between both datasets and thus, decreases the impact of the extracted features output to the classifier. The intent is to obtain extracted features being invariant or almost invariant with age.

Dotted boxes with a x n in figure 6 are used for indicating that a group of units can be sequentially repeated n times. The F^_1 symbol describes a Gradient Reversal Layer and is used, during training, to achieve domain confusion between the source and target datasets used for training. .

The feature extraction model of the feature extraction block 620 can comprise a group of functions (or units) comprising at least one Convolutional Layer (CONV), at least one “batch norm” unit, and/or at least one Rectified Linear Unit (ReLU). The group of functions can be repeated sequentially several times (like 1 , 3 or 5 times). The convolutional unit (CONV) outputs data for the batch norm unit, which in turn outputs data to the ReLU.

The classification model of the classification block 640 and the domain confusion model of the domain confusion block 660 can each comprise a group of functions comprising at least one Linear Layer (LIN) unit that outputs data to at least one Rectified Linear Unit (ReLU) function. The group of functions can be repeated sequentially several times (like 1 , 3, 5 , 6, 7, or 9 times). In the classification and/or domain confusion model, the output of the last ReLU function can be input to at least one “SoftMax” function.

The output of the classification block is a speaker id, while the output of the domain confusion block is a speaker age in the exemplary embodiments of figure 6.

Figures 7A, 7B and 7C present exemplary implementations and/or variants of the exemplary Target Speaker Selection architecture of figure 6.

The exemplary TSS model of figure 7A is a speaker classifier that permits to classify voice signals of different domains (herein age classes: “young” and “old”). According to figure 7A, the model is trained to classify which speaker from domain “young” is speaking, while the domain confusion model ensures a domain-agnostic feature extraction.

A voice signal of the dataset is associated with an identifier of a speaker of a known age (so as to deduce an associated age class) or is labelled with a known age class (like “young” or “old”). The classifier is made age-agnostic during training (as explained hereinbefore). It basically learns to classify voices of speakers belonging to a single domain (“young” as an exemplar) from a dataset of voice signals of both domains. The use of a speaker age identification model alongside a gradient reversal layer can help ensuring that the network only use shared information between young and old speakers, and thus help the identification of the speakers to be more age-agnostic.

Once the model has been trained, the obtaining of at least one target audio signal can comprise feeding the network with signals from an “old” speaker, and selecting, as the least one target audio signal, the audio signal associated with one or more “young” speaker yielded by the model as being the closest(s) young speaker to the input “old” speaker.

In the exemplary embodiments of figures 6 and 7A, the TSS model is applied between two values (thus inside a 2-domain). However, according to other embodiments, the TSS model can be an n-domain model with n classifiers, and where every dataset associated with a classifier can be a target dataset, as illustrated by figure 7B. In the exemplary embodiment of figure 7B, a domain (or classifier) can be an age class (for instance one of the age classes of table 1 or table 2 already introduced), thus allowing a more precise tuning of the age conversion. In such an embodiment, the method can comprise obtaining, further to an input audio signal, a desired age class of the output audio signal, or several audio signals can be output by voice converting several target audio signals corresponding to different domains . The output can further comprise a probability of an output audio signal. The probability of an output audio signal can be based for instance on a probability of a corresponding target speaker in its domain.

The training of the TSS model of the exemplary embodiment of figure 7B can be similar to the training of the TSS model of figures 6 and 7A, with however a training of each classifier to classify its corresponding dataset, and a training of the domain confusion model of the TSS model so as to find the dataset domain from which a training input signal is coming.

According to at least one of the embodiments, a new target domain can be added, after the training of the TSS model, without training the whole TSS model again (and thus not using the domain confusion model together with the classifiers) For instance, as illustrated by figure 7C, a mapping between a source speaker and at least one speaker of the new target domain can be infer once the TSS model is trained, by simply projecting the source, the target speaker identifiers and the speaker identifier of the new target domain on a feature space and performing a nearest neighbor search.

According to exemplary embodiment of figure 2, the method can comprise performing 460 a voice conversion between the input audio signal and at least one target audio signal associated with the obtained at least one target speaker.

More precisely, a raw waveform of the input audio signal can be fed to the voice conversion block (AC3) alongside at least one obtained target audio signal(s) associated with the obtained at least one target speaker. The audio signal(s) output by the voice conversion can be a converted audio signal, similar to the ones produced by the at least one target speaker of the obtained target audio signal(s) (AC4). Different voice conversion techniques can be used depending upon embodiments. Some embodiments can be based on a voice conversion technique adapted to perform voice identity conversion, for instance a Generative Adversarial Network (GAN) based conversion using a vocoder like WORLD, as the CycleGAN Voice Conversion model illustrated by figures 8A and 8B. Some embodiments of the present disclosure can use an end-to-end waveform Auto-Encoder based conversion, as the one illustrated by figure 9.

A CycleGAN model can need to be trained for each input audio signal and that thus requires powerful processing means and a long processing time but can help obtaining very intelligible output voice. At the difference of a CycleGAN model, VQVAE only needs to be trained once. Thus, embodiments using VQVAE can be more adapted to be implemented in a device with limited processing capabilities and/or in application requested a limited time processing.

In a general aspect, a Generative Adversarial Networks (GANs) model is based on two models, a generator and a discriminator. The generator tries to capture the data distribution, the discriminator tries to estimate the probability that a signal came from the training data rather than the generator.

In the exemplary embodiments of figures 8A and 8B, the voice conversion performed for obtaining 460 the output audio signal(s) from the target audio signal can involve a Generative Adversarial Network (GAN) based conversion that permits to achieve domain to domain translation by learning a mapping between two domains in the absence of paired examples, like a Cycle Generative Adversarial Network applied to Voice Conversion(CycleGAN-VC) model.

According to the exemplary embodiments of figures 8A and 8B, for instance, the CycleGAN-VC model is adapted to achieve domain translations for voice from different speakers (translation for instance an input audio signal comprising a voice of an input speaker in an at least one output audio signal comprising the same content but with the voice of one of the at least one target speakers).

The exemplar CycleGAN Voice Conversion of figure 8A and 8B comprises a generator and a discriminator. The generator tries to convert audio from one speaker to audio from another (target) speaker, while the discriminator tries to find from which speaker the converted speech is coming from. The losses involved during the training are the following: a Cycle Consistency Loss, making sure the model keeps contextual information (like phoneme, and/or words) an adversarial loss penalizing the generator if it outputs audio that is not corresponding to the target speaker. The CycleGAN Voice Conversion can be based on different audio feature depending upon embodiments. In the exemplary embodiment of figure 5, it can be based on audio feature extracted with a WORLD analysis-synthesis pipeline, as illustrated by figure 8C. In some embodiments, the method can perform a voice conversion based on an End- to-end waveform Voice Conversion architecture. The End-to-end waveform Voice Conversion architecture can vary upon embodiments. In the exemplary embodiment figure 9 for instance, the End-to-end waveform Voice Conversion architecture can be a voice conversion comprising a Vector-Quantized Auto-Encoder (VQVAE) using a simple waveform encoder alongside an autoregressive WaveNet Decoder.

An autoencoder can be seen as a neural network aiming at copying its input to its output. We can split this network into two segments, namely herein as encoder q and decoder p. The encoder q maps the input data x to a point in a latent space Z, which is then mapped back to the original input by the decoder p. Both q and p can be neural networks trained by minimizing the following objective

L(q, p) =|| x - (p o q)(x) ||², in order to approximate the optimal encoder q and decoder p q , p = argminL(q,p). q.p

The point of such a network can be confusing, as it looks like it is just copying its input. It actually becomes interesting when the latent point z = q(x) is of smaller dimensionality than the input, as it forces the network to extract a compact representation of the input in order to get good reconstructions.

One possible use case of autoencoders is manipulating high-level attribute through latent vectors manipulation (translation, rotation, ...). However, in the “classic” autoencoder, nothing encourages the encoder to produce neither contiguous nor compact latent space. Efficient latent translation might thus not always be achieved, as some points in the latent space may have never been seen by the decoder.

In order to solve this issue, the encoding / decoding process of an autoencoder can modified according to a probabilistic point of view .

Let X = {x®} be a dataset of N independent and identically distributed continuous variables x generated by a random process involving a random variable z. The generation process is as follow:

- A value z® is sampled from a prior distribution r_q (z)

- A value x® is generated by a conditional distribution r_q (x v z), where p_g stands for a parametric distribution p with true parameters Q .

We are interested in finding Q so that we can achieve approximate posterior inference of z given and approximate generation of x given z. Given that the posterior distribution r_q(z n c) is untractable , we need to introduce a recognition model ^(z v x) as an approximation of p_g(z v x) . We can now refer the recognition model q^(z v ) as a probabilistic encoder, and p_g(x v z) as a probabilistic decoder.

This stochastic re-writing of an auto encoder leads to a Variational Auto-Encoder (VAE). VAEs can still be composed of neural networks, except both the encoder and decoder yield parameters defining a probability distribution instead of a latent point / a reconstruction. This reflects the fact that given a latent vector z, for a sufficiently small e, p_g(x v z) shoud be close to p_g (x v z + e) : we therefore get something close to a contiguous space. The compactness of such latent space can then be obtained by applying some constraints on the prior pe(z), like a Kullback-Leibler divergence or a Maximum Mean Discrepancy , as shown in figures 10A, to 10C.

Figures 10A, 10B and 10C shows a comparison between the encoding of four samples x for different encoder types. Figure 10A illustrates a Classic autoencoder, neither contiguous nor compact, allowing very good straight-through reconstruction but without any latent manipulation. Figure 10B illustrates Variational autoencoder, the latent space is contiguous but filled with holes where data has never been encoded / decoded. In this scenario, an interpolation between the upper and the lower area of the latent space is impossible. Figure 10C illustrates a regularized variational autoencoder, where the prior p_g(z ) is forced to match a manually defined distribution. This space is contiguous and compact, allowing smooth interpolation between points.

When using Deep Learning based model in order to analyze / generate audio, using time-frequency representations can allow a compact bi-dimensional representation of audio that can be used as is in image-based deep learning architectures. Latent representation of spectrograms can also be used. The phase information in a spectrogram can be crucial to reconstruct audio and is however really unstructured, yielding learning difficulties when trying to autoencode it. Hence, most of the time the phase is discarded, making however the reconstruction process difficult. Some approaches can be used to invert such spectrograms, like the Griffin-Lim algorithm, iteratively estimating the missing phase or the Multi-Head Convolutional Neural Network , taking as an input an amplitude spectrogram and yielding an audio waveform. Both of those approaches allow the reconstruction of audio.

Use of a generative model for raw audio can help obtaining a better audio quality than some of the above approaches. With a generative model, the joint probability of a waveform x = {x₁ . . . x_T] can modelled as product of conditional probabilities as follows

Hence each audio sample x_t can be conditioned on all the previous timesteps. As generating a sample given all the previous one would be both from a memory and computation time point of view impossible, we only take into account the T previous samples to generate the next one. In order to build a neural network with a big enough receptive field, a new type of convolution called dilated convolutions can be used.

As an example, stacked non-dilated causal convolutions can yield a receptive field of size 5 for a stack of 5 convolutions, where the stacked dilated causal convolutions yield a receptive field of size 16 for the same number of convolutions.

Dilated convolutions can allow the overall network to have a much larger receptive field, hence allows the capture of way longer time dependencies. In order not to violate the ordering we model the data, all inputs are padded in order to get causal convolutions: every sample only rely on the previous ones, as stated in above equation.

Depending upon embodiments of the present disclosure, different variants of WaveNet can be used. For instance, some embodiments use a DeepVoice variant, like the exemplary DeepVoice variant whose architecture is shown in figure 11. It comprises an initial convolution layer followed by several stacked residual blocks, and finally two rectified convolution layers and a SoftMax. Each residual block is composed of a dilated convolution, whose dilation factor d is determined by the following equation d = 2^imodc, where i stands for the block number, and c is the cycle size. During our experiments, we’ve chosen to use 30 residual blocks with a cycle size of 10.

According to figure 12, T samples are fed to the network alongside a local condition (spectrograms, latent space, ...) and a global condition (speaker embedding). The model outputs the predicted next sample given the T previous ones. The process can be repeated over this prediction in order to generate an audio signal.

Before it is fed to the model, the audio waveform is first transformed using the following m-law and then encoded to a 256-dimensional one-hot vector. The resulting audio is an 8-bit m-law signal sampled at 16kHz. In a variant, each audio sample can be modeled using a discrete mixture of logistics alongside a higher sampling rate in order to get a higher audio quality , eventually resulting in a longer training time.

Some embodiments of the present disclosure can use another model generating raw audio than WaveNet. For instance, some embodiments of the present disclosure can use a parallel version of WaveNet taking advantage of Inverse Autoregressive Flow in order to provide a generative model with real time abilities. Some embodiments of the present disclosure can use a recurrent version of WaveNet called WaveRNN/ Because of its core features being inference speed and reduced computational cost, such a model can be adapted to generate raw waveform on an embedded device. Some embodiments of the present disclosure can use a flow-based version of WaveNet in order to simplify the training procedure involved in parallel WaveNet.

When trained with voice samples of n different speakers (with n integer greater or equal to 2) ,the speakers being identified accordingly as speaker 1 ... speaker n, the global condition can be expressed as an n-dimensional vector filled with integer value “ 0", except for the i^th dimension of the vector, where the vector is filled with integer value "1”, where i is an integer value that corresponds to an identifier of a speaker. In order to achieve Voice Conversion from a speaker A to a speaker B, one may simply reconstruct a speech sample from speaker A giving the VQVAE the speaker id of speaker B.

The different embodiments described above can be used in any combination. For instance, a first embodiment can use random pick to obtain e target speaker, followed by a voice conversion based on CGVC, a second embodiment can use a TSS model as described above, followed by a voice conversion based on CGVC, a third embodiment can use random pick followed by a voice conversion based on VQVAE, and/or a fourth embodiment can use a TSS model as described above, followed by a voice conversion based on VQVAE.

The present disclosure can be applied in many technical domains. Some exemplary use cases include age or de-age speaker voices or recreate someone’s voice at a given age in the technical field of movie production. Some other exemplary use cases can belong to the technical field of call applications , where it can be useful for an aged people to get a younger voice during a call (like a phone call) to feel more self-confident, or to be more reliable at a distant caller (or a distant called person) point of view. Use cases of the present disclosure can also be found in dynamic age adaptation applications. For instance, at least some embodiment of voice conversion method of the present disclosure can help improving client experience during a dialogue between an employee of a call center and a client, by permit to modify the employee voice so as to sounds as old as the client voice, thus reducing the age gap between them.

As already pointed out in the present disclosure, age is just a possible application, as the target speaker selection model can be used for other parameters (with others classifiers) such as parameters related to language in application dealing with translation, to find a speaker speaking another language matching an actor’s voice , or parameters related to accent, in the purpose of getting rid of an accent, to find a matching speaker with a neutral accent.

Some embodiments of the present disclosure can also find applications, in other domain than voice. For instance, in movie production, some embodiments of the present disclosure for aging the motor sound produced by a vehicle .

Experimental results Some experimental results are detailed below for an exemplary embodiment adapted to the very new technical domain of speech age conversion. As there is no prior work achieving speech age conversion, the experimental results cannot be compared with results of a prior work. Both objective evaluations and subjective evaluations of the results of different embodiments are proposed.

Objective evaluation can be based different parameters, like parameters adapted to measure or evaluate a quality of the output speech, in terms of audio quality and/or intelligibility for instance) For instance, objective evaluation can rely at least partially on shimmer jitter and/or on harmonic-to-noise ratio (HNR). HDR which describes the amount of noise present in a speaker’s voice, can be an interesting acoustic descriptor regarding speech age as the HNR tends to decrease with age, and seems more reliable than other acoustic parameter (like jitter) to discriminate elder / young adults voices.

Subjective evaluation can be based on perceptual tests permitting to obtain, via the feedback of one or more third persons, some parameters adapted to measure or evaluate at least one quality criterion, for instance parameters related to an identity coherence as assessed by third person(s) answering questions regarding the (presumed) identity of the speaker of both the input and the output speech (like question “Does the converted speech sounds like it has been said by the same speaker at a different age?”) or parameters relates to the age of input and/or output speech (like question “How old is the converted voice?)

Some tests can also be performed for evaluating only the TSS model.

For instance, in order to measure the effect of the Gradient Reversal Layer, the TSS model can be tested in link with an exemplary simple dataset (like the dataset known as “Modified National Institute of Standards and Technology database” (“ MNIST”) which is one of the most used hand-written digits dataset. Every digit is a 14*14 black and white image) comprising training and test elements.

Domain Confusion or Domain Adaptation is a set of technique allowing the training of a model when there is a shift between training and test distribution (where the training (resp. test) distribution refers to the distribution from which the training (resp. test) dataset is sampled). For example, for problems lacking labeled data, an option can be to create a synthetic and therefore fully labeled training dataset, inevitably having a slightly different distribution from actual data.

The test dataset can be spatially altered, with a random sinusoidal distortion for instance, so that element of the dataset is still recognizable, but the overall distribution of the test set is shifted from the training set.

The TSS model can be trained with and without GRL, the classification accuracy and confusion being then compared on both train and test set. Results are reported in table Table3 below. Table 3 shows a comparison of the accuracy and confusion of the model for both original and distorded datasets, and for both bypassed and activated Gradient Reversal Layer. The use of the Gradient Reversal Layer shows a significant accuracy increase when classifying the test dataset.

Accuracy Confusion w/o GRL, original dataset 99% 0% w/o GRL, distorded 72% 0% dataset w/ GRL, original dataset 99% 38% w/ GRL, distorded dataset 81% 35%

Table 3

Activation of the GRL can permit to increase the overlapping of samples from the original non-distorded and distorded dataset and can thus help obtaining a model being more accurate for both the original non-distorded and distorded domains.

Figure 12 shows how the feature space can be organized for original and distorded datasets of the MNIST datasets. The feature space z is projected on a bi-dimensional plane with the T-SNE algorithm in order to. When comparing the feature space with and without GRL, we see that the use of the GRL make the training and test domain overlap, which might explain the accuracy increase show in the above table “Table 3”. As shown by figure 12, when the GRL is activated, the samples from the original and distorded dataset are way more overlapping than when the GRL is bypassed.

Additional Embodiments and information

This application describes a variety of aspects, including tools, features, embodiments, models, approaches, etc. Many of these aspects are described with specificity and, at least to show the individual characteristics, are often described in a manner that may sound limiting. However, this is for purposes of clarity in description, and does not limit the application or scope of those aspects. Indeed, all of the different aspects can be combined and interchanged to provide further aspects. Moreover, the aspects can be combined and interchanged with aspects described in earlier filings as well.

The aspects described and contemplated in this application can be implemented in many different forms. Figure 1 below provide some embodiments, but other embodiments are contemplated and the discussion of all or part of the Figures does not limit the breadth of the implementations. At least one of the aspects generally relates to conversion of an audio signal, or an audio component of an audiovisual signal, and at least one other aspect generally relates to transmitting a converted bitstream. These and other aspects can be implemented as a method, an apparatus, a computer readable storage medium having stored thereon instructions for converting audio data according to any of the methods described, and/or a computer readable storage medium having stored thereon a bitstream generated according to any of the methods described.

In the present application, the terms “voice”, or “speech” may be used interchangeably.

Various methods are described herein, and each of the methods comprises one or more steps or actions for achieving the described method. Unless a specific order of steps or actions is required for proper operation of the method, the order and/or use of specific steps and/or actions may be modified or combined.

Various numeric values are used in the present application (for example regarding a number of groups of units in a model or sub-model). The specific values are for example purposes and the aspects described are not limited to these specific values.

Figure 4 illustrates an exemplary method 400 for audio conversion. Variations of this method 400 are contemplated, but the audio converting method 400 has been described above for purposes of clarity without describing all expected variations. The method for audio conversion can further be part of a method for encoding or a method for decoding as illustrated by figures 2 and 3.

At least some embodiments relate to improving compression efficiency compared to existing video compression systems such as HEVC (HEVC refers to High Efficiency Video Coding, also known as H.265 and MPEG-H Part 2 described in "ITU-T H.265 Telecommunication standardization sector of ITU (10/2014), series H: audiovisual and multimedia systems, infrastructure of audiovisual services - coding of moving video, High efficiency video coding, Recommendation ITU-T H.265"), or compared to under development video compression systems such WC (Versatile Video Coding, a new standard being developed by JVET, the Joint Video Experts Team).

To achieve high compression efficiency, image and video coding schemes usually employ prediction, including spatial and/or motion vector prediction, and transforms to leverage spatial and temporal redundancy in the video content. Generally, intra or inter prediction is used to exploit the intra or inter frame correlation, then the differences between the original image and the predicted image, often denoted as prediction errors or prediction residuals, are transformed, quantized, and entropy coded. To reconstruct the video, the compressed data are decoded by inverse processes corresponding to the entropy coding, quantization, transform, and prediction. Mapping and inverse mapping processes can be used in an encoder and decoder to achieve improved coding performance. Indeed, for better coding efficiency, signal mapping may be used. Mapping aims at better exploiting the samples codewords values distribution of the video pictures. Figure 2 illustrates an encoder 100. Variations of this encoder 100 are contemplated, but the encoder 100 is described below for purposes of clarity without describing all expected variations.

Before being encoded, the video sequence may go through pre-encoding processing (101), for example, applying a color transform to the input color picture (e.g., conversion from RGB 4:4:4 to YCbCr4:2:0), or performing a remapping of the input picture components in order to get a signal distribution more resilient to compression (for instance using a histogram equalization of one of the color components) and/or performing sound conversion of an audio part of the video sequence according to some embodiments of the present disclosure. Metadata can be associated with the pre-processing and attached to the bitstream.

In the encoder 100, a picture is encoded by the encoder elements as described below. The picture to be encoded is partitioned (102) and processed in units of, for example, CUs. Each unit is encoded using, for example, either an intra or inter mode. When a unit is encoded in an intra mode, it performs intra prediction (160). In an inter mode, motion estimation (175) and compensation (170) are performed. The encoder decides (105) which one of the intra mode or inter mode to use for encoding the unit, and indicates the intra/inter decision by, for example, a prediction mode flag. Prediction residuals are calculated, for example, by subtracting (110) the predicted block from the original image block.

The prediction residuals are then transformed (125) and quantized (130). The quantized transform coefficients, as well as motion vectors and other syntax elements, are entropy coded (145) to output a bitstream. The encoder can skip the transform and apply quantization directly to the non-transformed residual signal. The encoder can bypass both transform and quantization, i.e., the residual is coded directly without the application of the transform or quantization processes.

The encoder decodes an encoded block to provide a reference for further predictions. The quantized transform coefficients are de-quantized (140) and inverse transformed (150) to decode prediction residuals. Combining (155) the decoded prediction residuals and the predicted block, an image block is reconstructed. In-loop filters (165) are applied to the reconstructed picture to perform, for example, deblocking/SAO (Sample Adaptive Offset) filtering to reduce encoding artifacts. The filtered image is stored at a reference picture buffer (180).

Figure 3 illustrates a block diagram of a video decoder 200. In the decoder 200, a bitstream is decoded by the decoder elements as described below. Video decoder 200 generally performs a decoding pass reciprocal to the encoding pass as described in Figure 2. The encoder 100 also generally performs video decoding as part of encoding video data.

In particular, the input of the decoder includes a video bitstream, which can be generated by video encoder 100. The bitstream is first entropy decoded (230) to obtain transform coefficients, motion vectors, and other coded information. The picture partition information indicates how the picture is partitioned. The decoder may therefore divide (235) the picture according to the decoded picture partitioning information. The transform coefficients are de-quantized (240) and inverse transformed (250) to decode the prediction residuals. Combining (255) the decoded prediction residuals and the predicted block, an image block is reconstructed. The predicted block can be obtained (270) from intra prediction (260) or motion-compensated prediction (i.e., inter prediction) (275). In-loop filters (265) are applied to the reconstructed image. The filtered image is stored at a reference picture buffer (280).

The decoded picture can further go through post-decoding processing (285), for example, an inverse color transform (e.g. conversion from YCbCr 4:2:0 to RGB 4:4:4) or an inverse remapping performing the inverse of the remapping process performed in the preencoding processing (101), and/or performing sound conversion of an audio part of the decoded video sequence according to some embodiments of the present disclosure. The postdecoding processing can use metadata derived in the pre-encoding processing and signaled in the bitstream.

Figure 1 illustrates a block diagram of an example of a system 1000 in which various aspects and embodiments are implemented. System 1000 can be embodied as a device including the various components described below and is configured to perform one or more of the aspects described in this document. Examples of such devices include, but are not limited to, various electronic devices such as personal computers, laptop computers, smartphones, tablet computers, digital multimedia set top boxes, digital television receivers, personal video recording systems, connected home appliances, and servers. Elements of system 1000, singly or in combination, can be embodied in a single integrated circuit (IC), multiple ICs, and/or discrete components. For example, in at least one embodiment, the processing and encoder/decoder elements of system 1000 are distributed across multiple ICs and/or discrete components. In various embodiments, the system 1000 is communicatively coupled to one or more other systems, or other electronic devices, via, for example, a communications bus or through dedicated input and/or output ports. In various embodiments, the system 1000 is configured to implement one or more of the aspects described in this document.

The system 1000 includes at least one processor 1010 configured to execute instructions loaded therein for implementing, for example, the various aspects described in this document. Processor 1010 can include embedded memory, input output interface, and various other circuitries as known in the art. The system 1000 includes at least one memory 1020 (e.g., a volatile memory device, and/or a non-volatile memory device). System 1000 includes a storage device 1040, which can include non-volatile memory and/or volatile memory, including, but not limited to, Electrically Erasable Programmable Read-Only Memory (EEPROM), Read-Only Memory (ROM), Programmable Read-Only Memory (PROM), Random Access Memory (RAM), Dynamic Random-Access Memory (DRAM), Static Random- Access Memory (SRAM), flash, magnetic disk drive, and/or optical disk drive. The storage device 1040 can include an internal storage device, an attached storage device (including detachable and non-detachable storage devices), and/ora network accessible storage device, as non-limiting examples.

System 1000 includes an audio converter module 1030 configured, for example, to process data to provide a converted audio signal, and the audio converter module 1030 can include its own processor and memory. The audio converter module 1030 represents module(s) that can be included in a device to perform the audio converting functions. Audio converter module 1030 can be implemented as a separate element of system 1000 or can be incorporated within processor 1010 as a combination of hardware and software as known to those skilled in the art.

Program code to be loaded onto processor 1010 or audio converter module 1030 to perform the various aspects described in this document can be stored in storage device 1040 and subsequently loaded onto memory 1020 for execution by processor 1010. In accordance with various embodiments, one or more of processor 1010, memory 1020, storage device 1040, and audio converter module 1030 can store one or more of various items during the performance of the processes described in this document. Such stored items can include, but are not limited to, the input audio signal or portions of the input audio signal, the target dataset (including target audio signals and optionally associated information relating to a target domain), the output audio signal or portions of the output audio signal, at least some of the coefficients of at least one of the neural network(s) introduced above, at least one classifier set (or an identifier of at least one classifier set) used by the TSS model, value of at least one classifier used by the TSS model, the bitstream, matrices, variables, and intermediate or final results from the processing of equations, formulas, operations, and operational logic.

In some embodiments, memory inside of the processor 1010 and/or the audio converter module 1030 is used to store instructions and to provide working memory for processing that is needed during audio conversion. In other embodiments, however, a memory external to the processing device (for example, the processing device can be either the processor 1010 or the audio converter module 1030) is used for one or more of these functions. The external memory can be the memory 1020 and/or the storage device 1040, for example, a dynamic volatile memory and/or a non-volatile flash memory. In several embodiments, an external non-volatile flash memory is used to store the operating system of, for example, a television. In at least one embodiment, a fast-external dynamic volatile memory such as a RAM is used as working memory for audio converting operations. The input to the elements of system 1000 can be provided through various input devices as indicated in block 1130. Such input devices include, but are not limited to, (i) a radio frequency (RF) portion that receives an RF signal transmitted, for example, over the air by a broadcaster, (ii) a Component (COMP) input terminal (or a set of COMP input terminals), (iii) a Universal Serial Bus (USB) input terminal, and/or (iv) a High Definition Multimedia Interface (HDMI) input terminal. Other examples, not shown in Figure 3, include composite video.

In various embodiments, the input devices of block 1130 have associated respective input processing elements as known in the art. For example, the RF portion can be associated with elements suitable for (i) selecting a desired frequency (also referred to as selecting a signal, or band-limiting a signal to a band of frequencies), (ii) down converting the selected signal, (iii) band-limiting again to a narrower band of frequencies to select (for example) a signal frequency band which can be referred to as a channel in certain embodiments, (iv) demodulating the down converted and band-limited signal, (v) performing error correction, and (vi) demultiplexing to select the desired signal of data packets. The RF portion of various embodiments includes one or more elements to perform these functions, for example, frequency selectors, signal selectors, band-limiters, channel selectors, filters, downconverters, demodulators, error correctors, and demultiplexers. The RF portion can include a tuner that performs various of these functions, including, for example, down converting the received signal to a lower frequency (for example, an intermediate frequency or a near-baseband frequency) or to baseband. In one set-top box embodiment, the RF portion and its associated input processing element receives an RF signal transmitted over a wired (for example, cable) medium, and performs frequency selection by filtering, down converting, and filtering again to a desired frequency band. Various embodiments rearrange the order of the above-described (and other) elements, remove some of these elements, and/or add other elements performing similar or different functions. Adding elements can include inserting elements in between existing elements, such as, for example, inserting amplifiers and an analog-to-digital converter. In various embodiments, the RF portion includes an antenna.

Additionally, the USB and/or HDMI terminals can include respective interface processors for connecting system 1000 to other electronic devices across USB and/or HDMI connections. It is to be understood that various aspects of input processing, for example, Reed-Solomon error correction, can be implemented, for example, within a separate input processing IC or within processor 1010 as necessary. Similarly, aspects of USB or HDMI interface processing can be implemented within separate interface ICs or within processor 1010 as necessary. The demodulated, error corrected, and demultiplexed signal is provided to various processing elements, including, for example, processor 1010, and encoder/decoder 1030 operating in combination with the memory and storage elements to process the data stream as necessary for presentation on an output device.

Various elements of system 1000 can be provided within an integrated housing, Within the integrated housing, the various elements can be interconnected and transmit data therebetween using suitable connection arrangement 1140, for example, an internal bus as known in the art, including the Inter-IC (I2C) bus, wiring, and printed circuit boards.

The system 1000 includes communication interface 1050 that enables communication with other devices via communication channel 1060. The communication interface 1050 can include, but is not limited to, a transceiver configured to transmit and to receive data over communication channel 1060. The communication interface 1050 can include, but is not limited to, a modem or network card and the communication channel 1060 can be implemented, for example, within a wired and/or a wireless medium.

Data is streamed, or otherwise provided, to the system 1000, in various embodiments, using a wireless network such as a Wi-Fi network, for example IEEE 802.11 (IEEE refers to the Institute of Electrical and Electronics Engineers). The Wi-Fi stream of these embodiments is received over the communications channel 1060 and the communications interface 1050 which are adapted for Wi-Fi communications. The communications channel 1060 of these embodiments is typically connected to an access point or router that provides access to external networks including the Internet for allowing streaming applications and other over-the- top communications. Other embodiments provide streamed data to the system 1000 using a set-top box that delivers the data over the HDMI connection of the input block 1130. Still other embodiments provide streamed data to the system 1000 using the RF connection of the input block 1130. As indicated above, various embodiments provide data in a non-streaming manner. Additionally, various embodiments use wireless networks other than Wi-Fi, for example a cellular network or a Bluetooth network.

The system 1000 can provide an output signal to various output devices, including a display 1100, speakers 1110, and other peripheral devices 1120. The display 1100 of various embodiments includes one or more of, for example, a touchscreen display, an organic light- emitting diode (OLED) display, a curved display, and/or a foldable display. The display 1100 can be for a television, a tablet, a laptop, a cell phone (mobile phone), or another device. The display 1100 can also be integrated with other components (for example, as in a smart phone), or separate (for example, an external monitor for a laptop). The other peripheral devices 1120 include, in various examples of embodiments, one or more of a stand-alone digital video disc (or digital versatile disc) (DVR, for both terms), a disk player, a stereo system, and/or a lighting system. Various embodiments use one or more peripheral devices 1120 that provide a function based on the output of the system 1000. For example, a disk player performs the function of playing the output of the system 1000. In various embodiments, control signals are communicated between the system 1000 and the display 1100, speakers 1110, or other peripheral devices 1120 using signaling such as AV. Link, Consumer Electronics Control (CEC), or other communications protocols that enable device-to-device control with or without user intervention. The output devices can be communicatively coupled to system 1000 via dedicated connections through respective interfaces 1070, 1080, and 1090. Alternatively, the output devices can be connected to system 1000 using the communications channel 1060 via the communications interface 1050. The display 1100 and speakers 1110 can be integrated in a single unit with the other components of system 1000 in an electronic device such as, for example, a television. In various embodiments, the display interface 1070 includes a display driver, such as, for example, a timing controller (T Con) chip.

The display 1100 and speaker 1110 can alternatively be separate from one or more of the other components, for example, if the RF portion of input 1130 is part of a separate set-top box. In various embodiments in which the display 1100 and speakers 1110 are external components, the output signal can be provided via dedicated output connections, including, for example, HDMI ports, USB ports, or COMP outputs.

The embodiments can be carried out by computer software implemented by the processor 1010 or by hardware, or by a combination of hardware and software. As a nonlimiting example, the embodiments can be implemented by one or more integrated circuits. The memory 1020 can be of any type appropriate to the technical environment and can be implemented using any appropriate data storage technology, such as optical memory devices, magnetic memory devices, semiconductor-based memory devices, fixed memory, and removable memory, as non-limiting examples. The processor 1010 can be of any type appropriate to the technical environment, and can encompass one or more of microprocessors, general purpose computers, special purpose computers, and processors based on a multicore architecture, as non-limiting examples.

Various implementations involve output the converted audio signal. Output”, as used in this application, can encompass all or part of the processes performed, for example, on the converted audio signal and/or on the selected audio signal and associated information on which the audio target is based in order to produce a final output suitable to be render on a speaker and/or on a display (like a picture of a speaker (or individual) to which a selected audio signal comes from). In various embodiments, such processes include one or more of the processes typically performed by an output or rendering device.

Whether the phrase “audio converting process” is intended to refer specifically to a subset of operations or generally to the broader process will be clear based on the context of the specific descriptions and is believed to be well understood by those skilled in the art. Note that the syntax elements as used herein, are descriptive terms. As such, they do not preclude the use of other syntax element names.

When a figure is presented as a flow diagram, it should be understood that it also provides a block diagram of a corresponding apparatus. Similarly, when a figure is presented as a block diagram, it should be understood that it also provides a flow diagram of a corresponding method/process.

Various embodiments refer to optimization. There are different approaches to solve an optimization problem. For example, the approaches may be based on an extensive testing of all options, including all considered modes or parameters values. Other approaches only evaluate a subset of the possible options. .

The implementations and aspects described herein can be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method), the implementation of features discussed can also be implemented in other forms (for example, an apparatus or program). An apparatus can be implemented in, for example, appropriate hardware, software, and firmware. The methods can be implemented in, for example, , a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device. Processors also include communication devices, such as, for example, computers, cell phones, portable/personal digital assistants ("PDAs"), and other devices that facilitate communication of information between end-users.

Reference to “one embodiment” or “an embodiment” or “one implementation” or “an implementation”, as well as other variations thereof, means that a particular feature, structure, characteristic, and so forth described in connection with the embodiment is included in at least one embodiment. Thus, the appearances of the phrase “in one embodiment” or “in an embodiment” or “in one implementation” or “in an implementation”, as well any other variations, appearing in various places throughout this application are not necessarily all referring to the same embodiment.

Additionally, this application may refer to “determining” various pieces of information. Determining the information can include one or more of, for example, estimating the information, calculating the information, predicting the information, or retrieving the information from memory.

Further, this application may refer to “accessing” various pieces of information. Accessing the information can include one or more of, for example, receiving the information, retrieving the information (for example, from memory), storing the information, moving the information, copying the information, calculating the information, determining the information, predicting the information, or estimating the information. Additionally, this application may refer to “receiving” various pieces of information. Receiving is, as with “accessing”, intended to be a broad term. Receiving the information can include one or more of, for example, accessing the information, or retrieving the information (for example, from memory). Further, “receiving” is typically involved, in one way or another, during operations such as, for example, storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.

It is to be appreciated that the use of any of the following 7”, “and/or”, and “at least one of, for example, in the cases of “A/B”, “A and/or B” and “at least one of A and B”, is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of both options (A and B). As a further example, in the cases of “A, B, and/or C” and “at least one of A, B, and C”, such phrasing is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of the third listed option (C) only, or the selection of the first and the second listed options (A and B) only, or the selection of the first and third listed options (A and C) only, or the selection of the second and third listed options (B and C) only, or the selection of all three options (A and B and C). This may be extended, as is clear to one of ordinary skill in this and related arts, for as many items as are listed.

While the preceding relates to the verb form of the word “signal”, the word “signal” can also be used herein as a noun. As will be evident to one of ordinary skill in the art, implementations can produce a variety of signals formatted to carry information that can be, for example, stored or transmitted. The information can include, for example, instructions for performing a method, or data produced by one of the described implementations. For example, a signal can be formatted to carry the bitstream of a described embodiment. Such a signal can be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal. The formatting can include, for example, encoding a data stream and modulating a carrier with the encoded data stream. The information that the signal carries can be, for example, analog or digital information. The signal can be transmitted over a variety of different wired or wireless links, as is known. The signal can be stored on a processor-readable medium.

We describe a number of embodiments. Features of these embodiments can be provided alone or in any combination, across various claim categories and types. Further, embodiments can include one or more of the following features, devices, or aspects, alone or in any combination, across various claim categories and types:

• A process or device to perform an audio conversion. • A bitstream or signal that includes one or more of the described syntax elements, or variations thereof.

• A method, process, apparatus, medium storing instructions, medium storing data, or signal according to any of the embodiments described.

• Creating and/or transmitting and/or receiving and/or decoding a bitstream or signal that includes one or more of the described syntax elements, or variations thereof.

• A TV, set-top box, cell phone, tablet, or other electronic device that performs an audio conversion method(s) according to any of the embodiments described.

As pointed out above, the present disclosure encompasses many implementations.

For instance, some aspects of the present disclosure relate to a method comprising: obtaining an input audio signal originated from at least one input sound producer; selecting at least one target sound producer from a dataset of candidate sound producers labelled with a plurality of domains, said selecting being based at least partially on a target domain of said plurality of domains and on the obtained input audio signal.

According to at least some embodiments of the present disclosure, the method comprises applying at least one audio conversion between the obtained input audio signal and at least one target audio signal associated with the at least one target sound producer.

According to at least some embodiments of the present disclosure, said selecting is performed by using a first deep Neural Network classifying audio features output by a second Deep Neural Network from said obtained input audio signal .

According to at least some embodiments of the present disclosure, the method comprises training said first and said second deep neural network using a third neural network.

According to at least some embodiments of the present disclosure, said training comprises training said first neural network to classify audio features output by said second network according to identities of a first part of said candidate sound producers.

According to at least some embodiments of the present disclosure, said training comprises training said third network to associate audio features output by said second network according to their domains.

According to at least some embodiments of the present disclosure, said training comprises training said second network, by using as input a first part of said candidate sound producers, to extract audio features which minimize a first loss function associated to said first neural network while maximizing a loss function associated to said third neural network.

According to at least some embodiments of the present disclosure, one of the plurality of domains is used for training said first and said second deep neural network using a domain confusion network. According to at least some embodiments of the present disclosure, said plurality of domains is a plurality of age ranges.

According to at least some embodiments of the present disclosure, said input audio signal and said target audio signal are non-parallel audio signals.

Another aspect of the present disclosure relates to an apparatus comprising at least one processor configured for : obtaining an input audio signal originated from at least one input sound producer; selecting at least one target sound producer from a dataset of candidate sound producers labelled with a plurality of domains based at least partially on a target domain of said plurality of domains and on the obtained input audio signal.

The at least processor can be configured to perform the aforementioned method in any of its embodiments.

Another aspect of the present disclosure relates to a computer program product comprising instructions which when executed by at least one processor cause the least one processor to perform the aforementioned method in any of its embodiments.

For instance, at least some embodiments of the present disclosure relate to a computer program product comprising instructions which when executed by a processor cause the processor to perform a method comprising: obtaining an input audio signal originated from at least one input sound producer; selecting at least one target sound producer from a dataset of candidate sound producers labelled with a plurality of domains, said selecting being based at least partially on a target domain of said plurality of domains and on the obtained input audio signal.

According to at least other aspect of the present disclosure relates to a non-transitory computer readable medium having stored thereon instructions which when executed by a processor cause the processor to perform the aforementioned method in any of its embodiments.

For instance, at least some embodiments of the present disclosure relate to a non- transitory computer readable medium having stored thereon instructions which when executed by a processor cause the processor to perform a method comprising: obtaining an input audio signal originated from at least one input sound producer; selecting at least one target sound producer from a dataset of candidate sound producers labelled with a plurality of domains, said selecting being based at least partially on a target domain of said plurality of domains and on the obtained input audio signal.

Claims

1. An apparatus comprising at least one processor configured for: obtaining an input audio signal originated from at least one input sound producer; selecting at least one target sound producer from a dataset of candidate sound producers labelled with a plurality of domains, said selecting being based at least partially on a target domain of said plurality of domains and on the obtained input audio signal.

2. A method comprising: obtaining an input audio signal originated from at least one input sound producer; selecting at least one target sound producer from a dataset of candidate sound producers labelled with a plurality of domains, said selecting being based at least partially on a target domain of said plurality of domains and on the obtained input audio signal.

3. The apparatus of claim 1 ,said at least one processor being configured for, or the method of claim 2 comprising , applying at least one audio conversion between the obtained input audio signal and at least one target audio signal associated with the at least one target sound producer.

4. The apparatus of claim 1 or 3 or the method of claim 2 or 3 wherein said selecting is performed by using a first deep Neural Network classifying audio features output by a second Deep Neural Network from said obtained input audio signal .

5. The apparatus of claim 1 or 3 or 4, said at least one processor being configured for, or the method of any of claims 2 to 4 comprising, training said first and said second deep neural network using a third neural network.

6. The apparatus of claim 5 or the method of claim 5 wherein said training comprises training said first neural network to classify audio features output by said second network according to identities of a first part of said candidate sound producers;

7. The apparatus of claim 5 or 6 or the method of claim 5 or 6 wherein said training comprises training said third network to associate audio features output by said second network according to their domains;

8. The apparatus of claim 5 or 6 or 7 or the method of any of claims 5 to 7 wherein said training comprises training said second network, by using as input a first part of said candidate sound producers, to extract audio features which minimize a first loss function associated to said first neural network while maximizing a loss function associated to said third neural network.

9. The apparatus of any of claims 5 or 6 to 8 or the method of any of claims 5 to 8 wherein one of the plurality of domains is used for training said first and said second deep neural network using a domain confusion network.

10. The apparatus of any claims 1 or 3 to 9 or the method of any of claims 3 to 9 wherein said plurality of domains is a plurality of age ranges.

11. The apparatus of any claims 1 or 3 to 10 or the method of any of claims 3 to 10 wherein said input audio signal and said target audio signal are non-parallel audio signals.

12. A computer program product comprising instructions which when executed by a processor cause the processor to perform the method of claim 2.

13. A non-transitory computer readable medium having stored thereon instructions which when executed by a processor cause the processor to perform the method of claim 2.