CN113470622B

CN113470622B - Conversion method and device capable of converting any voice into multiple voices

Info

Publication number: CN113470622B
Application number: CN202111035937.9A
Authority: CN
Inventors: 曹艳艳; 陈佩云; 高君效
Original assignee: Chipintelli Technology Co Ltd
Current assignee: Chipintelli Technology Co Ltd
Priority date: 2021-09-06
Filing date: 2021-09-06
Publication date: 2021-11-19
Anticipated expiration: 2041-09-06
Also published as: CN113470622A

Abstract

A conversion method and device for converting an arbitrary voice into a plurality of voices, the conversion method comprising the steps of: preparing corpora of a plurality of target speakers as training corpora; extracting ppg characteristics of each training corpus; obtaining comprehensive characteristics; acquiring the encoding characteristics of the target speaker in the training set to obtain a mean value simulation characteristic gamma and a variance simulation characteristic beta; training a conversion model capable of converting the comprehensive characteristics into Mel characteristics; the mean value simulation characteristic gamma and the variance simulation characteristic beta are used as conversion model style input, the comprehensive characteristics are used as conversion model content input, the Mel frequency spectrums of different speakers are decoded, and synthesis of different sounds is achieved. The invention can better decouple the speaking content information and reduce the influence of inaccurate ppg characteristics extracted by the speech recognition model on speech conversion.

Description

Conversion method and device capable of converting any voice into multiple voices

Technical Field

The invention belongs to the technical field of voice synthesis, and particularly relates to a conversion method and a conversion device capable of converting any voice into a plurality of voices.

Background

The voice conversion technology is a technology for converting source voice data into voice data of a specified pronunciation person and keeping consistent pronunciation content, the traditional sound change technology changes original audio into machine sound by voice signal processing, adjusting the pitch of the audio, the speed of speech and the like, and the conversion mode is single; different from the traditional sound changing technology, the voice conversion technology can control the emotion, rhythm and other information of the target sound under the condition of ensuring the consistent pronunciation content. The voice conversion technology can be used for scenes such as virtual anchor, voice remodeling, rhythm/emotion conversion, voice style conversion and the like.

The voice conversion technique can be divided into parallel data conversion and non-parallel data conversion according to the supplied training data, and the parallel data conversion requires different speakers to supply the same voice data, which is difficult to satisfy in actual operation. Therefore, more and more studies are beginning to research another voice conversion technology aiming at the non-parallel data, and the application of deep learning in the technology greatly improves the voice conversion effect of the non-parallel data. The voice conversion technology can be classified into one-to-one, many-to-many, one-to-many, and the like according to application requirements, and one-to-many converts one person's voice into a plurality of persons' voices. The application scheme of deep learning on voice conversion mainly comprises the following steps: based on a method of antagonism learning (CycleGAN, StarGAN and the like), a method based on a voice recognition system extracts information ppg (phonetic post-experience probability) irrelevant to a speaker through a voice recognition model, then trains a conversion model from the ppg to audio characteristics to obtain voice information of a target speaker, and sends the voice information to a vocoder to obtain converted audio data.

The method based on the competitive learning can achieve better results by converting in the training set, but has the disadvantage that only the audio of the people in the training set can be converted. Whereas speech recognition model based methods can achieve conversion of any timbre, but rely on the accuracy of speech recognition.

Disclosure of Invention

In order to overcome the technical defects in the prior art, the invention discloses a conversion method and a conversion device capable of converting any voice into a plurality of voices.

The invention discloses a conversion method capable of converting any voice into a plurality of voices, which comprises a training method and a synthesis method, wherein the training method comprises the following steps:

step 1, preparing corpora of a plurality of target speakers as training corpora, wherein each corpus comprises audio and corresponding speaker information, and extracting original Mel characteristics of the training corpora;

building a first preprocessing network, a second preprocessing network, an affine layer and a conversion model; the number of output channels of the first preprocessing network and the second preprocessing network is the same, and the down-sampling rate of the second preprocessing network is consistent with the down-sampling rate when the ppg characteristics of the training corpus are extracted;

wherein the first and second pre-processing networks comprise an instance normalization layer;

the example normalization layer is calculated as follows:

（1）

（2）

the normalized feature profile is:

（3）

µ_cis the c channel mean, σ, of the feature map_cIs the c channel variance of the feature map, where M_cRepresenting the c-th channel of the feature map, W representing the dimension of each channel, M_c[n]Represents M_cThe value of the nth dimension of the channel, epsilon,

is M_c[n]Normalizing the characteristic value;

step 2, extracting the ppg characteristics of each training corpus;

step 3, sending the obtained ppg characteristics into a first pretreatment network for treatment;

step 4, calculating fundamental frequency features f0 of the training corpus audio data, taking logarithmic value logf0 to obtain fundamental frequency logarithmic features lf0, calculating voiced and unvoiced sound marks of the audio data, and splicing the fundamental frequency logarithmic features and the unvoiced and voiced sound marks to obtain fundamental frequency-unvoiced and voiced sound splicing features lf 0-uv;

sending the fundamental frequency-unvoiced and voiced sound splicing characteristics into a second preprocessing network for processing;

adding the results of the first and second preprocessing network processing to obtain a comprehensive characteristic;

step 5, acquiring the encoding characteristics of the target speaker in the training set, and transforming the encoding characteristics through an affine layer to obtain a mean value simulation characteristic gamma and a variance simulation characteristic beta;

step 6, inputting the mean value simulation characteristic gamma and the variance simulation characteristic beta obtained in the step 5 as the style of a conversion model, inputting the comprehensive characteristics obtained in the step 4 as the content of the conversion model, and generating converted Mel characteristics through the conversion model;

calculating a loss function according to the converted Mel feature and the original Mel feature;

the conversion model adopts a coder-decoder network framework, and comprises a coding network and a decoding network;

the coding network part codes the comprehensive characteristics obtained in the step 4, and the decoding network part decodes the coding result output by the coding network to obtain corresponding Mel characteristics;

the decoding network part comprises a convolution layer and an activation layer, the convolution layer is connected with the activation layer, and the activation layer is connected with a self-adaptive instance normalization layer;

the calculation method of the self-adaptive example normalization layer is as the formula (4):

（4）

µ_cis the c channel mean, σ, of the feature map_cIs the c channel variance, M, of the feature map_c[n]Represents M_cThe value of the nth dimension of the channel, epsilon,

is M_c[n]Normalized eigenvalue；γ_c、β_cThe values of the c channel of the mean value simulation characteristic gamma and the variance simulation characteristic beta obtained in the step 5 are obtained;

step 7, using the loss function to update the first preprocessing network, the second preprocessing network, the affine layer and the conversion model;

and 8, repeating the steps 2 to 7 until the loss function is converged and the training is finished.

Preferably, the updating of step 7 is performed by gradient descent and reverse conduction.

Preferably, the corpus in step S1 includes corpora of different languages, and the step S2 extracts ppg features of each language, and then performs the splicing process, followed by performing the step 3.

Preferably, in step S2, WeNet is used to extract ppg features of the corpus.

Preferably, the conversion model adopts an encoder-decoder network framework, which comprises an encoding network and a decoding network;

（4）

is M_c[n]Normalizing the characteristic value; gamma ray_c、β_cFor the step 5 to obtainThe resulting mean model feature γ and variance model feature β values for the c-th channel.

Preferably, the synthesis method comprises the following steps:

s9, extracting the ppg characteristics of the converted audio and sending the ppg characteristics into a first preprocessing network;

s10, extracting fundamental frequency logarithmic features of the converted audio and any audio of the target speaker, calculating a mean value and a variance, and performing linear mapping according to a formula to obtain a mapped feature lf 0':

（5）

wherein, lf0_sFor logarithmic features of the fundamental frequency of the converted audio, mu_sIs the mean, mu, of the logarithmic characteristics of the fundamental frequency of the converted audio_tIs the mean value, sigma, of the logarithmic feature of the fundamental frequency of the target speaker_sIs the variance, σ, of the logarithmic feature of the fundamental frequency of the converted audio_tIs the variance of the target speaker's base frequency logarithmic feature;

s11, splicing the mapped feature lf 0' with the unvoiced and voiced sound marks of the converted audio to obtain a fundamental frequency-unvoiced and voiced sound splicing feature;

acquiring the encoding characteristics of a speaker of a target speaker, and transforming the encoding characteristics of the speaker through an affine layer to obtain a mean value simulation characteristic gamma and a variance simulation characteristic beta;

s12, inputting the mean value simulation feature gamma and the variance simulation feature beta obtained in the step S11 as conversion model styles, inputting the comprehensive features obtained in the step S11 as conversion model contents, and generating converted Mel features through a conversion model; converting the Mel characteristic input vocoder into audio;

the first preprocessing network, the second preprocessing network, the affine layer and the conversion model in the steps S9-S12 are obtained after the training of the training method is completed.

The invention also discloses a conversion device capable of converting any voice into a plurality of voices, which comprises a ppg characteristic extraction module, an LF0 characteristic extraction module and a speaker coding extraction module, wherein the ppg characteristic extraction module is used for extracting ppg characteristics, the LF0 characteristic extraction module is used for extracting fundamental frequency-unvoiced and voiced sound splicing characteristics, the ppg characteristic extraction module and the LF0 characteristic extraction module are respectively connected with a first preprocessing network and a second preprocessing network, the first preprocessing network and the second preprocessing network are also connected with two input ends of an adder, and the adder adds the characteristics input by the input ends;

the output end of the adder is connected with the content input end of the conversion model;

the speaker code extraction module is used for extracting speaker code features, and is connected with an affine layer which is connected with the style input end of the conversion model; the output end of the conversion model is connected with a vocoder;

the first preprocessing network and the second preprocessing network comprise an instance normalization layer;

the example normalization layer is calculated as follows:

（1）

（2）

the normalized feature profile is:

（3）

µ_cis the c channel mean, σ, of the feature map_cIs the c channel variance of the feature map, where M_cRepresenting the c-th channel of the feature map, W representing the dimension of each channel, M_c[n]Representing the value of the nth dimension of the Mc channel, epsilon is a stable constant,

is M_c[n]Normalizing the characteristic value;

the conversion model adopts an encoder-decoder network framework, which comprises an encoding network and a decoding network;

（4）

is M_c[n]Normalizing the characteristic value; gamma ray_c、β_cThe values of the c channel of the mean value simulation characteristic gamma and the variance simulation characteristic beta obtained in the step 5 are obtained;

the LF0 feature extraction module extracts fundamental frequency logarithmic features of the converted audio and any audio of the target speaker, calculates mean and variance, and performs linear mapping according to a formula to obtain mapped features LF 0':

（5）

wherein, lf0_sFor logarithmic features of the fundamental frequency of the converted audio, mu_sIs the mean, mu, of the logarithmic characteristics of the fundamental frequency of the converted audio_tIs the mean value, sigma, of the logarithmic feature of the fundamental frequency of the target speaker_sIs the variance, σ, of the logarithmic feature of the fundamental frequency of the converted audio_tIs the variance of the logarithmic feature of the fundamental frequency of the target speaker.

Preferably, a splicing module is further connected between the ppg feature extraction module and the first preprocessing network.

The invention can better decouple the speaking content information and reduce the influence of inaccurate ppg characteristics extracted by the speech recognition model on speech conversion. The processing effect of audio detail information during voice conversion can be improved by combining and applying the ppg characteristic and the added fundamental frequency-unvoiced and voiced sound splicing characteristic; particularly, in the case of trans-lingual conversion, a phenomenon such as chinese english is remarkably improved.

The invention adds the speaker code into the coding network part, can realize the conversion from any voice to multi-person voice, and can realize the conversion from any voice under the condition that the training of the speaker code model is good enough.

Drawings

FIG. 1 is a schematic flow chart of a conversion method according to the present invention;

FIG. 2 is a schematic diagram of an embodiment of a first pre-processing network according to the present invention, and a second pre-processing network may also adopt the structure shown in FIG. 2;

FIG. 3 is a schematic diagram of an embodiment of step 5 and step 6 according to the present invention;

fig. 4 is a schematic diagram of an embodiment of the conversion device according to the present invention.

Detailed Description

The following provides a more detailed description of the present invention.

The method for converting any voice into a plurality of voices, as shown in fig. 1, includes the following steps:

the target speaker is a conversion target at the time of voice conversion, that is, it is desired to convert an arbitrary audio into an audio having the same characteristics as the voice of the target speaker.

the example normalization layer is calculated as follows:

（1）

（2）

the normalized feature profile is:

（3）

is M_c[n]Normalizing the characteristic value;

step 2, extracting the ppg characteristics of each training corpus;

step 4, calculating a fundamental frequency feature f0 of the audio data of the training sample, taking a logarithmic value logf0 to obtain a fundamental frequency logarithmic feature lf0, calculating a voiced and unvoiced label of the audio data, and splicing the fundamental frequency logarithmic feature and the voiced and unvoiced label to obtain a fundamental frequency-unvoiced and voiced splicing feature lf 0-uv;

In the step 1, an encoder-decoder network (encoder-decoder) network framework is adopted for the conversion model initial architecture, wherein the encoding network (encoder) part encodes the comprehensive characteristics, and the decoding network (decoder) part decodes the encoding result output by the encoder network to obtain the corresponding Mel characteristics. The encoder-decoder network is an existing general network, model input is generally encoded firstly and then decoded into target output, and when the encoder-decoder network is applied, a network layer used by the encoder and the decoder needs to be specifically selected and designed.

The decoder network part of the invention can comprise a convolution layer, an active layer and the like, wherein the convolution layer is connected with the active layer; an Adaptive Instance normalization layer (AdaIN for short) connected behind the active layer is embedded in the decoder network part, and as shown in fig. 3, the Adaptive Instance normalization layer is used for taking the mean simulation feature γ and the variance simulation feature β as the input of the conversion model style.

An adaptive instance normalization layer is used in style migration, where the adaptive instance normalization layer inputs include content inputs and style inputs, and channel-wise means and standard deviations of the content inputs are matched to channel-wise means and standard deviations of the style inputs.

The calculation method of the adaptive instance normalization layer of the decoder network may be as in equation (4).

（4）

is M_c[n]Normalizing the characteristic value; gamma ray_c、β_cAnd 5, simulating the value of the c channel of the mean value simulation characteristic gamma and the variance simulation characteristic beta obtained in the step 5.

Mean value μ_cSum variance σ_cThe calculation method of (2) is the same as the formulas (1) and (2).

The global speaker characteristics, including mean analog characteristics gamma and variance analog characteristics beta, are embedded into the decoding network by equation (4), thereby implementing the conversion to multiple speakers.

In one embodiment, step 2 may use WeNet (a speech recognition toolkit for industrial floor applications, where the question and ask speech team is open source in western big speech laboratories) to extract ppg (phoneme posterior) features of the corpus, where the ppg features correspond to the output of the encoding layer encoder in the WeNet model.

And when the conversion among multiple languages is involved, obtaining corresponding ppg characteristics through WeNet models of the corresponding languages, splicing the ppg characteristics of the multiple languages, and then performing the step 3.

And 3, sending the obtained ppg features into a first preprocessing network prenet1 for network processing, adding an Instance normalization layer (IN) after a one-dimensional convolution layer and an active layer IN a prenet1 network, and performing multi-stage cascading, as shown IN fig. 2, wherein the result obtained after the first preprocessing network prenet1 network processing is prenet1_ out.

In designing a prenet1 network, a one-dimensional convolution is used, and the example normalization layer is calculated as follows:

（1）

（2）

wherein M is_cRepresenting the feature map is the c-th channel, W represents the dimension of each channel, M_c[n]Represents M_cThe nth dimension value of the channel is obtained by the formula (1) and the formula (2) to obtain the mean value mu of each channel_cSum variance σ_c. Then the normalized feature map can be obtained as follows:

（3）

the stable constant epsilon is a small constant value, so that the numerical value after normalization is prevented from being unstable. And sending the normalized characteristic value to a subsequent network model. The audio content information can be better decoupled through the processing of the step.

And 4, calculating a fundamental frequency feature f0 of the audio data of the training sample, taking a logarithmic value logf0 to obtain a fundamental frequency logarithmic feature lf0, calculating a voiced and unvoiced label of the audio data, and splicing the fundamental frequency logarithmic feature and the voiced and unvoiced label to obtain a fundamental frequency-unvoiced and voiced splicing feature lf 0-uv.

Fundamental frequency feature F0 calculation can be calculated by reference to M.Morise, H.Kawahara and H.Katayose: Fast and reliable F0 estimation method based on the periodic extraction of the periodic focus division of horizontal and space, AES 35th International Conference, CD-ROM processing, Feb. 2009.

The result of feeding the fundamental frequency-unvoiced-voiced concatenation feature lf0-uv into the second preprocessing network prenet2 is prenet2_ out, the number of output channels of the second preprocessing network prenet2 must be consistent with the output channel of prenet1 in step 2, and the down-sampling rate of prenet2 is consistent with the down-sampling rate of WeNet. The results of both pre-processing networks are then summed with prenet1_ out and prenet2_ out to get the composite signature.

Preferably, the second pre-processing network prenet2 network is made up of a plurality of one-dimensional convolutional layers and is structurally identical to the first pre-processing network prenet 1. As fig. 2 shows a specific embodiment of the first preprocessing network, the second preprocessing network can also be implemented in the manner shown in fig. 2.

And 5, acquiring the speaker code of the target speaker in the training set, and extracting through a special neural network. As shown in fig. 3, the speaker coding features are transformed through two affine layers to obtain a mean simulation feature γ and a variance simulation feature β, which are used for simulating the mean and variance of the style features, respectively. The prior art is the technology of extracting the model and algorithm of the speaker code, and extracting the mean value simulation feature gamma and the variance simulation feature beta by the affine layer transformation.

calculating a loss function according to the converted Mel feature and the original Mel feature; the loss function is typically the difference between the transformed mel-frequency signature and the original mel-frequency signature.

the method for updating each parameter in the first preprocessing network, the second preprocessing network, the affine layer and the conversion model by using the loss function is the prior art, and usually adopts gradient descent and reverse conduction modes for updating.

When the loss function value is no longer reduced or hardly reduced, the loss function may be considered to converge and training is complete. And after the training is finished, the obtained first preprocessing network, the second preprocessing network, the affine layer and the conversion model are used for subsequent speech synthesis.

The synthesis method comprises the following steps:

（5）

acquiring a speaker code of a target speaker, and transforming the speaker code through an affine layer to obtain a mean value simulation feature gamma and a variance simulation feature beta;

s12, inputting the mean value simulation feature gamma and the variance simulation feature beta obtained in the step S11 as conversion model styles, inputting the comprehensive features obtained in the step S11 as conversion model contents, and generating converted Mel features through a conversion model; the mel features are input into the vocoder and converted into audio.

The invention also discloses a conversion device capable of converting any voice into a plurality of voices, which comprises a ppg feature extraction module, an LF0 feature extraction module and a speaker code extraction module, wherein the ppg feature extraction module is used for respectively extracting the ppg features in the step 3, the LF0-uv features in the step 4 and the speaker code extraction module in the step 5.

The ppg characteristic extraction module and the LF0 characteristic extraction module are respectively connected with a first preprocessing network and a second preprocessing network, the first preprocessing network and the second preprocessing network are also connected with two input ends of an adder, and the output end of the adder is connected with the content input end of the conversion model;

the adder outputs the comprehensive features to a conversion model, the speaker code extraction module is connected with an affine layer, and the affine layer is connected with a style input end of the conversion model; and the output end of the conversion model is connected with the vocoder.

The example normalization layer is calculated as follows:

（1）

（2）

the normalized feature profile is:

（3）

is M_c[n]Normalizing the characteristic value;

the decoding network part comprises a convolution layer and an active layer, wherein the convolution layer is connected with the active layer, and the active layer is connected with an adaptive instance normalization layer.

（4）

By adopting the device, the voice conversion method can be realized. When the PPG feature extraction module is used in multiple languages, a splicing module is connected between the PPG feature extraction module and the first preprocessing network, and the PPG features of different languages are spliced and then input into the first preprocessing network.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

Preparing training corpus, multi-person Chinese audio data, multi-person English data and marking pronouncing person numbers.

Extracting ppg characteristics by adopting WeNet (a speech recognition toolkit for industrial floor application, which is derived by a question-and-ask speech team in combination with a western big speech laboratory), wherein the ppg characteristics correspond to the output of a coding layer in a WeNet model.

When the conversion among multiple languages is related, corresponding ppg characteristics are obtained through WeNet models of the corresponding languages respectively, and the multi-language ppg characteristics are spliced. The Wenet model can be trained by self, and the ppg characteristic extraction can also be carried out by adopting the publicly trained model. When speech recognition is performed by the WeNet model, audio features of audio are downsampled, and in the present embodiment, a downsampling factor is set to 4.

And (3) sending the spliced Chinese-English mixed ppg features into a prenet1 network, wherein the prenet1 network IN the embodiment adopts three one-dimensional convolutional layers, and each convolutional layer is connected with an active layer and an example normalization layer (IN layer). After the three-layer convolution module is finished, the result is added with the ppg value obtained in step 2 to obtain the output result prenet1_ out of prenet 1.

And (3) extracting lf0-uv characteristics, wherein the frame length during characteristic extraction is consistent with the acoustic characteristic frame length sent into the recognition model in the step 1. The lf0-uv characteristics are sent into a prenet2 network, the prenet2 network still adopts three layers of one-dimensional convolution layers + Relu function layers + IN layers, and the output channel of the last layer of prenet1 are set to be the same value.

And setting the convolution step length of the last two layers as [2,2], and performing down-sampling to obtain an output value which is the same as the ppg characteristic dimension in the step 2. The output result is labeled prenet2_ out, which is added to prenet1_ out by prenet2_ out.

In this embodiment, a method in a thesis "Generalized end-to-end loss for spread verification, 2018, ICASSP, l, Wan, q, Wang," is adopted to train a deep learning model of multi-speaker codes. The resulting speaker of each audio data is encoded as a 256-dimensional vector. And taking the vector as an affine layer through two full-connection layers to obtain a mean value simulation feature gamma and a variance simulation feature beta.

The encoder-decoder network in this embodiment is referred to in the paper "Glow-TTS: a general Flow for Text-to-Speech visual information Search, NeurIPS 2020, Jaehieon Kim, Sungwon Kim, changes the input of encoder part into the output of step 4, adds speaker information in decoder part of the paper, and embeds AdaIN layer shown in formula (4) in the decoder's coupling layer. The decoder output is the mel-frequency spectrum feature.

In this embodiment, a wave rnn vocoder model described in the documents "Efficient neural synthesis Learning, International Conference on Machine 2018, Dieleman, and k.

After the encoder-decoder network and other relevant models are obtained through the steps, for any source audio signal, only Chinese and English ppg characteristics are obtained, and then model inference is carried out according to the steps. The source audio and the target voice audio calculate the converted lf0 characteristic according to equation (5) and obtain the target person code according to step 5, whereby the source audio can be converted into the voice of the target speaker through the coder-decoder network and the vocoder model.

The foregoing is directed to preferred embodiments of the present invention, wherein the preferred embodiments are not obviously contradictory or subject to any particular embodiment, and any combination of the preferred embodiments may be combined in any overlapping manner, and the specific parameters in the embodiments and examples are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the scope of the invention, which is defined by the claims and the equivalent structural changes made by the description and drawings of the present invention are also intended to be included in the scope of the present invention.

Claims

1. A conversion method for converting an arbitrary speech into a plurality of speeches, comprising a training method and a synthesis method, wherein the training method comprises the steps of:

the example normalization layer is calculated as follows:

（1）

（2）

the normalized feature profile is:

（3）

is M_c[n]Normalizing the characteristic value;

step 2, extracting the ppg characteristics of each training corpus;

（4）

2. The method of claim 1, wherein the updating of step 7 is performed by gradient descent and reverse conduction.

3. The method as claimed in claim 1, wherein the corpus in step S1 contains corpora of different languages, and step S2 is performed after extracting ppg features of each language, and then step 3 is performed after concatenation.

4. The method according to claim 1, wherein WeNet is used to extract the ppg characteristics of the corpus in step S2.

5. The method of converting from an arbitrary speech to a plurality of speeches according to claim 1, wherein said synthesizing method comprises the steps of:

（5）

6. A conversion device capable of converting any voice into a plurality of voices is characterized by comprising a ppg characteristic extraction module, an LF0 characteristic extraction module and a speaker coding extraction module, wherein the ppg characteristic extraction module is used for extracting ppg characteristics, the LF0 characteristic extraction module is used for extracting fundamental frequency-unvoiced and voiced sound splicing characteristics, the fundamental frequency-unvoiced and voiced sound splicing characteristics are respectively connected with a first preprocessing network and a second preprocessing network, the first preprocessing network and the second preprocessing network are further connected with two input ends of an adder, and the adder adds the characteristics input by the input ends;

the example normalization layer is calculated as follows:

（1）

（2）

the normalized feature profile is:

（3）

is M_c[n]Normalizing the characteristic value;

（4）

µ_cthe c channel of the feature mapMean value, σ_cIs the c channel variance, M, of the feature map_c[n]Represents M_cThe value of the nth dimension of the channel, epsilon,

（5）

7. The apparatus according to claim 6, wherein a concatenation module is further connected between the PPG feature extraction module and the first preprocessing network, and the concatenation module concatenates PPG features of different languages and inputs the result to the first preprocessing network.