WO2021021305A1 - Obtaining a singing voice detection model - Google Patents

Obtaining a singing voice detection model Download PDF

Info

Publication number
WO2021021305A1
WO2021021305A1 PCT/US2020/036869 US2020036869W WO2021021305A1 WO 2021021305 A1 WO2021021305 A1 WO 2021021305A1 US 2020036869 W US2020036869 W US 2020036869W WO 2021021305 A1 WO2021021305 A1 WO 2021021305A1
Authority
WO
WIPO (PCT)
Prior art keywords
singing voice
detection model
speech
clips
voice detection
Prior art date
Application number
PCT/US2020/036869
Other languages
English (en)
French (fr)
Inventor
Yuanbo HOU
Jian Luan
Kao-Ping SOONG
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Publication of WO2021021305A1 publication Critical patent/WO2021021305A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/056Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for extraction or identification of individual instrumental parts, e.g. melody, chords, bass; Identification or separation of instrumental parts by their characteristic voices or timbres
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/311Neural networks for electrophonic musical instruments or musical processing, e.g. for musical recognition or control, automatic composition or improvisation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/315Sound category-dependent sound synthesis processes [Gensound] for musical use; Sound category-specific synthesis-controlling parameters or control means therefor
    • G10H2250/455Gensound singing voices, i.e. generation of human voices for musical applications, vocal singing sounds or intelligible words at a desired pitch or with desired vocal effects, e.g. by phoneme synthesis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • Singing voice detection techniques may be used for determining endpoints of singing voice in music clips, e.g., determining singing voice regions and non-singing voice regions in polyphonic music clips, etc.
  • a polyphonic music clip may refer to an audio clip containing singing voices and accompaniments that are mixed together.
  • Successful detection of singing voice regions in polyphonic music clips is critical to Music Information Retrieval (MIR) tasks.
  • MIR tasks may comprise, e.g., music summarization, music retrieval, music annotation, music genre classification, singing voice separation, etc.
  • Embodiments of the present disclosure propose methods and apparatuses for obtaining a singing voice detection model.
  • a plurality of speech clips and a plurality of instrumental music clips may be synthesized into a plurality of audio clips.
  • a speech detection model may be trained with the plurality of audio clips. At least a part of the speech detection model may be transferred to a singing voice detection model.
  • the singing voice detection model may be trained with a set of polyphonic music clips.
  • FIG.1 illustrates an exemplary application of singing voice detection according to an embodiment.
  • FIG.2 illustrates an exemplary application of singing voice detection according to an embodiment.
  • FIG.3 illustrates an exemplary process for obtaining a singing voice detection model based on transfer learning according to an embodiment.
  • FIG.4 illustrates an exemplary implementation of a speech detection model according to an embodiment.
  • FIG.5 illustrates an exemplary implementation of a singing voice detection model according to an embodiment.
  • FIG.6 illustrates a flowchart of an exemplary method for obtaining a singing voice detection model according to an embodiment.
  • FIG.7 illustrates an exemplary apparatus for obtaining a singing voice detection model according to an embodiment.
  • FIG.8 illustrates an exemplary apparatus for obtaining a singing voice detection model according to an embodiment.
  • Deep Neural Networks may be used for estimating an Ideal Binary Spectrogram Mask that represents spectrogram bins in which singing voices are more prominent than accompaniments.
  • a temporal and timbre feature-based model may be established based on a Convolutional Neural Network (CNN), for boosting the performance in MIR.
  • Recurrent Neural Networks (RNN) may be employed to predict soft masks that are multiplied with the original signal to obtain a desired isolated region.
  • CNN Convolutional Neural Network
  • RNN Recurrent Neural Networks
  • the training of the above systems requires a large scale of accurately labeled polyphonic music clip dataset, in which end points of singing voices, accompaniments, etc. are annotated in a frame level.
  • such large scale of labeled dataset is usually not available, and manual labeling is time- consuming and expensive. Therefore, only a small scale of labeled polyphonic music clip dataset may be practically used for training these systems.
  • transfer learning has been proposed to extract knowledge learned from a source task and apply the knowledge to a similar but different target task.
  • the transfer learning may alleviate the problem of insufficient training data for the target task and tend to generalize a model.
  • a CNN for music annotation may be trained based on a dataset containing different genres of songs, and then be transferred to other music-related classification and regression tasks, e.g., singing voice detection.
  • transfer learning-based singing voice detection may only transfer singing voice knowledge among different genres of songs.
  • Embodiments of the present disclosure propose knowledge transfer from speech to singing voice.
  • a speech detection model for a source task of speech detection may be trained firstly, and then a part of the speech detection model may be transferred to a singing voice detection model for a target task of singing voice detection, and the singing voice detection model may be further trained with a small amount of labeled polyphonic music clips.
  • the singing voice detection model may be further trained with a small amount of labeled polyphonic music clips.
  • Transferring of latent representations learned from speech clips may improve the performance of singing voice detection.
  • the learned latent representations will retain relevant information of the source task of speech detection and transfer the information to the target task of singing voice detection.
  • sharing of knowledge between speeches in the source task and singing voices in the target task may enable the singing voice detection model to understand human voices, including speech, singing voice, etc., in a more general and robust approach.
  • a speech clip may comprise only voices of human speaking, and an instrumental music clip may comprise only sounds of instruments being played. Speech clips and instrumental music clips may be synthesized together to form a large scale of audio clip training dataset for training the speech detection model.
  • the singing voice detection model may be further trained or optimized with a polyphonic music clip training dataset containing a small amount of labeled polyphonic music clips. Benefiting from the knowledge transferred from the speech detection, although only a small amount of labeled polyphonic music clips are used, the obtained singing voice detection model will still have higher accuracy than conventional singing voice detection models.
  • the speech detection model may employ, e.g., a CNN to perform the source task of distinguishing between speech and non-speech in an audio clip.
  • the singing voice detection model may employ, e.g., a convolutional recurrent neural network (CRNN) to perform the target task of singing voice detection in a polyphonic music clip.
  • CRNN convolutional recurrent neural network
  • at least a part of the CNN, e.g., at least some convolutional layers, in the speech detection model may be transferred to the CRNN of the singing voice detection model.
  • Different knowledge transfer modes may be employed.
  • the part which is included in the singing voice detection model and transferred from the speech detection model may retain the original parameters.
  • the parameters of the part which is included in the singing voice detection model and transferred from the speech detection model may be adapted or refined with the polyphonic music clip training dataset.
  • the embodiments of the present disclosure overcome the problem of insufficient training data for training a singing voice detection model, make the obtained singing voice detection model contain voice knowledge in both speech and singing voice, and enable feature extraction to represent voices more efficiently.
  • the proposed transfer learning approach may enable the feature extraction trained in the source task to be efficiently adapted to the target task, and different knowledge transfer modes may be employed.
  • the singing voice detection model obtained according to the embodiments of the present disclosure may be applied for various scenarios.
  • the singing voice detection model may be applied in an intelligent singing assistance system with a function of automatically helping singing.
  • the system may prompt the lyrics in real time or automatically play voices of the next sentence in the original song.
  • the singing voice detection model may be applied for a pre-processing for the separation of singing voices from accompaniments.
  • the singing voice detection model may detect at least regions that need not to be separated in a polyphonic music clip, e.g., singing-voice-only regions or accompaniment-only regions, thereby reducing the amount of processing in the separation of singing voices from accompaniments and improving the efficiency of the separation.
  • the singing voice detection model may be applied for music structure decomposition. For example, singing voice parts, accompaniment parts, silence or mute parts, etc. in a target music may be identified with at least the singing voice detection model.
  • the singing voice detection model may be applied for a pre-processing for music recommendation, song library management, etc.
  • the singing voice detection model may be used for segmenting music or songs in a music library or song library in advance to extract a series of regions containing singing voice. These extracted singing voice regions will facilitate to efficiently retrieve corresponding music or songs in the music recommendation, the song library management, etc.
  • FIG. l illustrates an exemplary application 100 of singing voice detection according to an embodiment.
  • a singing voice detection model obtained according to an embodiment of the present disclosure may be used for detecting singing voice regions and non-singing voice regions in a polyphonic music clip.
  • a singing voice region may refer to a region including a singing voice of a singer in a polyphonic music clip
  • a non singing voice region may refer to a region not including a singing voice of a singer in a polyphonic music clip.
  • Each singing voice region may be defined by corresponding singing voice endpoints, e.g., defined by a singing voice start timepoint and a singing voice end timepoint.
  • Each non-singing voice region may be defined by corresponding non singing voice endpoints, e.g., defined by a non-singing voice start timepoint and a non singing voice end timepoint.
  • the singing voice detection model may perform singing voice detection based on spectrograms.
  • a waveform of the polyphonic music clip to be detected may be firstly converted into a spectrogram.
  • the spectrogram may be further provided to the singing voice detection model as input.
  • the singing voice detection model may generate a detection result by processing the spectrogram, wherein the detection result identifies singing-voice regions and non-singing voice regions in the polyphonic music clip.
  • the singing voice detection model may achieve binary classification of frames in the polyphonic music clip, e.g., classifying each frame as singing voice or non-singing voice. After classifying the frames, adjacent frames having the same category may be collectively identified as a singing voice region or a non-singing voice region, thereby forming the final detection result.
  • the detection result may comprise: identifying a region from time ti to time ti as a non-singing voice region; identifying a region from time h to time t3 as a singing voice region; identifying a region from time t3 to time U as a non-singing voice region; and identifying a region from time U to time ⁇ 5 as a singing voice region, etc.
  • FIG.2 illustrates an exemplary application 200 of singing voice detection according to an embodiment.
  • a singing voice detection model obtained according to an embodiment of the present disclosure may be used for detecting singing voice regions, accompaniment regions and silence regions in a polyphonic music clip.
  • a singing voice region may refer to a region including a singing voice of a singer in a polyphonic music clip
  • an accompaniment region may refer to a region including sounds of instruments being played in a polyphonic music clip
  • a silence region may refer to a region not including any sounds in a polyphonic music clip.
  • Each singing voice region may be defined by corresponding singing voice endpoints, e.g., defined by a singing voice start timepoint and a singing voice end timepoint.
  • Each accompaniment region may be defined by corresponding accompaniment endpoints, e.g., defined by an accompaniment start timepoint and an accompaniment end timepoint.
  • Each silence region may be defined by corresponding silence endpoints, e.g., defined by a silence start timepoint and a silence end timepoint.
  • the singing voice detection model may perform singing voice detection based on spectrograms.
  • a waveform of the polyphonic music clip to be detected may be firstly converted into a spectrogram.
  • the spectrogram may be further provided to the singing voice detection model as an input feature.
  • the singing voice detection model may generate a detection result by processing the spectrogram, wherein the detection result identifies singing voice regions, accompaniment regions and silence regions in the polyphonic music clip.
  • the singing voice detection model may achieve triple classification of frames in the polyphonic music clip, e.g., classifying each frame as at least one of singing voice, accompaniment and silence.
  • each frame may have one or more categories, e.g., if the current frame corresponds to a singer’s singing with accompaniment, this frame may have two categories of singing voice and accompaniment.
  • adjacent frames having the same category may be collectively identified as a singing voice region, an accompaniment region or a silence region, thereby forming the final detection result.
  • the detection result may comprise: identifying a region from time ti to time t3 as an accompaniment region; identifying a region from time t2 to time U as a singing voice region; identifying a region from time U to time ts as a silence region; identifying a region from time ts to time ti as an accompaniment region; and identifying a region from time t6 to time Xi as a singing voice region, etc.
  • the accompaniment region from time t2 to time t3 overlapping with the singing voice region from time t2 to time t 3 , which indicates that the polyphonic music clip comprises both singing voice and accompaniment between time t2 and time t 3.
  • the singing voice detection tasks involved in the present disclosure are not limited to these exemplary applications, but may also cover any applications that aim to detect singing voice regions and one or more types of other annotated regions in a polyphonic music clip.
  • FIG.3 illustrates an exemplary process 300 for obtaining a singing voice detection model based on transfer learning according to an embodiment.
  • transfer learning is used for extracting voice knowledge from a source task of speech detection, and applying the extracted voice knowledge to a target task of singing voice detection to perform singing voice detection.
  • the problem that training data for the target task of singing voice detection is insufficient to train a good singing voice detection model may be overcome.
  • a CNN in a speech detection model may be trained for detecting speech regions in a synthesized audio clip. Voice knowledge learned from a large scale of audio clip training dataset in the source task may be transferred to the target task.
  • a small scale of polyphonic music clip training dataset containing a small amount of labeled polyphonic music clips collected in the target task may be used for further training or optimizing a CRNN in the singing voice detection model, so as to perform singing voice detection in a polyphonic music clip.
  • a large amount of speech clips 302 and instrumental music clips 304 may be obtained respectively.
  • the speech clips 302 may be collected on the network or obtained from any content sources, which may be any types of speech recording containing only voices of human speaking, e.g., speech recording, news broadcast recording, storytelling recording, etc.
  • the instrumental music clips 304 may be collected on the network or obtained from any content sources, which may be any types of instrument sound recording containing only sounds of instruments being played, e.g., pure music, etc.
  • the instrumental music clips 304 may also broadly comprise any non-speech sound recordings, e.g., recording of sounds existing in the nature, recording of artificially simulated sounds, etc.
  • the speech clips 302 and the instrumental music clips 304 may be synthesized into a plurality of audio clips 306.
  • one or more speech clips and one or more instrumental music clips may be provided to a plurality of different audio tracks according to a specific timing, so as to synthesize an audio clip.
  • a large scale of audio clip training dataset 308 for training the speech detection model may be formed based on the synthesized audio clips 306.
  • Each audio clip in the audio clip training dataset 308 may comprise a plurality of frame-level labels indicating whether there exists speech.
  • speech regions, in which there exists speech, in the speech clips may be determined firstly. Each speech region is identified by a pair of speech endpoints including, e.g., a speech start timepoint and a speech end timepoint. Then, frame-level speech labels are added to frames in the speech clips based on the determined speech regions. For example, a label indicating the existence of speech is added to frames located in the speech regions, and a label indicating the absence of speech is added to frames not located in any speech region. Accordingly, the audio clips synthesized with the labeled speech clips also have a plurality of frame-level labels indicating the existence or absence of speech.
  • the audio clip training dataset 308 containing a large amount of labeled synthesized audio clips may be used for training a speech detection model 310.
  • the speech detection model 310 may perform a source task for detecting speech in an audio clip.
  • the speech detection model 310 may classify each frame in an audio clip as speech or not, and may further determine speech regions and non-speech regions in the audio clip.
  • the speech detection model 310 may be based on a CNN comprising one or more convolutional layers. The CNN may be trained for recognizing speech regions in an audio clip.
  • a singing voice detection model 320 may be constructed.
  • the singing voice detection model 320 may perform a target task of singing voice detection.
  • the singing voice detection model 320 may perform a target task for detecting singing voice in a polyphonic music clip.
  • the singing voice detection model 320 may classify each frame in a polyphonic music clip as singing voice or not, and may further determine singing voice regions and non-singing voice regions in the polyphonic music clip.
  • the singing voice detection model 320 may perform a target task for detecting singing voice, accompaniment and silence in a polyphonic music clip.
  • the speech detection model 320 may classify each frame in a polyphonic music clip as singing voice, accompaniment and/or silence, and may further determine singing voice regions, accompaniment regions and silence regions in the polyphonic music clip.
  • the singing voice detection model 320 may be based on CRNN.
  • the CRNN may comprise, e.g., CNN 322 and RNN 324.
  • at least a part of the CNN 312 in the speech detection model 310 may be transferred to the CNN 322 in the singing voice detection model 320.
  • the entire CNN 312, e.g., all the convolutional layers may be transferred to the singing voice detection module 320 as the CNN 322.
  • only a part of the CNN 312, e.g., one or more convolutional layers may be transferred to the CNN 322 as a part of the CNN 322.
  • the singing voice detection model 320 may be further trained or optimized.
  • a set of polyphonic music clips 326 may be obtained, and the set of polyphonic music clips 326 may be used for forming a polyphonic music clip training dataset 328 for training or optimizing the singing voice detection model 320.
  • the polyphonic music clip training dataset 328 may comprise only a small amount of labeled polyphonic music clips. According to different target tasks of singing voice detection performed by the singing voice detection model 320, the polyphonic music clips 326 may have corresponding frame-level labels.
  • each polyphonic music clip in the polyphonic music clip training dataset 328 may comprise a plurality of frame-level labels indicating whether there exists singing voice. For example, a label indicating the existence of singing voice is added to frames located in singing voice regions in a polyphonic music clip, and a label indicating the absence of singing voice is added to frames not located in any singing voice region. If the singing voice detection model 320 performs a target task for detecting singing voice, accompaniment and silence in a polyphonic music clip, each polyphonic music clip in the polyphonic music clip training dataset 328 may comprise a plurality of frame-level labels indicating whether there exists singing voice, accompaniment and/or silence.
  • a label indicating the existence of singing voice is added to frames located in singing voice regions in a polyphonic music clip
  • a label indicating the existence of accompaniment is added to frames located in accompaniment regions
  • a label indicating the existence of silence is added to frames located in silence regions.
  • the polyphonic music clip training dataset 328 containing labeled polyphonic music clips may be used for training or optimizing the singing voice detection model 320.
  • the singing voice detection model 320 may obtain the knowledge about speech learned in the source task, and through utilizing the polyphonic music clip training dataset 328 for further training or optimization, the singing voice detection model 320 may be better adapted to the dataset involving singing voice in the target task, thereby mitigating the mismatch problem that a detection model trained with synthesized audio clips cannot match data in a target task well.
  • the singing voice detection model 320 obtained through the process 300 may be used for performing a singing voice detection task on input polyphonic music clips with high accuracy.
  • FIG.4 illustrates an exemplary implementation of a speech detection model according to an embodiment.
  • the speech detection model 420 shown in FIG.4 may correspond to the speech detection model 310 in FIG.3.
  • Input 410 of the speech detection model 420 may be an audio clip.
  • a waveform of the audio clip may be converted into a spectrogram, and the spectrogram is used as the input 410.
  • the audio clip may be an audio clip synthesized with utterance clips and instrumental music clips.
  • the spectrogram converted from the waveform of the audio clip may be a Mel spectrogram, e.g., log Mel spectrogram, etc., which is a 2D representation that is used for approximating human auditory perception and has high computational efficiency.
  • the following discussion takes an audio clip representation in the form of log Mel spectrogram as an input feature of the speech detection model 420.
  • the speech detection model 420 may be based on CNN.
  • the speech detection model 420 may comprise a CNN 430.
  • the CNN 430 may comprise one or more convolutional layers stacked in sequence, e.g., convolutional layer 432, convolutional layer 436, convolutional layer 440, etc.
  • each convolutional layer may be further attached with a corresponding pooling layer, e.g., pooling layer 434, pooling layer 438, pooling layer 442, etc.
  • pooling layers may be, e.g., max-pooling layers.
  • the structure of the CNN 430 shown in FIG.4 is only exemplary, and depending on specific application requirements or design constraints, the CNN 430 may also have any other structures, e.g., comprising more or less convolutional layers, omitting pooling layers, adding layers for other processes, etc.
  • the input of the CNN 430 may adopt a moving data block.
  • the moving data block may comprise the current frame, the preceding L frames of the current frame, and the succeeding L frames of the current frame.
  • the shift between consecutive blocks may be, e.g., one frame.
  • Each moving data block may contain 2L + 1 frames.
  • the value of L determines the range of context visible at each frame, which may be set empirically.
  • the convolutional layers in the CNN 430 may be used for extracting spatial location information.
  • the convolutional layers may learn local shift-invariant patterns from the input log Mel spectrogram feature.
  • pooling may be further applied to the frequency axis only.
  • a convolutional layer may be represented by (filters, (receptive field in time, receptive field in frequency)), e.g., (64, (3, 3)).
  • a pooling layer may be represented by (pooling length in time, pooling length in frequency), e.g., (1, 4).
  • batch normalization may be used to accelerate training convergence.
  • GLUs gated linear units
  • the GLUs provide a linear path for gradient propagation while retaining non-linear capabilities through, e.g., a sigmoid operation.
  • W and V as convolutional filters
  • b and c biases
  • X as the input features or the feature maps of the interval layers
  • s as sigmoid function
  • the speech detection model 420 may further comprise an output layer 444.
  • the output layer 444 may comprise two output units having, e.g., softmax, which may indicate whether the current input corresponds to speech. It should be appreciated that although not shown in FIG.4, a Relu-based full connection layer may optionally be included between the pooling layer 442 and the output layer 444.
  • the speech detection model 420 may classify frames in an audio clip as speech or non-speech, and these classification results may form the final speech detection result 450.
  • the speech detection result 450 may be represented as frame-level speech or non-speech labels for frames in the audio clip.
  • the speech detection result 450 may be an integration of frame-level speech or non-speech labels, and is represented as speech regions and non-speech regions as identified in the audio clip.
  • FIG.5 illustrates an exemplary implementation of a singing voice detection model according to an embodiment.
  • the singing voice detection model 520 shown in FIG.5 may correspond to the singing voice detection model 320 in FIG.3.
  • Input 510 of the singing voice detection model 520 may be a polyphonic music clip.
  • a waveform of the polyphonic music clip may be converted into a spectrogram, and the spectrogram is used as the input 510.
  • the spectrogram converted from the waveform of the polyphonic music clip may be a Mel spectrogram, e.g., log Mel spectrogram.
  • the following discussion takes a polyphonic music clip representation in the form of log Mel spectrogram as an input feature of the singing voice detection model 520.
  • the singing voice detection model 520 may be based on CRNN.
  • the singing voice detection model 520 may comprise a CNN 530.
  • the CNN 530 may comprise one or more convolutional layers stacked in sequence, e.g., convolutional layer 532, convolutional layer 536, convolutional layer 540, etc.
  • the convolutional layers in the CNN 530 may be used for extracting spatial location information.
  • each convolutional layer may be further attached with a corresponding pooling layer, e.g., pooling layer 534, pooling layer 538, pooling layer 542, etc.
  • These pooling layers may be, e.g., max-pooling layers.
  • the structure of the CNN 530 shown in FIG.5 is only exemplary, and depending on specific application requirements or design constraints, the CNN 530 may also have any other structures, e.g., comprising more or less convolutional layers, omitting pooling layers, adding layers for other processes, etc.
  • the input of the CNN 530 may also adopt a moving data block.
  • the moving data block may comprise the current frame, the preceding L frames of the current frame, and the succeeding L frames of the current frame.
  • the shift between consecutive blocks may be, e.g., one frame.
  • Each moving data block may contain 2L + 1 frames.
  • the value of L determines the range of context visible at each frame, which may be set empirically.
  • the singing voice detection model 520 may further comprise a RNN 550.
  • the RNN 550 may learn timing information and capture long-term temporal contextual information.
  • the RNN 550 may utilize recurrent neurons, e.g., simple RNN, gated recurrent unit (GRU), long short-term memory (LSTM) network, etc., for learning the timing information.
  • a recurrent neuron in the RNN 550 may have a feedback loop for feeding the learned information back to its own neuron in order to record historical information. Therefore, at the next instant, the current information and the existing historical information may be combined to jointly make a decision.
  • the RNN 550 in order to jointly make a decision in combination with contextual information, the RNN 550 may also be based on a bidirectional recurrent neural network.
  • each recurrent neuron in the bidirectional recurrent neural network the information flow propagates not only from front to back, but also from back to front, so that the recurrent neuron may know past information and future information within a certain time range, thereby making better decisions.
  • the singing voice detection model 520 may further comprise an output layer 552.
  • the output layer 552 may generate a classification result for the current input.
  • the classification result may be singing voice or non-singing voice, or may be singing voice, accompaniment, or silence.
  • the classification result generated by the singing voice detection model 520 may form a final singing voice detection result 560.
  • the singing voice detection result 560 may be represented as frame-level classification labels for frames in the polyphonic music clip, e.g., singing voice or non-singing voice, or, e.g., singing voice, accompaniment or silence.
  • the singing voice detection result 560 may be an integration of frame-level classification results, and is represented as singing voice regions and non-singing voice regions, or singing voice regions, accompaniment regions and silent regions, as identified in the polyphonic music clip.
  • the CNN 530 in the singing voice detection model 520 may be constructed by the transferring from the CNN 430 of the speech detection model 420.
  • at least one of the convolutional layer 532, the convolutional layer 536, and the convolutional layer 540 in the CNN 530 may be from corresponding convolutional layers in the CNN 430.
  • the CNN 530 may have various construction approaches. In one construction approach, all the convolutional layers in the CNN 430 may be transferred to the CNN 530, and accordingly, the convolutional layer 532, the convolutional layer 536, and the convolutional layer 540 may correspond to the convolutional layer 432, the convolutional layer 436 and the convolutional layer 440 respectively.
  • a part of the convolutional layers in the CNN 430 may be transferred to the CNN 530.
  • the convolutional layer 432 is transferred to the CNN 530 as the convolutional layer 532, or only the convolutional layer 432 and the convolutional layer 436 are transferred to the CNN 530 as the convolutional layer 532 and the convolutional layer 536.
  • one or more convolutional layers located at a bottom level in the CNN 430 may be transferred to the CNN 530 as the corresponding convolutional layers at a bottom level in the CNN 530, wherein the convolutional layers at the bottom level may refer to those convolutional layers closer to the input 410 or 510.
  • the bottom-level convolutional layers may contain more generic features that are useful for both the source task and the target task.
  • the bottom-level convolutional layers learn the basic and local features of sound, while high-level convolutional layers may become more irrelevant in learning some high-level representations and knowledge.
  • the singing voice in the target task is more complicated than the speech in the source task, because the singing voice will change with the accompaniment. Therefore, high-level representations of sound learned by the high-level convolutional layers from speech in the CNN 430 may not match the target task, which results in that transferring of this knowledge is less helpful for the target task.
  • the above transfer from the CNN 430 to the CNN 530 may employ various knowledge transfer modes.
  • a transfer mode that may be called a fixed mode
  • knowledge from the source task may be applied directly to the target task.
  • parameters learned by the convolutional layers in the CNN 430 are directly transferred to the CNN 530, and these parameters are fixed or retained in subsequent training of the singing voice detection model 520.
  • the convolutional layer 432 in the CNN 430 is transferred to the CNN 530 as the convolutional layer 532
  • the convolutional layer 532 will fix those parameters previously learned by the convolutional layer 432, and will not change these parameters in the subsequent training process.
  • the CNN 530 In another transfer mode that may be called a fine-tuning mode, the CNN 530 considers new knowledge learned from the target task domain, in addition to the knowledge from the source task. For example, parameters learned by the convolutional layers in the CNN 430 are firstly transferred to the CNN 530 as initial values of the corresponding convolutional layers, and then during the training of the singing voice detection model 520 with a polyphonic music clip training dataset, the transferred parameters are adapted or fine-tuned continuously, so that new knowledge in the target task of singing voice detection may be learned and knowledge from both the source task and the target task may be integrated, thus obtaining a more generic and more robust model.
  • parameters learned by the convolutional layers in the CNN 430 are firstly transferred to the CNN 530 as initial values of the corresponding convolutional layers, and then during the training of the singing voice detection model 520 with a polyphonic music clip training dataset, the transferred parameters are adapted or fine-tuned continuously, so that new knowledge in the target task of singing voice detection may be learned and knowledge
  • the knowledge transfer modes and the various construction approaches of the CNN 530 described above may be arbitrarily combined.
  • the fine-tuning mode may be employed to adapt or fine-tune the parameters of the transferred convolutional layers.
  • the CNN 530 may have a similar structure with the CNN 430. For those convolutional layers in the CNN 530 that are not transferred from the CNN 430, they may be trained in the process of training the singing voice detection model with polyphonic music clips. Moreover, optionally, the pooling layers in the CNN 530 may be transferred from the CNN 430 along with the corresponding convolutional layers, or may be reconstructed.
  • FIG.6 illustrates a flowchart of an exemplary method 600 for obtaining a singing voice detection model according to an embodiment.
  • a plurality of speech clips and a plurality of instrumental music clips may be synthesized into a plurality of audio clips.
  • a speech detection model may be trained with the plurality of audio clips.
  • At 630 at least a part of the speech detection model may be transferred to a singing voice detection model.
  • the singing voice detection model may be trained with a set of polyphonic music clips.
  • the speech detection model may perform a source task for detecting speech in an audio clip.
  • Each of the plurality of audio clips may comprise a plurality of frame-level labels indicating whether there exists speech.
  • the speech detection model is based on a CNN comprising one or more convolutional layers.
  • the transferring may comprise: transferring at least one convolutional layer in the one or more convolutional layers to the singing voice detection model.
  • the at least one convolutional layer may locate at a bottom level of the one or more convolutional layers.
  • Each of the one or more convolutional layers may connect to a corresponding pooling layer.
  • the singing voice detection model may perform a target task for detecting singing voice in a polyphonic music clip.
  • Each of the set of polyphonic music clips may comprise a plurality of frame-level labels indicating whether there exists singing voice.
  • the singing voice detection model may perform a target task for detecting singing voice, accompaniment and silence in a polyphonic music clip.
  • Each of the set of polyphonic music clips may comprise a plurality of frame-level labels indicating whether there exists singing voice, accompaniment and/or silence.
  • the singing voice detection model may be based on a CRNN, the CRNN comprising a CNN and an RNN.
  • the CNN may comprise at least one convolutional layer transferred from the speech detection model.
  • the training the singing voice detection model may comprise: fixing parameters of the at least one convolutional layer.
  • the training the singing voice detection model may comprise: adapting parameters of the at least one convolutional layer with the set of polyphonic music clips.
  • inputs to the speech detection model and the singing voice detection model may be in a Mel spectrogram form.
  • the method 600 may further comprise any steps/processes for obtaining a singing voice detection model according to the above embodiments of the present disclosure.
  • FIG.7 illustrates an exemplary apparatus 700 for obtaining a singing voice detection model according to an embodiment.
  • the apparatus 700 may comprise: an audio clip synthesizing module 710, for synthesizing a plurality of speech clips and a plurality of instrumental music clips into a plurality of audio clips; a speech detection model training module 720, for training a speech detection model with the plurality of audio clips; a transferring module 730, for transferring at least a part of the speech detection model to a singing voice detection model; and a singing voice detection model training module 740, for training the singing voice detection model with a set of polyphonic music clips.
  • an audio clip synthesizing module 710 for synthesizing a plurality of speech clips and a plurality of instrumental music clips into a plurality of audio clips
  • a speech detection model training module 720 for training a speech detection model with the plurality of audio clips
  • a transferring module 730 for transferring at least a part of the speech detection model to a singing voice detection model
  • a singing voice detection model training module 740 for training the singing voice detection model with a set of polyphonic music
  • the apparatus 700 may further comprise any other modules configured for obtaining a singing voice detection model according to the above embodiments of the present disclosure.
  • FIG.8 illustrates an exemplary apparatus 800 for obtaining a singing voice detection model according to an embodiment.
  • the apparatus 800 may comprise at least one processor 810 and a memory 820 storing computer-executable instructions.
  • the processor 810 may: synthesize a plurality of speech clips and a plurality of instrumental music clips into a plurality of audio clips; train a speech detection model with the plurality of audio clips; transfer at least a part of the speech detection model to a singing voice detection model; and train the singing voice detection model with a set of polyphonic music clips.
  • the processor 810 may further perform any steps/processes for obtaining a singing voice detection model according to the above embodiments of the present disclosure.
  • the embodiments of the present disclosure may be embodied in a non- transitory computer-readable medium.
  • the non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods for obtaining a singing voice detection model according to the above embodiments of the present disclosure.
  • modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.
  • processors are described in connection with various apparatus and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether these processors are implemented as hardware or software will depend on the specific application and the overall design constraints imposed on the system.
  • a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, a micro-controller, a digital signal processor (DSP), a field programmable gate array (FPGA), a programmable logic device (PLD), state machine, gate logic, discrete hardware circuitry, and other suitable processing components configured to perform the various functions described in this disclosure.
  • DSP digital signal processor
  • FPGA field programmable gate array
  • PLD programmable logic device
  • state machine gate logic, discrete hardware circuitry, and other suitable processing components configured to perform the various functions described in this disclosure.
  • the functions of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as software executed by a microprocessor, a
  • Software should be considered broadly to represent instructions, instruction sets, code, code segments, program code, programs, subroutines, software modules, applications, software applications, software packages, routines, subroutines, objects, running threads, processes, functions, and the like. Software may reside on computer readable medium.
  • Computer readable medium may include, for example, a memory, which may be, for example, a magnetic storage device (e.g., a hard disk, a floppy disk, a magnetic strip), an optical disk, a smart card, a flash memory device, a random access memory (RAM), a read only memory (ROM), a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), a register, or a removable disk.
  • a memory is shown as being separate from the processor in various aspects presented in this disclosure, a memory may also be internal to the processor (e.g., a cache or a register).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Reverberation, Karaoke And Other Acoustics (AREA)
PCT/US2020/036869 2019-07-30 2020-06-10 Obtaining a singing voice detection model WO2021021305A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910694160.3 2019-07-30
CN201910694160.3A CN112309428B (zh) 2019-07-30 2019-07-30 获得歌声检测模型

Publications (1)

Publication Number Publication Date
WO2021021305A1 true WO2021021305A1 (en) 2021-02-04

Family

ID=71899957

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/036869 WO2021021305A1 (en) 2019-07-30 2020-06-10 Obtaining a singing voice detection model

Country Status (2)

Country Link
CN (1) CN112309428B (zh)
WO (1) WO2021021305A1 (zh)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4524634B2 (ja) * 2005-03-02 2010-08-18 株式会社国際電気通信基礎技術研究所 歌声評定装置およびプログラム
JP5811837B2 (ja) * 2011-12-27 2015-11-11 ヤマハ株式会社 表示制御装置及びプログラム
CN104091600B (zh) * 2014-03-21 2015-11-11 腾讯科技(深圳)有限公司 一种歌声位置检测方法及装置
CN104616663A (zh) * 2014-11-25 2015-05-13 重庆邮电大学 一种结合hpss的mfcc-多反复模型的音乐分离方法
CN107680611B (zh) * 2017-09-13 2020-06-16 电子科技大学 基于卷积神经网络的单通道声音分离方法
CN109903773B (zh) * 2019-03-13 2021-01-08 腾讯音乐娱乐科技(深圳)有限公司 音频处理方法、装置及存储介质

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
ALE KORETZKY: "Audio AI: isolating vocals from stereo music using Convolutional Neural Networks | by Ale Koretzky | Towards Data Science", 4 February 2019 (2019-02-04), XP055728359, Retrieved from the Internet <URL:https://towardsdatascience.com/audio-ai-isolating-vocals-from-stereo-music-using-convolutional-neural-networks-210532383785> [retrieved on 20200907] *
ARORA PRERNA ET AL: "A study on transfer learning for acoustic event detection in a real life scenario", 2017 IEEE 19TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), IEEE, 16 October 2017 (2017-10-16), pages 1 - 6, XP033271590, DOI: 10.1109/MMSP.2017.8122258 *
MAVADDATI SAMIRA ED - KHATEB FABIAN ET AL: "A Novel Singing Voice Separation Method Based on a Learnable Decomposition Technique", CIRCUITS, SYSTEMS AND SIGNAL PROCESSING, CAMBRIDGE, MS, US, vol. 39, no. 7, 8 January 2020 (2020-01-08), pages 3652 - 3681, XP037127830, ISSN: 0278-081X, [retrieved on 20200108], DOI: 10.1007/S00034-019-01338-0 *
PO-SEN HUANG ET AL: "Singing-Voice Separation From Monaural Recordings Using Deep Recurrent Neural Networks", ISMIR 2014, 31 October 2014 (2014-10-31), XP055729050, DOI: 10.5281/zenodo.1415678 *
SHINGCHERND YOU ET AL: "Comparative study of singing voice detection based on deep neural networks and ensemble learning", HUMAN-CENTRIC COMPUTING AND INFORMATION SCIENCES, BIOMED CENTRAL LTD, LONDON, UK, vol. 8, no. 1, 26 November 2018 (2018-11-26), pages 1 - 18, XP021263029, DOI: 10.1186/S13673-018-0158-1 *
STOLLER DANIEL ET AL: "Jointly Detecting and Separating Singing Voice: A Multi-Task Approach", 6 June 2018, ANNUAL INTERNATIONAL CONFERENCE ON THE THEORY AND APPLICATIONS OF CRYPTOGRAPHIC TECHNIQUES, EUROCRYPT 2018; [LECTURE NOTES IN COMPUTER SCIENCE; LECT.NOTES COMPUTER], SPRINGER, BERLIN, HEIDELBERG, PAGE(S) 329 - 339, ISBN: 978-3-642-17318-9, XP047474369 *
SWAMINATHAN RUPAK VIGNESH ET AL: "Improving Singing Voice Separation Using Attribute-Aware Deep Network", 2019 INTERNATIONAL WORKSHOP ON MULTILAYER MUSIC REPRESENTATION AND PROCESSING (MMRP), IEEE, 23 January 2019 (2019-01-23), pages 60 - 65, XP033529291, DOI: 10.1109/MMRP.2019.8665379 *
TAKAHASHI NAOYA ET AL: "Improving Voice Separation by Incorporating End-To-End Speech Recognition", ICASSP 2020 - 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 4 May 2020 (2020-05-04), pages 41 - 45, XP033793497, DOI: 10.1109/ICASSP40776.2020.9053845 *
WEI TSUNG LU ET AL: "Vocal Melody Extraction with Semantic Segmentation and Audio-symbolic Domain Transfer Learning", ISMIR 2018, 26 September 2018 (2018-09-26), XP055728084, DOI: 10.5281/zenodo.1492466 *
YIN-JYUN LUO ET AL: "Learning Domain-Adaptive Latent Representations of Music Signals Using Variational Autoencoders", ISMIR 2018, 26 September 2018 (2018-09-26), XP055728801, DOI: 10.5281/zenodo.1492501 *

Also Published As

Publication number Publication date
CN112309428A (zh) 2021-02-02
CN112309428B (zh) 2024-03-19

Similar Documents

Publication Publication Date Title
Böck et al. Deconstruct, Analyse, Reconstruct: How to improve Tempo, Beat, and Downbeat Estimation.
Han et al. Deep convolutional neural networks for predominant instrument recognition in polyphonic music
Tzanetakis et al. Marsyas: A framework for audio analysis
Wu et al. Multi-instrument automatic music transcription with self-attention-based instance segmentation
Gururani et al. Instrument Activity Detection in Polyphonic Music using Deep Neural Networks.
Hung et al. Frame-level instrument recognition by timbre and pitch
Friedland et al. The ICSI RT-09 speaker diarization system
Vogl et al. Drum transcription from polyphonic music with recurrent neural networks
Su et al. TENT: Technique-Embedded Note Tracking for Real-World Guitar Solo Recordings.
Tzanetakis et al. A framework for audio analysis based on classification and temporal segmentation
Wu et al. Music chord recognition based on midi-trained deep feature and blstm-crf hybird decoding
Hou et al. Transfer learning for improving singing-voice detection in polyphonic instrumental music
Huang et al. Improving lyrics alignment through joint pitch detection
Mounika et al. Music genre classification using deep learning
Murthy et al. Singer identification from smaller snippets of audio clips using acoustic features and DNNs
Arumugam et al. An efficient approach for segmentation, feature extraction and classification of audio signals
WO2019053544A1 (en) IDENTIFICATION OF AUDIOS COMPONENTS IN AN AUDIO MIX
Ullrich et al. Music transcription with convolutional sequence-to-sequence models
Wang et al. Musicyolo: A vision-based framework for automatic singing transcription
Amarasinghe et al. Supervised learning approach for singer identification in sri lankan music
Stark Musicians and machines: Bridging the semantic gap in live performance
Shi et al. Which Ones Are Speaking? Speaker-Inferred Model for Multi-Talker Speech Separation.
Yang et al. Highlighting root notes in chord recognition using cepstral features and multi-task learning
WO2021021305A1 (en) Obtaining a singing voice detection model
Barthet et al. Speech/music discrimination in audio podcast using structural segmentation and timbre recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20750538

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20750538

Country of ref document: EP

Kind code of ref document: A1