WO2020228226A1 - 一种纯音乐检测方法、装置及存储介质 - Google Patents

一种纯音乐检测方法、装置及存储介质 Download PDF

Info

Publication number
WO2020228226A1
WO2020228226A1 PCT/CN2019/109638 CN2019109638W WO2020228226A1 WO 2020228226 A1 WO2020228226 A1 WO 2020228226A1 CN 2019109638 W CN2019109638 W CN 2019109638W WO 2020228226 A1 WO2020228226 A1 WO 2020228226A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
human voice
feature
processed
detected
Prior art date
Application number
PCT/CN2019/109638
Other languages
English (en)
French (fr)
Inventor
王征韬
Original Assignee
腾讯音乐娱乐科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯音乐娱乐科技(深圳)有限公司 filed Critical 腾讯音乐娱乐科技(深圳)有限公司
Publication of WO2020228226A1 publication Critical patent/WO2020228226A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/81Detection of presence or absence of voice signals for discriminating voice from music

Definitions

  • the invention relates to the field of audio processing, in particular to a pure music detection method, device and storage medium.
  • Pure music refers to music that does not contain lyrics. This kind of music uses pure and beautiful music to narrate and express the emotions of the author. Pure music can be played by natural musical instruments (such as piano, violin, guitar, etc.) or electroacoustic instruments. , So it is usually based on whether the audio contains human voice to distinguish whether the audio is pure music.
  • the embodiments of the present invention provide a pure music detection method, device and storage medium, which are used to improve the accuracy of pure music detection.
  • the embodiments of the present invention provide a pure music detection method, device and storage medium, which are used to improve the accuracy of pure music detection.
  • the embodiment of the present invention provides a pure music detection method, the method includes:
  • the audio segment to be processed does not contain human voice, it is determined that the audio to be detected belongs to pure music.
  • an embodiment of the present invention also provides a pure music detection device, which includes:
  • the first acquiring unit is used to acquire the audio to be detected
  • a processing unit configured to perform human voice separation processing on the audio to be detected to obtain audio clips to be processed
  • An extraction unit configured to extract audio features of the to-be-processed audio segment, the audio features including Mel mel feature and vocal proportion feature;
  • the input unit is used to input the audio features into the trained human voice detection network model
  • the first determining unit is configured to obtain the output result of the trained human voice detection network model
  • the second determining unit is configured to determine that the audio to be detected is pure music when it is determined that the audio segment to be processed does not contain human voice according to the output result.
  • the device further includes:
  • the second acquiring unit is configured to acquire multiple audio samples, where the audio samples are audio samples that are known to be pure music;
  • the third determining unit is configured to determine the audio feature of the audio sample according to the audio sample
  • the training unit is used to train the human voice detection network model according to the training sample set to obtain the trained human voice detection network model.
  • the third determining unit is specifically configured to:
  • the audio feature of the audio segment is extracted, and the audio feature is determined.
  • the third determining unit is further specifically configured to:
  • the human voice separation processing is performed on the audio samples through the Hourglass model.
  • the extraction unit is specifically configured to:
  • STFT transformation is performed on the to-be-processed audio segment to obtain an STFT spectrum
  • Logarithmic processing and first-order difference processing are performed on the mel spectrum to obtain the mel feature.
  • the extraction unit is further specifically configured to:
  • Extracting the characteristics of the proportion of human voice in the audio segment to be processed Extracting the characteristics of the proportion of human voice in the audio segment to be processed.
  • the extraction unit is specifically configured to:
  • the human voice ratio feature is determined according to the duration corresponding to the filtered audio clip to be processed and the duration of the audio to be detected.
  • the processing unit is specifically configured to:
  • the human voice separation processing is performed on the to-be-detected audio through the Hourglass model.
  • the embodiment of the present invention also provides a storage medium, the storage medium stores a plurality of instructions, and the instructions are suitable for the processor to load and execute any of the pure music detection methods provided in the embodiments of the present invention. A step of.
  • the embodiment of the present invention obtains the audio to be detected; performs human voice separation processing on the audio to be detected to obtain the audio segment to be processed; then extracts the audio features of the audio segment to be processed, and the audio features include Melmel features and Vocal ratio feature; input the audio feature into the trained voice detection network model; obtain the output result of the trained voice detection network model; if it is determined according to the output result that the to-be-processed audio segment does not contain Human voice, it is determined that the audio to be detected belongs to pure music.
  • the embodiment of the present invention performs pure music detection on audio fragments separated from the audio to be detected, without the need for whole song detection, the audio length to be detected is relatively short, and the accuracy of pure music detection can be improved.
  • Fig. 1 is a system schematic diagram of a pure music detection device provided by an embodiment of the present invention.
  • FIG. 2 is a schematic flowchart of a pure music detection method provided by an embodiment of the present invention.
  • Fig. 3 is a schematic structural diagram of a basic convolutional network in a human voice detection network model provided by an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of the structure of the coding layer in the human voice detection network model provided by an embodiment of the present invention.
  • FIG. 5 is another schematic flowchart of a pure music detection method provided by an embodiment of the present invention.
  • Fig. 6 is a schematic structural diagram of a pure music detection device provided by an embodiment of the present invention.
  • FIG. 7 is another schematic structural diagram of a pure music detecting device provided by an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a server provided by an embodiment of the present invention.
  • FIG. 9 is a schematic structural diagram of a terminal according to an embodiment of the present invention.
  • first and second in the present invention are used to distinguish different objects, rather than to describe a specific order.
  • the terms “including” and “having” and any variations thereof are intended to cover non-exclusive inclusion.
  • a process, method, system, product, or device that includes a series of steps or modules is not limited to the listed steps or modules, but optionally includes unlisted steps or modules, or optionally also includes Other steps or modules inherent to these processes, methods, products or equipment.
  • Pure music refers to music that does not contain lyrics. We can determine whether the music is pure music according to whether the music contains human voices.
  • the audio of the whole song is generally used to determine whether the audio is pure music. Some songs have scattered vocals, which are usually regarded as pure music. The accuracy of pure music detection is not high.
  • the model is to be trained To judge whether the audio of the whole song belongs to pure music, because the length of the whole song varies from tens of seconds to as many as tens of minutes, if you use too long audio to train the model to determine whether the audio is pure music, it will bring the model Problems that are difficult to train and optimize.
  • the embodiments of the present invention provide a pure music detection method, device, and storage medium.
  • the to-be-detected audio is obtained; the human-voice separation processing is performed on the to-be-detected audio to obtain the to-be-processed audio segment;
  • the audio feature of the audio segment to be processed is detecting whether the audio segment to be processed contains human voice according to the audio feature; if it does not contain human voice, it is determined that the audio to be detected belongs to pure music.
  • the embodiment of the present invention performs pure music detection on audio fragments separated from the audio to be detected, without the need for whole song detection, the audio length to be detected is relatively short, and the accuracy of pure music detection can be improved.
  • the audio feature when detecting whether the to-be-processed audio segment contains human voice according to the audio feature in this embodiment, the audio feature can be input into the trained human voice detection network model; and then according to the trained human voice The output result of the human voice detection network model determines whether the to-be-processed audio segment contains human voice.
  • the audio features separated and extracted from the audio samples are used as the training samples for training.
  • the length of the audio features is relatively short, so when using the audio features as the training samples for training, compare It is easy to train and optimize the model.
  • the pure music detection method provided by the embodiment of the present invention can be implemented in a pure music detection device, and the pure music detection device can be integrated in a terminal or server and other equipment.
  • FIG. 1 is a system diagram of a pure music detection device provided by an embodiment of the present invention.
  • the system can be used for model training and pure music detection.
  • the model provided by the embodiment of the present invention is a deep learning network model, such as a human voice detection network model.
  • the trained audio samples are obtained in advance, and then the audio samples are separated from human voices to obtain audio fragments.
  • the audio segment determines the audio feature of the audio sample, and then adds the audio feature to the training sample set, and trains the human voice detection network model according to the training sample set to obtain the trained human voice detection network model.
  • FIG. 2 is a schematic flowchart of a pure music detection method provided by an embodiment of the present invention. The method includes:
  • the pure music detection device in the present invention can be integrated in a terminal or server and other equipment , wherein the equipment includes but is not limited to computers, smart TVs, smart speakers, mobile phones, tablet computers and other devices.
  • the pure music detection device in this embodiment includes a trained human voice detection network model, which is a deep learning network model. Before acquiring the audio to be detected, this embodiment needs to train the human voice detection network first, so that The human voice detection network can be used to detect whether the audio to be detected contains human voice.
  • the details can be as follows:
  • the audio samples are audio samples that are known to be pure music.
  • the audio samples include long audio samples and short audio samples.
  • Long audio samples are audio samples that are tens of minutes long.
  • the audio duration is only a few seconds of audio. Among them, the duration of long audio is longer than that of short audio, and the specific duration of long audio and segment audio is not limited here.
  • the audio sample needs to be separated into audio segments and pure music segments through the Hourglass model.
  • the audio segments mentioned in this embodiment are extracted from audio samples. "Human Voice Clips" that came out.
  • the Hourglass model is trained on the DSD100 data set (human voice and other musical instruments such as drums, bass, guitar, etc.). This model can support blind source separation of any multi-source, and can separate the human voice and instrument sound in the audio.
  • the original audio should be merged into mono when fed into the Hourglass model and downsampled to 8k.
  • the number of downsampling is not limited here. In addition to 8k, it is also downsampled to other values according to the specific situation.
  • the audio segment may be a real "vocal segment” (the label corresponding to the sample should be impure music), or it can be a fake “vocal segment” "(That is, the sound of an instrument that is misidentified as a human voice, at this time the label corresponding to the sample should be pure music), which may or may not include human voice.
  • the audio features can include Melmel features and vocal ratio features (Vocal_ratio ).
  • the mel feature When extracting the mel feature, it specifically includes: performing STFT transformation on the audio segment to obtain the STFT spectrum; converting the STFT spectrum to obtain the mel spectrum; performing logarithmic processing and first-order difference processing on the mel spectrum to obtain the mel feature .
  • perform logarithmic processing and first-order difference processing on the mel spectrum and obtain the characteristics of the mel specifically as follows: (1) Add 1 to the mel spectrum and then take the logarithm, that is, log(1+x), where x is mel spectrum; (2) The result of 1 is taken as a first-order difference along the time direction to find the absolute value, and then superimposed with the result of (1) to form a mel feature.
  • the result is scaled to a fixed length in time, where the fixed length can be 2000 frames, and the specific length is not limited here.
  • the mel feature is a frequency spectrum feature obtained by a filter bank that meets the characteristics of human hearing, and reflects the time-frequency structure of audio.
  • the mel feature contains information to determine whether the audio is pure music or human voice, so it can be used as a basis for pure music judgment, so we can use the mel feature to train the model.
  • the specific method is to normalize the audio segment (normalized to 0-1) to obtain the normalized audio segment; perform the normalized audio segment Filter mute processing (among which, the filter threshold can be set to 20db, or it can be set to other values according to the specific situation, the specific filter threshold is not limited here) to obtain the filtered audio clip; according to the filter
  • the value of the vocal proportion feature of pure music is usually low, and the value of the vocal proportion feature of non-pure music is usually higher, so we can use the mel feature to train the model.
  • the human voice proportion feature may not be extracted.
  • the audio features of the above audio samples need to be added to the training samples.
  • the above audio samples are samples that are known to be pure music, we can extract from the audio samples Add tags to the audio features that come out to indicate that the audio features reflect that the audio is pure music or impure music, so a set of samples in the training sample set include (mel feature, vocal proportion feature, tag).
  • the human voice detection network model will be trained according to the training sample set. Specifically, a set of audio features are input into the prediction result of the preset human voice detection network model, and then the prediction result is compared with the group The tags of audio features train the human voice detection network model.
  • the audio features extracted from the audio samples are used as training samples for training.
  • the length of the audio features is relatively short, so it is easier to train using audio features as the training samples. Model training and optimization.
  • the audio to be detected needs to be separated into audio segments and pure music segments through the Hourglass model.
  • the audio segments mentioned in this embodiment are "human voices” extracted from audio samples. Fragment".
  • the Hourglass model in this step allows part of the musical instruments to be misidentified at the expense of recalling all the voices in the audio so as not to miss the human voices and to ensure a high recall rate.
  • the to-be-processed audio fragments may be real "human voice fragments" or fake “human voice fragments", which may include human voices or not. Including human voice, in this case, it is necessary to determine whether the to-be-processed audio segment is a real human voice segment according to subsequent steps.
  • the audio feature in the embodiment of the present invention can be used to indicate that the audio is pure music/human voice.
  • the to-be-processed audio clips are extracted from the audio to be detected through the Hourglass model
  • feature extraction of the to-be-processed audio clips output from the Hourglass model is required to obtain audio features.
  • the audio features can include mel features and vocal proportions feature.
  • the mel feature When extracting the mel feature, it specifically includes: performing short-time Fourier transform (STFT) on the audio segment to be processed to obtain the STFT spectrum; converting the STFT spectrum to obtain the mel spectrum; The spectrum is subjected to logarithmic processing and first-order difference processing to obtain the mel feature. Among them, perform logarithmic processing and first-order difference processing on the mel spectrum.
  • STFT short-time Fourier transform
  • the characteristics of the mel are as follows: (1) Add 1 to the mel spectrum and then take the logarithm, that is, log(1+x), where , X is the mel spectrum; (2) Take the result of 1 along the time direction as a first-order difference to find the absolute value, and then superimpose it with the result of (1) to form the mel feature.
  • the specific method is to normalize the frequency segment to be processed (normalized to 0-1) to obtain the normalized audio segment to be processed;
  • the human voice proportion feature may not be extracted, that is, if the trained human voice detection network model is based on the mel feature and the human voice proportion feature If the training is obtained, the human voice proportion feature needs to be extracted at this time. If the training is based on the mel feature only, the human voice proportion feature does not need to be extracted at this time.
  • the audio features in the embodiment of the present invention may also have other audio features that can indicate that the audio is human voice/pure music.
  • the specific feature type is not limited here.
  • the audio feature after the audio feature of the audio to be detected is obtained, the audio feature will be input into the trained human voice detection network model, and the output result of the audio feature will be obtained, where the output result can be used for Determine whether the audio clip to be processed contains human voice.
  • the output result is greater than 0.5, it is determined that the audio segment to be processed does not contain human voice, if the output result is less than 0.5, it is determined that the audio segment to be processed contains human voice, for example, if the output result is 1. , It is determined that the audio segment to be processed does not contain human voice, if it is 0, it is determined that the audio segment to be processed contains human voice.
  • the Vocal_ratio can be additionally output in this embodiment. According to statistics, we find that the vocal_ratio of pure music is usually lower, while the vocal_ratio of non-pure music is generally higher.
  • the vocal_ratio can be used as a physical value for manual reference. .
  • the audio segment to be processed does not contain human voice, it is determined that the audio to be detected belongs to pure music.
  • the embodiment of the present invention obtains the audio to be detected; performs human voice separation processing on the audio to be detected to obtain the audio segment to be processed; then extracts the audio features of the audio segment to be processed, and the audio features include Mel mel feature and human voice Proportion feature; input the audio feature into the trained human voice detection network model; obtain the output result of the trained human voice detection network model; if it is determined according to the output result that the to-be-processed audio segment does not contain human voice , It is determined that the audio to be detected belongs to pure music.
  • the embodiment of the present invention performs pure music detection on audio fragments separated from the audio to be detected, without the need for whole song detection, the audio length to be detected is relatively short, and the accuracy of pure music detection can be improved.
  • the vehicle identification device is specifically integrated in the server as an example for description.
  • the server obtains a large number of audio samples through multiple channels.
  • the audio samples are audio samples that are known to be pure music.
  • the audio samples include long audio samples and short audio samples.
  • Long audio samples are tens of minutes long.
  • Audio a piece of audio sample is audio with an audio duration of only a few seconds. Among them, the duration of long audio is longer than that of short audio, and the specific duration of long audio and segment audio is not limited here.
  • the audio feature of the audio sample is determined according to the audio sample. Specifically, the audio sample needs to be subjected to human voice separation processing to obtain an audio segment; then the audio feature of the audio segment is extracted to determine the audio feature.
  • the audio sample needs to be separated into audio segments and pure music segments through the Hourglass model.
  • the audio segments mentioned in this embodiment are extracted from audio samples. "Human Voice Fragments" that came out.
  • the original audio should be merged into mono when fed into the Hourglass model and downsampled to 8k.
  • the number of downsampling is not limited here. In addition to 8k, it is also downsampled to other values according to the specific situation.
  • the audio fragments may be real "human voice fragments" or fake “human voice fragments", which may include human voices, or may not include humans. sound.
  • the audio clips output from the Hourglass model need to be feature extracted to obtain audio features, where the audio features can include mel features and vocal ratio features (Vocal_ratio).
  • the mel feature When extracting the mel feature, it specifically includes: performing STFT transformation on the audio segment to obtain the STFT spectrum; converting the STFT spectrum to obtain the mel spectrum; performing logarithmic processing and first-order difference processing on the mel spectrum to obtain the mel feature .
  • perform logarithmic processing and first-order difference processing on the mel spectrum and obtain the characteristics of the mel specifically as follows: (1) Add 1 to the mel spectrum and then take the logarithm, that is, log(1+x), where x is mel spectrum; (2) The result of 1 is taken as a first-order difference along the time direction to find the absolute value, and then superimposed with the result of (1) to form a mel feature.
  • the result is scaled to a fixed length in time, where the fixed length can be 2000 frames, and the specific length is not limited here.
  • the mel feature is a frequency spectrum feature obtained by a filter bank that meets the characteristics of human hearing, and reflects the time-frequency structure of audio.
  • the mel feature contains information to determine whether the audio is pure music or human voice, so it can be used as a basis for pure music judgment, so we can use the mel feature to train the model.
  • the specific method is to normalize the audio segment (normalized to 0-1) to obtain the normalized audio segment; perform the normalized audio segment Filter mute processing (among which, the filter threshold can be set to 20db, or it can be set to other values according to the specific situation, the specific filter threshold is not limited here) to obtain the filtered audio clip; according to the filter
  • the value of the vocal proportion feature of pure music is usually low, and the value of the vocal proportion feature of non-pure music is usually higher, so we can use the mel feature to train the model.
  • the human voice proportion feature may not be extracted. In this case, only the mel feature needs to be extracted when performing pure music detection later.
  • the audio feature is added to the training sample set; more specifically, when the audio feature of the above audio sample is extracted, the audio feature needs to be added to the training sample at this time, where it is known whether the audio sample is It is a pure music sample, so we can add a label to the audio feature extracted from the audio sample to indicate that the audio feature reflects that the audio is pure music or impure music, so a set of samples in the training sample set include ( mel feature, vocal proportion feature, label).
  • the human voice detection network model is trained according to the training sample set, and the trained human voice detection network model is obtained, that is, after the training sample set is obtained, the human voice detection network model will be trained according to the training sample set.
  • a group of audio features are input into a preset human voice detection network model to predict the result, and then the prediction result and the label of the group of audio features are trained on the human voice detection network model.
  • the audio features extracted from the audio samples are used as training samples for training.
  • the length of the audio features is relatively short, so it is easier to train using audio features as the training samples. Model training and optimization.
  • the trained human voice detection network model in the embodiments of the present invention is composed of a basic convolutional network + coding layer + feature fusion layer + fully connected classification layer.
  • the basic convolutional network selects a network without expansion coefficient
  • a schematic diagram of a basic convolutional network is shown in Figure 3, wherein the size of the feature map formed after the stacking of multiple basic convolutional layers is The (timestep, feature) matrix needs to be converted into a vector in order to be able to perform subsequent classification.
  • the structure of the encoding layer is shown in Figure 4.
  • the encoding layer learns the importance of data at each time step through convolution with only one convolution kernel (softmax mask), and then this importance value is multiplied with the data row by row, and the result of the multiplication is summed on the time axis to obtain a feature vector.
  • This technique is equivalent to encoding the features distributed in the time step to one point, so it is called the encoding layer.
  • the training sample set we input the samples in the training sample set into the human voice detection network model.
  • the samples in the training sample set into the human voice detection network model.
  • the mel feature into the basic convolutional network in the human voice detection network model, and then fix-length coding, and then The feature fusion layer adds the vocal proportion feature.
  • the classification result is output through the fully connected classification layer, and then the classification result is compared with the label corresponding to the sample, and the weight of the basic convolutional network is adjusted according to the comparison error. Until the model converges.
  • the pure music detection device in the present invention can be integrated in a terminal or server and other equipment , wherein the equipment includes but is not limited to computers, smart TVs, smart speakers, mobile phones, tablet computers and other devices.
  • the original audio should be merged into mono when fed into the Hourglass model and downsampled to 8k.
  • the number of downsampling is not limited here. In addition to 8k, it is also downsampled to other values according to the specific situation.
  • the to-be-processed audio fragments may be real "human voice fragments" or fake “human voice fragments", which may include human voices or not. Including human voice, in this case, it is necessary to determine whether the to-be-processed audio segment is a real human voice segment according to subsequent steps.
  • the to-be-processed audio clips are extracted from the audio to be detected through the Hourglass model
  • feature extraction of the to-be-processed audio clips output from the Hourglass model is required to obtain audio features.
  • the audio features can include mel features and vocal proportions feature.
  • the specific steps are: performing STFT transformation on the to-be-processed audio segment to obtain the STFT spectrum; converting the STFT spectrum to obtain the mel spectrum; performing logarithmic processing and first-order difference processing on the mel spectrum to obtain the mel features.
  • the characteristics of the mel are as follows: (1) Add 1 to the mel spectrum and then take the logarithm, that is, log(1+x), where , X is the mel spectrum; (2) Take the result of 1 along the time direction as a first-order difference to find the absolute value, and then superimpose it with the result of (1) to form the mel feature.
  • the result is scaled to a fixed length in time, where the fixed length can be 2000 frames, and the specific length is not limited here.
  • the specific method is to normalize the frequency segment to be processed (normalized to 0-1) to obtain the normalized audio segment to be processed;
  • the human voice proportion feature may not be extracted, that is, if the trained human voice detection network model is based on the mel feature and the human voice proportion feature If the training is obtained, the human voice proportion feature needs to be extracted at this time. If the training is based on the mel feature only, the human voice proportion feature does not need to be extracted at this time.
  • the audio feature processing mel feature in addition to the human voice ratio feature, there may be other audio features that can indicate that the audio is human voice/pure music.
  • the specific feature type is not limited here.
  • the audio feature is input into a built-in trained human voice detection network model, and then according to the output result of the human voice detection network model, it is determined whether the audio segment to be processed contains human voice.
  • step 505. Determine whether the to-be-processed audio segment contains human voice according to the output result of the trained human voice detection network model, if it does not contain human voice, execute step 506, if it contains, execute step 507.
  • the output result is greater than 0.5, it is determined that the audio segment to be processed does not contain human voice, if the output result is less than 0.5, it is determined that the audio segment to be processed contains human voice, for example, if the output result is 1. , It is determined that the audio segment to be processed does not contain real human voice, if it is 0, it is determined that the audio segment to be processed contains real human voice.
  • the Vocal_ratio can be additionally output in this embodiment. According to statistics, we find that the vocal_ratio of pure music is usually lower, while the vocal_ratio of non-pure music is generally higher.
  • the vocal_ratio can be used as a physical value for manual reference. .
  • the audio segment to be processed does not contain human voices, it means that the audio to be detected does not contain human voices (because the audio segments that may be human voices have been separated for detection before). At this time, you can determine the The detected audio is pure music.
  • the audio clip to be processed contains human voice
  • the audio clip extracted from the audio to be detected contains human voice
  • the embodiment of the present invention obtains the audio to be detected; performs human voice separation processing on the audio to be detected to obtain the audio segment to be processed; then extracts the audio features of the audio segment to be processed, and the audio features include Mel mel feature and human voice Proportion feature; input the audio feature into the trained voice detection network model; obtain the output result of the trained voice detection network model, if it is determined according to the output result that the to-be-processed audio segment does not contain human voice , It is determined that the audio to be detected belongs to pure music.
  • the embodiment of the present invention performs pure music detection on audio fragments separated from the audio to be detected, without the need for whole song detection, the audio length to be detected is relatively short, and the accuracy of pure music detection can be improved.
  • FIG. 6 is a schematic structural diagram of a pure music detection device provided by an embodiment of the present invention.
  • the pure music detection device 600 may include a first acquisition unit 601, a processing unit 602, an extraction unit 603, an input unit 604, a first determination unit 605, and a second determination unit 606, wherein:
  • the first acquiring unit 601 is configured to acquire audio to be detected
  • the processing unit 602 is configured to perform human voice separation processing on the audio to be detected to obtain audio clips to be processed;
  • the extraction unit 603 is configured to extract audio features of the to-be-processed audio segment, where the audio features include Mel mel feature and vocal proportion feature;
  • the input unit 604 is configured to input the audio features into the trained human voice detection network model
  • the first determining unit 605 is configured to obtain the output result of the trained human voice detection network model
  • the second determining unit 606 is configured to determine that the audio to be detected is pure music when it is determined that the audio segment to be processed does not contain human voice according to the output result.
  • the apparatus 600 further includes:
  • the second acquiring unit 607 is configured to acquire multiple audio samples, where the audio samples are audio samples that are known to be pure music;
  • the third determining unit 608 is configured to determine the audio feature of the audio sample according to the audio sample
  • the adding unit 609 is configured to add the audio feature to the training sample set
  • the training unit 610 is configured to train the human voice detection network model according to the training sample set to obtain the trained human voice detection network model.
  • the third determining unit 608 is specifically configured to:
  • the audio feature of the audio segment is extracted, and the audio feature is determined.
  • the third determination unit 608 yuan is also specifically used for:
  • the human voice separation processing is performed on the audio samples through the Hourglass model.
  • the extraction unit 603 is specifically configured to:
  • STFT transformation is performed on the to-be-processed audio segment to obtain an STFT spectrum
  • Logarithmic processing and first-order difference processing are performed on the mel spectrum to obtain the mel feature.
  • the extraction unit 603 is further specifically configured to:
  • Extracting the characteristics of the proportion of human voice in the audio segment to be processed Extracting the characteristics of the proportion of human voice in the audio segment to be processed.
  • the extraction unit 603 is further specifically configured to:
  • the human voice ratio feature is determined according to the duration corresponding to the filtered audio clip to be processed and the duration of the audio to be detected.
  • the processing unit 602 is specifically configured to:
  • the human voice separation processing is performed on the to-be-detected audio through the Hourglass model.
  • the first obtaining unit 601 obtains the audio to be detected; the processing unit 602 performs human voice separation processing on the audio to be detected to obtain the audio segment to be processed; then the extracting unit 603 extracts the audio features of the audio segment to be processed ,
  • the audio features include Melmel features and voice ratio features;
  • the input unit 604 inputs the audio features into the trained voice detection network model;
  • the first determining unit 605 obtains the trained voice detection The output result of the network model; if it is determined according to the output result that the audio segment to be processed does not contain human voice, the first determining unit 605 determines that the audio to be detected belongs to pure music.
  • the embodiment of the present invention performs pure music detection on audio fragments separated from the audio to be detected, without the need for whole song detection, the audio length to be detected is relatively short, and the accuracy of pure music detection can be improved.
  • the embodiment of the present invention also provides a server, as shown in FIG. 8, which shows a schematic structural diagram of the server involved in the embodiment of the present invention, specifically:
  • the server may include one or more processing core processors 801, one or more computer-readable storage media of memory 802, power supply 803, input unit 804 and other components.
  • processing core processors 801 one or more computer-readable storage media of memory 802, power supply 803, input unit 804 and other components.
  • FIG. 8 does not constitute a limitation on the server, and may include more or fewer components than shown in the figure, or a combination of certain components, or different component arrangements. among them:
  • the processor 801 is the control center of the server. It uses various interfaces and lines to connect various parts of the entire server. By running or executing software programs and/or modules stored in the memory 802, and calling data stored in the memory 802, Perform various functions of the server and process data to monitor the server as a whole.
  • the processor 801 may include one or more processing cores; preferably, the processor 801 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, application programs, etc. , The modem processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 801.
  • the memory 802 may be used to store software programs and modules.
  • the processor 801 executes various functional applications and data processing by running the software programs and modules stored in the memory 802.
  • the memory 802 may mainly include a program storage area and a data storage area.
  • the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of the server, etc.
  • the memory 802 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the memory 802 may further include a memory controller to provide the processor 801 with access to the memory 802.
  • the server also includes a power supply 803 for supplying power to various components.
  • the power supply 803 may be logically connected to the processor 801 through a power management system, so that functions such as charging, discharging, and power consumption management can be managed through the power management system.
  • the power supply 803 may also include any components such as one or more DC or AC power supplies, a recharging system, a power failure detection circuit, a power converter or inverter, and a power status indicator.
  • the server may further include an input unit 804, which can be used to receive inputted number or character information and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.
  • an input unit 804 which can be used to receive inputted number or character information and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.
  • the server may also include a display unit, etc., which will not be repeated here.
  • the processor 801 in the server loads the executable file corresponding to the process of one or more application programs into the memory 802 according to the following instructions, and the processor 801 runs and stores the executable file in the memory.
  • the applications in 802 can realize various functions as follows:
  • the server provided in this embodiment obtains the audio to be detected; performs human voice separation processing on the audio to be detected to obtain the audio segment to be processed; then extracts the audio features of the audio segment to be processed;
  • the audio feature detects whether the audio segment to be processed contains human voice; if it does not contain human voice, it is determined that the audio to be detected belongs to pure music.
  • the embodiment of the present invention performs pure music detection on audio fragments separated from the audio to be detected, without the need for whole song detection, the audio length to be detected is relatively short, and the accuracy of pure music detection can be improved.
  • an embodiment of the present invention also provides a terminal.
  • the terminal may include a radio frequency (RF, Radio Frequency) circuit 901, a memory 902 including one or more computer-readable storage media, and an input unit. 903, a display unit 904, a sensor 905, an audio circuit 906, a wireless fidelity (WiFi, Wireless Fidelity) module 907, a processor 908 including one or more processing cores, a power supply 909 and other components.
  • RF Radio Frequency
  • the RF circuit 901 can be used for receiving and sending signals during the process of sending and receiving information or talking. In particular, after receiving the downlink information of the base station, it is processed by one or more processors 908; in addition, the uplink data is sent to the base station. .
  • the RF circuit 901 includes, but is not limited to, an antenna, at least one amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, and a low noise amplifier (LNA, Low Noise Amplifier), duplexer, etc.
  • SIM Subscriber Identity Module
  • LNA Low Noise Amplifier
  • duplexer etc.
  • the RF circuit 901 can also communicate with the network and other devices through wireless communication.
  • the wireless communication may use any communication standard or protocol, including but not limited to Global System of Mobile communication (GSM), General Packet Radio Service (GPRS, General Packet Radio Service), Code Division Multiple Access (CDMA, Code Division Multiple Access), Wideband Code Division Multiple Access (WCDMA, Wideband Code Division Multiple Access), Long Term Evolution (LTE), Email, Short Messaging Service (SMS), etc.
  • GSM Global System of Mobile communication
  • GPRS General Packet Radio Service
  • GPRS General Packet Radio Service
  • CDMA Code Division Multiple Access
  • WCDMA Wideband Code Division Multiple Access
  • LTE Long Term Evolution
  • Email Short Messaging Service
  • the memory 902 may be used to store software programs and modules.
  • the processor 908 executes various functional applications and data processing by running the software programs and modules stored in the memory 902.
  • the memory 902 may mainly include a program storage area and a data storage area.
  • the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data (such as audio data, phone book, etc.) created by the use of the terminal.
  • the memory 902 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the memory 902 may further include a memory controller to provide the processor 908 and the input unit 903 to access the memory 902.
  • the input unit 903 can be used to receive input digital or character information, and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.
  • the input unit 903 may include a touch-sensitive surface and other input devices.
  • a touch-sensitive surface also known as a touch screen or a touchpad, can collect user touch operations on or near it (for example, the user uses any suitable objects or accessories such as fingers, stylus, etc.) on the touch-sensitive surface or on the touch-sensitive surface. Operation near the surface), and drive the corresponding connection device according to the preset program.
  • the touch-sensitive surface may include two parts: a touch detection device and a touch controller.
  • the touch detection device detects the user's touch position, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it To the processor 908, and can receive and execute commands sent by the processor 908.
  • multiple types such as resistive, capacitive, infrared, and surface acoustic waves can be used to realize touch-sensitive surfaces.
  • the input unit 903 may also include other input devices. Specifically, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control buttons, switch buttons, etc.), trackball, mouse, and joystick.
  • the display unit 904 can be used to display information input by the user or information provided to the user and various graphical user interfaces of the terminal. These graphical user interfaces can be composed of graphics, text, icons, videos, and any combination thereof.
  • the display unit 904 may include a display panel.
  • a liquid crystal display LCD, Liquid Crystal Display
  • OLED Organic Light-Emitting Diode
  • the touch-sensitive surface may cover the display panel. When the touch-sensitive surface detects a touch operation on or near it, it is transmitted to the processor 908 to determine the type of the touch event, and then the processor 908 displays the display panel according to the type of the touch event.
  • Corresponding visual output is provided on the panel.
  • the touch-sensitive surface and the display panel are used as two independent components to realize the input and input functions, in some embodiments, the touch-sensitive surface and the display panel may be integrated to realize the input and output functions.
  • the terminal may also include at least one sensor 905, such as a light sensor, a motion sensor, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor, where the ambient light sensor can adjust the brightness of the display panel according to the brightness of the ambient light, and the proximity sensor can turn off the display panel and/or backlight when the terminal is moved to the ear .
  • the gravity acceleration sensor can detect the magnitude of acceleration in various directions (usually three-axis), and can detect the magnitude and direction of gravity when it is stationary.
  • the audio circuit 906, speakers, and microphones can provide an audio interface between the user and the terminal.
  • the audio circuit 906 can transmit the electrical signal converted from the received audio data to the speaker, which is converted into a sound signal for output; on the other hand, the microphone converts the collected sound signal into an electrical signal, which is received by the audio circuit 906 and then converted
  • the audio data is processed by the audio data output processor 908, and then sent to, for example, another terminal via the RF circuit 901, or the audio data is output to the memory 902 for further processing.
  • the audio circuit 906 may also include an earplug jack to provide communication between a peripheral earphone and the terminal.
  • WiFi is a short-distance wireless transmission technology.
  • the terminal can help users send and receive e-mails, browse web pages, and access streaming media. It provides users with wireless broadband Internet access.
  • FIG. 9 shows the WiFi module 907, it is understandable that it is not a necessary component of the terminal, and can be omitted as needed without changing the essence of the invention.
  • the processor 908 is the control center of the terminal. It uses various interfaces and lines to connect various parts of the entire mobile phone. It executes by running or executing software programs and/or modules stored in the memory 902, and calling data stored in the memory 902. Various functions of the terminal and processing data, so as to monitor the mobile phone as a whole.
  • the processor 908 may include one or more processing cores; preferably, the processor 908 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, and application programs, etc. , The modem processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 908.
  • the terminal also includes a power source 909 (such as a battery) for supplying power to various components.
  • a power source 909 (such as a battery) for supplying power to various components.
  • the power source can be logically connected to the processor 908 through a power management system, so that functions such as charging, discharging, and power management are realized through the power management system.
  • the power supply 909 may also include one or more DC or AC power supplies, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and any other components.
  • the terminal may also include a camera, a Bluetooth module, etc., which will not be repeated here.
  • the processor 908 in the terminal loads the executable files corresponding to the processes of one or more application programs into the memory 902 according to the following instructions, and the processor 908 runs and stores them in the memory.
  • the applications in 902 can realize various functions:
  • the terminal obtains the audio to be detected; performs human voice separation processing on the audio to be detected to obtain the audio segment to be processed; then extracts the audio features of the audio segment to be processed;
  • the audio features are input into the trained human voice detection network model; according to the output result of the trained human voice detection network model, it is determined whether the to-be-processed audio segment contains human voice; if it does not contain human voice, it is determined
  • the audio to be detected is pure music.
  • the embodiment of the present invention performs pure music detection on audio fragments separated from the audio to be detected, without the need for whole song detection, the audio length to be detected is relatively short, and the accuracy of pure music detection can be improved.
  • an embodiment of the present invention provides a storage medium in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any pure music detection method provided in the embodiments of the present invention.
  • the instruction can perform the following steps:
  • the storage medium may include: read-only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), disk or CD, etc.
  • the instructions stored in the storage medium can execute the steps in any pure music detection method provided in the embodiments of the present invention, it can realize what any pure music detection method provided in the embodiments of the present invention can.
  • the beneficial effects achieved refer to the previous embodiments for details, which will not be repeated here.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Auxiliary Devices For Music (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种纯音乐检测方法、装置及存储介质,所示方法包括:获取待检测音频(201);对所述待检测音频进行人声分离处理,得到待处理音频片段(202);然后提取所述待处理音频片段的音频特征,所述音频特征包括梅尔mel特征及人声占比特征(203);将所述音频特征输入训练后的人声检测网络模型中(204);获取所述训练后的人声检测网络模型的输出结果(205);若根据所述输出结果确定所述待处理音频片段不包含人声,则确定所述待检测音频属于纯音乐(206)。其对从待检测音频中分离出来的音频片段进行纯音乐检测,不需要整曲检测,需要检测的音频长度较短,可以提高纯音乐检测的准确率。

Description

一种纯音乐检测方法、装置及存储介质
本申请要求于2019年5月14日提交中国专利局、申请号为201910398945.6、发明名称为“一种纯音乐检测方法、装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及音频处理领域,具体涉及一种纯音乐检测方法、装置及存储介质。
背景技术
纯音乐是指不包含歌词的音乐,这种音乐完全以纯粹优美的音乐来叙述表达作者的情感,纯音乐可以由自然乐器(如钢琴、小提琴、吉他等等)或电声乐器演奏而成的,所以通常以音频中是否包含人声来区别该音频是否属于纯音乐。
在实现本发明过程中,发明人发现现有技术中,通常需要根据整曲音乐判断该音乐是否为纯音乐。但在通常定义中对含有零散人声的音乐,也通常被认为是纯音乐,导致对纯音乐检测准确率不高。
技术问题
本发明实施例提供一种纯音乐检测方法、装置及存储介质,用于提高纯音乐检测的准确率。
技术解决方案
本发明实施例提供一种纯音乐检测方法、装置及存储介质,用于提高纯音乐检测的准确率。
本发明实施例提供一种纯音乐检测方法,所述方法包括:
获取待检测音频;
对所述待检测音频进行人声分离处理,得到待处理音频片段;
提取所述待处理音频片段的音频特征,所述音频特征包括梅尔mel特征及人声占比特征;
将所述音频特征输入训练后的人声检测网络模型中;
获取所述训练后的人声检测网络模型的输出结果;
若根据所述输出结果确定所述待处理音频片段不包含人声,则确定所述待检测音频属于纯音乐。
相应地,本发明实施例还提供一种纯音乐检测装置,所述装置包括:
第一获取单元,用于获取待检测音频;
处理单元,用于对所述待检测音频进行人声分离处理,得到待处理音频片段;
提取单元,用于提取所述待处理音频片段的音频特征,所述音频特征包括梅尔mel特征及人声占比特征;
输入单元,用于将所述音频特征输入训练后的人声检测网络模型中;
第一确定单元,用于获取所述训练后的人声检测网络模型的输出结果;
第二确定单元,用于当根据所述输出结果确定所述待处理音频片段不包含人声时,确定所述待检测音频属于纯音乐。
可选的,在一些实施例中,所述装置还包括:
第二获取单元,用于获取多个音频样本,所述音频样本为已知是否为纯音乐的音频样本;
第三确定单元,用于根据所述音频样本确定所述音频样本的音频特征;
添加单元,用于将所述音频特征添加至训练样本集中;
训练单元,用于根据所述训练样本集对人声检测网络模型进行训练,得到所述训练后的人声检测网络模型。
可选的,在一些实施例中,所述第三确定单元具体用于:
对所述音频样本进行人声分离处理,得到音频片段;
提取所述音频片段的音频特征,确定所述音频特征。
可选的,在一些实施例中,所述第三确定单元还具体用于:
通过Hourglass模型对所述音频样本进行人声分离处理。
可选的,在一些实施例中,所述音频特征为所述mel特征时,所述提取单元具体用于:
对所述待处理音频片段做STFT变换,得到STFT频谱;
对所述STFT频谱进行转换得到mel频谱;
对所述mel频谱进行求对数处理以及1阶差分处理,得到所述mel特征。
可选的,在一些实施例中,所述提取单元还具体用于:
提取所述待处理音频片段的mel特征;
提取所述待处理音频片段的人声占比特征。
可选的,在一些实施例中,当所述音频特征为所述人声占比特征时,所述提取单元具体用于:
对所述待处理音频片段进行归一化处理,得到归一化后的待处理音频片段;
对所述归一化后的待处理音频片段进行过滤静音处理,得到过滤后的待处理音频片段;
根据所述过滤后的待处理音频片段所对应的时长以及所述待检测音频的时长确定所述人声占比特征。
可选的,在一些实施例中,所述处理单元具体用于:
通过Hourglass模型对所述待检测音频进行人声分离处理。
本发明实施例还提供一种存储介质,所述存储介质存储有多条指令,所述指令适于处理器进行加载,执行本发明实施例所提供的任一种所述的纯音乐检测方法中的步骤。
有益效果
本发明实施例通过获取待检测音频;对所述待检测音频进行人声分离处理,得到待处理音频片段;然后提取所述待处理音频片段的音频特征,所述音频特征包括梅尔mel特征及人声占比特征;将该音频特征输入训练后的人声检测网络模型中;获取该训练后的人声检测网络模型的输出结果;若根据所述输出结果确定所述待处理音频片段不包含人声,则确定所述待检测音频属于纯音乐。本发明实施例对从待检测音频中分离出来的音频片段进行纯音乐检测,不需要整曲检测,需要检测的音频长度较短,可以提高纯音乐检测的准确率。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例提供的一种纯音乐检测装置的系统示意图。
图2为本发明实施例提供的一种纯音乐检测方法的一个流程示意图。
图3为本发明实施例提供的人声检测网络模型中基础卷积网络的结构示意图。
图4为本发明实施例提供的人声检测网络模型中编码层的结构示意图。
图5为本发明实施例提供的一种纯音乐检测方法的另一个流程示意图。
图6为本发明实施例提供的一种纯音乐检测装置的一个结构示意图。
图7为本发明实施例提供的一种纯音乐检测装置的另一个结构示意图。
图8为本发明实施例提供的一种服务器的结构示意图。
图9为本发明实施例提供的一种终端的结构示意图。
本发明的实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本发明中的术语“第一”和“第二”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或模块的过程、方法、系统、产品或设备没有限定于已列出的步骤或模块,而是可选地还包括没有列出的步骤或模块,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或模块。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本发明的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
纯音乐是指不包含歌词的音乐,我们可以根据音乐中是否包含人声来确定该音乐是否属于纯音乐。
但是现有技术中一般都是以整曲音频来判断音频是否为纯音乐,有的歌曲有零散人声,通常也被认为是纯音乐,纯音乐检测准确率不高,此外,若是要训练模型来判断整曲音频是否属于纯音乐,由于整曲音频长度多变,少则几十秒,多则几十分钟,若是用过长的音频来训练模型判断音频是否属于纯音乐,会带来模型难以训练以及优化的问题。
因而,本发明实施例提供了一种纯音乐检测方法、装置及存储介质,本实施例通过获取待检测音频;对所述待检测音频进行人声分离处理,得到待处理音频片段;然后提取所述待处理音频片段的音频特征;根据所述音频特征检测所述待处理音频片段中是否包含人声;若不包含人声,则确定所述待检测音频属于纯音乐。本发明实施例对从待检测音频中分离出来的音频片段进行纯音乐检测,不需要整曲检测,需要检测的音频长度较短,可以提高纯音乐检测的准确率。
具体地,本实施例中根据所述音频特征检测所述待处理音频片段中是否包含人声时,可以将所述音频特征输入训练后的人声检测网络模型中;然后根据所述训练后的人声检测网络模型的输出结果确定所述待处理音频片段中是否包含人声。
其中,在训练人声检测网络模型时,使用的是从音频样本中分离提取出来的音频特征作为训练样本进行训练的,音频特征的长度较短,故使用音频特征作为训练样本进行训练时,比较容易进行模型的训练及优化。
本发明实施例提供的纯音乐检测方法,可实现在纯音乐检测装置中,该纯音乐检测装置可以集成在终端或服务器等设备中。
请参阅图1,图1为本发明实施例提供的一种纯音乐检测装置的系统示意图,该系统可以用于模型的训练以及纯音乐的检测。本发明实施例提供的模型为深度学习网络模型,例如人声检测网络模型,在进行模型的训练时,预先获取训练的音频样本,然后对该音频样本进行人声分离,得到音频片段,再根据音频片段确定该音频样本的音频特征,然后将该音频特征添加至训练样本集中,根据所述训练样本集对人声检测网络模型进行训练,以得到所述训练后的人声检测网络模型。在进行纯音乐检测时,首先需要获取待检测音频;对所述待检测音频进行人声分离处理,得到待处理音频片段;然后提取所述待处理音频片段的音频特征;最后将所述音频特征输入训练后的人声检测网络模型中;并根据所述训练后的人声检测网络模型的输出结果确定所述待处理音频片段中是否包含人声,如果不包含人声,则确定该待检测音频为纯音乐,否则,不是纯音乐。
以下将分别进行详细说明,以下各个实施例的描述先后顺序并不构成对具体实施先后顺序的限定。
请参阅图2,其中,图2为本发明实施例提供的一种纯音乐检测方法的流程示意图。所述方法包括:
201、获取待检测音频。
本实施例中,当需要检测待检测音频是否为纯音乐时,首先需要将该待检测音频输入纯音乐检测装置中,其中,本发明中的纯音乐检测装置可以集成在终端或服务器等设备中,其中,该设备包括但不限于计算机、智能电视、智能音箱、手机、平板电脑等设备。
其中,本实施例中的纯音乐检测装置中包括训练后的人声检测网络模型,该模型深度学习网络模型,在获取待检测音频之前,本实施例需要先训练该人声检测网络,使得该人声检测网络可以用于检测待检测音频中是否包含人声。
具体地,在训练该人声检测网络时,具体可以如下:
a.获取多个音频样本。
首先需要获取多个音频样本,该音频样本为已知是否为纯音乐的音频样本,该音频样本包括长音频样本以及短音频样本,长音频样本为音频时长达几十分钟的音频,段音频样本为音频时长只有几秒的音频。其中,长音频的时长比短音频的时长大,长音频以及段音频的具体时长此处不做限定。
b.根据该音频样本确定该音频样本的音频特征。
具体地,需要对该音频样本进行人声分离处理,得到音频片段;然后提取该音频片段的音频特征,确定该音频特征。
更具体地,在一些实施例中,需要通过Hourglass模型对该音频样本进行人声分离处理,分离成音频片段以及纯音乐片段,其中,本实施例中提及的音频片段为从音频样本中提取出来的“人声片段”。
其中,Hourglass模型由DSD100数据集(人声和及鼓、贝斯、吉他等其他乐器)训练得到,该模型能够支持任意多源的盲源分离,能够分离音频中的人声与乐器声。
需要说明的是,现有技术中通过Hourglass模型进行人声分离的时候,通常会将一些与人声相似的乐器误识别为人声而保留,分离得不够充分,准确率不足,在本发明实施例中,我们需要适当训练该Hourglass模型以提高召回率,即在本发明实施例中的Hourglass模型中允许一部分乐器误识别为代价,以将音频中的所有人声召回,以免遗漏人声,保证高召回率。
其中,原始音频在送入Hourglass模型应合并为单声道,并降采样到8k,其中,降采样到多少具体此处不做限定,除了8k,还根据具体情况降采样到其他数值。
由于本发明中的Hourglass模型的召回率比较高,所以音频片段中可能为真的“人声片段”(此时该样本对应的标签应该为非纯音乐),也能为假的“人声片段”(即误识别为人声的乐器声,此时该样本对应的标签应该为纯音乐),即有可能包括人声,也有可能不包括人声。
当通过Hourglass模型从音频样本中提取出到了音频片段之后,需要对从Hourglass模型输出的音频片段进行特征提取,得到音频特征,其中,音频特征可以包括梅尔mel特征以及人声占比特征(Vocal_ratio)。
在提取mel特征时,具体为:对该音频片段做STFT变换,得到STFT频谱;对该STFT频谱进行转换得到mel频谱;对该mel频谱进行求对数处理以及1阶差分处理,得到该mel特征。其中,对该mel频谱进行求对数处理以及1阶差分处理,得到该mel特征具体为:(1)对mel频谱加1后取对数,即求log(1+x),其中,x为mel频谱;(2 )将1的结果沿时间方向做1阶差分求绝对值,再与(1)的结果进行叠加,形成mel特征,其中,在一些实施例中我们还需要将(1)的结果在时间上缩放到固定长度,其中,固定长度可以为2000帧,具体长度此处不做限定。
其中,mel特征是以符合人耳听觉特性的滤波器组获取的频谱特征,反映的是音频的时-频结构。mel特征包含了判断音频为纯音乐或人声的信息,因此可以用做纯音乐判断的依据,所以我们可以使用mel特征训练模型。
在提取人声占比特征时,具体为,对该音频片段进行归一化处理(归一化到0-1),得到归一化后的音频片段;对该归一化后的音频片段进行过滤静音处理(其中,可以将过滤门限值设置为20db,也可以根据具体情况设置为其他值,具体的过滤门限值此处不做限定),得到过滤后的音频片段;根据该过滤后的音频片段所对应的时长以及该待检测音频的时长确定该人声占比特征,即Vocal_ratio=非静音区人声时长/全曲时长。
其中,纯音乐的人声占比特征的值通常较低,而非纯音乐的人声占比特征的值通常较高,所以我们可以使用mel特征训练模型。
其中,在一些实施例中,如果模型不需要使用人声占比特征进行训练时,可以不提取人声占比特征。
c.将该音频特征添加至训练样本集中。
当提取出了上述音频样本的音频特征时,此时需要将音频特征添加至训练样本中,其中,由于上述音频样本为已知是否为纯音乐的样本,所以我们可以对从该音频样本中提取出来了的音频特征添加标签,以表明该音频特征反映该音频为纯音乐或非纯音乐,所以训练样本集中的一组样本包括(mel特征,人声占比特征,标签)。
d.根据该训练样本集对人声检测网络模型进行训练,得到该训练后的人声检测网络模型。
当获取了训练样本集之后,将根据训练样本集对人声检测网络模型进行训练,具体地,将一组音频特征输入预置的人声检测网络模型中预测结果,然后将预测结果与该组音频特征的标签对该人声检测网络模型进行训练。
在训练人声检测网络模型时,使用的是从音频样本中分离提取出来的音频特征作为训练样本进行训练的,音频特征的长度较短,故使用音频特征作为训练样本进行训练时,比较容易进行模型的训练及优化。
202、对该待检测音频进行人声分离处理,得到待处理音频片段。
在一些实施例中,需要通过Hourglass模型对待检测音频进行人声分离处理,分离成音频片段以及纯音乐片段,其中,本实施例中提及的音频片段为从音频样本中提取出来的“人声片段”。
其中,本步骤中的Hourglass模型中允许一部分乐器误识别为代价,以将音频中的所有人声召回,以免遗漏人声,保证高召回率。
由于本发明中的Hourglass模型的召回率比较高,所以待处理音频片段中可能为真的“人声片段”,也能为假的“人声片段”,即有可能包括人声,也有可能不包括人声,此时则需要根据后续步骤判断该待处理音频片段是否为真的人声片段。
203、提取该待处理音频片段的音频特征。
其中,本发明实施例中的音频特征可以用于指示该音频为纯音乐/人声。
当通过Hourglass模型从待检测音频中提取出到了待处理音频片段之后,需要对从Hourglass模型输出的待处理音频片段进行特征提取,得到音频特征,其中,音频特征可以包括mel特征以及人声占比特征。
在提取mel特征时,具体为:对该待处理音频片段做短时傅里叶变换(STFT,short-time Fourier transform)变换,得到STFT频谱;对该STFT频谱进行转换得到mel频谱;对该mel频谱进行求对数处理以及1阶差分处理,得到该mel特征。其中,对该mel频谱进行求对数处理以及1阶差分处理,根据mel频谱得到该mel特征具体为:(1)对mel频谱加1后取对数,即求log(1+x),其中,x为mel频谱;(2 )将1的结果沿时间方向做1阶差分求绝对值,再与(1)的结果进行叠加,形成mel特征,其中,在一些实施例中我们还需要将(1)的结果在时间上缩放到固定长度,其中,固定长度可以为2000帧,具体长度此处不做限定。
在提取人声占比特征时,具体为,对该待处理频片段进行归一化处理(归一化到0-1),得到归一化后的待处理音频片段;对该归一化后的待处理音频片段进行过滤静音处理(其中,可以将过滤门限值设置为20db,也可以根据具体情况设置为其他值,具体的过滤门限值此处不做限定),得到过滤后的待处理音频片段;根据该过滤后的待处理音频片段所对应的时长以及该待检测音频的时长确定该人声占比特征,即Vocal_ratio=非静音区人声时长/全曲时长。
其中,在一些实施例中,还可以根据训练后的人声检测网络模型的类型,不提取人声占比特征,即若训练后的人声检测网络模型是根据mel特征以及人声占比特征训练得到的,则此时需要提取人声占比特征,若是只根据mel特征训练的,则此时不需要提取人声占比特征。
需要说明的是,本发明实施例中的音频特征除了mel特征,人声占比特征之外,还可能有其他可以指示音频为人声/纯音乐的音频特征,具体特征类型此处不做限定。
204、将音频特征输入训练后的人声检测网络模型中。
在一些实施例中,在得到待检测音频的音频特征之后,将会将该音频特征输入训练后的人声检测网络模型中,并且得到该音频特征的输出结果,其中,该输出结果可以用于判断该待处理音频片段中是否包含人声。
205、获取训练后的人声检测网络模型的输出结果。
在一些实施例中,如果输出结果大于0.5,则确定该待处理音频片段中不包含人声,如果输出结果小于0.5,则确定该待处理音频片段中包含人声,例如,如果输出结果为1,则确定该待处理音频片段中不包含人声,如果为0,则确定该待处理音频片段中包含人声。
在一些实施例中,本实施例中还可以额外输出Vocal_ratio,经过统计我们发现,纯音乐的vocal_ratio通常较低,而非纯音乐的vocal_ratio一般较高,vocal_ratio可作为有物理意义的数值供人工参考。
205、若根据所述输出结果确定所述待处理音频片段不包含人声,则确定该待检测音频属于纯音乐。
本实施例中,当判断出待处理音频片段中不包含人声时,此时则确定待检测的音频属于纯音乐,否则,确定该待检测音频不属于纯音乐。
本发明实施例通过获取待检测音频;对该待检测音频进行人声分离处理,得到待处理音频片段;然后提取该待处理音频片段的音频特征,所述音频特征包括梅尔mel特征及人声占比特征;将该音频特征输入训练后的人声检测网络模型中;获取该训练后的人声检测网络模型的输出结果;若根据所述输出结果确定所述待处理音频片段不包含人声,则确定该待检测音频属于纯音乐。本发明实施例对从待检测音频中分离出来的音频片段进行纯音乐检测,不需要整曲检测,需要检测的音频长度较短,可以提高纯音乐检测的准确率。
根据前面实施例所描述的方法,以下将举例作进一步详细说明。
在本实施例中,将以该车辆识别装置具体集成在服务器中为例进行说明。
(一)模型的训练。
首先,服务器通过多个途径获取大量的音频样本,该音频样本为已知是否为纯音乐的音频样本,该音频样本包括长音频样本以及短音频样本,长音频样本为音频时长达几十分钟的音频,段音频样本为音频时长只有几秒的音频。其中,长音频的时长比短音频的时长大,长音频以及段音频的具体时长此处不做限定。
然后,根据该音频样本确定该音频样本的音频特征,具体地,需要对该音频样本进行人声分离处理,得到音频片段;然后提取该音频片段的音频特征,确定该音频特征。
更具体地,在一些实施例中,需要通过Hourglass模型对该音频样本进行人声分离处理,分离成音频片段以及纯音乐片段,其中,本实施例中提及的音频片段为从音频样本中提取出来的“人声片段”。
需要说明的是,现有技术中通过Hourglass模型进行人声分离的时候,通常会将一些与人声相似的乐器误识别为人声而保留,分离得不够充分,准确率不足,在本发明实施例中,我们需要适当训练该Hourglass模型以提高召回率,即在本发明实施例中的Hourglass模型中允许一部分乐器误识别为代价,以将音频中的所有人声召回,以免遗漏人声,保证高召回率。
其中,原始音频在送入Hourglass模型应合并为单声道,并降采样到8k,其中,降采样到多少具体此处不做限定,除了8k,还根据具体情况降采样到其他数值。
由于本发明中的Hourglass模型的召回率比较高,所以音频片段中可能为真的“人声片段”,也能为假的“人声片段”,即有可能包括人声,也有可能不包括人声。
当通过Hourglass模型从音频样本中提取出到了音频片段之后,需要对从Hourglass模型输出的音频片段进行特征提取,得到音频特征,其中,音频特征可以包括mel特征以及人声占比特征(Vocal_ratio)。
在提取mel特征时,具体为:对该音频片段做STFT变换,得到STFT频谱;对该STFT频谱进行转换得到mel频谱;对该mel频谱进行求对数处理以及1阶差分处理,得到该mel特征。其中,对该mel频谱进行求对数处理以及1阶差分处理,得到该mel特征具体为:(1)对mel频谱加1后取对数,即求log(1+x),其中,x为mel频谱;(2 )将1的结果沿时间方向做1阶差分求绝对值,再与(1)的结果进行叠加,形成mel特征,其中,在一些实施例中我们还需要将(1)的结果在时间上缩放到固定长度,其中,固定长度可以为2000帧,具体长度此处不做限定。
其中,mel特征是以符合人耳听觉特性的滤波器组获取的频谱特征,反映的是音频的时-频结构。mel特征包含了判断音频为纯音乐或人声的信息,因此可以用做纯音乐判断的依据,所以我们可以使用mel特征训练模型。
在提取人声占比特征时,具体为,对该音频片段进行归一化处理(归一化到0-1),得到归一化后的音频片段;对该归一化后的音频片段进行过滤静音处理(其中,可以将过滤门限值设置为20db,也可以根据具体情况设置为其他值,具体的过滤门限值此处不做限定),得到过滤后的音频片段;根据该过滤后的音频片段所对应的时长以及该待检测音频的时长确定该人声占比特征,即Vocal_ratio=非静音区人声时长/全曲时长。
其中,纯音乐的人声占比特征的值通常较低,而非纯音乐的人声占比特征的值通常较高,所以我们可以使用mel特征训练模型。
其中,在一些实施例中,如果模型不需要使用人声占比特征进行训练时,可以不提取人声占比特征,此时,后续进行纯音乐检测时,也只需提取mel特征。
当然,一般实施例中,除了使用mel特征训练模型,还需要使用人声占比特征进行训练,以进一步提高模型检测的准确性。
再然后,该音频特征添加至训练样本集中;更具体地,当提取出了上述音频样本的音频特征时,此时需要将音频特征添加至训练样本中,其中,由于上述音频样本为已知是否为纯音乐的样本,所以我们可以对从该音频样本中提取出来了的音频特征添加标签,以表明该音频特征反映该音频为纯音乐或非纯音乐,所以训练样本集中的一组样本包括(mel特征,人声占比特征,标签)。
最后,根据该训练样本集对人声检测网络模型进行训练,得到该训练后的人声检测网络模型,即当获取了训练样本集之后,将根据训练样本集对人声检测网络模型进行训练,具体地,将一组音频特征输入预置的人声检测网络模型中预测结果,然后将预测结果与该组音频特征的标签对该人声检测网络模型进行训练。
在训练人声检测网络模型时,使用的是从音频样本中分离提取出来的音频特征作为训练样本进行训练的,音频特征的长度较短,故使用音频特征作为训练样本进行训练时,比较容易进行模型的训练及优化。
其中,在一些实施例中,本发明实施例中的训练人声检测网络模型由基础卷积网络+编码层+特征融合层+全连接分类层构成。
其中,在一些实施例中,基础卷积网络选择不带膨胀系数的网络,一层基础卷积网络的示意图如图3所示,其中,多层基础卷积层堆叠后形成的特征图大小为(timestep, feature)的矩阵,为了能够进行后续分类,需要将其将其转化为向量。我们设计了编码层(encoder)来合理完成该任务,编码层的结构如图4所示,其中:该编码层通过只有一个卷积核的卷积学习数据在各个时间步上的重要性(softmax mask),随后以此重要性值逐行与数据相乘,相乘的结果在时间轴上求和,得到一个特征向量。该技术相当于将分布在时间步上的特征编码到一个点,因此称为编码层。
具体地,当构造了训练样本集之后,我们将训练样本集中的样本输入人声检测网络模型中,首先将mel特征输入人声检测网络模型中的基础卷积网络,然后定长编码,再在特征融合层加入人声占比特征,经过特征融合之后,经过全连接分类层输出分类结果,再将该分类结果与该样本对应的标签比对,根据比对误差调整基础卷积网络的权重,直至模型收敛。
(二)纯音乐检测。
如图5所示,基于上述训练后的人声检测网络模型,该纯音乐检测方法的另一流程可以如下:
501、获取待检测音频。
本实施例中,当需要检测待检测音频是否为纯音乐时,首先需要将该待检测音频输入纯音乐检测装置中,其中,本发明中的纯音乐检测装置可以集成在终端或服务器等设备中,其中,该设备包括但不限于计算机、智能电视、智能音箱、手机、平板电脑等设备。
502、通过Hourglass模型对该待检测音频进行人声分离处理,得到待处理音频片段。
需要说明的是,现有技术中通过Hourglass模型进行人声分离的时候,通常会将一些与人声相似的乐器误识别为人声而保留,分离得不够充分,准确率不足,在本发明实施例中,我们需要适当训练该Hourglass模型以提高召回率,即在本发明实施例中的Hourglass模型中允许一部分乐器误识别为代价,以将音频中的所有人声召回,以免遗漏人声,保证高召回率。
其中,原始音频在送入Hourglass模型应合并为单声道,并降采样到8k,其中,降采样到多少具体此处不做限定,除了8k,还根据具体情况降采样到其他数值。
由于本发明中的Hourglass模型的召回率比较高,所以待处理音频片段中可能为真的“人声片段”,也能为假的“人声片段”,即有可能包括人声,也有可能不包括人声,此时则需要根据后续步骤判断该待处理音频片段是否为真的人声片段。
503、提取该待处理音频片段的mel特征以及人声占比特征。
当通过Hourglass模型从待检测音频中提取出到了待处理音频片段之后,需要对从Hourglass模型输出的待处理音频片段进行特征提取,得到音频特征,其中,音频特征可以包括mel特征以及人声占比特征。
在提取mel特征时,具体为:对该待处理音频片段做STFT变换,得到STFT频谱;对该STFT频谱进行转换得到mel频谱;对该mel频谱进行求对数处理以及1阶差分处理,得到该mel特征。其中,对该mel频谱进行求对数处理以及1阶差分处理,根据mel频谱得到该mel特征具体为:(1)对mel频谱加1后取对数,即求log(1+x),其中,x为mel频谱;(2 )将1的结果沿时间方向做1阶差分求绝对值,再与(1)的结果进行叠加,形成mel特征,其中,在一些实施例中我们还需要将(1)的结果在时间上缩放到固定长度,其中,固定长度可以为2000帧,具体长度此处不做限定。
在提取人声占比特征时,具体为,对该待处理频片段进行归一化处理(归一化到0-1),得到归一化后的待处理音频片段;对该归一化后的待处理音频片段进行过滤静音处理(其中,可以将过滤门限值设置为20db,也可以根据具体情况设置为其他值,具体的过滤门限值此处不做限定),得到过滤后的待处理音频片段;根据该过滤后的待处理音频片段所对应的时长以及该待检测音频的时长确定该人声占比特征,即Vocal_ratio=非静音区人声时长/全曲时长。
其中,在一些实施例中,还可以根据训练后的人声检测网络模型的类型,不提取人声占比特征,即若训练后的人声检测网络模型是根据mel特征以及人声占比特征训练得到的,则此时需要提取人声占比特征,若是只根据mel特征训练的,则此时不需要提取人声占比特征。
需要说明的是,本发明实施例中的音频特征处理mel特征,人声占比特征之外,还可能有其他可以指示音频为人声/纯音乐的音频特征,具体特征类型此处不做限定。
504、将该音频特征输入训练后的人声检测网络模型中。
具体地,将该音频特征输入内置的训练后的人声检测网络模型中,然后根据人声检测网络模型的输出结果判断待处理音频片段中是否包含人声。
即在本实施例中,需要首先获取一个需要检测的待检测音频,然后根据Hourglass模型对该待检测模型进行人声分离处理,得到“人声数据”,在从该“人声数据”中提取mel特征以及人声占比特征,将mel特征输入上述训练后的人声检测网络模型的基础卷积网络中,然后定长编码,再在特征融合层加入人声占比特征,经过特征融合之后,经过全连接分类层输出分类结果,即输出“人声数据”是否包含真的人声的结果。
505、根据该训练后的人声检测网络模型的输出结果确定该待处理音频片段中是否包含人声,若不包含,则执行步骤506,若包含,则执行步骤507。
在一些实施例中,如果输出结果大于0.5,则确定该待处理音频片段中不包含人声,如果输出结果小于0.5,则确定该待处理音频片段中包含人声,例如,如果输出结果为1,则确定该该待处理音频片段中不包含真的人声,如果为0,则确定该待处理音频片段中包含真的人声。
在一些实施例中,本实施例中还可以额外输出Vocal_ratio,经过统计我们发现,纯音乐的vocal_ratio通常较低,而非纯音乐的vocal_ratio一般较高,vocal_ratio可作为有物理意义的数值供人工参考。
506、确定该待检测音频属于纯音乐。
如果该待处理音频片段中不包含人声,则说明此时待检测音频中也不包含人声(因为之前已经把可能为人声的音频片段分离出来检测了),此时,则可以确定该待检测音频属于纯音乐。
507、确定该待检测音频不属于纯音乐。
如果该待处理音频片段中包含人声,因为从待检测音频提取出来的音频片段都包含人声了,则说明此时待检测音频中也包含人声,此时,可以确定待检测音频不属于纯音乐。
本发明实施例通过获取待检测音频;对该待检测音频进行人声分离处理,得到待处理音频片段;然后提取该待处理音频片段的音频特征,所述音频特征包括梅尔mel特征及人声占比特征;将该音频特征输入训练后的人声检测网络模型中;获取该训练后的人声检测网络模型的输出结果,若根据所述输出结果确定所述待处理音频片段不包含人声,则确定该待检测音频属于纯音乐。本发明实施例对从待检测音频中分离出来的音频片段进行纯音乐检测,不需要整曲检测,需要检测的音频长度较短,可以提高纯音乐检测的准确率。
本发明实施例还提供一种纯音乐检测装置,如图6所示,图6为本发明实施例提供的一种纯音乐检测装置的结构示意图。所述纯音乐检测装置600可以包括第一获取单元601、处理单元602、提取单元603、输入单元604、第一确定单元605和第二确定单元606,其中:
第一获取单元601,用于获取待检测音频;
处理单元602,用于对所述待检测音频进行人声分离处理,得到待处理音频片段;
提取单元603,用于提取所述待处理音频片段的音频特征,所述音频特征包括梅尔mel特征及人声占比特征;
输入单元604,用于将所述音频特征输入训练后的人声检测网络模型中;
第一确定单元605,用于获取所述训练后的人声检测网络模型的输出结果;
第二确定单元606,用于当根据所述输出结果确定所述待处理音频片段不包含人声时,确定所述待检测音频属于纯音乐。
如图7所示,在一些实施例中,所述装置600还包括:
第二获取单元607,用于获取多个音频样本,所述音频样本为已知是否为纯音乐的音频样本;
第三确定单元608,用于根据所述音频样本确定所述音频样本的音频特征;
添加单元609,用于将所述音频特征添加至训练样本集中;
训练单元610,用于根据所述训练样本集对人声检测网络模型进行训练,得到所述训练后的人声检测网络模型。
在一些实施例中,所述第三确定单元608具体用于:
对所述音频样本进行人声分离处理,得到音频片段;
提取所述音频片段的音频特征,确定所述音频特征。
可选的,在一些实施例中,所述第三确定单608元还具体用于:
通过Hourglass模型对所述音频样本进行人声分离处理。
在一些实施例中,所述音频特征为mel特征时,所述提取单元603具体用于:
对所述待处理音频片段做STFT变换,得到STFT频谱;
对所述STFT频谱进行转换得到mel频谱;
对所述mel频谱进行求对数处理以及1阶差分处理,得到所述mel特征。
在一些实施例中,所述提取单元603还具体用于:
提取所述待处理音频片段的mel特征;
提取所述待处理音频片段的人声占比特征。
可选的,在一些实施例中,当所述音频特征为所述人声占比特征时,所述提取单元603还具体用于:
对所述待处理音频片段进行归一化处理,得到归一化后的待处理音频片段;
对所述归一化后的待处理音频片段进行过滤静音处理,得到过滤后的待处理音频片段;
根据所述过滤后的待处理音频片段所对应的时长以及所述待检测音频的时长确定所述人声占比特征。
可选的,在一些实施例中,所述处理单元602具体用于:
通过Hourglass模型对所述待检测音频进行人声分离处理。
本发明实施例通过第一获取单元601获取待检测音频;处理单元602对所述待检测音频进行人声分离处理,得到待处理音频片段;然后提取单元603提取所述待处理音频片段的音频特征,所述音频特征包括梅尔mel特征及人声占比特征;输入单元604将所述音频特征输入训练后的人声检测网络模型中;第一确定单元605获取所述训练后的人声检测网络模型的输出结果;若根据所述输出结果确定所述待处理音频片段不包含人声,则第一确定单元605确定所述待检测音频属于纯音乐。本发明实施例对从待检测音频中分离出来的音频片段进行纯音乐检测,不需要整曲检测,需要检测的音频长度较短,可以提高纯音乐检测的准确率。
本发明实施例还提供一种服务器,如图8所示,其示出了本发明实施例所涉及的服务器的结构示意图,具体来讲:
该服务器可以包括一个或者一个以上处理核心的处理器801、一个或一个以上计算机可读存储介质的存储器802、电源803和输入单元804等部件。本领域技术人员可以理解,图8中示出的服务器结构并不构成对服务器的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。其中:
处理器801是该服务器的控制中心,利用各种接口和线路连接整个服务器的各个部分,通过运行或执行存储在存储器802内的软件程序和/或模块,以及调用存储在存储器802内的数据,执行服务器的各种功能和处理数据,从而对服务器进行整体监控。可选的,处理器801可包括一个或多个处理核心;优选的,处理器801可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器801中。
存储器802可用于存储软件程序以及模块,处理器801通过运行存储在存储器802的软件程序以及模块,从而执行各种功能应用以及数据处理。存储器802可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据服务器的使用所创建的数据等。此外,存储器802可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器802还可以包括存储器控制器,以提供处理器801对存储器802的访问。
服务器还包括给各个部件供电的电源803,优选的,电源803可以通过电源管理系统与处理器801逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源803还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。
该服务器还可包括输入单元804,该输入单元804可用于接收输入的数字或字符信息,以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。
尽管未示出,服务器还可以包括显示单元等,在此不再赘述。具体在本实施例中,服务器中的处理器801会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行文件加载到存储器802中,并由处理器801来运行存储在存储器802中的应用程序,从而实现各种功能,如下:
获取待检测音频;对所述待检测音频进行人声分离处理,得到待处理音频片段;提取所述待处理音频片段的音频特征;将所述音频特征输入训练后的人声检测网络模型中;根据所述训练后的人声检测网络模型的输出结果确定所述待处理音频片段中是否包含人声;;若不包含人声,则确定所述待检测音频属于纯音乐。
以上操作具体可参见前面的实施例,在此不作赘述。
由上可知,本实施例提供的服务器,通过获取待检测音频;对所述待检测音频进行人声分离处理,得到待处理音频片段;然后提取所述待处理音频片段的音频特征;根据所述音频特征检测所述待处理音频片段中是否包含人声;若不包含人声,则确定所述待检测音频属于纯音乐。本发明实施例对从待检测音频中分离出来的音频片段进行纯音乐检测,不需要整曲检测,需要检测的音频长度较短,可以提高纯音乐检测的准确率。
相应的,本发明实施例还提供一种终端,如图9所示,该终端可以包括射频(RF,Radio Frequency)电路901、包括有一个或一个以上计算机可读存储介质的存储器902、输入单元903、显示单元904、传感器905、音频电路906、无线保真(WiFi,Wireless Fidelity)模块907、包括有一个或者一个以上处理核心的处理器908、以及电源909等部件。本领域技术人员可以理解,图9中示出的终端结构并不构成对终端的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。其中:
RF电路901可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,交由一个或者一个以上处理器908处理;另外,将涉及上行的数据发送给基站。通常,RF电路901包括但不限于天线、至少一个放大器、调谐器、一个或多个振荡器、用户身份模块(SIM, Subscriber Identity Module)卡、收发信机、耦合器、低噪声放大器(LNA,Low Noise Amplifier)、双工器等。此外,RF电路901还可以通过无线通信与网络和其他设备通信。所述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯系统(GSM,Global System of Mobile communication)、通用分组无线服务(GPRS ,General Packet Radio Service)、码分多址(CDMA,Code Division Multiple Access)、宽带码分多址(WCDMA,Wideband Code Division Multiple Access)、长期演进(LTE,Long Term Evolution)、电子邮件、短消息服务(SMS,Short Messaging Service)等。
存储器902可用于存储软件程序以及模块,处理器908通过运行存储在存储器902的软件程序以及模块,从而执行各种功能应用以及数据处理。存储器902可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据终端的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器902可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器902还可以包括存储器控制器,以提供处理器908和输入单元903对存储器902的访问。
输入单元903可用于接收输入的数字或字符信息,以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。具体地,在一个具体的实施例中,输入单元903可包括触敏表面以及其他输入设备。触敏表面,也称为触摸显示屏或者触控板,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触敏表面上或在触敏表面附近的操作),并根据预先设定的程式驱动相应的连接装置。可选的,触敏表面可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器908,并能接收处理器908发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触敏表面。除了触敏表面,输入单元903还可以包括其他输入设备。具体地,其他输入设备可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
显示单元904可用于显示由用户输入的信息或提供给用户的信息以及终端的各种图形用户接口,这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。显示单元904可包括显示面板,可选的,可以采用液晶显示器(LCD,Liquid Crystal Display)、有机发光二极管(OLED,Organic Light-Emitting Diode)等形式来配置显示面板。进一步的,触敏表面可覆盖显示面板,当触敏表面检测到在其上或附近的触摸操作后,传送给处理器908以确定触摸事件的类型,随后处理器908根据触摸事件的类型在显示面板上提供相应的视觉输出。虽然在图9中,触敏表面与显示面板是作为两个独立的部件来实现输入和输入功能,但是在某些实施例中,可以将触敏表面与显示面板集成而实现输入和输出功能。
终端还可包括至少一种传感器905,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板的亮度,接近传感器可在终端移动到耳边时,关闭显示面板和/或背光。作为运动传感器的一种,重力加速度传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等; 至于终端还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。
音频电路906、扬声器,传声器可提供用户与终端之间的音频接口。音频电路906可将接收到的音频数据转换后的电信号,传输到扬声器,由扬声器转换为声音信号输出;另一方面,传声器将收集的声音信号转换为电信号,由音频电路906接收后转换为音频数据,再将音频数据输出处理器908处理后,经RF电路901以发送给比如另一终端,或者将音频数据输出至存储器902以便进一步处理。音频电路906还可能包括耳塞插孔,以提供外设耳机与终端的通信。
WiFi属于短距离无线传输技术,终端通过WiFi模块907可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图9示出了WiFi模块907,但是可以理解的是,其并不属于终端的必须构成,完全可以根据需要在不改变发明的本质的范围内而省略。
处理器908是终端的控制中心,利用各种接口和线路连接整个手机的各个部分,通过运行或执行存储在存储器902内的软件程序和/或模块,以及调用存储在存储器902内的数据,执行终端的各种功能和处理数据,从而对手机进行整体监控。可选的,处理器908可包括一个或多个处理核心;优选的,处理器908可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器908中。
终端还包括给各个部件供电的电源909(比如电池),优选的,电源可以通过电源管理系统与处理器908逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源909还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。
尽管未示出,终端还可以包括摄像头、蓝牙模块等,在此不再赘述。具体在本实施例中,终端中的处理器908会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行文件加载到存储器902中,并由处理器908来运行存储在存储器902中的应用程序,从而实现各种功能:
获取待检测音频;对所述待检测音频进行人声分离处理,得到待处理音频片段;提取所述待处理音频片段的音频特征;根据所述音频特征检测所述待处理音频片段中是否包含人声;若不包含人声,则确定所述待检测音频属于纯音乐。
以上操作具体可参见前面的实施例,在此不作赘述。
由上可知,本实施例提供的终端,通过获取待检测音频;对所述待检测音频进行人声分离处理,得到待处理音频片段;然后提取所述待处理音频片段的音频特征;将所述音频特征输入训练后的人声检测网络模型中;根据所述训练后的人声检测网络模型的输出结果确定所述待处理音频片段中是否包含人声;;若不包含人声,则确定所述待检测音频属于纯音乐。本发明实施例对从待检测音频中分离出来的音频片段进行纯音乐检测,不需要整曲检测,需要检测的音频长度较短,可以提高纯音乐检测的准确率。
本领域普通技术人员可以理解,上述实施例的各种方法中的全部或部分步骤可以通过指令来完成,或通过指令控制相关的硬件来完成,该指令可以存储于一计算机可读存储介质中,并由处理器进行加载和执行。
为此,本发明实施例提供一种存储介质,其中存储有多条指令,该指令能够被处理器进行加载,以执行本发明实施例所提供的任一种纯音乐检测方法中的步骤。例如,该指令可以执行如下步骤:
获取待检测音频;对所述待检测音频进行人声分离处理,得到待处理音频片段;提取所述待处理音频片段的音频特征;将所述音频特征输入训练后的人声检测网络模型中;根据所述训练后的人声检测网络模型的输出结果确定所述待处理音频片段中是否包含人声;若不包含人声,则确定所述待检测音频属于纯音乐。
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。
其中,该存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)、磁盘或光盘等。
由于该存储介质中所存储的指令,可以执行本发明实施例所提供的任一种纯音乐检测方法中的步骤,因此,可以实现本发明实施例所提供的任一种纯音乐检测方法所能实现的有益效果,详见前面的实施例,在此不再赘述。
以上对本发明实施例所提供的一种纯音乐检测方法、装置及存储介质进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。

Claims (10)

  1. 一种纯音乐检测方法,其特征在于,包括:
    获取待检测音频;
    对所述待检测音频进行人声分离处理,得到待处理音频片段;
    提取所述待处理音频片段的音频特征,所述音频特征包括梅尔mel特征及人声占比特征;
    将所述音频特征输入训练后的人声检测网络模型中;
    获取所述训练后的人声检测网络模型的输出结果;
    若根据所述输出结果确定所述待处理音频片段不包含人声,则确定所述待检测音频属于纯音乐。
  2. 根据权利要求1所述的方法,其特征在于,在所述将所述音频特征输入训练后的人声检测网络模型中之前,所述方法还包括:
    获取多个音频样本,所述音频样本为已知是否为纯音乐的音频样本;
    根据所述音频样本确定所述音频样本的音频特征;
    将所述音频特征添加至训练样本集中;
    根据所述训练样本集对人声检测网络模型进行训练,得到所述训练后的人声检测网络模型。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述音频样本确定所述音频样本的音频特征,包括:
    对所述音频样本进行人声分离处理,得到音频片段;
    提取所述音频片段的音频特征,确定所述音频特征。
  4. 根据权利要求3所述的方法,其特征在于,所述对所述音频样本进行人声分离处理,包括:
    通过Hourglass模型对所述音频样本进行人声分离处理。
  5. 根据权利要求1所述的方法,其特征在于,当所述音频特征为所述mel特征时,所述提取所述待处理音频片段的音频特征,包括:
    对所述待处理音频片段做短时傅里叶变换STFT变换,得到STFT频谱;
    对所述STFT频谱进行转换得到mel频谱;
    对所述mel频谱进行求对数处理以及1阶差分处理,得到所述mel特征。
  6. 根据权利要求1所述的方法,其特征在于,当所述音频特征为所述人声占比特征时,所述提取所述待处理音频片段的音频特征,包括:
    对所述待处理音频片段进行归一化处理,得到归一化后的待处理音频片段;
    对所述归一化后的待处理音频片段进行过滤静音处理,得到过滤后的待处理音频片段;
    根据所述过滤后的待处理音频片段所对应的时长以及所述待检测音频的时长确定所述人声占比特征。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述对所述待检测音频进行人声分离处理,包括:
    通过Hourglass模型对所述待检测音频进行人声分离处理。
  8. 一种纯音乐检测装置,其特征在于,包括:
    第一获取单元,用于获取待检测音频;
    处理单元,用于对所述待检测音频进行人声分离处理,得到待处理音频片段;
    提取单元,用于提取所述待处理音频片段的音频特征,所述音频特征包括梅尔mel特征及人声占比特征;
    输入单元,用于将所述音频特征输入训练后的人声检测网络模型中;
    第一确定单元,获取所述训练后的人声检测网络模型的输出结果;
    第二确定单元,用于当根据所述输出结果确定所述待处理音频片段不包含人声时,确定所述待检测音频属于纯音乐。
  9. 一种终端,其特征在于,包括处理器和存储器,所述存储器中存储有计算机程序,所述处理器调用所述存储器中的计算机程序时执行如权利要求1至7任一项所述的纯音乐检测方法。
  10. 一种存储介质,其特征在于,所述存储介质存储有多条指令,所述指令适于处理器进行加载,以执行权利要求1至7任一项所述的纯音乐检测方法中的步骤。
PCT/CN2019/109638 2019-05-14 2019-09-30 一种纯音乐检测方法、装置及存储介质 WO2020228226A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910398945.6 2019-05-14
CN201910398945.6A CN110097895B (zh) 2019-05-14 2019-05-14 一种纯音乐检测方法、装置及存储介质

Publications (1)

Publication Number Publication Date
WO2020228226A1 true WO2020228226A1 (zh) 2020-11-19

Family

ID=67447961

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/109638 WO2020228226A1 (zh) 2019-05-14 2019-09-30 一种纯音乐检测方法、装置及存储介质

Country Status (2)

Country Link
CN (1) CN110097895B (zh)
WO (1) WO2020228226A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110097895B (zh) * 2019-05-14 2021-03-16 腾讯音乐娱乐科技(深圳)有限公司 一种纯音乐检测方法、装置及存储介质
CN110648656A (zh) * 2019-08-28 2020-01-03 北京达佳互联信息技术有限公司 语音端点检测方法、装置、电子设备及存储介质
CN112259119B (zh) * 2020-10-19 2021-11-16 深圳市策慧科技有限公司 基于堆叠沙漏网络的音乐源分离方法
CN114615534A (zh) * 2022-01-27 2022-06-10 海信视像科技股份有限公司 显示设备及音频处理方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150081298A1 (en) * 2013-09-17 2015-03-19 Kabushiki Kaisha Toshiba Speech processing apparatus and method
CN108320756A (zh) * 2018-02-07 2018-07-24 广州酷狗计算机科技有限公司 一种检测音频是否是纯音乐音频的方法和装置
CN108538311A (zh) * 2018-04-13 2018-09-14 腾讯音乐娱乐科技(深圳)有限公司 音频分类方法、装置及计算机可读存储介质
CN109166593A (zh) * 2018-08-17 2019-01-08 腾讯音乐娱乐科技(深圳)有限公司 音频数据处理方法、装置及存储介质
CN109308901A (zh) * 2018-09-29 2019-02-05 百度在线网络技术(北京)有限公司 歌唱者识别方法和装置
CN110097895A (zh) * 2019-05-14 2019-08-06 腾讯音乐娱乐科技(深圳)有限公司 一种纯音乐检测方法、装置及存储介质

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8005666B2 (en) * 2006-10-24 2011-08-23 National Institute Of Advanced Industrial Science And Technology Automatic system for temporal alignment of music audio signal with lyrics
CN102956230B (zh) * 2011-08-19 2017-03-01 杜比实验室特许公司 对音频信号进行歌曲检测的方法和设备
CN102982804B (zh) * 2011-09-02 2017-05-03 杜比实验室特许公司 音频分类方法和系统
CN104078050A (zh) * 2013-03-26 2014-10-01 杜比实验室特许公司 用于音频分类和音频处理的设备和方法
CN104347067B (zh) * 2013-08-06 2017-04-12 华为技术有限公司 一种音频信号分类方法和装置
CN103680517A (zh) * 2013-11-20 2014-03-26 华为技术有限公司 一种音频信号的处理方法、装置及设备
CN103646649B (zh) * 2013-12-30 2016-04-13 中国科学院自动化研究所 一种高效的语音检测方法
CN108538309B (zh) * 2018-03-01 2021-09-21 杭州小影创新科技股份有限公司 一种歌声侦测的方法
CN108877783B (zh) * 2018-07-05 2021-08-31 腾讯音乐娱乐科技(深圳)有限公司 确定音频数据的音频类型的方法和装置
CN109545191B (zh) * 2018-11-15 2022-11-25 电子科技大学 一种歌曲中人声起始位置的实时检测方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150081298A1 (en) * 2013-09-17 2015-03-19 Kabushiki Kaisha Toshiba Speech processing apparatus and method
CN108320756A (zh) * 2018-02-07 2018-07-24 广州酷狗计算机科技有限公司 一种检测音频是否是纯音乐音频的方法和装置
CN108538311A (zh) * 2018-04-13 2018-09-14 腾讯音乐娱乐科技(深圳)有限公司 音频分类方法、装置及计算机可读存储介质
CN109166593A (zh) * 2018-08-17 2019-01-08 腾讯音乐娱乐科技(深圳)有限公司 音频数据处理方法、装置及存储介质
CN109308901A (zh) * 2018-09-29 2019-02-05 百度在线网络技术(北京)有限公司 歌唱者识别方法和装置
CN110097895A (zh) * 2019-05-14 2019-08-06 腾讯音乐娱乐科技(深圳)有限公司 一种纯音乐检测方法、装置及存储介质

Also Published As

Publication number Publication date
CN110097895B (zh) 2021-03-16
CN110097895A (zh) 2019-08-06

Similar Documents

Publication Publication Date Title
CN109166593B (zh) 音频数据处理方法、装置及存储介质
CN109087669B (zh) 音频相似度检测方法、装置、存储介质及计算机设备
CN110853618B (zh) 一种语种识别的方法、模型训练的方法、装置及设备
CN109903773B (zh) 音频处理方法、装置及存储介质
WO2020228226A1 (zh) 一种纯音乐检测方法、装置及存储介质
CN111883091B (zh) 音频降噪方法和音频降噪模型的训练方法
CN103440862B (zh) 一种语音与音乐合成的方法、装置以及设备
US9414174B2 (en) Method and apparatus for controlling audio output
CN108470571B (zh) 一种音频检测方法、装置及存储介质
CN106847307B (zh) 信号检测方法及装置
CN107229629B (zh) 音频识别方法及装置
CN110830368B (zh) 即时通讯消息发送方法及电子设备
WO2022089098A1 (zh) 音高调节方法、装置及计算机存储介质
CN109243488B (zh) 音频检测方法、装置及存储介质
CN107731241B (zh) 处理音频信号的方法、装置和存储介质
JP2017509009A (ja) オーディオストリームの中の音楽の追跡
CN110568926A (zh) 一种声音信号处理方法及终端设备
CN113157240A (zh) 语音处理方法、装置、设备、存储介质及计算机程序产品
CN112906369A (zh) 一种歌词文件生成方法及装置
CN115798459A (zh) 音频处理方法、装置、存储介质及电子设备
WO2017215615A1 (zh) 一种音效处理方法及移动终端
CN109346102B (zh) 音频开头爆音的检测方法、装置及存储介质
CN112259076B (zh) 语音交互方法、装置、电子设备及计算机可读存储介质
WO2020118560A1 (zh) 一种录音方法、装置、电子设备和计算机可读存储介质
CN111739493B (zh) 音频处理方法、装置及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19928495

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19928495

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 22.03.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19928495

Country of ref document: EP

Kind code of ref document: A1