WO2020228226A1 - 一种纯音乐检测方法、装置及存储介质 - Google Patents
一种纯音乐检测方法、装置及存储介质 Download PDFInfo
- Publication number
- WO2020228226A1 WO2020228226A1 PCT/CN2019/109638 CN2019109638W WO2020228226A1 WO 2020228226 A1 WO2020228226 A1 WO 2020228226A1 CN 2019109638 W CN2019109638 W CN 2019109638W WO 2020228226 A1 WO2020228226 A1 WO 2020228226A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio
- human voice
- feature
- processed
- detected
- Prior art date
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 150
- 238000012545 processing Methods 0.000 claims abstract description 77
- 238000000926 separation method Methods 0.000 claims abstract description 37
- 238000000034 method Methods 0.000 claims abstract description 35
- 238000012549 training Methods 0.000 claims description 50
- 238000001228 spectrum Methods 0.000 claims description 44
- 230000001755 vocal effect Effects 0.000 claims description 36
- 239000012634 fragment Substances 0.000 claims description 21
- 238000000605 extraction Methods 0.000 claims description 12
- 230000009466 transformation Effects 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 3
- 238000004590 computer program Methods 0.000 claims 1
- 230000006870 function Effects 0.000 description 17
- 239000000284 extract Substances 0.000 description 15
- 238000010586 diagram Methods 0.000 description 11
- 238000004891 communication Methods 0.000 description 6
- 238000007726 management method Methods 0.000 description 6
- 230000004927 fusion Effects 0.000 description 5
- 238000001914 filtration Methods 0.000 description 4
- 230000000717 retained effect Effects 0.000 description 3
- 230000001133 acceleration Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000007599 discharging Methods 0.000 description 2
- 230000005484 gravity Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012905 input function Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000009527 percussion Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000010897 surface acoustic wave method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/81—Detection of presence or absence of voice signals for discriminating voice from music
Definitions
- the invention relates to the field of audio processing, in particular to a pure music detection method, device and storage medium.
- Pure music refers to music that does not contain lyrics. This kind of music uses pure and beautiful music to narrate and express the emotions of the author. Pure music can be played by natural musical instruments (such as piano, violin, guitar, etc.) or electroacoustic instruments. , So it is usually based on whether the audio contains human voice to distinguish whether the audio is pure music.
- the embodiments of the present invention provide a pure music detection method, device and storage medium, which are used to improve the accuracy of pure music detection.
- the embodiments of the present invention provide a pure music detection method, device and storage medium, which are used to improve the accuracy of pure music detection.
- the embodiment of the present invention provides a pure music detection method, the method includes:
- the audio segment to be processed does not contain human voice, it is determined that the audio to be detected belongs to pure music.
- an embodiment of the present invention also provides a pure music detection device, which includes:
- the first acquiring unit is used to acquire the audio to be detected
- a processing unit configured to perform human voice separation processing on the audio to be detected to obtain audio clips to be processed
- An extraction unit configured to extract audio features of the to-be-processed audio segment, the audio features including Mel mel feature and vocal proportion feature;
- the input unit is used to input the audio features into the trained human voice detection network model
- the first determining unit is configured to obtain the output result of the trained human voice detection network model
- the second determining unit is configured to determine that the audio to be detected is pure music when it is determined that the audio segment to be processed does not contain human voice according to the output result.
- the device further includes:
- the second acquiring unit is configured to acquire multiple audio samples, where the audio samples are audio samples that are known to be pure music;
- the third determining unit is configured to determine the audio feature of the audio sample according to the audio sample
- the training unit is used to train the human voice detection network model according to the training sample set to obtain the trained human voice detection network model.
- the third determining unit is specifically configured to:
- the audio feature of the audio segment is extracted, and the audio feature is determined.
- the third determining unit is further specifically configured to:
- the human voice separation processing is performed on the audio samples through the Hourglass model.
- the extraction unit is specifically configured to:
- STFT transformation is performed on the to-be-processed audio segment to obtain an STFT spectrum
- Logarithmic processing and first-order difference processing are performed on the mel spectrum to obtain the mel feature.
- the extraction unit is further specifically configured to:
- Extracting the characteristics of the proportion of human voice in the audio segment to be processed Extracting the characteristics of the proportion of human voice in the audio segment to be processed.
- the extraction unit is specifically configured to:
- the human voice ratio feature is determined according to the duration corresponding to the filtered audio clip to be processed and the duration of the audio to be detected.
- the processing unit is specifically configured to:
- the human voice separation processing is performed on the to-be-detected audio through the Hourglass model.
- the embodiment of the present invention also provides a storage medium, the storage medium stores a plurality of instructions, and the instructions are suitable for the processor to load and execute any of the pure music detection methods provided in the embodiments of the present invention. A step of.
- the embodiment of the present invention obtains the audio to be detected; performs human voice separation processing on the audio to be detected to obtain the audio segment to be processed; then extracts the audio features of the audio segment to be processed, and the audio features include Melmel features and Vocal ratio feature; input the audio feature into the trained voice detection network model; obtain the output result of the trained voice detection network model; if it is determined according to the output result that the to-be-processed audio segment does not contain Human voice, it is determined that the audio to be detected belongs to pure music.
- the embodiment of the present invention performs pure music detection on audio fragments separated from the audio to be detected, without the need for whole song detection, the audio length to be detected is relatively short, and the accuracy of pure music detection can be improved.
- Fig. 1 is a system schematic diagram of a pure music detection device provided by an embodiment of the present invention.
- FIG. 2 is a schematic flowchart of a pure music detection method provided by an embodiment of the present invention.
- Fig. 3 is a schematic structural diagram of a basic convolutional network in a human voice detection network model provided by an embodiment of the present invention.
- FIG. 4 is a schematic diagram of the structure of the coding layer in the human voice detection network model provided by an embodiment of the present invention.
- FIG. 5 is another schematic flowchart of a pure music detection method provided by an embodiment of the present invention.
- Fig. 6 is a schematic structural diagram of a pure music detection device provided by an embodiment of the present invention.
- FIG. 7 is another schematic structural diagram of a pure music detecting device provided by an embodiment of the present invention.
- FIG. 8 is a schematic structural diagram of a server provided by an embodiment of the present invention.
- FIG. 9 is a schematic structural diagram of a terminal according to an embodiment of the present invention.
- first and second in the present invention are used to distinguish different objects, rather than to describe a specific order.
- the terms “including” and “having” and any variations thereof are intended to cover non-exclusive inclusion.
- a process, method, system, product, or device that includes a series of steps or modules is not limited to the listed steps or modules, but optionally includes unlisted steps or modules, or optionally also includes Other steps or modules inherent to these processes, methods, products or equipment.
- Pure music refers to music that does not contain lyrics. We can determine whether the music is pure music according to whether the music contains human voices.
- the audio of the whole song is generally used to determine whether the audio is pure music. Some songs have scattered vocals, which are usually regarded as pure music. The accuracy of pure music detection is not high.
- the model is to be trained To judge whether the audio of the whole song belongs to pure music, because the length of the whole song varies from tens of seconds to as many as tens of minutes, if you use too long audio to train the model to determine whether the audio is pure music, it will bring the model Problems that are difficult to train and optimize.
- the embodiments of the present invention provide a pure music detection method, device, and storage medium.
- the to-be-detected audio is obtained; the human-voice separation processing is performed on the to-be-detected audio to obtain the to-be-processed audio segment;
- the audio feature of the audio segment to be processed is detecting whether the audio segment to be processed contains human voice according to the audio feature; if it does not contain human voice, it is determined that the audio to be detected belongs to pure music.
- the embodiment of the present invention performs pure music detection on audio fragments separated from the audio to be detected, without the need for whole song detection, the audio length to be detected is relatively short, and the accuracy of pure music detection can be improved.
- the audio feature when detecting whether the to-be-processed audio segment contains human voice according to the audio feature in this embodiment, the audio feature can be input into the trained human voice detection network model; and then according to the trained human voice The output result of the human voice detection network model determines whether the to-be-processed audio segment contains human voice.
- the audio features separated and extracted from the audio samples are used as the training samples for training.
- the length of the audio features is relatively short, so when using the audio features as the training samples for training, compare It is easy to train and optimize the model.
- the pure music detection method provided by the embodiment of the present invention can be implemented in a pure music detection device, and the pure music detection device can be integrated in a terminal or server and other equipment.
- FIG. 1 is a system diagram of a pure music detection device provided by an embodiment of the present invention.
- the system can be used for model training and pure music detection.
- the model provided by the embodiment of the present invention is a deep learning network model, such as a human voice detection network model.
- the trained audio samples are obtained in advance, and then the audio samples are separated from human voices to obtain audio fragments.
- the audio segment determines the audio feature of the audio sample, and then adds the audio feature to the training sample set, and trains the human voice detection network model according to the training sample set to obtain the trained human voice detection network model.
- FIG. 2 is a schematic flowchart of a pure music detection method provided by an embodiment of the present invention. The method includes:
- the pure music detection device in the present invention can be integrated in a terminal or server and other equipment , wherein the equipment includes but is not limited to computers, smart TVs, smart speakers, mobile phones, tablet computers and other devices.
- the pure music detection device in this embodiment includes a trained human voice detection network model, which is a deep learning network model. Before acquiring the audio to be detected, this embodiment needs to train the human voice detection network first, so that The human voice detection network can be used to detect whether the audio to be detected contains human voice.
- the details can be as follows:
- the audio samples are audio samples that are known to be pure music.
- the audio samples include long audio samples and short audio samples.
- Long audio samples are audio samples that are tens of minutes long.
- the audio duration is only a few seconds of audio. Among them, the duration of long audio is longer than that of short audio, and the specific duration of long audio and segment audio is not limited here.
- the audio sample needs to be separated into audio segments and pure music segments through the Hourglass model.
- the audio segments mentioned in this embodiment are extracted from audio samples. "Human Voice Clips" that came out.
- the Hourglass model is trained on the DSD100 data set (human voice and other musical instruments such as drums, bass, guitar, etc.). This model can support blind source separation of any multi-source, and can separate the human voice and instrument sound in the audio.
- the original audio should be merged into mono when fed into the Hourglass model and downsampled to 8k.
- the number of downsampling is not limited here. In addition to 8k, it is also downsampled to other values according to the specific situation.
- the audio segment may be a real "vocal segment” (the label corresponding to the sample should be impure music), or it can be a fake “vocal segment” "(That is, the sound of an instrument that is misidentified as a human voice, at this time the label corresponding to the sample should be pure music), which may or may not include human voice.
- the audio features can include Melmel features and vocal ratio features (Vocal_ratio ).
- the mel feature When extracting the mel feature, it specifically includes: performing STFT transformation on the audio segment to obtain the STFT spectrum; converting the STFT spectrum to obtain the mel spectrum; performing logarithmic processing and first-order difference processing on the mel spectrum to obtain the mel feature .
- perform logarithmic processing and first-order difference processing on the mel spectrum and obtain the characteristics of the mel specifically as follows: (1) Add 1 to the mel spectrum and then take the logarithm, that is, log(1+x), where x is mel spectrum; (2) The result of 1 is taken as a first-order difference along the time direction to find the absolute value, and then superimposed with the result of (1) to form a mel feature.
- the result is scaled to a fixed length in time, where the fixed length can be 2000 frames, and the specific length is not limited here.
- the mel feature is a frequency spectrum feature obtained by a filter bank that meets the characteristics of human hearing, and reflects the time-frequency structure of audio.
- the mel feature contains information to determine whether the audio is pure music or human voice, so it can be used as a basis for pure music judgment, so we can use the mel feature to train the model.
- the specific method is to normalize the audio segment (normalized to 0-1) to obtain the normalized audio segment; perform the normalized audio segment Filter mute processing (among which, the filter threshold can be set to 20db, or it can be set to other values according to the specific situation, the specific filter threshold is not limited here) to obtain the filtered audio clip; according to the filter
- the value of the vocal proportion feature of pure music is usually low, and the value of the vocal proportion feature of non-pure music is usually higher, so we can use the mel feature to train the model.
- the human voice proportion feature may not be extracted.
- the audio features of the above audio samples need to be added to the training samples.
- the above audio samples are samples that are known to be pure music, we can extract from the audio samples Add tags to the audio features that come out to indicate that the audio features reflect that the audio is pure music or impure music, so a set of samples in the training sample set include (mel feature, vocal proportion feature, tag).
- the human voice detection network model will be trained according to the training sample set. Specifically, a set of audio features are input into the prediction result of the preset human voice detection network model, and then the prediction result is compared with the group The tags of audio features train the human voice detection network model.
- the audio features extracted from the audio samples are used as training samples for training.
- the length of the audio features is relatively short, so it is easier to train using audio features as the training samples. Model training and optimization.
- the audio to be detected needs to be separated into audio segments and pure music segments through the Hourglass model.
- the audio segments mentioned in this embodiment are "human voices” extracted from audio samples. Fragment".
- the Hourglass model in this step allows part of the musical instruments to be misidentified at the expense of recalling all the voices in the audio so as not to miss the human voices and to ensure a high recall rate.
- the to-be-processed audio fragments may be real "human voice fragments" or fake “human voice fragments", which may include human voices or not. Including human voice, in this case, it is necessary to determine whether the to-be-processed audio segment is a real human voice segment according to subsequent steps.
- the audio feature in the embodiment of the present invention can be used to indicate that the audio is pure music/human voice.
- the to-be-processed audio clips are extracted from the audio to be detected through the Hourglass model
- feature extraction of the to-be-processed audio clips output from the Hourglass model is required to obtain audio features.
- the audio features can include mel features and vocal proportions feature.
- the mel feature When extracting the mel feature, it specifically includes: performing short-time Fourier transform (STFT) on the audio segment to be processed to obtain the STFT spectrum; converting the STFT spectrum to obtain the mel spectrum; The spectrum is subjected to logarithmic processing and first-order difference processing to obtain the mel feature. Among them, perform logarithmic processing and first-order difference processing on the mel spectrum.
- STFT short-time Fourier transform
- the characteristics of the mel are as follows: (1) Add 1 to the mel spectrum and then take the logarithm, that is, log(1+x), where , X is the mel spectrum; (2) Take the result of 1 along the time direction as a first-order difference to find the absolute value, and then superimpose it with the result of (1) to form the mel feature.
- the specific method is to normalize the frequency segment to be processed (normalized to 0-1) to obtain the normalized audio segment to be processed;
- the human voice proportion feature may not be extracted, that is, if the trained human voice detection network model is based on the mel feature and the human voice proportion feature If the training is obtained, the human voice proportion feature needs to be extracted at this time. If the training is based on the mel feature only, the human voice proportion feature does not need to be extracted at this time.
- the audio features in the embodiment of the present invention may also have other audio features that can indicate that the audio is human voice/pure music.
- the specific feature type is not limited here.
- the audio feature after the audio feature of the audio to be detected is obtained, the audio feature will be input into the trained human voice detection network model, and the output result of the audio feature will be obtained, where the output result can be used for Determine whether the audio clip to be processed contains human voice.
- the output result is greater than 0.5, it is determined that the audio segment to be processed does not contain human voice, if the output result is less than 0.5, it is determined that the audio segment to be processed contains human voice, for example, if the output result is 1. , It is determined that the audio segment to be processed does not contain human voice, if it is 0, it is determined that the audio segment to be processed contains human voice.
- the Vocal_ratio can be additionally output in this embodiment. According to statistics, we find that the vocal_ratio of pure music is usually lower, while the vocal_ratio of non-pure music is generally higher.
- the vocal_ratio can be used as a physical value for manual reference. .
- the audio segment to be processed does not contain human voice, it is determined that the audio to be detected belongs to pure music.
- the embodiment of the present invention obtains the audio to be detected; performs human voice separation processing on the audio to be detected to obtain the audio segment to be processed; then extracts the audio features of the audio segment to be processed, and the audio features include Mel mel feature and human voice Proportion feature; input the audio feature into the trained human voice detection network model; obtain the output result of the trained human voice detection network model; if it is determined according to the output result that the to-be-processed audio segment does not contain human voice , It is determined that the audio to be detected belongs to pure music.
- the embodiment of the present invention performs pure music detection on audio fragments separated from the audio to be detected, without the need for whole song detection, the audio length to be detected is relatively short, and the accuracy of pure music detection can be improved.
- the vehicle identification device is specifically integrated in the server as an example for description.
- the server obtains a large number of audio samples through multiple channels.
- the audio samples are audio samples that are known to be pure music.
- the audio samples include long audio samples and short audio samples.
- Long audio samples are tens of minutes long.
- Audio a piece of audio sample is audio with an audio duration of only a few seconds. Among them, the duration of long audio is longer than that of short audio, and the specific duration of long audio and segment audio is not limited here.
- the audio feature of the audio sample is determined according to the audio sample. Specifically, the audio sample needs to be subjected to human voice separation processing to obtain an audio segment; then the audio feature of the audio segment is extracted to determine the audio feature.
- the audio sample needs to be separated into audio segments and pure music segments through the Hourglass model.
- the audio segments mentioned in this embodiment are extracted from audio samples. "Human Voice Fragments" that came out.
- the original audio should be merged into mono when fed into the Hourglass model and downsampled to 8k.
- the number of downsampling is not limited here. In addition to 8k, it is also downsampled to other values according to the specific situation.
- the audio fragments may be real "human voice fragments" or fake “human voice fragments", which may include human voices, or may not include humans. sound.
- the audio clips output from the Hourglass model need to be feature extracted to obtain audio features, where the audio features can include mel features and vocal ratio features (Vocal_ratio).
- the mel feature When extracting the mel feature, it specifically includes: performing STFT transformation on the audio segment to obtain the STFT spectrum; converting the STFT spectrum to obtain the mel spectrum; performing logarithmic processing and first-order difference processing on the mel spectrum to obtain the mel feature .
- perform logarithmic processing and first-order difference processing on the mel spectrum and obtain the characteristics of the mel specifically as follows: (1) Add 1 to the mel spectrum and then take the logarithm, that is, log(1+x), where x is mel spectrum; (2) The result of 1 is taken as a first-order difference along the time direction to find the absolute value, and then superimposed with the result of (1) to form a mel feature.
- the result is scaled to a fixed length in time, where the fixed length can be 2000 frames, and the specific length is not limited here.
- the mel feature is a frequency spectrum feature obtained by a filter bank that meets the characteristics of human hearing, and reflects the time-frequency structure of audio.
- the mel feature contains information to determine whether the audio is pure music or human voice, so it can be used as a basis for pure music judgment, so we can use the mel feature to train the model.
- the specific method is to normalize the audio segment (normalized to 0-1) to obtain the normalized audio segment; perform the normalized audio segment Filter mute processing (among which, the filter threshold can be set to 20db, or it can be set to other values according to the specific situation, the specific filter threshold is not limited here) to obtain the filtered audio clip; according to the filter
- the value of the vocal proportion feature of pure music is usually low, and the value of the vocal proportion feature of non-pure music is usually higher, so we can use the mel feature to train the model.
- the human voice proportion feature may not be extracted. In this case, only the mel feature needs to be extracted when performing pure music detection later.
- the audio feature is added to the training sample set; more specifically, when the audio feature of the above audio sample is extracted, the audio feature needs to be added to the training sample at this time, where it is known whether the audio sample is It is a pure music sample, so we can add a label to the audio feature extracted from the audio sample to indicate that the audio feature reflects that the audio is pure music or impure music, so a set of samples in the training sample set include ( mel feature, vocal proportion feature, label).
- the human voice detection network model is trained according to the training sample set, and the trained human voice detection network model is obtained, that is, after the training sample set is obtained, the human voice detection network model will be trained according to the training sample set.
- a group of audio features are input into a preset human voice detection network model to predict the result, and then the prediction result and the label of the group of audio features are trained on the human voice detection network model.
- the audio features extracted from the audio samples are used as training samples for training.
- the length of the audio features is relatively short, so it is easier to train using audio features as the training samples. Model training and optimization.
- the trained human voice detection network model in the embodiments of the present invention is composed of a basic convolutional network + coding layer + feature fusion layer + fully connected classification layer.
- the basic convolutional network selects a network without expansion coefficient
- a schematic diagram of a basic convolutional network is shown in Figure 3, wherein the size of the feature map formed after the stacking of multiple basic convolutional layers is The (timestep, feature) matrix needs to be converted into a vector in order to be able to perform subsequent classification.
- the structure of the encoding layer is shown in Figure 4.
- the encoding layer learns the importance of data at each time step through convolution with only one convolution kernel (softmax mask), and then this importance value is multiplied with the data row by row, and the result of the multiplication is summed on the time axis to obtain a feature vector.
- This technique is equivalent to encoding the features distributed in the time step to one point, so it is called the encoding layer.
- the training sample set we input the samples in the training sample set into the human voice detection network model.
- the samples in the training sample set into the human voice detection network model.
- the mel feature into the basic convolutional network in the human voice detection network model, and then fix-length coding, and then The feature fusion layer adds the vocal proportion feature.
- the classification result is output through the fully connected classification layer, and then the classification result is compared with the label corresponding to the sample, and the weight of the basic convolutional network is adjusted according to the comparison error. Until the model converges.
- the pure music detection device in the present invention can be integrated in a terminal or server and other equipment , wherein the equipment includes but is not limited to computers, smart TVs, smart speakers, mobile phones, tablet computers and other devices.
- the original audio should be merged into mono when fed into the Hourglass model and downsampled to 8k.
- the number of downsampling is not limited here. In addition to 8k, it is also downsampled to other values according to the specific situation.
- the to-be-processed audio fragments may be real "human voice fragments" or fake “human voice fragments", which may include human voices or not. Including human voice, in this case, it is necessary to determine whether the to-be-processed audio segment is a real human voice segment according to subsequent steps.
- the to-be-processed audio clips are extracted from the audio to be detected through the Hourglass model
- feature extraction of the to-be-processed audio clips output from the Hourglass model is required to obtain audio features.
- the audio features can include mel features and vocal proportions feature.
- the specific steps are: performing STFT transformation on the to-be-processed audio segment to obtain the STFT spectrum; converting the STFT spectrum to obtain the mel spectrum; performing logarithmic processing and first-order difference processing on the mel spectrum to obtain the mel features.
- the characteristics of the mel are as follows: (1) Add 1 to the mel spectrum and then take the logarithm, that is, log(1+x), where , X is the mel spectrum; (2) Take the result of 1 along the time direction as a first-order difference to find the absolute value, and then superimpose it with the result of (1) to form the mel feature.
- the result is scaled to a fixed length in time, where the fixed length can be 2000 frames, and the specific length is not limited here.
- the specific method is to normalize the frequency segment to be processed (normalized to 0-1) to obtain the normalized audio segment to be processed;
- the human voice proportion feature may not be extracted, that is, if the trained human voice detection network model is based on the mel feature and the human voice proportion feature If the training is obtained, the human voice proportion feature needs to be extracted at this time. If the training is based on the mel feature only, the human voice proportion feature does not need to be extracted at this time.
- the audio feature processing mel feature in addition to the human voice ratio feature, there may be other audio features that can indicate that the audio is human voice/pure music.
- the specific feature type is not limited here.
- the audio feature is input into a built-in trained human voice detection network model, and then according to the output result of the human voice detection network model, it is determined whether the audio segment to be processed contains human voice.
- step 505. Determine whether the to-be-processed audio segment contains human voice according to the output result of the trained human voice detection network model, if it does not contain human voice, execute step 506, if it contains, execute step 507.
- the output result is greater than 0.5, it is determined that the audio segment to be processed does not contain human voice, if the output result is less than 0.5, it is determined that the audio segment to be processed contains human voice, for example, if the output result is 1. , It is determined that the audio segment to be processed does not contain real human voice, if it is 0, it is determined that the audio segment to be processed contains real human voice.
- the Vocal_ratio can be additionally output in this embodiment. According to statistics, we find that the vocal_ratio of pure music is usually lower, while the vocal_ratio of non-pure music is generally higher.
- the vocal_ratio can be used as a physical value for manual reference. .
- the audio segment to be processed does not contain human voices, it means that the audio to be detected does not contain human voices (because the audio segments that may be human voices have been separated for detection before). At this time, you can determine the The detected audio is pure music.
- the audio clip to be processed contains human voice
- the audio clip extracted from the audio to be detected contains human voice
- the embodiment of the present invention obtains the audio to be detected; performs human voice separation processing on the audio to be detected to obtain the audio segment to be processed; then extracts the audio features of the audio segment to be processed, and the audio features include Mel mel feature and human voice Proportion feature; input the audio feature into the trained voice detection network model; obtain the output result of the trained voice detection network model, if it is determined according to the output result that the to-be-processed audio segment does not contain human voice , It is determined that the audio to be detected belongs to pure music.
- the embodiment of the present invention performs pure music detection on audio fragments separated from the audio to be detected, without the need for whole song detection, the audio length to be detected is relatively short, and the accuracy of pure music detection can be improved.
- FIG. 6 is a schematic structural diagram of a pure music detection device provided by an embodiment of the present invention.
- the pure music detection device 600 may include a first acquisition unit 601, a processing unit 602, an extraction unit 603, an input unit 604, a first determination unit 605, and a second determination unit 606, wherein:
- the first acquiring unit 601 is configured to acquire audio to be detected
- the processing unit 602 is configured to perform human voice separation processing on the audio to be detected to obtain audio clips to be processed;
- the extraction unit 603 is configured to extract audio features of the to-be-processed audio segment, where the audio features include Mel mel feature and vocal proportion feature;
- the input unit 604 is configured to input the audio features into the trained human voice detection network model
- the first determining unit 605 is configured to obtain the output result of the trained human voice detection network model
- the second determining unit 606 is configured to determine that the audio to be detected is pure music when it is determined that the audio segment to be processed does not contain human voice according to the output result.
- the apparatus 600 further includes:
- the second acquiring unit 607 is configured to acquire multiple audio samples, where the audio samples are audio samples that are known to be pure music;
- the third determining unit 608 is configured to determine the audio feature of the audio sample according to the audio sample
- the adding unit 609 is configured to add the audio feature to the training sample set
- the training unit 610 is configured to train the human voice detection network model according to the training sample set to obtain the trained human voice detection network model.
- the third determining unit 608 is specifically configured to:
- the audio feature of the audio segment is extracted, and the audio feature is determined.
- the third determination unit 608 yuan is also specifically used for:
- the human voice separation processing is performed on the audio samples through the Hourglass model.
- the extraction unit 603 is specifically configured to:
- STFT transformation is performed on the to-be-processed audio segment to obtain an STFT spectrum
- Logarithmic processing and first-order difference processing are performed on the mel spectrum to obtain the mel feature.
- the extraction unit 603 is further specifically configured to:
- Extracting the characteristics of the proportion of human voice in the audio segment to be processed Extracting the characteristics of the proportion of human voice in the audio segment to be processed.
- the extraction unit 603 is further specifically configured to:
- the human voice ratio feature is determined according to the duration corresponding to the filtered audio clip to be processed and the duration of the audio to be detected.
- the processing unit 602 is specifically configured to:
- the human voice separation processing is performed on the to-be-detected audio through the Hourglass model.
- the first obtaining unit 601 obtains the audio to be detected; the processing unit 602 performs human voice separation processing on the audio to be detected to obtain the audio segment to be processed; then the extracting unit 603 extracts the audio features of the audio segment to be processed ,
- the audio features include Melmel features and voice ratio features;
- the input unit 604 inputs the audio features into the trained voice detection network model;
- the first determining unit 605 obtains the trained voice detection The output result of the network model; if it is determined according to the output result that the audio segment to be processed does not contain human voice, the first determining unit 605 determines that the audio to be detected belongs to pure music.
- the embodiment of the present invention performs pure music detection on audio fragments separated from the audio to be detected, without the need for whole song detection, the audio length to be detected is relatively short, and the accuracy of pure music detection can be improved.
- the embodiment of the present invention also provides a server, as shown in FIG. 8, which shows a schematic structural diagram of the server involved in the embodiment of the present invention, specifically:
- the server may include one or more processing core processors 801, one or more computer-readable storage media of memory 802, power supply 803, input unit 804 and other components.
- processing core processors 801 one or more computer-readable storage media of memory 802, power supply 803, input unit 804 and other components.
- FIG. 8 does not constitute a limitation on the server, and may include more or fewer components than shown in the figure, or a combination of certain components, or different component arrangements. among them:
- the processor 801 is the control center of the server. It uses various interfaces and lines to connect various parts of the entire server. By running or executing software programs and/or modules stored in the memory 802, and calling data stored in the memory 802, Perform various functions of the server and process data to monitor the server as a whole.
- the processor 801 may include one or more processing cores; preferably, the processor 801 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, application programs, etc. , The modem processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 801.
- the memory 802 may be used to store software programs and modules.
- the processor 801 executes various functional applications and data processing by running the software programs and modules stored in the memory 802.
- the memory 802 may mainly include a program storage area and a data storage area.
- the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of the server, etc.
- the memory 802 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
- the memory 802 may further include a memory controller to provide the processor 801 with access to the memory 802.
- the server also includes a power supply 803 for supplying power to various components.
- the power supply 803 may be logically connected to the processor 801 through a power management system, so that functions such as charging, discharging, and power consumption management can be managed through the power management system.
- the power supply 803 may also include any components such as one or more DC or AC power supplies, a recharging system, a power failure detection circuit, a power converter or inverter, and a power status indicator.
- the server may further include an input unit 804, which can be used to receive inputted number or character information and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.
- an input unit 804 which can be used to receive inputted number or character information and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.
- the server may also include a display unit, etc., which will not be repeated here.
- the processor 801 in the server loads the executable file corresponding to the process of one or more application programs into the memory 802 according to the following instructions, and the processor 801 runs and stores the executable file in the memory.
- the applications in 802 can realize various functions as follows:
- the server provided in this embodiment obtains the audio to be detected; performs human voice separation processing on the audio to be detected to obtain the audio segment to be processed; then extracts the audio features of the audio segment to be processed;
- the audio feature detects whether the audio segment to be processed contains human voice; if it does not contain human voice, it is determined that the audio to be detected belongs to pure music.
- the embodiment of the present invention performs pure music detection on audio fragments separated from the audio to be detected, without the need for whole song detection, the audio length to be detected is relatively short, and the accuracy of pure music detection can be improved.
- an embodiment of the present invention also provides a terminal.
- the terminal may include a radio frequency (RF, Radio Frequency) circuit 901, a memory 902 including one or more computer-readable storage media, and an input unit. 903, a display unit 904, a sensor 905, an audio circuit 906, a wireless fidelity (WiFi, Wireless Fidelity) module 907, a processor 908 including one or more processing cores, a power supply 909 and other components.
- RF Radio Frequency
- the RF circuit 901 can be used for receiving and sending signals during the process of sending and receiving information or talking. In particular, after receiving the downlink information of the base station, it is processed by one or more processors 908; in addition, the uplink data is sent to the base station. .
- the RF circuit 901 includes, but is not limited to, an antenna, at least one amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, and a low noise amplifier (LNA, Low Noise Amplifier), duplexer, etc.
- SIM Subscriber Identity Module
- LNA Low Noise Amplifier
- duplexer etc.
- the RF circuit 901 can also communicate with the network and other devices through wireless communication.
- the wireless communication may use any communication standard or protocol, including but not limited to Global System of Mobile communication (GSM), General Packet Radio Service (GPRS, General Packet Radio Service), Code Division Multiple Access (CDMA, Code Division Multiple Access), Wideband Code Division Multiple Access (WCDMA, Wideband Code Division Multiple Access), Long Term Evolution (LTE), Email, Short Messaging Service (SMS), etc.
- GSM Global System of Mobile communication
- GPRS General Packet Radio Service
- GPRS General Packet Radio Service
- CDMA Code Division Multiple Access
- WCDMA Wideband Code Division Multiple Access
- LTE Long Term Evolution
- Email Short Messaging Service
- the memory 902 may be used to store software programs and modules.
- the processor 908 executes various functional applications and data processing by running the software programs and modules stored in the memory 902.
- the memory 902 may mainly include a program storage area and a data storage area.
- the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data (such as audio data, phone book, etc.) created by the use of the terminal.
- the memory 902 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
- the memory 902 may further include a memory controller to provide the processor 908 and the input unit 903 to access the memory 902.
- the input unit 903 can be used to receive input digital or character information, and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.
- the input unit 903 may include a touch-sensitive surface and other input devices.
- a touch-sensitive surface also known as a touch screen or a touchpad, can collect user touch operations on or near it (for example, the user uses any suitable objects or accessories such as fingers, stylus, etc.) on the touch-sensitive surface or on the touch-sensitive surface. Operation near the surface), and drive the corresponding connection device according to the preset program.
- the touch-sensitive surface may include two parts: a touch detection device and a touch controller.
- the touch detection device detects the user's touch position, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it To the processor 908, and can receive and execute commands sent by the processor 908.
- multiple types such as resistive, capacitive, infrared, and surface acoustic waves can be used to realize touch-sensitive surfaces.
- the input unit 903 may also include other input devices. Specifically, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control buttons, switch buttons, etc.), trackball, mouse, and joystick.
- the display unit 904 can be used to display information input by the user or information provided to the user and various graphical user interfaces of the terminal. These graphical user interfaces can be composed of graphics, text, icons, videos, and any combination thereof.
- the display unit 904 may include a display panel.
- a liquid crystal display LCD, Liquid Crystal Display
- OLED Organic Light-Emitting Diode
- the touch-sensitive surface may cover the display panel. When the touch-sensitive surface detects a touch operation on or near it, it is transmitted to the processor 908 to determine the type of the touch event, and then the processor 908 displays the display panel according to the type of the touch event.
- Corresponding visual output is provided on the panel.
- the touch-sensitive surface and the display panel are used as two independent components to realize the input and input functions, in some embodiments, the touch-sensitive surface and the display panel may be integrated to realize the input and output functions.
- the terminal may also include at least one sensor 905, such as a light sensor, a motion sensor, and other sensors.
- the light sensor may include an ambient light sensor and a proximity sensor, where the ambient light sensor can adjust the brightness of the display panel according to the brightness of the ambient light, and the proximity sensor can turn off the display panel and/or backlight when the terminal is moved to the ear .
- the gravity acceleration sensor can detect the magnitude of acceleration in various directions (usually three-axis), and can detect the magnitude and direction of gravity when it is stationary.
- the audio circuit 906, speakers, and microphones can provide an audio interface between the user and the terminal.
- the audio circuit 906 can transmit the electrical signal converted from the received audio data to the speaker, which is converted into a sound signal for output; on the other hand, the microphone converts the collected sound signal into an electrical signal, which is received by the audio circuit 906 and then converted
- the audio data is processed by the audio data output processor 908, and then sent to, for example, another terminal via the RF circuit 901, or the audio data is output to the memory 902 for further processing.
- the audio circuit 906 may also include an earplug jack to provide communication between a peripheral earphone and the terminal.
- WiFi is a short-distance wireless transmission technology.
- the terminal can help users send and receive e-mails, browse web pages, and access streaming media. It provides users with wireless broadband Internet access.
- FIG. 9 shows the WiFi module 907, it is understandable that it is not a necessary component of the terminal, and can be omitted as needed without changing the essence of the invention.
- the processor 908 is the control center of the terminal. It uses various interfaces and lines to connect various parts of the entire mobile phone. It executes by running or executing software programs and/or modules stored in the memory 902, and calling data stored in the memory 902. Various functions of the terminal and processing data, so as to monitor the mobile phone as a whole.
- the processor 908 may include one or more processing cores; preferably, the processor 908 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, and application programs, etc. , The modem processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 908.
- the terminal also includes a power source 909 (such as a battery) for supplying power to various components.
- a power source 909 (such as a battery) for supplying power to various components.
- the power source can be logically connected to the processor 908 through a power management system, so that functions such as charging, discharging, and power management are realized through the power management system.
- the power supply 909 may also include one or more DC or AC power supplies, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and any other components.
- the terminal may also include a camera, a Bluetooth module, etc., which will not be repeated here.
- the processor 908 in the terminal loads the executable files corresponding to the processes of one or more application programs into the memory 902 according to the following instructions, and the processor 908 runs and stores them in the memory.
- the applications in 902 can realize various functions:
- the terminal obtains the audio to be detected; performs human voice separation processing on the audio to be detected to obtain the audio segment to be processed; then extracts the audio features of the audio segment to be processed;
- the audio features are input into the trained human voice detection network model; according to the output result of the trained human voice detection network model, it is determined whether the to-be-processed audio segment contains human voice; if it does not contain human voice, it is determined
- the audio to be detected is pure music.
- the embodiment of the present invention performs pure music detection on audio fragments separated from the audio to be detected, without the need for whole song detection, the audio length to be detected is relatively short, and the accuracy of pure music detection can be improved.
- an embodiment of the present invention provides a storage medium in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any pure music detection method provided in the embodiments of the present invention.
- the instruction can perform the following steps:
- the storage medium may include: read-only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), disk or CD, etc.
- the instructions stored in the storage medium can execute the steps in any pure music detection method provided in the embodiments of the present invention, it can realize what any pure music detection method provided in the embodiments of the present invention can.
- the beneficial effects achieved refer to the previous embodiments for details, which will not be repeated here.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Auxiliary Devices For Music (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (10)
- 一种纯音乐检测方法,其特征在于,包括:获取待检测音频;对所述待检测音频进行人声分离处理,得到待处理音频片段;提取所述待处理音频片段的音频特征,所述音频特征包括梅尔mel特征及人声占比特征;将所述音频特征输入训练后的人声检测网络模型中;获取所述训练后的人声检测网络模型的输出结果;若根据所述输出结果确定所述待处理音频片段不包含人声,则确定所述待检测音频属于纯音乐。
- 根据权利要求1所述的方法,其特征在于,在所述将所述音频特征输入训练后的人声检测网络模型中之前,所述方法还包括:获取多个音频样本,所述音频样本为已知是否为纯音乐的音频样本;根据所述音频样本确定所述音频样本的音频特征;将所述音频特征添加至训练样本集中;根据所述训练样本集对人声检测网络模型进行训练,得到所述训练后的人声检测网络模型。
- 根据权利要求2所述的方法,其特征在于,所述根据所述音频样本确定所述音频样本的音频特征,包括:对所述音频样本进行人声分离处理,得到音频片段;提取所述音频片段的音频特征,确定所述音频特征。
- 根据权利要求3所述的方法,其特征在于,所述对所述音频样本进行人声分离处理,包括:通过Hourglass模型对所述音频样本进行人声分离处理。
- 根据权利要求1所述的方法,其特征在于,当所述音频特征为所述mel特征时,所述提取所述待处理音频片段的音频特征,包括:对所述待处理音频片段做短时傅里叶变换STFT变换,得到STFT频谱;对所述STFT频谱进行转换得到mel频谱;对所述mel频谱进行求对数处理以及1阶差分处理,得到所述mel特征。
- 根据权利要求1所述的方法,其特征在于,当所述音频特征为所述人声占比特征时,所述提取所述待处理音频片段的音频特征,包括:对所述待处理音频片段进行归一化处理,得到归一化后的待处理音频片段;对所述归一化后的待处理音频片段进行过滤静音处理,得到过滤后的待处理音频片段;根据所述过滤后的待处理音频片段所对应的时长以及所述待检测音频的时长确定所述人声占比特征。
- 根据权利要求1至6中任一项所述的方法,其特征在于,所述对所述待检测音频进行人声分离处理,包括:通过Hourglass模型对所述待检测音频进行人声分离处理。
- 一种纯音乐检测装置,其特征在于,包括:第一获取单元,用于获取待检测音频;处理单元,用于对所述待检测音频进行人声分离处理,得到待处理音频片段;提取单元,用于提取所述待处理音频片段的音频特征,所述音频特征包括梅尔mel特征及人声占比特征;输入单元,用于将所述音频特征输入训练后的人声检测网络模型中;第一确定单元,获取所述训练后的人声检测网络模型的输出结果;第二确定单元,用于当根据所述输出结果确定所述待处理音频片段不包含人声时,确定所述待检测音频属于纯音乐。
- 一种终端,其特征在于,包括处理器和存储器,所述存储器中存储有计算机程序,所述处理器调用所述存储器中的计算机程序时执行如权利要求1至7任一项所述的纯音乐检测方法。
- 一种存储介质,其特征在于,所述存储介质存储有多条指令,所述指令适于处理器进行加载,以执行权利要求1至7任一项所述的纯音乐检测方法中的步骤。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910398945.6 | 2019-05-14 | ||
CN201910398945.6A CN110097895B (zh) | 2019-05-14 | 2019-05-14 | 一种纯音乐检测方法、装置及存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020228226A1 true WO2020228226A1 (zh) | 2020-11-19 |
Family
ID=67447961
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/109638 WO2020228226A1 (zh) | 2019-05-14 | 2019-09-30 | 一种纯音乐检测方法、装置及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110097895B (zh) |
WO (1) | WO2020228226A1 (zh) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110097895B (zh) * | 2019-05-14 | 2021-03-16 | 腾讯音乐娱乐科技(深圳)有限公司 | 一种纯音乐检测方法、装置及存储介质 |
CN110648656A (zh) * | 2019-08-28 | 2020-01-03 | 北京达佳互联信息技术有限公司 | 语音端点检测方法、装置、电子设备及存储介质 |
CN112259119B (zh) * | 2020-10-19 | 2021-11-16 | 深圳市策慧科技有限公司 | 基于堆叠沙漏网络的音乐源分离方法 |
CN114615534A (zh) * | 2022-01-27 | 2022-06-10 | 海信视像科技股份有限公司 | 显示设备及音频处理方法 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150081298A1 (en) * | 2013-09-17 | 2015-03-19 | Kabushiki Kaisha Toshiba | Speech processing apparatus and method |
CN108320756A (zh) * | 2018-02-07 | 2018-07-24 | 广州酷狗计算机科技有限公司 | 一种检测音频是否是纯音乐音频的方法和装置 |
CN108538311A (zh) * | 2018-04-13 | 2018-09-14 | 腾讯音乐娱乐科技(深圳)有限公司 | 音频分类方法、装置及计算机可读存储介质 |
CN109166593A (zh) * | 2018-08-17 | 2019-01-08 | 腾讯音乐娱乐科技(深圳)有限公司 | 音频数据处理方法、装置及存储介质 |
CN109308901A (zh) * | 2018-09-29 | 2019-02-05 | 百度在线网络技术(北京)有限公司 | 歌唱者识别方法和装置 |
CN110097895A (zh) * | 2019-05-14 | 2019-08-06 | 腾讯音乐娱乐科技(深圳)有限公司 | 一种纯音乐检测方法、装置及存储介质 |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8005666B2 (en) * | 2006-10-24 | 2011-08-23 | National Institute Of Advanced Industrial Science And Technology | Automatic system for temporal alignment of music audio signal with lyrics |
CN102956230B (zh) * | 2011-08-19 | 2017-03-01 | 杜比实验室特许公司 | 对音频信号进行歌曲检测的方法和设备 |
CN102982804B (zh) * | 2011-09-02 | 2017-05-03 | 杜比实验室特许公司 | 音频分类方法和系统 |
CN104078050A (zh) * | 2013-03-26 | 2014-10-01 | 杜比实验室特许公司 | 用于音频分类和音频处理的设备和方法 |
CN104347067B (zh) * | 2013-08-06 | 2017-04-12 | 华为技术有限公司 | 一种音频信号分类方法和装置 |
CN103680517A (zh) * | 2013-11-20 | 2014-03-26 | 华为技术有限公司 | 一种音频信号的处理方法、装置及设备 |
CN103646649B (zh) * | 2013-12-30 | 2016-04-13 | 中国科学院自动化研究所 | 一种高效的语音检测方法 |
CN108538309B (zh) * | 2018-03-01 | 2021-09-21 | 杭州小影创新科技股份有限公司 | 一种歌声侦测的方法 |
CN108877783B (zh) * | 2018-07-05 | 2021-08-31 | 腾讯音乐娱乐科技(深圳)有限公司 | 确定音频数据的音频类型的方法和装置 |
CN109545191B (zh) * | 2018-11-15 | 2022-11-25 | 电子科技大学 | 一种歌曲中人声起始位置的实时检测方法 |
-
2019
- 2019-05-14 CN CN201910398945.6A patent/CN110097895B/zh active Active
- 2019-09-30 WO PCT/CN2019/109638 patent/WO2020228226A1/zh active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150081298A1 (en) * | 2013-09-17 | 2015-03-19 | Kabushiki Kaisha Toshiba | Speech processing apparatus and method |
CN108320756A (zh) * | 2018-02-07 | 2018-07-24 | 广州酷狗计算机科技有限公司 | 一种检测音频是否是纯音乐音频的方法和装置 |
CN108538311A (zh) * | 2018-04-13 | 2018-09-14 | 腾讯音乐娱乐科技(深圳)有限公司 | 音频分类方法、装置及计算机可读存储介质 |
CN109166593A (zh) * | 2018-08-17 | 2019-01-08 | 腾讯音乐娱乐科技(深圳)有限公司 | 音频数据处理方法、装置及存储介质 |
CN109308901A (zh) * | 2018-09-29 | 2019-02-05 | 百度在线网络技术(北京)有限公司 | 歌唱者识别方法和装置 |
CN110097895A (zh) * | 2019-05-14 | 2019-08-06 | 腾讯音乐娱乐科技(深圳)有限公司 | 一种纯音乐检测方法、装置及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN110097895B (zh) | 2021-03-16 |
CN110097895A (zh) | 2019-08-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109166593B (zh) | 音频数据处理方法、装置及存储介质 | |
CN109087669B (zh) | 音频相似度检测方法、装置、存储介质及计算机设备 | |
CN110853618B (zh) | 一种语种识别的方法、模型训练的方法、装置及设备 | |
CN109903773B (zh) | 音频处理方法、装置及存储介质 | |
WO2020228226A1 (zh) | 一种纯音乐检测方法、装置及存储介质 | |
CN111883091B (zh) | 音频降噪方法和音频降噪模型的训练方法 | |
CN103440862B (zh) | 一种语音与音乐合成的方法、装置以及设备 | |
US9414174B2 (en) | Method and apparatus for controlling audio output | |
CN108470571B (zh) | 一种音频检测方法、装置及存储介质 | |
CN106847307B (zh) | 信号检测方法及装置 | |
CN107229629B (zh) | 音频识别方法及装置 | |
CN110830368B (zh) | 即时通讯消息发送方法及电子设备 | |
WO2022089098A1 (zh) | 音高调节方法、装置及计算机存储介质 | |
CN109243488B (zh) | 音频检测方法、装置及存储介质 | |
CN107731241B (zh) | 处理音频信号的方法、装置和存储介质 | |
JP2017509009A (ja) | オーディオストリームの中の音楽の追跡 | |
CN110568926A (zh) | 一种声音信号处理方法及终端设备 | |
CN113157240A (zh) | 语音处理方法、装置、设备、存储介质及计算机程序产品 | |
CN112906369A (zh) | 一种歌词文件生成方法及装置 | |
CN115798459A (zh) | 音频处理方法、装置、存储介质及电子设备 | |
WO2017215615A1 (zh) | 一种音效处理方法及移动终端 | |
CN109346102B (zh) | 音频开头爆音的检测方法、装置及存储介质 | |
CN112259076B (zh) | 语音交互方法、装置、电子设备及计算机可读存储介质 | |
WO2020118560A1 (zh) | 一种录音方法、装置、电子设备和计算机可读存储介质 | |
CN111739493B (zh) | 音频处理方法、装置及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19928495 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19928495 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 22.03.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19928495 Country of ref document: EP Kind code of ref document: A1 |