CN110097895B

CN110097895B - Pure music detection method, pure music detection device and storage medium

Info

Publication number: CN110097895B
Application number: CN201910398945.6A
Authority: CN
Inventors: 王征韬
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2019-05-14
Filing date: 2019-05-14
Publication date: 2021-03-16
Anticipated expiration: 2039-05-14
Also published as: CN110097895A; WO2020228226A1

Abstract

The embodiment of the invention discloses a pure music detection method, a pure music detection device and a pure music detection storage medium, wherein the pure music detection method comprises the following steps: the embodiment of the invention obtains the audio to be detected; carrying out human voice separation processing on the audio to be detected to obtain an audio clip to be processed; then extracting the audio features of the audio clip to be processed; inputting the audio features into a trained human voice detection network model; determining whether the audio clip to be processed contains the voice or not according to the output result of the trained voice detection network model; and if the voice is not contained, determining that the audio to be detected belongs to pure music. The embodiment of the invention carries out pure music detection on the audio frequency segment separated from the audio frequency to be detected, does not need whole music detection, has shorter length of the audio frequency to be detected, and can improve the accuracy of pure music detection.

Description

Pure music detection method, pure music detection device and storage medium

Technical Field

The invention relates to the field of audio processing, in particular to a pure music detection method, a pure music detection device and a pure music detection storage medium.

Background

Pure music refers to music without lyrics, which expresses the emotion of the author in pure graceful music, and can be played by natural musical instruments (such as piano, violin, guitar, etc.) or electroacoustic musical instruments, so whether the audio is pure music is usually distinguished by whether the audio contains human voice.

In the prior art, whether music is pure music or not is generally required to be judged according to the whole music, and some songs have scattered voices, but are generally considered as pure music, so that the accuracy of pure music detection is not high.

Disclosure of Invention

The embodiment of the invention provides a pure music detection method, a pure music detection device and a pure music detection storage medium, which are used for improving the accuracy of pure music detection.

The embodiment of the invention provides a pure music detection method, which comprises the following steps:

acquiring audio to be detected;

carrying out human voice separation processing on the audio to be detected to obtain an audio clip to be processed;

extracting the audio features of the audio clip to be processed;

inputting the audio features into a trained human voice detection network model;

determining whether the audio clip to be processed contains the voice or not according to the output result of the trained voice detection network model;

and if the voice is not contained, determining that the audio to be detected belongs to pure music. .

Correspondingly, the embodiment of the invention also provides a pure music detection device, which comprises:

the first acquisition unit is used for acquiring the audio to be detected;

the processing unit is used for carrying out human voice separation processing on the audio to be detected to obtain an audio clip to be processed;

the extraction unit is used for extracting the audio features of the audio clip to be processed;

the input unit is used for inputting the audio features into the trained human voice detection network model;

a first determining unit, configured to determine whether the audio segment to be processed contains human voice according to an output result of the trained human voice detection network model;

and the second determining unit is used for determining that the audio to be detected belongs to pure music when the audio clip to be processed does not contain human voice.

Optionally, in some embodiments, the apparatus further comprises:

a second obtaining unit, configured to obtain a plurality of audio samples, where the audio samples are known to be pure music or not;

a third determining unit, configured to determine an audio feature of the audio sample according to the audio sample;

an adding unit, configured to add the audio feature to a training sample set;

and the training unit is used for training the voice detection network model according to the training sample set to obtain the trained voice detection network model.

Optionally, in some embodiments, the third determining unit is specifically configured to:

carrying out human voice separation processing on the audio sample to obtain an audio fragment;

and extracting the audio features of the audio segments and determining the audio features.

Optionally, in some embodiments, the third determining unit is further specifically configured to:

and carrying out human voice separation processing on the audio sample through a Hourglass model.

Optionally, in some embodiments, the audio features include mel features, and the extraction unit is specifically configured to:

performing STFT on the audio segment to be processed to obtain an STFT spectrum;

converting the STFT spectrum to obtain a mel spectrum;

and carrying out logarithm solving processing and 1-order difference processing on the mel frequency spectrum to obtain the mel characteristics.

Optionally, in some embodiments, the audio features include mel features and human voice proportion features, and the extracting unit is further specifically configured to:

extracting mel characteristics of the audio clip to be processed;

and extracting the human voice ratio characteristics of the audio clip to be processed.

Optionally, in some embodiments, the extracting unit is further specifically configured to:

converting the STFT spectrum to obtain a mel spectrum;

normalizing the audio clip to be processed to obtain a normalized audio clip to be processed;

carrying out filtering and mute processing on the normalized audio clip to be processed to obtain a filtered audio clip to be processed;

and determining the human voice ratio characteristic according to the time length corresponding to the filtered audio clip to be processed and the time length of the audio to be detected.

Optionally, in some embodiments, the processing unit is specifically configured to:

and carrying out human voice separation processing on the audio to be detected through a Hourglass model.

The embodiment of the present invention further provides a storage medium, where the storage medium stores a plurality of instructions, and the instructions are suitable for being loaded by a processor to perform any of the steps in the pure music detection method provided in the embodiment of the present invention.

The embodiment of the invention obtains the audio to be detected; carrying out human voice separation processing on the audio to be detected to obtain an audio clip to be processed; then extracting the audio features of the audio clip to be processed; inputting the audio features into a trained human voice detection network model; determining whether the audio clip to be processed contains the voice or not according to the output result of the trained voice detection network model; and if the voice is not contained, determining that the audio to be detected belongs to pure music. The embodiment of the invention carries out pure music detection on the audio frequency segment separated from the audio frequency to be detected, does not need whole music detection, has shorter length of the audio frequency to be detected, and can improve the accuracy of pure music detection.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a system diagram of a pure music detection apparatus according to an embodiment of the present invention.

Fig. 2 is a schematic flow chart of a pure music detection method according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a basic convolutional network in a human voice detection network model according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of an encoding layer in a human voice detection network model according to an embodiment of the present invention.

Fig. 5 is another schematic flow chart of a pure music detection method according to an embodiment of the present invention.

Fig. 6 is a schematic structural diagram of a pure music detecting apparatus according to an embodiment of the present invention.

Fig. 7 is another schematic structural diagram of a pure music detecting device according to an embodiment of the present invention.

Fig. 8 is a schematic structural diagram of a server according to an embodiment of the present invention.

Fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first" and "second", etc. in the present invention are used for distinguishing different objects, not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or modules is not limited to the listed steps or modules but may alternatively include other steps or modules not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

Pure music refers to music that does not contain lyrics, and we can determine whether the music belongs to pure music according to whether human voice is contained in the music.

However, in the prior art, it is generally determined whether an audio is pure music by using an entire song audio, some songs have scattered voices and are generally regarded as pure music, and the detection accuracy of pure music is not high.

Therefore, the embodiment of the invention provides a pure music detection method, a pure music detection device and a storage medium, and the embodiment obtains the audio to be detected; carrying out human voice separation processing on the audio to be detected to obtain an audio clip to be processed; then extracting the audio features of the audio clip to be processed; detecting whether the audio clip to be processed contains human voice according to the audio features; and if the voice is not contained, determining that the audio to be detected belongs to pure music. The embodiment of the invention carries out pure music detection on the audio frequency segment separated from the audio frequency to be detected, does not need whole music detection, has shorter length of the audio frequency to be detected, and can improve the accuracy of pure music detection.

Specifically, in this embodiment, when detecting whether the audio segment to be processed includes a human voice according to the audio feature, the audio feature may be input into a trained human voice detection network model; and then determining whether the audio clip to be processed contains the human voice according to the output result of the trained human voice detection network model.

When the human voice detection network model is trained, the audio features separated and extracted from the audio samples are used as training samples for training, and the length of the audio features is short, so that when the audio features are used as the training samples for training, the model can be trained and optimized easily.

The pure music detection method provided by the embodiment of the invention can be realized in a pure music detection device, and the pure music detection device can be integrated in equipment such as a terminal or a server.

Referring to fig. 1, fig. 1 is a schematic diagram of a system of a pure music detection apparatus according to an embodiment of the present invention, which can be used for model training and pure music detection. The model provided by the embodiment of the invention is a deep learning network model, such as a human voice detection network model, when the model is trained, a trained audio sample is obtained in advance, then human voice separation is carried out on the audio sample to obtain an audio fragment, the audio characteristic of the audio sample is determined according to the audio fragment, then the audio characteristic is added into a training sample set, and the human voice detection network model is trained according to the training sample set to obtain the trained human voice detection network model. When pure music detection is carried out, firstly, audio to be detected needs to be acquired; carrying out human voice separation processing on the audio to be detected to obtain an audio clip to be processed; then extracting the audio features of the audio clip to be processed; finally, inputting the audio features into a trained human voice detection network model; and determining whether the audio segment to be processed contains the voice or not according to the output result of the trained voice detection network model, if not, determining that the audio to be detected is pure music, otherwise, determining that the audio to be detected is not pure music.

The following detailed description will be made separately, and the description sequence of each embodiment below does not limit the specific implementation sequence.

Referring to fig. 2, fig. 2 is a schematic flow chart illustrating a pure music detection method according to an embodiment of the present invention. The method comprises the following steps:

201. and acquiring the audio to be detected.

In this embodiment, when it is required to detect whether the audio to be detected is pure music, the audio to be detected needs to be input into a pure music detection device, where the pure music detection device in the present invention may be integrated in a terminal or a server, and the device includes, but is not limited to, a computer, a smart television, a smart sound box, a mobile phone, a tablet computer, and the like.

The pure music detection device in this embodiment includes a trained voice detection network model, and the model deeply learns the network model, and before acquiring the audio to be detected, this embodiment needs to train the voice detection network first, so that the voice detection network can be used to detect whether the audio to be detected contains voice.

Specifically, when training the human voice detection network, the following steps may be specifically performed:

a. a plurality of audio samples is obtained.

First, a plurality of audio samples are acquired, where the audio samples are known to be pure music or not, the audio samples include long audio samples and short audio samples, the long audio samples are audio with an audio duration of several tens of minutes, and the segment audio samples are audio with an audio duration of only several seconds. The duration of the long audio is longer than that of the short audio, and the specific durations of the long audio and the segment audio are not limited herein.

b. An audio feature of the audio sample is determined from the audio sample.

Specifically, the voice separation processing needs to be performed on the audio sample to obtain an audio segment; then, the audio features of the audio segments are extracted, and the audio features are determined.

More specifically, in some embodiments, the audio sample needs to be subjected to a vocal separation process by a Hourglass model, and separated into audio pieces and pure music pieces, where the audio pieces mentioned in the present embodiment are "vocal pieces" extracted from the audio sample.

The Hourglass model is obtained by training a DSD100 data set (human voice and other musical instruments such as drums, bass and guitars), can support blind source separation of any multiple sources, and can separate the human voice and the musical instrument voice in audio.

It should be noted that, in the prior art, when the acoustic separation is performed through the Hourglass model, some musical instruments similar to the acoustic are usually mistakenly recognized as the acoustic and retained, the separation is not sufficient, and the accuracy is not sufficient.

The original audio is sent to the Hourglass model and is merged into a single channel, and is down-sampled to 8k, wherein the specific degree of down-sampling is not limited here, and other values are down-sampled according to specific situations except 8 k.

Since the Hourglass model in the present invention has a high recall rate, the audio segment may be a true "voice segment" (the label corresponding to the sample should be non-pure music at this time) or a false "voice segment" (the musical instrument sound recognized as the voice by mistake, and the label corresponding to the sample should be pure music at this time), that is, the voice may be included or not included.

After the audio segments are extracted from the audio sample by the Hourglass model, feature extraction needs to be performed on the audio segments output from the Hourglass model to obtain audio features, where the audio features may include mel features and Vocal proportion features (Vocal _ ratio).

When the mel features are extracted, the method specifically comprises the following steps: performing STFT on the audio segment to obtain an STFT spectrum; converting the STFT spectrum to obtain a mel spectrum; and carrying out logarithm processing and 1-order difference processing on the mel frequency spectrum to obtain the mel characteristics. The log-solving processing and 1-order difference processing are carried out on the mel frequency spectrum, and the mel characteristics are obtained by the following steps: (1) adding 1 to the mel frequency spectrum, and then taking logarithm to obtain log (1+ x), wherein x is the mel frequency spectrum; (2) and (3) performing 1-order difference on the result of 1 along the time direction to obtain an absolute value, and then adding the absolute value with the result of (1) to form a mel feature, wherein in some embodiments, the result of (1) needs to be scaled to a fixed length in time, wherein the fixed length can be 2000 frames, and the specific length is not limited herein.

The mel features are frequency spectrum features obtained by a filter bank according with the auditory characteristics of human ears, and reflect the time-frequency structure of audio. The mel features contain information for judging the audio to be pure music or human voice, so that the mel features can be used as a basis for judging pure music, and a mel feature training model can be used.

When the human voice ratio feature is extracted, specifically, the audio segment is subjected to normalization processing (normalization is carried out to 0-1) to obtain a normalized audio segment; performing filtering mute processing on the normalized audio segment (wherein a filtering threshold value may be set to 20db, or may be set to other values according to specific situations, and the specific filtering threshold value is not limited here), so as to obtain a filtered audio segment; and determining the voice ratio characteristic according to the corresponding time length of the filtered audio segment and the time length of the audio to be detected, namely, Vocal _ ratio is the voice time length/full-song time length of the non-mute area.

Wherein the value of the human voice ratio feature of pure music is generally low, and the value of the human voice ratio feature of non-pure music is generally high, so that the human voice ratio feature can be used for training the model.

In some embodiments, if the model does not need to be trained using the human voice proportion feature, the human voice proportion feature may not be extracted.

c. The audio features are added to a set of training samples.

When the audio features of the audio samples are extracted, the audio features need to be added to the training samples, wherein since the audio samples are known to be pure music samples, we can add labels to the audio features extracted from the audio samples to indicate that the audio features reflect that the audio is pure music or non-pure music, so that a group of samples in the training sample set includes (mel features, human voice ratio features, labels).

d. And training the voice detection network model according to the training sample set to obtain the trained voice detection network model.

After the training sample set is obtained, the human voice detection network model is trained according to the training sample set, specifically, a group of audio features are input into a preset human voice detection network model to predict results, and then the prediction results and the labels of the group of audio features are used for training the human voice detection network model.

202. And carrying out human voice separation processing on the audio to be detected to obtain an audio clip to be processed.

In some embodiments, it is necessary to perform a vocal separation process on the audio to be detected through the Hourglass model, and separate the audio into audio segments and pure music segments, where the audio segments mentioned in this embodiment are "vocal segments" extracted from the audio sample.

In the method, the hornglass model in the step allows a part of musical instruments to be mistakenly identified as a cost, so that all the voices in the audio are recalled, the voices are prevented from being omitted, and the high recall rate is ensured.

Because the recall rate of the Hourglass model in the invention is relatively high, the audio segment to be processed may be a true voice segment and may also be a false voice segment, that is, may include voice or may not include voice, and at this time, it is necessary to determine whether the audio segment to be processed is a true voice segment according to the subsequent steps.

203. And extracting the audio features of the audio clip to be processed.

The audio features in the embodiment of the present invention may be used to indicate that the audio is pure music/human voice.

After the audio segment to be processed is extracted from the audio to be detected through the Hourglass model, feature extraction needs to be performed on the audio segment to be processed output from the Hourglass model to obtain an audio feature, wherein the audio feature can include mel feature and human voice proportion feature.

When the mel features are extracted, the method specifically comprises the following steps: performing short-time Fourier transform (STFT) on the audio fragment to be processed to obtain an STFT spectrum; converting the STFT spectrum to obtain a mel spectrum; and carrying out logarithm processing and 1-order difference processing on the mel frequency spectrum to obtain the mel characteristics. The logarithm processing and the 1-order difference processing are carried out on the mel frequency spectrum, and the mel characteristic obtained according to the mel frequency spectrum specifically comprises the following steps: (1) adding 1 to the mel frequency spectrum, and then taking logarithm to obtain log (1+ x), wherein x is the mel frequency spectrum; (2) and (3) performing 1-order difference on the result of 1 along the time direction to obtain an absolute value, and then adding the absolute value with the result of (1) to form a mel feature, wherein in some embodiments, the result of (1) needs to be scaled to a fixed length in time, wherein the fixed length can be 2000 frames, and the specific length is not limited herein.

When the human voice ratio feature is extracted, specifically, normalization processing (normalization to 0-1) is carried out on the frequency segment to be processed to obtain a normalized audio segment to be processed; performing filtering mute processing on the normalized audio clip to be processed (wherein, the filtering threshold value may be set to 20db, or may be set to other values according to specific situations, and the specific filtering threshold value is not limited here), so as to obtain a filtered audio clip to be processed; and determining the voice ratio characteristic according to the time length corresponding to the filtered audio clip to be processed and the time length of the audio to be detected, namely, Vocal _ ratio is the voice time length/full-song time length of the non-mute area.

In some embodiments, the human voice proportion feature may not be extracted according to the type of the trained human voice detection network model, that is, if the trained human voice detection network model is obtained by training according to the mel feature and the human voice proportion feature, the human voice proportion feature needs to be extracted at this time, and if the trained human voice detection network model is only trained according to the mel feature, the human voice proportion feature does not need to be extracted at this time.

It should be noted that, in the embodiment of the present invention, the audio feature processes mel feature, and besides the human voice proportion feature, there may be other audio features that can indicate that the audio is human voice/pure music, and the specific feature type is not limited here.

204. And inputting the audio features into the trained human voice detection network model.

In some embodiments, after obtaining the audio feature of the audio to be detected, the audio feature is input into the trained human voice detection network model, and an output result of the audio feature is obtained, where the output result may be used to determine whether the audio segment to be processed contains human voice.

205. And determining whether the audio clip to be processed contains the human voice according to the output result of the trained human voice detection network model.

In some embodiments, if the output result is greater than 0.5, it is determined that the audio segment to be processed does not contain human voice, and if the output result is less than 0.5, it is determined that the audio segment to be processed contains human voice, for example, if the output result is 1, it is determined that the audio segment to be processed does not contain human voice, and if the output result is 0, it is determined that the audio segment to be processed contains human voice.

In some embodiments, the present embodiment may also output the Vocal _ ratio additionally, and it is statistically found that the Vocal _ ratio of pure music is generally low, and the Vocal _ ratio of non-pure music is generally high, and the Vocal _ ratio can be manually referred to as a physically significant value.

205. And if the voice is not contained, determining that the audio to be detected belongs to pure music.

In this embodiment, when it is determined that the audio segment to be processed does not contain human voice, it is determined that the audio to be detected belongs to pure music at this time, and otherwise, it is determined that the audio to be detected does not belong to pure music.

The method according to the preceding embodiment is illustrated in further detail below by way of example.

In the present embodiment, the vehicle identification device will be described by taking as an example that it is specifically integrated in a server.

And (I) training a model.

Firstly, a server acquires a large number of audio samples through a plurality of ways, wherein the audio samples are known to be pure music audio samples or not, the audio samples comprise long audio samples and short audio samples, the long audio samples are audio with audio duration of tens of minutes, and the segment audio samples are audio with audio duration of only a few seconds. The duration of the long audio is longer than that of the short audio, and the specific durations of the long audio and the segment audio are not limited herein.

Then, determining the audio characteristics of the audio sample according to the audio sample, specifically, performing human voice separation processing on the audio sample to obtain an audio segment; then, the audio features of the audio segments are extracted, and the audio features are determined.

Because the recall rate of the Hourglass model in the invention is relatively high, the audio segment may be a true "voice segment" or a false "voice segment", that is, the voice may be included or not included.

After the audio segments are extracted from the audio sample by the Hourglass model, feature extraction needs to be performed on the audio segments output from the Hourglass model to obtain audio features, where the audio features may include mel features and human voice ratio features (Vocal _ ratio).

In some embodiments, if the model does not need to be trained by using the human voice proportion feature, the human voice proportion feature may not be extracted, and at this time, only the mel feature needs to be extracted when the pure music detection is subsequently performed.

Of course, in the general embodiment, besides the mel feature training model, the human voice ratio feature training is also needed to further improve the accuracy of model detection.

Then, the audio features are added to a training sample set; more specifically, when the audio features of the audio samples are extracted, it is necessary to add the audio features to the training samples, where since the audio samples are known to be pure music samples, we can add labels to the audio features extracted from the audio samples to indicate that the audio features reflect that the audio is pure music or non-pure music, so that a group of samples in the training sample set includes (mel features, human voice ratio features, labels).

And finally, training the voice detection network model according to the training sample set to obtain the trained voice detection network model, namely after the training sample set is obtained, training the voice detection network model according to the training sample set, specifically, inputting a group of audio features into a preset voice detection network model to obtain a prediction result, and then training the voice detection network model according to the prediction result and the group of audio feature labels.

In some embodiments, the trained human voice detection network model in the embodiments of the present invention is formed by a base convolutional network + an encoding layer + a feature fusion layer + a fully-connected classification layer.

In some embodiments, the base convolutional network selects a network without an expansion coefficient, and a schematic diagram of a layer of the base convolutional network is shown in fig. 3, where a feature map size (feature) matrix formed after stacking multiple layers of base convolutional layers needs to be converted into a vector in order to enable subsequent classification. We have designed an encoding layer (encoder) to reasonably accomplish this task, and the structure of the encoding layer is shown in fig. 4, in which: the coding layer learns the importance (softmax mask) of the data at each time step through convolution with only one convolution kernel, then multiplies the data line by the importance value, and sums the multiplication results on a time axis to obtain a feature vector. This technique is equivalent to coding features distributed over a time step to one point, and is therefore called a coding layer.

Specifically, after a training sample set is constructed, samples in the training sample set are input into a human voice detection network model, mel features are input into a basic convolutional network in the human voice detection network model, then fixed-length coding is carried out, human voice proportion features are added into a feature fusion layer, after feature fusion, a classification result is output through a full-connection classification layer, the classification result is compared with a label corresponding to the sample, and the weight of the basic convolutional network is adjusted according to a comparison error until the model converges.

And (II) pure music detection.

As shown in fig. 5, based on the trained human voice detection network model, another flow of the pure music detection method may be as follows:

501. and acquiring the audio to be detected.

502. And carrying out human voice separation processing on the audio to be detected through a Hourglass model to obtain an audio fragment to be processed.

503. And extracting mel characteristics and human voice proportion characteristics of the audio clip to be processed.

When the mel features are extracted, the method specifically comprises the following steps: performing STFT on the audio segment to be processed to obtain an STFT spectrum; converting the STFT spectrum to obtain a mel spectrum; and carrying out logarithm processing and 1-order difference processing on the mel frequency spectrum to obtain the mel characteristics. The logarithm processing and the 1-order difference processing are carried out on the mel frequency spectrum, and the mel characteristic obtained according to the mel frequency spectrum specifically comprises the following steps: (1) adding 1 to the mel frequency spectrum, and then taking logarithm to obtain log (1+ x), wherein x is the mel frequency spectrum; (2) and (3) performing 1-order difference on the result of 1 along the time direction to obtain an absolute value, and then adding the absolute value with the result of (1) to form a mel feature, wherein in some embodiments, the result of (1) needs to be scaled to a fixed length in time, wherein the fixed length can be 2000 frames, and the specific length is not limited herein.

504. And inputting the audio features into the trained human voice detection network model.

Specifically, the audio features are input into a built-in trained voice detection network model, and then whether the audio clip to be processed contains voice is judged according to the output result of the voice detection network model.

In this embodiment, it is necessary to first obtain an audio to be detected, then perform voice separation processing on the model to be detected according to the Hourglass model to obtain "voice data", extract mel features and voice proportion features from the "voice data", input the mel features into the basic convolutional network of the trained voice detection network model, then perform fixed-length coding, add the voice proportion features into the feature fusion layer, and output a classification result through the fully-connected classification layer after feature fusion, that is, output a result of whether the "voice data" contains true voice.

505. And determining whether the audio clip to be processed contains the human voice according to the output result of the trained human voice detection network model, if not, executing step 506, and if so, executing step 507.

In some embodiments, if the output result is greater than 0.5, it is determined that the audio segment to be processed does not contain human voice, and if the output result is less than 0.5, it is determined that the audio segment to be processed contains human voice, for example, if the output result is 1, it is determined that the audio segment to be processed does not contain true human voice, and if the output result is 0, it is determined that the audio segment to be processed contains true human voice.

506. Determining that the audio to be detected belongs to pure music.

If the audio segment to be processed does not contain human voice, it indicates that the audio segment to be detected does not contain human voice (because the audio segment which may be human voice has been separated and detected before), and at this time, it can be determined that the audio segment to be detected belongs to pure music.

507. Determining that the audio to be detected does not belong to pure music.

If the audio segment to be processed contains the voice, because the audio segment extracted from the audio to be detected contains the voice, the voice is also contained in the audio to be detected, and at this time, it can be determined that the audio to be detected does not belong to pure music.

Fig. 6 shows a schematic structural diagram of a pure music detection device according to an embodiment of the present invention, where fig. 6 is a schematic structural diagram of a pure music detection device according to an embodiment of the present invention. The pure music detection apparatus 600 may include a first acquisition unit 601, a processing unit 602, an extraction unit 603, an input unit 604, a first determination unit 605, and a second determination unit 606, wherein:

a first obtaining unit 601, configured to obtain an audio to be detected;

the processing unit 602 is configured to perform voice separation processing on the audio to be detected to obtain an audio segment to be processed;

an extracting unit 603, configured to extract an audio feature of the audio segment to be processed;

an input unit 604, configured to input the audio feature into a trained human voice detection network model;

a first determining unit 605, configured to determine whether the audio segment to be processed includes a human voice according to an output result of the trained human voice detection network model;

a second determining unit 606, configured to determine that the audio to be detected belongs to pure music when the audio segment to be processed does not include human voice.

As shown in fig. 7, in some embodiments, the apparatus 600 further comprises:

a second obtaining unit 607, configured to obtain a plurality of audio samples, where the audio samples are known to be pure music or not;

a third determining unit 608, configured to determine an audio feature of the audio sample according to the audio sample;

an adding unit 609, configured to add the audio feature to a training sample set;

and the training unit 610 is used for training the voice detection network model according to the training sample set to obtain the trained voice detection network model.

In some embodiments, the third determining unit 608 is specifically configured to:

Optionally, in some embodiments, the third determining unit 608 is further specifically configured to:

In some embodiments, the audio features include mel features, and the extracting unit 603 is specifically configured to:

converting the STFT spectrum to obtain a mel spectrum;

In some embodiments, the audio features include mel features and human voice ratio features, and the extracting unit 603 is further specifically configured to:

extracting mel characteristics of the audio clip to be processed;

Optionally, in some embodiments, the extracting unit 603 is further specifically configured to:

converting the STFT spectrum to obtain a mel spectrum;

Optionally, in some embodiments, the processing unit 602 is specifically configured to:

According to the embodiment of the invention, the audio to be detected is acquired through the first acquisition unit 601; the processing unit 602 performs voice separation processing on the audio to be detected to obtain an audio clip to be processed; then, the extracting unit 603 extracts the audio features of the audio clip to be processed; the input unit 604 inputs the audio features into the trained human voice detection network model; the first determining unit 605 determines whether the audio clip to be processed contains the human voice according to the output result of the trained human voice detection network model; if the voice is not contained, the first determining unit 605 determines that the audio to be detected belongs to pure music. The embodiment of the invention carries out pure music detection on the audio frequency segment separated from the audio frequency to be detected, does not need whole music detection, has shorter length of the audio frequency to be detected, and can improve the accuracy of pure music detection.

An embodiment of the present invention further provides a server, as shown in fig. 8, which shows a schematic structural diagram of the server according to the embodiment of the present invention, specifically:

the server may include components such as a processor 801 of one or more processing cores, memory 802 of one or more computer-readable storage media, a power supply 803, and an input unit 804. Those skilled in the art will appreciate that the server architecture shown in FIG. 8 is not meant to be limiting, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 801 is a control center of the server, connects various parts of the entire server using various interfaces and lines, and performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 802 and calling data stored in the memory 802, thereby performing overall monitoring of the server. Alternatively, processor 801 may include one or more processing cores; preferably, the processor 801 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 801.

The memory 802 may be used to store software programs and modules, and the processor 801 executes various functional applications and data processing by operating the software programs and modules stored in the memory 802. The memory 802 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the server, and the like. Further, the memory 802 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 802 may also include a memory controller to provide the processor 801 access to the memory 802.

The server further includes a power supply 803 for supplying power to the various components, and preferably, the power supply 803 may be logically connected to the processor 801 via a power management system, so that functions of managing charging, discharging, and power consumption are performed via the power management system. The power supply 803 may also include one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and any like components.

The server may further include an input unit 804, and the input unit 804 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the server may further include a display unit and the like, which will not be described in detail herein. Specifically, in this embodiment, the processor 801 in the server loads the executable file corresponding to the process of one or more application programs into the memory 802 according to the following instructions, and the processor 801 runs the application programs stored in the memory 802, thereby implementing various functions as follows:

acquiring audio to be detected; carrying out human voice separation processing on the audio to be detected to obtain an audio clip to be processed; extracting the audio features of the audio clip to be processed; inputting the audio features into a trained human voice detection network model; determining whether the audio clip to be processed contains the voice or not according to the output result of the trained voice detection network model; (ii) a And if the voice is not contained, determining that the audio to be detected belongs to pure music.

The above operations can be specifically referred to the previous embodiments, and are not described herein.

As can be seen from the above, the server provided in this embodiment obtains the audio to be detected; carrying out human voice separation processing on the audio to be detected to obtain an audio clip to be processed; then extracting the audio features of the audio clip to be processed; detecting whether the audio clip to be processed contains human voice according to the audio features; and if the voice is not contained, determining that the audio to be detected belongs to pure music. The embodiment of the invention carries out pure music detection on the audio frequency segment separated from the audio frequency to be detected, does not need whole music detection, has shorter length of the audio frequency to be detected, and can improve the accuracy of pure music detection.

Accordingly, an embodiment of the present invention further provides a terminal, as shown in fig. 9, the terminal may include a Radio Frequency (RF) circuit 901, a memory 902 including one or more computer-readable storage media, an input unit 903, a display unit 904, a sensor 905, an audio circuit 906, a Wireless Fidelity (WiFi) module 907, a processor 908 including one or more processing cores, and a power supply 909. Those skilled in the art will appreciate that the terminal structure shown in fig. 9 does not constitute a limitation of the terminal, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

RF circuit 901 may be used for receiving and transmitting signals during a message transmission or communication process, and in particular, for receiving downlink information from a base station and then processing the received downlink information by one or more processors 908; in addition, data relating to uplink is transmitted to the base station. In general, the RF circuit 901 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuit 901 can also communicate with a network and other devices through wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.

The memory 902 may be used to store software programs and modules, and the processor 908 executes various functional applications and data processing by operating the software programs and modules stored in the memory 902. The memory 902 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the terminal, etc. Further, the memory 902 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 902 may also include a memory controller to provide access to the memory 902 by the processor 908 and the input unit 903.

The input unit 903 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, in one particular embodiment, the input unit 903 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (e.g., operations by a user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. Alternatively, the touch sensitive surface may comprise two parts, a touch detection means and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 908, and receives and executes commands from the processor 908. In addition, touch sensitive surfaces may be implemented using various types of resistive, capacitive, infrared, and surface acoustic waves. The input unit 903 may include other input devices in addition to a touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 904 may be used to display information input by or provided to a user and various graphical user interfaces of the terminal, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 904 may include a Display panel, and may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay the display panel, and when a touch operation is detected on or near the touch-sensitive surface, the touch operation is communicated to the processor 908 to determine the type of touch event, and the processor 908 provides a corresponding visual output on the display panel according to the type of touch event. Although in FIG. 9 the touch sensitive surface and the display panel are two separate components to implement input and output functions, in some embodiments the touch sensitive surface may be integrated with the display panel to implement input and output functions.

The terminal may also include at least one sensor 905, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel according to the brightness of ambient light, and a proximity sensor that may turn off the display panel and/or the backlight when the terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when the mobile phone is stationary, and can be used for applications of recognizing the posture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured in the terminal, detailed description is omitted here.

Audio circuitry 906, a speaker, and a microphone may provide an audio interface between the user and the terminal. The audio circuit 906 may transmit the electrical signal converted from the received audio data to a speaker, and the electrical signal is converted into a sound signal by the speaker and output; on the other hand, the microphone converts a collected sound signal into an electric signal, converts the electric signal into audio data after being received by the audio circuit 906, processes the audio data by the audio data output processor 908, and then sends the audio data to, for example, another terminal via the RF circuit 901 or outputs the audio data to the memory 902 for further processing. The audio circuitry 906 may also include an earbud jack to provide peripheral headset communication with the terminal.

WiFi belongs to short-distance wireless transmission technology, and the terminal can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 907, and provides wireless broadband internet access for the user. Although fig. 9 shows the WiFi module 907, it is understood that it does not belong to the essential constitution of the terminal, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 908 is a control center of the terminal, connects various parts of the entire handset by various interfaces and lines, and performs various functions of the terminal and processes data by operating or executing software programs and/or modules stored in the memory 902 and calling data stored in the memory 902, thereby performing overall monitoring of the handset. Optionally, processor 908 may include one or more processing cores; preferably, the processor 908 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It is to be appreciated that the modem processor described above may not be integrated into processor 908.

The terminal also includes a power supply 909 (e.g., a battery) that provides power to the various components, which may preferably be logically connected to the processor 908 via a power management system, such that the functions of managing charging, discharging, and power consumption are performed via the power management system. The power supply 909 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

Although not shown, the terminal may further include a camera, a bluetooth module, and the like, which will not be described herein. Specifically, in this embodiment, the processor 908 in the terminal loads the executable file corresponding to the process of one or more application programs into the memory 902 according to the following instructions, and the processor 908 runs the application programs stored in the memory 902, thereby implementing various functions:

acquiring audio to be detected; carrying out human voice separation processing on the audio to be detected to obtain an audio clip to be processed; extracting the audio features of the audio clip to be processed; detecting whether the audio clip to be processed contains human voice according to the audio features; and if the voice is not contained, determining that the audio to be detected belongs to pure music.

As can be seen from the above, the terminal provided in this embodiment obtains the audio to be detected; carrying out human voice separation processing on the audio to be detected to obtain an audio clip to be processed; then extracting the audio features of the audio clip to be processed; inputting the audio features into a trained human voice detection network model; determining whether the audio clip to be processed contains the voice or not according to the output result of the trained voice detection network model; (ii) a And if the voice is not contained, determining that the audio to be detected belongs to pure music. The embodiment of the invention carries out pure music detection on the audio frequency segment separated from the audio frequency to be detected, does not need whole music detection, has shorter length of the audio frequency to be detected, and can improve the accuracy of pure music detection.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present invention provide a storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any pure music detection method provided by the embodiments of the present invention. For example, the instructions may perform the steps of:

acquiring audio to be detected; carrying out human voice separation processing on the audio to be detected to obtain an audio clip to be processed; extracting the audio features of the audio clip to be processed; inputting the audio features into a trained human voice detection network model; determining whether the audio clip to be processed contains the voice or not according to the output result of the trained voice detection network model; and if the voice is not contained, determining that the audio to be detected belongs to pure music.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any pure music detection method provided in the embodiments of the present invention, the beneficial effects that can be achieved by any pure music detection method provided in the embodiments of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The pure music detection method, the pure music detection device and the pure music detection storage medium provided by the embodiment of the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A pure music detection method, comprising:

acquiring audio to be detected;

extracting the audio features of the audio clip to be processed;

and if the voice is not contained, determining that the audio to be detected belongs to pure music.

2. The method of claim 1, wherein prior to said inputting the audio features into the trained human voice detection network model, the method further comprises:

obtaining a plurality of audio samples, wherein the audio samples are known to be pure music or not;

determining audio characteristics of the audio sample according to the audio sample;

adding the audio features to a set of training samples;

and training the voice detection network model according to the training sample set to obtain the trained voice detection network model.

3. The method of claim 2, wherein the determining the audio characteristics of the audio sample from the audio sample comprises:

4. The method of claim 3, wherein the performing the voice separation process on the audio sample comprises:

5. The method of claim 1, wherein the audio features comprise mel features, and wherein extracting the audio features of the audio piece to be processed comprises:

performing short-time Fourier transform (STFT) on the audio fragment to be processed to obtain an STFT spectrum;

converting the STFT spectrum to obtain a mel spectrum;

6. The method according to claim 1, wherein the audio features comprise mel features and human voice ratio features, and the extracting the audio features of the audio segment to be processed comprises:

extracting mel characteristics of the audio clip to be processed;

7. The method according to claim 6, wherein the extracting mel features of the audio segment to be processed comprises:

converting the STFT spectrum to obtain a mel spectrum;

8. The method according to claim 6, wherein the extracting the human voice proportion feature of the audio segment to be processed comprises:

9. The method according to any one of claims 1 to 8, wherein the performing human voice separation processing on the audio to be detected comprises:

10. A pure music detection device, comprising:

the first acquisition unit is used for acquiring the audio to be detected;

11. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the pure music detection method according to any one of claims 1 to 9.