CN108257614A

CN108257614A - The method and its system of audio data mark

Info

Publication number: CN108257614A
Application number: CN201611247230.3A
Authority: CN
Inventors: 晁卫
Original assignee: Beijing Kuwo Technology Co Ltd
Current assignee: Beijing Kuwo Technology Co Ltd
Priority date: 2016-12-29
Filing date: 2016-12-29
Publication date: 2018-07-06

Abstract

The present invention provides a kind of method and its system of audio data mark, and this method includes：Receive audio data to be marked；The audio fragment of audio data to be marked is obtained, audio fragment is analyzed using at least one training pattern of pre-training, determines the tag along sort of audio fragment；Tag along sort is marked for the corresponding audio data to be marked of audio fragment.The automation mark of audio data is realized, improves the accuracy rate of audio data mark.

Description

The method and its system of audio data mark

Technical field

The present invention relates to audio analysis and processing technology field, especially design a kind of method of audio data mark and its are System.

Background technology

With the fast development of Technology of Audio Collection and Internet technology, a large amount of audio data (example can be all generated daily Such as song) network is uploaded to, the genre classification of audio data can help user quickly to search out the audio data liked, but pass The audio data classification of system, i.e., carrying out Emotion tagging to audio data needs artificial screening, and be labeled, and needs a large amount of people Power and time, and audio data can be led to because personal subjective factor causes audio data classification results to have very poor difference The accuracy of mark is low.

Invention content

The present invention provides a kind of method and its system of audio data mark, by extracting the part audio in audio data The feature vector of data completes the automation mark of audio data, it is noted that the accuracy of audio data mark.

In a first aspect, the embodiment of the present invention provides a kind of method of audio data mark, this method includes：

Receive audio data to be marked；

The audio fragment of audio data to be marked is obtained, using at least one training pattern of pre-training to audio fragment It is analyzed, determines the tag along sort of audio fragment；

Tag along sort is marked for the corresponding audio data to be marked of audio fragment.

By obtaining the audio fragment of audio data to be marked, and trained model analyzes audio fragment, and is The corresponding audio data to be marked of audio fragment carries out the mark of tag along sort, realizes the automation mark of audio data, Improve the accuracy rate of audio data mark.

Optionally, in a designing scheme, audio fragment is carried out at least one training pattern using pre-training Before analysis, method further includes：

The corresponding multiple audio datas to be trained of each tag along sort are obtained according at least one tag along sort；

The audio fragment of the corresponding multiple audio datas to be trained of each tag along sort is obtained, and extracts audio fragment Feature vector；

The feature vector of multiple audio fragments corresponding at least one tag along sort is trained, and obtains at least one point The corresponding at least one training pattern of class label.

Optionally, in a designing scheme, the feature vector of audio fragment is extracted, including：

Using the feature vector of mel-frequency cepstrum coefficient MFCC and perceptual linear prediction PLP extraction audio fragments.

Optionally, in a designing scheme, before the feature vector of extraction audio fragment, this method further includes：

Hamming window processing is carried out to audio fragment.

Optionally, in a designing scheme, the features of multiple audio fragments corresponding at least one tag along sort to Amount is trained, including：

It is carried out using the feature vector of convolutional neural networks CNN multiple audio fragments corresponding at least one tag along sort Training.

Second aspect, the embodiment of the present invention provide a kind of system, and system includes：

Receiving unit, for receiving audio data to be marked；

Processing unit, for obtaining the audio fragment of audio data to be marked, using at least one training of pre-training Model analyzes audio fragment, determines the tag along sort of audio fragment；

Processing unit is additionally operable to mark tag along sort for the corresponding audio data to be marked of audio fragment.

Optionally, in a designing scheme, system further includes training unit；

Processing unit is additionally operable to obtain the corresponding multiple sounds to be trained of each tag along sort according at least one tag along sort Frequency evidence；

Processing unit is additionally operable to obtain the audio fragment of the corresponding multiple audio datas to be trained of each tag along sort, And extract the feature vector of audio fragment；

Training unit, the feature vector for multiple audio fragments corresponding at least one tag along sort are trained, Obtain the corresponding at least one training pattern of at least one tag along sort.

Optionally, in a designing scheme, processing unit extracts the feature vector of audio fragment, including：

Optionally, in a designing scheme, processing unit is additionally operable to carry out Hamming window processing to audio fragment.

Optionally, in a designing scheme, training unit multiple audio fragments corresponding at least one tag along sort Feature vector be trained, including：

Training unit uses the feature of convolutional neural networks CNN multiple audio fragments corresponding at least one tag along sort Vector is trained.

Based on the method and its system of audio data provided by the invention mark, the audio piece of audio data to be sorted is taken Section by training pattern trained in advance, classifies to audio data, and mark, realizes the automation mark of audio data Note improves the accuracy of audio data mark.

Description of the drawings

In order to illustrate the technical solution of the embodiments of the present invention more clearly, it will make below to required in the embodiment of the present invention Attached drawing is briefly described, it should be apparent that, drawings described below is only some embodiments of the present invention, for For those of ordinary skill in the art, without creative efforts, other are can also be obtained according to these attached drawings Attached drawing.

Fig. 1 is a kind of method flow diagram of audio data mark provided in an embodiment of the present invention；

Fig. 2 is a kind of method flow diagram of model training provided in an embodiment of the present invention；

Fig. 3 is a kind of result figure of audio data mark provided in an embodiment of the present invention；

Fig. 4 is a kind of structure diagram of system provided in an embodiment of the present invention.

Specific embodiment

The present invention provides the method and its system of a kind of audio data mark, suitable for audio data, such as：Song Type classify and the mark of classification type.

Technical scheme of the present invention is described in detail below in conjunction with the accompanying drawings.

Fig. 1 is a kind of method flow diagram of audio data mark provided in an embodiment of the present invention.As shown in Figure 1, this method It may comprise steps of：

S110 receives audio data to be marked.

Audio data to be marked is the audio data of pending classification.When there is audio data to be sorted to need to classify When, such as the audio data progress classification of type in audio database.More specifically, type point is carried out to the song in music libraries Class in other words carries out song tag along sort classification, the classification of stylistic category, such as popular (POP) song, rock and roll (Rock) song Song, hip-hop (Rap) song, jazz (Jazz) song, Blues (Blues) song, allusion (Classical) song, punk (Punk), terrible (Reggae) music of metal (Metal) type song, Latin music (Latin Music), thunder, new century (New Age), country music (Folk Music or Country Music), electronics dance music (Electronic Dance), nursery rhymes (Child Music), folk music, folk song, the world (World) music, fever (HiFi) music, etc..

S120 obtains the audio fragment of audio data to be marked, using at least one training pattern of pre-training to sound Frequency segment is analyzed, and determines the tag along sort of audio fragment.

In embodiments of the present invention, the part audio fragment of audio data to be marked is obtained, to accelerate the speed of acquisition, The audio fragment of 30 seconds in audio data to be marked is obtained in embodiments of the present invention.Specifically acquisition process is：Using sample rate as 16KHz (a frame audio data can have 512 sampled points), frame, which is moved, samples audio data for 16ms, i.e. a frame audio Data can have 256 sampled points, to obtain the audio fragment of audio data.In embodiments of the present invention, a song can be with 1875 frames are obtained, it is consistent with former audio data to ensure.

In embodiments of the present invention, audio fragment is carried out to analyze it using at least one training pattern of pre-training Before, it needs to train at least one training pattern, the description of specific training process such as Fig. 2.

Audio fragment is analyzed using trained at least one training pattern, determines the classification of audio fragment.It can Selection of land in embodiments of the present invention, is analyzed audio fragment as training pattern using AlexNet.AlexNet compares The advantage of other training patterns such as LeNet is：Network increases (5 complete+1 softmax layers of articulamentums of convolutional layer+3), together When solve fitting (dropout, data augmentation or LRN), and multiple graphics processors can be utilized simultaneously (Graphic Processing Unit, GPU) is calculated, and is accelerated calculating speed, is shortened the training time, that is, shorten To the analysis time of audio fragment.

In embodiments of the present invention, server/customer end may be used in the deployment way of audio data labeling system (Client/Server, CS) structure.In embodiments of the present invention, distributed deployment mode may be used in server-side.Client is held Row S110 and S120 after S120, that is, after the audio fragment for obtaining audio data to be marked, send to server and call The call request of at least one training pattern, server call training pattern according to call request, audio fragment are analyzed, Determine the tag along sort of audio fragment.The parallel place that training pattern treats trained audio data is realized using CS deployment way Reason improves the response speed of client request.

S130 is that the corresponding audio data to be marked of audio fragment marks tag along sort.

The method marked using audio data provided in an embodiment of the present invention, by the audio for obtaining audio data to be marked Segment, and trained model analyzes audio fragment, and is divided for the corresponding audio data to be marked of audio fragment The mark of class label realizes the automation mark of audio data, improves the accuracy rate of audio data mark.

Fig. 2 is a kind of method flow diagram of model training provided in an embodiment of the present invention.As shown in Fig. 2, this method can be with Include the following steps：

S210 obtains the corresponding multiple audio datas to be trained of each tag along sort according at least one tag along sort.

In the deep learning field of audio data, it is necessary first to determine the basic principle that training set is chosen, wherein, training When collection is training pattern, according to the corresponding multiple audio datas to be trained of each tag along sort of at least one tag along sort acquisition Set.

For example, at least one tag along sort is 20 tag along sorts or referred to as 20 stylistic categories, such as popular (POP), Rock and roll (Rock), hip-hop (Rap), jazz (Jazz), Blues (Blues), classic (Classical), punk (Punk), metal (Metal), Latin (Latin), thunder terrible (Reggae), new century (New Age), country music (Folk Music or Country Music), electronics dance music (Electronic Dance), nursery rhymes (Child Music), folk music, folk song, the world (World) sound Happy, fever (HiFi) music, etc. music style type.According to 20 stylistic categories from audio database, 20 styles are chosen The training set of type, each stylistic category choose multiple audio datas to be trained, in embodiments of the present invention, a stylistic category 1000 songs to be trained can be selected, artificial screening can be aided with during selection, to improve the matter of music to be trained Amount.

S220, obtains the audio fragment of the corresponding multiple audio datas to be trained of each tag along sort, and extracts audio The feature vector of segment.

In inventive embodiments, for speed up processing, the audio fragment of 30 seconds in each audio data is intercepted.Specifically Can be using sample rate as 16KHz (a frame audio data there can be 512 sampled points), frame, which is moved, adopts audio data for 16ms Sample, i.e. a frame audio data can have 256 sampled points, to obtain the audio fragment of audio data.

Optionally, in embodiments of the present invention, audio fragment will be got and carry out Hamming window processing, Hamming window processing is normal The function processing procedure seen for succinct description, repeats no more herein.

The feature vector of audio fragment after extraction process.Optionally, in embodiments of the present invention, Meier frequency may be used Described in rate cepstrum coefficient (Mel-Frequency Cepstral Coefficients, MFCC) and perceptual linear prediction PLP extractions The feature vector of audio fragment.Such as：Its preceding 20 dimension MFCC is extracted to each song by pretreatment, takes RASTA-PLP Cepstrum 9 is tieed up, and RASTA-PLP frequency spectrums 21 are tieed up, its mean value and variance are asked respectively to obtained MFCC and RASTA-PLP feature vectors, The feature vector that such words a piece of music segment can have 100 dimensions represents.

It should be noted that mel-frequency cepstrum coefficient (MFCC) it the auditory model of human ear can be modeled.Sound Happy characteristic aspect MFCC can accurately more represent music signal than other short-time characteristic parameters, so the application selects to use MFCC.It is a kind of strong characteristic parameter to perceive linear prediction (PLP), it simulates the characteristic of human auditory system, other are special with voice Sign parameter will be got well compared to robustness, while pass through RASTA filtering process, and the variation between short time spectrum analysis time frame and frame is played Certain smoothing effect.In addition, having carried out spectrum increase and decrease processing to obtained PLP cepstrum parameters, spectrum vertex is sharpened.Finally to obtaining To short-time characteristic parameter take its mean value and variance respectively, to establish the correlation between each characteristic parameter frame and frame.

S230, the feature vector of multiple audio fragments corresponding at least one tag along sort are trained, obtain at least The corresponding at least one training pattern of one tag along sort.

Optionally, in embodiments of the present invention, using convolutional neural networks (Convolutional Neural Network, CNN) feature vectors of corresponding at least one tag along sort multiple audio fragments is trained, obtain at least one The corresponding at least one training pattern of a tag along sort.CNN is a kind of feedforward neural network, its artificial neuron can respond Surrounding cells in a part of coverage area have outstanding performance for image procossing.It includes convolutional layer (alternating Convolutional layer) and pond layer (pooling layer).

Training method provided in an embodiment of the present invention carries out convolutional neural networks model using the feature vector extracted Training successfully reduces the artificial mark with subjective factor.

The model trained using the training method can reach 98.58% recognition accuracy.Such as shown in Fig. 3.

Fig. 3 is the result figure of audio data provided in an embodiment of the present invention mark.Fig. 3 (a) is the knot of ethnic song mark Fruit is schemed；Fig. 3 (b) is the result figure of classic song mark；Fig. 3 (c) is the annotation results figure of DJ songs；Fig. 3 (d) is children's song Annotation results figure.Wherein, the abscissa in Fig. 3 (a), Fig. 3 (b), Fig. 3 (c) and Fig. 3 (d) represents dimension；Ordinate expression pair The dimension values answered.

From Fig. 3 (a), Fig. 3 (b), Fig. 3 (c) and Fig. 3 (d) it is found that except Fig. 3 (c) DJ in the annotation results figure of this few class style The fluctuation of style is bigger, the substantially presentation ascendant trend of other three kinds of styles.For Fig. 3 (a), Fig. 3 (b), Fig. 3 (b), Fig. 3 (c) mark accuracy rate is up to 98.73%, 98.97%, 99.73%, 98.17% respectively.

Fig. 1 above and Fig. 3 describe the training process of training pattern in detail, the annotation process of audio data to be marked, And the interpretation of result that annotated audio data are labeled is treated using the training pattern of Fig. 2 training, 4 in detail below in conjunction with the accompanying drawings System provided in an embodiment of the present invention is described.

Fig. 4 is a kind of structure diagram of system provided in an embodiment of the present invention.As shown in figure 4, the system can include Receiving unit 310 and processing unit 320.

Receiving unit 310, for receiving audio data to be marked.

Processing unit 320, for obtaining the audio fragment of audio data to be marked, using at least one instruction of pre-training Practice model to analyze the audio fragment, determine the tag along sort of audio fragment；It is corresponding to be marked for audio fragment Audio data marks tag along sort.

Its detailed process is identical with the process of S110, S120 and S130 in Fig. 1, specifically describe please refer to Fig. 1 S110, S120 and S130 for succinct description, is repeated no more herein.

Optionally, in embodiments of the present invention, as shown in figure 4, the system can also include training unit 330.

Processing unit 320 obtains the corresponding multiple audios to be trained of each tag along sort according at least one tag along sort Data；The audio fragment of the corresponding the multiple audio data to be trained of each tag along sort is obtained, and extracts the audio The feature vector of segment.

Training unit 330, the feature vector for multiple audio fragments corresponding at least one tag along sort are instructed Practice, obtain the corresponding at least one training pattern of at least one tag along sort.

In the training process, it needs first to obtain the corresponding training sample of each tag along sort according to tag along sort, i.e., it is multiple Audio data to be trained.And the snatch of music of multiple audio datas to be trained is obtained, the feature vector that extraction audio judges.

Optionally, in embodiments of the present invention, processing unit 320 carries out Hamming window processing to the audio fragment got. And will treated audio fragment, the audio fragment of each tag along sort is extracted according to tag along sort.

In embodiments of the present invention, mel-frequency cepstrum coefficient MFCC and perceptual linear prediction PLP extractions institute may be used State the feature vector of audio fragment.

Then, the feature vector of the multiple audio fragments corresponding at least one tag along sort of training unit 330 is instructed Practice, including：

Training unit 330 is using convolutional neural networks CNN multiple audio fragments corresponding at least one tag along sort Feature vector is trained.

Detailed process is identical with the process of S210, S220 and S230 of Fig. 2, specifically describes S210, the S220 for referring to Fig. 2 And S230, for succinct description, repeat no more herein.

Using system provided in an embodiment of the present invention, by obtaining the audio fragment of audio data to be marked, and it is trained Model analyzes audio fragment, and the mark of tag along sort is carried out for the corresponding audio data to be marked of audio fragment, The automation mark of audio data is realized, improves the accuracy rate of audio data mark.

Above-described specific embodiment has carried out the purpose of the present invention, technical solution and advantageous effect further It is described in detail, it should be understood that the foregoing is merely the specific embodiment of the present invention, is not intended to limit the present invention Protection domain, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include Within protection scope of the present invention.

Claims

A kind of 1. method of audio data mark, which is characterized in that the method includes：

Receive audio data to be marked；

The audio fragment of the audio data to be marked is obtained, using at least one training pattern of pre-training to the audio Segment is analyzed, and determines the tag along sort of the audio fragment；

The tag along sort is marked for the corresponding audio data to be marked of the audio fragment.
2. according to the method described in claim 1, it is characterized in that, at least one training pattern pair using pre-training Before the audio fragment is analyzed, the method further includes：

The corresponding multiple audio datas to be trained of each tag along sort are obtained according at least one tag along sort；

The audio fragment of the corresponding the multiple audio data to be trained of each tag along sort is obtained, and extracts the audio piece The feature vector of section；

The feature vector of the corresponding multiple audio fragments of at least one tag along sort is trained, obtains described at least one The corresponding at least one training pattern of a tag along sort.
3. according to the method described in claim 2, it is characterized in that, the feature vector of the extraction audio fragment, including：

The feature vector of the audio fragment is extracted using mel-frequency cepstrum coefficient MFCC and perceptual linear prediction PLP.
4. according to the method described in claim 2, it is characterized in that, the extraction audio fragment feature vector it Before, the method further includes：

Hamming window processing is carried out to the audio fragment.
5. according to claim 2 to 4 any one of them method, which is characterized in that described at least one tag along sort The feature vector of corresponding multiple audio fragments is trained, including：

The feature vector of the corresponding multiple audio fragments of at least one tag along sort is carried out using convolutional neural networks CNN Training.
6. a kind of system, which is characterized in that the system comprises：

Receiving unit, for receiving audio data to be marked；

Processing unit, for obtaining the audio fragment of the audio data to be marked, using at least one training of pre-training Model analyzes the audio fragment, determines the tag along sort of the audio fragment；

The processing unit is additionally operable to mark the contingency table for the corresponding audio data to be marked of the audio fragment Label.
7. system according to claim 6, which is characterized in that the system also includes training units；

The processing unit is additionally operable to obtain the corresponding multiple sounds to be trained of each tag along sort according at least one tag along sort Frequency evidence；

The processing unit is additionally operable to obtain the audio piece of the corresponding the multiple audio data to be trained of each tag along sort Section, and extract the feature vector of the audio fragment；

The training unit is instructed for the feature vector to the corresponding multiple audio fragments of at least one tag along sort Practice, obtain the corresponding at least one training pattern of at least one tag along sort.
8. system according to claim 7, which is characterized in that the processing unit extract the feature of the audio fragment to Amount, including：

The feature vector of the audio fragment is extracted using mel-frequency cepstrum coefficient MFCC and perceptual linear prediction PLP.
9. system according to claim 7, which is characterized in that

The processing unit is additionally operable to carry out Hamming window processing to the audio fragment.
10. system according to any one of claims 7 to 9, which is characterized in that the training unit is to described at least one The feature vector of the corresponding multiple audio fragments of tag along sort is trained, including：

The training unit is using convolutional neural networks CNN to the corresponding multiple audio fragments of at least one tag along sort Feature vector is trained.