CN108205535A

CN108205535A - The method and its system of Emotion tagging

Info

Publication number: CN108205535A
Application number: CN201611169265.XA
Authority: CN
Inventors: 马明
Original assignee: Beijing Kuwo Technology Co Ltd
Current assignee: Beijing Kuwo Technology Co Ltd
Priority date: 2016-12-16
Filing date: 2016-12-16
Publication date: 2018-06-26

Abstract

The present invention provides a kind of method and its system of Emotion tagging, and this method includes：Receive audio data to be marked；Audio data to be marked is analyzed using at least one training pattern of pre-training, audio data to be marked is converted into sonograph, to determine the affective tag of audio data to be marked；Emotion tagging is carried out for audio data to be marked.By audio data to be identified by training pattern trained in advance, audio data is converted into sonograph identification, the emotion of audio data is labeled, the automation emotional semantic classification of audio data is realized, reduces learning cost, shorten the development cycle.

Description

The method and its system of Emotion tagging

Technical field

The present invention relates to data analysis technique fields, especially design the method and its system of a kind of Emotion tagging.

Background technology

The automatic marking of music emotion is related to music theory, psychology, signal processing, pattern-recognition, data mining etc. Relevant knowledge, but professional span is big, and automatic identification process is complicated, and excessively high to the design process requirement of technical staff.

Invention content

The present invention provides a kind of method and its system of Emotion tagging, is known by the way that the identification of audio data is converted into image Not, the automation mark of song emotion is completed, learning cost is reduced, shortens the development cycle.

In a first aspect, the embodiment of the present invention provides a kind of method of Emotion tagging, this method includes：

Receive audio data to be marked；

Audio data to be marked is analyzed using at least one training pattern of pre-training, by audio to be marked Data are converted to sonograph, to determine the affective tag of audio data to be marked；

Emotion tagging is carried out for audio data to be marked.

By audio data to be identified by training pattern trained in advance, audio data is converted into sonograph identification, it is right The emotion of audio data is labeled, and is realized the automation emotional semantic classification of audio data, is reduced learning cost, shorten out Send out the period.

Optionally, audio data to be marked is analyzed using at least one training pattern of pre-training, determined Before the affective tag of audio data to be marked, method further includes：

The corresponding training set of each affective tag is obtained according at least one affective tag, training set is waited to train including multiple Audio data；

Multiple audio datas to be trained that training set includes are respectively converted into sonograph；

Model training is carried out to each sonograph that multiple audio datas to be trained that training set includes are converted to, is obtained Training pattern.

Optionally, multiple audio datas to be trained that training set includes are respectively converted into sonograph, including：

Multiple audio datas to be trained are subjected to Fourier transformation respectively, obtain the sound spectrum of multiple audio datas to be trained Figure.

Optionally, after the multiple audio datas to be trained for including training set are respectively converted into sonograph, method is also Including：

Image scaling is carried out to multiple transformed sonographs of audio data to be trained；

Sonograph after scaling is subjected to model training, obtains training pattern.

Optionally, this method includes obtaining training pattern to sonograph progress model training using AlexNet models.

Second aspect, the embodiment of the present invention provide a kind of system, and system includes：

Receiving unit, for receiving audio data to be marked；

Analytic unit, for being analyzed using at least one training pattern of pre-training audio data to be marked, Audio data to be marked is converted into sonograph, to determine the affective tag of audio data to be marked；

Processing unit is additionally operable to carry out Emotion tagging for audio data to be marked.

Optionally, system further includes training unit,

Processing unit is additionally operable to obtain the corresponding training set of each affective tag, training according at least one affective tag Collection includes multiple audio datas to be trained；

Processing unit is additionally operable to multiple audio datas to be trained that training set includes being respectively converted into sonograph；

Training unit, each sonograph for being converted to multiple audio datas to be trained that training set includes carry out Model training obtains training pattern.

Optionally, processing unit is specifically used for,

Optionally,

Processing unit is additionally operable to carry out image scaling to multiple transformed sonographs of audio data to be trained；

Training unit is additionally operable to the sonograph after scaling carrying out model training, obtains training pattern.

Optionally, training unit carries out model training to sonograph using AlexNet models and obtains training pattern.

Audio data to be identified is passed through training in advance by method and its system based on Emotion tagging provided by the invention Audio data is converted into sonograph identification, the emotion of audio data is labeled, realizes audio data by training pattern Emotional semantic classification is automated, learning cost is reduced, shortens the development cycle.

Description of the drawings

In order to illustrate the technical solution of the embodiments of the present invention more clearly, it will make below to required in the embodiment of the present invention Attached drawing is briefly described, it should be apparent that, drawings described below is only some embodiments of the present invention, for For those of ordinary skill in the art, without creative efforts, other are can also be obtained according to these attached drawings Attached drawing.

Fig. 1 is a kind of method flow diagram of Emotion tagging provided in an embodiment of the present invention；

Fig. 2 is the method flow diagram of model training provided in an embodiment of the present invention；

Fig. 3 is the method flow diagram of another model training provided in an embodiment of the present invention；

Fig. 4 is a kind of system structure diagram provided in an embodiment of the present invention.

Specific embodiment

The present invention provides the method and its system of a kind of Emotion tagging, suitable for the emotional semantic classification to audio data, example Such as：Song.

Technical scheme of the present invention is described in detail below in conjunction with the accompanying drawings.

Fig. 1 is a kind of method flow diagram of Emotion tagging provided in an embodiment of the present invention.As shown in Figure 1, this method can be with Include the following steps：

S110 receives audio data to be marked.

Audio data to be marked is audio data to be identified.Audio data to be identified is the audio for treating emotional semantic classification Data.Emotional semantic classification is carried out when there is audio data to be identified to need identification, such as to the audio data in audio database. Emotional semantic classification more specifically is carried out to the song in library, song " is such as made us full according to affective tag " vivifying " Foot ", " dejected ", " anxiety ", etc. emotional semantic classification classifies to song.

S120 analyzes audio data to be marked using at least one training pattern of pre-training, determines to wait to know The affective tag of other audio data.

In embodiments of the present invention, audio data to be marked is carried out using with trained at least one training pattern Analysis before carrying out emotional semantic classification analysis in other words, needs to train at least one training pattern, specific training process is such as Fig. 2 Description.

Audio data to be marked is converted to sonograph according to trained at least one training pattern to be identified, really The emotional semantic classification of fixed audio data to be marked determines the affective tag of audio data to be marked.

During sentiment analysis is carried out by least one training pattern, the deployment of at least one training pattern can adopt With following two sets of plan：

Scheme one, at least one training pattern use graphics processor (Graphic Processing Unit, GPU) mould Formula is deployed to individual GPU cluster, and audio data, such as digital music are moved to the cluster and are labeled.

Scheme two, the CPU cluster progress being deployed at least one training pattern where audio data, such as digital music Local mark, model deployment way are cpu model.

Since audio data mark task is related to the difficulty that a large amount of audio data causes Data Migration, and in view of GPU Although arithmetic speed is faster, cost is excessively high, therefore the dispositions method of training pattern that scheme two would generally be used to provide is treated The audio data of mark is analyzed.

S130 marks the affective tag for audio data to be identified.

Fig. 2 is a kind of flow chart of the training method of training pattern provided in an embodiment of the present invention.As shown in Fig. 2, the party Method may comprise steps of：

S210 obtains the corresponding training set of each affective tag according at least one affective tag, and training set includes multiple Audio data to be trained.

Using popular Thayer emotion models, i.e., respectively from two reference axis of energy and pressure by the emotion of audio data Two classes are divided into, such as " vivifying ", " gratifying ", " dejected ", " anxiety ", etc. multiclass feelings can be formed Sense, i.e. affective tag.

According at least one affective tag (such as vivifying, gratifying, dejected, anxiety) respectively from audio number According to the audio data file to be trained that predetermined number is extracted in library, such as 1000.It is each to extracting in embodiments of the present invention The number of the audio data to be trained of affective tag is not construed as limiting, and can be determined according to practical training precision, usually taken out The precision of the number of audio data the to be trained training patterns trained taken higher more.

Multiple audio datas to be trained that training set includes are respectively converted into sonograph by S220.

The corresponding sound of audio data mainly includes three dimensions：Time, frequency, energy.Common sound visualization table The mode of showing has oscillogram, spectrogram, sonograph.Oscillogram characterizes the time-domain information of audio data, has lost frequency information；Frequency spectrum Chart levies the frequency domain information of audio data, not comprising energy information；Sonograph can then characterize simultaneously time of sound, frequency, Energy information.Since the lossy compression process of audio data is related to three dimensions of sound, to ensure information representation Integrality, the application carry out model training using the sonograph of audio data to be trained as the input of model training.

In embodiments of the present invention, audio data to be trained can be converted to sonograph by Fourier transformation. In the embodiment of the present invention, audio data to be trained can be converted to by sonograph by short Fourier transformation.Short Fourier becomes Commutation introduces window function compared with Fourier transformation, can provide the information that frequency signal changes over time.It most waits to train at last Audio data be converted into horizontal axis characterization the time, the longitudinal axis characterization frequency, characterization energy size sonograph.

It should be noted that window function be in order to reduce wait train audio data carry out Fourier transformation when spectrum energy Leakage, may be used different cutted functions and signal is blocked, truncation funcation is properly termed as wearing in embodiments of the present invention Function, can also referred to as " window ".

S230 carries out model instruction to each sonograph that multiple audio datas to be trained that training set includes are converted to Practice, obtain training pattern.

In the embodiment of the present invention, convolutional neural networks (Convolutional Neural Network, CNN) may be used Carry out model training, convolutional neural networks are a kind of feedforward neural networks, it can with the visual cognition process of the approximate simulation mankind, It is widely used in image processing field.

Optionally, present invention employs AlexNet as training pattern, the reason is that the number of parameters of the model is about 6000 Ten thousand, model tormulation is very capable, easily acquires more features.In addition, AlexNet additionally uses the skills such as ReLU, LRN, Dropout Art, the problems such as effectively alleviating activation primitive saturation, model over-fitting, while improve calculated performance.Training process employs CUDA+ graphics processors (Graphic Processing Unit, GPU) accelerate, to shorten the training time.

The trained training pattern of training method of training pattern through the embodiment of the present invention is to audio to be marked Data are analyzed, and complete the emotional semantic classification of audio data, to realize the S120 in Fig. 1.

Optionally, such as Fig. 3 in embodiments of the present invention, turn respectively by multiple audio datas to be trained that training set includes It is changed to after sonograph, this method can also include：

S240 carries out image scaling to multiple transformed sonographs of audio data to be trained.

In practical applications, it needs to consider the accuracy rate of training pattern simultaneously and annotated audio number is treated using training pattern According to the processing speed for carrying out emotional semantic classification.In embodiments of the present invention, the method that image scaling may be used carrys out training for promotion mould Type treats the processing speed that annotated audio data carry out emotional semantic classification.

Optionally, bilinear interpolation may be used in the strategy of image scaling in the embodiment of the present invention, the bilinear interpolation Method has taken into account continuous pixels and computation complexity.Meanwhile in order to retain enough audio-frequency informations as far as possible, picture should not compress It is too small.For example, in view of the application scenarios of song sentiment analysis are not high to requirement of real-time, the image ruler of 256*256 may be used It is very little, to maximize the ability to express of audio data, and then ensure higher recognition accuracy.

Sonograph after scaling is carried out model training, obtains training pattern by S250.

The processing procedure of this method/step is identical with the process of method/step of S230 in Fig. 2, is described to be succinct, at this In repeat no more.

Fig. 1 above and Fig. 3 describe the training process of training pattern and the emotion mark of audio data to be marked in detail Below in conjunction with the accompanying drawings 4 system provided in an embodiment of the present invention is described in detail in the process of note.

Fig. 4 is system provided in an embodiment of the present invention, which can include：Receiving unit 410,420 and of analytic unit Processing unit 430.

Receiving unit 410, for receiving audio data to be marked.

Analytic unit 420, for being divided using at least one training pattern of pre-training audio data to be marked Analysis determines the affective tag of audio data to be marked.

Processing unit 430 is additionally operable to carry out Emotion tagging for audio data to be marked.

When needing to carry out emotional semantic classification specifically, working as and receiving audio data to be marked, and carrying out Emotion tagging, pass through Advance trained at least one training pattern carries out sentiment analysis to audio data to be marked, by audio data to be marked It is converted to sonograph and carries out sentiment analysis, it will be complete to the identification being converted to sonograph of audio data to be marked Into the sentiment analysis of audio data to be marked, and it is labeled.

It, can be with it should be noted that the advance trained at least one training pattern used in embodiments of the present invention At least one training pattern trained for S210, S220 and S230 in Fig. 2 and Fig. 3 or S210, S220, S240 and S250.

Optionally, in embodiments of the present invention, as shown in figure 4, the system can also include training unit 440.

Processing unit 430 is additionally operable to obtain the corresponding training set of each affective tag, instruction according at least one affective tag Practice collection and include multiple audio datas to be trained.

Processing unit 430 is additionally operable to multiple audio datas to be trained that training set includes being respectively converted into sonograph.

Training unit 440, for each sonograph being converted to multiple audio datas to be trained that training set includes Model training is carried out, obtains training pattern.

In embodiments of the present invention, processing unit 430 and training unit 440 can be at least one trained moulds of training in Fig. 2 Method/step S210 of at least one training pattern of training in the method for type/step S210, S220 and S230 or Fig. 3, S220, S240 and S250 specifically describe the method/step for referring to Fig. 2 and Fig. 3, for succinct description, repeat no more herein.

Optionally, in embodiments of the present invention, processing unit 430 is specifically used for,

It should be noted that the processing procedure of the processing unit 430 is identical with the description process of the S220 in Fig. 2, for letter Clean description, repeats no more herein.

Optionally, in embodiments of the present invention, processing unit 430 are additionally operable to multiple after the conversion of trained audio data Sonograph carry out image scaling.

Training unit 440 is additionally operable to the sonograph after scaling carrying out model training, obtains training pattern.

The description of the process is identical with the description process of the S240 in Fig. 3 and S250, no longer superfluous herein for succinct description It states.The method provided through the embodiment of the present invention, can improve the training pattern that training obtains treats annotated audio data The accuracy of emotional semantic classification and the processing speed that the progress emotional semantic classification analysis of annotated audio data is treated using training pattern.

Optionally, in embodiments of the present invention, training unit 430 may be used AlexNet models and carry out mould to sonograph Type training obtains training pattern, to improve calculated performance.In embodiments of the present invention, which can also be using GUDA+GPU's Mode, to shorten the time of training.

Above-described specific embodiment has carried out the purpose of the present invention, technical solution and advantageous effect further It is described in detail, it should be understood that the foregoing is merely the specific embodiment of the present invention, is not intended to limit the present invention Protection domain, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include Within protection scope of the present invention.

Claims

A kind of 1. method of Emotion tagging, which is characterized in that the method includes：

Receive audio data to be marked；

The audio data to be marked is analyzed using at least one training pattern of pre-training, is determined described to be marked Audio data affective tag；

Emotion tagging is carried out for the audio data to be marked.
2. according to the method described in claim 1, it is characterized in that, at least one training pattern pair using pre-training The audio data to be marked is analyzed, before the affective tag for determining the audio data to be marked, the method It further includes：

The corresponding training set of each affective tag is obtained according at least one affective tag, the training set is waited to train including multiple Audio data；

The multiple audio data to be trained that the training set includes is respectively converted into sonograph；

Model training is carried out to each sonograph that the multiple audio data to be trained that the training set includes is converted to, Obtain training pattern.
3. according to the method described in claim 2, it is characterized in that, described the multiple wait to train by what the training set included Audio data is respectively converted into sonograph, including：

The multiple audio data to be trained is subjected to Fourier transformation respectively, obtains the sound of the multiple audio data to be trained Spectrogram.
4. according to the method in claim 2 or 3, which is characterized in that the multiple wait to instruct include the training set Practice audio data to be respectively converted into after sonograph, the method further includes：

Image scaling is carried out to the multiple transformed sonograph of audio data to be trained；

Sonograph after scaling is subjected to model training, obtains training pattern.
5. according to the method described in claim 2, it is characterized in that, this method include using AlexNet models to sonograph into Row model training obtains training pattern.
6. a kind of system, which is characterized in that the system comprises：

Receiving unit, for receiving audio data to be marked；

Analytic unit, for being analyzed using at least one training pattern of pre-training the audio data to be marked, Determine the affective tag of the audio data to be marked；

The processing unit is additionally operable to carry out Emotion tagging for the audio data to be marked.
7. system according to claim 6, which is characterized in that the system also includes training unit,

The processing unit is additionally operable to obtain the corresponding training set of each affective tag according at least one affective tag, described Training set includes multiple audio datas to be trained；

The processing unit is additionally operable to the multiple audio data to be trained that the training set includes being respectively converted into sound spectrum Figure；

The training unit, for each sound being converted to the multiple audio data to be trained that the training set includes Spectrogram carries out model training, obtains training pattern.
8. system according to claim 7, which is characterized in that the processing unit is specifically used for,

The multiple audio data to be trained is subjected to Fourier transformation respectively, obtains the sound of the multiple audio data to be trained Spectrogram.
9. system according to claim 7 or 8, which is characterized in that

The processing unit is additionally operable to carry out image scaling to the multiple transformed sonograph of audio data to be trained；

The training unit is additionally operable to the sonograph after scaling carrying out model training, obtains training pattern.
10. system according to claim 7, which is characterized in that the training unit is using AlexNet models to sonograph It carries out model training and obtains training pattern.