CN110008372A

CN110008372A - Model generating method, audio-frequency processing method, device, terminal and storage medium

Info

Publication number: CN110008372A
Application number: CN201910134036.1A
Authority: CN
Inventors: 贾少勇
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2019-02-22
Filing date: 2019-02-22
Publication date: 2019-07-12

Abstract

The embodiment of the invention provides a kind of model generating method, audio-frequency processing method, device, terminal and computer readable storage medium, the model generating method includes: the music emotion for marking sample audio data, obtains annotated audio sample；The annotated audio sample is cut into multiple annotated audio data segments of preset length；Each annotated audio data segment is handled into the mark sample audio section feature vector for multiple default dimensions, using as mark sample set；The music emotion label of the mark sample audio section feature vector each in the mark sample set is updated, mark sample audio training set is obtained；The mark sample audio training set is trained using deep learning method, obtains first music Emotion tagging model.It realizes and target audio data is inputted into first music Emotion tagging model, obtain the purpose of music emotion label.

Description

Model generating method, audio-frequency processing method, device, terminal and storage medium

Technical field

The present invention relates to network technique fields, more particularly to model generating method, audio-frequency processing method, device, terminal And computer readable storage medium.

Background technique

With the universal and development of video or audio network, many videos and audio website are emerged, user is facilitated to regard Interested video or audio are searched on frequency or audio website, greatly enriches the life of user, wherein music is reaction people The art of class actual life emotion, short-sighted frequency are a kind of expression ways of prevalence.

Currently, for a large amount of audio, video datas by user's self-control or official's production stored on video or audio website For users to use, wherein the content of audio-video often matches with the emotion of music, to express emotion.With the stream of audio-video Row, audio-video quantity is growing day by day, is badly in need of the highly effective algorithm that a kind of pair of music emotion automatically analyzes, to data structured.So For above-mentioned emotional semantic classification to audio or the video comprising music carry out music emotion mark be complete audio, video data structure Change essential key factor.

In the prior art, it is marked often through the artificial music emotion for carrying out audio-video website, low efficiency and at high cost.Cause This, how efficiently and accurately to carry out the mark of music emotion to the audio, video data that stores on audio-video website be to need to be solved at present Certainly the technical issues of.

Summary of the invention

The technical problem to be solved is that provide a kind of model generating method, audio-frequency processing method, dress for the embodiment of the present invention It sets, terminal and computer readable storage medium, to solve to the music associated video data or audio number stored in video website The technical issues of according to the mark for carrying out music emotion.

To solve the above-mentioned problems, the present invention is achieved through the following technical solutions:

First aspect provides a kind of model generating method, which comprises

The music emotion for marking sample audio data, obtains annotated audio sample；

The annotated audio sample is cut into multiple annotated audio data segments of preset length；

Each annotated audio data segment is handled into the mark sample audio section feature vector for multiple default dimensions, to make To mark sample set；

The music emotion label of the mark sample audio section feature vector each in the mark sample set is carried out It updates, obtains mark sample audio training set；

The mark sample audio training set is trained using deep learning method, obtains first music Emotion tagging Model.

Second aspect provides a kind of audio-frequency processing method, which comprises

The mark for carrying out music emotion to target audio data is received to request；

It is requested according to the label, using music emotion marking model, marks the music emotion of the target audio data.

The third aspect provides a kind of model generating means, and described device includes:

Annotated audio sample generation module obtains annotated audio sample for marking the music emotion of sample audio data；

Annotated audio data segment obtains module, for the annotated audio sample to be cut into multiple marks of preset length Audio data section；

Sample set determining module is marked, for handling each annotated audio data segment for the mark of multiple default dimensions Sample audio section feature vector, using as mark sample set；

Sample audio training set generation module is marked, is used for the mark sample audio section each in the mark sample set The music emotion label of feature vector is updated, and obtains mark sample audio training set；

First music Emotion tagging model training module, for being instructed using deep learning method to the mark sample audio Practice collection to be trained, obtains first music Emotion tagging model.

Fourth aspect provides a kind of apparatus for processing audio, and described device includes:

Music emotion marks request receiving module, asks for receiving the mark for carrying out music emotion to target audio data It asks；

Music emotion labeling module, using music emotion marking model, marks the mesh for requesting according to the label Mark the music emotion of audio data.

5th aspect provides a kind of terminal, comprising: memory, processor and is stored on the memory and can be described The computer program run on processor realizes such as above-mentioned model generation side when the computer program is executed by the processor Step in method, or such as the step of above-mentioned audio-frequency processing method.

6th aspect provides a kind of computer readable storage medium, and calculating is stored on the computer readable storage medium Machine program realizes the step in such as above-mentioned model generating method when the computer program is executed by processor, or such as above-mentioned Audio-frequency processing method in step.

Compared with prior art, the embodiment of the present invention includes following advantages:

In the embodiment of the present invention, for the audio data in audio-video website, marked using preset musical affective tag After note, by pretreatment, if audio data is cut into pieces post-processing as the feature vector of default dimension, then music is carried out Mark sample audio training set is obtained after the update of affective tag, using deep learning method to mark sample audio training Collection is trained, and obtains first music Emotion tagging model.Then, target audio data are inputted into above-mentioned first music emotion mark Injection molding type obtains the music emotion of first music Emotion tagging model output.Wherein, above-mentioned music emotion is preset, such as Pop music, hip-hop music, rock music, rhythm and blues etc..In this way, being marked by all music emotions to realize audio-video Data carry out the purpose of music emotion label, realize the mesh that watching focus type mark is carried out for various video data precise and high efficiencies , have the beneficial effect that efficiently and accurately realizes the music emotion label of audio, video data.

The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.

It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The application can be limited.

Detailed description of the invention

Fig. 1 is a kind of flow chart of model generating method provided in an embodiment of the present invention；

Figure 1A is a kind of audio signal schematic diagram provided in an embodiment of the present invention；

Figure 1B is a kind of audio data windowing process schematic diagram provided in an embodiment of the present invention；

Fig. 2 is a kind of audio-frequency processing method flow chart provided in an embodiment of the present invention；

Fig. 3 is a kind of structural schematic diagram of model generating means provided in an embodiment of the present invention；

Fig. 4 is a kind of structural schematic diagram of apparatus for processing audio provided in an embodiment of the present invention.

Specific embodiment

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.

Referring to Fig. 1, being a kind of flow chart of model generating method provided in an embodiment of the present invention, specifically include:

Step 101, the music emotion for marking sample audio data, obtain annotated audio sample；

In the embodiment of the present invention, sample audio data is the audio, video data concentration extraction in the storage of audio-video website backstage Out, wherein the storage mode of audio, video data collection generally can be with the form storage that the time marks, such as 1 year first quarter User freely upload audio, video data and official production upload audio, video data gather, in these audio, video data collection It is middle extraction wherein audio data as audio sample.

For example, extracting audio data in video data as audio sample, or by audio data directly as audio The audio data extracted in video data and the audio data set stored naturally can also be synthesized audio sample by sample.

Wherein, the specific method for extracting video data sound intermediate frequency data is described as follows:

It is read by the video data in real-time messages transport protocol (RTMP, Real Time Messaging Protocol) The method for taking packet RTMP_ReadPacket to obtain video and corresponding audio data are as follows:

1, the audio sync packet in video data is obtained；

2, the audio head decoding data AACDecoderSpecificInfo and audio data parsed in audio sync packet matches Confidence ceases AudioSpecificConfig.Wherein, audio data configuration information AudioSpecificConfig is for generating ADST (including sample rate, channel number, the frame length data in audio data).

3, other audio packs in video data are obtained, and parse original audio data (i.e. ES).

4, the ES stream of AAC is packaged by audio data head AAC decoder the format of ADTS, wherein be in AAC ES The header file ADTSheader of 7 bytes of addition before stream, to parse audio data content.

As above, i.e., by the packets of audio data extracted in parsing video data, and then the specific interior of audio data is parsed Hold, that is, has extracted the audio content in video data.

It is to be appreciated that the method that the audio data extracting mode in video data is not limited to foregoing description, the present invention is real It is without restriction to the extracting mode of audio data to apply example.

After obtaining audio sample by the above method, by predefining 7, music emotion label (e.g., happily Happy, soft Tender, excited Exciting, humorous Funny, sad Sad, stimulation Scary and indignation Angry) to audio sample Originally annotated audio sample is obtained after being marked.

Step 102, multiple annotated audio data segments that the annotated audio sample is cut into preset length；

In practical applications, the length disunity of annotated audio sample will cause data mistake when carrying out batch processing Difference finally obtains the training sample for meeting preset standard so needing to cut audio data, such as sample total is total to 16955, average every class 2422, when each sample, is about 10 seconds.

Wherein, annotated audio sample is split, obtains N number of annotated audio data segment of default size.It can will be upper It states annotated audio sample and imports preset audio cutter and cut, cutting audio data can be pre-set in cutting The duration of section, audio cutter may be implemented to carry out batch cutting according to the duration.

Certainly, the embodiment of the present invention is without restriction to the type of audio cutter.

It is to be appreciated that needing the preset requirement to training sample different on different models, therefore the embodiment of the present invention It is without restriction to the specific length of audio section.

Each annotated audio data segment is handled the mark sample audio section feature for multiple default dimensions by step 103 Vector, using as mark sample set；

Preferably, step 103, further comprise:

Each annotated audio data segment is carried out sub-frame processing respectively by sub-step 1031, obtains each mark sound Multiple framing annotated audio data segments of frequency data segment；

Specifically, as shown in Figure 1A, voice signal is being macroscopically jiggly, is smoothly, to have short on microcosmic When stationarity (voice signal chosen in box in such as figure, in 10---30ms it is considered that voice signal approximation it is constant), this A voice signal to be divided into segment to be handled, each segment is known as a frame (CHUNK), each certain piece The duration of section is not limited to the 10---30ms of foregoing description, and the embodiment of the present invention is without restriction to the duration of frame.

Therefore, annotated audio data segment is further divided into smaller the first framing audio data as unit of frame.

Each framing annotated audio data segment is multiplied by sub-step 1032 with windowed function respectively, obtains each described The mark adding window audio data section of framing annotated audio data segment；

Specifically, when framing, each frame can repeat to intercept a part, i.e., the tail portion of previous true frame and present frame After head respectively takes a part to be overlapped, then windowing process is carried out, so overall situation voice signal will not make one because of windowing process The both ends part of frame signal, which is weakened, obtains the audio data of excessive noise reduction, so realizing the weight between frame and frame in framing It is folded, so that the audio signal after windowing process is more continuous.

Wherein, the first framing audio data obtained above is subjected to windowing process, i.e., is as schemed by original audio signal In 1B shown in left-hand component, by being multiplied with the intermediate windowed function as shown in the middle section Figure 1B, obtain on the right of Figure 1B Logarithmic spectrum of every frame audio data on frequency domain shown in part, so that originally without (such as first point of periodic voice signal Frame audio data) Partial Feature that shows periodic function, it is determined as the mark adding window audio data of above-mentioned framing audio data Section.

Each mark adding window audio data section is carried out Meier transformation respectively by sub-step 1033, obtains each mark Infuse the mark Meier frequency spectrum data of audio data section.

Further, in order to enable sound characteristic is intuitive in the first adding window audio data obtained after framing and windowing process It shows, needs to carry out Meier transformation, audio data is converted into mark Meier frequency spectrum data, wherein the unit of frequency is hertz (Hz), the frequency range that human ear can be heard is 20-20000Hz, but human ear is not that linear perception is closed to this scale unit of Hz System.For example, if pitch frequency is increased to 2000Hz, that ear can only be perceived if people have adapted to the tone of 1000Hz A little is improved to frequency, frequency is detectable at all and is doubled.If converting Meier for common frequency scaling Frequency scaling, then human ear is to the perceptibility of frequency just at linear relationship.That is, under Meier scale, if two sections of languages The mel-frequency of sound differs twice, then the tone that human ear can perceive probably also differs twice.Having audio data realization can Depending on the beneficial effect changed.

Each mark Meier frequency spectrum data is converted to the feature vector of default dimension respectively, obtained by sub-step 1034 To the mark sample audio section feature vector of each mark Meier frequency spectrum data.

In this step, above-mentioned first Meier spectral image data are converted into the feature vector that machine can identify, wherein Image data, which is converted to the common model of machine readable feature vector, BVLC GoogLeNet model, certainly, is actually answering In, it is not limited to the conversion regime of foregoing description, the embodiments of the present invention are not limited thereto.

Preferably, step 1034, further comprise:

Sub-step 10341, by it is described mark Meier frequency spectrum data in the corresponding Meier spectrum number of each frame audio data According to being determined as sample framing Meier frequency spectrum data；

In this step, extracts every frame audio in the annotated audio data segment of above-mentioned acquisition and correspond to Meier frequency spectrum data figure, really It is set to the first framing Meier frequency spectrum data.

The sample framing Meier frequency spectrum data is converted to sample framing audio feature vector by sub-step 10342；

In this step, each first framing Meier spectrogram data is subjected to feature vector conversion.

Specifically, above-mentioned first framing Meier spectral image data are converted to point by image feature vector transformation model Frame audio feature vector, wherein known common image feature vector transformation model has BVLC GoogLeNet model, it is one A 22 layers of deep convolutional network can detect the feature vector of 1000 kinds of different image types.

Certainly, foregoing description is not limited to for characteristics of image conversion method, the embodiments of the present invention are not limited thereto.

Sub-step 10343 splices the sample framing audio feature vector of default frame number, obtains default dimension Mark sample audio section feature vector；

In this step, for the sample framing audio frequency characteristics of the first framing Meier frequency spectrum data obtained in step 10342 After vector, by multiple sample framing audio feature vectors merge into the mark sample audio section feature of a default dimension to Amount is directed to the audio data of one second frame for example, framing audio feature vector is the feature vector of 128 dimensions, and for sound The processing of frequency evidence, the information that one second frame is included are not enough to characterize the concrete type of audio data, so by the framing audio The context-sensitive framing audio feature vector of feature vector merges, i.e. the corresponding feature vector of 3 seconds audio datas, i.e., 3 framing audio feature vectors are spliced into the feature vector of 128*3=384 dimension.

Certainly, default dimension is not necessarily 384 dimensions mentioned above, is also possible to the group of five framing audio feature vectors The audio feature vector for the default dimension of conjunction or ten framing audio feature vectors being composed, so, preset dimension Whether what setting depended primarily on audio data includes enough information in case subsequent processing, therefore, the embodiment of the present invention is to pre- If the specific value of dimension is not limited.

Each mark sample audio section set of eigenvectors is combined into mark sample set by sub-step 1035.

In this step, above-mentioned all mark sample audio section feature vectors are stored as a set, as mark sample Collection.

Step 104 marks each music emotion for marking sample audio section feature vector in sample set for described Label is updated, and obtains mark sample audio training set；

In this step, the audio data typically directly downloaded there are strong noise (respectively data noise and label noise), If directly training music emotion marking model, accuracy rate are lower.So further to above-mentioned mark sample audio training set It carries out data cleansing, that is, with marking sample audio training set music emotion model, then is carried out with sample of the model to every class The cleaning of label noise, the final music emotion data set for obtaining high quality, specific steps are described as follows:

Preferably, step 104, further comprise:

Sub-step 1041, according to preset ratio, from the mark sample set extract the mark sample audio section feature to Amount, is determined as training sample feature set；

In this step, if marking sample set total amount totally 16955 mark sample audio section feature vectors, average every assonance The mark sample audio section feature vector of happy emotion has 2422, and when each sample is about 10 seconds, by preset ratio therein A part (such as 20%) in (such as 50%) is extracted as training sample feature set, and is extracted in training sample feature set 20% data are used as test sample feature as the second training sample feature, remaining 30%.

The training sample feature set is trained by sub-step 1042 by predetermined deep learning method, obtains second Music emotion marking model；

In this step, the second training sample feature is trained by predetermined deep learning algorithm, obtains the second music Emotion tagging model, wherein predetermined deep learning algorithm can be Softmax classifier, certainly, in practical applications It is not limited to Softmax classifier, the embodiment of the present invention is without restriction to specific deep learning method.

Sub-step 1043, using the mark sample audio section feature vector remaining in the mark sample set as test Sample characteristics collection, and the test sample feature set is inputted into the second music emotion marking model, so that second sound Happy Emotion tagging model exports the music emotion of each mark sample audio section feature vector in the test sample feature set Label generates and updates mark sample set；

Wherein, remaining 30% in 50% in above-mentioned totally 16955 sample totals is extracted and is divided into conduct three times Test set is cleaned, above-mentioned trained second music emotion marking model is inputted.

In this step, music emotion label for labelling is carried out, the test set that will have been marked each time is again added to training set In, it trains again, generates the second music emotion marking model of update, then extract 10% test set progress genre labels mark Note, mark are put into training set, second of the second music emotion marking model updated of training, so until all tests after the completion Collection all returns in training set, then the data in training set are to complete the sample data of cleaning, that is, update mark sample Collection.

Sub-step 1044 merges update mark sample set with the training sample feature set, is determined as marking sample This audio training set.

In this step, the update mark sample set of above-mentioned completion cleaning merges with training sample feature set, as mark sample This audio training set.

It unlabelled sample data is input to the second music emotion marking model marks it is to be appreciated that repeated multiple times Note, mark are completed to gather training sample update training pattern again, can effectively improve mark accuracy rate, and training sample is got over Huge, the accuracy rate that training pattern is used to mark is higher, the second music emotion marking model obtained by repetition training, finally The music segmentation tag of all test sets marked out marks sample set in conjunction with above-mentioned update, and what is obtained is mark sample sound Frequency training set.

Step 105 is trained the mark sample audio training set using deep learning method, obtains first music Emotion tagging model.

In this step, mark sample audio training set obtained above is carried out again by predetermined deep learning method Training, finally obtains first music Emotion tagging model, effectively reduces the manpower of music label in artificial mark training sample Cost, and training sample data amount is improved, improve model training efficiency and mark accuracy rate.

In embodiments of the present invention, by marking the music emotion of sample audio data, annotated audio sample is obtained；By institute State multiple annotated audio data segments that annotated audio sample is cut into preset length；It is by each annotated audio data segment processing The mark sample audio section feature vector of multiple default dimensions, using as mark sample set；By each institute in the mark sample set The music emotion label for stating mark sample audio section feature vector is updated, and obtains mark sample audio training set；Benefit The mark sample audio training set is trained with deep learning method, obtains first music Emotion tagging model, it can be with The label that music emotion label is carried out to the audio data for not having music emotion label of efficiently and accurately.

Referring to Fig. 2, be a kind of flow chart of audio-frequency processing method provided in an embodiment of the present invention, can specifically include as Lower step:

Step 201 receives the mark request that target audio data are carried out with music emotion；

In the embodiment of the present invention, back-end server receives the music emotion mark that user is sent by application interface and asks It asks, wherein music emotion mark requests one that the multitude of video data set of usual corresponding server storage or audio data are concentrated Or multiple data sets carry out, wherein audio, video data set is usually to be stored according to the date, is also possible to according to upload The data acquisition system of storage is marked in user identifier.Gather for example, the audio-video that user in February uploads is stored as one, in official The audio-video of biography is stored as a set, initiating the label request of video watching focus is initiated for selected one or more set.

In practical applications, the mark for initiating music emotion for audio data sets or video data set is requested, such as Fruit is directed to video data set, it is necessary to the audio data in video data set is extracted as target audio data, and It is then further processed directly as target audio data for audio data sets.

Wherein, the extracting mode of video data sound intermediate frequency data is described in detail in a step 101, no longer superfluous herein It states.

Certainly, for the concrete mode of audio-video storage set, it is not limited to foregoing description, this is not added in the embodiment of the present invention With limitation.

Therefore, it is concentrated for large data and carries out music emotion analysis, it can be automatically and efficiently to the short-sighted frequency of magnanimity or short Audio data carries out music emotion analysis, to realize the purpose to user-customized recommended.

Step 202 is requested according to the label, using music emotion marking model, marks the target audio data Music emotion；

Preferably, step 202, further comprise:

Sub-step 2021 is requested according to the mark, and the target audio data are divided into the audio number of preset length According to section；

In the embodiment of the present invention, the audio data of said extracted is split, obtains the N section audio data of default size Section, wherein above-mentioned audio data can be imported into preset audio cutter and cut, can manually selected and cut in cutting The duration at audio data end is cut, and audio cutter may be implemented batch and cut.

Certainly, the embodiment of the present invention is without restriction to the type of audio cutting method.

Sub-step 2022, the feature vector by each audio data section processing to preset dimension；

In this step, the second audio of default dimension is converted to after pre-processing to each audio data section obtained above Feature vector, description specific as follows:

Preferably, sub-step 2022 further comprise:

Sub-step 20221 carries out sub-frame processing to each audio data section, obtains framing audio data section；

In this step, in this step, framing windowing process will be carried out by second audio signal in above-mentioned each audio data section It is converted with Meier.

Wherein, sub-frame processing is as shown in Figure 1A, voice signal be macroscopically it is jiggly, on microcosmic be smoothly, With short-term stationarity (as shown in box, it is considered that voice signal approximation is constant in 10---30ms), this can Handled so that voice signal is divided into segment, each segment is known as a frame (CHUNK), each certain segment when The long 10---30ms for being not limited to foregoing description, the embodiment of the present invention are without restriction to the duration of frame.

The framing audio data section is multiplied by sub-step 20222 with windowed function, obtains adding window audio data section.

Wherein, it when framing, not intercept back-to-back, but overlapped a part, i.e., the tail of previous true frame After portion respectively takes a part Chong Die with the head of present frame, then windowing process is carried out, so overall situation voice signal will not be because of adding window It handles and the both ends part of a frame signal is weakened and obtains the audio data of excessive noise reduction, so realizing frame in framing It is overlapping between frame, so that the audio signal after windowing process is more continuous.

Wherein, audio section downlink data upon handover obtained above is subjected to windowing process, i.e. original audio signal is as in Figure 1B Shown in left-hand component, by being multiplied with the intermediate windowing process function as shown in the middle section Figure 1B, obtain on the right of Figure 1B Logarithmic spectrum of every frame audio data on frequency domain shown in part, so that showing the period without periodic voice signal originally The Partial Feature of function is to get to the second audio section windowed data.

The adding window audio data section is carried out Meier transformation by sub-step 20223, obtains the Meier of the audio data section Frequency spectrum data

Further, in order to obtain after framing and windowing process to audio section windowed data in sound characteristic it is intuitive It shows, needs to carry out Meier transformation to adding window audio data section, audio data is converted into Meier frequency spectrum data, have sound spy The linear beneficial effect intuitively shown of sign.

Sub-step 20223, the feature vector that the Meier frequency spectrum data is converted to default dimension.

Preferably, sub-step 20223 further comprise:

Sub-step 202231, by the corresponding Meier frequency spectrum data of each frame audio data in the Meier frequency spectrum data, It is determined as framing Meier frequency spectrum data；

In this step, intercept every frame audio in the audio data section of above-mentioned acquisition and correspond to Meier frequency spectrum data figure, as point Frame Meier frequency spectrum data, that is, the Meier spectrogram data being segmented are determined as framing Meier frequency spectrum data.

The framing Meier frequency spectrum data is converted to framing audio feature vector by sub-step 202232；

In this step, each framing Meier spectrogram data is subjected to feature vector conversion.

Specifically, above-mentioned framing Meier spectral image data are converted into framing sound by image feature vector transformation model Frequency feature vector, wherein known common image feature vector transformation model has BVLC GoogLeNet model, it is one 22 1000 kinds of different image format conversions can be machine-readable features vector by the deep convolutional network of layer.

Sub-step 202233 splices the framing audio feature vector of default frame number, obtains default dimension Feature vector.

In this step, after the framing audio feature vector of above-mentioned second framing Meier frequency spectrum data, by multiple Two framing audio feature vectors merge into second audio feature vector of a default dimension, for example, framing audio feature vector For the feature vector of 128 dimensions, it is directed to the audio data of one second frame, and the processing for audio data, frame is wrapped within one second The information contained is not enough to characterize the concrete type of audio data, so by context-sensitive point of the framing audio feature vector Frame audio feature vector merges, i.e. the corresponding feature vector of 3 seconds audio datas, i.e. 3 framing audio feature vector splicings Generate the feature vector of 128*3=384 dimension.

The feature vector is input to music emotion marking model by sub-step 2023, so that the music feelings Sense marking model exports the music emotion label of the feature vector；

In this step, the audio feature vector of the above-mentioned default dimension being spliced is inputted into trained first music feelings Feel marking model, exports the music emotion label of each audio feature vector.

Wherein, music emotion can be divided into happiness Happy, and releive Tender, excited Exciting, interesting Funny, sad Sad, terrified Scary and indignation Angry.

Certainly, music emotion be not limited to it is above-mentioned enumerate, the present invention is without restriction to this.

Sub-step 2024 obtains the music emotion label of each audio data section in the target audio data Number；

In this step, for multiple audio feature vectors that each audio data for being not fixed duration is handled, to wherein Each audio feature vector carry out the output of music emotion label after, then entire audio data has multiple music emotions label, this When, it needs to take voting mechanism, the music emotion number of tags of each audio feature vector in entire audio data is counted.

Wherein, audio data is divided into the small fragment of 3s-5s, or also commonly uses the small fragment of 8s-10s, then will be above-mentioned small Segment carries out framing and windowing process and Meier converts to obtain image feature data, and each image feature data obtains one Music emotion label, then a segment of audio data may include multiple music emotion labels.

For example, each 3 seconds data segments correspond to different labels in one when a length of 5 minutes video datas, then whole A 5 minutes video datas are made of 100 type labels, obtain the corresponding number of each type.

Sub-step 2025, by the number maximum value, or, the number is greater than or equal to the music emotion mark of preset threshold Corresponding music emotion is signed, the music emotion of the target audio data is determined as.

In this step, as described above, 100 music emotion labels are right respectively in obtaining 5 minutes video data ends After the number answered, the most music emotion label of number is determined as to the music emotion label of 5 minutes video datas, or will Music emotion number of tags is ranked up, and takes out the label of sequence top N, the music emotion as the audio data section.

Certainly, in practical applications, a quantity threshold can also be preset, being more than in a certain music emotion number of tags should When quantity threshold, that is, it is arranged to the music emotion label of the video data, for example, in 100 music emotion labels, it is pre- to be marked with Signing quantity threshold is 30, wherein the music emotion label more than 30 has rock music and traditional music, then the audio data Music emotion label be rock music and traditional music, and above-mentioned label is determined as the corresponding video of the audio data The music emotion label of data carries out can be merged into traditional rock music when recommendation operates subsequent.

Music emotion labeling method of the present invention is illustrated below by way of specific example:

1) when carrying out music emotion label to video data, the audio data of video data is obtained first.

2) audio signal that will acquire carries out framing windowing process and Meier transformation, obtains the Meier frequency spectrum of audio data Figure；

3) Meier spectrogram input VGGish depth model is obtained to the feature vector of the default dimension of Meier spectrogram；

4) by above-mentioned default dimensional characteristics vector input first pass through in advance machine learning algorithm Softmax Classifier into The music emotion markup model of row training obtains the preset kind label of each default dimensional characteristics vector, such as hip-hop, rock and roll, stream Row, folk rhyme, allusion, electronics etc.；

5) type of most music emotion number of labels, or the mark more than preset threshold will be finally obtained in audio data Note is determined as the corresponding music emotion of the audio data.

The embodiment of the invention provides a kind of audio-frequency processing methods, carry out music emotion to target audio data by receiving Mark request after, obtain target audio data, and according to the mark request, the target audio data are divided into default Each audio data section processing is the feature vector of default dimension by the audio data section of length；By the audio Section feature vector is input to trained first music Emotion tagging model, marks music emotion label；Obtain the target sound The number of frequency music emotion label of each audio data section in；According to music emotion number of tags, determining pair The final music emotion for answering video data realizes the mesh that batch is efficiently labeled video data sound intermediate frequency music emotion , the cost of labor of music emotion label is saved, music emotion labeling effciency is improved.

It should be noted that for simple description, therefore, it is stated as a series of action groups for embodiment of the method It closes, but those skilled in the art should understand that, embodiment of that present invention are not limited by the describe sequence of actions, because according to According to the embodiment of the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art also should Know, the embodiments described in the specification are all preferred embodiments, and the related movement not necessarily present invention is implemented Necessary to example.

Referring to Fig. 3, being a kind of structural schematic diagram of model generating means 300 provided in an embodiment of the present invention, specifically may be used To include following module:

Annotated audio sample generation module 301 obtains annotated audio sample for marking the music emotion of sample audio data This；

Annotated audio data segment obtains module 302, for the annotated audio sample to be cut into the multiple of preset length Annotated audio data segment；

Sample set determining module 303 is marked, for handling each annotated audio data segment for multiple default dimensions Mark sample audio section feature vector, using as mark sample set；

Preferably, the mark sample set determining module 303, comprising:

Annotated audio data segment generates submodule, for respectively carrying out each annotated audio data segment at framing Reason, obtains multiple framing annotated audio data segments of each annotated audio data segment；

Mark adding window audio data section generate submodule, for respectively will each framing annotated audio data segment with add Window function is multiplied, and obtains the mark adding window audio data section of each framing annotated audio data segment；

Mark Meier frequency spectrum data obtains submodule, for each mark adding window audio data section to be carried out plum respectively You convert, and obtain the mark Meier frequency spectrum data of each annotated audio data segment；

Mark sample audio section feature vector obtains submodule, for respectively turning each mark Meier frequency spectrum data It is changed to the feature vector of default dimension, obtains the mark sample audio section feature vector of each mark Meier frequency spectrum data；

Preferably, the mark sample audio section feature vector obtains submodule, comprising:

Sample framing Meier frequency spectrum data determination unit, for by it is described mark Meier frequency spectrum data in each frame audio The corresponding Meier frequency spectrum data of data is determined as sample framing Meier frequency spectrum data；

Sample framing audio feature vector acquiring unit, for the sample framing Meier frequency spectrum data to be converted to sample Framing audio feature vector；

Mark sample audio section feature vector obtain unit, for by the sample framing audio frequency characteristics of default frame number to Amount is spliced, and the mark sample audio section feature vector of default dimension is obtained.

Mark sample set determines submodule, for each mark sample audio section set of eigenvectors to be combined into mark sample Collection.

Sample audio training set generation module 304 is marked, is used for the mark sample sound each in the mark sample set The music emotion label of frequency range feature vector is updated, and obtains mark sample audio training set；

Preferably, the mark sample audio training set generation module 304, comprising:

Second training sample feature generates submodule, is used for according to preset ratio, described in mark sample set extraction Sample audio section feature vector is marked, training sample feature set is determined as；

Second music emotion marking model training module, for learning the training sample feature set by predetermined depth Method is trained, and obtains the second music emotion marking model；

It updates mark sample set and generates submodule, be used for the mark sample audio remaining in the mark sample set The test sample feature set is inputted second music emotion and marks mould by section feature vector as test sample feature set Type, so that the second music emotion marking model exports each mark sample audio Duan Te in the test sample feature set The music emotion label of vector is levied, generates and updates mark sample set；

Sample audio training set acquisition submodule is marked, for update mark sample set and the training sample is special Collection merges, and is determined as marking sample audio training set.

First music Emotion tagging model training module 305, for utilizing deep learning method to the mark sample sound Frequency training set is trained, and obtains first music Emotion tagging model.

In the embodiment of the present invention, by annotated audio sample generation module, for marking the music feelings of sample audio data Sense, obtains annotated audio sample；Annotated audio data segment obtains module, for the annotated audio sample to be cut into default length Multiple annotated audio data segments of degree；Sample set determining module is marked, for being more by each annotated audio data segment processing The mark sample audio section feature vector of a default dimension, using as mark sample set；It marks sample audio training set and generates mould Block, for carrying out more the music emotion label of the mark sample audio section feature vector each in the mark sample set Newly, mark sample audio training set is obtained；First music Emotion tagging model training module, for utilizing deep learning method pair The mark sample audio training set is trained, obtain first music Emotion tagging model, can be with efficiently and accurately to not having The audio data of standby music emotion label carries out the label of music emotion label.

Optionally, in another embodiment, as shown in figure 4, including a kind of apparatus for processing audio 400, described device includes:

Music emotion marks request receiving module 401, for receiving the mark that target audio data are carried out with music emotion Request；

Music emotion labeling module 402, using music emotion marking model, marks institute for requesting according to the label State the music emotion of target audio data.

Preferably, the music emotion labeling module 402, comprising:

The target audio data are divided into pre- by audio data section acquisition submodule for being requested according to the mark If the audio data section of length；

Feature vector acquiring unit, for each audio data section processing is special for the audio section of default dimension Levy vector；

Preferably, the feature vector acquiring unit includes:

Framing audio data section obtains unit, for carrying out sub-frame processing to each audio data section, obtains framing sound Frequency data segment；

Adding window audio data section obtains unit, for the framing audio data section to be multiplied with windowed function, is added Window audio data section；

Meier frequency spectrum data obtains unit, for the adding window audio data section to be carried out Meier transformation, obtains the sound The Meier frequency spectrum data of frequency data segment；

Feature vector acquiring unit, the audio section for the Meier frequency spectrum data to be converted to default dimension are special Levy vector.

Preferably, the feature vector acquiring unit, comprising:

Framing Meier frequency spectrum data determines subelement, for by each frame audio data pair in the Meier frequency spectrum data The Meier frequency spectrum data answered is determined as framing Meier frequency spectrum data；

Framing audio feature vector obtains subelement, special for the framing Meier frequency spectrum data to be converted to framing audio Levy vector；

Feature vector obtains subelement, for spelling the framing audio feature vector of default frame number It connects, obtains the feature vector of default dimension.

Music emotion label acquisition submodule, for the feature vector to be input to music emotion mark mould Type, so that the music emotion marking model exports the music emotion label of the feature vector；

Music emotion number of tags acquisition submodule, for obtaining each audio data section in the target audio data The music emotion label number；

Music emotion label determines submodule, for being preset or, the number is greater than or equal to by the number maximum value The corresponding music emotion of music emotion label of threshold value, is determined as the music emotion of the target audio data.

In the embodiment of the present invention, music emotion marks request receiving module, carries out sound to target audio data for receiving The mark of happy emotion is requested；Music emotion labeling module, for according to label request, using music emotion marking model, Mark the music emotion of the target audio data.Batch is realized efficiently to be labeled video data sound intermediate frequency music emotion Purpose, save music emotion label cost of labor, improve music emotion labeling effciency.

For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple Place illustrates referring to the part of embodiment of the method.

In the embodiment of the present invention, in the video search request for receiving user's input, first marks the video search and ask The label and tag types are input in video semanteme label independence model by label and tag types in asking, screening Semantic independent label out, and video search is carried out to semantic independent label, it obtains and the independent label phase of the semanteme The video matched.The embodiment of the present invention is scanned for according to the independent label of semanteme filtered out, is reduced due to accidentally searching for label Incoherent video search result is recalled, to improve the accuracy rate of video search.

Optionally, the embodiment of the present invention also provides a kind of terminal, including processor, and memory stores on a memory simultaneously The computer program that can be run on the processor, the computer program realize above-mentioned model generation side when being executed by processor Each process of method or audio-frequency processing method embodiment, and identical technical effect can be reached, it is no longer superfluous here to avoid repeating It states.

Optionally, the embodiment of the present invention also provides a kind of computer readable storage medium, on computer readable storage medium It is stored with computer program, which realizes above-mentioned model generating method or audio-frequency processing method when being executed by processor Each process of embodiment, and identical technical effect can be reached, to avoid repeating, which is not described herein again.Wherein, the meter Calculation machine readable storage medium storing program for executing, such as read-only memory (Read-Only Memory, abbreviation ROM), random access memory (Random Access Memory, abbreviation RAM), magnetic or disk etc..

All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.

It should be understood by those skilled in the art that, the embodiment of the embodiment of the present invention can provide as method, apparatus or calculate Machine program product.Therefore, the embodiment of the present invention can be used complete hardware embodiment, complete software embodiment or combine software and The form of the embodiment of hardware aspect.Moreover, the embodiment of the present invention can be used one or more wherein include computer can With in the computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of program code The form of the computer program product of implementation.

The embodiment of the present invention be referring to according to the method for the embodiment of the present invention, terminal device (system) and computer program The flowchart and/or the block diagram of product describes.It should be understood that flowchart and/or the block diagram can be realized by computer program instructions In each flow and/or block and flowchart and/or the block diagram in process and/or box combination.It can provide these Computer program instructions are set to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing terminals Standby processor is to generate a machine, so that being held by the processor of computer or other programmable data processing terminal devices Capable instruction generates for realizing in one or more flows of the flowchart and/or one or more blocks of the block diagram The device of specified function.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing terminal devices In computer-readable memory operate in a specific manner, so that instruction stored in the computer readable memory generates packet The manufacture of command device is included, which realizes in one side of one or more flows of the flowchart and/or block diagram The function of being specified in frame or multiple boxes.

These computer program instructions can also be loaded into computer or other programmable data processing terminal devices, so that Series of operation steps are executed on computer or other programmable terminal equipments to generate computer implemented processing, thus The instruction executed on computer or other programmable terminal equipments is provided for realizing in one or more flows of the flowchart And/or in one or more blocks of the block diagram specify function the step of.

Although the preferred embodiment of the embodiment of the present invention has been described, once a person skilled in the art knows bases This creative concept, then additional changes and modifications can be made to these embodiments.So the claim is intended to be construed to Including preferred embodiment and fall into all change and modification of range of embodiment of the invention.

Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that process, method, article or terminal device including a series of elements not only wrap Those elements are included, but also including other elements that are not explicitly listed, or further includes for this process, method, article Or the element that terminal device is intrinsic.In the absence of more restrictions, being wanted by what sentence "including a ..." limited Element, it is not excluded that there is also other identical elements in process, method, article or the terminal device for including the element.

It above can to a kind of model generating method provided by the present invention, audio-frequency processing method, device, terminal and computer Storage medium is read, is described in detail, specific case used herein carries out the principle of the present invention and embodiment It illustrates, the above description of the embodiment is only used to help understand the method for the present invention and its core ideas；Meanwhile for this field Those skilled in the art, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, to sum up Described, the contents of this specification are not to be construed as limiting the invention.

Claims

1. a kind of model generating method characterized by comprising

Each annotated audio data segment is handled into the mark sample audio section feature vector for multiple default dimensions, using as mark Infuse sample set；

The music emotion label of the mark sample audio section feature vector each in the mark sample set is updated, Obtain mark sample audio training set；

The mark sample audio training set is trained using deep learning method, obtains first music Emotion tagging mould Type.

2. the method according to claim 1, wherein described by the mark sample each in the mark sample set The music emotion label of feature vector is updated, and obtains mark sample audio training set, comprising:

According to preset ratio, the mark sample audio section feature vector is extracted from the mark sample set, is determined as training sample Eigen collection；

The training sample feature set is trained by predetermined deep learning method, obtains the second music emotion mark mould Type；

Using the mark sample audio section feature vector remaining in the mark sample set as test sample feature set, and will The test sample feature set inputs the second music emotion marking model, so that the second music emotion marking model is defeated Each music emotion label for marking sample audio section feature vector in the test sample feature set out, generates and updates mark Sample set；

Update mark sample set is merged with the training sample feature set, is determined as marking sample audio training set.

3. the method according to claim 1, wherein described handle each annotated audio data segment is multiple The mark sample audio section feature vector of default dimension, using as mark sample set, comprising:

Each annotated audio data segment is subjected to sub-frame processing respectively, obtains multiple points of each annotated audio data segment Frame annotated audio data segment；

Each framing annotated audio data segment is multiplied with windowed function respectively, obtains each framing annotated audio data The mark adding window audio data section of section；

Each mark adding window audio data section is subjected to Meier transformation respectively, obtains the mark of each annotated audio data segment Infuse Meier frequency spectrum data；

The feature vector that each mark Meier frequency spectrum data is converted to default dimension respectively, obtains each mark Meier The mark sample audio section feature vector of frequency spectrum data；

Each mark sample audio section set of eigenvectors is combined into mark sample set.

4. according to the method described in claim 3, it is characterized in that, described respectively turn each mark Meier frequency spectrum data It is changed to the feature vector of default dimension, obtains the mark sample audio section feature vector of each mark Meier frequency spectrum data, packet It includes:

By the corresponding Meier frequency spectrum data of each frame audio data in the mark Meier frequency spectrum data, it is determined as sample framing Meier frequency spectrum data；

The sample framing Meier frequency spectrum data is converted into sample framing audio feature vector；

The sample framing audio feature vector of default frame number is spliced, the mark sample audio section of default dimension is obtained Feature vector.

5. a kind of audio-frequency processing method characterized by comprising

It is requested according to the label, using music emotion marking model, marks the music emotion of the target audio data；It is described Music emotion marking model is to be obtained using any one of any one of claims 1 to 44 the method.

6. according to the method described in claim 5, it is characterized in that, it is described according to the label request, utilize music emotion mark Injection molding type marks the music emotion of the target audio data, comprising:

It is requested according to the mark, the target audio data is divided into the audio data section of preset length；

It is the feature vector of default dimension by each audio data section processing；

The feature vector is input to music emotion marking model, so that the music emotion marking model exports institute State the music emotion label of feature vector；

Obtain the number of the music emotion label of each audio data section in the target audio data；

By the number maximum value, or, the number is greater than or equal to the corresponding music feelings of music emotion label of preset threshold Sense, is determined as the music emotion of the target audio data.

7. according to the method described in claim 6, it is characterized in that, described handle each audio data section for default dimension Feature vector, comprising:

Sub-frame processing is carried out to each audio data section, obtains framing audio data section；

The framing audio data section is multiplied with windowed function, obtains adding window audio data section；

The adding window audio data section is subjected to Meier transformation, obtains the Meier frequency spectrum data of the audio data section；

The Meier frequency spectrum data is converted to the feature vector of default dimension.

8. the method according to the description of claim 7 is characterized in that described be converted to default dimension for the Meier frequency spectrum data Feature vector, comprising:

By the corresponding Meier frequency spectrum data of each frame audio data in the Meier frequency spectrum data, it is determined as framing Meier frequency spectrum Data；

The framing Meier frequency spectrum data is converted into framing audio feature vector；

The framing audio feature vector of default frame number is spliced, the feature vector of default dimension is obtained.

9. a kind of model generating means characterized by comprising

Annotated audio data segment obtains module, for the annotated audio sample to be cut into multiple annotated audios of preset length Data segment；

Sample set determining module is marked, for handling each annotated audio data segment for the mark sample of multiple default dimensions Feature vector, using as mark sample set；

Sample audio training set generation module is marked, is used for the mark sample audio section feature each in the mark sample set The music emotion label of vector is updated, and obtains mark sample audio training set；

First music Emotion tagging model training module, for utilizing deep learning method to the mark sample audio training set It is trained, obtains first music Emotion tagging model.

10. device according to claim 9, which is characterized in that the mark sample audio training set generation module, packet It includes:

Second training sample feature generates submodule, for extracting the mark from the mark sample set according to preset ratio Sample audio section feature vector, is determined as training sample feature set；

Second music emotion marking model training module, for the training sample feature set to be passed through predetermined deep learning method It is trained, obtains the second music emotion marking model；

It updates mark sample set and generates submodule, be used for the mark sample audio Duan Te remaining in the mark sample set Vector is levied as test sample feature set, and the test sample feature set is inputted into the second music emotion marking model, So that the second music emotion marking model exports each mark sample audio section feature in the test sample feature set The music emotion label of vector generates and updates mark sample set；

Sample audio training set acquisition submodule is marked, for the update to be marked sample set and the training sample feature set Merge, is determined as marking sample audio training set.

11. device according to claim 9, which is characterized in that the mark sample set determining module, comprising:

Annotated audio data segment generates submodule, for each annotated audio data segment to be carried out sub-frame processing respectively, obtains To multiple framing annotated audio data segments of each annotated audio data segment；

It marks adding window audio data section and generates submodule, for respectively by each framing annotated audio data segment and adding window letter Number is multiplied, and obtains the mark adding window audio data section of each framing annotated audio data segment；

Mark Meier frequency spectrum data obtains submodule, for each mark adding window audio data section to be carried out Meier change respectively It changes, obtains the mark Meier frequency spectrum data of each annotated audio data segment；

Mark sample audio section feature vector obtains submodule, for being respectively converted to each mark Meier frequency spectrum data The feature vector of default dimension obtains the mark sample audio section feature vector of each mark Meier frequency spectrum data；

Mark sample set determines submodule, for each mark sample audio section set of eigenvectors to be combined into mark sample set.

12. device according to claim 11, which is characterized in that the mark sample audio section feature vector obtains submodule Block, comprising:

Sample framing Meier frequency spectrum data determination unit, for by it is described mark Meier frequency spectrum data in each frame audio data Corresponding Meier frequency spectrum data is determined as sample framing Meier frequency spectrum data；

Mark sample audio section feature vector obtain unit, for by the sample framing audio feature vector of default frame number into Row splicing obtains the mark sample audio section feature vector of default dimension.

13. a kind of apparatus for processing audio characterized by comprising

Music emotion marks request receiving module, for receiving the mark request that target audio data are carried out with music emotion；

Music emotion labeling module, using music emotion marking model, marks the target sound for requesting according to the label The music emotion of frequency evidence.

14. device according to claim 13, which is characterized in that the music emotion labeling module, comprising:

The target audio data are divided into default length for requesting according to the mark by audio data section acquisition submodule The audio data section of degree；

Feature vector acquiring unit, for by each audio data section processing for default dimension feature to Amount；

Music emotion label acquisition submodule, for the feature vector to be input to music emotion marking model, with The music emotion marking model is set to export the music emotion label of the feature vector；

Music emotion number of tags acquisition submodule, for obtaining the institute of each audio data section in the target audio data State the number of music emotion label；

Music emotion label determines submodule, is used for the number maximum value, or, the number is greater than or equal to preset threshold The corresponding music emotion of music emotion label, be determined as the music emotion of the target audio data.

15. device according to claim 14, which is characterized in that the feature vector acquiring unit includes:

Framing audio data section obtains unit, for carrying out sub-frame processing to each audio data section, obtains framing audio number According to section；

Adding window audio data section obtains unit, for the framing audio data section to be multiplied with windowed function, obtains adding window sound Frequency data segment；

Meier frequency spectrum data obtains unit, for the adding window audio data section to be carried out Meier transformation, obtains the audio number According to the Meier frequency spectrum data of section；

Feature vector acquiring unit, for the Meier frequency spectrum data is converted to the feature of default dimension to Amount.

16. device according to claim 15, which is characterized in that the feature vector acquiring unit, comprising:

Framing Meier frequency spectrum data determines subelement, for each frame audio data in the Meier frequency spectrum data is corresponding Meier frequency spectrum data is determined as framing Meier frequency spectrum data；

Framing audio feature vector obtain subelement, for by the framing Meier frequency spectrum data be converted to framing audio frequency characteristics to Amount；

Feature vector obtains subelement, for splicing the framing audio feature vector of default frame number, obtains To the feature vector of default dimension.

17. a kind of terminal characterized by comprising memory, processor and be stored on the memory and can be at the place The computer program run on reason device is realized when the computer program is executed by the processor as appointed in Claims 1-4 Step in one model generating method, or the step of the audio-frequency processing method as described in any one of claim 5 to 8 Suddenly.

18. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program is realized as described in any one of claims 1 to 4 when the computer program is executed by processor in model generating method The step of, or the step of audio-frequency processing method as described in any one of claim 5 to 8.