CN110008372A - Model generating method, audio-frequency processing method, device, terminal and storage medium - Google Patents
Model generating method, audio-frequency processing method, device, terminal and storage medium Download PDFInfo
- Publication number
- CN110008372A CN110008372A CN201910134036.1A CN201910134036A CN110008372A CN 110008372 A CN110008372 A CN 110008372A CN 201910134036 A CN201910134036 A CN 201910134036A CN 110008372 A CN110008372 A CN 110008372A
- Authority
- CN
- China
- Prior art keywords
- audio
- sample
- mark
- audio data
- feature vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention provides a kind of model generating method, audio-frequency processing method, device, terminal and computer readable storage medium, the model generating method includes: the music emotion for marking sample audio data, obtains annotated audio sample;The annotated audio sample is cut into multiple annotated audio data segments of preset length;Each annotated audio data segment is handled into the mark sample audio section feature vector for multiple default dimensions, using as mark sample set;The music emotion label of the mark sample audio section feature vector each in the mark sample set is updated, mark sample audio training set is obtained;The mark sample audio training set is trained using deep learning method, obtains first music Emotion tagging model.It realizes and target audio data is inputted into first music Emotion tagging model, obtain the purpose of music emotion label.
Description
Technical field
The present invention relates to network technique fields, more particularly to model generating method, audio-frequency processing method, device, terminal
And computer readable storage medium.
Background technique
With the universal and development of video or audio network, many videos and audio website are emerged, user is facilitated to regard
Interested video or audio are searched on frequency or audio website, greatly enriches the life of user, wherein music is reaction people
The art of class actual life emotion, short-sighted frequency are a kind of expression ways of prevalence.
Currently, for a large amount of audio, video datas by user's self-control or official's production stored on video or audio website
For users to use, wherein the content of audio-video often matches with the emotion of music, to express emotion.With the stream of audio-video
Row, audio-video quantity is growing day by day, is badly in need of the highly effective algorithm that a kind of pair of music emotion automatically analyzes, to data structured.So
For above-mentioned emotional semantic classification to audio or the video comprising music carry out music emotion mark be complete audio, video data structure
Change essential key factor.
In the prior art, it is marked often through the artificial music emotion for carrying out audio-video website, low efficiency and at high cost.Cause
This, how efficiently and accurately to carry out the mark of music emotion to the audio, video data that stores on audio-video website be to need to be solved at present
Certainly the technical issues of.
Summary of the invention
The technical problem to be solved is that provide a kind of model generating method, audio-frequency processing method, dress for the embodiment of the present invention
It sets, terminal and computer readable storage medium, to solve to the music associated video data or audio number stored in video website
The technical issues of according to the mark for carrying out music emotion.
To solve the above-mentioned problems, the present invention is achieved through the following technical solutions:
First aspect provides a kind of model generating method, which comprises
The music emotion for marking sample audio data, obtains annotated audio sample;
The annotated audio sample is cut into multiple annotated audio data segments of preset length;
Each annotated audio data segment is handled into the mark sample audio section feature vector for multiple default dimensions, to make
To mark sample set;
The music emotion label of the mark sample audio section feature vector each in the mark sample set is carried out
It updates, obtains mark sample audio training set;
The mark sample audio training set is trained using deep learning method, obtains first music Emotion tagging
Model.
Second aspect provides a kind of audio-frequency processing method, which comprises
The mark for carrying out music emotion to target audio data is received to request;
It is requested according to the label, using music emotion marking model, marks the music emotion of the target audio data.
The third aspect provides a kind of model generating means, and described device includes:
Annotated audio sample generation module obtains annotated audio sample for marking the music emotion of sample audio data;
Annotated audio data segment obtains module, for the annotated audio sample to be cut into multiple marks of preset length
Audio data section;
Sample set determining module is marked, for handling each annotated audio data segment for the mark of multiple default dimensions
Sample audio section feature vector, using as mark sample set;
Sample audio training set generation module is marked, is used for the mark sample audio section each in the mark sample set
The music emotion label of feature vector is updated, and obtains mark sample audio training set;
First music Emotion tagging model training module, for being instructed using deep learning method to the mark sample audio
Practice collection to be trained, obtains first music Emotion tagging model.
Fourth aspect provides a kind of apparatus for processing audio, and described device includes:
Music emotion marks request receiving module, asks for receiving the mark for carrying out music emotion to target audio data
It asks;
Music emotion labeling module, using music emotion marking model, marks the mesh for requesting according to the label
Mark the music emotion of audio data.
5th aspect provides a kind of terminal, comprising: memory, processor and is stored on the memory and can be described
The computer program run on processor realizes such as above-mentioned model generation side when the computer program is executed by the processor
Step in method, or such as the step of above-mentioned audio-frequency processing method.
6th aspect provides a kind of computer readable storage medium, and calculating is stored on the computer readable storage medium
Machine program realizes the step in such as above-mentioned model generating method when the computer program is executed by processor, or such as above-mentioned
Audio-frequency processing method in step.
Compared with prior art, the embodiment of the present invention includes following advantages:
In the embodiment of the present invention, for the audio data in audio-video website, marked using preset musical affective tag
After note, by pretreatment, if audio data is cut into pieces post-processing as the feature vector of default dimension, then music is carried out
Mark sample audio training set is obtained after the update of affective tag, using deep learning method to mark sample audio training
Collection is trained, and obtains first music Emotion tagging model.Then, target audio data are inputted into above-mentioned first music emotion mark
Injection molding type obtains the music emotion of first music Emotion tagging model output.Wherein, above-mentioned music emotion is preset, such as
Pop music, hip-hop music, rock music, rhythm and blues etc..In this way, being marked by all music emotions to realize audio-video
Data carry out the purpose of music emotion label, realize the mesh that watching focus type mark is carried out for various video data precise and high efficiencies
, have the beneficial effect that efficiently and accurately realizes the music emotion label of audio, video data.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention,
And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can
It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not
The application can be limited.
Detailed description of the invention
Fig. 1 is a kind of flow chart of model generating method provided in an embodiment of the present invention;
Figure 1A is a kind of audio signal schematic diagram provided in an embodiment of the present invention;
Figure 1B is a kind of audio data windowing process schematic diagram provided in an embodiment of the present invention;
Fig. 2 is a kind of audio-frequency processing method flow chart provided in an embodiment of the present invention;
Fig. 3 is a kind of structural schematic diagram of model generating means provided in an embodiment of the present invention;
Fig. 4 is a kind of structural schematic diagram of apparatus for processing audio provided in an embodiment of the present invention.
Specific embodiment
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real
Applying mode, the present invention is described in further detail.
Referring to Fig. 1, being a kind of flow chart of model generating method provided in an embodiment of the present invention, specifically include:
Step 101, the music emotion for marking sample audio data, obtain annotated audio sample;
In the embodiment of the present invention, sample audio data is the audio, video data concentration extraction in the storage of audio-video website backstage
Out, wherein the storage mode of audio, video data collection generally can be with the form storage that the time marks, such as 1 year first quarter
User freely upload audio, video data and official production upload audio, video data gather, in these audio, video data collection
It is middle extraction wherein audio data as audio sample.
For example, extracting audio data in video data as audio sample, or by audio data directly as audio
The audio data extracted in video data and the audio data set stored naturally can also be synthesized audio sample by sample.
Wherein, the specific method for extracting video data sound intermediate frequency data is described as follows:
It is read by the video data in real-time messages transport protocol (RTMP, Real Time Messaging Protocol)
The method for taking packet RTMP_ReadPacket to obtain video and corresponding audio data are as follows:
1, the audio sync packet in video data is obtained;
2, the audio head decoding data AACDecoderSpecificInfo and audio data parsed in audio sync packet matches
Confidence ceases AudioSpecificConfig.Wherein, audio data configuration information AudioSpecificConfig is for generating
ADST (including sample rate, channel number, the frame length data in audio data).
3, other audio packs in video data are obtained, and parse original audio data (i.e. ES).
4, the ES stream of AAC is packaged by audio data head AAC decoder the format of ADTS, wherein be in AAC ES
The header file ADTSheader of 7 bytes of addition before stream, to parse audio data content.
As above, i.e., by the packets of audio data extracted in parsing video data, and then the specific interior of audio data is parsed
Hold, that is, has extracted the audio content in video data.
It is to be appreciated that the method that the audio data extracting mode in video data is not limited to foregoing description, the present invention is real
It is without restriction to the extracting mode of audio data to apply example.
After obtaining audio sample by the above method, by predefining 7, music emotion label (e.g., happily
Happy, soft Tender, excited Exciting, humorous Funny, sad Sad, stimulation Scary and indignation Angry) to audio sample
Originally annotated audio sample is obtained after being marked.
Step 102, multiple annotated audio data segments that the annotated audio sample is cut into preset length;
In practical applications, the length disunity of annotated audio sample will cause data mistake when carrying out batch processing
Difference finally obtains the training sample for meeting preset standard so needing to cut audio data, such as sample total is total to
16955, average every class 2422, when each sample, is about 10 seconds.
Wherein, annotated audio sample is split, obtains N number of annotated audio data segment of default size.It can will be upper
It states annotated audio sample and imports preset audio cutter and cut, cutting audio data can be pre-set in cutting
The duration of section, audio cutter may be implemented to carry out batch cutting according to the duration.
Certainly, the embodiment of the present invention is without restriction to the type of audio cutter.
It is to be appreciated that needing the preset requirement to training sample different on different models, therefore the embodiment of the present invention
It is without restriction to the specific length of audio section.
Each annotated audio data segment is handled the mark sample audio section feature for multiple default dimensions by step 103
Vector, using as mark sample set;
Preferably, step 103, further comprise:
Each annotated audio data segment is carried out sub-frame processing respectively by sub-step 1031, obtains each mark sound
Multiple framing annotated audio data segments of frequency data segment;
Specifically, as shown in Figure 1A, voice signal is being macroscopically jiggly, is smoothly, to have short on microcosmic
When stationarity (voice signal chosen in box in such as figure, in 10---30ms it is considered that voice signal approximation it is constant), this
A voice signal to be divided into segment to be handled, each segment is known as a frame (CHUNK), each certain piece
The duration of section is not limited to the 10---30ms of foregoing description, and the embodiment of the present invention is without restriction to the duration of frame.
Therefore, annotated audio data segment is further divided into smaller the first framing audio data as unit of frame.
Each framing annotated audio data segment is multiplied by sub-step 1032 with windowed function respectively, obtains each described
The mark adding window audio data section of framing annotated audio data segment;
Specifically, when framing, each frame can repeat to intercept a part, i.e., the tail portion of previous true frame and present frame
After head respectively takes a part to be overlapped, then windowing process is carried out, so overall situation voice signal will not make one because of windowing process
The both ends part of frame signal, which is weakened, obtains the audio data of excessive noise reduction, so realizing the weight between frame and frame in framing
It is folded, so that the audio signal after windowing process is more continuous.
Wherein, the first framing audio data obtained above is subjected to windowing process, i.e., is as schemed by original audio signal
In 1B shown in left-hand component, by being multiplied with the intermediate windowed function as shown in the middle section Figure 1B, obtain on the right of Figure 1B
Logarithmic spectrum of every frame audio data on frequency domain shown in part, so that originally without (such as first point of periodic voice signal
Frame audio data) Partial Feature that shows periodic function, it is determined as the mark adding window audio data of above-mentioned framing audio data
Section.
Each mark adding window audio data section is carried out Meier transformation respectively by sub-step 1033, obtains each mark
Infuse the mark Meier frequency spectrum data of audio data section.
Further, in order to enable sound characteristic is intuitive in the first adding window audio data obtained after framing and windowing process
It shows, needs to carry out Meier transformation, audio data is converted into mark Meier frequency spectrum data, wherein the unit of frequency is hertz
(Hz), the frequency range that human ear can be heard is 20-20000Hz, but human ear is not that linear perception is closed to this scale unit of Hz
System.For example, if pitch frequency is increased to 2000Hz, that ear can only be perceived if people have adapted to the tone of 1000Hz
A little is improved to frequency, frequency is detectable at all and is doubled.If converting Meier for common frequency scaling
Frequency scaling, then human ear is to the perceptibility of frequency just at linear relationship.That is, under Meier scale, if two sections of languages
The mel-frequency of sound differs twice, then the tone that human ear can perceive probably also differs twice.Having audio data realization can
Depending on the beneficial effect changed.
Each mark Meier frequency spectrum data is converted to the feature vector of default dimension respectively, obtained by sub-step 1034
To the mark sample audio section feature vector of each mark Meier frequency spectrum data.
In this step, above-mentioned first Meier spectral image data are converted into the feature vector that machine can identify, wherein
Image data, which is converted to the common model of machine readable feature vector, BVLC GoogLeNet model, certainly, is actually answering
In, it is not limited to the conversion regime of foregoing description, the embodiments of the present invention are not limited thereto.
Preferably, step 1034, further comprise:
Sub-step 10341, by it is described mark Meier frequency spectrum data in the corresponding Meier spectrum number of each frame audio data
According to being determined as sample framing Meier frequency spectrum data;
In this step, extracts every frame audio in the annotated audio data segment of above-mentioned acquisition and correspond to Meier frequency spectrum data figure, really
It is set to the first framing Meier frequency spectrum data.
The sample framing Meier frequency spectrum data is converted to sample framing audio feature vector by sub-step 10342;
In this step, each first framing Meier spectrogram data is subjected to feature vector conversion.
Specifically, above-mentioned first framing Meier spectral image data are converted to point by image feature vector transformation model
Frame audio feature vector, wherein known common image feature vector transformation model has BVLC GoogLeNet model, it is one
A 22 layers of deep convolutional network can detect the feature vector of 1000 kinds of different image types.
Certainly, foregoing description is not limited to for characteristics of image conversion method, the embodiments of the present invention are not limited thereto.
Sub-step 10343 splices the sample framing audio feature vector of default frame number, obtains default dimension
Mark sample audio section feature vector;
In this step, for the sample framing audio frequency characteristics of the first framing Meier frequency spectrum data obtained in step 10342
After vector, by multiple sample framing audio feature vectors merge into the mark sample audio section feature of a default dimension to
Amount is directed to the audio data of one second frame for example, framing audio feature vector is the feature vector of 128 dimensions, and for sound
The processing of frequency evidence, the information that one second frame is included are not enough to characterize the concrete type of audio data, so by the framing audio
The context-sensitive framing audio feature vector of feature vector merges, i.e. the corresponding feature vector of 3 seconds audio datas, i.e.,
3 framing audio feature vectors are spliced into the feature vector of 128*3=384 dimension.
Certainly, default dimension is not necessarily 384 dimensions mentioned above, is also possible to the group of five framing audio feature vectors
The audio feature vector for the default dimension of conjunction or ten framing audio feature vectors being composed, so, preset dimension
Whether what setting depended primarily on audio data includes enough information in case subsequent processing, therefore, the embodiment of the present invention is to pre-
If the specific value of dimension is not limited.
Each mark sample audio section set of eigenvectors is combined into mark sample set by sub-step 1035.
In this step, above-mentioned all mark sample audio section feature vectors are stored as a set, as mark sample
Collection.
Step 104 marks each music emotion for marking sample audio section feature vector in sample set for described
Label is updated, and obtains mark sample audio training set;
In this step, the audio data typically directly downloaded there are strong noise (respectively data noise and label noise),
If directly training music emotion marking model, accuracy rate are lower.So further to above-mentioned mark sample audio training set
It carries out data cleansing, that is, with marking sample audio training set music emotion model, then is carried out with sample of the model to every class
The cleaning of label noise, the final music emotion data set for obtaining high quality, specific steps are described as follows:
Preferably, step 104, further comprise:
Sub-step 1041, according to preset ratio, from the mark sample set extract the mark sample audio section feature to
Amount, is determined as training sample feature set;
In this step, if marking sample set total amount totally 16955 mark sample audio section feature vectors, average every assonance
The mark sample audio section feature vector of happy emotion has 2422, and when each sample is about 10 seconds, by preset ratio therein
A part (such as 20%) in (such as 50%) is extracted as training sample feature set, and is extracted in training sample feature set
20% data are used as test sample feature as the second training sample feature, remaining 30%.
The training sample feature set is trained by sub-step 1042 by predetermined deep learning method, obtains second
Music emotion marking model;
In this step, the second training sample feature is trained by predetermined deep learning algorithm, obtains the second music
Emotion tagging model, wherein predetermined deep learning algorithm can be Softmax classifier, certainly, in practical applications
It is not limited to Softmax classifier, the embodiment of the present invention is without restriction to specific deep learning method.
Sub-step 1043, using the mark sample audio section feature vector remaining in the mark sample set as test
Sample characteristics collection, and the test sample feature set is inputted into the second music emotion marking model, so that second sound
Happy Emotion tagging model exports the music emotion of each mark sample audio section feature vector in the test sample feature set
Label generates and updates mark sample set;
Wherein, remaining 30% in 50% in above-mentioned totally 16955 sample totals is extracted and is divided into conduct three times
Test set is cleaned, above-mentioned trained second music emotion marking model is inputted.
In this step, music emotion label for labelling is carried out, the test set that will have been marked each time is again added to training set
In, it trains again, generates the second music emotion marking model of update, then extract 10% test set progress genre labels mark
Note, mark are put into training set, second of the second music emotion marking model updated of training, so until all tests after the completion
Collection all returns in training set, then the data in training set are to complete the sample data of cleaning, that is, update mark sample
Collection.
Sub-step 1044 merges update mark sample set with the training sample feature set, is determined as marking sample
This audio training set.
In this step, the update mark sample set of above-mentioned completion cleaning merges with training sample feature set, as mark sample
This audio training set.
It unlabelled sample data is input to the second music emotion marking model marks it is to be appreciated that repeated multiple times
Note, mark are completed to gather training sample update training pattern again, can effectively improve mark accuracy rate, and training sample is got over
Huge, the accuracy rate that training pattern is used to mark is higher, the second music emotion marking model obtained by repetition training, finally
The music segmentation tag of all test sets marked out marks sample set in conjunction with above-mentioned update, and what is obtained is mark sample sound
Frequency training set.
Step 105 is trained the mark sample audio training set using deep learning method, obtains first music
Emotion tagging model.
In this step, mark sample audio training set obtained above is carried out again by predetermined deep learning method
Training, finally obtains first music Emotion tagging model, effectively reduces the manpower of music label in artificial mark training sample
Cost, and training sample data amount is improved, improve model training efficiency and mark accuracy rate.
In embodiments of the present invention, by marking the music emotion of sample audio data, annotated audio sample is obtained;By institute
State multiple annotated audio data segments that annotated audio sample is cut into preset length;It is by each annotated audio data segment processing
The mark sample audio section feature vector of multiple default dimensions, using as mark sample set;By each institute in the mark sample set
The music emotion label for stating mark sample audio section feature vector is updated, and obtains mark sample audio training set;Benefit
The mark sample audio training set is trained with deep learning method, obtains first music Emotion tagging model, it can be with
The label that music emotion label is carried out to the audio data for not having music emotion label of efficiently and accurately.
Referring to Fig. 2, be a kind of flow chart of audio-frequency processing method provided in an embodiment of the present invention, can specifically include as
Lower step:
Step 201 receives the mark request that target audio data are carried out with music emotion;
In the embodiment of the present invention, back-end server receives the music emotion mark that user is sent by application interface and asks
It asks, wherein music emotion mark requests one that the multitude of video data set of usual corresponding server storage or audio data are concentrated
Or multiple data sets carry out, wherein audio, video data set is usually to be stored according to the date, is also possible to according to upload
The data acquisition system of storage is marked in user identifier.Gather for example, the audio-video that user in February uploads is stored as one, in official
The audio-video of biography is stored as a set, initiating the label request of video watching focus is initiated for selected one or more set.
In practical applications, the mark for initiating music emotion for audio data sets or video data set is requested, such as
Fruit is directed to video data set, it is necessary to the audio data in video data set is extracted as target audio data, and
It is then further processed directly as target audio data for audio data sets.
Wherein, the extracting mode of video data sound intermediate frequency data is described in detail in a step 101, no longer superfluous herein
It states.
Certainly, for the concrete mode of audio-video storage set, it is not limited to foregoing description, this is not added in the embodiment of the present invention
With limitation.
Therefore, it is concentrated for large data and carries out music emotion analysis, it can be automatically and efficiently to the short-sighted frequency of magnanimity or short
Audio data carries out music emotion analysis, to realize the purpose to user-customized recommended.
Step 202 is requested according to the label, using music emotion marking model, marks the target audio data
Music emotion;
Preferably, step 202, further comprise:
Sub-step 2021 is requested according to the mark, and the target audio data are divided into the audio number of preset length
According to section;
In the embodiment of the present invention, the audio data of said extracted is split, obtains the N section audio data of default size
Section, wherein above-mentioned audio data can be imported into preset audio cutter and cut, can manually selected and cut in cutting
The duration at audio data end is cut, and audio cutter may be implemented batch and cut.
Certainly, the embodiment of the present invention is without restriction to the type of audio cutting method.
Sub-step 2022, the feature vector by each audio data section processing to preset dimension;
In this step, the second audio of default dimension is converted to after pre-processing to each audio data section obtained above
Feature vector, description specific as follows:
Preferably, sub-step 2022 further comprise:
Sub-step 20221 carries out sub-frame processing to each audio data section, obtains framing audio data section;
In this step, in this step, framing windowing process will be carried out by second audio signal in above-mentioned each audio data section
It is converted with Meier.
Wherein, sub-frame processing is as shown in Figure 1A, voice signal be macroscopically it is jiggly, on microcosmic be smoothly,
With short-term stationarity (as shown in box, it is considered that voice signal approximation is constant in 10---30ms), this can
Handled so that voice signal is divided into segment, each segment is known as a frame (CHUNK), each certain segment when
The long 10---30ms for being not limited to foregoing description, the embodiment of the present invention are without restriction to the duration of frame.
The framing audio data section is multiplied by sub-step 20222 with windowed function, obtains adding window audio data section.
Wherein, it when framing, not intercept back-to-back, but overlapped a part, i.e., the tail of previous true frame
After portion respectively takes a part Chong Die with the head of present frame, then windowing process is carried out, so overall situation voice signal will not be because of adding window
It handles and the both ends part of a frame signal is weakened and obtains the audio data of excessive noise reduction, so realizing frame in framing
It is overlapping between frame, so that the audio signal after windowing process is more continuous.
Wherein, audio section downlink data upon handover obtained above is subjected to windowing process, i.e. original audio signal is as in Figure 1B
Shown in left-hand component, by being multiplied with the intermediate windowing process function as shown in the middle section Figure 1B, obtain on the right of Figure 1B
Logarithmic spectrum of every frame audio data on frequency domain shown in part, so that showing the period without periodic voice signal originally
The Partial Feature of function is to get to the second audio section windowed data.
The adding window audio data section is carried out Meier transformation by sub-step 20223, obtains the Meier of the audio data section
Frequency spectrum data
Further, in order to obtain after framing and windowing process to audio section windowed data in sound characteristic it is intuitive
It shows, needs to carry out Meier transformation to adding window audio data section, audio data is converted into Meier frequency spectrum data, have sound spy
The linear beneficial effect intuitively shown of sign.
Sub-step 20223, the feature vector that the Meier frequency spectrum data is converted to default dimension.
Preferably, sub-step 20223 further comprise:
Sub-step 202231, by the corresponding Meier frequency spectrum data of each frame audio data in the Meier frequency spectrum data,
It is determined as framing Meier frequency spectrum data;
In this step, intercept every frame audio in the audio data section of above-mentioned acquisition and correspond to Meier frequency spectrum data figure, as point
Frame Meier frequency spectrum data, that is, the Meier spectrogram data being segmented are determined as framing Meier frequency spectrum data.
The framing Meier frequency spectrum data is converted to framing audio feature vector by sub-step 202232;
In this step, each framing Meier spectrogram data is subjected to feature vector conversion.
Specifically, above-mentioned framing Meier spectral image data are converted into framing sound by image feature vector transformation model
Frequency feature vector, wherein known common image feature vector transformation model has BVLC GoogLeNet model, it is one 22
1000 kinds of different image format conversions can be machine-readable features vector by the deep convolutional network of layer.
Certainly, foregoing description is not limited to for characteristics of image conversion method, the embodiments of the present invention are not limited thereto.
Sub-step 202233 splices the framing audio feature vector of default frame number, obtains default dimension
Feature vector.
In this step, after the framing audio feature vector of above-mentioned second framing Meier frequency spectrum data, by multiple
Two framing audio feature vectors merge into second audio feature vector of a default dimension, for example, framing audio feature vector
For the feature vector of 128 dimensions, it is directed to the audio data of one second frame, and the processing for audio data, frame is wrapped within one second
The information contained is not enough to characterize the concrete type of audio data, so by context-sensitive point of the framing audio feature vector
Frame audio feature vector merges, i.e. the corresponding feature vector of 3 seconds audio datas, i.e. 3 framing audio feature vector splicings
Generate the feature vector of 128*3=384 dimension.
Certainly, default dimension is not necessarily 384 dimensions mentioned above, is also possible to the group of five framing audio feature vectors
The audio feature vector for the default dimension of conjunction or ten framing audio feature vectors being composed, so, preset dimension
Whether what setting depended primarily on audio data includes enough information in case subsequent processing, therefore, the embodiment of the present invention is to pre-
If the specific value of dimension is not limited.
The feature vector is input to music emotion marking model by sub-step 2023, so that the music feelings
Sense marking model exports the music emotion label of the feature vector;
In this step, the audio feature vector of the above-mentioned default dimension being spliced is inputted into trained first music feelings
Feel marking model, exports the music emotion label of each audio feature vector.
Wherein, music emotion can be divided into happiness Happy, and releive Tender, excited Exciting, interesting Funny, sad
Sad, terrified Scary and indignation Angry.
Certainly, music emotion be not limited to it is above-mentioned enumerate, the present invention is without restriction to this.
Sub-step 2024 obtains the music emotion label of each audio data section in the target audio data
Number;
In this step, for multiple audio feature vectors that each audio data for being not fixed duration is handled, to wherein
Each audio feature vector carry out the output of music emotion label after, then entire audio data has multiple music emotions label, this
When, it needs to take voting mechanism, the music emotion number of tags of each audio feature vector in entire audio data is counted.
Wherein, audio data is divided into the small fragment of 3s-5s, or also commonly uses the small fragment of 8s-10s, then will be above-mentioned small
Segment carries out framing and windowing process and Meier converts to obtain image feature data, and each image feature data obtains one
Music emotion label, then a segment of audio data may include multiple music emotion labels.
For example, each 3 seconds data segments correspond to different labels in one when a length of 5 minutes video datas, then whole
A 5 minutes video datas are made of 100 type labels, obtain the corresponding number of each type.
Sub-step 2025, by the number maximum value, or, the number is greater than or equal to the music emotion mark of preset threshold
Corresponding music emotion is signed, the music emotion of the target audio data is determined as.
In this step, as described above, 100 music emotion labels are right respectively in obtaining 5 minutes video data ends
After the number answered, the most music emotion label of number is determined as to the music emotion label of 5 minutes video datas, or will
Music emotion number of tags is ranked up, and takes out the label of sequence top N, the music emotion as the audio data section.
Certainly, in practical applications, a quantity threshold can also be preset, being more than in a certain music emotion number of tags should
When quantity threshold, that is, it is arranged to the music emotion label of the video data, for example, in 100 music emotion labels, it is pre- to be marked with
Signing quantity threshold is 30, wherein the music emotion label more than 30 has rock music and traditional music, then the audio data
Music emotion label be rock music and traditional music, and above-mentioned label is determined as the corresponding video of the audio data
The music emotion label of data carries out can be merged into traditional rock music when recommendation operates subsequent.
Music emotion labeling method of the present invention is illustrated below by way of specific example:
1) when carrying out music emotion label to video data, the audio data of video data is obtained first.
2) audio signal that will acquire carries out framing windowing process and Meier transformation, obtains the Meier frequency spectrum of audio data
Figure;
3) Meier spectrogram input VGGish depth model is obtained to the feature vector of the default dimension of Meier spectrogram;
4) by above-mentioned default dimensional characteristics vector input first pass through in advance machine learning algorithm Softmax Classifier into
The music emotion markup model of row training obtains the preset kind label of each default dimensional characteristics vector, such as hip-hop, rock and roll, stream
Row, folk rhyme, allusion, electronics etc.;
5) type of most music emotion number of labels, or the mark more than preset threshold will be finally obtained in audio data
Note is determined as the corresponding music emotion of the audio data.
The embodiment of the invention provides a kind of audio-frequency processing methods, carry out music emotion to target audio data by receiving
Mark request after, obtain target audio data, and according to the mark request, the target audio data are divided into default
Each audio data section processing is the feature vector of default dimension by the audio data section of length;By the audio
Section feature vector is input to trained first music Emotion tagging model, marks music emotion label;Obtain the target sound
The number of frequency music emotion label of each audio data section in;According to music emotion number of tags, determining pair
The final music emotion for answering video data realizes the mesh that batch is efficiently labeled video data sound intermediate frequency music emotion
, the cost of labor of music emotion label is saved, music emotion labeling effciency is improved.
It should be noted that for simple description, therefore, it is stated as a series of action groups for embodiment of the method
It closes, but those skilled in the art should understand that, embodiment of that present invention are not limited by the describe sequence of actions, because according to
According to the embodiment of the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art also should
Know, the embodiments described in the specification are all preferred embodiments, and the related movement not necessarily present invention is implemented
Necessary to example.
Referring to Fig. 3, being a kind of structural schematic diagram of model generating means 300 provided in an embodiment of the present invention, specifically may be used
To include following module:
Annotated audio sample generation module 301 obtains annotated audio sample for marking the music emotion of sample audio data
This;
Annotated audio data segment obtains module 302, for the annotated audio sample to be cut into the multiple of preset length
Annotated audio data segment;
Sample set determining module 303 is marked, for handling each annotated audio data segment for multiple default dimensions
Mark sample audio section feature vector, using as mark sample set;
Preferably, the mark sample set determining module 303, comprising:
Annotated audio data segment generates submodule, for respectively carrying out each annotated audio data segment at framing
Reason, obtains multiple framing annotated audio data segments of each annotated audio data segment;
Mark adding window audio data section generate submodule, for respectively will each framing annotated audio data segment with add
Window function is multiplied, and obtains the mark adding window audio data section of each framing annotated audio data segment;
Mark Meier frequency spectrum data obtains submodule, for each mark adding window audio data section to be carried out plum respectively
You convert, and obtain the mark Meier frequency spectrum data of each annotated audio data segment;
Mark sample audio section feature vector obtains submodule, for respectively turning each mark Meier frequency spectrum data
It is changed to the feature vector of default dimension, obtains the mark sample audio section feature vector of each mark Meier frequency spectrum data;
Preferably, the mark sample audio section feature vector obtains submodule, comprising:
Sample framing Meier frequency spectrum data determination unit, for by it is described mark Meier frequency spectrum data in each frame audio
The corresponding Meier frequency spectrum data of data is determined as sample framing Meier frequency spectrum data;
Sample framing audio feature vector acquiring unit, for the sample framing Meier frequency spectrum data to be converted to sample
Framing audio feature vector;
Mark sample audio section feature vector obtain unit, for by the sample framing audio frequency characteristics of default frame number to
Amount is spliced, and the mark sample audio section feature vector of default dimension is obtained.
Mark sample set determines submodule, for each mark sample audio section set of eigenvectors to be combined into mark sample
Collection.
Sample audio training set generation module 304 is marked, is used for the mark sample sound each in the mark sample set
The music emotion label of frequency range feature vector is updated, and obtains mark sample audio training set;
Preferably, the mark sample audio training set generation module 304, comprising:
Second training sample feature generates submodule, is used for according to preset ratio, described in mark sample set extraction
Sample audio section feature vector is marked, training sample feature set is determined as;
Second music emotion marking model training module, for learning the training sample feature set by predetermined depth
Method is trained, and obtains the second music emotion marking model;
It updates mark sample set and generates submodule, be used for the mark sample audio remaining in the mark sample set
The test sample feature set is inputted second music emotion and marks mould by section feature vector as test sample feature set
Type, so that the second music emotion marking model exports each mark sample audio Duan Te in the test sample feature set
The music emotion label of vector is levied, generates and updates mark sample set;
Sample audio training set acquisition submodule is marked, for update mark sample set and the training sample is special
Collection merges, and is determined as marking sample audio training set.
First music Emotion tagging model training module 305, for utilizing deep learning method to the mark sample sound
Frequency training set is trained, and obtains first music Emotion tagging model.
In the embodiment of the present invention, by annotated audio sample generation module, for marking the music feelings of sample audio data
Sense, obtains annotated audio sample;Annotated audio data segment obtains module, for the annotated audio sample to be cut into default length
Multiple annotated audio data segments of degree;Sample set determining module is marked, for being more by each annotated audio data segment processing
The mark sample audio section feature vector of a default dimension, using as mark sample set;It marks sample audio training set and generates mould
Block, for carrying out more the music emotion label of the mark sample audio section feature vector each in the mark sample set
Newly, mark sample audio training set is obtained;First music Emotion tagging model training module, for utilizing deep learning method pair
The mark sample audio training set is trained, obtain first music Emotion tagging model, can be with efficiently and accurately to not having
The audio data of standby music emotion label carries out the label of music emotion label.
Optionally, in another embodiment, as shown in figure 4, including a kind of apparatus for processing audio 400, described device includes:
Music emotion marks request receiving module 401, for receiving the mark that target audio data are carried out with music emotion
Request;
Music emotion labeling module 402, using music emotion marking model, marks institute for requesting according to the label
State the music emotion of target audio data.
Preferably, the music emotion labeling module 402, comprising:
The target audio data are divided into pre- by audio data section acquisition submodule for being requested according to the mark
If the audio data section of length;
Feature vector acquiring unit, for each audio data section processing is special for the audio section of default dimension
Levy vector;
Preferably, the feature vector acquiring unit includes:
Framing audio data section obtains unit, for carrying out sub-frame processing to each audio data section, obtains framing sound
Frequency data segment;
Adding window audio data section obtains unit, for the framing audio data section to be multiplied with windowed function, is added
Window audio data section;
Meier frequency spectrum data obtains unit, for the adding window audio data section to be carried out Meier transformation, obtains the sound
The Meier frequency spectrum data of frequency data segment;
Feature vector acquiring unit, the audio section for the Meier frequency spectrum data to be converted to default dimension are special
Levy vector.
Preferably, the feature vector acquiring unit, comprising:
Framing Meier frequency spectrum data determines subelement, for by each frame audio data pair in the Meier frequency spectrum data
The Meier frequency spectrum data answered is determined as framing Meier frequency spectrum data;
Framing audio feature vector obtains subelement, special for the framing Meier frequency spectrum data to be converted to framing audio
Levy vector;
Feature vector obtains subelement, for spelling the framing audio feature vector of default frame number
It connects, obtains the feature vector of default dimension.
Music emotion label acquisition submodule, for the feature vector to be input to music emotion mark mould
Type, so that the music emotion marking model exports the music emotion label of the feature vector;
Music emotion number of tags acquisition submodule, for obtaining each audio data section in the target audio data
The music emotion label number;
Music emotion label determines submodule, for being preset or, the number is greater than or equal to by the number maximum value
The corresponding music emotion of music emotion label of threshold value, is determined as the music emotion of the target audio data.
In the embodiment of the present invention, music emotion marks request receiving module, carries out sound to target audio data for receiving
The mark of happy emotion is requested;Music emotion labeling module, for according to label request, using music emotion marking model,
Mark the music emotion of the target audio data.Batch is realized efficiently to be labeled video data sound intermediate frequency music emotion
Purpose, save music emotion label cost of labor, improve music emotion labeling effciency.
For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple
Place illustrates referring to the part of embodiment of the method.
In the embodiment of the present invention, in the video search request for receiving user's input, first marks the video search and ask
The label and tag types are input in video semanteme label independence model by label and tag types in asking, screening
Semantic independent label out, and video search is carried out to semantic independent label, it obtains and the independent label phase of the semanteme
The video matched.The embodiment of the present invention is scanned for according to the independent label of semanteme filtered out, is reduced due to accidentally searching for label
Incoherent video search result is recalled, to improve the accuracy rate of video search.
Optionally, the embodiment of the present invention also provides a kind of terminal, including processor, and memory stores on a memory simultaneously
The computer program that can be run on the processor, the computer program realize above-mentioned model generation side when being executed by processor
Each process of method or audio-frequency processing method embodiment, and identical technical effect can be reached, it is no longer superfluous here to avoid repeating
It states.
Optionally, the embodiment of the present invention also provides a kind of computer readable storage medium, on computer readable storage medium
It is stored with computer program, which realizes above-mentioned model generating method or audio-frequency processing method when being executed by processor
Each process of embodiment, and identical technical effect can be reached, to avoid repeating, which is not described herein again.Wherein, the meter
Calculation machine readable storage medium storing program for executing, such as read-only memory (Read-Only Memory, abbreviation ROM), random access memory (Random
Access Memory, abbreviation RAM), magnetic or disk etc..
All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with
The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.
All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with
The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.
It should be understood by those skilled in the art that, the embodiment of the embodiment of the present invention can provide as method, apparatus or calculate
Machine program product.Therefore, the embodiment of the present invention can be used complete hardware embodiment, complete software embodiment or combine software and
The form of the embodiment of hardware aspect.Moreover, the embodiment of the present invention can be used one or more wherein include computer can
With in the computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of program code
The form of the computer program product of implementation.
The embodiment of the present invention be referring to according to the method for the embodiment of the present invention, terminal device (system) and computer program
The flowchart and/or the block diagram of product describes.It should be understood that flowchart and/or the block diagram can be realized by computer program instructions
In each flow and/or block and flowchart and/or the block diagram in process and/or box combination.It can provide these
Computer program instructions are set to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing terminals
Standby processor is to generate a machine, so that being held by the processor of computer or other programmable data processing terminal devices
Capable instruction generates for realizing in one or more flows of the flowchart and/or one or more blocks of the block diagram
The device of specified function.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing terminal devices
In computer-readable memory operate in a specific manner, so that instruction stored in the computer readable memory generates packet
The manufacture of command device is included, which realizes in one side of one or more flows of the flowchart and/or block diagram
The function of being specified in frame or multiple boxes.
These computer program instructions can also be loaded into computer or other programmable data processing terminal devices, so that
Series of operation steps are executed on computer or other programmable terminal equipments to generate computer implemented processing, thus
The instruction executed on computer or other programmable terminal equipments is provided for realizing in one or more flows of the flowchart
And/or in one or more blocks of the block diagram specify function the step of.
Although the preferred embodiment of the embodiment of the present invention has been described, once a person skilled in the art knows bases
This creative concept, then additional changes and modifications can be made to these embodiments.So the claim is intended to be construed to
Including preferred embodiment and fall into all change and modification of range of embodiment of the invention.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by
One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation
Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning
Covering non-exclusive inclusion, so that process, method, article or terminal device including a series of elements not only wrap
Those elements are included, but also including other elements that are not explicitly listed, or further includes for this process, method, article
Or the element that terminal device is intrinsic.In the absence of more restrictions, being wanted by what sentence "including a ..." limited
Element, it is not excluded that there is also other identical elements in process, method, article or the terminal device for including the element.
It above can to a kind of model generating method provided by the present invention, audio-frequency processing method, device, terminal and computer
Storage medium is read, is described in detail, specific case used herein carries out the principle of the present invention and embodiment
It illustrates, the above description of the embodiment is only used to help understand the method for the present invention and its core ideas;Meanwhile for this field
Those skilled in the art, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, to sum up
Described, the contents of this specification are not to be construed as limiting the invention.
Claims (18)
1. a kind of model generating method characterized by comprising
The music emotion for marking sample audio data, obtains annotated audio sample;
The annotated audio sample is cut into multiple annotated audio data segments of preset length;
Each annotated audio data segment is handled into the mark sample audio section feature vector for multiple default dimensions, using as mark
Infuse sample set;
The music emotion label of the mark sample audio section feature vector each in the mark sample set is updated,
Obtain mark sample audio training set;
The mark sample audio training set is trained using deep learning method, obtains first music Emotion tagging mould
Type.
2. the method according to claim 1, wherein described by the mark sample each in the mark sample set
The music emotion label of feature vector is updated, and obtains mark sample audio training set, comprising:
According to preset ratio, the mark sample audio section feature vector is extracted from the mark sample set, is determined as training sample
Eigen collection;
The training sample feature set is trained by predetermined deep learning method, obtains the second music emotion mark mould
Type;
Using the mark sample audio section feature vector remaining in the mark sample set as test sample feature set, and will
The test sample feature set inputs the second music emotion marking model, so that the second music emotion marking model is defeated
Each music emotion label for marking sample audio section feature vector in the test sample feature set out, generates and updates mark
Sample set;
Update mark sample set is merged with the training sample feature set, is determined as marking sample audio training set.
3. the method according to claim 1, wherein described handle each annotated audio data segment is multiple
The mark sample audio section feature vector of default dimension, using as mark sample set, comprising:
Each annotated audio data segment is subjected to sub-frame processing respectively, obtains multiple points of each annotated audio data segment
Frame annotated audio data segment;
Each framing annotated audio data segment is multiplied with windowed function respectively, obtains each framing annotated audio data
The mark adding window audio data section of section;
Each mark adding window audio data section is subjected to Meier transformation respectively, obtains the mark of each annotated audio data segment
Infuse Meier frequency spectrum data;
The feature vector that each mark Meier frequency spectrum data is converted to default dimension respectively, obtains each mark Meier
The mark sample audio section feature vector of frequency spectrum data;
Each mark sample audio section set of eigenvectors is combined into mark sample set.
4. according to the method described in claim 3, it is characterized in that, described respectively turn each mark Meier frequency spectrum data
It is changed to the feature vector of default dimension, obtains the mark sample audio section feature vector of each mark Meier frequency spectrum data, packet
It includes:
By the corresponding Meier frequency spectrum data of each frame audio data in the mark Meier frequency spectrum data, it is determined as sample framing
Meier frequency spectrum data;
The sample framing Meier frequency spectrum data is converted into sample framing audio feature vector;
The sample framing audio feature vector of default frame number is spliced, the mark sample audio section of default dimension is obtained
Feature vector.
5. a kind of audio-frequency processing method characterized by comprising
The mark for carrying out music emotion to target audio data is received to request;
It is requested according to the label, using music emotion marking model, marks the music emotion of the target audio data;It is described
Music emotion marking model is to be obtained using any one of any one of claims 1 to 44 the method.
6. according to the method described in claim 5, it is characterized in that, it is described according to the label request, utilize music emotion mark
Injection molding type marks the music emotion of the target audio data, comprising:
It is requested according to the mark, the target audio data is divided into the audio data section of preset length;
It is the feature vector of default dimension by each audio data section processing;
The feature vector is input to music emotion marking model, so that the music emotion marking model exports institute
State the music emotion label of feature vector;
Obtain the number of the music emotion label of each audio data section in the target audio data;
By the number maximum value, or, the number is greater than or equal to the corresponding music feelings of music emotion label of preset threshold
Sense, is determined as the music emotion of the target audio data.
7. according to the method described in claim 6, it is characterized in that, described handle each audio data section for default dimension
Feature vector, comprising:
Sub-frame processing is carried out to each audio data section, obtains framing audio data section;
The framing audio data section is multiplied with windowed function, obtains adding window audio data section;
The adding window audio data section is subjected to Meier transformation, obtains the Meier frequency spectrum data of the audio data section;
The Meier frequency spectrum data is converted to the feature vector of default dimension.
8. the method according to the description of claim 7 is characterized in that described be converted to default dimension for the Meier frequency spectrum data
Feature vector, comprising:
By the corresponding Meier frequency spectrum data of each frame audio data in the Meier frequency spectrum data, it is determined as framing Meier frequency spectrum
Data;
The framing Meier frequency spectrum data is converted into framing audio feature vector;
The framing audio feature vector of default frame number is spliced, the feature vector of default dimension is obtained.
9. a kind of model generating means characterized by comprising
Annotated audio sample generation module obtains annotated audio sample for marking the music emotion of sample audio data;
Annotated audio data segment obtains module, for the annotated audio sample to be cut into multiple annotated audios of preset length
Data segment;
Sample set determining module is marked, for handling each annotated audio data segment for the mark sample of multiple default dimensions
Feature vector, using as mark sample set;
Sample audio training set generation module is marked, is used for the mark sample audio section feature each in the mark sample set
The music emotion label of vector is updated, and obtains mark sample audio training set;
First music Emotion tagging model training module, for utilizing deep learning method to the mark sample audio training set
It is trained, obtains first music Emotion tagging model.
10. device according to claim 9, which is characterized in that the mark sample audio training set generation module, packet
It includes:
Second training sample feature generates submodule, for extracting the mark from the mark sample set according to preset ratio
Sample audio section feature vector, is determined as training sample feature set;
Second music emotion marking model training module, for the training sample feature set to be passed through predetermined deep learning method
It is trained, obtains the second music emotion marking model;
It updates mark sample set and generates submodule, be used for the mark sample audio Duan Te remaining in the mark sample set
Vector is levied as test sample feature set, and the test sample feature set is inputted into the second music emotion marking model,
So that the second music emotion marking model exports each mark sample audio section feature in the test sample feature set
The music emotion label of vector generates and updates mark sample set;
Sample audio training set acquisition submodule is marked, for the update to be marked sample set and the training sample feature set
Merge, is determined as marking sample audio training set.
11. device according to claim 9, which is characterized in that the mark sample set determining module, comprising:
Annotated audio data segment generates submodule, for each annotated audio data segment to be carried out sub-frame processing respectively, obtains
To multiple framing annotated audio data segments of each annotated audio data segment;
It marks adding window audio data section and generates submodule, for respectively by each framing annotated audio data segment and adding window letter
Number is multiplied, and obtains the mark adding window audio data section of each framing annotated audio data segment;
Mark Meier frequency spectrum data obtains submodule, for each mark adding window audio data section to be carried out Meier change respectively
It changes, obtains the mark Meier frequency spectrum data of each annotated audio data segment;
Mark sample audio section feature vector obtains submodule, for being respectively converted to each mark Meier frequency spectrum data
The feature vector of default dimension obtains the mark sample audio section feature vector of each mark Meier frequency spectrum data;
Mark sample set determines submodule, for each mark sample audio section set of eigenvectors to be combined into mark sample set.
12. device according to claim 11, which is characterized in that the mark sample audio section feature vector obtains submodule
Block, comprising:
Sample framing Meier frequency spectrum data determination unit, for by it is described mark Meier frequency spectrum data in each frame audio data
Corresponding Meier frequency spectrum data is determined as sample framing Meier frequency spectrum data;
Sample framing audio feature vector acquiring unit, for the sample framing Meier frequency spectrum data to be converted to sample framing
Audio feature vector;
Mark sample audio section feature vector obtain unit, for by the sample framing audio feature vector of default frame number into
Row splicing obtains the mark sample audio section feature vector of default dimension.
13. a kind of apparatus for processing audio characterized by comprising
Music emotion marks request receiving module, for receiving the mark request that target audio data are carried out with music emotion;
Music emotion labeling module, using music emotion marking model, marks the target sound for requesting according to the label
The music emotion of frequency evidence.
14. device according to claim 13, which is characterized in that the music emotion labeling module, comprising:
The target audio data are divided into default length for requesting according to the mark by audio data section acquisition submodule
The audio data section of degree;
Feature vector acquiring unit, for by each audio data section processing for default dimension feature to
Amount;
Music emotion label acquisition submodule, for the feature vector to be input to music emotion marking model, with
The music emotion marking model is set to export the music emotion label of the feature vector;
Music emotion number of tags acquisition submodule, for obtaining the institute of each audio data section in the target audio data
State the number of music emotion label;
Music emotion label determines submodule, is used for the number maximum value, or, the number is greater than or equal to preset threshold
The corresponding music emotion of music emotion label, be determined as the music emotion of the target audio data.
15. device according to claim 14, which is characterized in that the feature vector acquiring unit includes:
Framing audio data section obtains unit, for carrying out sub-frame processing to each audio data section, obtains framing audio number
According to section;
Adding window audio data section obtains unit, for the framing audio data section to be multiplied with windowed function, obtains adding window sound
Frequency data segment;
Meier frequency spectrum data obtains unit, for the adding window audio data section to be carried out Meier transformation, obtains the audio number
According to the Meier frequency spectrum data of section;
Feature vector acquiring unit, for the Meier frequency spectrum data is converted to the feature of default dimension to
Amount.
16. device according to claim 15, which is characterized in that the feature vector acquiring unit, comprising:
Framing Meier frequency spectrum data determines subelement, for each frame audio data in the Meier frequency spectrum data is corresponding
Meier frequency spectrum data is determined as framing Meier frequency spectrum data;
Framing audio feature vector obtain subelement, for by the framing Meier frequency spectrum data be converted to framing audio frequency characteristics to
Amount;
Feature vector obtains subelement, for splicing the framing audio feature vector of default frame number, obtains
To the feature vector of default dimension.
17. a kind of terminal characterized by comprising memory, processor and be stored on the memory and can be at the place
The computer program run on reason device is realized when the computer program is executed by the processor as appointed in Claims 1-4
Step in one model generating method, or the step of the audio-frequency processing method as described in any one of claim 5 to 8
Suddenly.
18. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium
Program is realized as described in any one of claims 1 to 4 when the computer program is executed by processor in model generating method
The step of, or the step of audio-frequency processing method as described in any one of claim 5 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910134036.1A CN110008372A (en) | 2019-02-22 | 2019-02-22 | Model generating method, audio-frequency processing method, device, terminal and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910134036.1A CN110008372A (en) | 2019-02-22 | 2019-02-22 | Model generating method, audio-frequency processing method, device, terminal and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110008372A true CN110008372A (en) | 2019-07-12 |
Family
ID=67165970
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910134036.1A Pending CN110008372A (en) | 2019-02-22 | 2019-02-22 | Model generating method, audio-frequency processing method, device, terminal and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110008372A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111586494A (en) * | 2020-04-30 | 2020-08-25 | 杭州慧川智能科技有限公司 | Intelligent strip splitting method based on audio and video separation |
CN111898753A (en) * | 2020-08-05 | 2020-11-06 | 字节跳动有限公司 | Music transcription model training method, music transcription method and corresponding device |
CN112687280A (en) * | 2020-12-25 | 2021-04-20 | 浙江弄潮儿智慧科技有限公司 | Biodiversity monitoring system with frequency spectrum-time space interface |
CN113257283A (en) * | 2021-03-29 | 2021-08-13 | 北京字节跳动网络技术有限公司 | Audio signal processing method and device, electronic equipment and storage medium |
CN113362218A (en) * | 2021-05-21 | 2021-09-07 | 北京百度网讯科技有限公司 | Data processing method and device, electronic equipment and storage medium |
CN114025216A (en) * | 2020-04-30 | 2022-02-08 | 网易(杭州)网络有限公司 | Media material processing method, device, server and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108053836A (en) * | 2018-01-18 | 2018-05-18 | 成都嗨翻屋文化传播有限公司 | A kind of audio automation mask method based on deep learning |
US20180276540A1 (en) * | 2017-03-22 | 2018-09-27 | NextEv USA, Inc. | Modeling of the latent embedding of music using deep neural network |
CN108764372A (en) * | 2018-06-08 | 2018-11-06 | Oppo广东移动通信有限公司 | Construction method and device, mobile terminal, the readable storage medium storing program for executing of data set |
CN108897829A (en) * | 2018-06-22 | 2018-11-27 | 广州多益网络股份有限公司 | Modification method, device and the storage medium of data label |
CN108962279A (en) * | 2018-07-05 | 2018-12-07 | 平安科技(深圳)有限公司 | New Method for Instrument Recognition and device, electronic equipment, the storage medium of audio data |
CN109147826A (en) * | 2018-08-22 | 2019-01-04 | 平安科技(深圳)有限公司 | Music emotion recognition method, device, computer equipment and computer storage medium |
KR101943075B1 (en) * | 2017-11-06 | 2019-01-28 | 주식회사 아티스츠카드 | Method for automatical tagging metadata of music content using machine learning |
-
2019
- 2019-02-22 CN CN201910134036.1A patent/CN110008372A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180276540A1 (en) * | 2017-03-22 | 2018-09-27 | NextEv USA, Inc. | Modeling of the latent embedding of music using deep neural network |
KR101943075B1 (en) * | 2017-11-06 | 2019-01-28 | 주식회사 아티스츠카드 | Method for automatical tagging metadata of music content using machine learning |
CN108053836A (en) * | 2018-01-18 | 2018-05-18 | 成都嗨翻屋文化传播有限公司 | A kind of audio automation mask method based on deep learning |
CN108764372A (en) * | 2018-06-08 | 2018-11-06 | Oppo广东移动通信有限公司 | Construction method and device, mobile terminal, the readable storage medium storing program for executing of data set |
CN108897829A (en) * | 2018-06-22 | 2018-11-27 | 广州多益网络股份有限公司 | Modification method, device and the storage medium of data label |
CN108962279A (en) * | 2018-07-05 | 2018-12-07 | 平安科技(深圳)有限公司 | New Method for Instrument Recognition and device, electronic equipment, the storage medium of audio data |
CN109147826A (en) * | 2018-08-22 | 2019-01-04 | 平安科技(深圳)有限公司 | Music emotion recognition method, device, computer equipment and computer storage medium |
Non-Patent Citations (1)
Title |
---|
韩凝: ""基于深度神经网络的音乐自动标注技术研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111586494A (en) * | 2020-04-30 | 2020-08-25 | 杭州慧川智能科技有限公司 | Intelligent strip splitting method based on audio and video separation |
CN114025216A (en) * | 2020-04-30 | 2022-02-08 | 网易(杭州)网络有限公司 | Media material processing method, device, server and storage medium |
CN111586494B (en) * | 2020-04-30 | 2022-03-11 | 腾讯科技(深圳)有限公司 | Intelligent strip splitting method based on audio and video separation |
CN114025216B (en) * | 2020-04-30 | 2023-11-17 | 网易(杭州)网络有限公司 | Media material processing method, device, server and storage medium |
CN111898753A (en) * | 2020-08-05 | 2020-11-06 | 字节跳动有限公司 | Music transcription model training method, music transcription method and corresponding device |
CN111898753B (en) * | 2020-08-05 | 2024-07-02 | 字节跳动有限公司 | Training method of music transcription model, music transcription method and corresponding device |
CN112687280A (en) * | 2020-12-25 | 2021-04-20 | 浙江弄潮儿智慧科技有限公司 | Biodiversity monitoring system with frequency spectrum-time space interface |
CN112687280B (en) * | 2020-12-25 | 2023-09-12 | 浙江弄潮儿智慧科技有限公司 | Biodiversity monitoring system with frequency spectrum-time space interface |
CN113257283A (en) * | 2021-03-29 | 2021-08-13 | 北京字节跳动网络技术有限公司 | Audio signal processing method and device, electronic equipment and storage medium |
CN113257283B (en) * | 2021-03-29 | 2023-09-26 | 北京字节跳动网络技术有限公司 | Audio signal processing method and device, electronic equipment and storage medium |
CN113362218A (en) * | 2021-05-21 | 2021-09-07 | 北京百度网讯科技有限公司 | Data processing method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109977255A (en) | Model generating method, audio-frequency processing method, device, terminal and storage medium | |
CN110008372A (en) | Model generating method, audio-frequency processing method, device, terminal and storage medium | |
CN111488489B (en) | Video file classification method, device, medium and electronic equipment | |
CN105244026B (en) | A kind of method of speech processing and device | |
CN107526809B (en) | Method and device for pushing music based on artificial intelligence | |
CN110880198A (en) | Animation generation method and device | |
CN113658577B (en) | Speech synthesis model training method, audio generation method, equipment and medium | |
CN107680584B (en) | Method and device for segmenting audio | |
CN113573161B (en) | Multimedia data processing method, device, equipment and storage medium | |
CN111125384B (en) | Multimedia answer generation method and device, terminal equipment and storage medium | |
CN111508466A (en) | Text processing method, device and equipment and computer readable storage medium | |
CN109982137A (en) | Model generating method, video marker method, apparatus, terminal and storage medium | |
CN114842826A (en) | Training method of speech synthesis model, speech synthesis method and related equipment | |
CN109376145B (en) | Method and device for establishing movie and television dialogue database and storage medium | |
CN116416962A (en) | Audio synthesis method, device, equipment and storage medium | |
CN110889008B (en) | Music recommendation method and device, computing device and storage medium | |
CN114125506A (en) | Voice auditing method and device | |
Wu et al. | Cold start problem for automated live video comments | |
CN115578998A (en) | Speech synthesis method, electronic device, and storage medium | |
CN113112993B (en) | Audio information processing method and device, electronic equipment and storage medium | |
CN114155829A (en) | Speech synthesis method, speech synthesis device, readable storage medium and electronic equipment | |
CN113299271B (en) | Speech synthesis method, speech interaction method, device and equipment | |
CN113808579B (en) | Detection method and device for generated voice, electronic equipment and storage medium | |
CN117174069A (en) | Speech synthesis method, device, equipment and storage medium | |
CN116843805B (en) | Method, device, equipment and medium for generating virtual image containing behaviors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190712 |
|
RJ01 | Rejection of invention patent application after publication |