CN116486838A - Music emotion recognition method and system, electronic equipment and storage medium - Google Patents
Music emotion recognition method and system, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN116486838A CN116486838A CN202310572293.XA CN202310572293A CN116486838A CN 116486838 A CN116486838 A CN 116486838A CN 202310572293 A CN202310572293 A CN 202310572293A CN 116486838 A CN116486838 A CN 116486838A
- Authority
- CN
- China
- Prior art keywords
- audio
- emotion
- symbol
- model
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 112
- 238000000034 method Methods 0.000 title claims abstract description 84
- 230000008451 emotion Effects 0.000 claims abstract description 186
- 238000000605 extraction Methods 0.000 claims abstract description 37
- 238000013518 transcription Methods 0.000 claims abstract description 22
- 230000035897 transcription Effects 0.000 claims abstract description 22
- 238000012545 processing Methods 0.000 claims description 44
- 238000001228 spectrum Methods 0.000 claims description 34
- 238000012549 training Methods 0.000 claims description 25
- 238000007781 pre-processing Methods 0.000 claims description 19
- 230000015654 memory Effects 0.000 claims description 18
- 230000004927 fusion Effects 0.000 claims description 17
- 238000004590 computer program Methods 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 9
- 238000006243 chemical reaction Methods 0.000 claims description 5
- 238000004458 analytical method Methods 0.000 description 63
- 238000005516 engineering process Methods 0.000 description 16
- 238000004891 communication Methods 0.000 description 12
- 238000013473 artificial intelligence Methods 0.000 description 10
- 238000013527 convolutional neural network Methods 0.000 description 9
- 230000033764 rhythmic process Effects 0.000 description 9
- 239000013598 vector Substances 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 7
- 239000012634 fragment Substances 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 230000007246 mechanism Effects 0.000 description 6
- 238000003062 neural network model Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000000306 recurrent effect Effects 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 210000005036 nerve Anatomy 0.000 description 3
- 208000020016 psychiatric disease Diseases 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 208000019901 Anxiety disease Diseases 0.000 description 1
- 230000036506 anxiety Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000000051 music therapy Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Hospice & Palliative Care (AREA)
- General Health & Medical Sciences (AREA)
- Psychiatry (AREA)
- Child & Adolescent Psychology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a music emotion recognition method and system, electronic equipment and a storage medium, and belongs to the technical field of digital medical treatment. The method comprises the following steps: acquiring an audio sample set comprising a sample audio file and an initial emotion tag; inputting the sample audio file into an audio transcription model to obtain an audio symbol music score; inputting the sample audio file and the audio symbol score to an initial recognition model; extracting audio features of the sample audio file through the audio recognition sub-model to obtain audio domain features; carrying out symbol feature extraction on the audio symbol music score through a symbol recognition sub-model to obtain symbol domain features; determining a predictive emotion label of the sample audio file according to the audio domain characteristics and the symbol domain characteristics; determining a music emotion recognition model according to the predicted emotion label and the initial emotion label; and inputting the target audio file to be identified into a music emotion recognition model to obtain an emotion recognition result. The embodiment of the application improves the identification accuracy of the emotion categories of the music pieces.
Description
Technical Field
The present application relates to the field of digital medical technology, and in particular, to a music emotion recognition method and system, an electronic device, and a storage medium.
Background
In the intelligent scene of music treatment in the digital medical field, the music piece with the required emotion type can be selected to play an auxiliary role in treating mental diseases, nerve rehabilitation and other diseases. At present, the music emotion recognition technology is a technical means for recognizing emotion information contained in a given music piece by utilizing computer analysis and processing of characteristics of the given music piece, and emotion types of different music pieces are recognized through the music emotion recognition technology, so that the auxiliary effect in an intelligent scene of music treatment can be better realized. However, the existing music emotion recognition method has some drawbacks for recognizing emotion categories of music pieces, for example, for data sets formed by different music types, due to large differences between the arrangement of music symbols and the data sets formed by the composing method, the existing music emotion recognition method is difficult to learn the influence of the note flow of the whole music on the whole music emotion, so that the recognition accuracy of emotion categories of music pieces is reduced. Therefore, how to improve the accuracy of identifying emotion categories of music pieces becomes a technical problem to be solved urgently.
Disclosure of Invention
The embodiment of the application mainly aims to provide a music emotion recognition method and system, electronic equipment and storage medium, and aims to improve recognition accuracy of emotion categories of music fragments.
To achieve the above object, a first aspect of an embodiment of the present application provides a music emotion recognition method, including:
acquiring an audio sample set, wherein the audio sample set comprises a sample audio file and an initial emotion tag of the sample audio file;
inputting the sample audio file into a pre-trained audio transcription model for format standardization processing to obtain an audio symbol music score;
inputting the sample audio file and the audio symbol score to a pre-constructed initial recognition model, wherein the initial recognition model comprises an audio recognition sub-model and a symbol recognition sub-model;
extracting audio features of the sample audio file through the audio recognition sub-model to obtain audio domain features;
extracting the symbol characteristics of the audio symbol music score through the symbol recognition sub-model to obtain symbol domain characteristics;
carrying out emotion classification on the sample audio file according to the audio domain characteristics and the symbol domain characteristics, and determining a predicted emotion label of the sample audio file;
parameter adjustment is carried out on the initial recognition model according to the predicted emotion label and the initial emotion label, so that a music emotion recognition model is obtained;
Acquiring a target audio file to be identified, inputting the target audio file into the music emotion recognition model for emotion recognition processing to obtain an emotion recognition result
In some embodiments, the audio recognition sub-model includes an audio preprocessing module and an audio feature extraction module, and the audio feature extraction is performed on the sample audio file by the audio recognition sub-model to obtain audio domain features, including:
performing frequency spectrum conversion on the sample audio file through the audio preprocessing module to obtain a target sample frequency spectrum;
and carrying out frequency spectrum feature extraction on the target sample frequency spectrum through the audio feature extraction module to obtain the audio domain feature.
In some embodiments, the symbol recognition sub-model includes a symbol preprocessing module and a symbol feature extraction module, and the symbol feature extraction is performed on the audio symbol score by the symbol recognition sub-model to obtain a symbol domain feature, including:
performing note embedding processing on the audio symbol music score through the symbol preprocessing module to obtain a sample note sequence;
and extracting sequence features of the sample note sequence through the symbol feature extraction module to obtain the symbol domain features.
In some embodiments, the performing, by the symbol preprocessing module, note embedding processing on the audio symbol score to obtain a sample note sequence includes:
information extraction is carried out on each audio symbol in the audio symbol music score to obtain note information of each audio symbol, wherein the note information comprises note time information, note dynamics information and note pitch information;
acquiring the position information of each audio symbol in the audio symbol music score;
arranging the note time information according to the position information to obtain a first note sequence;
arranging the note dynamics information according to the position information to obtain a second note sequence;
arranging the note pitch information according to the position information to obtain a third note sequence;
and combining the first note sequence, the second note sequence and the third note sequence to obtain a sample note sequence.
In some embodiments, the initial recognition model further includes a feature fusion sub-model and a classifier, the emotion classifying the sample audio file according to the audio domain features and the symbol domain features, and determining a predicted emotion tag of the sample audio file includes:
Performing feature fusion on the audio domain features and the symbol domain features through the feature fusion sub-model to obtain sample target features;
carrying out emotion type prediction on the sample target characteristics through the classifier to obtain a label prediction result, wherein the label prediction result comprises sample emotion prediction values of the sample audio files belonging to different emotion labels respectively;
and determining the predictive emotion labels of the sample audio files by carrying out numerical comparison on all the sample emotion predictive values.
In some embodiments, the performing parameter adjustment on the initial recognition model according to the predicted emotion tag and the initial emotion tag to obtain a music emotion recognition model includes:
determining a model loss value according to the predicted emotion label and the initial emotion label;
and adjusting model parameters of the audio recognition sub-model and the symbol recognition sub-model according to the model loss value, and continuously training the adjusted initial recognition model based on the audio sample set until the model loss value meets a preset training ending condition so as to obtain a music emotion recognition model.
In some embodiments, the emotion recognition result includes a target emotion tag, and the inputting the target audio file into the music emotion recognition model to perform emotion recognition processing, to obtain the emotion recognition result includes:
Inputting the target audio file into the music emotion recognition model to predict emotion types, so as to obtain a target emotion predicted value; the target emotion predicted value is used for representing emotion categories to which the target audio file belongs;
and determining the target emotion labels of the target audio files by carrying out numerical comparison on all the target emotion predicted values.
To achieve the above object, a second aspect of the embodiments of the present application proposes a music emotion recognition system, including:
the system comprises a sample set acquisition module, a sample set generation module and a sample set generation module, wherein the sample set acquisition module is used for acquiring an audio sample set, and the audio sample set comprises a sample audio file and an initial emotion tag of the sample audio file;
the audio transcription module is used for inputting the sample audio file into a pre-trained audio transcription model for format standardization processing to obtain an audio symbol music score;
the model input module is used for inputting the sample audio file and the audio symbol music score into a pre-built initial recognition model, wherein the initial recognition model comprises an audio recognition sub-model and a symbol recognition sub-model;
the audio processing module is used for extracting audio features of the sample audio file through the audio recognition sub-model to obtain audio domain features;
The symbol processing module is used for extracting symbol characteristics of the audio symbol music score through the symbol recognition sub-model to obtain symbol domain characteristics;
the emotion classification module is used for performing emotion classification on the sample audio file according to the audio domain characteristics and the symbol domain characteristics, and determining a predicted emotion label of the sample audio file;
the parameter adjustment module is used for carrying out parameter adjustment on the initial recognition model according to the predicted emotion label and the initial emotion label to obtain a music emotion recognition model;
the emotion recognition module is used for acquiring a target audio file to be recognized, inputting the target audio file into the music emotion recognition model for emotion recognition processing, and obtaining an emotion recognition result.
To achieve the above object, a third aspect of the embodiments of the present application proposes an electronic device, which includes a memory and a processor, the memory storing a computer program, the processor implementing the method according to the first aspect when executing the computer program.
To achieve the above object, a fourth aspect of the embodiments of the present application proposes a computer-readable storage medium storing a computer program that, when executed by a processor, implements the method of the first aspect.
According to the music emotion recognition method and system, the electronic equipment and the storage medium, an audio sample set is obtained, and the audio sample set comprises a sample audio file and an initial emotion label of the sample audio file; inputting the sample audio file into a pre-trained audio transcription model for format standardization processing to obtain an audio symbol music score; inputting a sample audio file and an audio symbol music score into a pre-constructed initial recognition model, wherein the initial recognition model comprises an audio recognition sub-model and a symbol recognition sub-model; extracting audio features of the sample audio file through the audio recognition sub-model to obtain audio domain features; extracting the symbol characteristics of the audio symbol music score through the symbol recognition sub-model to obtain symbol domain characteristics; carrying out emotion classification on the sample audio file according to the audio domain features and the symbol domain features, and determining a predicted emotion label of the sample audio file; parameter adjustment is carried out on the initial recognition model according to the predicted emotion label and the initial emotion label, and a music emotion recognition model is obtained; and acquiring a target audio file to be identified, inputting the target audio file into a music emotion recognition model for emotion recognition processing, and obtaining an emotion recognition result. Therefore, by the music emotion recognition method, the recognition accuracy of emotion categories of the music pieces can be effectively improved.
Drawings
FIG. 1 is a method flow chart of a music emotion recognition method provided in an embodiment of the present application;
FIG. 2 is a flow chart of the method of step S140 in FIG. 1;
FIG. 3 is a flow chart of the method of step S150 in FIG. 1;
FIG. 4 is a flow chart of the method of step S310 in FIG. 3;
FIG. 5 is a flow chart of the method of step S160 in FIG. 1;
FIG. 6 is a flow chart of the method of step S170 in FIG. 1;
FIG. 7 is a flow chart of the method of step S180 in FIG. 1;
fig. 8 is a schematic structural diagram of a music emotion recognition system according to an embodiment of the present application;
fig. 9 is a schematic hardware structure of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
It should be noted that although functional block diagrams are depicted as block diagrams, and logical sequences are shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the block diagrams in the system. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.
First, several nouns referred to in this application are parsed:
artificial intelligence (Artificial Intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.
Music digital interface (Musical Instrument Digital Interface, MIDI): is an industry standard electronic communication protocol that defines various notes or playing codes for playing devices (e.g., synthesizers) for electronic musical instruments, etc., allowing electronic musical instruments, computers, cell phones, or other stage performance equipment to be connected to one another, adjusted, and synchronized to allow for the immediate exchange of performance data.
Convolutional neural network (Convolutional Neural Networks, CNN): the method is a multi-layer supervised learning neural network, and a convolution layer and a pool sampling layer of an implicit layer are core modules for realizing the feature extraction function of the convolution neural network. The network model adopts a gradient descent method to minimize the loss function to reversely adjust the weight parameters in the network layer by layer, and improves the network precision through frequent iterative training.
token is a computer term that is meant to be a token (temporary) in computer authentication and a token in lexical analysis, representing the right to perform certain operations.
In the intelligent scene of music treatment in the digital medical field, the music piece with the required emotion type can be selected to play an auxiliary role in treating mental diseases, nerve rehabilitation and other diseases. At present, the music emotion recognition (Music Emotion Recognition, MER) technology is a technical means for analyzing and processing characteristics of a given music piece by using a computer to recognize emotion information contained in the music piece, and emotion types of different music pieces are recognized by the music emotion recognition technology, so that an auxiliary effect in an intelligent scene of music treatment can be better realized. Currently, in a music emotion recognition method based on an audio domain, the audio domain information refers to an audio waveform of a whole song, which not only contains semantic information of a music score, but also contains information such as intensity, rhythm, tone and the like which are not contained in symbolized representation, and meanwhile, due to different deductions of different players on the same music score, the audio domain analysis can bring higher detection accuracy. The existing music emotion recognition method based on the audio domain mostly adopts short-time Fourier transform spectrum or Mel frequency spectrum for analysis, and the analysis is mainly completed by a deep neural network structure based on a convolutional neural network due to the similarity of the frequency spectrum and a computer image. However, this method has the following drawbacks: (1) A large amount of training data is needed, and a deep neural network often needs a large amount of training data with better labeling quality; (2) The generalization between different types of data sets is poor, such as the music emotion recognition model performance gap between the classical music-based MAESTRO data set and the popular music-based empia data set is large; (3) Because convolutional neural networks have poor capture capability of long-distance dependency, networks with insufficient depth have difficulty in learning the influence of note flow of the whole music on the whole music emotion.
In the method for identifying musical emotion based on a symbol domain, symbol domain information refers to symbolized representation of music, and most of the existing methods for identifying musical emotion based on a symbol domain are based on a natural language processing method, and a sequence of notes is regarded as a natural language sequence for analysis. Specifically, notes are first converted into note embeddings, and then the embedded features are mapped onto a high-dimensional space. At present, long Short-Term Memory (LSTM) or a transducer structure is often used to analyze the embedded notes, and the influence of the flowing direction of the full-tune notes on the emotion of the music is captured to capture the dependency relationship of the overlength distances in the note sequence. However, the symbol domain-based analysis method has the following drawbacks: (1) Because of the large differences in the musical composition and the music composition methods between different types of data sets, the model may exhibit a dramatic drop in performance in the face of unseen music types; (2) Since the symbol domain information does not contain additional information such as musical instrument tone color and emotion injected by the player himself/herself at the time of performance, and the same piece of music may show completely different emotion at the time of performance of different persons, such distinction cannot be perceived based on analysis of the symbol domain.
Therefore, the existing music emotion recognition method has the defects of recognizing the emotion type of the music piece, so that the recognition accuracy of the emotion type of the music piece is reduced. Therefore, how to improve the accuracy of identifying emotion categories of music pieces becomes a technical problem to be solved urgently.
Based on the above, the embodiment of the application provides a music emotion recognition method and system, electronic equipment and storage medium, aiming at improving the recognition accuracy of emotion categories of music fragments.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
The music emotion recognition method provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like that implements the music emotion recognition method, but is not limited to the above form.
The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
It should be noted that, in each specific embodiment of the present application, when related processing is required according to user information, user behavior data, user history data, user location information, and other data related to user identity or characteristics, permission or consent of the user is obtained first, and the collection, use, processing, and the like of these data comply with related laws and regulations and standards. In addition, when the embodiment of the application needs to acquire the sensitive personal information of the user, the independent permission or independent consent of the user is acquired through a popup window or a jump to a confirmation page or the like, and after the independent permission or independent consent of the user is explicitly acquired, necessary user related data for enabling the embodiment of the application to normally operate is acquired.
Referring to fig. 1, fig. 1 is an optional flowchart of a music emotion recognition method according to an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, steps S110 to S180:
step S110, an audio sample set is obtained, wherein the audio sample set comprises a sample audio file and an initial emotion tag of the sample audio file;
step S120, inputting a sample audio file into a pre-trained audio transcription model for format standardization processing to obtain an audio symbol music score;
Step S130, inputting the sample audio file and the audio symbol music score into a pre-constructed initial recognition model, wherein the initial recognition model comprises an audio recognition sub-model and a symbol recognition sub-model;
step S140, extracting audio features of the sample audio file through the audio recognition sub-model to obtain audio domain features;
step S150, extracting the symbol characteristics of the audio symbol music score through the symbol recognition sub-model to obtain symbol domain characteristics;
step S160, carrying out emotion classification on the sample audio file according to the audio domain features and the symbol domain features, and determining a predicted emotion label of the sample audio file;
step S170, carrying out parameter adjustment on the initial recognition model according to the predicted emotion label and the initial emotion label to obtain a music emotion recognition model;
step S180, obtaining a target audio file to be identified, inputting the target audio file into a music emotion recognition model for emotion recognition processing, and obtaining an emotion recognition result.
In steps S110 to S180 illustrated in the embodiment of the present application, due to the drawbacks of the existing music emotion recognition method in recognizing the emotion type of the music piece, the detection precision of the emotion type of the music piece is not enough, and in the embodiment of the present application, an audio sample set is obtained, where the audio sample set includes a sample audio file and an initial emotion tag of the sample audio file; inputting the sample audio file into a pre-trained audio transcription model for format standardization processing to obtain an audio symbol music score; inputting a sample audio file and an audio symbol music score into a pre-constructed initial recognition model, wherein the initial recognition model comprises an audio recognition sub-model and a symbol recognition sub-model; extracting audio features of the sample audio file through the audio recognition sub-model to obtain audio domain features; extracting the symbol characteristics of the audio symbol music score through the symbol recognition sub-model to obtain symbol domain characteristics; carrying out emotion classification on the sample audio file according to the audio domain features and the symbol domain features, and determining a predicted emotion label of the sample audio file; parameter adjustment is carried out on the initial recognition model according to the predicted emotion label and the initial emotion label, and a music emotion recognition model is obtained; and acquiring a target audio file to be identified, inputting the target audio file into a music emotion recognition model for emotion recognition processing, and obtaining an emotion recognition result. The method and the device can improve the accuracy of identifying the emotion categories of the music pieces.
It should be noted that, the application scenario in the embodiment of the present application may include a user side device and a server side device, where the user side device is configured to send a target audio file to be identified to the server side device, and the server side device is configured to execute the music emotion recognition method provided in the embodiment of the present application, and execute the music emotion recognition method of the present application after obtaining the target audio file input by the user side device. Specifically, taking an intelligent scene of music treatment in the digital medical field as an example, by adopting the music emotion recognition method with higher recognition accuracy, a music piece which meets emotion requirements better can be selected for a patient so as to play an auxiliary role in music treatment.
In step S110 of some embodiments, a set of audio samples is obtained, the set of audio samples including a plurality of sample audio files and an initial emotion tag corresponding to each sample audio file.
It should be noted that the audio sample set may be a constructed music data set, such as POP909 data set, million song data set, MAESTRO data set mainly of classical music, empia data set mainly of popular music, etc.; the audio sample set may be a pre-constructed audio sample set, i.e. a sample audio file may be obtained from other devices, networks, and from a storage space such as a database, a data carrier, etc., and the music content of the sample audio file includes, but is not limited to, piano playing music, guitar playing music, drum playing music, etc., and may be specifically determined based on the actual application scenario requirements, which is not limited herein. And integrating the data of the audio files with different music types and the corresponding pre-marked initial emotion labels to obtain an audio sample set.
It should be noted that the sample audio files may be audio files of different music types, and the music types may be classical music types, popular music types, and the like.
It should be noted that, the initial emotion label is used for representing one emotion classification of the sample audio file, and may be a happy label, a sad label, a eased label, and the like, and different initial emotion labels in the obtained audio sample set may be set to corresponding labels, for example, the happy label is labeled as 1, and the sad label is labeled as 2.
In step S120 of some embodiments, in order to deeply capture the influence of the whole note flow direction of the sample note file on the musical emotion, the sample audio file is input into a pre-trained audio transcription model for format normalization processing, so as to obtain an audio symbol score, so that the musical emotion recognition of the sample audio file in a symbol domain is realized according to the audio symbol score. According to the embodiment of the application, the sample note file is converted into the music score form, and compared with the prior method in which the lyrics are required to be completely marked, the difficulty in recognition can be reduced while the recognition accuracy is ensured, so that the recognition efficiency is improved.
Note that an audio notation score refers to a score of a corresponding dataized/symbolized representation of a sample audio file, which may be, for example, a MIDI numeric score.
It should be noted that, since one emotion category corresponds to one sample audio file, the audio symbol score also corresponds to the initial emotion label.
It should be noted that, the specific training process of the audio transcription model is to construct a transcription training set, where the transcription training set includes a plurality of transcription audios and a sample MIDI score corresponding to each transcription audio. The method comprises the steps of obtaining an initial neural network model, and firstly, extracting audio characteristics of transcribed audio to obtain audio characteristic vectors. And training the initial neural network model based on the transcription training set until the total loss function corresponding to the model converges, and determining the model after training as an audio transcription model. The input of the model is an audio feature vector corresponding to each transcribed audio, and the output of the model is a predicted music score corresponding to the transcribed audio. And determining a model loss value according to the sample MIDI music score, the predicted music score and the total loss function, and performing parameter adjustment according to the model loss value and an initial neural network model until the model converges.
It should be noted that, the initial neural network model may include, but is not limited to, any one of neural network models including a recurrent neural network (Recurrent Neural Networks, RNN), a Long Short-Term Memory artificial neural network (LSTM), a gate-controlled recurrent unit (Gated Recurrent Unit, GRU), etc., which may be specifically determined based on actual application scenario requirements, and is not limited herein.
It should be noted that, the transcription training set may be a large data set of MAESTRO, which includes a piano recording segment of about 70Gb and a MIDI score file corresponding thereto.
In step S130 of some embodiments, in order to learn the semantic information of the score corresponding to the sample audio file, the audio emotion information of intensity, rhythm, timbre, etc. not included in the symbolized representation in the sample audio file is also simultaneously learned. For example, in a music therapy intelligent scene in the digital medical field, non-symbolized information such as intensity, rhythm, timbre and the like in an audio file can also change the emotion type of the audio file, and then a sample audio file and a corresponding audio symbol music score are input into a pre-built initial recognition model, wherein the initial recognition model comprises an audio recognition sub-model and a symbol recognition sub-model. The audio recognition sub-model is used for learning audio emotion information such as intensity, rhythm, tone and the like in the sample audio file, which are not contained in the symbolized representation, and the symbol recognition sub-model is used for learning symbolized information such as time, intensity, pitch and the like in the sample audio file. According to the embodiment of the application, the music emotion recognition is carried out by combining the symbol domain and the audio domain, so that the performance and generalization capability of the model on a strange data set are enhanced, and meanwhile, the recognition precision of the model can be improved through the complementary relation.
In step S140 of some embodiments, audio feature extraction is performed on the sample audio file through the audio recognition sub-model, so as to obtain audio domain features, so as to implement analysis on the sample audio file in the audio domain according to the audio domain features.
Specifically, referring to fig. 2, fig. 2 is an optional flowchart of step S140 provided in the embodiment of the present application. In some embodiments, the audio recognition sub-model includes an audio preprocessing module and an audio feature extraction module, and step S140 may include, but is not limited to, step S210 and step S220:
step S210, performing frequency spectrum conversion on a sample audio file through an audio preprocessing module to obtain a target sample frequency spectrum;
step S220, the frequency spectrum feature extraction is carried out on the frequency spectrum of the target sample through the audio feature extraction module, and the audio domain feature is obtained.
In step S210 of some embodiments, since the fourier transform only reflects the characteristics of the signal in the frequency domain, the signal cannot be analyzed in the time domain. In order to analyze a sample audio file in an audio domain, firstly, preprocessing the audio domain of the sample audio file, namely, performing spectrum conversion on the sample audio file through an audio preprocessing module to obtain a target sample spectrum in a short-time Fourier transform spectrum form.
It should be noted that the target sample spectrum may be calculated using an audio preprocessing function in the library.
In step S220 of some embodiments, after obtaining a target sample spectrum in the form of a short-time fourier transform spectrum, in order to learn detailed information such as timbre, rhythm distribution, and the like of a sample audio file, a spectrum feature extraction is performed on the target sample spectrum by using an audio feature extraction module, so as to obtain audio domain features. The audio feature extraction module can perform audio analysis by adopting a convolutional neural network structure, namely, an audio analysis model is constructed by adopting the convolutional neural network structure, and the convolutional neural network performs multiple downsampling on an input target sample frequency spectrum, so that the learned audio domain features comprise detail information reflecting tone color, such as frequency spectrum waveform edges, and rhythm distribution information of music.
It should be noted that, in the training process of the audio analysis model, an audio analysis data set may be constructed according to the above audio sample set, where the audio analysis data set includes a plurality of audio analysis spectrums and initial audio emotion tags corresponding to each audio analysis spectrum, and the audio analysis data set may be acquired from other devices, networks, and from a storage space such as a database, a data carrier, or the like, to acquire an audio analysis spectrum based on data transmission. And constructing an initial audio model by adopting a convolutional neural network structure, extracting audio characteristics from an audio analysis frequency spectrum to obtain a sample audio characteristic vector, and carrying out classification prediction according to the sample audio characteristic vector to determine a target audio emotion label. And determining an audio analysis loss value according to the initial audio emotion tag and the target audio emotion tag. Training an initial audio model based on the audio analysis data set until an audio analysis loss value corresponding to the model meets a preset ending condition, and determining the model after training is ended as an audio analysis model.
In step S150 of some embodiments, since it is difficult to extract and abstract the melody of the whole music when the audio file is discriminated in the audio domain, and then learn the influence of the distribution and flow direction of the whole melody on emotion, the present application uses the joint analysis of the symbol domain to solve the problem. Specifically, symbol feature extraction is performed on the audio symbol music score through a symbol recognition sub-model, so as to obtain symbol domain features. According to the method and the device for the sign domain analysis of the audio-frequency sign score by combining the sign recognition sub-model, the final initial recognition model can be helped to enhance the capture of long-distance dependency, and the model can be helped to accurately obtain the prediction output from the note flow distribution in the audio-frequency sign score, so that the interference caused by redundant noise information of the audio-frequency domain is reduced.
Specifically, referring to fig. 3, fig. 3 is an optional flowchart of step S150 provided in the embodiment of the present application. In some embodiments, the symbol recognition sub-model includes a symbol preprocessing module and a symbol feature extraction module, and step S150 may include, but is not limited to, step S310 and step S320:
step S310, performing note embedding processing on the audio symbol music score through a symbol preprocessing module to obtain a sample note sequence;
Step S320, extracting sequence features of the sample note sequence through a symbol feature extraction module to obtain symbol domain features.
In step S310 of some embodiments, in order to simulate a real language sequence as much as possible from a score when performing a symbol domain analysis on an audio symbolic score, first, a symbol embedding process is performed on the audio symbolic score by a symbol preprocessing module, resulting in a sample symbol sequence.
Specifically, referring to fig. 4, fig. 4 is an optional flowchart of step S310 provided in the embodiment of the present application. In some embodiments, step S310 may specifically include, but is not limited to including, step S410 to step S460:
step S410, extracting information from each audio symbol in the audio symbol score to obtain note information of each audio symbol, wherein the note information comprises note time information, note strength information and note pitch information;
step S420, the position information of each audio symbol in the audio symbol music score is obtained;
step S430, arranging the note time information according to the position information to obtain a first note sequence;
step S440, arranging the note dynamics information according to the position information to obtain a second note sequence;
Step S450, arranging the note pitch information according to the position information to obtain a third note sequence;
step S460, merging the first note sequence, the second note sequence and the third note sequence to obtain a sample note sequence.
In step S410 of some embodiments, note events in the MIDI score file are extracted, that is, information extraction is performed on each audio symbol in the symbolically represented audio symbol score, so as to obtain note information of each audio symbol, where the note information includes note time information, note intensity information, and note pitch information. The note time information is used for representing duration time token corresponding to the audio symbol, the note dynamics information is used for representing dynamics token corresponding to the audio symbol, and the note pitch information is used for representing pitch token corresponding to the audio symbol.
In steps S420 to S460 of some embodiments, in order to simulate a real language sequence as much as possible from a score, duration and position coding information of notes are embedded in the notes. In particular, position information of each audio symbol in the audio symbol score is acquired, the position information being used to represent the time sequence of notes in actual implementation. According to the embodiment of the application, the note time information, the note dynamics information and the note pitch information can be used as triples of one audio symbol, and each triplet is arranged according to the position information to obtain a sample note sequence. Specifically, the note time information is arranged according to the position information to obtain a first note sequence; arranging the note dynamics information according to the position information to obtain a second note sequence; and arranging the note pitch information according to the position information to obtain a third note sequence. And combining the first note sequence, the second note sequence and the third note sequence to obtain a sample note sequence simulating the real language sequence.
In step S320 of some embodiments, to assist the initial recognition model in obtaining a prediction output from the flow direction and the overall distribution of notes, in the case of symbol domain analysis, a sequence feature extraction is performed on the sample note sequence by a symbol feature extraction module, so as to obtain symbol domain features. The symbol feature extraction module can construct a symbol analysis model by adopting a transducer structure so as to obtain symbol domain features through the symbol analysis model.
Note that, the training process of the sign domain features may be that a sign analysis data set may be constructed according to the above-mentioned audio sample set, where the sign analysis data set includes a plurality of sign analysis scores and initial score emotion tags corresponding to each audio analysis score, where the sign analysis data set may be acquired from other devices, networks, and from a storage space such as a database, a data carrier, or the like based on data transmission. And constructing an initial symbol model by adopting a transducer structure, performing note embedding processing on a symbol analysis music score to obtain a symbol note sequence, extracting sequence features of the symbol note sequence to obtain a sample symbol feature vector, and performing classification prediction according to the sample symbol feature vector to determine a target music score emotion label. And determining a symbol analysis loss value according to the initial score emotion tag and the target score emotion tag. Training an initial symbol model based on the symbol analysis data set until a symbol analysis loss value corresponding to the model meets a preset ending condition, and determining the model at the end of training as a symbol analysis model.
In step S160 of some embodiments, in order to better capture long-distance dependency in music, the anti-interference capability of the model on the audio domain is enhanced, so that the performance and generalization capability of the model on the untrained data set are enhanced, joint analysis is performed on the audio domain and the symbol domain, emotion classification is performed on the sample audio file according to the audio domain features and the symbol domain features, and the predicted emotion label of the sample audio file is determined.
Specifically, referring to fig. 5, fig. 5 is an optional flowchart of step S160 provided in the embodiment of the present application. In some embodiments, the initial recognition model further includes a feature fusion sub-model and a classifier, and step S160 may specifically include, but is not limited to, steps S510 to S530:
step S510, carrying out feature fusion on the audio domain features and the symbol domain features through a feature fusion sub-model to obtain sample target features;
step S520, carrying out emotion type prediction on sample target characteristics through a classifier to obtain a label prediction result, wherein the label prediction result comprises sample emotion prediction values of sample audio files belonging to different emotion labels respectively;
in step S530, the predicted emotion labels of the sample audio file are determined by comparing the values of all the sample emotion predicted values.
In steps S510 to S530 of some embodiments, in order to better utilize redundancy and complementarity of information between audio and symbol modes, feature fusion is performed on the audio domain features and symbol domain features through a feature fusion sub-model, so as to obtain sample target features. And then, carrying out emotion type prediction on the sample target characteristics through a classifier to obtain a label prediction result, wherein the label prediction result comprises sample emotion prediction values of sample audio files belonging to different emotion labels respectively. And comparing the values of all the sample emotion predicted values, and determining that the emotion label corresponding to the maximum value in the sample emotion predicted values is the predicted emotion label of the sample audio file. According to the method and the device for identifying the emotion categories of the music fragments, repeated learning of pitch information in audio domain analysis and symbol domain analysis can be eliminated, unique important information in the audio domain analysis and the symbol domain analysis, such as rhythm information and tone information in audio domain characteristics, and integral flow information of notes in symbol domain characteristics can be enhanced, and accuracy of identifying emotion categories of the music fragments by the models can be improved.
It should be noted that, in the embodiment of the present application, a cross-modal attention mechanism is used to perform fusion of feature information between audio domain analysis and symbol domain analysis. The multi-modal self-attention mechanism is essentially a multi-head self-attention mechanism, wherein the Query vector is information from audio domain analysis or symbol domain analysis, the Key and Value vectors are information from another domain analysis, and after the multi-head self-attention mechanism is calculated on the basis, the output content of the multi-head self-attention mechanism is the fusion information characteristic of two analysis modal information.
In step S170 of some embodiments, in order to obtain a music emotion recognition model combining the audio domain analysis and the symbol domain analysis, parameter adjustment is performed on the initial recognition model according to the predicted emotion label and the initial emotion label, so as to obtain the music emotion recognition model.
Specifically, referring to fig. 6, fig. 6 is an optional flowchart of step S170 provided in the embodiment of the present application. In some embodiments, step S170 may specifically include, but is not limited to including, step S610 and step S620:
step S610, determining a model loss value according to the predicted emotion label and the initial emotion label;
and step S620, adjusting model parameters of the audio recognition sub-model and the symbol recognition sub-model according to the model loss value, and continuously training the adjusted initial recognition model based on the audio sample set until the model loss value meets the preset training ending condition so as to obtain the music emotion recognition model.
In steps S610 and S620 of some embodiments, model loss values are determined from the predicted emotion tags and the initial emotion tags, i.e., model loss values are determined by weighting calculations based on audio analysis loss values of the audio domain analysis and symbol analysis loss values of the symbol domain analysis. And respectively adjusting model parameters of the audio recognition sub-model and the symbol recognition sub-model according to the model loss value, and continuously training the adjusted initial recognition model based on the audio sample set until the model loss value meets the preset training ending condition so as to obtain the music emotion recognition model. The preset training ending condition may be when the model loss value is smaller than a preset loss value threshold.
In step S180 of some embodiments, in practical application, a target audio file to be identified is obtained, and the target audio file is input into a music emotion recognition model for emotion recognition processing, so as to obtain an emotion recognition result.
It should be noted that, the present application may classify the target audio file to be identified based on emotion, for example, in a music treatment intelligent scene in the digital medical field, when a patient needs to alleviate emotion to calm the emotion of anxiety, a doctor may input an instruction to play an audio clip of a relaxation tag to a device in which the music emotion recognition system of the present application is installed. Therefore, the device randomly selects one audio segment with emotion classified as a relaxation label from the emotion classified database and plays the audio segment to relax the emotion of the patient, thereby assisting in the music treatment scenes such as mental diseases, nerve rehabilitation and the like. Therefore, the music emotion recognition method with high recognition accuracy is adopted, emotion classification with high accuracy can be carried out on the target audio file, and therefore music fragments which are more in line with emotion requirements can be selected for patients, and the music emotion recognition method has an auxiliary effect on music treatment.
It should be noted that the present application may also be used in recommendation and information retrieval of music software, that is, recommending music information of the same emotion classification according to the emotion types favored by the target object.
Referring to fig. 7, fig. 7 is an optional flowchart of step S180 provided in the embodiment of the present application. In some embodiments, the emotion recognition result includes a target emotion tag, and step S180 may specifically include, but is not limited to, steps S710 and S720:
step S710, inputting the target audio file into a music emotion recognition model for emotion type prediction to obtain a target emotion predicted value; the target emotion predicted value is used for representing emotion categories to which the target audio file belongs;
step S720, determining the target emotion tags of the target audio file by comparing all the target emotion predicted values.
Specifically, inputting the target audio file into a trained music emotion recognition model for emotion recognition processing to obtain an emotion recognition result, wherein the emotion recognition result comprises target emotion predicted values of the target audio file belonging to different emotion labels respectively. And comparing the values of all the target emotion predicted values, and determining that the emotion label with the maximum target emotion predicted value is the target emotion label of the target audio file.
Referring to fig. 8, the embodiment of the present application further provides a music emotion recognition system, which can implement the above music emotion recognition method, where the system includes:
a sample set obtaining module 810, configured to obtain an audio sample set, where the audio sample set includes a sample audio file and an initial emotion tag of the sample audio file;
the audio transcription module 820 is configured to input the sample audio file into a pre-trained audio transcription model for format normalization processing, so as to obtain an audio symbol score;
a model input module 830 for inputting the sample audio file and the audio-symbol score into a pre-constructed initial recognition model, the initial recognition model including an audio recognition sub-model and a symbol recognition sub-model;
the audio processing module 840 is configured to perform audio feature extraction on the sample audio file through the audio recognition sub-model to obtain audio domain features;
the symbol processing module 850 is configured to perform symbol feature extraction on the audio symbol score through the symbol recognition sub-model to obtain symbol domain features;
the emotion classification module 860 is configured to perform emotion classification on the sample audio file according to the audio domain features and the symbol domain features, and determine a predicted emotion tag of the sample audio file;
The parameter adjustment module 870 is configured to perform parameter adjustment on the initial recognition model according to the predicted emotion tag and the initial emotion tag, so as to obtain a music emotion recognition model;
the emotion recognition module 880 is configured to obtain a target audio file to be recognized, input the target audio file into the music emotion recognition model, and perform emotion recognition processing to obtain an emotion recognition result.
The specific implementation of the music emotion recognition system is basically the same as the specific embodiment of the music emotion recognition method, and will not be described herein.
The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the music emotion recognition method when executing the computer program. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.
Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:
the processor 910 may be implemented by a general purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided in the embodiments of the present application;
The Memory 920 may be implemented in the form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access Memory (Random Access Memory, RAM). Memory 920 may store an operating system and other application programs, and when implementing the technical solutions provided in the embodiments of the present application through software or firmware, relevant program codes are stored in memory 920, and the processor 910 invokes a music emotion recognition method to execute the embodiments of the present application;
an input/output interface 930 for inputting and outputting information;
the communication interface 940 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g., USB, network cable, etc.), or may implement communication in a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.);
a bus 950 for transferring information between components of the device (e.g., processor 910, memory 920, input/output interface 930, and communication interface 940);
wherein processor 910, memory 920, input/output interface 930, and communication interface 940 implement communication connections among each other within the device via a bus 950.
The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the music emotion recognition method when being executed by a processor.
The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
According to the music emotion recognition method and system, the electronic equipment and the storage medium, an audio sample set is obtained, and the audio sample set comprises a sample audio file and an initial emotion tag of the sample audio file. Inputting the sample audio file into a pre-trained audio transcription model for format standardization processing to obtain an audio symbol music score; the sample audio file and the audio symbolic score are input to a pre-constructed initial recognition model, which includes an audio recognition sub-model and a symbolic recognition sub-model. In the audio domain analysis, the sample audio file is subjected to frequency spectrum conversion through an audio preprocessing module to obtain a target sample frequency spectrum, and the target sample frequency spectrum is subjected to frequency spectrum feature extraction through an audio feature extraction module to obtain audio domain features. According to the embodiment of the application, the audio domain analysis is carried out on the sample audio file, so that the learned audio domain features comprise detail information reflecting tone color such as spectrum waveform edges and rhythm distribution information of music.
In addition, when the audio frequency domain judges the audio frequency file, it is difficult to extract and abstract the melody of the whole music, and then learn the distribution and flow direction of the whole melody to affect emotion. And then, according to the position information of each audio symbol in the audio symbol score, arranging and combining the note time information, the note dynamics information and the note pitch information to obtain a sample note sequence. And extracting sequence features of the sample note sequence through a symbol feature extraction module to obtain symbol domain features. And then, in order to better utilize redundancy and complementarity of information between audio and symbol modes, carrying out feature fusion on the audio domain features and the symbol domain features through a feature fusion sub-model to obtain sample target features, and carrying out emotion type prediction on the sample target features through a classifier to obtain a label prediction result so as to determine a predicted emotion label of the sample audio file. Parameter adjustment is carried out on the initial recognition model according to the predicted emotion label and the initial emotion label, and a music emotion recognition model is obtained; and acquiring a target audio file to be identified, inputting the target audio file into a music emotion recognition model for emotion recognition processing, and obtaining target emotion predicted values of the target audio file respectively belonging to different emotion categories, thereby determining target emotion labels corresponding to the target audio file. Compared with the existing single-tone frequency domain analysis method, the method can better capture long-distance dependence in music, and therefore the anti-interference capability of the model on the tone frequency domain is improved.
Meanwhile, the method performs joint analysis on the audio frequency domain and the symbol domain, and adds additional audio frequency domain information such as tone and rhythm compared with the existing single-symbol domain analysis method, so that the expression and generalization capability of the model on a strange data set are enhanced, and meanwhile, the complementary relation can also improve the recognition accuracy of the model on emotion categories of music fragments. In addition, the cross-modal attention fusion mechanism provided by the embodiment of the application performs feature fusion on the audio domain features and the symbol domain features, and compared with the existing multi-modal joint analysis method, the redundancy and complementarity between two analysis modalities can be better utilized, and the useful parts in the two modal information are selectively emphasized and given higher weight, so that the model performance is improved.
The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.
It will be appreciated by those skilled in the art that the technical solutions shown in the figures do not constitute limitations of the embodiments of the present application, and may include more or fewer steps than shown, or may combine certain steps, or different steps.
The system embodiments described above are merely illustrative, in that the units illustrated as separate components may or may not be physically separate, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.
The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided in this application, it should be understood that the disclosed systems and methods may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the division of the above elements is merely a logical functional division, and there may be additional divisions in actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, system or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.
Preferred embodiments of the present application are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.
Claims (10)
1. A method for identifying musical emotion, said method comprising:
acquiring an audio sample set, wherein the audio sample set comprises a sample audio file and an initial emotion tag of the sample audio file;
inputting the sample audio file into a pre-trained audio transcription model for format standardization processing to obtain an audio symbol music score;
inputting the sample audio file and the audio symbol score to a pre-constructed initial recognition model, wherein the initial recognition model comprises an audio recognition sub-model and a symbol recognition sub-model;
extracting audio features of the sample audio file through the audio recognition sub-model to obtain audio domain features;
extracting the symbol characteristics of the audio symbol music score through the symbol recognition sub-model to obtain symbol domain characteristics;
carrying out emotion classification on the sample audio file according to the audio domain characteristics and the symbol domain characteristics, and determining a predicted emotion label of the sample audio file;
Parameter adjustment is carried out on the initial recognition model according to the predicted emotion label and the initial emotion label, so that a music emotion recognition model is obtained;
and acquiring a target audio file to be identified, inputting the target audio file into the music emotion recognition model for emotion recognition processing, and obtaining an emotion recognition result.
2. The method of claim 1, wherein the audio recognition sub-model includes an audio preprocessing module and an audio feature extraction module, and the audio feature extraction is performed on the sample audio file by the audio recognition sub-model to obtain audio domain features, including:
performing frequency spectrum conversion on the sample audio file through the audio preprocessing module to obtain a target sample frequency spectrum;
and carrying out frequency spectrum feature extraction on the target sample frequency spectrum through the audio feature extraction module to obtain the audio domain feature.
3. The method of claim 1, wherein the symbol recognition sub-model includes a symbol preprocessing module and a symbol feature extraction module, the performing symbol feature extraction on the audio symbol score by the symbol recognition sub-model to obtain symbol domain features, including:
Performing note embedding processing on the audio symbol music score through the symbol preprocessing module to obtain a sample note sequence;
and extracting sequence features of the sample note sequence through the symbol feature extraction module to obtain the symbol domain features.
4. The method of claim 3, wherein the performing, by the symbol preprocessing module, the note embedding process on the audio symbol score to obtain a sample note sequence comprises:
information extraction is carried out on each audio symbol in the audio symbol music score to obtain note information of each audio symbol, wherein the note information comprises note time information, note dynamics information and note pitch information;
acquiring the position information of each audio symbol in the audio symbol music score;
arranging the note time information according to the position information to obtain a first note sequence;
arranging the note dynamics information according to the position information to obtain a second note sequence;
arranging the note pitch information according to the position information to obtain a third note sequence;
and combining the first note sequence, the second note sequence and the third note sequence to obtain a sample note sequence.
5. The method of any one of claims 1 to 4, wherein the initial recognition model further comprises a feature fusion sub-model and a classifier, wherein the emotion classifying the sample audio file based on the audio domain features and the symbol domain features, determining a predicted emotion tag for the sample audio file, comprises:
performing feature fusion on the audio domain features and the symbol domain features through the feature fusion sub-model to obtain sample target features;
carrying out emotion type prediction on the sample target characteristics through the classifier to obtain a label prediction result, wherein the label prediction result comprises sample emotion prediction values of the sample audio files belonging to different emotion labels respectively;
and determining the predictive emotion labels of the sample audio files by carrying out numerical comparison on all the sample emotion predictive values.
6. The method according to any one of claims 1 to 4, wherein said performing parameter adjustment on said initial recognition model according to said predicted emotion label and said initial emotion label to obtain a music emotion recognition model comprises:
determining a model loss value according to the predicted emotion label and the initial emotion label;
And adjusting model parameters of the audio recognition sub-model and the symbol recognition sub-model according to the model loss value, and continuously training the adjusted initial recognition model based on the audio sample set until the model loss value meets a preset training ending condition so as to obtain a music emotion recognition model.
7. The method according to any one of claims 1 to 4, wherein the emotion recognition result includes a target emotion tag, and the inputting the target audio file into the music emotion recognition model performs emotion recognition processing to obtain the emotion recognition result includes:
inputting the target audio file into the music emotion recognition model to predict emotion types, so as to obtain a target emotion predicted value; the target emotion predicted value is used for representing emotion categories to which the target audio file belongs;
and determining the target emotion labels of the target audio files by carrying out numerical comparison on all the target emotion predicted values.
8. A musical emotion recognition system, said system comprising:
the system comprises a sample set acquisition module, a sample set generation module and a sample set generation module, wherein the sample set acquisition module is used for acquiring an audio sample set, and the audio sample set comprises a sample audio file and an initial emotion tag of the sample audio file;
The audio transcription module is used for inputting the sample audio file into a pre-trained audio transcription model for format standardization processing to obtain an audio symbol music score;
the model input module is used for inputting the sample audio file and the audio symbol music score into a pre-built initial recognition model, wherein the initial recognition model comprises an audio recognition sub-model and a symbol recognition sub-model;
the audio processing module is used for extracting audio features of the sample audio file through the audio recognition sub-model to obtain audio domain features;
the symbol processing module is used for extracting symbol characteristics of the audio symbol music score through the symbol recognition sub-model to obtain symbol domain characteristics;
the emotion classification module is used for performing emotion classification on the sample audio file according to the audio domain characteristics and the symbol domain characteristics, and determining a predicted emotion label of the sample audio file;
the parameter adjustment module is used for carrying out parameter adjustment on the initial recognition model according to the predicted emotion label and the initial emotion label to obtain a music emotion recognition model;
the emotion recognition module is used for acquiring a target audio file to be recognized, inputting the target audio file into the music emotion recognition model for emotion recognition processing, and obtaining an emotion recognition result.
9. An electronic device comprising a memory storing a computer program and a processor implementing the method of any of claims 1 to 7 when the computer program is executed by the processor.
10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310572293.XA CN116486838A (en) | 2023-05-19 | 2023-05-19 | Music emotion recognition method and system, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310572293.XA CN116486838A (en) | 2023-05-19 | 2023-05-19 | Music emotion recognition method and system, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116486838A true CN116486838A (en) | 2023-07-25 |
Family
ID=87225209
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310572293.XA Pending CN116486838A (en) | 2023-05-19 | 2023-05-19 | Music emotion recognition method and system, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116486838A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117828537A (en) * | 2024-03-04 | 2024-04-05 | 北京建筑大学 | Music emotion recognition method and device based on CBA model |
-
2023
- 2023-05-19 CN CN202310572293.XA patent/CN116486838A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117828537A (en) * | 2024-03-04 | 2024-04-05 | 北京建筑大学 | Music emotion recognition method and device based on CBA model |
CN117828537B (en) * | 2024-03-04 | 2024-05-17 | 北京建筑大学 | Music emotion recognition method and device based on CBA model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Humphrey et al. | Moving beyond feature design: Deep architectures and automatic feature learning in music informatics. | |
CN113420556B (en) | Emotion recognition method, device, equipment and storage medium based on multi-mode signals | |
CN110851650B (en) | Comment output method and device and computer storage medium | |
CN116386594A (en) | Speech synthesis method, speech synthesis device, electronic device, and storage medium | |
CN116312463A (en) | Speech synthesis method, speech synthesis device, electronic device, and storage medium | |
Xu et al. | A comprehensive survey of automated audio captioning | |
Wataraka Gamage et al. | Speech-based continuous emotion prediction by learning perception responses related to salient events: A study based on vocal affect bursts and cross-cultural affect in AVEC 2018 | |
CN116486838A (en) | Music emotion recognition method and system, electronic equipment and storage medium | |
CN116343747A (en) | Speech synthesis method, speech synthesis device, electronic device, and storage medium | |
Kai | [Retracted] Optimization of Music Feature Recognition System for Internet of Things Environment Based on Dynamic Time Regularization Algorithm | |
CN116580691A (en) | Speech synthesis method, speech synthesis device, electronic device, and storage medium | |
CN116541551A (en) | Music classification method, music classification device, electronic device, and storage medium | |
Jadhav et al. | Transfer Learning for Audio Waveform to Guitar Chord Spectrograms Using the Convolution Neural Network | |
CN116543797A (en) | Emotion recognition method and device based on voice, electronic equipment and storage medium | |
CN116645961A (en) | Speech recognition method, speech recognition device, electronic apparatus, and storage medium | |
CN115686426A (en) | Music playing method and device based on emotion recognition, equipment and medium | |
Rossetto et al. | Musicbert-learning multi-modal representations for music and text | |
CN113806586B (en) | Data processing method, computer device and readable storage medium | |
CN118154051B (en) | Auxiliary course teaching effect evaluation method and related device | |
CN117668285B (en) | Music emotion matching method based on acoustic features | |
CN117725153B (en) | Text matching method, device, electronic equipment and storage medium | |
CN118262730A (en) | Pitch-based voice processing method and device, equipment and storage medium | |
Zhao et al. | Computational music: Analysis of music forms | |
Yuan | Musical Note Feature Recognition Based on Long Short‐Term Memory | |
Jiang et al. | EEG-driven automatic generation of emotive music based on transformer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |