CN117056557A - Music playing method, device, equipment and storage medium - Google Patents

Music playing method, device, equipment and storage medium Download PDF

Info

Publication number
CN117056557A
CN117056557A CN202311026921.0A CN202311026921A CN117056557A CN 117056557 A CN117056557 A CN 117056557A CN 202311026921 A CN202311026921 A CN 202311026921A CN 117056557 A CN117056557 A CN 117056557A
Authority
CN
China
Prior art keywords
music
emotion recognition
recognition result
emotion
lyrics
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311026921.0A
Other languages
Chinese (zh)
Inventor
李进丽
高翥
曾通
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Geely Holding Group Co Ltd
Geely Automobile Research Institute Ningbo Co Ltd
Original Assignee
Zhejiang Geely Holding Group Co Ltd
Geely Automobile Research Institute Ningbo Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Geely Holding Group Co Ltd, Geely Automobile Research Institute Ningbo Co Ltd filed Critical Zhejiang Geely Holding Group Co Ltd
Priority to CN202311026921.0A priority Critical patent/CN117056557A/en
Publication of CN117056557A publication Critical patent/CN117056557A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/685Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using automatically derived transcript of audio data, e.g. lyrics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/65Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Library & Information Science (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

The application provides a music playing method, a device, equipment and a storage medium. The method comprises the following steps: target music is acquired. And obtaining emotion recognition results corresponding to each lyric of the target music according to the music characteristics of the target music. And in the process of playing the target music, controlling the body state of the displayed virtual object according to the emotion recognition result corresponding to each sentence of lyrics of the target music. The method of the application improves the recognition accuracy of music emotion, enriches the music playing mode and style and improves the audio-visual experience of users.

Description

Music playing method, device, equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a music playing method, apparatus, device, and storage medium.
Background
With the improvement of science and technology and living standard, automobiles gradually evolve into a third space of modern life from past vehicles, wherein automobile music service becomes one of standard configurations of automobile intellectualization and automobile networking. The existing car music service is mainly that a car machine system plays music through a car audio, and according to the emotion of the played music, the configuration of the car audio is adjusted to improve the hearing experience of the played music, or the car music service is linked with the played music through an atmosphere lamp in a car, so that the visual experience is improved.
However, the existing playing mode of the car music service has the problems that the identification of music emotion is accurately low and the playing mode is single.
Disclosure of Invention
The application provides a music playing method, a device, equipment and a storage medium, which are used for solving the problems that in the prior art, the music emotion recognition is accurately low and the playing mode is single in the playing mode of an automobile music service.
In a first aspect, the present application provides a music playing method, including:
acquiring target music;
according to the music characteristics of the target music, obtaining emotion recognition results corresponding to each sentence of lyrics of the target music;
and in the process of playing the target music, controlling the body state of the displayed virtual object according to the emotion recognition result corresponding to each sentence of lyrics of the target music.
Optionally, the obtaining, according to the music feature of the target music, a emotion recognition result corresponding to each lyric of the target music includes:
performing time slicing on the target music to obtain a plurality of music pieces of the target music;
according to the music characteristics of the music pieces, obtaining emotion recognition results corresponding to the music pieces;
and taking the emotion recognition result corresponding to the music piece as the emotion recognition result corresponding to each lyric contained in the music piece.
Optionally, the music feature includes: at least two sub-features in audio, lyrics, music score images; according to the music characteristics of the music piece, obtaining the emotion recognition result corresponding to the music piece comprises the following steps:
obtaining an initial emotion recognition result corresponding to the sub-feature by using an emotion recognition model corresponding to the sub-feature of the music piece and the sub-feature;
and obtaining emotion recognition results corresponding to the music pieces according to the initial emotion recognition results corresponding to the sub-features.
Optionally, the sub-feature is a music score image; the method for obtaining the initial emotion recognition result corresponding to the sub-feature by using the emotion recognition model corresponding to the sub-feature of the music piece and the sub-feature comprises the following steps:
extracting a musical character sequence feature according to the music score image;
and obtaining an initial emotion recognition result corresponding to the music score image by using the character sequence characteristics and the emotion recognition model corresponding to the music score image.
Optionally, the obtaining, according to the initial emotion recognition result corresponding to each sub-feature, the emotion recognition result corresponding to the music piece includes:
Generating an emotion recognition result sequence according to the initial emotion recognition result corresponding to each sub-feature;
and inputting the emotion recognition result sequence into a preset emotion classification model to obtain emotion recognition results corresponding to the music pieces and confidence degrees of the emotion recognition results.
Optionally, in the process of playing the target music, according to the emotion recognition result corresponding to each sentence of lyrics of the target music, controlling the posture of the displayed virtual object includes:
acquiring the physical state of the virtual object corresponding to each lyric of the target music according to the emotion recognition result corresponding to each lyric of the target music, the confidence coefficient of the emotion recognition result and the mapping relation among the emotion recognition result, the confidence coefficient and the physical state;
and when the lyrics corresponding to the target music are played, controlling the virtual object to execute the corresponding body state of the lyrics.
Optionally, the obtaining the emotion recognition result corresponding to each lyric of the target music includes:
aiming at any sentence of lyrics, according to the emotion recognition result corresponding to the sentence of lyrics and the emotion recognition result corresponding to the lyrics of the next sentence, acquiring the emotion change between the sentence of lyrics and the next sentence of lyrics;
And according to the emotion change, adjusting an emotion recognition result of the lyrics of the sentence at least one target time point so as to transition the emotion from the emotion recognition result corresponding to the lyrics of the sentence to the emotion recognition result corresponding to the lyrics of the next sentence.
In a second aspect, the present application provides a music playing device comprising:
the acquisition module is used for acquiring target music;
the processing module is used for acquiring emotion recognition results corresponding to each sentence of lyrics of the target music according to the music characteristics of the target music;
and the control module is used for controlling the body state of the displayed virtual object according to the emotion recognition result corresponding to each sentence of lyrics of the target music in the process of playing the target music.
In a third aspect, the present application provides an electronic device comprising: a processor, a communication interface, and a memory; the processor is respectively in communication connection with the communication interface and the memory;
the memory stores computer-executable instructions;
the communication interface performs communication interaction with external equipment;
the processor executes computer-executable instructions stored by the memory to implement the method of any one of the first aspects.
In a fourth aspect, the present application provides a computer-readable storage medium having stored therein computer-executable instructions for implementing the music playing method according to any one of the first aspects when executed by a processor.
In a fifth aspect, the present application provides a computer program product for implementing a music playing method according to any one of the first aspects when being executed by a processor.
According to the music playing method, device, equipment and storage medium, the target music is obtained, the emotion recognition result corresponding to each lyric of the target music is obtained according to the music characteristics of the target music, and in the process of playing the target music, the body state of the displayed virtual object is controlled according to the emotion recognition result corresponding to each lyric of the target music, so that the body state of the virtual object is the body state corresponding to the emotion recognition result of the lyric of the sentence, the recognition accuracy of music emotion is improved, the music playing mode and style are enriched, and the audio-visual experience of a user is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
Fig. 1 is a schematic flow chart of a music playing method according to an embodiment of the present application;
fig. 2 is a schematic view of a scene of music playing according to an embodiment of the present application;
fig. 3 is a flowchart of another music playing method according to an embodiment of the present application;
fig. 4 is a flowchart of another music playing method according to an embodiment of the present application;
fig. 5 is a flowchart of another music playing method according to an embodiment of the present application;
fig. 6 is a flowchart of another music playing method according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a music playing device according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Specific embodiments of the present application have been shown by way of the above drawings and will be described in more detail below. The drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but rather to illustrate the inventive concepts to those skilled in the art by reference to the specific embodiments.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
For easy understanding, a detailed description will be given of a music playing method of the existing car music service:
currently, there are multimedia solutions created based on a three-dimensional (3D) rendering engine, visually providing 3D dynamic lyrics and animations based on music rhythms to achieve real-time rendering. For example, the emotion characteristics of the music can be extracted from the music audio, the music emotion is analyzed through the music emotion analysis processor, and the color change of the atmosphere lamp corresponding to the emotion characteristics is determined so as to reflect the emotion change of the music, so that the audio-visual experience of a user is improved.
At present, music emotion is mainly identified through machine learning and deep learning methods, for example, the characteristics of the music such as rhythm, tone color and the like can be extracted from audio data of the music, and the music emotion of the music can be identified by training machine learning models such as a k nearest neighbor classification algorithm (k-NearestNeighbor, kNN), a support vector machine (Support Vector Machine, SVM) and the like according to corresponding emotion labels. Alternatively, a deep learning method based on convolutional neural network (Convolutional Neural Networks, CNN), cyclic neural network (Recurrent Neural Network, RNN), attention mechanism, etc. may be used to focus on important features in audio and learn the correspondence between these features and emotion, thereby realizing the function of identifying the music emotion of the music.
However, the musical emotion of music is not only represented in the audio data of the music, but the lyrics and score of the music also play an important role in the musical emotion of the music. The existing music emotion recognition method can only singly recognize music emotion on the audio data, so that the problem of low accuracy of music emotion recognition exists. In addition, only through the configuration of adjustment on-vehicle stereo set and the color variation linkage music mood of the atmosphere lamp in the car, there is comparatively single with the style of music broadcast mode, user experience's problem relatively poor.
In view of the above, the present application provides a music playing method, by constructing a virtual object, extracting at least two features of audio, lyrics and music score images of a target music, identifying a music emotion of the target music and a change of the music emotion according to the features, and controlling a posture of the virtual object according to the music emotion, thereby improving an identification accuracy of the music emotion, enriching a music playing mode and style, and improving an audio-visual experience of a user.
The execution main body of the music playing method provided by the application can be terminal equipment with a data processing function, or a processing chip of the terminal equipment, or software or program codes for realizing the music playing method. When the execution main body is a terminal device with a data processing function, the terminal device can be, for example, a computing device such as a car machine, a mobile phone, a computer and the like of a car, and software or program codes for running the music playing method can be deployed on the computing device, so that emotion recognition is performed on the music currently played by the computing device through the software or program codes, and the physical state of a virtual object is controlled. The execution main body of the method can also be a cloud platform with a data processing function, when the execution main body is the cloud platform, emotion recognition can be carried out on music currently played by the computing equipment in the cloud execution method, the processing of the physical state of the virtual object is controlled, the cloud platform can be logically divided into a plurality of parts according to actual requirements, and each part has different functions. Portions of the platform may be deployed in any two or three of an electronic device (on the user side), an edge environment, and a cloud environment, respectively. An edge environment is an environment that includes a collection of edge electronic devices that are closer to the electronic device, the edge electronic device comprising: edge servers, edge kiosks with computational power, etc. The various portions of the data processing platform deployed in different environments or devices cooperatively implement the functionality of the data processing platform. It should be understood that the method does not carry out restrictive division on what part of the data processing platform is deployed in what environment, and can carry out adaptive deployment according to the computing capacity of the electronic equipment, the resource occupation situation of the edge environment and the cloud environment or specific application requirements in practical application.
In the following, taking an application scene as a car machine to play music, and taking an execution main body as the car machine for realizing the method as an example, the technical scheme of the application and how the technical scheme of the application solves the technical problems are described in detail through specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Fig. 1 is a flowchart of a music playing method according to an embodiment of the present application. As shown in fig. 1, the method may include:
s101, acquiring target music.
The target music may be the music played by the car machine of the car at present, or the music to be played by the car machine. The target music may be obtained from a storage medium of the vehicle system, or may be obtained by downloading in a network, or may be obtained from other terminal devices connected to the vehicle system through a communication connection, or may be obtained from an external storage medium connected to the vehicle system, where the other terminal devices may be devices such as a mobile phone, a computer, etc. that have a bluetooth connection or a network connection with the vehicle, and the external storage medium may be, for example, a mobile phone, a usb disk, etc.
S102, according to the music characteristics of the target music, obtaining emotion recognition results corresponding to each lyric of the target music.
Wherein the musical feature comprises: audio, lyrics, music score images, etc. The musical features may be extracted from the target music by a feature extraction model. The audio features may be extracted from audio data in the target music, the lyric features may be extracted from lyric text corresponding to the target music, and the score image features may be obtained from a score image corresponding to the target music. The application does not limit the feature extraction model for extracting the music features, for example, a transform model, a Long Short-Term Memory (LSTM) model and the like can be used for extracting the features of different music features through the feature extraction model conforming to the music features.
Implementation a: selecting any one sub-feature, acquiring the emotion recognition result of the sub-feature in each sentence of lyrics by using a single-mode recognition model, and taking the emotion recognition result of the sub-feature as the emotion recognition result corresponding to the sentence of lyrics in the target music.
Implementation mode B: and obtaining emotion recognition results of each music feature by using a single-mode recognition model, and fusing the emotion recognition results to obtain emotion recognition results corresponding to each lyric of the target music. The emotion recognition of different music features can be the same single-mode recognition model or different single-mode recognition models. The fusion method may be, for example, a fusion method such as a stacking method, a mean method, a voting method, or the like.
One possible implementation way is to perform emotion recognition on each sentence of lyrics separately. For each lyric of the target music, the corresponding audio and the corresponding music score of each lyric can be determined by acquiring the audio, the lyric and the time mark on the music score image. Illustratively, when the time of one sentence of lyrics is from 1 minute 15 seconds to 1 minute 30 seconds, audio data in the time period from 1 minute 15 seconds to 1 minute 30 seconds is acquired from audio, and a musical character sequence in the time period from 1 minute 15 seconds to 1 minute 30 seconds is acquired from a music score image.
In another possible implementation manner, the target music is divided into time slices of a plurality of pieces of music, and the emotion recognition result of each piece of music is taken as the emotion recognition result of each lyric covered by the piece of music.
S103, in the process of playing target music, according to emotion recognition results corresponding to each sentence of lyrics of the target music, controlling the body states of the displayed virtual objects.
The virtual object may be, for example, a virtual character, a virtual graphic, or the like, and the virtual object may be a screen displayed on the vehicle body, for example, a center control screen in the vehicle, a co-driver screen, a screen on a seat, or a screen at another position in the vehicle body, or the like, and the virtual object may be a pattern of augmented reality (Augmented Reality, AR), a nearby area displayed in the vehicle body, the vehicle body outside, or the like. Taking the virtual object as a virtual character as an example, the posture of the virtual object may include the expression, the action, and the like of the virtual character. Fig. 2 is a schematic view of a music playing scene according to an embodiment of the present application. As shown in fig. 2, the scene is a scene inside a car body, and includes virtual characters on a central control screen of a car, where the virtual characters can exhibit different body states according to emotion recognition results of the target music.
The expression and/or action of the virtual character may be different for different moods of the target music, for example, when the mood of the target music is happy, the virtual character may be designed to smile, open eyes, double-hand lifting, etc. to characterize the happy expression and/or action; when the emotion of the target music is sadness, the virtual character can be designed to be crying face, locking eyebrows, double-hand face and the like to represent sad expression and/or action; when the emotion of the target music is anger, the virtual character may be designed to be anger expression, eyebrow tightening lock, fist grasping, or the like, which characterizes anger expression and/or action, or the like.
The emotion recognition result may include emotion category, emotion style, and by way of example, the emotion recognition result may be as shown in table 1 below:
TABLE 1
And determining the physical parameters of the virtual object when each sentence of lyrics is played by the target music according to the preset mapping relation between the physical state of the virtual object and the emotion recognition result corresponding to each sentence of lyrics, and controlling the physical state of the virtual object to be the physical state corresponding to the emotion recognition result of each sentence of lyrics according to the physical parameters.
According to the method provided by the embodiment of the application, the target music is obtained, the emotion recognition result corresponding to each lyric of the target music is obtained according to the music characteristics of the target music, and in the process of playing the target music, the body state of the displayed virtual object is controlled according to the emotion recognition result corresponding to each lyric of the target music, so that the body state of the virtual object is the body state corresponding to the emotion recognition result of the lyric of the sentence, thereby improving the recognition accuracy of music emotion, enriching the music playing mode and style and improving the audio-visual experience of users.
Next, taking a time slice of dividing the target music into a plurality of pieces of music as an example, how to obtain the emotion recognition result corresponding to each lyric of the target music according to the music characteristics of the target music in the step S102 will be described in detail.
Fig. 3 is a flowchart of another music playing method according to an embodiment of the present application. As shown in fig. 3, the foregoing step S102 may include:
s301, performing time slicing on target music to obtain a plurality of pieces of music of the target music.
The music piece may be divided according to, for example, a main song and a sub song of the target music, or may be divided according to a section of the melody, for example, the target music may be divided according to a pre-playing, an interlude, or the like. Determining the dividing time point of the target music according to different dividing strategies, and performing time slicing on the target music to obtain the plurality of music pieces. The duration of the plurality of pieces of music may be uniform or non-uniform.
S302, according to the music characteristics of the music piece, obtaining the emotion recognition result corresponding to the music piece.
When the song lyrics are not included in the music piece, for example, the music piece is a piece including no song lyrics such as a pre-playing piece, an interluding piece, and the like, the music feature includes two sub-features in audio in the music piece and in a music score image; when lyrics are included in the music piece, for example, the music piece is a piece of a main song, a sub song, or the like including lyrics, the music feature includes at least two sub features in audio, lyrics, and a music score image in the music piece.
One possible implementation manner is to obtain the confidence coefficient corresponding to the emotion recognition result of each sub-feature of the music piece, and take the emotion recognition result of the sub-feature with the highest confidence coefficient as the emotion recognition result corresponding to the music piece.
In another possible implementation manner, at least two sub-features of the audio, lyrics and music score image in each music piece are extracted through the feature extraction model in the step S102, and emotion recognition is performed on the at least two sub-features of the music piece, so as to obtain an emotion recognition result corresponding to the music piece. The steps of the implementation are as follows:
s3021, obtaining an initial emotion recognition result corresponding to the sub-feature by utilizing an emotion recognition model corresponding to the sub-feature of the music piece and the sub-feature.
According to the mapping relation between the sub-features and emotion recognition models of the music piece, and the sub-features, determining an emotion recognition model corresponding to the sub-features, for example, the emotion recognition model corresponding to the sub-features of the audio is a Bert classification model, the emotion recognition model corresponding to the sub-features of the lyrics is an electric classification model, and the emotion recognition model corresponding to the sub-features of the music score image is an LSTM model or an LSTM model based on an attention mechanism.
Optionally, each sub-feature may be trained by using different emotion recognition models, and after training, performance evaluation indexes of each emotion recognition model, such as recognition accuracy, recall rate, model reasoning time, etc., are obtained, and according to the advantages and disadvantages of these performance evaluation indexes, the emotion recognition model most suitable for the sub-feature is determined.
Inputting the sub-features into an emotion recognition model corresponding to the sub-features of the music piece, and outputting an initial emotion recognition result corresponding to the sub-features.
S3022, obtaining emotion recognition results corresponding to the music pieces according to the initial emotion recognition results corresponding to the sub-features.
Fusing the initial emotion recognition results corresponding to the sub-features, and taking the fused results as emotion recognition results corresponding to the music piece. The fusion method may be, for example, a fusion method such as a stacking method, an average method, or a voting method.
Taking Stacking method as an example, the Stacking method can be that initial emotion recognition results corresponding to all sub-features are input as a model through an electric classification model, input into the electric classification model, fuse the initial emotion recognition results corresponding to all the sub-features through the trained electric classification model, and output emotion recognition results corresponding to the music piece.
The obtained emotion recognition result corresponding to the music piece may include only emotion recognition results such as emotion category and/or emotion style in table 1, and optionally, may further include confidence level of the emotion recognition result. The confidence is obtained by a model processing of the fusion method, and can represent the exaggeration degree of each emotion category and/or emotion style, for example, the higher the confidence is, the higher the exaggeration degree of emotion categories and/or emotion styles corresponding to the emotion recognition result is. Illustratively, when the emotion category corresponding to the emotion recognition result is "happy", the higher the confidence, the higher the happiness characterizing the emotion, and the more happy the physical presentation of the virtual object.
S303, taking the emotion recognition result corresponding to the music piece as the emotion recognition result corresponding to each lyric contained in the music piece.
That is, the emotion recognition results corresponding to each sentence of lyrics covered by the music piece are all emotion recognition results corresponding to the music piece, and only after the next music piece is played, the emotion recognition results corresponding to each sentence of lyrics may be changed.
In the following, an exemplary description is given of how each sub-feature is used to obtain an emotion recognition model corresponding to the sub-feature, and an initial emotion recognition result corresponding to the sub-feature is obtained by the sub-feature.
The sub-feature is when the music score image:
from the score image, a musical character sequence feature is extracted. The sequence of musical symbols in the score image may be digitized, for example, by acquiring the sequence of musical symbols characteristic of the score image from the score image via a transducer model. Wherein the emotions corresponding to the different character sequence features are different. The specific steps of acquiring the character sequence features of the music score image from the music score image through the transducer model may refer to the prior art, and the present application will not be described herein.
And obtaining an initial emotion recognition result corresponding to the music score image by using the character sequence characteristics and the emotion recognition model corresponding to the music score image. The emotion recognition model may be, for example, a preset emotion recognition model trained by a character sequence feature training set and an emotion tag, or may be an emotion recognition model which meets actual performance requirements and is determined from a plurality of emotion recognition models according to performance evaluation indexes of the models. For example, the emotion recognition model corresponding to the music score image may be an electric model, and the electric model is used as a classifier to perform emotion recognition on the character sequence feature, so as to obtain an initial emotion recognition result corresponding to the character sequence feature.
The sub-feature is when audio:
and intercepting the audio data corresponding to the music piece, extracting the audio features in the audio data through a feature extraction model, and obtaining an initial emotion recognition result corresponding to the audio through an emotion recognition model corresponding to the audio. The emotion recognition model may be, for example, a preset emotion recognition model trained by an audio feature training set and an emotion tag, or may be an emotion recognition model which meets actual performance requirements and is determined from a plurality of emotion recognition models according to performance evaluation indexes of the models. For example, if the finally determined emotion recognition model is a Bert classification model, the audio features may be input into the Bert classification model for emotion recognition, so as to obtain probabilities of each emotion type and/or each emotion style, and the emotion type and/or emotion style with the highest probability is used as an initial emotion recognition result corresponding to the audio features.
The sub-feature is lyrics:
each lyric included in the music piece is obtained, text characteristics corresponding to the lyrics in the music piece are extracted through a text recognition mode, and an initial emotion recognition result corresponding to the lyrics is obtained through an emotion recognition model corresponding to the lyrics. The emotion recognition model may be, for example, a preset emotion recognition model trained by a lyric feature training set and an emotion tag, or may be an emotion recognition model which meets actual performance requirements and is determined according to performance evaluation indexes of models from multiple emotion recognition models, and the multiple emotion recognition models may include, for example, fast-Text models, LSTM models, bert models, electric models, and the like.
Next, for an example in which the emotion recognition result corresponding to the music piece in S3022 further includes the confidence level of the emotion recognition result, the emotion recognition result corresponding to the music piece is obtained according to the initial emotion recognition result corresponding to each sub-feature, and is described in detail. Fig. 4 is a flowchart of another music playing method according to an embodiment of the present application. As shown in fig. 4, step S3022 may include:
s401, generating an emotion recognition result sequence according to the initial emotion recognition result corresponding to each sub-feature.
The initial emotion recognition result corresponding to each sub-feature may include at least one emotion recognition result, and the confidence level of the emotion recognition result, for example, the initial emotion recognition result corresponding to each sub-feature may include multiple emotion recognition results (i.e., multiple emotion types and/or emotion styles), and for different emotion recognition results, the confidence levels corresponding to the different emotion recognition results are different, the probabilities of different emotion recognition results corresponding to the characterization sub-feature are different, and the different emotion recognition results are ranked according to the confidence levels of the different emotion recognition results of the sub-feature, so as to generate an emotion recognition result sequence of the sub-feature.
S402, inputting the emotion recognition result sequence into a preset emotion classification model to obtain emotion recognition results corresponding to the music pieces and confidence degrees of the emotion recognition results.
And combining the emotion recognition result sequences of the plurality of sub-features to generate an input matrix of the preset emotion classification model, processing the input matrix by the emotion classification model, outputting emotion recognition results obtained by fusing the emotion recognition results of the plurality of sub-features, and outputting the confidence level of the fused emotion recognition results.
Fig. 5 is a flowchart of another music playing method according to an embodiment of the present application in the implementation manner shown in fig. 4. As shown in fig. 5, in the foregoing step S103, in the process of playing the target music, according to the emotion recognition result corresponding to each lyric of the target music, the method for controlling the posture of the displayed virtual object may include:
s501, according to the emotion recognition result corresponding to each sentence of lyrics of the target music, the confidence coefficient of the emotion recognition result and the mapping relation among the emotion recognition result, the confidence coefficient and the body state, the body state of the virtual object corresponding to each sentence of lyrics is obtained.
The mapping relation among the emotion recognition result, the confidence coefficient and the posture is that under each emotion recognition result, different confidence coefficients correspond to different postures of the emotion recognition result. For example, taking the emotion recognition result as "happy", the virtual object is a virtual character, and the mapping relationship between the confidence and the posture may be as follows in table 2:
TABLE 2
Confidence level Posture of body
(0,0.5] Expression is smile intensity 1, no action
(0.5-0.6] Expression is smile intensity 2, no action
(0.6-0.7] The expression is smile intensity 3, no action
(0.7-0.8] The expression is smile intensity 4, and the clapping hand acts
(0.8-0.9] The expression is smile intensity 5, and the two hands are lifted to excite
(0.9-1.0] Expression is smile intensity 6, jumping action
The smile strength can be achieved by controlling different changes of the mouth and eyes of the virtual character, for example, the higher the smile strength is, the higher the mouth corner of the virtual character is tilted upwards, the larger the mouth opening amplitude is, the larger the change amplitude of the eye corner of the eye is, and the like.
It should be understood that the foregoing description is merely illustrative of taking the emotion recognition result as "happy" as an example, and other emotion recognition results may refer to the table 2 to set the mapping relationship among the emotion recognition result, the confidence level and the posture, and the setting of different postures corresponding to the different emotion recognition results may be set according to the actual requirement, which is not limited in the present application.
S502, when playing the lyrics corresponding to the target music, controlling the virtual object to execute the corresponding body states of the lyrics.
One possible implementation manner is to replace a rendering model of the virtual object to control the virtual object to execute a body state corresponding to the lyrics, for example, determine a emotion recognition result of the lyrics corresponding to the target music and a confidence level of the emotion recognition result, determine the body state of the virtual object, and select the rendering model of the virtual object corresponding to the body state according to the body state of the virtual object to display.
In another possible implementation manner, the animation effect of the virtual object is implemented to control the virtual object to execute the corresponding posture of the lyrics. For example, by determining the current body state of the virtual object, determining the transitional animation effect of the body states of the emotion recognition result of the lyrics corresponding to the target music and the body state corresponding to the confidence degree of the emotion recognition result, controlling the virtual object to execute the body state corresponding to the lyrics through the transitional animation effect, and maintaining the corresponding body state when the lyrics corresponding to the target music are played.
According to the method provided by the embodiment of the application, the initial emotion recognition result of each sub-feature is output through the emotion recognition model corresponding to the plurality of sub-features of the music piece of the target music. And fusing the initial emotion recognition results of the multiple sub-features through a preset emotion classification model, and executing further emotion recognition. The emotion recognition result of the music piece is comprehensively determined from the aspects of the music style, the vocabulary theme, the singing effect and the like of the music piece of the target music by fusion judgment of information of multiple modes such as audio frequency, lyrics, music score images and the like, and compared with the existing single-mode music emotion recognition method, the accuracy of the emotion recognition result of the music piece of the target music is improved.
Further, since in the target music, the emotion corresponding to different pieces of music, or to different lyrics, may be different. Therefore, in the playing process of the target music, the emotion change is also involved, so how to control the virtual object to cope with the emotion change is also a problem to be solved.
Fig. 6 is a flowchart of another music playing method according to an embodiment of the present application. As shown in fig. 6, the foregoing step S102 may further include:
s601, aiming at any lyric, according to a emotion recognition result corresponding to the lyric of the sentence and a emotion recognition result corresponding to the lyric of the next sentence, acquiring emotion change between the lyric of the sentence and the lyric of the next sentence.
Case a: the emotion recognition result corresponding to the lyrics of the sentence and the emotion recognition result corresponding to the lyrics of the next sentence are not the same emotion recognition result. For example, if the emotion recognition result corresponding to the lyrics of the sentence is happy and the emotion recognition result corresponding to the lyrics of the next sentence is excited, the emotion change between the lyrics of the sentence and the lyrics of the next sentence changes from happy to excited.
Case B: the emotion recognition result corresponding to the lyrics of the sentence is the same as the emotion recognition result corresponding to the lyrics of the next sentence, but the confidence level of the emotion recognition result corresponding to the lyrics of the sentence is different from the confidence level of the emotion recognition result corresponding to the lyrics of the next sentence. For example, both the emotion recognition result corresponding to the lyrics of the sentence and the emotion recognition result corresponding to the lyrics of the next sentence are happy, but the confidence level of the emotion recognition result corresponding to the lyrics of the sentence is 0.85, and the confidence level of the emotion recognition result corresponding to the lyrics of the next sentence is 0.65, and the emotion between the lyrics of the sentence and the lyrics of the next sentence changes from a happy emotion with a confidence level of 0.85 to a happy emotion with a confidence level of 0.65.
S602, according to the emotion change, adjusting an emotion recognition result of the lyrics of the sentence at least one target time point to transition emotion from the emotion recognition result corresponding to the lyrics of the sentence to the emotion recognition result corresponding to the lyrics of the next sentence.
Determining the duration of the emotion change according to the emotion change, and determining the step length of each detail change in the emotion change. And determining the number of the intermediate emotions required for completing the emotion change according to the duration of the emotion change and the step length of each detail change in the emotion change, and corresponding intermediate emotions. And starting timing at the time point of starting the emotion change, and after each time the step size is reached, converting into the intermediate emotion corresponding to the step size times until the emotion change is completed.
Illustratively, the description will be continued with examples in case a and case B described above:
in case a, the change of emotion is from happy to excited, and if the time of emotion change is 2 seconds and the step length is 0.5 seconds, 3 intermediate emotion changes are required to complete the emotion change from happy to excited. The change of the intermediate emotion can be intercepted according to the corresponding animation change of the actual virtual object from the happy state to the excited state, for example, the animation change of the virtual object from the happy state to the excited state comprises 80 frames, the 80 frames of animation change is divided into four parts, every 20 frames, the animation change is carried out for 0.5 seconds from the beginning of the emotion change to the first step, the animation of the first 20 frames is played, the animation of the subsequent 20 frames is played for 0.5 seconds from the first step to the second step, and so on until the emotion change is completed.
In case B, the emotion change is from a happy emotion with a confidence level of 0.85 to a happy emotion with a confidence level of 0.65, and assuming that the time of the emotion change is 1 second and the step size is 0.5 seconds, 1 intermediate emotion change (i.e., 2 emotion changes) is required to complete the emotion change. The confidence level corresponding to the intermediate emotion, that is, the confidence level corresponding to the intermediate emotion is 0.75, may be determined according to the difference of the confidence levels divided by the number of emotion changes. The physical state of the virtual object of the happy emotion with the confidence of 0.75 is obtained. After the emotion change starts, the body state of the virtual object is the body state corresponding to the happy emotion with the confidence degree of 0.85 at the starting time, the body state of the virtual object is the body state corresponding to the happy emotion with the confidence degree of 0.75 when 0.5 seconds pass, and the body state of the virtual object is the body state corresponding to the happy emotion with the confidence degree of 0.65 when 1 second passes (namely when the emotion change is completed).
According to the method provided by the embodiment of the application, according to the emotion recognition result corresponding to the lyrics of any sentence and the emotion recognition result corresponding to the lyrics of the next sentence, the emotion change between the lyrics of the sentence and the lyrics of the next sentence is obtained, and according to the emotion change, the emotion recognition result of the lyrics of the sentence at least one target time point is adjusted so as to transition the emotion from the emotion recognition result corresponding to the lyrics of the sentence to the emotion recognition result corresponding to the lyrics of the next sentence, thereby improving the smoothness and smoothness of the change of the body state when the virtual object is controlled to correspond to the emotion change and improving the audiovisual experience effect of a user.
Fig. 7 is a schematic structural diagram of a music playing device according to an embodiment of the present application. As shown in fig. 7, the music playing device may include: the acquisition module 11, the processing module 12 and the control module 13.
An acquisition module 11, configured to acquire target music.
The processing module 12 is configured to obtain, according to the music feature of the target music, a emotion recognition result corresponding to each lyric of the target music.
And the control module 13 is used for controlling the body state of the displayed virtual object according to the emotion recognition result corresponding to each lyric of the target music in the process of playing the target music.
In one possible implementation, the processing module 12 is specifically configured to time slice the target music to obtain a plurality of pieces of music of the target music. And obtaining emotion recognition results corresponding to the music piece according to the music characteristics of the music piece. And taking the emotion recognition result corresponding to the music piece as the emotion recognition result corresponding to each lyric covered by the music piece.
In this implementation, optionally, if the music feature includes: the processing module 12 is specifically configured to obtain an initial emotion recognition result corresponding to the sub-feature by using the emotion recognition model corresponding to the sub-feature of the music piece and the sub-feature. And obtaining emotion recognition results corresponding to the music pieces according to the initial emotion recognition results corresponding to the sub-features.
Alternatively, when the sub-feature is a music score image, the processing module 12 is specifically configured to extract a musical character sequence feature according to the music score image. And obtaining an initial emotion recognition result corresponding to the music score image by using the character sequence characteristics and the emotion recognition model corresponding to the music score image.
In another possible implementation manner, the processing module 12 is specifically configured to generate an emotion recognition result sequence according to the initial emotion recognition result corresponding to each of the sub-features. And inputting the emotion recognition result sequence into a preset emotion classification model to obtain emotion recognition results corresponding to the music pieces and the confidence level of the emotion recognition results.
In this implementation manner, optionally, the control module 13 is specifically configured to obtain, according to the emotion recognition result corresponding to each lyric of the target music, the confidence level of the emotion recognition result, and the mapping relationship among the emotion recognition result, the confidence level, and the posture, the posture of the virtual object corresponding to each lyric. And when the lyrics corresponding to the target music are played, controlling the virtual object to execute the corresponding body states of the lyrics.
In any implementation manner, the processing module 12 is specifically configured to obtain, for any lyric, a change in emotion between the lyric of the sentence and the lyric of the next sentence according to an emotion recognition result corresponding to the lyric of the sentence and an emotion recognition result corresponding to the lyric of the next sentence. And according to the emotion change, adjusting an emotion recognition result of the lyrics of the sentence at least one target time point so as to transition the emotion from the emotion recognition result corresponding to the lyrics of the sentence to the emotion recognition result corresponding to the lyrics of the next sentence.
The music playing device provided by the embodiment of the application can execute the music playing method in the method embodiment, and the implementation principle and the technical effect are similar, and are not repeated here.
Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device is used for executing the music playing method, and may be, for example, the car machine. As shown in fig. 8, the electronic device 800 may include: at least one processor 801, a memory 802, a communication interface 803.
A memory 802 for storing programs. In particular, the program may include program code including computer-operating instructions.
Memory 802 may comprise high-speed RAM memory or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The processor 801 is configured to execute computer-executable instructions stored in the memory 802 to implement the methods described in the foregoing method embodiments. The processor 801 may be a CPU or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC) or one or more integrated circuits configured to implement embodiments of the present application.
The processor 801 may communicate with external devices, such as a cell phone connected to the vehicle, a network server, or the like, through the communication interface 803. In a specific implementation, if the communication interface 803, the memory 802, and the processor 801 are implemented independently, the communication interface 803, the memory 802, and the processor 801 may be connected to each other and perform communication with each other through a bus. The bus may be an industry standard architecture (Industry Standard Architecture, abbreviated ISA) bus, an external device interconnect (Peripheral Component, abbreviated PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) bus, among others. Buses may be divided into address buses, data buses, control buses, etc., but do not represent only one bus or one type of bus.
Alternatively, in a specific implementation, if the communication interface 803, the memory 802, and the processor 801 are implemented on a single chip, the communication interface 803, the memory 802, and the processor 801 may complete communication through internal interfaces.
The present application also provides a computer-readable storage medium, which may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, etc., in which program codes may be stored, and in particular, the computer-readable storage medium stores program instructions for the methods in the above embodiments.
The present application also provides a program product comprising execution instructions stored in a readable storage medium. The at least one processor of the computing device may read the execution instructions from the readable storage medium, the execution instructions being executed by the at least one processor to cause the computing device to implement the music playback method described above.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

Claims (10)

1. A music playing method, comprising:
acquiring target music;
according to the music characteristics of the target music, obtaining emotion recognition results corresponding to each sentence of lyrics of the target music;
and in the process of playing the target music, controlling the body state of the displayed virtual object according to the emotion recognition result corresponding to each sentence of lyrics of the target music.
2. The method of claim 1, wherein the obtaining, according to the music feature of the target music, the emotion recognition result corresponding to each lyric of the target music includes:
performing time slicing on the target music to obtain a plurality of music pieces of the target music;
according to the music characteristics of the music pieces, obtaining emotion recognition results corresponding to the music pieces;
and taking the emotion recognition result corresponding to the music piece as the emotion recognition result corresponding to each lyric contained in the music piece.
3. The method of claim 2, wherein the musical feature comprises: at least two sub-features in audio, lyrics, music score images; according to the music characteristics of the music piece, obtaining the emotion recognition result corresponding to the music piece comprises the following steps:
obtaining an initial emotion recognition result corresponding to the sub-feature by using an emotion recognition model corresponding to the sub-feature of the music piece and the sub-feature;
and obtaining emotion recognition results corresponding to the music pieces according to the initial emotion recognition results corresponding to the sub-features.
4. The method of claim 3, wherein the sub-feature is a music score image; the method for obtaining the initial emotion recognition result corresponding to the sub-feature by using the emotion recognition model corresponding to the sub-feature of the music piece and the sub-feature comprises the following steps:
Extracting a musical character sequence feature according to the music score image;
and obtaining an initial emotion recognition result corresponding to the music score image by using the character sequence characteristics and the emotion recognition model corresponding to the music score image.
5. A method according to claim 3, wherein the obtaining the emotion recognition result corresponding to the music piece according to the initial emotion recognition result corresponding to each sub-feature includes:
generating an emotion recognition result sequence according to the initial emotion recognition result corresponding to each sub-feature;
and inputting the emotion recognition result sequence into a preset emotion classification model to obtain emotion recognition results corresponding to the music pieces and confidence degrees of the emotion recognition results.
6. The method according to claim 5, wherein the controlling the posture of the virtual object according to the emotion recognition result corresponding to each lyric of the target music during the playing of the target music includes:
acquiring the physical state of the virtual object corresponding to each lyric of the target music according to the emotion recognition result corresponding to each lyric of the target music, the confidence coefficient of the emotion recognition result and the mapping relation among the emotion recognition result, the confidence coefficient and the physical state;
And when the lyrics corresponding to the target music are played, controlling the virtual object to execute the corresponding body state of the lyrics.
7. The method according to any one of claims 2-6, wherein the obtaining the emotion recognition result corresponding to each lyric of the target music includes:
aiming at any sentence of lyrics, according to the emotion recognition result corresponding to the sentence of lyrics and the emotion recognition result corresponding to the lyrics of the next sentence, acquiring the emotion change between the sentence of lyrics and the next sentence of lyrics;
and according to the emotion change, adjusting an emotion recognition result of the lyrics of the sentence at least one target time point so as to transition the emotion from the emotion recognition result corresponding to the lyrics of the sentence to the emotion recognition result corresponding to the lyrics of the next sentence.
8. A music playing device, comprising:
the acquisition module is used for acquiring target music;
the processing module is used for acquiring emotion recognition results corresponding to each sentence of lyrics of the target music according to the music characteristics of the target music;
and the control module is used for controlling the body state of the displayed virtual object according to the emotion recognition result corresponding to each sentence of lyrics of the target music in the process of playing the target music.
9. An electronic device, comprising: the processor is respectively in communication connection with the communication interface and the memory;
the memory stores computer-executable instructions;
the communication interface performs communication interaction with external equipment;
the processor executes computer-executable instructions stored in the memory to implement the method of any one of claims 1-7.
10. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor are for implementing a music playing method as claimed in any one of claims 1 to 7.
CN202311026921.0A 2023-08-15 2023-08-15 Music playing method, device, equipment and storage medium Pending CN117056557A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311026921.0A CN117056557A (en) 2023-08-15 2023-08-15 Music playing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311026921.0A CN117056557A (en) 2023-08-15 2023-08-15 Music playing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117056557A true CN117056557A (en) 2023-11-14

Family

ID=88654778

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311026921.0A Pending CN117056557A (en) 2023-08-15 2023-08-15 Music playing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117056557A (en)

Similar Documents

Publication Publication Date Title
EP3803846B1 (en) Autonomous generation of melody
CN109862393B (en) Method, system, equipment and storage medium for dubbing music of video file
CN111415677B (en) Method, apparatus, device and medium for generating video
CN110838286A (en) Model training method, language identification method, device and equipment
CN108346427A (en) A kind of audio recognition method, device, equipment and storage medium
CN110853617B (en) Model training method, language identification method, device and equipment
EP3616190A1 (en) Automatic song generation
CN109801349B (en) Sound-driven three-dimensional animation character real-time expression generation method and system
CN109254669A (en) A kind of expression picture input method, device, electronic equipment and system
CN114073854A (en) Game method and system based on multimedia file
CN107767850A (en) A kind of singing marking method and system
CN113238654A (en) Multi-modal based reactive response generation
CN110263218A (en) Video presentation document creation method, device, equipment and medium
CN116704085A (en) Avatar generation method, apparatus, electronic device, and storage medium
CN114373480A (en) Training method of voice alignment network, voice alignment method and electronic equipment
CN104270501B (en) The head portrait setting method of a kind of contact person in address list and relevant apparatus
CN117152308B (en) Virtual person action expression optimization method and system
CN110781327B (en) Image searching method and device, terminal equipment and storage medium
CN111859008B (en) Music recommending method and terminal
CN117056557A (en) Music playing method, device, equipment and storage medium
US20220414472A1 (en) Computer-Implemented Method, System, and Non-Transitory Computer-Readable Storage Medium for Inferring Audience's Evaluation of Performance Data
CN116564269A (en) Voice data processing method and device, electronic equipment and readable storage medium
CN114443889A (en) Audio acquisition method and device, electronic equipment and storage medium
CN112633136B (en) Video analysis method, device, electronic equipment and storage medium
JP2015176592A (en) Animation generation device, animation generation method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination