CN116504206A - Camera capable of identifying environment and generating music - Google Patents

Camera capable of identifying environment and generating music Download PDF

Info

Publication number
CN116504206A
CN116504206A CN202310264005.4A CN202310264005A CN116504206A CN 116504206 A CN116504206 A CN 116504206A CN 202310264005 A CN202310264005 A CN 202310264005A CN 116504206 A CN116504206 A CN 116504206A
Authority
CN
China
Prior art keywords
music
emotion
module
scene
hidden layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310264005.4A
Other languages
Chinese (zh)
Other versions
CN116504206B (en
Inventor
孙鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Wolf Vision Technology Co ltd
Original Assignee
Shenzhen Wolf Vision Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Wolf Vision Technology Co ltd filed Critical Shenzhen Wolf Vision Technology Co ltd
Priority to CN202310264005.4A priority Critical patent/CN116504206B/en
Publication of CN116504206A publication Critical patent/CN116504206A/en
Application granted granted Critical
Publication of CN116504206B publication Critical patent/CN116504206B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • G10H1/0025Automatic or semi-automatic music composition, e.g. producing random music, applying rules from music theory or modifying a musical piece
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/02Casings; Cabinets ; Supports therefor; Mountings therein
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/075Musical metadata derived from musical analysis or for use in electrophonic musical instruments
    • G10H2240/085Mood, i.e. generation, detection or selection of a particular emotional content or atmosphere in a musical piece
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention relates to a camera for identifying environment and generating music, which comprises a camera module, a loudspeaker, a scene analysis module, an emotion recognition module and a music generation module, wherein when an expression image of a user is acquired through the camera module, the emotion recognition module acquires the emotion characteristics of the user, a database is used for searching music fragments with labels as corresponding emotion characteristics, music fragment data are input into a music generation model, music with corresponding emotion characteristics is generated through the music generation model, and the music is output to the loudspeaker; according to the invention, the surrounding environment image and the face image are obtained through the camera, the background melody of the corresponding style can be generated according to the surrounding environment and the emotion of the user, and meanwhile, the music of different styles can be switched according to the expression change, so that people can conveniently experience the music decompression treatment when facing life or working pressure, and the pressure is relieved.

Description

Camera capable of identifying environment and generating music
Technical Field
The invention relates to the technical field of music generation, in particular to a camera for identifying environment and generating music.
Background
Music has a acquainted influence on the mind and body of a person, and along with the development and progress of the Internet and cloud music, the music occupies more and more time in daily life of the person, and silently regulates the physical and mental health of the person. In normal life, people can deeply feel the music effect. The proper music can be played in proper occasions, so that the mind and body of people can be greatly stretched, for example, the symphony music of listening to the passion and blowing when the emotion is low can enable the low emotion of people to be released to a certain extent, and the light music can enable the dysphoric emotion of people to be smoothed to a certain extent when the emotion is dysphoric. In order to relieve pressure and create comfortable and positive environment atmosphere, factories, enterprises and markets can shield environment noise by broadcasting music, and create a relaxed and comfortable environment. However, the background music control system at present can only randomly select the tracks, can not automatically adjust the volume according to the mood of a person, can not change the sound effect, can not be matched with the mood of the person, lacks humanization, and meanwhile, the repeated music is easy to cause boring and is unfavorable for relieving the pressure.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provide a camera for identifying environment and generating music.
The technical scheme adopted for solving the technical problems is as follows:
the method comprises the steps of constructing a camera for identifying an environment and generating music, wherein the camera comprises a camera module, a loudspeaker, a scene analysis module, a mood identification module and a music generation module;
the camera shooting module comprises an angle adjusting mechanism and a camera arranged on the angle adjusting mechanism, wherein the camera acquires a multi-angle environment image through the angle adjusting mechanism and transmits the image to the emotion recognition module and the scene analysis module;
the emotion recognition module is used for detecting whether a face exists in the environment image, if so, recognizing the facial expression of the user through an expression recognition technology, acquiring emotion characteristics corresponding to the facial expression, and transmitting the result to the music generation module;
the scene analysis module is used for acquiring scene characteristics through the Internet and transmitting the scene characteristics to the music generation module, wherein the scene characteristics comprise weather characteristics and time period characteristics, and the time period characteristics comprise night, morning, noon, afternoon and dusk;
the music generation module comprises a music database and a music generation model, wherein the music database stores a plurality of music fragments, the music fragments are marked with style labels corresponding to scene features or emotion features, the music fragments are used for randomly selecting the style labels corresponding to the scene features or emotion features, the music fragments are converted into matrixes, and the matrixes are input into the music generation model to generate background melodies corresponding to the styles;
the music generation model is a deep belief network model, the deep belief network model comprises an input layer, hidden layers and an output layer, the hidden layers are provided with five layers, the number of nodes from a first hidden layer to a third hidden layer is sequentially reduced, the number of nodes of a fourth hidden layer is the same as that of a second hidden layer, and the number of nodes of a fifth hidden layer is the same as that of the first hidden layer; each hidden layer receives data from the hidden layer in front of the hidden layer and outputs the data to the hidden layer in back of the hidden layer; the music generation model outputs a binary matrix, converts the binary matrix into music and outputs the music to the loudspeaker; each column in the binary matrix represents a duration, and each row represents a pitch;
the loudspeaker is used for playing the background melody generated by the music generation module.
Preferably, an expression recognition model is arranged in the emotion recognition module, and the training method of the expression recognition model comprises a preprocessing step, a deep feature learning step and a deep feature classification step; wherein, the liquid crystal display device comprises a liquid crystal display device,
the pretreatment steps comprise: detecting a human face through a human face detector, deleting a background and a non-human face area, obtaining a facial image, and carrying out alignment processing on the facial image; randomly clipping from four corners and the center of the face image, and then horizontally turning over; then carrying out illumination normalization and posture normalization processing on the facial image;
the deep learning step adopts convolutional neural network processing, and the specific method comprises the following steps: convolving the facial image through a convolution layer and generating a plurality of activated feature images of specific types, carrying out global average pooling through a pooling layer, and finally converting the 2D feature image into a 1D feature image through a full connection layer and outputting the 1D feature image; obtaining global characteristics of the face, and training model parameters of the expression recognition model according to the global characteristics of the face to obtain the expression recognition model;
the depth feature classification step includes: adding a softMax classification algorithm to the expression recognition model after parameter learning is completed, calculating probability values of the facial image belonging to each expression through the softMax classification algorithm, and judging one emotion type with the largest probability value as the emotion of the facial image; emotional characteristics are categorized into seven types of anger, aversion, fear, happiness, sadness, surprise and neutrality.
Preferably, the aligning step includes: the detected face is normalized to a size of 48 x 48 so that the interocular distance of the face reaches a preset value and both eyes are located at preset vertical coordinates.
Preferably, the music fragments are stored in a music database in the form of MIDI files, the music information in the MIDI files is extracted in the form of a matrix, the matrix extracted from the MIDI files is a binary matrix, and the matrix is stored in the form of a two-dimensional scatter diagram, wherein each column in the matrix represents a tone value of 16 notes, and each row represents a pitch.
Preferably, the visible layer is an input layer of the restricted boltzmann machine, the hidden layer is a feature extraction layer of the restricted boltzmann machine, the number of nodes of the second hidden layer is one fourth of the number of nodes of the first hidden layer, and the number of nodes of the third hidden layer is one fourth of the number of nodes of the second hidden layer.
Preferably, the music generation module classifies anger, aversion, fear and sadness expressions as negative emotions, happiness and surprise as positive emotions, and neutral as normal emotion; when the identified emotion belongs to the front emotion, the output volume of the loudspeaker is increased by 4-5db, the heavy bass and the high bass are both increased by 2-3db, and the surround sound is started; when the identified emotion belongs to a normal emotion, the loudspeaker adjusts the output volume to 45-50db, the low-pitch is adjusted to 22-25db, the high-pitch is adjusted to 18-21db, and the surround sound is started; when the identified emotion belongs to negative emotion, the loudspeaker reduces the output volume by 5-6db, turns off the heavy bass, reduces the high bass by 2-3db, and turns off the surround sound.
Preferably, the system further comprises a radar module and a pickup, wherein the radar module comprises a microwave radar and a laser radar, and the laser radar is used for detecting whether a personnel target exists in a preset range and acquiring the moving speed of the personnel target; the microwave radar acquires micro-motion signals of the target personnel, and screens out heartbeat signals according to the micro-motion signals to acquire personnel heartbeat information; the pickup is used for collecting environmental sounds and transmitting collected sound information to the scene analysis module.
Preferably, the scene analysis module is further used for identifying background sounds in the sound information, identifying a plurality of objects in the environment image and obtaining names of the corresponding objects, and a scene database is arranged in the scene analysis module and is internally provided with characters corresponding to the objects and characters corresponding to the background sounds; the scene analysis module is used for splicing the characters corresponding to the object and the characters corresponding to the background sound to form scene characters, acquiring syllables of each character in the scene characters, arranging syllables of the characters according to the sequence of each character in the scene voice, and the music generation module is used for sequentially corresponding each note in the background melody with the syllables of each character in the scene voice; the music generation module stretches or compresses the syllable duration of the characters to make the syllable duration of the characters identical to the corresponding note duration, carries out sound changing processing on the syllables of the Chinese characters, and carries out sound mixing with the background melody generated by the music generation module.
Preferably, the scene analysis module is further configured to obtain a movement speed change and a sound information loudness change of the personnel target within a period of time, and analyze a brightness value in an environmental image within the period of time, where the scene database stores a movement speed and pitch comparison table, a background noise loudness and tone length comparison table, and a brightness and tone color comparison table, and if the radar module detects that the personnel target exists, the scene analysis module obtains a corresponding pitch change through the movement speed change of the personnel target, obtains a corresponding tone length change through the background noise loudness, obtains a corresponding tone color through brightness valueization, and then splices the pitch, the tone length and the tone color into a complete music score, and converts the music score into music.
The invention has the beneficial effects that: the method comprises the steps that a camera module is arranged in public places such as a mall, a factory and a park, an environment image is obtained through the camera module, the environment image is transmitted to a emotion recognition module, the emotion recognition module detects whether a face exists in the environment image, if the face exists, facial expressions of a user are recognized through an expression recognition technology, emotion features corresponding to the facial expressions are obtained, if the emotion features are recognized as negative emotion or positive emotion, the emotion features are transmitted to a music generation module, and if the emotion features are recognized as normal emotion or no face is recognized, time period features and weather features are obtained through a scene analysis module and are transmitted to the music generation module; obtaining corresponding music fragments according to emotion characteristics or scene characteristics, converting the music fragments into binary matrixes one by one and respectively inputting the binary matrixes into a music generation model, generating music by the music generation model according to input data and transmitting the music to a loudspeaker, and playing the background melody generated by the music generation module by the loudspeaker; according to the invention, the surrounding environment image and the face image are obtained through the camera, the background melody of the corresponding style can be generated according to the surrounding environment and the emotion of the user, and meanwhile, the music of different styles can be switched according to the expression change, so that people can conveniently experience the music decompression treatment when facing the life or working pressure, and the pressure is relieved; music suitable for the user is generated according to the recognized expression result, and the method can be used for medical aspects such as psychological diagnosis and treatment, so that the emotion of a patient is relaxed, and the treatment effect is improved; or the method is applied to music selection in public places such as markets, restaurants and the like, so that the consumer experience is improved; can also alleviate industrial park noise pollution, show noise reduction and to healthy harmful effects of crowd, realize reducing working pressure's effect simultaneously, have the function of adjusting the mood.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the present invention will be further described with reference to the accompanying drawings and embodiments, in which the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained by those skilled in the art without inventive effort:
FIG. 1 is a block diagram of a camera for recognizing an environment and generating music in accordance with a preferred embodiment of the present invention;
fig. 2 is a schematic diagram of a music generation model of a camera that recognizes an environment and generates music according to a preferred embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the following description will be made in detail with reference to the technical solutions in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by a person skilled in the art without any inventive effort, are intended to be within the scope of the present invention, based on the embodiments of the present invention.
The camera for recognizing the environment and generating the music according to the preferred embodiment of the invention, as shown in fig. 1, comprises a camera module, a loudspeaker, a scene analysis module, a mood recognition module and a music generation module;
the camera module comprises an angle adjusting mechanism and a camera arranged on the angle adjusting mechanism, and the camera acquires a multi-angle environment image through the angle adjusting mechanism and transmits the image to the emotion recognition module and the scene analysis module; the camera can acquire a plurality of environmental images with different angles at one time through the angle adjusting mechanism, the images can be acquired for a plurality of times, the time interval of the plurality of times can be freely set, and the embodiment is preferably one time in 30 minutes; the angle adjusting mechanism is specifically an electric mechanical arm;
the emotion recognition module is used for detecting whether a face exists in the environment image, if so, recognizing the facial expression of the user through an expression recognition technology, acquiring emotion characteristics corresponding to the facial expression, and transmitting the result to the music generation module; the obtained expression is one of anger, aversion, fear, happiness, sadness, surprise and neutral, the anger, aversion, fear and sad expression are classified as negative emotion, the happiness and surprise are classified as positive emotion, and the neutral is classified as normal emotion;
the scene analysis module is used for acquiring scene characteristics through the Internet and transmitting the scene characteristics to the music generation module, wherein the scene characteristics comprise weather characteristics and time period characteristics, and the time period characteristics comprise night, morning, noon, afternoon and dusk; weather features include clear, light, heavy rain, cloudy, and heavy fog;
the music generation module comprises a music database and a music generation model, wherein the music database stores a plurality of music fragments, the music fragments are marked with style labels corresponding to scene features or emotion features, the music fragments are used for randomly selecting the style labels corresponding to the scene features or emotion features, the music fragments are converted into matrixes, and the matrixes are input into the music generation model to generate background melodies corresponding to the styles;
the music generation model is a deep belief network model, the deep belief network model comprises an input layer, an hidden layer and an output layer, the hidden layer is provided with five layers, the number of nodes from the first hidden layer to the third hidden layer is sequentially reduced, the number of nodes of the fourth hidden layer is the same as that of the second hidden layer, and the number of nodes of the fifth hidden layer is the same as that of the first hidden layer; each hidden layer receives data from the hidden layer in front of the hidden layer and outputs the data to the hidden layer in back of the hidden layer; the music generation model outputs a binary matrix, converts the binary matrix into music and outputs the music to the loudspeaker; each column in the binary matrix represents a duration, and each row represents a pitch;
and the loudspeaker is used for playing the background melody generated by the music generation module.
The pressure of personnel in environments such as factories is high, people often need to relax after working, the pressure is relieved through music, the existing broadcasting is used for randomly selecting and playing music from hundreds of MP3 music stored in the external memory under the condition that a random function is selected in the music playing process, but because the randomly played music does not consider the mood or hobbies of a user at all, only any MP3 music is simply selected and played from the external memory, if the played music does not accord with the hobbies of the user, the pressure is not relieved, the mood is low, and the working efficiency and physical and mental health of the user are affected. According to the camera for recognizing the environment and generating the music, provided by the invention, the surrounding environment image can be acquired through the camera, the environment image is transmitted to the emotion recognition module and is used for detecting whether a face exists in the environment image, recognizing the expression of the face and generating the music with different styles according to different expressions. A user can input the face of the user through the face recognition system and associate favorite music, when the user is reached to the corresponding user through the camera, a piece is selected from the music associated with the user and is input into the music generation module, and music with the same style is generated through the music generation model, so that the user can be prevented from being tedious in music repetition. The camera can accurately detect facial emotion of the user, the expression detection is real-time, and for music generated by the music generation model, the music can be modulated in terms of beat, rhythm and the like of the music, and the user can be led to calm down due to slow trend; the heart beat can be properly accelerated, so that the user can moderately improve the heart beat to become excited and pleasant (the acceleration of the heart beat has an upper limit threshold value and cannot be higher than a certain heart beat rate, otherwise, the heart beat is unfavorable for health).
The method comprises the steps that a camera module is arranged indoors, an environment image is obtained through the camera module, the environment image is transmitted to a emotion recognition module, the emotion recognition module detects whether a face exists in the environment image, if the face exists, facial expressions of a user are recognized through an expression recognition technology, emotion features corresponding to the facial expressions are obtained, if the emotion features are recognized as negative emotion or positive emotion, the emotion features are transmitted to a music generation module, and if the emotion features are recognized as normal emotion or the face is not recognized, time period features and weather features are obtained through a scene analysis module, and the time period features and the weather features are transmitted to the music generation module;
for example, if the emotion recognition module detects that no face exists in the environment image, then the weather feature acquired by the scene analysis module is sunny, the acquired time period feature is morning, then the music fragments with the labels of sunny and morning are searched through the database, and a plurality of music fragments, the optional number of which is 12-36, are selected, in this embodiment, 20 music fragments are preferred, the music fragments are converted into binary matrixes one by one and are respectively input into the music generation model, the music generation model generates music according to the input data and transmits the music to the loudspeaker, and the loudspeaker plays the background melody generated by the music generation module; according to the invention, the surrounding environment image and the face image are obtained through the camera, the background melody of the corresponding style can be generated according to the surrounding environment and the emotion of the user, and meanwhile, the music of different styles can be switched according to the expression change, so that people can conveniently experience the music decompression treatment when facing life or working pressure, and the pressure is relieved.
The application can also set outdoors or at a doorway, and particularly can be arranged in leisure public places such as parks and squares, the outdoor environment images with a plurality of angles can be acquired through the angle adjusting mechanism by the camera shooting module, the expression detection module is used for detecting a plurality of faces in the environment images, the expression recognition is respectively carried out on the faces, the maximum number of times of expression occurrence is judged, the emotion characteristics are sent to the music generating module, the music generating module generates music corresponding to the emotion characteristics, and the music is played through the loudspeaker, so that people who pass a road can relax the pressure.
As shown in fig. 2, the input layer of the music generation model of the present embodiment is the input layer of the Restricted Boltzmann Machine (RBM), and the hidden layer is the feature extraction layer of the restricted boltzmann machine; the music generation model initially comprises an input layer, an hidden layer and an output layer, the input layer is trained through a limited Boltzmann machine, then a coding part is reserved to serve as a new hidden layer (namely a first hidden layer), a new hidden layer (namely a fifth hidden layer) is built between the output layer and the hidden layer, the weight of the fifth hidden layer is initialized to the weight of the first hidden layer, further, the first hidden layer is trained through the limited Boltzmann machine, the coding part is reserved to serve as a new hidden layer (namely a second hidden layer), a new hidden layer (namely a fourth hidden layer) is added between the output layer and the hidden layer, the weight of the fourth hidden layer is initialized to the weight of the second hidden layer, the initially existing hidden layer is a third hidden layer, and after all layers are added, reverse propagation is operated on the whole deep belief network to finely tune parameters;
the pieces of music are stored in the database in the form of MIDI files, and the pieces of music information in the MIDI files are extracted in the form of a matrix of (time, pitch) and stored in the npz file in the form of a sparse matrix. The pretty midi library provides traversing notes (notes) in each track and deriving the pitch (pitch) of each Note, the Note start time (Note on) and Note end time (Note off), dividing the start and end times by the length of the sixteen notes (60 seconds/120 BPM/4), respectively, to derive the corresponding positions of the start and end times in the matrix.
The matrix extracted from MIDI file is binary matrix, which is stored in the form of two-dimensional scatter diagram, each column in binary matrix represents a 16-note tone value, and each row represents a pitch. In the figure, 1 and 0 indicate the presence and absence of a specific time note.
The visible layer is an input layer of the limited Boltzmann machine, the hidden layer is a characteristic extraction layer of the limited Boltzmann machine, the node number of the second hidden layer is one fourth of the node number of the first hidden layer, and the node number of the third hidden layer is one fourth of the node number of the second hidden layer; the data output is a binary matrix, where each column represents a 16-note scale value and each row represents a pitch. 1 and 0 represent the presence and absence of a particular time note, the binary matrix is converted to music and output to the speaker, the number of nodes of the first hidden layer is 1024, the number of nodes of the second hidden layer is 256, and the number of nodes of the third hidden layer is 16.
As shown in fig. 1, an expression recognition model is arranged in the emotion recognition module, and the training method of the expression recognition model comprises a preprocessing step, a deep feature learning step and a deep feature classification step; wherein, the liquid crystal display device comprises a liquid crystal display device,
the pretreatment steps comprise: detecting a human face through a human face detector, deleting a background and a non-human face area, obtaining a facial image, and carrying out alignment processing on the facial image; randomly clipping from four corners and the center of the face image, and then horizontally turning over; then carrying out illumination normalization and posture normalization processing on the facial image;
the deep learning step adopts convolutional neural network processing, and the specific method comprises the following steps: convolving the facial image through a convolution layer and generating a plurality of activated feature images of specific types, carrying out global average pooling through a pooling layer, and finally converting the 2D feature image into a 1D feature image through a full connection layer and outputting the 1D feature image; obtaining global characteristics of the face, and training model parameters of the expression recognition model according to the global characteristics of the face to obtain the expression recognition model;
the depth feature classification step includes: adding a softMax classification algorithm to the expression recognition model after parameter learning is completed, calculating probability values of the facial image belonging to each expression through the softMax classification algorithm, and judging one emotion type with the largest probability value as the emotion of the facial image; emotional characteristics are categorized into seven types of anger, aversion, fear, happiness, sadness, surprise and neutrality.
The alignment processing step comprises the following steps: the detected face is normalized to a size of 48 x 48 so that the interocular distance of the face reaches a preset value and both eyes are located at preset vertical coordinates.
The music generation module determines the music style of the corresponding emotion through the big data, and the music pieces stored in the music database are marked with the corresponding music styles; after music playing reaches a preset time, acquiring facial expressions again through a camera, acquiring the emotion of a person through an emotion recognition module, and comparing and analyzing the front emotion and the rear emotion; when the emotion of the user is detected to be positive emotion or negative emotion, the frequency of acquiring the environment image by the camera module is not 30 minutes once, but is changed to 3 minutes once, so that the emotion of the person can be acquired in real time, and corresponding music can be adjusted according to emotion change.
The emotion recognition module is connected with a depression detection module and is used for further detecting depression of people in negative emotion for a long time, the emotion scores of seven emotions obtained by the face emotion recognition module are further led into positive and negative neutral video guidance, the emotion feedback characteristics of the user are specifically analyzed, feedback of positive and negative neutral guidance is divided into three angles of stability, variability and sensitivity, the influence degree of psychology statistics positive and negative neutral materials on the crowd is combined with corresponding weights to obtain a depression index quantification score, distribution analysis is carried out on the obtained score, under the condition of big data, the value of distribution concentration is used as a benchmark, and two thresholds are added above the score to divide the score into three different depression degrees, namely normal depression, light depression and major depression. Normal depression indicates that the user's stress level is normal, light depression indicates that the user's stress level is greater, and heavy depression indicates that the user's stress level is greater. The style labels of the music pieces in the music library comprise normal depression, light depression and heavy depression, the depression degree of the person detected by the depression detection module is transmitted to the music generation module, the music generation module acquires music corresponding to the style labels through the music library, selects a plurality of music pieces from the music pieces, converts the music pieces into binary matrixes one by one and inputs the binary matrixes into the music generation models respectively, the music generation models generate music according to the input data and transmit the music to the loudspeaker, and the loudspeaker plays the background melody generated by the music generation module.
The music generating module may further be configured to further specifically include a music prediction model and a preset motivation melody model, and generate a first vocal pitch sequence and a first vocal duration sequence based on the music prediction model and the preset motivation melody model; the preset motivation melody model is a model which is obtained based on the music style type of the MIDI data set and a Markov model and can generate motivation melody rules;
performing curve fitting based on the first voice pitch sequence and the first voice duration sequence to obtain a second voice pitch sequence and a second voice duration sequence;
and synthesizing the first sound part pitch sequence, the first sound part time value sequence, the second sound part pitch sequence and the second sound part time value sequence to obtain the two sound part melody.
Acquiring a target audio file and determining a music feature sequence aiming at the target audio file; wherein the musical feature sequence includes a plurality of musical feature fragments;
carrying out up-sampling coding on a preset MIDI data set to obtain a MIDI data up-sampling sequence; the preset MIDI data set comprises a plurality of pieces of preset style music;
acquiring the hidden Markov model and outputting an environment data sequence aiming at the music feature sequence;
wherein the hidden Markov model generates an environmental data sequence for the musical feature sequence in the following manner: when the Nth music feature fragment is processed, determining the minimum cost corresponding to each action node in the animation state transition relation and the minimum cost path corresponding to the minimum cost; wherein N is a positive integer greater than 1, and the minimum cost path includes one or more action nodes;
when the Nth music feature segment is the last music feature segment, comparing the minimum cost corresponding to each action node to obtain a target action node;
and generating an environment data sequence aiming at the music feature sequence by adopting the minimum cost path corresponding to the target action node.
The music generation module classifies anger, aversion, fear and sadness expressions as negative emotions, happiness as positive emotions and neutral as normal emotions; when the identified emotion belongs to the front emotion, the output volume of the loudspeaker is increased by 4-5db, the heavy bass and the high bass are both increased by 2-3db, and the surround sound is started; when the identified emotion belongs to a normal emotion, the loudspeaker adjusts the output volume to 45-50db, the low-pitch is adjusted to 22-25db, the high-pitch is adjusted to 18-21db, and the surround sound is started; when the identified emotion belongs to negative emotion, the loudspeaker reduces the output volume by 5-6db, turns off the heavy bass, reduces the high bass by 2-3db, and turns off the surround sound.
The embodiment also comprises a radar module and a pickup, wherein the radar module comprises a microwave radar and a laser radar, and the laser radar is used for detecting whether personnel targets exist in a preset range and acquiring the moving speed of the personnel targets; the microwave radar acquires micro-motion signals of a target person, and the heartbeat signals are screened out according to the micro-motion signals to acquire the heartbeat information of the person; the sound pick-up is used for collecting environmental sound and transmitting collected sound information to the scene analysis module; and screening out heartbeat signals according to the micro-motion signals, drawing a heart rate graph, judging whether the heartbeat frequency of the personnel exceeds a preset value through the heartbeat signals if the expressions of the personnel are anger, fear and surprise, judging that the emotion of the personnel is normal emotion if the heartbeat frequency of the personnel does not exceed the preset value, judging that the expressions of the personnel are anger and fear personnel are in negative emotion if the heartbeat frequency of the personnel exceeds the preset value, and judging that the expressions of the personnel are in positive emotion.
The scene analysis module is also used for identifying background sounds in the sound information, wherein the background sounds comprise wind sounds, rain sounds, laughter sounds, crying sounds and the like; simultaneously identifying a plurality of objects in the environment image and obtaining names of the corresponding objects, wherein the objects can comprise trees, lawns, tables and chairs, street lamps, doors, windows, buildings and the like, a scene database is arranged in the scene analysis module, and characters of the corresponding objects (the characters are articles related to the objects) and characters of corresponding background sounds (the characters are articles related to the background sounds) are stored in the scene database; the scene analysis module splices the characters corresponding to the object and the characters corresponding to the background sound to form scene characters, acquires syllables of each character in the scene characters, arranges the syllables of the characters according to the sequence of each character in the scene voice, and the music generation module sequentially corresponds each note in the background melody with the syllable of each character in the scene voice; the music generation module stretches or compresses the syllable duration of the characters so that the syllable duration of the characters is the same as the corresponding note duration, carries out sound changing processing on the syllables of the Chinese characters, and carries out sound mixing with the background melody generated by the music generation module. Through converting the scene into Chinese characters, then corresponding the syllables of the Chinese characters with notes of the background melody generated by the music generating module, and corresponding the background melody generated by the music generating module with the scene and personnel through details, the association relationship between the finally generated music score and the scene and personnel is enhanced, and deeper music immersion feeling is provided for pedestrians.
Determining syllable duration of each syllable according to the background melody generated by the music generation module; each syllable corresponds to each word one by one; each syllable comprises at least one phoneme; determining the phoneme duration of each phoneme in each syllable according to the syllable duration; the pronunciation time length of each character is the total phoneme time length of each character; it will be appreciated that each syllable consists of phonemes, which are the smallest phonetic units. For example: syllables "feng" consist of the phonemes "f" and the phonemes "eng". Each syllable includes at least one phoneme. The method can be used for training a prediction model in advance, the phoneme duration of each phoneme in each syllable can be determined according to the syllable duration through the preset duration prediction model, and stretching or compressing processing can be carried out on the phoneme duration of each phoneme. According to the phoneme duration of each phoneme, the pronunciation time of each word can be obtained. For example, the syllable duration of syllable "feng" is 80ms, the phoneme duration of phoneme "f" is 30ms, and the phoneme duration of phoneme "eng" is 50ms, which can be obtained by a preset prediction model.
If wind sounds are recognized from background sounds, an article related to wind is acquired from a book library, for example, the first sentence of the acquired characters is 'wind-urgent heaven ape howling grist', and the corresponding syllable is 'feng/ji/tie/gao/yuan/xiao/ai'. According to the singing time length of the corresponding notes of each background melody, determining the syllable time length of the corresponding text, and stretching or compressing the phoneme time length is realized by changing end factors, for example, if the singing time length of the notes of the text background melody is 90ms, the syllable time length of the corresponding syllable 'feng' is 80ms, the phoneme time length of the factor 'f' is 30ms, and the phoneme time length of the phoneme 'eng' is 60ms, so that the scene text and the background melody are in one-to-one correspondence.
The scene analysis module is also used for acquiring the moving speed change and the sound information loudness change of the personnel targets within a period of time, analyzing the brightness value in the environment image within the period of time, storing a moving speed and pitch comparison table, a background noise loudness and tone length comparison table and a brightness and tone color comparison table in the scene database, acquiring the moving speed of the personnel and the interval of the background noise loudness to be 20 ms/time, acquiring the interval of the brightness to be 2 hours/time, acquiring the corresponding pitch change through the moving speed change of the personnel targets if the radar module detects the personnel targets, acquiring the corresponding tone length change through the background noise loudness, acquiring the corresponding tone color through brightness valueing, splicing the pitch, the tone length and the tone color into a complete music score, converting the music score into music, and correspondingly converting the music score, the scene and the personnel into details, thereby enhancing the association relationship between the finally generated music score, the scene and the personnel, and providing deeper music feeling for the pedestrians.
When the scene analysis module detects that more than one person target speed reaches 1m/s and the moving time exceeds a preset value (can be specifically set to be more than 2 minutes), the background noise loudness change in the period of time is obtained, the pitch corresponding to the moving speed is obtained according to a comparison table in a database, the tone length corresponding to the background noise loudness is obtained, and the corresponding tone color is obtained through brightness valuing, wherein the comparison table is as follows:
when the brightness value is 1-100nit, the tone is organ tone, when the brightness value is 100-1000nit, the tone is guitar tone, when the brightness value is 1000-2000nit, the tone is guitar tone, when the brightness value is 2000-5000nit, the tone is blow tube tone, when the brightness value is more than 5000nit, the tone is synthesized dominant tone.
The moving speed range of the radar detection personnel target is 1m/s-5m/s, and the corresponding pitch range is 1kz-3khz.
The background noise loudness is 2-20db, the duration comprises full notes, half notes, quarter notes, eighth notes, sixteen notes and thirty-half notes, if the background noise loudness is 2-5db, the duration is full notes, the background noise loudness is 6-8db, the duration is half notes, the background noise loudness is 9-11db, the duration is quarter notes, the background noise loudness is 12-14db, the duration is eighth notes, the background noise loudness is 15-17db, the duration is sixteen notes, and the background noise loudness is 18-20db, the duration is thirty-half notes.
The music score is converted into music, the music score corresponds to scenes and personnel through details, the association relation among the finally generated music score, the scenes and the personnel is enhanced, deeper music immersion feeling is provided for the personnel in the camera scene, the music score and the environmental factors of the music are split into fine-grained elements and then are corresponding, under the condition that the corresponding music score can be successfully generated according to the environmental factors, the variability of the music score is ensured, and the problem that the music score is possibly repeated when the music score is generated more is avoided; and combine together user action speed and the creation process of music, promote user's use experience, promote user's participation sense, be particularly useful for relaxing mood in public place, nature sound, music, human sound, animal sound, bird's voice and running water sound cooperate the music to generate for the personal strong atmosphere effect that personnel in the scene enjoyed.
It will be understood that modifications and variations will be apparent to those skilled in the art from the foregoing description, and it is intended that all such modifications and variations be included within the scope of the following claims.

Claims (8)

1. The camera is characterized by comprising a camera module, a loudspeaker, a scene analysis module, a mood recognition module and a music generation module;
the camera shooting module comprises an angle adjusting mechanism and a camera arranged on the angle adjusting mechanism, wherein the camera acquires a multi-angle environment image through the angle adjusting mechanism and transmits the image to the emotion recognition module and the scene analysis module;
the emotion recognition module is used for detecting whether a face exists in the environment image, if so, recognizing the facial expression of the user through an expression recognition technology, acquiring emotion characteristics corresponding to the facial expression, and transmitting the result to the music generation module;
the scene analysis module is used for acquiring scene characteristics through the Internet and transmitting the scene characteristics to the music generation module, wherein the scene characteristics comprise weather characteristics and time period characteristics, and the time period characteristics comprise night, morning, noon, afternoon and dusk;
the music generation module comprises a music database and a music generation model, wherein the music database stores a plurality of music fragments, the music fragments are marked with style labels corresponding to scene features or emotion features, the music fragments are used for randomly selecting the style labels corresponding to the scene features or emotion features, the music fragments are converted into matrixes, and the matrixes are input into the music generation model to generate background melodies corresponding to the styles;
the music generation model is a deep belief network model, the deep belief network model comprises an input layer, hidden layers and an output layer, the hidden layers are provided with five layers, the number of nodes from a first hidden layer to a third hidden layer is sequentially reduced, the number of nodes of a fourth hidden layer is the same as that of a second hidden layer, and the number of nodes of a fifth hidden layer is the same as that of the first hidden layer; each hidden layer receives data from the hidden layer in front of the hidden layer and outputs the data to the hidden layer in back of the hidden layer; the music generation model outputs a binary matrix, converts the binary matrix into music and outputs the music to the loudspeaker; each column in the binary matrix represents a duration, and each row represents a pitch;
the loudspeaker is used for playing the background melody generated by the music generation module.
2. The camera for recognizing an environment and generating music according to claim 1, further comprising a radar module and a pickup, wherein the radar module comprises a microwave radar and a laser radar, and the laser radar is used for detecting whether a person target exists in a preset range and acquiring the moving speed of the person target; the microwave radar acquires micro-motion signals of the target personnel, and screens out heartbeat signals according to the micro-motion signals to acquire personnel heartbeat information; the pickup is used for collecting environmental sounds and transmitting collected sound information to the scene analysis module.
3. The camera for recognizing an environment and generating music according to claim 1, wherein the scene analysis module is further configured to recognize background sounds in sound information, recognize a plurality of objects in an environment image and obtain names of the corresponding objects at the same time, and a scene database is provided in the scene analysis module, and words corresponding to the objects and words corresponding to the background sounds are stored in the scene database; the scene analysis module is used for splicing the characters corresponding to the object and the characters corresponding to the background sound to form scene characters, acquiring syllables of each character in the scene characters, arranging syllables of the characters according to the sequence of each character in the scene voice, and the music generation module is used for sequentially corresponding each note in the background melody with the syllables of each character in the scene voice; the music generation module stretches or compresses the syllable duration of the characters to make the syllable duration of the characters identical to the corresponding note duration, carries out sound changing processing on the syllables of the Chinese characters, and carries out sound mixing with the background melody generated by the music generation module.
4. A camera for recognizing an environment and generating music according to claim 2 or 3, wherein the scene analysis module is further configured to obtain a change in moving speed and a change in loudness of sound information of a person's target within a period of time, and analyze brightness values in an environmental image within the period of time, the scene database stores a table of moving speed versus pitch, a table of background noise loudness versus duration, and a table of brightness versus tone color, and if the radar module detects the presence of the person's target, the scene analysis module obtains a corresponding change in pitch by the change in moving speed of the person's target, obtains a corresponding change in tone length by the background noise loudness, obtains a corresponding tone color by brightness valueization, and then splices the pitch, tone length, and tone color into a complete score, and converts the score into music.
5. The camera for recognizing environment and generating music according to claim 1, wherein an expression recognition model is provided in the emotion recognition module, and the training method of the expression recognition model comprises a preprocessing step, a deep feature learning step and a deep feature classifying step; wherein, the liquid crystal display device comprises a liquid crystal display device,
the pretreatment steps comprise: detecting a human face, deleting a background and a non-human face area, obtaining a facial image, and carrying out alignment treatment on the facial image; randomly clipping from four corners and the center of the face image, and then horizontally turning over; then carrying out illumination normalization and posture normalization processing on the facial image;
the deep learning step adopts convolutional neural network processing, and the specific method comprises the following steps: convolving the facial image through a convolution layer and generating a plurality of activated feature images of specific types, carrying out global average pooling through a pooling layer, and finally converting the 2D feature image into a 1D feature image through a full connection layer and outputting the 1D feature image; obtaining global characteristics of the face, and training model parameters of the expression recognition model according to the global characteristics of the face to obtain the expression recognition model;
the depth feature classification step includes: adding a softMax classification algorithm to the expression recognition model after parameter learning is completed, calculating probability values of the facial image belonging to each expression through the softMax classification algorithm, and judging one emotion type with the largest probability value as the emotion of the facial image; emotional characteristics are categorized into seven types of anger, aversion, fear, happiness, sadness, surprise and neutrality.
6. The camera for recognizing an environment and generating music according to claim 1, wherein the music generation module classifies anger, aversion, fear and sadness expressions as negative emotions, happiness, surprise as positive emotions, and neutral as normal emotions; when the identified emotion belongs to the front emotion, the output volume of the loudspeaker is increased by 4-5db, the heavy bass and the high bass are both increased by 2-3db, and the surround sound is started; when the identified emotion belongs to a normal emotion, the loudspeaker adjusts the output volume to 45-50db, the low-pitch is adjusted to 22-25db, the high-pitch is adjusted to 18-21db, and the surround sound is started; when the identified emotion belongs to negative emotion, the loudspeaker reduces the output volume by 5-6db, turns off the heavy bass, reduces the high bass by 2-3db, and turns off the surround sound.
7. The camera for recognizing an environment and generating music according to claim 1, wherein the pieces of music are stored in a music database in the form of MIDI files, the pieces of music information in the MIDI files are extracted in the form of matrices, the matrices extracted from the MIDI files are binary matrices, the matrices are stored in the form of two-dimensional scatter diagrams, each column in the matrices represents a tone value of 16 notes, and each row represents a pitch.
8. The camera for recognizing an environment and generating music according to claim 1, wherein the visible layer is an input layer of a restricted boltzmann machine, the hidden layer is a feature extraction layer of the restricted boltzmann machine, the number of nodes of the second hidden layer is one fourth of the hidden layer of the first layer, and the number of nodes of the third hidden layer is one fourth of the hidden layer of the second layer.
CN202310264005.4A 2023-03-18 2023-03-18 Camera capable of identifying environment and generating music Active CN116504206B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310264005.4A CN116504206B (en) 2023-03-18 2023-03-18 Camera capable of identifying environment and generating music

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310264005.4A CN116504206B (en) 2023-03-18 2023-03-18 Camera capable of identifying environment and generating music

Publications (2)

Publication Number Publication Date
CN116504206A true CN116504206A (en) 2023-07-28
CN116504206B CN116504206B (en) 2024-02-20

Family

ID=87327436

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310264005.4A Active CN116504206B (en) 2023-03-18 2023-03-18 Camera capable of identifying environment and generating music

Country Status (1)

Country Link
CN (1) CN116504206B (en)

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH04229458A (en) * 1990-12-27 1992-08-18 Sanyo Electric Co Ltd Musical tone reproducing device
JPH05297837A (en) * 1992-04-16 1993-11-12 Yashima Denki Co Ltd Music information and image information converting device
JP2007086572A (en) * 2005-09-26 2007-04-05 Yamaha Corp Image display device and program
KR20070059253A (en) * 2005-12-06 2007-06-12 최종민 The method for transforming the language into symbolic melody
CN101409070A (en) * 2008-03-28 2009-04-15 徐开笑 Music reconstruction method base on movement image analysis
CN102750964A (en) * 2012-07-30 2012-10-24 西北工业大学 Method and device used for controlling background music and based on facial expression
KR20150112048A (en) * 2014-03-25 2015-10-07 서강대학교산학협력단 music-generation method based on real-time image
US20170262256A1 (en) * 2016-03-10 2017-09-14 Panasonic Automotive Systems Company of America, Division of Panasonic Corporation of North Americ Environment based entertainment
CN107169430A (en) * 2017-05-02 2017-09-15 哈尔滨工业大学深圳研究生院 Reading environment audio strengthening system and method based on image procossing semantic analysis
CN107248406A (en) * 2017-06-29 2017-10-13 上海青声网络科技有限公司 A kind of method and device for automatically generating terrible domestic animals song
CN107802938A (en) * 2017-11-23 2018-03-16 中国科学院心理研究所 A kind of music electrical stimulation analgesia method
CN108197185A (en) * 2017-12-26 2018-06-22 努比亚技术有限公司 A kind of music recommends method, terminal and computer readable storage medium
CN110516593A (en) * 2019-08-27 2019-11-29 京东方科技集团股份有限公司 A kind of emotional prediction device, emotional prediction method and display device
CN110555126A (en) * 2018-06-01 2019-12-10 微软技术许可有限责任公司 Automatic generation of melodies
CN110853605A (en) * 2019-11-15 2020-02-28 中国传媒大学 Music generation method and device and electronic equipment
CN113923517A (en) * 2021-09-30 2022-01-11 北京搜狗科技发展有限公司 Background music generation method and device and electronic equipment
CN115391590A (en) * 2022-08-31 2022-11-25 安徽江淮汽车集团股份有限公司 Vehicle-mounted music pushing method and system based on face recognition
CN115633136A (en) * 2022-10-12 2023-01-20 杭州万像文化科技有限公司 Full-automatic music video generation method

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH04229458A (en) * 1990-12-27 1992-08-18 Sanyo Electric Co Ltd Musical tone reproducing device
JPH05297837A (en) * 1992-04-16 1993-11-12 Yashima Denki Co Ltd Music information and image information converting device
JP2007086572A (en) * 2005-09-26 2007-04-05 Yamaha Corp Image display device and program
KR20070059253A (en) * 2005-12-06 2007-06-12 최종민 The method for transforming the language into symbolic melody
CN101409070A (en) * 2008-03-28 2009-04-15 徐开笑 Music reconstruction method base on movement image analysis
CN102750964A (en) * 2012-07-30 2012-10-24 西北工业大学 Method and device used for controlling background music and based on facial expression
KR20150112048A (en) * 2014-03-25 2015-10-07 서강대학교산학협력단 music-generation method based on real-time image
US20170262256A1 (en) * 2016-03-10 2017-09-14 Panasonic Automotive Systems Company of America, Division of Panasonic Corporation of North Americ Environment based entertainment
CN107169430A (en) * 2017-05-02 2017-09-15 哈尔滨工业大学深圳研究生院 Reading environment audio strengthening system and method based on image procossing semantic analysis
CN107248406A (en) * 2017-06-29 2017-10-13 上海青声网络科技有限公司 A kind of method and device for automatically generating terrible domestic animals song
CN107802938A (en) * 2017-11-23 2018-03-16 中国科学院心理研究所 A kind of music electrical stimulation analgesia method
CN108197185A (en) * 2017-12-26 2018-06-22 努比亚技术有限公司 A kind of music recommends method, terminal and computer readable storage medium
CN110555126A (en) * 2018-06-01 2019-12-10 微软技术许可有限责任公司 Automatic generation of melodies
CN110516593A (en) * 2019-08-27 2019-11-29 京东方科技集团股份有限公司 A kind of emotional prediction device, emotional prediction method and display device
CN110853605A (en) * 2019-11-15 2020-02-28 中国传媒大学 Music generation method and device and electronic equipment
CN113923517A (en) * 2021-09-30 2022-01-11 北京搜狗科技发展有限公司 Background music generation method and device and electronic equipment
CN115391590A (en) * 2022-08-31 2022-11-25 安徽江淮汽车集团股份有限公司 Vehicle-mounted music pushing method and system based on face recognition
CN115633136A (en) * 2022-10-12 2023-01-20 杭州万像文化科技有限公司 Full-automatic music video generation method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
梁晓晶: "音乐信息可视化研究", 《中国博士学位论文全文数据库 哲学与人文科学辑》, no. 8, pages 1 *

Also Published As

Publication number Publication date
CN116504206B (en) 2024-02-20

Similar Documents

Publication Publication Date Title
Zhang et al. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching
CN108492817B (en) Song data processing method based on virtual idol and singing interaction system
US9685152B2 (en) Technology for responding to remarks using speech synthesis
CN110085263B (en) Music emotion classification and machine composition method
Kraus Of sound mind: How our brain constructs a meaningful sonic world
CN112270933A (en) Audio identification method and device
Van Zijl et al. The sound of emotion: The effect of performers’ experienced emotions on auditory performance characteristics
Coutinho et al. Singing and emotion
CN113238654A (en) Multi-modal based reactive response generation
Ohnaka et al. Visual onoma-to-wave: environmental sound synthesis from visual onomatopoeias and sound-source images
CN114093386A (en) Education-oriented multi-dimensional singing evaluation method
CN116504206B (en) Camera capable of identifying environment and generating music
US20230343321A1 (en) Method and apparatus for processing virtual concert, device, storage medium, and program product
Clark et al. Iconic pitch expresses vertical space
CN108922505B (en) Information processing method and device
Ruvolo et al. Auditory mood detection for social and educational robots
Sato Voice quality conversion using interactive evolution of prosodic control
Cervantes et al. Embedded design of an emotion-aware music player
Nazir et al. Deep learning end to end speech synthesis: A review
Zhang et al. Fundamental frequency adjustment and formant transition based emotional speech synthesis
Ribaldini Heavy metal vocal technique terminology compendium: A poietic perspective
Van Rossum et al. “Pitch” accent in alaryngeal speech
Zheng et al. The Extraction Method of Emotional Feature Based on Children's Spoken Speech
Thilag et al. Speech emotion recognition using deep learning
Seddon Recurrence in acousmatic music

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant