WO2022218027A1 - 音频播放方法、装置、计算机可读存储介质及电子设备 - Google Patents

音频播放方法、装置、计算机可读存储介质及电子设备 Download PDF

Info

Publication number
WO2022218027A1
WO2022218027A1 PCT/CN2022/076239 CN2022076239W WO2022218027A1 WO 2022218027 A1 WO2022218027 A1 WO 2022218027A1 CN 2022076239 W CN2022076239 W CN 2022076239W WO 2022218027 A1 WO2022218027 A1 WO 2022218027A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
information
audio
target
emotion
Prior art date
Application number
PCT/CN2022/076239
Other languages
English (en)
French (fr)
Inventor
朱长宝
牛建伟
余凯
Original Assignee
深圳地平线机器人科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳地平线机器人科技有限公司 filed Critical 深圳地平线机器人科技有限公司
Priority to JP2022573581A priority Critical patent/JP7453712B2/ja
Priority to US18/247,754 priority patent/US20240004606A1/en
Publication of WO2022218027A1 publication Critical patent/WO2022218027A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/54Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/165Management of the audio stream, e.g. setting of volume, audio stream path
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/166Detection; Localisation; Normalisation using acquisition arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/176Dynamic expression
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/056Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for extraction or identification of individual instrumental parts, e.g. melody, chords, bass; Identification or separation of instrumental parts by their characteristic voices or timbres
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/066Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for pitch analysis as part of wider processing for musical purposes, e.g. transcription, musical performance evaluation; Pitch recognition, e.g. in polyphonic sounds; Estimation or use of missing fundamental
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/075Musical metadata derived from musical analysis or for use in electrophonic musical instruments
    • G10H2240/085Mood, i.e. generation, detection or selection of a particular emotional content or atmosphere in a musical piece
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/121Musical libraries, i.e. musical databases indexed by musical parameters, wavetables, indexing schemes using musical parameters, musical rule bases or knowledge bases, e.g. for automatic composing methods
    • G10H2240/131Library retrieval, i.e. searching a database or selecting a specific musical piece, segment, pattern, rule or parameter set
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present disclosure relates to the field of computer technology, and in particular, to an audio playback method, an apparatus, a computer-readable storage medium, and an electronic device.
  • Embodiments of the present disclosure provide an audio playback method, apparatus, computer-readable storage medium, and electronic device.
  • An embodiment of the present disclosure provides an audio playback method, the method includes: acquiring intent judgment data collected for at least one user in a target space; determining, based on the intent judgment data, a target vocalization intent of the at least one user; For the target vocalization intention, determine feature information representing the current feature of the at least one user; extract and play audio corresponding to the feature information from a preset audio library.
  • an audio playback device the device includes: an acquisition module for acquiring intent judgment data collected for at least one user in a target space; a first determination module for based on The intent judgment data determines the target vocalization intent of the at least one user; the second determination module determines feature information representing the current feature of the at least one user based on the target vocalization intent; the first playback module uses It is used to extract and play the audio corresponding to the feature information from the preset audio library.
  • a computer-readable storage medium stores a computer program, and the computer program is used to execute the above-mentioned audio playback method.
  • an electronic device includes: a processor; a memory for storing instructions executable by the processor; a processor for reading the executable instructions from the memory, and Execute the instruction to implement the above audio playback method.
  • the target utterance of at least one user is determined Intent, then determine the feature information according to the target vocalization intent, and finally extract the audio corresponding to the feature information from the preset audio library and play it, so that the electronic device can automatically determine the user's target vocalization intent, and when it is determined that the user has vocalization intent
  • the audio playback is automatically performed by the electronic device, and the user does not need to actively trigger the audio playback operation, which reduces the steps for the user to perform the audio playback operation and improves the convenience of the audio playback operation.
  • the played audio is adapted to the characteristics of the user, so that the audio that the user wants to listen to is more accurately played, and the pertinence of the automatically played audio is improved.
  • FIG. 1 is a system diagram to which the present disclosure is applied.
  • FIG. 2 is a schematic flowchart of an audio playback method provided by an exemplary embodiment of the present disclosure.
  • FIG. 3 is a schematic flowchart of an audio playback method provided by another exemplary embodiment of the present disclosure.
  • FIG. 4 is a schematic flowchart of an audio playback method provided by another exemplary embodiment of the present disclosure.
  • FIG. 5 is a schematic flowchart of an audio playback method provided by another exemplary embodiment of the present disclosure.
  • FIG. 6 is a schematic flowchart of an audio playback method provided by another exemplary embodiment of the present disclosure.
  • FIG. 7 is a schematic flowchart of an audio playback method provided by another exemplary embodiment of the present disclosure.
  • FIG. 8 is a schematic flowchart of an audio playback method provided by another exemplary embodiment of the present disclosure.
  • FIG. 9 is a schematic structural diagram of an audio playback device provided by an exemplary embodiment of the present disclosure.
  • FIG. 10 is a schematic structural diagram of an audio playback apparatus provided by another exemplary embodiment of the present disclosure.
  • FIG. 11 is a structural diagram of an electronic device provided by an exemplary embodiment of the present disclosure.
  • a plurality may refer to two or more, and “at least one” may refer to one, two or more.
  • the term "and/or" in the present disclosure is only an association relationship to describe associated objects, indicating that there can be three kinds of relationships, for example, A and/or B, it can mean that A exists alone, and A and B exist at the same time , there are three cases of B alone.
  • the character "/" in the present disclosure generally indicates that the related objects are an "or" relationship.
  • Embodiments of the present disclosure can be applied to electronic devices such as terminal devices, computer systems, servers, etc., which can operate with numerous other general-purpose or special-purpose computing system environments or configurations.
  • Examples of well-known terminal equipment, computing systems, environments and/or configurations suitable for use with terminal equipment, computer systems, servers, etc. electronic equipment include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients computer, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, minicomputer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the foregoing, among others.
  • Electronic devices such as terminal devices, computer systems, servers, etc., may be described in the general context of computer system-executable instructions, such as program modules, being executed by the computer system.
  • program modules may include routines, programs, object programs, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • Computer systems/servers may be implemented in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • program modules may be located on local or remote computing system storage media including storage devices.
  • the user In the current audio playback system, the user usually needs to manually select the audio to be played, or trigger the audio playback by means of voice recognition, gesture recognition, or the like. These methods often require the user to actively interact with the audio playback system, and cannot automatically determine the user's vocal intention. The convenience of audio playback is insufficient, and it is impossible to automatically play the corresponding audio according to the user's characteristics. The pertinence of audio playback Not enough either.
  • FIG. 1 illustrates an exemplary system architecture 100 of an audio playback method or audio playback apparatus to which embodiments of the present disclosure may be applied.
  • the system architecture 100 may include a terminal device 101 , a network 102 , a server 103 and an information collection device 104 .
  • the network 102 is a medium for providing a communication link between the terminal device 101 and the server 103 .
  • the network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the user can use the terminal device 101 to interact with the server 103 through the network 102 to receive or send messages and the like.
  • Various communication client applications such as audio players, video players, web browser applications, instant communication tools, etc., may be installed on the terminal device 101 .
  • the terminal device 101 may be various electronic devices capable of audio playback, including but not limited to, such as vehicle-mounted terminals, mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablets), PMPs (portables) mobile terminals such as multimedia players), etc., as well as stationary terminals such as digital TVs, desktop computers, smart home appliances, and the like.
  • the information collection device 104 may be various devices for collecting user-related information (including intention decision data), including but not limited to at least one of the following: a camera, a microphone, and the like.
  • the terminal device 101 is set in a space 105 with a limited range, and the information collection device 104 is associated with the space 105 .
  • the information collection device 104 may be installed in the space 105 to collect various information such as images and sounds of the user, or may be installed outside the space 105 to collect various information such as images and sounds around the space 105 .
  • the space 105 may be a variety of confined spaces, such as the interior of a vehicle, the interior of a room, and the like.
  • the server 103 may be a server that provides various services, such as a background audio server that provides support for the audio played on the terminal device 101 .
  • the background audio server can process the received intent judgment data to obtain information such as the user's target vocalization intent, the user's feature information, and the audio to be played).
  • the audio playback method provided by the embodiments of the present disclosure may be executed by the server 103 or by the terminal device 101 , and correspondingly, the audio playback device may be set in the server 103 or in the terminal device 101 middle.
  • the audio playback method provided by the embodiments of the present disclosure may also be performed jointly by the terminal device 101 and the server 103.
  • the steps of acquiring intent judgment data and determining the target vocalization intent are performed by the terminal device 101, and the steps of determining feature information and extracting audio Executed by the server 103, correspondingly, each module included in the audio playback apparatus may be set in the terminal device 101 and the server 103, respectively.
  • terminal devices, networks and servers in FIG. 1 are merely illustrative. According to implementation needs, there can be any number of terminal devices, networks, servers and information collection devices.
  • the above-mentioned system architecture may not include the network and the server, but only the terminal device and the information collection device.
  • FIG. 2 is a schematic flowchart of an audio playback method provided by an exemplary embodiment of the present disclosure. This embodiment can be applied to an electronic device (the terminal device 101 or the server 103 shown in FIG. 1 ). As shown in FIG. 2 , the method includes the following steps:
  • Step 201 Obtain intent judgment data collected for at least one user in the target space.
  • the electronic device may acquire intent judgment data collected for at least one user in the target space.
  • the target space eg, the space 105 in FIG. 1
  • the intention determination data may be various data used to determine the user's intention, for example, including but not limited to at least one of the following: the user's face image data, the user's voice, and the like.
  • Step 202 Determine, based on the intent judgment data, a target vocalization intent of at least one user.
  • the electronic device may determine the target vocalization intention of at least one user based on the intention judgment data.
  • the utterance type represented by the target utterance intention may be preset.
  • the target vocalization intent may include, but is not limited to, at least one of the following: singing intent, recitation intent, and the like.
  • the electronic device may select a corresponding method to determine the target vocalization intention according to the type of intention judgment data.
  • the intent judgment data includes the user's face image data
  • emotion recognition can be performed on the facial image to obtain the emotion type, and if the emotion type is joy, it can be determined that the above-mentioned at least one user has a target vocalization intention (eg, singing intention) ).
  • the intent determination data includes a voice signal emitted by the user
  • the voice signal can be identified, and if the identification result indicates that the user is humming, it can be determined that there is a target voice intent.
  • Step 203 Based on the target vocalization intention, determine feature information representing the current feature of the at least one user.
  • the electronic device may determine feature information that characterizes the current feature of at least one user.
  • the current characteristics of the user may include, but are not limited to, at least one of the following: the user's emotion, the number of users, the user's listening habits, and the like.
  • the electronic device may determine the feature information in a manner corresponding to the above-mentioned various features respectively. For example, a facial image captured by a camera of the user may be obtained, and emotion recognition may be performed on the facial image to obtain characteristic information representing the current emotion of the user.
  • the user's historical playing records may be acquired, and the type of audio that the user is accustomed to listen to may be determined according to the historical playing records as the feature information.
  • Step 204 Extract and play audio corresponding to the feature information from the preset audio library.
  • the electronic device may extract and play audio corresponding to the feature information from a preset audio library.
  • the preset audio library may be set in the above electronic device, or may be set in other electronic devices communicatively connected with the above electronic device.
  • the above-mentioned characteristic information corresponds to the type of audio, and the electronic device can determine the type of audio to be played according to the characteristic information, and from the audio of this type, select (for example, select by playback volume, randomly select, etc.) audio to play.
  • audio playback marked as joy type may be extracted from a preset audio library.
  • the characteristic information indicates that the user is accustomed to listening to rock music
  • the audio playback of the rock type may be extracted from the preset audio library.
  • the intent judgment data is collected for at least one user in the target space
  • the target vocalization intention of the user is determined according to the intent determination data
  • the feature information is determined according to the target vocalization intent
  • the preset vocalization intent is determined.
  • the audio corresponding to the feature information is extracted from the audio library and played, so that the electronic device can actively determine the user's target vocalization intention without the need for the user to trigger the audio playback operation. Playing the audio reduces the steps for the user to perform the audio playing operation, and improves the convenience of the audio playing operation.
  • the played audio is adapted to the characteristics of the user, so that the audio that the user wants to listen to is more accurately played, and the pertinence of the automatically played audio is improved.
  • the target vocalization intention of the at least one user may be determined based on any of the following methods:
  • Mode 1 in response to determining that the intent judgment data includes at least one user's face image, input the face image into a pre-trained third emotion recognition model to obtain emotion category information; if the emotion category information is preset emotion type information, determine at least one. A user has a target vocal intent.
  • the third emotion recognition model may be obtained by training a preset initial model for training the third emotion recognition model using a preset training sample set in advance.
  • the training samples in the training sample set may include sample face images and corresponding emotion category information.
  • the electronic device can use the sample face image as the input of the initial model (for example, including a convolutional neural network, a classifier, etc.), and use the emotion category information corresponding to the input sample face image as the expected output of the initial model to train the initial model. , to obtain the above-mentioned third emotion recognition model.
  • the preset emotions represented by the above-mentioned preset emotion type information may be various emotions such as excitement, joy, sadness, etc.
  • the emotion type information output by the third emotion recognition model representing the user's emotion is the above-mentioned preset emotion, it is determined that at least one user There is a target vocal intent. For example, when the emotion type information indicates that the user's emotion is excited, it means that the user may want to sing to express his mood, and it is determined that the user has the intention to sing.
  • Mode 2 In response to determining that the intent judgment data includes at least one user's voice information, voice recognition is performed on the voice information to obtain a voice recognition result; if the voice recognition result indicates that at least one user instructs to play audio, it is determined that at least one user has a target vocalization intention.
  • the method for performing speech recognition on sound information is the prior art, and details are not described here.
  • a voice of "this song is good, I want to sing" is recognized by a user, it is determined that the above-mentioned at least one user has a target vocalization intention (ie, singing intention).
  • Manner 3 In response to determining that the intent judgment data includes voice information of at least one user, perform melody recognition on the voice information to obtain a melody recognition result; if the melody recognition result indicates that at least one user is uttering in a target form, it is determined that at least one user has a target. Voice intent.
  • the vocalization of the above-mentioned target form corresponds to the target vocalization intention.
  • vocalizations in the target form may include singing, reciting, humming, and the like.
  • the method of performing melody recognition on sound information is in the prior art, and is usually performed according to the following steps: performing melody extraction on the human voice input to the melody recognition model through note segmentation and pitch extraction, and obtaining a note sequence through melody extraction.
  • the electronic device further matches the note sequence output by the melody recognition model with the audio note sequence in the audio library.
  • the similarity between the output note sequence and a certain audio note sequence is greater than a preset similarity threshold, it means that the user Singing (ie, vocalization in the form of a target) is being performed, and it is determined that the at least one user has an intention to vocalize the target.
  • This implementation provides a variety of ways to determine the user's target vocalization intention, thereby realizing the comprehensive detection of the user's target vocalization intention through multi-modal methods such as emotion recognition, speech recognition, and melody recognition, and the detection accuracy is improved. High, the audio can be played to the user based on the target vocalization intention in the follow-up without the user's manual operation, thereby improving the convenience of the audio playback operation.
  • the feature information may be determined in at least one of the following manners:
  • Manner 1 Obtain historical audio playback records for at least one user; determine listening habit information of at least one user based on the historical audio playback records; and determine feature information based on the listening habit information.
  • the electronic device may acquire historical audio playback records locally or remotely, and the listening habit information is used to characterize features such as the type of audio that the user often listens to, listening time, and the like. For example, the audio type with the most listening times may be determined as the listening habit information according to the historical audio playback records. Generally, the listening habit information may be included as the characteristic information.
  • Method 2 Obtain a face image of at least one user, input the face image into a pre-trained fourth emotion recognition model, and obtain emotion category information representing the current emotion of at least one user; and determine feature information based on the emotion category information.
  • the fourth emotion recognition model may be a neural network model used for emotion classification of facial images, which may be the same as or different from the third emotion recognition model described in the above-mentioned optional implementation manner, but the training method is the same as The method for training the third emotion recognition model is basically the same, and will not be repeated here.
  • the emotion category information can be included as the information of the feature information.
  • Manner 3 Obtain at least one environment image of the environment where the user is located, input the environment image into a pre-trained environment recognition model, and obtain environment type information; and determine feature information based on the environment type information.
  • the environment image may be obtained by photographing an environment other than the target space by a camera.
  • the environment recognition model may be a neural network model for classifying environmental images, and the electronic device may use a preset training sample set in advance to obtain the environment recognition model by training a preset initial model for training the environment recognition model.
  • the training samples in the training sample set may include sample environment images and corresponding environment type information.
  • the electronic device can use the sample environment image as the input of the initial model (for example, including a convolutional neural network, a classifier, etc.), and use the environment type information corresponding to the input sample environment image as the expected output of the initial model, train the initial model, and obtain The above environment recognition model.
  • the environment type information is used to represent the type of the environment where the at least one user is located.
  • the type of the environment is a location type such as a suburb, a highway, and a village, and it can also be a weather type such as a sunny day, a rainy day, and a snowy day.
  • the environment type information can be included as the information of the characteristic information.
  • an in-space image is obtained by photographing the target space; based on the in-space image, the number of people in the target space is determined; and feature information is determined based on the number of people.
  • the in-space image may be an image captured by a camera set in the target space, the number of in-space images may be one or more, and the electronic device may determine which of the in-space images is based on an existing target detection method. People and counting people. Generally, the number of people can be included as the information of the characteristic information.
  • this implementation can comprehensively detect the current state of the user, and obtain more comprehensive feature information, which can help to more targetedly extract the user's interest based on the feature information. to improve the accuracy of playing audio for users.
  • step 204 may be performed as follows:
  • the characteristic information includes listening habit information
  • audio corresponding to the listening habit is extracted and played.
  • audio corresponding to the emotion category information is extracted and played.
  • audio corresponding to the environment type information is extracted and played.
  • audio corresponding to the number of people is extracted and played.
  • rock-type audio may be extracted and played. If the emotion category information indicates that the user's current emotion is happy, fast-paced audio can be extracted and played. If the environment type information indicates that the current environment of the user is in the wild, the audio of the soothing rhythm type can be extracted and played. If the determined number of users is 2 or more, chorus-type audio can be extracted and played.
  • the feature information includes at least two of listening habit information, emotion category information, environment type information, and number of people
  • the intersection of the audio contained in the audio types corresponding to the various information can be taken as the audio to be played.
  • the extracted audio can be more attractive to the user, thereby improving the accuracy of playing the audio for the user.
  • FIG. 3 a schematic flowchart of still another embodiment of an audio playing method is shown. As shown in FIG. 3 , on the basis of the above-mentioned embodiment shown in FIG. 2 , after step 204, the following steps may be further included:
  • Step 205 extracting user audio information from the current mixed sound signal.
  • the above-mentioned mixed sound signal may be a signal collected by the information collection device 104 (ie, a microphone) as shown in FIG. 1 , which is arranged in the above-mentioned target space.
  • User audio information is the sound made by a user.
  • the sound signal collected by the microphone includes a noise signal, or includes a sound signal sent by at least two users at the same time, and the sound signal collected at this time is a mixed sound signal. That is, the mixed sound signal may include the noise signal, or may include the sound information uttered by the user, or may include both the noise signal and the sound signal uttered by the user.
  • existing speech separation methods for example, Blind Source Separation (BSS, Blind Source Separation) method, Auditory Scene Analysis (ASA, Auditory Scene Analysis) method, etc.
  • BSS Blind Source Separation
  • ASA Auditory Scene Analysis
  • User audio information corresponding to the user respectively.
  • Step 206 in the case that the user audio information meets the preset condition, play the user audio information.
  • the electronic device may analyze the extracted user audio information, and if the user audio information satisfies a preset condition, play the user audio information. As an example, if the electronic device recognizes that the user audio information represents that the user is singing, the electronic device plays the user audio information whose volume is amplified through the speaker. Or, if the electronic device recognizes that the melody of the user's audio information representing the sound made by the user matches the currently playing audio, the electronic device plays the user's audio information.
  • steps 205-206 are performed while the audio described in step 204 is playing.
  • the played audio can be music.
  • user audio information is extracted in real time from the mixed sound signal currently sent by at least one user. If the user audio information matches the played music, the user audio information is played, thereby realizing A scene where the user sings to the music.
  • an existing feedback sound elimination method may also be used to filter out the sound signal collected by the microphone and played from the speaker, thereby reducing the interference of the feedback sound on the playback of user audio information.
  • Fig. 3 corresponds to the method provided by the embodiment.
  • the user audio information can be played simultaneously with the audio extracted from the preset audio library, and there is no need for the user to separately provide a special purpose for playing the user's audio information.
  • the microphone for sound just use the microphone for collecting the mixed sound of each user in the target space to extract the sound made by the user from the mixed sound signal and play it simultaneously with the currently playing audio, which simplifies the process of playing user audio information.
  • the required hardware improves the convenience for the user to achieve the target vocal intention.
  • playing the user audio information that meets the preset conditions can avoid the interference on the playing of the user audio information caused by playing out the user chat and other content.
  • step 205 further includes the following steps:
  • Step 2051 Acquire initial audio information collected by an audio collection device set in the target space.
  • the initial audio information may include a mixed sound signal.
  • the audio collection device is a device included in the information collection device 104 shown in FIG. 1 .
  • the number of audio collection devices may be one or more, and the number of channels of initial audio information is consistent with the number of audio collection devices, that is, each audio collection device collects a channel of initial audio information.
  • the number of audio collection devices may match the number of seats in the vehicle. That is, an audio capture device is installed near each seat.
  • Step 2052 Perform vocal separation on the initial audio information to obtain at least one channel of user audio information.
  • At least one channel of user audio information corresponds to one user respectively.
  • the electronic device can use the existing voice separation method to extract the user audio information corresponding to each user from the initial audio information.
  • a blind source separation algorithm may be used to separate at least one channel of user audio information from the initial audio information.
  • at least one channel of user audio information can be separated from the initial audio information collected by each audio collection device by using an existing microphone array-based speech separation algorithm.
  • FIG. 4 corresponds to the method provided by the embodiment.
  • the user audio information of multiple users can be collected in real time, and each channel of user audio information can be collected in real time.
  • the audio information eliminates the sound interference of other users, so that the user audio information played subsequently can clearly reflect the voices of each user, and the quality of playing the voices of multiple users is improved.
  • step 206 in the foregoing embodiment corresponding to FIG. 3 may be performed as follows:
  • the volume of at least one channel of user audio information is adjusted to the target volume respectively, the user audio information after the volume adjustment is synthesized, and the synthesized user audio information is played.
  • the target volume corresponding to each channel of user audio information may be the same or different.
  • the volume of the user audio information with the highest volume can be used as the target volume, and the volumes of other user audio information can be adjusted to the target volume; or a fixed volume can be set as the target volume, and the user audio information of each channel can be set to the same volume target volume.
  • each channel of user audio information can be combined into stereo playback, or combined into the same channel for playback.
  • the volume of each user audio information being played can be made consistent or reach the respective set volume, so as to prevent the volume from being too low during playback due to the low volume emitted by the user.
  • the above-mentioned step 206 may play user audio information based on at least one of the following methods:
  • Manner 1 Perform melody recognition on the user audio information to obtain the user melody information; match the user melody information with the melody information of the currently playing audio, and play the user audio information based on the obtained first matching result.
  • the method of performing melody recognition on user audio information is the prior art, which is usually carried out according to the following steps: performing melody extraction on the user audio information input to the melody recognition model through note segmentation and fundamental tone extraction, and obtaining a sequence of notes through melody extraction as the melody information.
  • the electronic device further calculates the similarity between the melody information output by the melody recognition model and the melody information of the currently playing audio, if the similarity (that is, the first matching result) is greater than or equal to the preset first similarity threshold, it can be determined that the first If the matching result meets the preset conditions, the user audio information can be played.
  • Method 2 Perform speech recognition on the user audio information to obtain a speech recognition result; match the speech recognition result with the corresponding text information of the currently playing audio, and play the user audio information based on the obtained second matching result.
  • the speech recognition result may be text information. It should be noted that the method for performing speech recognition on user audio information is in the prior art, and details are not described herein again.
  • the corresponding text information of the currently playing audio is the text information that has established a corresponding relationship with the audio in advance. For example, if the currently playing audio is a song, the corresponding text information may be lyrics; if the currently playing audio is poetry reading, the corresponding text The information is the original text of the poem read aloud.
  • the electronic device may perform similarity calculation on the speech recognition result and the above-mentioned corresponding text information, and if the similarity (ie, the second matching result) is greater than or equal to a preset second similarity threshold, it may be determined that the second matching result meets the preset condition, User audio information can be played.
  • the electronic device may execute any one of the above-mentioned manners 1 and 2 to play user audio information.
  • the first and second manners above may also be performed simultaneously, and if it is determined based on the first matching result and the second matching result that the user audio information can be played in both manners, the user audio information is played.
  • the first mode and/or the second mode may be performed for each channel of user audio information.
  • the user audio information can be played when certain conditions are met, so as to avoid playing the user audio information irrelevant to the currently playing audio, and make the user audio information played.
  • the information matches the currently playing audio to a higher degree, thereby improving the quality of playing the user's audio information.
  • the above step 206 further includes:
  • the pitch of the user's audio information is determined.
  • the method for determining the pitch of the user audio information is in the prior art, and details are not described herein again.
  • Step 1 Adjust the pitch of the currently playing audio to a target pitch that matches the pitch of the user audio information.
  • the pitch of the currently playing audio may be compared with the pitch of the user's audio information, and if the difference between the two is outside the preset difference range, the pitch of the currently playing audio is adjusted to match the user's audio pitch.
  • the pitch difference of the audio information is within a preset difference range.
  • the user audio information is the audio information of the user singing
  • the currently playing audio is song music
  • the pitch of the user audio information is higher or lower than the pitch of the currently playing music
  • the pitch of the music is adapted to the pitch of the user's singing, that is, the difficulty of singing along with the played music is adjusted, so that the user can better adapt to the played music.
  • Step 2 output recommendation information for recommending audio corresponding to the pitch of the user audio information.
  • the audio corresponding to the pitch of the user audio information may be the audio whose difference value from the pitch of the user audio information is within a preset difference value range.
  • the recommended information can be output in the form of prompt sounds, displayed text, images, etc. After the recommended information is output, the user can choose whether to play the recommended audio, so that the pitch of the replayed audio matches the user's pitch.
  • this implementation makes the pitch of the played audio automatically adapt to the user's pitch, so that the playback effect of the user's audio information is better. There is no need to adjust the pitch of the played audio through active means such as manual or voice control, which improves the convenience of adjusting the audio.
  • FIG. 5 a schematic flowchart of still another embodiment of the audio playing method is shown. As shown in FIG. 5 , on the basis of the above-mentioned embodiment shown in FIG. 3 , after step 206 , the following steps may be further included:
  • Step 207 Determine a target user corresponding to the user audio information from at least one user, and acquire a face image of the target user.
  • the face image may be an image captured by a camera provided in the target space and included in the information acquisition device 104 in FIG. 1 .
  • the electronic device when it extracts the user audio information from the mixed sound signal, it can determine the position of the sound source corresponding to the user audio information based on the existing voice separation method (for example, using an existing microphone array-based multi-sound area voice)
  • the separation method determines which position in the target space the user audio information corresponds to), the position of the sound source is the user's position, and the user's position can be determined by the image captured by the user, and then the user's face image corresponding to the user's audio information can be obtained. .
  • Step 208 Input the respective face images of the at least one user into the pre-trained first emotion recognition model to obtain emotion category information corresponding to the at least one user. That is to say, in this step, the facial image of the target user corresponding to the user audio information is input into the pre-trained first emotion recognition model, and correspondingly, the emotion category information corresponding to the target user is obtained.
  • the first emotion recognition model may be the same as at least one of the third emotion recognition model and the fourth emotion recognition model described in the above-mentioned optional implementation manner, or may be different, but the training method is the same as that of the third emotion recognition model and the fourth emotion recognition model.
  • the training method of at least one of the four emotion recognition models is basically the same, and will not be repeated here.
  • Step 209 based on the emotion category information, determine a first score representing the degree of matching between the emotion of at least one user and the type of the currently playing audio. If the emotion category information in this step is the emotion category information corresponding to the target user, the determined first score is used to represent the matching degree between the emotion of the target user and the type of the currently playing audio.
  • the first score may be obtained based on a probability value corresponding to the output emotion category information calculated by the first emotion recognition model.
  • the first emotion recognition model can classify the input facial image, and obtain a plurality of emotion category information and probability values corresponding to each emotion category information respectively, and the emotion category information corresponding to the maximum probability value can be determined as the one identified this time. Emotion category information for face images.
  • the first score may be determined according to the probability corresponding to this type of emotion category information. If the emotion category information of the face image recognized this time includes multiple types of emotion category information, the emotion category information that matches the type of the currently playing audio can be determined from the multiple emotion category information as the target emotion category information, and then based on the target emotion category information The corresponding probability determines the first score. The larger the value of the first score, the greater the degree of matching with the currently playing audio.
  • the corresponding relationship between the type of the currently playing audio and the emotion category information may be preset. For example, if the type of the currently playing audio is marked as "cheerful", the first score may be obtained based on the probability corresponding to the emotion category information representing the cheerful emotion output by the model.
  • Step 210 based on the first score, determine and output the score of the user audio information.
  • the score of the user's audio information can be output in various ways, such as displaying on a display screen, outputting the sound of the score through a speaker, and the like.
  • the first score may be determined as the score of the user audio information.
  • step 209 may be performed as follows: based on the user audio information, determine a second score representing the degree of matching between the user audio information and the currently played audio, that is, in this step, the second score is determined based on the user audio information, The second score is used to represent the degree of matching between the user audio information and the currently playing audio.
  • Step 210 may be performed as follows: based on the second score, determine and output the score of the user audio information.
  • the second score may be determined by using an existing method for scoring user audio information. For example, when the user audio information indicates that the user is singing, the second score may be determined based on an existing singing scoring method. Further, the second score may be determined as the score of the user's audio information.
  • step 210 may also be performed as follows: based on the first score and the second score, determine and output the score of the user audio information.
  • the first score and the second score may be weighted and summed based on the preset weights corresponding to the first score and the second score respectively, to obtain the score of the user audio information.
  • the method provided in the corresponding embodiment of FIG. 5 determines the score of the user audio information based on facial image recognition and/or audio scoring, so that the score can fully reflect the matching degree between the user audio information and the played audio, and improves the performance of the user audio information. Scoring accuracy.
  • step 208 may be performed as follows:
  • the emotion category information in the first emotion category information sequence respectively corresponds to a face image subsequence.
  • the number of the user's face images is at least two, that is, the user's face image sequence is input to the first emotion recognition model.
  • the face image sequence of a certain user may be the user's face image sequence.
  • the sequence of emotion category information can be represented in the form of a vector, where each value in the vector corresponds to a subsequence of face images and represents a certain emotion category.
  • Each facial image subsequence may include at least one facial image.
  • the duration of the currently playing audio is 3 minutes, and the user's face is photographed for 3 minutes during the playback.
  • the 3-minute face image sequence can be divided into 100 face image subsequences, and each subsequence is input into the first subsequence in turn.
  • An emotion recognition model obtaining a vector including 100 values as an emotion category information sequence.
  • Step 2091 Acquire the video corresponding to the currently playing audio, and extract the facial image sequence of the target person from the video.
  • the target person may be a person related to the currently playing audio.
  • the corresponding video may be a video including an image of the singer of the song
  • the target person may be the singer of the song or a person performing with the song.
  • the target person can be manually set in advance, or can be obtained by recognizing the video by electronic equipment. For example, based on the existing mouth motion recognition method, the person whose mouth motion frequency matches the rhythm of the song is identified as the target person.
  • the electronic device can use an existing facial image detection method to extract the facial image sequence of the target person from the image frames included in the video according to the preset or recognized target person.
  • Step 2092 Input the facial image sequence into the first emotion recognition model to obtain the second emotion category information sequence.
  • This step is basically the same as the above step of determining the first emotion category information sequence, and will not be repeated here.
  • Step 2093 Determine the similarity between the first emotion category information sequence and the second emotion category information sequence.
  • the first emotion category information sequence and the second emotion category information sequence may both be in the form of vectors, and the electronic device may determine the distance between the vectors, and determine the similarity based on the distance (for example, the inverse of the distance is the similarity).
  • Step 2094 based on the similarity, determine a first score.
  • the similarity may be determined as the first score, or the similarity may be scaled according to a preset ratio to obtain the first score.
  • this implementation can accurately determine the degree of consistency between the user's emotions and the emotions of the original video, and the obtained first score is more accurate It reflects the degree of consistency between the user's emotion and the currently playing audio, thereby improving the accuracy of scoring the user's audio information.
  • FIG. 7 a schematic flowchart of still another embodiment of an audio playing method is shown. As shown in FIG. 7 , on the basis of the above-mentioned embodiment shown in FIG. 3 , after step 206 , the following steps may be further included:
  • Step 211 Determine a target user corresponding to the user audio information from at least one user, and acquire a face image of the target user.
  • This step is basically the same as the above-mentioned step 207, and will not be repeated here.
  • Step 212 input the facial image of the target user and the user audio information corresponding to the user audio information into the pre-trained second emotion recognition model to obtain emotion category information.
  • the second emotion recognition model in this step is different from the above-mentioned first emotion recognition model, third emotion recognition model, and fourth emotion recognition model.
  • the second emotion recognition model can receive images and audios as input at the same time. Audio is jointly analyzed to output emotion category information.
  • the second emotion recognition model can be obtained by training a preset initial model for training the second emotion recognition model using a preset training sample set in advance.
  • the training samples in the training sample set may include sample face images, sample audio information, and corresponding emotion category information.
  • the electronic device can use the sample face image and sample audio information as the input of the initial model (for example, including neural networks, classifiers, etc.), and use the emotion category information corresponding to the input sample face image and sample audio information as the expected output of the initial model.
  • the initial model is trained to obtain the above-mentioned third emotion recognition model.
  • the neural network included in the initial model can determine the feature information of the input sample face image and sample audio information
  • the classifier can classify the feature information, compare the actual output information with the expected output, and adjust the parameters of the initial model to make The gap between the actual output and the expected output gradually decreases until convergence, so that the above-mentioned second emotion recognition model is obtained by training.
  • Step 213 based on the emotion category information, determine and output a score representing the matching degree between the emotion of the target user corresponding to the user audio information and the type of the currently playing audio.
  • the score may be obtained based on the probability value corresponding to the output emotion category information calculated by the second emotion recognition model.
  • the method for determining the score based on the probability value is basically the same as the method for determining the first score in the foregoing step 209, and details are not repeated here.
  • FIG. 7 corresponds to the method provided by the embodiment.
  • step 212 may be performed as follows:
  • the emotion category information in the third emotion category information sequence respectively corresponds to a face image subsequence.
  • the definition of the third emotion category information sequence is basically the same as that of the above-mentioned first emotion category information, and will not be repeated here.
  • step 213 can be performed as follows:
  • Step 2131 Acquire the video corresponding to the currently playing audio, and extract the facial image sequence of the target person from the video.
  • This step is basically the same as the above-mentioned step 2091, and will not be repeated here.
  • Step 2132 Input the facial image sequence and the currently playing audio into the second emotion recognition model to obtain a fourth emotion category information sequence.
  • This step is basically the same as the above step of determining the third emotion category information sequence, and will not be repeated here.
  • Step 2133 Determine the similarity between the third emotional category information sequence and the fourth emotional category information sequence.
  • the third emotional category information sequence and the fourth emotional category information sequence may both be in the form of vectors, and the electronic device may determine the distance between the vectors, and determine the similarity based on the distance (for example, the inverse of the distance is the similarity).
  • Step 2134 based on the similarity, determine a score representing the degree of matching between the user's emotion corresponding to the user's audio information and the type of the currently playing audio.
  • the similarity may be determined as a score, or the similarity may be scaled according to a preset ratio to obtain a score.
  • the third emotional category information sequence and the fourth emotional category information sequence in this implementation are obtained based on the user's face image and user audio information, the images and audio are integrated during the emotional classification. Therefore, the two emotional categories The accuracy of the information sequence in representing emotions is higher. Therefore, the score determined by the similarity between the two emotion category information sequences can more accurately represent the degree of agreement between the user's emotion and the emotion of the original video, which further improves the user's audio quality. The accuracy of the information to be scored.
  • FIG. 9 is a schematic structural diagram of an audio playback device provided by an exemplary embodiment of the present disclosure. This embodiment can be applied to electronic equipment.
  • the audio playback apparatus includes: an acquisition module 901 for acquiring intent judgment data collected for at least one user in the target space; a first determination module 902 for Based on the intention judgment data, determine the target vocalization intention of at least one user; the second determination module 903 is used for determining the feature information representing the current feature of the at least one user based on the target vocalization intention; the first playing module 904 is used for Extract and play the audio corresponding to the feature information from the preset audio library.
  • the obtaining module 901 may obtain the intent judgment data collected for at least one user in the target space.
  • the target space eg, the space 105 in FIG. 1
  • the intent determination data may be various pieces of information used to determine the user's intent, for example, including but not limited to at least one of the following: the user's face image, the user's voice, and the like.
  • the first determination module 902 may determine the target vocalization intention of at least one user based on the intention judgment data.
  • the utterance type represented by the target utterance intention may be preset.
  • the target vocalization intent may include, but is not limited to, at least one of the following: singing intent, recitation intent, and the like.
  • the first determining module 902 may select a corresponding manner to determine the target vocalization intent according to the type of intent determination data.
  • the intent judgment data when the intent judgment data includes a face image of the user, emotion recognition can be performed on the facial image to obtain the emotion type, and if the emotion type is joy, it can be determined that the above-mentioned at least one user has a target vocalization intention (eg, singing intention) .
  • the intent determination data includes a voice signal emitted by the user, the voice signal can be identified, and if the identification result indicates that the user is humming, it can be determined that there is a target voice intent.
  • the second determining module 903 may determine feature information representing the current feature of at least one user.
  • the current characteristics of the user may include, but are not limited to, at least one of the following, the user's mood, the number of users, the user's listening habits, and the like.
  • the second determining module 903 may determine the feature information in a manner corresponding to the above-mentioned various features respectively. For example, a facial image captured by a camera of the user may be acquired, and emotion recognition may be performed on the facial image to obtain characteristic information representing the current emotion of the user. For another example, the user's historical playing records may be acquired, and the type of audio that the user is accustomed to listen to may be determined according to the historical playing records as the feature information.
  • the first playing module 904 may extract and play audio corresponding to the feature information from a preset audio library.
  • the preset audio library may be set in the above electronic device, or may be set in other electronic devices communicatively connected with the above electronic device.
  • the above-mentioned feature information corresponds to the type of audio, and the first playback module 904 can determine the type of audio to be played according to the feature information, and select (for example, select by playback volume, randomly select, etc.) audio from the audio of this type. play.
  • audio playback marked as joy type may be extracted from a preset audio library.
  • the characteristic information indicates that the user is accustomed to listening to rock music
  • the audio playback of the rock type may be extracted from the preset audio library.
  • FIG. 10 is a schematic structural diagram of an audio playback apparatus provided by another exemplary embodiment of the present disclosure.
  • the apparatus further includes: an extraction module 905, configured to extract user audio information from the current mixed sound signal; a second playback module 906, configured to extract user audio information when the user audio information meets preset conditions In this case, the user audio information is played.
  • the apparatus further includes: a third determination module 907, configured to determine a target user corresponding to the user audio information from at least one user and acquire a face image of the target user; a first emotion recognition module 908, is used to input the facial image of the target user corresponding to the user audio information into the pre-trained first emotion recognition model to obtain the emotion category information corresponding to the target user respectively; the fourth determination module 909 is used to determine the representative user based on the emotion category information The first score of the matching degree between the emotion of the target user corresponding to the audio information and the type of the currently playing audio; and/or, the fifth determination module 910 is configured to determine, based on the user audio information, characterizing the user audio information and the currently playing audio The second score of the matching degree of ; the sixth determination module 911 is configured to determine and output the score of the user audio information based on the first score and/or the second score.
  • a third determination module 907 configured to determine a target user corresponding to the user audio information from at least one user and acquire a face image of the target
  • the first emotion recognition module 908 includes: a first emotion recognition unit 9081, configured to input the face image of at least one user into the first emotion recognition model to obtain the first emotion recognition model corresponding to the at least one user.
  • an emotion category information sequence wherein the emotion category information in the first emotion category information sequence corresponds to a face image subsequence respectively
  • the first determination unit 9082 is configured to determine, based on the emotion category information, the emotion representing at least one user and the
  • the first score of the degree of matching of the type of the currently playing audio includes: a first acquiring unit 9083 for acquiring the video corresponding to the currently playing audio, and extracting the face image sequence of the target person from the video;
  • the second emotion recognition The unit 9084 is used to input the facial image sequence into the first emotion recognition model to obtain the second emotion category information sequence;
  • the second determination unit 9085 is used to determine the relationship between the first emotion category information sequence and the second emotion category information sequence. similarity;
  • the third determining unit 9086 is configured to determine the first score based on
  • the apparatus further includes: a seventh determination module 912, configured to determine a target user corresponding to the user audio information from at least one user and acquire a face image of the target user; a second emotion recognition module 913, For inputting the face image of the target user corresponding to the user audio information and the user audio information into the pre-trained second emotion recognition model to obtain emotion category information; the eighth determination module 914 is used to determine the representative user audio based on the emotion category information. The information corresponding to the target user's emotion and the currently playing audio type match the score and output.
  • a seventh determination module 912 configured to determine a target user corresponding to the user audio information from at least one user and acquire a face image of the target user
  • a second emotion recognition module 913 For inputting the face image of the target user corresponding to the user audio information and the user audio information into the pre-trained second emotion recognition model to obtain emotion category information
  • the eighth determination module 914 is used to determine the representative user audio based on the emotion category information.
  • the second emotion recognition module 913 is further configured to: input the user's face image and user audio information corresponding to the user audio information into the second emotion recognition model to obtain a third emotion category information sequence, wherein , the emotional category information in the third emotional category information sequence corresponds to a sub-sequence of facial images respectively;
  • the eighth determination module 914 includes: a second acquisition unit 9141 for acquiring the video corresponding to the currently playing audio, and from the video Extract the facial image sequence of the target person;
  • the third emotion recognition unit 9142 is used to input the facial image sequence and the currently played audio into the second emotion recognition model to obtain the fourth emotion category information sequence;
  • the fourth determination unit 9143 is used for is used to determine the similarity between the third emotional category information sequence and the fourth emotional category information sequence;
  • the fifth determination unit 9144 is used to determine, based on the similarity, the user's emotion corresponding to the user's audio information and the type of the currently playing audio match score.
  • the extraction module 905 includes: a third acquisition unit 9051, configured to acquire initial audio information collected by an audio acquisition device set in the target space, where the initial audio information includes a mixed sound signal; a separation unit 9052, It is used to separate the human voice from the initial audio information to obtain at least one channel of user audio information, wherein at least one channel of user audio information corresponds to one user respectively.
  • a third acquisition unit 9051 configured to acquire initial audio information collected by an audio acquisition device set in the target space, where the initial audio information includes a mixed sound signal
  • a separation unit 9052 It is used to separate the human voice from the initial audio information to obtain at least one channel of user audio information, wherein at least one channel of user audio information corresponds to one user respectively.
  • the second playing module 906 is further configured to: adjust the volume of at least one channel of user audio information to the target volume respectively, synthesize the adjusted user audio information, and play the synthesized user audio information.
  • the second playing module 906 includes: a first melody identification unit 9061, configured to perform melody identification on the user audio information to obtain the user melody information; compare the user melody information with the melody information of the currently playing audio Carry out matching, and play user audio information based on the obtained first matching result; and/or, the first voice recognition unit 9062 is used to perform voice recognition on user audio information to obtain a voice recognition result; the voice recognition result and the currently played audio The corresponding text information is matched, and the user audio information is played based on the obtained second matching result.
  • a first melody identification unit 9061 configured to perform melody identification on the user audio information to obtain the user melody information
  • compare the user melody information with the melody information of the currently playing audio Carry out matching and play user audio information based on the obtained first matching result
  • the first voice recognition unit 9062 is used to perform voice recognition on user audio information to obtain a voice recognition result; the voice recognition result and the currently played audio The corresponding text information is matched, and the user audio information is played based on the obtained second matching result.
  • the second playing module 906 includes: a sixth determining unit 9063, configured to determine the pitch of the user audio information; and an adjusting unit 9064, configured to adjust the pitch of the currently playing audio to be consistent with the user's a target pitch that matches the pitch of the audio information; and/or an output unit 9065, configured to output recommendation information for recommending audio corresponding to the pitch of the user's audio information.
  • the first determination module 902 includes: a fourth emotion recognition unit 9021, configured to input the face image into a pre-trained third user in response to determining that the intent judgment data includes a face image of at least one user The emotion recognition model, to obtain emotion category information; if the emotion category information is preset emotion type information, it is determined that at least one user has a target vocalization intention; or, the second speech recognition unit 9022 is used to determine that the data includes at least one user in response to the determination of the intention.
  • a fourth emotion recognition unit 9021 configured to input the face image into a pre-trained third user in response to determining that the intent judgment data includes a face image of at least one user The emotion recognition model, to obtain emotion category information; if the emotion category information is preset emotion type information, it is determined that at least one user has a target vocalization intention; or, the second speech recognition unit 9022 is used to determine that the data includes at least one user in response to the determination of the intention.
  • the second melody recognition unit 9023 is used to respond to the Determine that the intent judgment data includes at least one user's voice information, perform melody recognition on the voice information, and obtain a melody recognition result; if the melody recognition result indicates that at least one user is vocalizing in a target form, it is determined that at least one user has a target vocalization intention.
  • the second determination module 903 includes: a seventh determination unit 9031, configured to acquire historical audio playback records for at least one user; and determine listening habit information of at least one user based on the historical audio playback records; Based on the listening habit information, determine characteristic information; and/or, a fifth emotion recognition unit 9032, configured to acquire a facial image of at least one user, input the facial image into a pre-trained fourth emotion recognition model, and obtain a representation of the at least one user
  • the emotion category information of the current emotion; based on the emotion category information, determine the feature information; and/or, the environment recognition unit 9033 is used to obtain the environment image of the environment where at least one user is located, and input the environment image into the pre-trained environment recognition model , obtain environment type information; based on the environment type information, determine the feature information; and/or, the eighth determination unit 9034 is used to obtain the in-space image obtained by photographing the target space; based on the in-space image, determine the number of people in the target space; The number of people to
  • the first playing module 904 includes: a first playing unit 9041, configured to extract and play audio corresponding to the listening habit in response to determining that the feature information includes listening habit information; a second playing unit 9042, In response to determining that the feature information includes emotion category information, extract and play the audio corresponding to the emotion category information; the third playing unit 9043 is used to extract and play the audio corresponding to the environment type information in response to determining that the feature information includes environment type information. Audio; the fourth playing unit 9044 is configured to extract and play audio corresponding to the number of people in response to determining that the characteristic information includes the number of people.
  • the audio playback device collects intent judgment data from at least one user in the target space, determines the target vocalization intention of the user according to the intent determination data, determines feature information according to the target vocalization intent, and finally selects the desired vocalization intent from the preset
  • the audio corresponding to the feature information is extracted from the audio library and played, so that the electronic device can automatically determine the user's target vocalization intention, and when it is determined that the user has the vocalization intent, the electronic device can automatically play the audio without the need for the user.
  • the operation of actively triggering the audio playback reduces the steps for the user to perform the audio playback operation and improves the convenience of the audio playback operation.
  • the played audio is adapted to the characteristics of the user, so that the audio that the user wants to listen to is more accurately played, and the pertinence of the automatically played audio is improved.
  • the electronic device may be any one or both of the terminal device 101 and the server 103 as shown in FIG. 1 , or a stand-alone device independent of them, the stand-alone device may communicate with the terminal device 101 and the server 103 to obtain data from them Receive the collected input signal.
  • FIG. 11 illustrates a block diagram of an electronic device according to an embodiment of the present disclosure.
  • the electronic device 1100 includes one or more processors 1101 and a memory 1102 .
  • the processor 1101 may be a central processing unit (Central Processing Unit, CPU) or other forms of processing units with data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 1100 to perform desired functions.
  • CPU Central Processing Unit
  • Memory 1102 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
  • Volatile memory may include, for example, random access memory (Random Access Memory, RAM) and/or cache memory (cache).
  • the non-volatile memory may include, for example, a read-only memory (Read-Only Memory, ROM), a hard disk, a flash memory, and the like.
  • One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 1101 may execute the program instructions to implement the audio playback method and/or other desired functions of the various embodiments of the present disclosure above.
  • Various contents such as intent decision data, feature information, audio, and the like may also be stored in the computer-readable storage medium.
  • the electronic device 1100 may also include an input device 1103 and an output device 1104 interconnected by a bus system and/or other form of connection mechanism (not shown).
  • the input device 1103 may be a device such as a camera, a microphone, and the like, for inputting intention judgment data.
  • the input device 1103 may be a communication network connector for receiving the input intention judgment data from the terminal device 101 and the server 103 .
  • the output device 1104 can output various information, including extracted audio, to the outside.
  • the output devices 1104 may include, for example, displays, speakers, and communication networks and their connected remote output devices, among others.
  • the electronic device 1100 may also include any other appropriate components according to the specific application.
  • embodiments of the present disclosure may also be computer program products comprising computer program instructions that, when executed by a processor, cause the processor to perform the "exemplary method" described above in this specification
  • the steps in the audio playback method according to various embodiments of the present disclosure are described in the section.
  • the computer program product may write program code for performing operations of embodiments of the present disclosure in any combination of one or more programming languages, including object-oriented programming languages, such as Java, C++, etc. , also includes conventional procedural programming languages, such as "C" language or similar programming languages.
  • the program code may execute entirely on the user computing device, partly on the user device, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on.
  • embodiments of the present disclosure may also be computer-readable storage media having computer program instructions stored thereon that, when executed by a processor, cause the processor to perform the above-described "Example Method" section of this specification Steps in an audio playback method according to various embodiments of the present disclosure described in .
  • the computer-readable storage medium may employ any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may include, for example, but not limited to, electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatuses or devices, or a combination of any of the above.
  • readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (Erasable Programmable Read-Only Memory, EPROM) or flash memory), optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage devices, magnetic storage devices, Or any suitable combination of the above.
  • the methods and apparatus of the present disclosure may be implemented in many ways.
  • the methods and apparatus of the present disclosure may be implemented in software, hardware, firmware, or any combination of software, hardware, and firmware.
  • the above-described order of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise.
  • the present disclosure can also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing methods according to the present disclosure.
  • the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
  • each component or each step may be decomposed and/or recombined. These disaggregations and/or recombinations should be considered equivalents of the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Signal Processing (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • Child & Adolescent Psychology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • User Interface Of Digital Computer (AREA)
  • Image Analysis (AREA)

Abstract

本公开实施例公开了一种音频播放方法、装置、计算机可读存储介质及电子设备,其中,该方法包括:获取针对目标空间内的至少一个用户采集的意图判决数据;基于意图判决数据,确定至少一个用户具有目标发声意图,再确定表征至少一个用户的当前特征的特征信息;从预设音频库中提取并播放与特征信息对应的音频。本公开实施例实现了由电子设备自动判断用户的目标发声意图,无需用户主动触发音频播放的操作,减少了用户进行音频播放的操作的步骤,提高了音频播放操作的便利性。此外,通过确定用户当前的特征,使播放的音频与用户的特征相适应,从而实现了更精准地播放用户想收听音频,提高了自动播放音频的针对性。

Description

音频播放方法、装置、计算机可读存储介质及电子设备 技术领域
本公开涉及计算机技术领域,尤其是一种音频播放方法、装置、计算机可读存储介质及电子设备。
背景技术
近年来,随着智能电子设备的不断推广,人机交互的手段越来越丰富。人与设备可以通过语音识别、手势识别等方式进行交互。例如在智能汽车领域,用户可以通过手动操作、语音控制等方式操控车载电子设备,如开启音乐播放、打开或关闭空调、设置导航、修改导航等。在用户控制音频播放设备时,目前主要采用手动控制、语音识别等方式主动地控制音频播放设备播放音乐、打开收音机等。
发明内容
本公开的实施例提供了一种音频播放方法、装置、计算机可读存储介质及电子设备。
本公开的实施例提供了一种音频播放方法,该方法包括:获取针对目标空间内的至少一个用户采集的意图判决数据;基于意图判决数据,确定所述至少一个用户具有的目标发声意图;基于所述目标发声意图,确定表征所述至少一个用户的当前特征的特征信息;从预设音频库中提取并播放与特征信息对应的音频。
根据本公开实施例的另一个方面,提供了一种音频播放装置,该装置包括:获取模块,用于获取针对目标空间内的至少一个用户采集的意图判决数据;第一确定模块,用于基于所述意图判决数据,确定所述至少一个用户具有的目标发声意图;第二确定模块,基于所述目标发声意图,确定表征所述至少一个用户的当前特征的特征信息;第一播放模块,用于从预设音频库中提取并播放与特征信息对应的音频。
根据本公开实施例的另一个方面,提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序用于执行上述音频播放方 法。
根据本公开实施例的另一个方面,提供了一种电子设备,电子设备包括:处理器;用于存储处理器可执行指令的存储器;处理器,用于从存储器中读取可执行指令,并执行指令以实现上述音频播放方法。
基于本公开上述实施例提供的音频播放方法、装置、计算机可读存储介质及电子设备,通过对目标空间内的至少一个用户采集意图判决数据,根据意图判决数据,确定至少一个用户具有的目标发声意图,再根据目标发声意图确定特征信息,最后从预设音频库中提取与特征信息对应的音频并播放,从而实现了由电子设备自动判断用户的目标发声意图,并在判定用户具有发声意图的情况下,由电子设备自动进行音频的播放,无需用户主动触发音频播放的操作,减少了用户进行音频播放的操作的步骤,提高了音频播放操作的便利性。此外,通过确定用户当前的特征,使播放的音频与用户的特征相适应,从而实现了更精准地播放用户想收听的音频,提高了自动播放音频的针对性。
下面通过附图和实施例,对本公开的技术方案做进一步的详细描述。
附图说明
通过结合附图对本公开实施例进行更详细的描述,本公开的上述以及其他目的、特征和优势将变得更加明显。附图用来提供对本公开实施例的进一步理解,并且构成说明书的一部分,与本公开实施例一起用于解释本公开,并不构成对本公开的限制。在附图中,相同的参考标号通常代表相同部件或步骤。
图1是本公开所适用的系统图。
图2是本公开一示例性实施例提供的音频播放方法的流程示意图。
图3是本公开另一示例性实施例提供的音频播放方法的流程示意图。
图4是本公开另一示例性实施例提供的音频播放方法的流程示意图。
图5是本公开另一示例性实施例提供的音频播放方法的流程示意图。
图6是本公开另一示例性实施例提供的音频播放方法的流程示意图。
图7是本公开另一示例性实施例提供的音频播放方法的流程示意图。
图8是本公开另一示例性实施例提供的音频播放方法的流程示意图。
图9是本公开一示例性实施例提供的音频播放装置的结构示意图。
图10是本公开另一示例性实施例提供的音频播放装置的结构示意图。
图11是本公开一示例性实施例提供的电子设备的结构图。
具体实施方式
下面,将参考附图详细地描述根据本公开的示例实施例。显然,所描述的实施例仅仅是本公开的一部分实施例,而不是本公开的全部实施例,应理解,本公开不受这里描述的示例实施例的限制。
应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本公开的范围。
本领域技术人员可以理解,本公开实施例中的“第一”、“第二”等术语仅用于区别不同步骤、设备或模块等,既不代表任何特定技术含义,也不表示它们之间的必然逻辑顺序。
还应理解,在本公开实施例中,“多个”可以指两个或两个以上,“至少一个”可以指一个、两个或两个以上。
还应理解,对于本公开实施例中提及的任一部件、数据或结构,在没有明确限定或者在前后文给出相反启示的情况下,一般可以理解为一个或多个。
另外,本公开中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本公开中字符“/”,一般表示前后关联对象是一种“或”的关系。
还应理解,本公开对各个实施例的描述着重强调各个实施例之间的不同之处,其相同或相似之处可以相互参考,为了简洁,不再一一赘述。
同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本公开及其应用或使用的任何限制。
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。
本公开实施例可以应用于终端设备、计算机系统、服务器等电子设备, 其可与众多其它通用或专用计算系统环境或配置一起操作。适于与终端设备、计算机系统、服务器等电子设备一起使用的众所周知的终端设备、计算系统、环境和/或配置的例子包括但不限于:个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统、大型计算机系统和包括上述任何系统的分布式云计算技术环境,等等。
终端设备、计算机系统、服务器等电子设备可以在由计算机系统执行的计算机系统可执行指令(诸如程序模块)的一般语境下描述。通常,程序模块可以包括例程、程序、目标程序、组件、逻辑、数据结构等等,它们执行特定的任务或者实现特定的抽象数据类型。计算机系统/服务器可以在分布式云计算环境中实施,分布式云计算环境中,任务是由通过通信网络链接的远程处理设备执行的。在分布式云计算环境中,程序模块可以位于包括存储设备的本地或远程计算系统存储介质上。
申请概述
目前的音频播放系统,通常需要由用户手动选择播放的音频,或者通过语音识别、手势识别等方式触发音频播放。这些方式往往需要用户主动与音频播放系统进行交互,无法做到自动对用户的发声意图进行判决音频播放的便利性不足,并且无法做到根据用户的特征自动播放相应的音频,音频播放的针对性也不足。
示例性系统
图1示出了可以应用本公开的实施例的音频播放方法或音频播放装置的示例性系统架构100。
如图1所示,系统架构100可以包括终端设备101,网络102,服务器103和信息采集设备104。网络102用于在终端设备101和服务器103之间提供通信链路的介质。网络102可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备101通过网络102与服务器103交互,以接收或发送消息等。终端设备101上可以安装有各种通讯客户端应用,例如音频播放器、视频播放器、网页浏览器应用、即时通信工具等。
终端设备101可以是各种能够进行音频播放的电子设备,包括但不限于诸如车载终端、移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)等等的移动终端以及诸如数字TV、台式计算机、智能家电等等的固定终端。
信息采集设备104可以是各种用于采集用户相关信息(包括意图判决数据)的设备,包括但不限于以下至少一种:摄像头、麦克风等。
通常,终端设备101设置在一个被限定范围的空间105内,信息采集设备104与空间105关联。例如,信息采集设备104可以设置在空间105内,用于采集用户的图像、声音等各种信息,也可以设置在空间105外,用于采集空间105周围的图像、声音等各种信息。空间105可以是各种被限定范围的空间,例如车辆内部、房间内部等。
服务器103可以是提供各种服务的服务器,例如对终端设备101上播放的音频提供支持的后台音频服务器。后台音频服务器可以对接收到的意图判决数据进行处理,得到用户的目标发声意图、用户的特征信息、待播放的音频等信息)。
需要说明的是,本公开的实施例所提供的音频播放方法可以由服务器103执行,也可以由终端设备101执行,相应地,音频播放装置可以设置于服务器103中,也可以设置于终端设备101中。本公开的实施例所提供的音频播放方法还可以由终端设备101和服务器103共同执行,例如,获取意图判决数据和确定目标发声意图的步骤由终端设备101执行,确定特征信息和提取音频的步骤由服务器103执行,相应地,音频播放装置包括的各模块可以分别设置于终端设备101和服务器103中。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络、服务器和信息采集设备。例如,在预设音频库设置在本地的情况下,上述系统架构可以不包括网络和服务器,只包括终端设备和信息采集设备。
示例性方法
图2是本公开一示例性实施例提供的音频播放方法的流程示意图。本实施例可应用在电子设备(如图1所示的终端设备101或服务器103)上,如图2所示,该方法包括如下步骤:
步骤201,获取针对目标空间内的至少一个用户采集的意图判决数据。
在本实施例中,电子设备可以获取针对目标空间内的至少一个用户采集的意图判决数据。其中,目标空间(例如图1中的空间105)可以是各种空间,例如车辆内部、房间内部等。意图判决数据可以是各种被用来判定用户的意图的数据,例如包括但不限于以下至少一种:用户的脸部图像数据、用户发出的语音等。
步骤202,基于意图判决数据,确定至少一个用户具有的目标发声意图。
在本实施例中,电子设备可以基于意图判决数据,确定至少一个用户具有的目标发声意图。其中,目标发声意图表示的发声类型可以是预先设定的。例如,目标发声意图可以包括但不限于以下至少一种:唱歌意图、朗诵意图等。电子设备可以根据意图判决数据的类型,选择相应的方式进行目标发声意图的判定。
作为示例,当意图判决数据包括用户的脸部图像数据时,可以对脸部图像进行情绪识别,得到情绪类型,如果情绪类型是喜悦,则可以确定上述至少一个用户有目标发声意图(例如唱歌意图)。当意图判决数据包括用户发出的声音信号时,可以对声音信号进行识别,如果识别结果表示用户正在哼唱,则可以确定有目标发声意图。
步骤203,基于所述目标发声意图,确定表征所述至少一个用户的当前特征的特征信息。
在本实施例中,电子设备可以确定表征至少一个用户的当前特征的特征信息。其中,用户当前特征可以包括但不限于以下至少一种:用户的情绪、用户的数量、用户的收听习惯等。电子设备可以采用与上述各种特征分别对应的方式,确定特征信息。例如,可以获取摄像头对用户拍摄的脸部图像,对脸部图像进行情绪识别,得到表征用户当前情绪的特征信息。再例如,可以获取用户的历史播放记录,根据历史播放记录确定用户习惯收听的音频的类型作为特征信息。
步骤204,从预设音频库中提取并播放与特征信息对应的音频。
在本实施例中,电子设备可以从预设音频库中提取并播放与特征信息对应的音频。其中,预设音频库可以是设置在上述电子设备中,也可以设置在与上述电子设备通信连接的其他电子设备中。上述特征信息对应于音频的类型,电子设备可以根据特征信息,确定待播放的音频的类型,并从该类型的 音频中,选择(例如按播放量选择,随机选择等方式)音频进行播放。
作为示例,当特征信息表示用户当前的情绪为喜悦时,可以从预设音频库中提取标记为喜悦类型的音频播放。当特征信息表示用户习惯收听摇滚乐时,可以从预设音频库中提取摇滚类型的音频播放。
本公开的上述实施例提供的方法,通过对目标空间内的至少一个用户采集意图判决数据,根据意图判决数据,确定用户具有的目标发声意图,再根据目标发声意图确定特征信息,最后从预设音频库中提取与特征信息对应的音频并播放,从而实现了由电子设备主动判断用户的目标发声意图,无需用户触发音频播放的操作,并在判定用户具有发声意图的情况下,由电子设备自动进行音频的播放,减少了用户进行音频播放的操作的步骤,提高了音频播放操作的便利性。此外,通过确定用户当前的特征,使播放的音频与用户的特征相适应,从而实现了更精准地播放用户想收听的音频,提高了自动播放音频的针对性。
在一些可选的实现方式中,上述步骤202中,可以基于如下任一方式确定上述至少一个用户具有的目标发声意图:
方式一,响应于确定意图判决数据包括至少一个用户的脸部图像,将脸部图像输入预先训练的第三情绪识别模型,得到情绪类别信息;如果情绪类别信息为预设情绪类型信息,确定至少一个用户有目标发声意图。
其中,第三情绪识别模型可以预先利用预设的训练样本集合,通过对预设的用于训练第三情绪识别模型的初始模型训练得到。训练样本集合中的训练样本可以包括样本脸部图像和对应的情绪类别信息。电子设备可以将样本脸部图像作为初始模型(例如包括卷积神经网络、分类器等)的输入,将输入的样本脸部图像对应的情绪类别信息作为初始模型的期望输出,对初始模型进行训练,得到上述第三情绪识别模型。
上述预设情绪类型信息表征的预设情绪可以是诸如兴奋、喜悦、悲伤等各种情绪,当第三情绪识别模型输出的情绪类型信息表征用户的情绪是上述预设情绪时,确定至少一个用户有目标发声意图。例如,当情绪类型信息表征用户的情绪是兴奋时,表示用户此时可能想唱歌来表达自己的心情,此时确定用户具有唱歌意图。
方式二,响应于确定意图判决数据包括至少一个用户的声音信息,对声音信息进行语音识别,得到语音识别结果;如果语音识别结果表征至少一个 用户指示播放音频,确定至少一个用户有目标发声意图。
其中,对声音信息进行语音识别的方法是现有技术,这里不再赘述。作为示例,当识别到某用户发出“这首歌不错,我想唱唱”的语音时,确定上述至少一个用户有目标发声意图(即唱歌意图)。
方式三,响应于确定意图判决数据包括至少一个用户的声音信息,对声音信息进行旋律识别,得到旋律识别结果;如果旋律识别结果表征至少一个用户正在进行目标形式的发声,确定至少一个用户有目标发声意图。
其中,上述目标形式的发声与目标发声意图相对应。例如,目标形式的发声可以包括唱歌、朗诵、哼唱等。对声音信息进行旋律识别的方法为现有技术,通常按照如下步骤进行:对输入旋律识别模型的人声通过音符切分和基音提取进行旋律提取,通过旋律提取获取音符序列。电子设备进一步将旋律识别模型输出的音符序列与音频库中的音频的音符序列进行匹配,如果输出的音符序列与某个音频的音符序列之间的相似度大于预设的相似度阈值,表示用户正在唱歌(即目标形式的发声),此时确定上述至少一个用户有目标发声意图。
本实现方式提供了多种确定用户的目标发声意图的方式,从而实现了通过情绪识别、语音识别、旋律识别等多模态的方式全面地对用户的目标发声意图进行检测,其检测准确性更高,无需用户手动操作即可在后续基于目标发声意图对用户播放音频,从而提高了音频播放操作的便利性。
在一些可选的实现方式中,步骤203中,可以按照如下至少一种方式确定特征信息:
方式一,获取针对至少一个用户的历史音频播放记录;基于历史音频播放记录,确定至少一个用户的收听习惯信息;基于收听习惯信息,确定特征信息。
其中,电子设备可以从本地或从远程获取历史音频播放记录,收听习惯信息用于表征用户经常收听的音频的类型、收听时间等特征。例如,可以根据历史音频播放记录,确定收听次数最多的音频类型作为收听习惯信息。通常,可以将收听习惯信息作为特征信息包括的信息。
方式二,获取至少一个用户的脸部图像,将脸部图像输入预先训练的第四情绪识别模型,得到表征至少一个用户当前的情绪的情绪类别信息;基于情绪类别信息,确定特征信息。
其中,第四情绪识别模型可以是用于对脸部图像进行情绪分类的神经网络模型,其可以与上述可选的实现方式中描述的第三情绪识别模型相同,也可以不同,但训练方法与训练第三情绪识别模型的方法基本相同,这里不再赘述。通常,可以将情绪类别信息作为特征信息包括的信息。
方式三,获取至少一个用户所处的环境的环境图像,将环境图像输入预先训练的环境识别模型,得到环境类型信息;基于环境类型信息,确定特征信息。
其中,环境图像可以是摄像头对上述目标空间以外的环境拍摄得到的。环境识别模型可以是用于对环境图像进行分类的神经网络模型,电子设备可以预先利用预设的训练样本集合,通过对预设的用于训练环境识别模型的初始模型训练得到环境识别模型。训练样本集合中的训练样本可以包括样本环境图像和对应的环境类型信息。电子设备可以将样本环境图像作为初始模型(例如包括卷积神经网络、分类器等)的输入,将输入的样本环境图像对应的环境类型信息作为初始模型的期望输出,对初始模型进行训练,得到上述环境识别模型。
环境类型信息用于表征上述至少一个用户所处的环境的类型。作为示例,环境的类型为郊外、高速公路、乡村等地点类型,还可以为晴天、雨天、雪天等天气类型。通常,可以将环境类型信息作为特征信息包括的信息。
方式四,获取对目标空间拍摄得到空间内图像;基于空间内图像,确定目标空间内的人数;基于人数,确定特征信息。
其中,空间内图像可以是设置在目标空间内的摄像头拍摄的图像,空间内图像的数量可以为一个或多个,电子设备可以基于现有的目标检测方法,从各个空间内图像中确定其中的人物并统计人数。通常,可以将人数作为特征信息包括的信息。
本实现方式通过提供上述四种方式确定用户的特征信息,可以全面地对用户当前的状态进行检测,得到的特征信息更加全面,进而可以有助于基于特征信息更有针对性地提取用户感兴趣的音频,提高为用户播放音频的精准性。
在一些可选的实现方式中,基于上述四种确定特征信息的方式,步骤204可以如下执行:
响应于确定特征信息包括收听习惯信息,提取并播放与收听习惯对应的 音频。
响应于确定特征信息包括情绪类别信息,提取并播放与情绪类别信息对应的音频。
响应于确定特征信息包括环境类型信息,提取并播放与环境类型信息对应的音频。
响应于确定特征信息包括人数,提取并播放与人数对应的音频。
作为示例,如果收听习惯信息表示用户喜欢听摇滚乐,可以提取摇滚类型的音频并播放。如果情绪类别信息表示用户当前的情绪为高兴,则可以提取快节奏类型的音频并播放。如果环境类型信息表示用户当前所处的环境为野外,可以提取节奏舒缓类型的音频并播放。如果所确定的用户人数为大于等于2人,则可以提取合唱类型的音频并播放。
需要说明的是,当特征信息包括收听习惯信息、情绪类别信息、环境类型信息、人数中的至少两种时,可以取各种信息分别对应的音频类型包含的音频的交集作为待播放的音频。
本实现方式由于采用了能够全面地表示用户的特征的特征信息,可以使提取的音频对用户更有吸引力,从而提高了为用户播放音频的精准性。
进一步参考图3,示出了音频播放方法的又一个实施例的流程示意图。如图3所示,在上述图2所示实施例的基础上,在步骤204之后,还可以包括如下步骤:
步骤205,从当前的混合声音信号中提取用户音频信息。
其中,上述混合声音信号可以是设置在上述目标空间内的如图1所示的信息采集设备104(即麦克风)采集的信号。用户音频信息即一个用户发出的声音。通常麦克风采集的声音信号包括了噪声信号,或者包括至少两个用户同时发出的声音信号,此时采集的声音信号为混合声音信号。也就是说,混合声音信号可包括噪声信号,或者可包括用户发出的声音信息,或者同时包括噪声信号和用户发出的声音信号。在本实施例中,可以采用现有的语音分离方法(例如盲源分离(BSS,Blind Source Separation)方法、听觉场景分析(ASA,Auditory Scene Analysis)方法等),从混合声音信号中提取出各个用户分别对应的用户音频信息。
步骤206,在用户音频信息符合预设条件的情况下,播放用户音频信息。
具体地,电子设备可以对提取的用户音频信息进行分析,如果用户音频信息满足预设条件,则播放用户音频信息。作为示例,电子设备若识别出用户音频信息表征用户正在唱歌,则通过扬声器播放音量放大后的用户音频信息。或者,电子设备若识别出用户音频信息表征用户发出的声音的旋律与当前播放的音频匹配,则播放用户音频信息。
通常,步骤205-步骤206是在播放步骤204中描述的音频的同时执行的。例如,播放的音频可以为音乐,播放音乐的同时,实时地从至少一个用户当前发出的混合声音信号中提取用户音频信息,若用户音频信息与播放的音乐匹配,播放用户音频信息,从而实现了用户随着音乐演唱的场景。
可选的,还可以采用现有的反馈声消除方法,将麦克风采集的来自扬声器播放的声音信号滤除,从而降低反馈声对播放用户音频信息的干扰。
图3对应实施例提供的方法,通过从混合声音信号中提取并播放用户音频信息,可以实现将用户音频信息与从预设音频库中提取的音频同时播放,无需为用户单独提供专用于播放用户声音的麦克风,只需使用用于采集目标空间内的各个用户的混合声音的麦克风即可从混合声音信号中提取用户发出的声音并与当前播放的音频同时播放,从而简化了播放用户音频信息所需的硬件,提高了用户实现目标发声意图的便利性。此外,播放符合预设条件的用户音频信息,可以避免将用户交谈等内容播放出去造成的对播放用户音频信息的干扰。
进一步参考图4,示出了音频播放方法的又一个实施例的流程示意图。如图4所示,在上述图3所示实施例的基础上,步骤205进一步包括如下步骤:
步骤2051,获取设置在目标空间的音频采集设备采集的初始音频信息。该初始音频信息可包括混合声音信号。
其中,音频采集设备即为如图1所示的信息采集设备104包括的设备。音频采集设备的数量可以是一个或多个,初始音频信息的路数与音频采集设备的数量一致,即每个音频采集设备采集一路初始音频信息。作为示例,当目标空间为车辆内部空间时,音频采集设备的数量可以与车内的座位数量匹配。即每个座位附近安装一个音频采集设备。
步骤2052,对初始音频信息进行人声分离,得到至少一路用户音频信息。
其中,至少一路用户音频信息分别对应于一个用户。具体地,电子设备 可以利用现有的语音分离方法,从初始音频信息中提取出各个用户分别对应的用户音频信息。作为示例,可以采用盲源分离算法从初始音频信息中分离出至少一路用户音频信息。或者,当音频采集设备的数量为大于等于两个时,可以采用现有的基于麦克风阵列的语音分离算法从各个音频采集设备采集的初始音频信息中分离出至少一路用户音频信息。
图4对应实施例提供的方法,通过对初始音频信息进行人声分离得到至少一路用户音频信息,可以实现在播放音频的过程中,实时地采集多个用户各自的用户音频信息,且每路用户音频信息排除了其他用户的声音干扰,使后续播放的用户音频信息可以清晰地反映各个用户的声音,提高播放多个用户的声音的质量。
在一些可选的实现方式中,基于上述步骤2051-步骤2052,上述图3对应实施例中的步骤206可以如下执行:
将至少一路用户音频信息的音量分别调整至目标音量并合成调整音量后的用户音频信息,以及播放合成后的用户音频信息。每路用户音频信息对应的目标音量可以相同,也可以不同。例如,可以将音量最大的一路用户音频信息的音量作为目标音量,其他路用户音频信息的音量均调整至目标音量;也可以设定一个固定音量作为目标音量,各路用户音频信息均设置为相同的目标音量。进一步地,可以将各路用户音频信息合成为立体声播放,或合成为同一声道播放。
通过对各路用户音频信息调整音量并合成后进行播放,可以使播放的各个用户音频信息的音量趋于一致或达到各自设定的音量,避免用户发出的音量较小造成播放时音量过小。
在一些可选的实现方式中,基于上述图3对应实施例,上述步骤206可以基于如下至少一种方式播放用户音频信息:
方式一,对用户音频信息进行旋律识别,得到用户旋律信息;将用户旋律信息与当前播放的音频的旋律信息进行匹配,基于得到的第一匹配结果播放用户音频信息。
其中,对用户音频信息进行旋律识别的方法为现有技术,通常按照如下步骤进行:对输入旋律识别模型的用户音频信息通过音符切分和基音提取进行旋律提取,通过旋律提取获取音符序列作为旋律信息。电子设备进一步将旋律识别模型输出的旋律信息与当前播放的音频的旋律信息进行相似度计算, 如果相似度(即第一匹配结果)大于或等于预设的第一相似度阈值,可以确定第一匹配结果符合预设条件,可以播放用户音频信息。
方式二,对用户音频信息进行语音识别,得到语音识别结果;将语音识别结果与当前播放的音频的对应文本信息进行匹配,基于得到的第二匹配结果播放用户音频信息。
其中,语音识别结果可以为文本信息。需要说明的是,对用户音频信息进行语音识别的方法是现有技术,这里不再赘述。当前播放的音频的对应文本信息为预先与音频建立对应关系的文本信息,例如,若当前播放的音频为歌曲,其对应文本信息可以为歌词;若当前播放的音频为诗词朗读,其对应的文本信息即为朗读的诗词原文。电子设备可以对语音识别结果和上述对应文本信息进行相似度计算,如果相似度(即第二匹配结果)大于或等于预设的第二相似度阈值,可以确定第二匹配结果符合预设条件,可以播放用户音频信息。
应当理解,电子设备可以执行上述方式一和方式二的任一种从而播放用户音频信息。还可以同时执行上述方式一和方式二,如果基于第一匹配结果和第二匹配结果确定两种方式均可以播放用户音频信息时,播放用户音频信息。还需要说明的是,当上述用户音频信息的数量为大于1路时,可以对每路用户音频信息执行方式一和/或方式二。
本实现方式通过对用户音频信息进行旋律识别和/或语音识别,可以使用户音频信息在满足一定的条件时进行播放,从而避免播放与当前播放的音频无关的用户音频信息,使播放的用户音频信息与当前播放的音频的匹配程度更高,进而提高播放用户音频信息的质量。
在一些可选的实现方式中,基于上述图3对应实施例的方法,上述步骤206进一步包括:
首先,确定用户音频信息的音高。其中,确定用户音频信息的音高的方法是现有技术,这里不再赘述。
然后,执行如下至少一个步骤:
步骤一,将当前播放的音频的音高调整至与用户音频信息的音高相匹配的目标音高。
具体地,可以将当前播放的音频的音高与用户音频信息的音高进行比较,若两者的差值处于预设的差值范围外,则调整当前播放的音频的音高使其与 用户音频信息的音高的差值处于预设的差值范围内。
作为示例,当用户音频信息为用户唱歌的音频信息,当前播放的音频为歌曲音乐时,若确定用户音频信息的音高与当前播放的音乐的音高相比较高或较低时,可以动态调整音乐的音高使其适应用户唱歌的音高,即调整播放的音乐的跟唱难度,使用户更好地适应播放的音乐。
步骤二,输出用于推荐与用户音频信息的音高相对应的音频的推荐信息。
其中,与用户音频信息的音高相对应的音频可以是与用户音频信息的音高的差值处于预设的差值范围内的音频。推荐信息可以以提示音、显示文字、图像等的方式输出,输出推荐信息后,用户可以选择是否播放推荐的音频,从而使重新播放的音频的音高与用户的音高匹配。
本实现方式通过确定用户音频信息的音高,并基于音高调整播放的音频,使播放的音频的音高自动与用户的音高相适应,从而使用户音频信息的播放效果更好,同时用户无需通过手动或语音控制等主动的方式调整播放的音频的音高,提高了调整音频的便利性。
进一步参考图5,示出了音频播放方法的又一个实施例的流程示意图。如图5所示,在上述图3所示实施例的基础上,在步骤206之后,还可以包括如下步骤:
步骤207,从至少一个用户中确定用户音频信息对应的目标用户并获取目标用户的脸部图像。
其中,脸部图像可以是设置在目标空间中的如图1中的信息采集设备104包括的摄像头拍摄的图像。具体地,电子设备在从混合声音信号中提取用户音频信息时,可以基于现有的语音分离方法,确定用户音频信息对应的声源的位置(例如采用现有的基于麦克风阵列的多音区语音分离方法确定用户音频信息对应于目标空间中的哪个位置),声源的位置即用户的位置,用户的位置可以由对用户拍摄的图像确定,进而可以得到用户音频信息对应的用户的脸部图像。
步骤208,将至少一个用户各自的脸部图像输入预先训练的第一情绪识别模型,得到至少一个用户分别对应的情绪类别信息。也就是说,该步骤中,会将用户音频信息对应的目标用户的脸部图像输入预先训练的第一情绪识别模型,相应的,会得到目标用户对应的情绪类别信息。
其中,第一情绪识别模型可以与上述可选的实现方式中描述的第三情绪识别模型和第四情绪识别模型中的至少一个相同,也可以不同,但训练方法与第三情绪识别模型和第四情绪识别模型中的至少一个的训练方法基本相同,这里不再赘述。
步骤209,基于情绪类别信息,确定表征至少一个用户的情绪与当前播放的音频的类型的匹配程度的第一评分。该步骤中的情绪类别信息如果为目标用户对应的情绪类别信息,则确定的第一评分用于表征目标用户的情绪与当前播放的音频的类型的匹配程度。
其中,第一评分可以基于第一情绪识别模型计算得到的与输出的情绪类别信息对应的概率值得到。通常,第一情绪识别模型可以对输入的脸部图像进行分类,得到多个情绪类别信息和每个情绪类别信息分别对应的概率值,最大概率值对应的情绪类别信息可以确定为本次识别的脸部图像的情绪类别信息。
如果本次识别的脸部图像的情绪类别信息为一种,可根据这一种情绪类别信息对应的概率确定第一评分。如果本次识别的脸部图像的情绪类别信息包括多种,可以从多个情绪类别信息中确定与当前播放的音频的类型相匹配的情绪类别信息作为目标情绪类别信息,然后根据目标情绪类别信息对应的概率确定第一评分。第一评分的数值越大,表示与当前播放的音频的匹配程度越大。其中,当前播放的音频的类型与情绪类别信息的对应关系可以预先设定。例如,当前播放的音频的类型标记为“欢快”,则第一评分可以基于模型输出的表征欢快情绪的情绪类别信息对应的概率得到。
步骤210,基于第一评分,确定用户音频信息的评分并输出。
具体地,可以将用户音频信息的评分按照各种方式输出,例如在显示屏上显示,通过扬声器输出评分的声音等。用户音频信息的评分的确定方法可以包括多种,作为示例,可以将第一评分确定为用户音频信息的评分。
可替换地,步骤209可以如下执行:基于用户音频信息,确定表征用户音频信息与当前播放的音频的匹配程度的第二评分,也就是说,该步骤中,基于用户音频信息确定第二评分,该第二评分用于表征用户音频信息与当前播放的音频的匹配程度。
步骤210可以如下执行:基于第二评分,确定用户音频信息的评分并输出。
其中,第二评分可以利用现有的对用户音频信息进行打分的方法确定,例如,当用户音频信息表示用户在唱歌时,可以基于现有的唱歌打分方法确定第二评分。进一步地,可以将第二评分确定为用户音频信息的评分。
可选的,步骤210还可以如下执行:基于第一评分和第二评分,确定用户音频信息的评分并输出。
例如,可以基于第一评分和第二评分分别对应的预设权重,对第一评分和第二评分进行加权求和,得到用户音频信息的评分。
图5对应实施例提供的方法,基于脸部图像识别和/或音频打分确定用户音频信息的评分,可以使评分能够充分反映用户音频信息与播放的音频的匹配程度,提高了对用户音频信息进行打分的准确性。
在一些可选的实现方式中,步骤208可以如下执行:
将至少一个用户各自的脸部图像输入第一情绪识别模型,得到至少一个用户分别对应的第一情绪类别信息序列。其中,第一情绪类别信息序列中的情绪类别信息分别对应于一个脸部图像子序列。在本实施例中,用户的脸部图像的数量为至少两个,即输入第一情绪识别模型的是用户的脸部图像序列,通常,某个用户的脸部图像序列可以是对该用户的脸部拍摄的视频中包括的脸部图像组成的图像序列。情绪类别信息序列可以利用向量的形式表示,其中,向量中的每个数值对应一个脸部图像子序列并且表示某种情绪类别。每个脸部图像子序列可以包括至少一个脸部图像。作为示例,当前播放的音频的时长为3分钟,播放期间对用户脸部拍摄了3分钟,可以将这3分钟的脸部图像序列分成100个脸部图像子序列,依次将每个子序列输入第一情绪识别模型,得到包括100个数值的向量作为情绪类别信息序列。
基于上述第一情绪类别信息序列,如图6所示,上述步骤209中,可以采用如下步骤确定第一评分:
步骤2091,获取当前播放的音频对应的视频,并从视频中提取目标人物的脸部图像序列。
其中,目标人物可以是与当前播放的音频相关的人物。例如,若当前播放的音频为歌曲,其对应的视频可以是包括该歌曲的演唱者的图像的视频,目标人物可以是歌曲的演唱者,也可以是随歌曲表演的人物。目标人物可以预先由人工设定,也可以由电子设备对视频进行识别得到,例如基于现有的嘴部动作识别方法,识别出嘴部动作频率与歌曲的节奏相匹配的人物为目标 人物。
电子设备可以采用现有的脸部图像检测方法,根据预先设定或识别出的目标人物,从视频包括的图像帧中提取出目标人物的脸部图像序列。
步骤2092,将脸部图像序列输入第一情绪识别模型,得到第二情绪类别信息序列。
该步骤与上述确定第一情绪类别信息序列的步骤基本相同,这里不再赘述。
步骤2093,确定第一情绪类别信息序列和第二情绪类别信息序列之间的相似度。
其中,第一情绪类别信息序列和第二情绪类别信息序列可以均为向量的形式,电子设备可以确定向量之间的距离,基于距离确定相似度(例如距离的倒数为相似度)。
步骤2094,基于相似度,确定第一评分。
作为示例,可以将相似度确定为第一评分,或者对相似度按照预设比例缩放,得到第一评分。
本实现方式通过对比用户的第一情绪类别信息序列和原视频中的目标人物的第二情绪类别序列,可以准确地确定用户的情绪与原视频的情绪的相符程度,得到的第一评分更准确地反映用户的情绪与当前播放的音频的相符程度,从而提高了对用户音频信息进行评分的准确性。
进一步参考图7,示出了音频播放方法的又一个实施例的流程示意图。如图7所示,在上述图3所示实施例的基础上,在步骤206之后,还可以包括如下步骤:
步骤211,从至少一个用户中确定用户音频信息对应的目标用户并获取目标用户的脸部图像。
该步骤与上述步骤207基本一致,这里不再赘述。
步骤212,将用户音频信息对应的目标用户的脸部图像和用户音频信息输入预先训练的第二情绪识别模型,得到情绪类别信息。
其中,本步骤中的第二情绪识别模型与上述第一情绪识别模型、第三情绪识别模型、第四情绪识别模型均不同,第二情绪识别模型可以同时接收图像和音频作为输入,对图像和音频进行联合分析,输出情绪类别信息。第二 情绪识别模型可以预先利用预设的训练样本集合,通过对预设的用于训练第二情绪识别模型的初始模型训练得到。训练样本集合中的训练样本可以包括样本脸部图像、样本音频信息和对应的情绪类别信息。电子设备可以将样本脸部图像和样本音频信息作为初始模型(例如包括神经网络、分类器等)的输入,将输入的样本脸部图像和样本音频信息对应的情绪类别信息作为初始模型的期望输出,对初始模型进行训练,得到上述第三情绪识别模型。通常,初始模型包括的神经网络可以确定输入的样本脸部图像和样本音频信息的特征信息,分类器可以对特征信息进行分类,实际输出的信息与期望输出进行比较,调整初始模型的参数,使实际输出与期望输出的差距逐渐减小直到收敛,从而训练得到上述第二情绪识别模型。
步骤213,基于情绪类别信息,确定表征用户音频信息对应的目标用户的情绪与当前播放的音频的类型的匹配程度的评分并输出。
其中,评分可以基于第二情绪识别模型计算得到的与输出的情绪类别信息对应的概率值得到。基于概率值确定评分的方法与上述步骤209中确定第一评分的方法基本一致,这里不再赘述。
图7对应实施例提供的方法,通过将脸部图像和用户音频信息同时输入第二情绪识别模型,直接得到评分,无需单独对脸部图像和用户音频信息进行评分,从而简化了评分步骤,提高了评分效率。由于第二情绪识别模型可以综合输入的脸部图像和用户音频信息的特征进行分类,从而使评分可以准确地反映用户的声音与播放的音频的匹配程度。
在一些可选的实现方式中,步骤212可以如下执行:
将用户音频信息对应的用户的脸部图像和用户音频信息输入第二情绪识别模型,得到第三情绪类别信息序列。其中,第三情绪类别信息序列中的情绪类别信息分别对应于一个脸部图像子序列。第三情绪类别信息序列的定义与上述第一情绪类别信息基本相同,这里不再赘述。
基于此,如图8所示,步骤213可以如下执行:
步骤2131,获取当前播放的音频对应的视频,并从视频中提取目标人物的脸部图像序列。
该步骤与上述步骤2091基本相同,这里不再赘述。
步骤2132,将脸部图像序列和当前播放的音频输入第二情绪识别模型,得到第四情绪类别信息序列。
该步骤与上述确定第三情绪类别信息序列的步骤基本相同,这里不再赘述。
步骤2133,确定第三情绪类别信息序列和第四情绪类别信息序列之间的相似度。
其中,第三情绪类别信息序列和第四情绪类别信息序列可以均为向量的形式,电子设备可以确定向量之间的距离,基于距离确定相似度(例如距离的倒数为相似度)。
步骤2134,基于相似度,确定表征用户音频信息对应的用户的情绪与当前播放的音频的类型的匹配程度的评分。
作为示例,可以将相似度确定为评分,或者对相似度按照预设比例缩放,得到评分。
本实现方式中的第三情绪类别信息序列和第四情绪类别信息序列由于是基于用户的脸部图像和用户音频信息得到的,在进行情绪分类时综合了图像和音频,因此,两个情绪类别信息序列表示情绪的准确性更高,因此,利用两个情绪类别信息序列之间的相似度确定的评分可以更准确地表示用户的情绪与原视频的情绪的相符程度,进一步提高了对用户音频信息进行评分的准确性。
示例性装置
图9是本公开一示例性实施例提供的音频播放装置的结构示意图。本实施例可应用在电子设备上,如图9所示,音频播放装置包括:获取模块901,用于获取针对目标空间内的至少一个用户采集的意图判决数据;第一确定模块902,用于基于意图判决数据,确定至少一个用户具有的目标发声意图;第二确定模块903,用于基于所述目标发声意图,确定表征至少一个用户的当前特征的特征信息;第一播放模块904,用于从预设音频库中提取并播放与特征信息对应的音频。
在本实施例中,获取模块901可以获取针对目标空间内的至少一个用户采集的意图判决数据。其中,目标空间(例如图1中的空间105)可以是各种空间,例如车辆内部、房间内部等。意图判决数据可以是各种被用来判定用户的意图的信息,例如包括但不限于以下至少一种:用户的脸部图像、用户发出的语音等。
在本实施例中,第一确定模块902可以基于意图判决数据,确定至少一个用户具有的目标发声意图。其中,目标发声意图表示的发声类型可以是预先设定的。例如,目标发声意图可以包括但不限于以下至少一种:唱歌意图、朗诵意图等。第一确定模块902可以根据意图判决数据的类型,选择相应的方式进行目标发声意图的判定。
作为示例,当意图判决数据包括用户的脸部图像时,可以对脸部图像进行情绪识别,得到情绪类型,如果情绪类型是喜悦,则可以确定上述至少一个用户有目标发声意图(例如唱歌意图)。当意图判决数据包括用户发出的声音信号时,可以对声音信号进行识别,如果识别结果表示用户正在哼唱,则可以确定有目标发声意图。
在本实施例中,第二确定模块903可以确定表征至少一个用户的当前特征的特征信息。其中,用户当前的特征可以包括但不限于以下至少一种,用户的情绪、用户的数量、用户的收听习惯等。第二确定模块903可以采用与上述各种特征分别对应的方式,确定特征信息。例如,可以获取摄像头对用户拍摄的脸部图像,对脸部图像进行情绪识别,得到表征用户当前的情绪的特征信息。再例如,可以获取用户的历史播放记录,根据历史播放记录确定用户习惯收听的音频的类型作为特征信息。
在本实施例中,第一播放模块904可以从预设音频库中提取并播放与特征信息对应的音频。其中,预设音频库可以是设置在上述电子设备中,也可以设置在与上述电子设备通信连接的其他电子设备中。上述特征信息对应于音频的类型,第一播放模块904可以根据特征信息,确定待播放的音频的类型,并从该类型的音频中,选择(例如按播放量选择,随机选择等方式)音频进行播放。
作为示例,当特征信息表示用户当前的情绪为喜悦时,可以从预设音频库中提取标记为喜悦类型的音频播放。当特征信息表示用户习惯收听摇滚乐时,可以从预设音频库中提取摇滚类型的音频播放。
参照图10,图10是本公开另一示例性实施例提供的音频播放装置的结构示意图。
在一些可选的实现方式中,装置还包括:提取模块905,用于从当前的混合声音信号中提取用户音频信息;第二播放模块906,用于在所述用户音频信息符合预设条件的情况下,播放所述用户音频信息。
在一些可选的实现方式中,装置还包括:第三确定模块907,用于从至少一个用户中确定用户音频信息对应的目标用户并获取目标用户的脸部图像;第一情绪识别模块908,用于将用户音频信息对应的目标用户的脸部图像输入预先训练的第一情绪识别模型,得到目标用户分别对应的情绪类别信息;第四确定模块909,用于基于情绪类别信息,确定表征用户音频信息对应的目标用户的情绪与当前播放的音频的类型的匹配程度的第一评分;和/或,第五确定模块910,用于基于用户音频信息,确定表征用户音频信息与当前播放的音频的匹配程度的第二评分;第六确定模块911,用于基于第一评分和/或第二评分,确定用户音频信息的评分并输出。
在一些可选的实现方式中,第一情绪识别模块908包括:第一情绪识别单元9081,用于将至少一个用户各自的脸部图像输入第一情绪识别模型,得到至少一个用户分别对应的第一情绪类别信息序列,其中,第一情绪类别信息序列中的情绪类别信息分别对应于一个脸部图像子序列;第一确定单元9082,用于基于情绪类别信息,确定表征至少一个用户的情绪与当前播放的音频的类型的匹配程度的第一评分,包括:第一获取单元9083,用于获取当前播放的音频对应的视频,并从视频中提取目标人物的脸部图像序列;第二情绪识别单元9084,用于将脸部图像序列输入第一情绪识别模型,得到第二情绪类别信息序列;第二确定单元9085,用于确定第一情绪类别信息序列和第二情绪类别信息序列之间的相似度;第三确定单元9086,用于基于相似度,确定第一评分。
在一些可选的实现方式中,装置还包括:第七确定模块912,用于从至少一个用户中确定用户音频信息对应的目标用户并获取目标用户的脸部图像;第二情绪识别模块913,用于将用户音频信息对应的目标用户的脸部图像和用户音频信息输入预先训练的第二情绪识别模型,得到情绪类别信息;第八确定模块914,用于基于情绪类别信息,确定表征用户音频信息对应的目标用户的情绪与当前播放的音频的类型的匹配程度的评分并输出。
在一些可选的实现方式中,第二情绪识别模块913进一步用于:将用户音频信息对应的用户的脸部图像和用户音频信息输入第二情绪识别模型,得到第三情绪类别信息序列,其中,第三情绪类别信息序列中的情绪类别信息分别对应于一个脸部图像子序列;第八确定模块914包括:第二获取单元9141,用于获取当前播放的音频对应的视频,并从视频中提取目标人物的脸部图像 序列;第三情绪识别单元9142,用于将脸部图像序列和当前播放的音频输入第二情绪识别模型,得到第四情绪类别信息序列;第四确定单元9143,用于确定第三情绪类别信息序列和第四情绪类别信息序列之间的相似度;第五确定单元9144,用于基于相似度,确定表征用户音频信息对应的用户的情绪与当前播放的音频的类型的匹配程度的评分。
在一些可选的实现方式中,提取模块905包括:第三获取单元9051,用于获取设置在目标空间的音频采集设备采集的初始音频信息,该初始音频信息包括混合声音信号;分离单元9052,用于对初始音频信息进行人声分离,得到至少一路用户音频信息,其中,至少一路用户音频信息分别对应于一个用户。
在一些可选的实现方式中,第二播放模块906进一步用于:将至少一路用户音频信息的音量分别调整至目标音量并合成调整音量后的用户音频信息,以及播放合成后的用户音频信息。
在一些可选的实现方式中,第二播放模块906包括:第一旋律识别单元9061,用于对用户音频信息进行旋律识别,得到用户旋律信息;将用户旋律信息与当前播放的音频的旋律信息进行匹配,基于得到的第一匹配结果播放用户音频信息;和/或,第一语音识别单元9062,用于对用户音频信息进行语音识别,得到语音识别结果;将语音识别结果与当前播放的音频的对应文本信息进行匹配,基于得到的第二匹配结果播放用户音频信息。
在一些可选的实现方式中,第二播放模块906包括:第六确定单元9063,用于确定用户音频信息的音高;调整单元9064,用于将当前播放的音频的音高调整至与用户音频信息的音高相匹配的目标音高;和/或,输出单元9065,用于输出用于推荐与用户音频信息的音高相对应的音频的推荐信息。
在一些可选的实现方式中,第一确定模块902包括:第四情绪识别单元9021,用于响应于确定意图判决数据包括至少一个用户的脸部图像,将脸部图像输入预先训练的第三情绪识别模型,得到情绪类别信息;如果情绪类别信息为预设情绪类型信息,确定至少一个用户有目标发声意图;或者,第二语音识别单元9022,用于响应于确定意图判决数据包括至少一个用户的声音信息,对声音信息进行语音识别,得到语音识别结果;如果语音识别结果表征至少一个用户指示播放音频,确定至少一个用户有目标发声意图;或者,第二旋律识别单元9023,用于响应于确定意图判决数据包括至少一个用户的 声音信息,对声音信息进行旋律识别,得到旋律识别结果;如果旋律识别结果表征至少一个用户正在进行目标形式的发声,确定至少一个用户有目标发声意图。
在一些可选的实现方式中,第二确定模块903包括:第七确定单元9031,用于获取针对至少一个用户的历史音频播放记录;基于历史音频播放记录,确定至少一个用户的收听习惯信息;基于收听习惯信息,确定特征信息;和/或,第五情绪识别单元9032,用于获取至少一个用户的脸部图像,将脸部图像输入预先训练的第四情绪识别模型,得到表征至少一个用户当前的情绪的情绪类别信息;基于情绪类别信息,确定特征信息;和/或,环境识别单元9033,用于获取至少一个用户所处的环境的环境图像,将环境图像输入预先训练的环境识别模型,得到环境类型信息;基于环境类型信息,确定特征信息;和/或,第八确定单元9034,用于获取对目标空间拍摄得到空间内图像;基于空间内图像,确定目标空间内的人数;基于人数,确定特征信息。
在一些可选的实现方式中,第一播放模块904包括:第一播放单元9041,用于响应于确定特征信息包括收听习惯信息,提取并播放与收听习惯对应的音频;第二播放单元9042,用于响应于确定特征信息包括情绪类别信息,提取并播放与情绪类别信息对应的音频;第三播放单元9043,用于响应于确定特征信息包括环境类型信息,提取并播放与环境类型信息对应的音频;第四播放单元9044,用于响应于确定特征信息包括人数,提取并播放与人数对应的音频。
本公开上述实施例提供的音频播放装置,通过对目标空间内的至少一个用户采集意图判决数据,根据意图判决数据,确定用户具有的目标发声意图,再根据目标发声意图确定特征信息,最后从预设音频库中提取与特征信息对应的音频并播放,从而实现了由电子设备自动判断用户的目标发声意图,并在判定用户具有发声意图的情况下,由电子设备自动进行音频的播放,无需用户主动触发音频播放的操作,减少了用户进行音频播放的操作的步骤,提高了音频播放操作的便利性。此外,通过确定用户当前的特征,使播放的音频与用户的特征相适应,从而实现了更精准地播放用户想收听的音频,提高了自动播放音频的针对性。
示例性电子设备
下面,参考图11来描述根据本公开实施例的电子设备。该电子设备可以是如图1所示的终端设备101和服务器103中的任一个或两者、或与它们独立的单机设备,该单机设备可以与终端设备101和服务器103进行通信,以从它们接收所采集到的输入信号。
图11图示了根据本公开实施例的电子设备的框图。
如图11所示,电子设备1100包括一个或多个处理器1101和存储器1102。
处理器1101可以是中央处理单元(Central Processing Unit,CPU)或者具有数据处理能力和/或指令执行能力的其他形式的处理单元,并且可以控制电子设备1100中的其他组件以执行期望的功能。
存储器1102可以包括一个或多个计算机程序产品,计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。易失性存储器例如可以包括随机存取存储器(Random Access Memory,RAM)和/或高速缓冲存储器(cache)等。非易失性存储器例如可以包括只读存储器(Read-Only Memory,ROM)、硬盘、闪存等。在计算机可读存储介质上可以存储一个或多个计算机程序指令,处理器1101可以运行程序指令,以实现上文的本公开的各个实施例的音频播放方法以及/或者其他期望的功能。在计算机可读存储介质中还可以存储诸如意图判决数据、特征信息、音频等各种内容。
在一个示例中,电子设备1100还可以包括:输入装置1103和输出装置1104,这些组件通过总线系统和/或其他形式的连接机构(未示出)互连。
例如,在该电子设备是终端设备101或服务器103时,该输入装置1103可以是摄像头、麦克风等设备,用于输入意图判决数据。在该电子设备是单机设备时,该输入装置1103可以是通信网络连接器,用于从终端设备101和服务器103接收所输入的意图判决数据。
该输出装置1104可以向外部输出各种信息,包括提取出的音频。该输出设备1104可以包括例如显示器、扬声器、以及通信网络及其所连接的远程输出设备等等。
当然,为了简化,图11中仅示出了该电子设备1100中与本公开有关的组件中的一些,省略了诸如总线、输入/输出接口等等的组件。除此之外,根据具体应用情况,电子设备1100还可以包括任何其他适当的组件。
示例性计算机程序产品和计算机可读存储介质
除了上述方法和设备以外,本公开的实施例还可以是计算机程序产品,其包括计算机程序指令,所述计算机程序指令在被处理器运行时使得所述处理器执行本说明书上述“示例性方法”部分中描述的根据本公开各种实施例的音频播放方法中的步骤。
所述计算机程序产品可以以一种或多种程序设计语言的任意组合来编写用于执行本公开实施例操作的程序代码,所述程序设计语言包括面向对象的程序设计语言,诸如Java、C++等,还包括常规的过程式程序设计语言,诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。
此外,本公开的实施例还可以是计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令在被处理器运行时使得所述处理器执行本说明书上述“示例性方法”部分中描述的根据本公开各种实施例的音频播放方法中的步骤。
所述计算机可读存储介质可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以包括但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器((Erasable Programmable Read-Only Memory,EPROM)或闪存)、光纤、便携式紧凑盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
以上结合具体实施例描述了本公开的基本原理,但是,需要指出的是,在本公开中提及的优点、优势、效果等仅是示例而非限制,不能认为这些优点、优势、效果等是本公开的各个实施例必须具备的。另外,上述公开的具体细节仅是为了示例的作用和便于理解的作用,而非限制,上述细节并不限制本公开为必须采用上述具体的细节来实现。
本说明书中各个实施例均采用递进的方式描述,每个实施例重点说明的 都是与其它实施例的不同之处,各个实施例之间相同或相似的部分相互参见即可。对于系统实施例而言,由于其与方法实施例基本对应,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本公开中涉及的器件、装置、设备、系统的方框图仅作为例示性的例子并且不意图要求或暗示必须按照方框图示出的方式进行连接、布置、配置。如本领域技术人员将认识到的,可以按任意方式连接、布置、配置这些器件、装置、设备、系统。诸如“包括”、“包含”、“具有”等等的词语是开放性词汇,指“包括但不限于”,且可与其互换使用。这里所使用的词汇“或”和“和”指词汇“和/或”,且可与其互换使用,除非上下文明确指示不是如此。这里所使用的词汇“诸如”指词组“诸如但不限于”,且可与其互换使用。
可能以许多方式来实现本公开的方法和装置。例如,可通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本公开的方法和装置。用于所述方法的步骤的上述顺序仅是为了进行说明,本公开的方法的步骤不限于以上具体描述的顺序,除非以其它方式特别说明。此外,在一些实施例中,还可将本公开实施为记录在记录介质中的程序,这些程序包括用于实现根据本公开的方法的机器可读指令。因而,本公开还覆盖存储用于执行根据本公开的方法的程序的记录介质。
还需要指出的是,在本公开的装置、设备和方法中,各部件或各步骤是可以分解和/或重新组合的。这些分解和/或重新组合应视为本公开的等效方案。
提供所公开的方面的以上描述以使本领域的任何技术人员能够做出或者使用本公开。对这些方面的各种修改对于本领域技术人员而言是非常显而易见的,并且在此定义的一般原理可以应用于其他方面而不脱离本公开的范围。因此,本公开不意图被限制到在此示出的方面,而是按照与在此公开的原理和新颖的特征一致的最宽范围。
为了例示和描述的目的已经给出了以上描述。此外,此描述不意图将本公开的实施例限制到在此公开的形式。尽管以上已经讨论了多个示例方面和实施例,但是本领域技术人员将认识到其某些变型、修改、改变、添加和子组合。

Claims (12)

  1. 一种音频播放方法,包括:
    获取针对目标空间内的至少一个用户采集的意图判决数据;
    基于所述意图判决数据,确定所述至少一个用户具有的目标发声意图;
    基于所述目标发声意图,确定表征所述至少一个用户的当前特征的特征信息;
    从预设音频库中提取并播放与所述特征信息对应的音频。
  2. 根据权利要求1所述的方法,其中,在所述提取并播放与所述特征信息对应的音频之后,所述方法还包括:
    从当前的混合声音信号中提取用户音频信息;
    在所述用户音频信息符合预设条件的情况下,播放所述用户音频信息。
  3. 根据权利要求2所述的方法,其中,在所述播放所述用户音频信息之后,所述方法还包括:
    从所述至少一个用户中确定所述用户音频信息对应的目标用户并获取所述目标用户的脸部图像;
    将所述用户音频信息对应的目标用户的脸部图像输入预先训练的第一情绪识别模型,得到所述目标用户对应的情绪类别信息;
    基于所述情绪类别信息,确定表征所述用户音频信息对应的目标用户的情绪与当前播放的音频的类型的匹配程度的第一评分;和/或,
    基于所述用户音频信息,确定表征所述用户音频信息与所述当前播放的音频的匹配程度的第二评分;
    基于所述第一评分和/或所述第二评分,确定所述用户音频信息的评分并输出。
  4. 根据权利要求2所述的方法,其中,在所述播放所述用户音频信息之后,所述方法还包括:
    从所述至少一个用户中确定所述用户音频信息对应的目标用户并获取所 述目标用户的脸部图像;
    将所述用户音频信息对应的目标用户的脸部图像和所述用户音频信息输入预先训练的第二情绪识别模型,得到情绪类别信息;
    基于所述情绪类别信息,确定表征所述用户音频信息对应的目标用户的情绪与当前播放的音频的类型的匹配程度的评分并输出。
  5. 根据权利要求4所述的方法,其中,所述将所述用户音频信息对应的目标用户的脸部图像和所述用户音频信息输入预先训练的第二情绪识别模型,得到情绪类别信息,包括:
    将所述用户音频信息对应的目标用户的脸部图像和所述用户音频信息输入所述第二情绪识别模型,得到第三情绪类别信息序列,其中,所述第三情绪类别信息序列中的情绪类别信息分别对应于一个脸部图像子序列;
    所述基于所述情绪类别信息,确定表征所述用户音频信息对应的目标用户的情绪与当前播放的音频的类型的匹配程度的评分,包括:
    获取所述当前播放的音频对应的视频,并从所述视频中提取目标人物的脸部图像序列;
    将所述脸部图像序列和所述当前播放的音频输入所述第二情绪识别模型,得到第四情绪类别信息序列;
    确定所述第三情绪类别信息序列和所述第四情绪类别信息序列之间的相似度;
    基于所述相似度,确定表征所述用户音频信息对应的用户的情绪与当前播放的音频的类型的匹配程度的评分。
  6. 根据权利要求2所述的方法,其中,所述从当前的混合声音信号中提取用户音频信息,包括:
    获取设置在所述目标空间的音频采集设备采集的初始音频信息,所述初始音频信息包括所述混合声音信号;
    对所述初始音频信息进行人声分离,得到至少一路用户音频信息,其中,所述至少一路用户音频信息分别对应于一个用户。
  7. 根据权利要求2所述的方法,其中,所述基于所述用户音频信息,播 放所述用户音频信息,包括:
    对所述用户音频信息进行旋律识别,得到用户旋律信息;将所述用户旋律信息与当前播放的音频的旋律信息进行匹配,基于得到的第一匹配结果播放所述用户音频信息;和/或,
    对所述用户音频信息进行语音识别,得到语音识别结果;将所述语音识别结果与当前播放的音频的对应文本信息进行匹配,基于得到的第二匹配结果播放所述用户音频信息。
  8. 根据权利要求1所述的方法,其中,所述基于所述意图判决数据,确定所述至少一个用户具有的目标发声意图,包括:
    响应于确定所述意图判决数据包括所述至少一个用户的脸部图像,将所述脸部图像输入预先训练的第三情绪识别模型,得到情绪类别信息;如果所述情绪类别信息为预设情绪类型信息,确定所述至少一个用户有目标发声意图;或者,
    响应于确定所述意图判决数据包括所述至少一个用户的声音信息,对所述声音信息进行语音识别,得到语音识别结果;如果所述语音识别结果表征所述至少一个用户指示播放音频,确定所述至少一个用户有目标发声意图;或者
    响应于确定所述意图判决数据包括所述至少一个用户的声音信息,对所述声音信息进行旋律识别,得到旋律识别结果;如果所述旋律识别结果表征所述至少一个用户正在进行目标形式的发声,确定所述至少一个用户有目标发声意图。
  9. 根据权利要求1所述的方法,其中,所述确定表征所述至少一个用户的当前特征的特征信息,包括:
    获取针对所述至少一个用户的历史音频播放记录;基于所述历史音频播放记录,确定所述至少一个用户的收听习惯信息;基于所述收听习惯信息,确定所述特征信息;和/或,
    获取所述至少一个用户的脸部图像,将所述脸部图像输入预先训练的第四情绪识别模型,得到表征所述至少一个用户当前的情绪的情绪类别信息;基于所述情绪类别信息,确定所述特征信息;和/或,
    获取所述至少一个用户所处的环境的环境图像,将所述环境图像输入预先训练的环境识别模型,得到环境类型信息;基于所述环境类型信息,确定所述特征信息;和/或,
    获取对所述目标空间拍摄得到空间内图像;基于所述空间内图像,确定所述目标空间内的人数;基于所述人数,确定所述特征信息。
  10. 一种音频播放装置,包括:
    获取模块,用于获取针对目标空间内的至少一个用户采集的意图判决数据;
    第一确定模块,用于基于所述意图判决数据,确定所述至少一个用户具有的目标发声意图;
    第二确定模块,用于确定表征所述至少一个用户的当前特征的特征信息;
    第一播放模块,用于从预设音频库中提取并播放与所述特征信息对应的音频。
  11. 一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序用于执行上述权利要求1-9任一所述的方法。
  12. 一种电子设备,所述电子设备包括:
    处理器;
    用于存储所述处理器可执行指令的存储器;
    所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述指令以实现上述权利要求1-9任一所述的方法。
PCT/CN2022/076239 2021-04-16 2022-02-14 音频播放方法、装置、计算机可读存储介质及电子设备 WO2022218027A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2022573581A JP7453712B2 (ja) 2021-04-16 2022-02-14 オーディオ再生方法、装置、コンピュータ可読記憶媒体及び電子機器
US18/247,754 US20240004606A1 (en) 2021-04-16 2022-02-14 Audio playback method and apparatus, computer readable storage medium, and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110410353.9 2021-04-16
CN202110410353.9A CN113126951B (zh) 2021-04-16 2021-04-16 音频播放方法、装置、计算机可读存储介质及电子设备

Publications (1)

Publication Number Publication Date
WO2022218027A1 true WO2022218027A1 (zh) 2022-10-20

Family

ID=76777173

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/076239 WO2022218027A1 (zh) 2021-04-16 2022-02-14 音频播放方法、装置、计算机可读存储介质及电子设备

Country Status (4)

Country Link
US (1) US20240004606A1 (zh)
JP (1) JP7453712B2 (zh)
CN (1) CN113126951B (zh)
WO (1) WO2022218027A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113126951B (zh) * 2021-04-16 2024-05-17 深圳地平线机器人科技有限公司 音频播放方法、装置、计算机可读存储介质及电子设备
CN114120939B (zh) * 2021-11-26 2024-10-11 合肥若叶无间网络科技有限公司 一种古琴调音器的实现方法
CN114999534A (zh) * 2022-06-10 2022-09-02 中国第一汽车股份有限公司 一种车载音乐的播放控制方法、装置、设备和存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107632814A (zh) * 2017-09-25 2018-01-26 珠海格力电器股份有限公司 音频信息的播放方法、装置和系统、存储介质、处理器
US10102851B1 (en) * 2013-08-28 2018-10-16 Amazon Technologies, Inc. Incremental utterance processing and semantic stability determination
CN110111795A (zh) * 2019-04-23 2019-08-09 维沃移动通信有限公司 一种语音处理方法及终端设备
CN110413250A (zh) * 2019-06-14 2019-11-05 华为技术有限公司 一种语音交互方法、装置及系统
CN111199732A (zh) * 2018-11-16 2020-05-26 深圳Tcl新技术有限公司 一种基于情感的语音交互方法、存储介质及终端设备
CN111523981A (zh) * 2020-04-29 2020-08-11 深圳追一科技有限公司 虚拟试用方法、装置、电子设备及存储介质
CN112397065A (zh) * 2020-11-04 2021-02-23 深圳地平线机器人科技有限公司 语音交互方法、装置、计算机可读存储介质及电子设备
CN113126951A (zh) * 2021-04-16 2021-07-16 深圳地平线机器人科技有限公司 音频播放方法、装置、计算机可读存储介质及电子设备

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000099014A (ja) * 1998-09-22 2000-04-07 Ntt Data Corp 音楽自動採点装置、音楽自動採点モデル作成装置、音楽自動採点モデル作成方法、音楽自動採点方法、及び、記録媒体
JP2000330576A (ja) * 1999-05-19 2000-11-30 Taito Corp カラオケの歌唱評価方法と装置
JP4037081B2 (ja) 2001-10-19 2008-01-23 パイオニア株式会社 情報選択装置及び方法、情報選択再生装置並びに情報選択のためのコンピュータプログラム
JP2004163590A (ja) 2002-11-12 2004-06-10 Denso Corp 再生装置及びプログラム
JP4916005B2 (ja) 2007-02-28 2012-04-11 株式会社第一興商 カラオケシステム
US8583615B2 (en) * 2007-08-31 2013-11-12 Yahoo! Inc. System and method for generating a playlist from a mood gradient
CN102970427A (zh) * 2012-11-16 2013-03-13 广东欧珀移动通信有限公司 一种手机播放歌曲的方法
US10373611B2 (en) * 2014-01-03 2019-08-06 Gracenote, Inc. Modification of electronic system operation based on acoustic ambience classification
JP6409652B2 (ja) 2015-03-30 2018-10-24 ブラザー工業株式会社 カラオケ装置、プログラム
JP6866715B2 (ja) * 2017-03-22 2021-04-28 カシオ計算機株式会社 情報処理装置、感情認識方法、及び、プログラム
CN107609034A (zh) * 2017-08-09 2018-01-19 深圳市汉普电子技术开发有限公司 一种智能音箱的音频播放方法、音频播放装置及存储介质
WO2019114426A1 (zh) * 2017-12-15 2019-06-20 蔚来汽车有限公司 车载音乐的匹配方法、装置及车载智能控制器
JP6944391B2 (ja) 2018-01-31 2021-10-06 株式会社第一興商 カラオケ装置
CN108848416A (zh) * 2018-06-21 2018-11-20 北京密境和风科技有限公司 音视频内容的评价方法和装置
CN109299318A (zh) * 2018-11-13 2019-02-01 百度在线网络技术(北京)有限公司 音乐推荐的方法、装置、存储介质和终端设备
CN111754965B (zh) * 2019-03-29 2023-11-14 比亚迪股份有限公司 车载k歌装置、方法和车辆
CN110096611A (zh) * 2019-04-24 2019-08-06 努比亚技术有限公司 一种歌曲推荐方法、移动终端及计算机可读存储介质
CN110197677A (zh) * 2019-05-16 2019-09-03 北京小米移动软件有限公司 一种播放控制方法、装置及播放设备
CN111968611B (zh) * 2020-08-12 2024-04-23 上海仙塔智能科技有限公司 K歌方法、车载终端及计算机可读存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10102851B1 (en) * 2013-08-28 2018-10-16 Amazon Technologies, Inc. Incremental utterance processing and semantic stability determination
CN107632814A (zh) * 2017-09-25 2018-01-26 珠海格力电器股份有限公司 音频信息的播放方法、装置和系统、存储介质、处理器
CN111199732A (zh) * 2018-11-16 2020-05-26 深圳Tcl新技术有限公司 一种基于情感的语音交互方法、存储介质及终端设备
CN110111795A (zh) * 2019-04-23 2019-08-09 维沃移动通信有限公司 一种语音处理方法及终端设备
CN110413250A (zh) * 2019-06-14 2019-11-05 华为技术有限公司 一种语音交互方法、装置及系统
CN111523981A (zh) * 2020-04-29 2020-08-11 深圳追一科技有限公司 虚拟试用方法、装置、电子设备及存储介质
CN112397065A (zh) * 2020-11-04 2021-02-23 深圳地平线机器人科技有限公司 语音交互方法、装置、计算机可读存储介质及电子设备
CN113126951A (zh) * 2021-04-16 2021-07-16 深圳地平线机器人科技有限公司 音频播放方法、装置、计算机可读存储介质及电子设备

Also Published As

Publication number Publication date
US20240004606A1 (en) 2024-01-04
CN113126951B (zh) 2024-05-17
CN113126951A (zh) 2021-07-16
JP2023527473A (ja) 2023-06-28
JP7453712B2 (ja) 2024-03-21

Similar Documents

Publication Publication Date Title
US11842730B2 (en) Modification of electronic system operation based on acoustic ambience classification
WO2022218027A1 (zh) 音频播放方法、装置、计算机可读存储介质及电子设备
Czyzewski et al. An audio-visual corpus for multimodal automatic speech recognition
US10490195B1 (en) Using system command utterances to generate a speaker profile
US11854550B2 (en) Determining input for speech processing engine
CN109074806B (zh) 控制分布式音频输出以实现语音输出
CN106373569B (zh) 语音交互装置和方法
JP7126613B2 (ja) ドメイン分類器を使用したニューラルネットワークにおけるドメイン適応のためのシステム及び方法
Eronen et al. Audio-based context recognition
JP7108144B2 (ja) クロスドメインバッチ正規化を使用したニューラルネットワークにおけるドメイン適応のためのシステム及び方法
JP2020034895A (ja) 応答方法及び装置
US12119022B2 (en) Cognitive assistant for real-time emotion detection from human speech
KR102495888B1 (ko) 사운드를 출력하기 위한 전자 장치 및 그의 동작 방법
CN114121006A (zh) 虚拟角色的形象输出方法、装置、设备以及存储介质
US20240321264A1 (en) Automatic speech recognition
Geiger et al. Learning new acoustic events in an hmm-based system using map adaptation
CN110232911B (zh) 跟唱识别方法、装置、存储介质及电子设备
CN111429882A (zh) 播放语音的方法、装置及电子设备
CN111627417B (zh) 播放语音的方法、装置及电子设备
US9412395B1 (en) Narrator selection by comparison to preferred recording features
US20210082427A1 (en) Information processing apparatus and information processing method
US20240203446A1 (en) Method of operating sound recognition device identifying speaker and electronic device having the same
CN114360523A (zh) 关键词数据集获取、模型训练方法、装置、设备及介质
JP2020086011A (ja) 抽出装置、学習装置、抽出方法、抽出プログラム、学習方法および学習プログラム

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2022573581

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22787257

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18247754

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 22.02.2024)