WO2023173285A1 - Audio processing method and apparatus, electronic device, and computer-readable storage medium - Google Patents

Audio processing method and apparatus, electronic device, and computer-readable storage medium Download PDF

Info

Publication number
WO2023173285A1
WO2023173285A1 PCT/CN2022/080925 CN2022080925W WO2023173285A1 WO 2023173285 A1 WO2023173285 A1 WO 2023173285A1 CN 2022080925 W CN2022080925 W CN 2022080925W WO 2023173285 A1 WO2023173285 A1 WO 2023173285A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
audio
information
virtual sound
sound source
Prior art date
Application number
PCT/CN2022/080925
Other languages
French (fr)
Chinese (zh)
Inventor
莫品西
边云锋
高建正
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to PCT/CN2022/080925 priority Critical patent/WO2023173285A1/en
Publication of WO2023173285A1 publication Critical patent/WO2023173285A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S5/00Pseudo-stereo systems, e.g. in which additional channel signals are derived from monophonic signals by means of phase shifting, time delay or reverberation 
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control

Definitions

  • the present application relates to the field of audio processing technology, specifically, to an audio processing method, device, electronic equipment and computer-readable storage medium.
  • the present application provides an audio processing method, device, electronic equipment and computer-readable storage medium to solve the technical problem in related technologies that audio processing cannot meet the personalized needs of users.
  • an audio processing method is provided.
  • the method is applied to an electronic device.
  • the electronic device includes at least two players and a motion sensor.
  • the method includes:
  • the virtual sound field is established based on the positional relationship between the at least two players and the user's left and right hearing organs;
  • an audio signal corresponding to each of the at least two players is determined, and passed through the corresponding The player plays the audio signal.
  • an audio processing device in a second aspect, includes a processor, a memory, and a computer program stored on the memory and executable by the processor. When the processor executes the computer program, the aforementioned first step is implemented. Method embodiments described in one aspect.
  • an electronic device in a third aspect, includes at least two players and a motion sensor.
  • the electronic device further includes a processor, a memory, and a computer stored on the memory and executable by the processor. program;
  • a computer-readable storage medium is provided. Several computer instructions are stored on the computer-readable storage medium. When the computer instructions are executed, the method embodiment described in the first aspect is implemented.
  • a virtual sound field can be established based on the positional relationship between the at least two players and the user's left and right hearing organs, thereby obtaining the orientation information of the virtual sound source in the audio to be played in the virtual sound field; and, all the information can be used
  • the motion sensor obtains the user's motion information, and generates a sound source orientation adjustment instruction based on the motion information, so that the sound source orientation adjustment instruction can be used to adjust the orientation of the virtual sound source, determine a new audio signal, and pass the at least Two players play; this allows the user to hear the sound source with changing position when moving, and allows the user to feel that the position of the sound source in the sound changes with the user's movement. Therefore, the audio processing solution of this embodiment realizes the generation of audio that conforms to the corresponding motion sound effects based on the user's movements, achieving a realistic spatial sound effect experience.
  • Figure 1 is a schematic diagram of a real concert scene according to an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of an audio processing method according to an embodiment of the present application.
  • FIG. 3A is a schematic flowchart of an audio processing method according to another embodiment of the present application.
  • Figure 3B is a schematic flowchart of step 303 according to an embodiment of the present application.
  • Figure 3C is a schematic diagram of the orientation of each discrete frequency in a coordinate system according to an embodiment of the present application.
  • Figure 3D is a schematic diagram showing user action changes in a coordinate system according to an embodiment of the present application.
  • FIG. 4 is a hardware structure diagram of an audio processing device according to an embodiment of the present application.
  • FIG. 5 is a hardware structure diagram of an electronic device according to an embodiment of the present application.
  • the user faces the concert stage.
  • To the left of the user is a performer playing gongs
  • To the right of the user is a performer playing drums.
  • the sounds of gongs and drums are respectively from The user's left front and right front are transmitted to the user's ears, and the user can feel the sound of gongs and drums in their left front and right front respectively.
  • the concert stage is on the left side of the user
  • the drums are in front of the user's left
  • the gongs are behind the user's left side.
  • the existing audio processing solution cannot express the above changes. This is because the audio signals played are pre-recorded.
  • the position of the radio equipment is fixed. For example, if the gongs and drums in the audio signal are fixed to the front left and right of the radio equipment, the user will always only be able to feel the gongs and drums in the audio signal coming from the front left and right when listening to music with headphones. In the front, when the user turns sideways, he cannot feel the sound of gongs and drums and the changes in his own position as in a real concert scene.
  • the position of the radio equipment is not fixed, for example, the radio equipment has a changing position during the recording process
  • the gongs and drums as the sound source in the recorded audio signal change relative to the position of the radio equipment, or it can also be
  • the position of the radio equipment is fixed, but the position of the gongs and drums as the sound source changes; the audio recorded in this way can make people feel the position changes of the gongs and drums; although dynamic effects of sound changes can be achieved,
  • this method needs to be realized by changing the position of the radio equipment or the position of the sound source during recording. After the recording is completed, the change in the position of the sound source in the audio is fixed, and its dynamic effect is also consistent with the user's listening to the audio signal. The action does not match.
  • FIG. 2 it is a flow chart of an audio processing method according to an exemplary embodiment of the present application.
  • This method can be applied to an electronic device.
  • the electronic device includes at least two players and a motion sensor.
  • the method includes the following steps:
  • step 202 the orientation information of the virtual sound source in the audio to be played in the virtual sound field is obtained.
  • the virtual sound field is established based on the positional relationship between the at least two players and the user's left and right hearing organs.
  • step 204 the motion sensor is used to obtain the user's motion information, and a sound source orientation adjustment instruction is generated based on the motion information.
  • step 206 based on the sound source position adjustment instruction, the orientation information of the virtual sound source and the audio to be played, an audio signal corresponding to each of the at least two players is determined, and The audio signal is played through the corresponding player.
  • the electronic device in this embodiment can be any device with audio processing capabilities, such as a smartphone, a camera device, a VR device, a wearable device or a computer that supports audio playback, and so on.
  • the electronic device is connected to an audio player, that is, at least two players, such as ordinary headphones, bone conduction headphones or speakers, etc.
  • the motion sensor includes any sensor that can collect data and the collected data can detect the user's motion, such as inertial measurement sensors, visual sensors (such as monocular or binocular) or laser radar, etc.
  • the at least two players and motion sensors can be mounted on the device that performs the above method embodiments.
  • a wearable device is equipped with a player and a motion sensor, and the wearable device can obtain the audio to be played, Obtain the data collected by the motion sensor to determine the user's motion information, and after processing the audio to be played, control the player to play the processed audio signal.
  • the player and the motion sensor may be respectively connected to the device executing the above method embodiment.
  • the player and the motion sensor may be respectively connected to a device such as a mobile terminal, and the mobile terminal obtains the audio to be played and obtains the action.
  • the data collected by the sensor determines the user's action information, and after processing the audio to be played, the processed audio signal is played through the player; the playback method can be that the audio is processed and then played by the player, or it can be The processed audio is stored and then played through the player when needed; or, if the device executing the above method embodiment is not directly connected to the player, the device can send the processed audio to other devices connected to the player. Control player playback from other devices, etc.
  • a virtual sound field can be established based on the positional relationship between the at least two players and the user's left and right hearing organs, thereby obtaining the orientation information of the virtual sound source in the audio to be played in the virtual sound field; and, the action can be used
  • the sensor obtains the user's action information, and generates a sound source orientation adjustment instruction based on the action information, so that the sound source orientation adjustment instruction can be used to adjust the orientation of the virtual sound source, determine a new audio signal, and pass the at least two
  • the player plays; the user can hear the sound source with changing position when moving, and the user can feel that the position of the sound source in the sound changes with the user's movement. Therefore, the audio processing solution of this embodiment can generate audio that matches the corresponding motion sound effects based on the user's movements, achieving a realistic spatial sound effect experience.
  • the device when the user is listening to music with headphones, if the user turns right, the device obtains the user's action information of turning right, adjusts the orientation information of the virtual sound source in the audio signal through the above solution, and generates a dynamic audio signal in real time.
  • the position of the virtual sound source in the audio signal relative to the user changes with the user's movements, thus realizing a VR sound effect in which the sound follows the movement of the person, and the user can feel the sound source changing in real time with his/her movements.
  • this embodiment can also provide users with personalized sound effect requirements. The user can control the azimuth changes of the virtual sound source through actions.
  • the user can customize the corresponding relationship between his actions and the azimuth changes of the virtual sound source, so that the user can control the azimuth changes of the virtual sound source according to the user's needs. Flexibly adjust the position of the virtual sound source according to its own needs.
  • the sound field refers to the area where sound waves exist in the medium.
  • the physical quantity describing the sound field can be sound pressure, particle vibration velocity, displacement or medium density, etc., and is generally a function of position and time. The relationship between changes in the spatial position and time of these physical quantities in the sound field is described by the acoustic wave equation.
  • the sound source is in a uniform and isotropic medium, and the sound field in which the influence of the boundary is negligible can be called a free sound field.
  • the player plays the audio.
  • the virtual sound field in this embodiment can be established based on the positional relationship between the at least two players and the user's left and right hearing organs.
  • the orientation information describes the location of the virtual sound source in the virtual sound field.
  • Position information can establish a coordinate system based on the user. For example, the origin of the coordinate system is on the user. Alternatively, establishing a coordinate system with the player is also optional. In actual applications, the orientation of the virtual sound source can be flexibly determined according to needs. The information is described by the coordinate information of the virtual sound source in the coordinate system.
  • the virtual sound source can be determined by signal recognition of the audio to be played. After identifying the virtual sound source in the audio to be played, the orientation information of the virtual sound source in the virtual sound field is determined. In practical applications, It is implemented through audio processing algorithms or trained neural networks. In other examples, it can be implemented based on signal frequency domain analysis. For example, it can be to obtain the frequency domain signal from the audio to be played, extract the orientation of each discrete frequency from the frequency domain signal, and adjust the orientation of each discrete frequency in subsequent steps. , so that the position of the virtual sound source in the virtual sound field changes with the user's actions.
  • the orientation information of the virtual sound source in the virtual sound field in the audio to be played is obtained by extracting multiple discrete frequencies from the audio to be played, and based on the orientation information of the multiple discrete frequencies in the virtual sound field It is certain that in this embodiment, the sound signal generated by the real sound source can be decomposed into multiple discrete frequencies.
  • the audio to be played is an audio signal of at least two channels; the orientation information of the multiple discrete frequencies in the virtual sound field is obtained in the following manner: according to the audio of the at least two channels The amplitude ratio and/or phase difference of the signal is used to obtain the orientation information of the multiple discrete frequencies in the virtual sound field based on the first coordinate system and the binaural transfer function.
  • the first coordinate system is established based on the positional relationship between the at least two players and the user's left and right hearing organs.
  • Each discrete frequency sound source can be used as an independent sound source and is distributed in the coordinate system to represent the sound.
  • Various directions of the source there are usually at least two players, so the audio to be played is usually an audio signal of at least two channels.
  • each independent sound source in space can be based on the amplitude of the two-channel frequency domain signal.
  • the ratio and phase difference are calculated by combining the propagation model of sound to both ears.
  • the sound propagation model to both ears can be implemented using the binaural transfer function.
  • the binaural transfer function is also called Head Related Transfer Functions (HRTF). This function describes the flow of sound waves from the sound source to the binaural Transmission process; or you can also refer to the propagation model of sound in the free sound field, etc. If the audio to be played is a mono signal, it can be expanded into the same signal for the left and right channels through copying.
  • HRTF Head Related Transfer Functions
  • the motion sensor may include a sensor worn on the user.
  • the electronic device is connected to the sensor.
  • the data collected by the sensor may be sent to the electronic device, and the electronic device determines the user's motion information.
  • one or more of the following implementation methods may also be included:
  • the electronic device is a wearable electronic device
  • the motion sensor includes an inertial measurement sensor (IMU, Inertial Measurement Unit), that is, the motion sensor is built into the electronic device.
  • IMU Inertial Measurement Unit
  • the use of the motion sensor to obtain the user's motion information includes: : Determine the user's action information based on the measurement data of the inertial measurement sensor.
  • the electronic device is a wearable electronic device
  • the motion sensor includes a first image sensor; that is, the electronic device is equipped with one or more first image sensors.
  • the wearable electronic device is worn on the user's head
  • the The first image sensor faces the user's eyes
  • using the motion sensor to obtain the user's motion information includes: acquiring an image collected by the first image sensor, and acquiring the motion information of the user's eyeballs based on the image.
  • the electronic device is a wearable electronic device
  • the motion sensor includes one or more second image sensors.
  • the motion sensor of the one or more second image sensors The observation range covers the activity space of the user's hand; the use of the motion sensor to obtain the user's motion information includes: acquiring images collected by the one or more second image sensors. If the user's hand is detected from the image, part to obtain the movement information of the user's hand.
  • the user's action information may be determined using any of the above methods, or the action information may be determined using a combination of at least two of the above methods, which is not limited in this embodiment.
  • the audio to be played has a certain playback duration.
  • One of the implementation methods may be used to determine the user's action information in certain time periods, and other implementation methods may be used to determine the user's action information in other time periods.
  • the first image sensor in the aforementioned embodiment can collect an image of the user's eyes, the electronic device determines the action information of the user's eyeballs based on the image, and adjusts the audio at the current time based on the action information.
  • the second image sensor of the aforementioned embodiment can collect an image of the user's hand.
  • the electronic device determines the action information of the user's hand based on the image, and adjusts the current time based on the action information.
  • the position of the virtual sound source in the audio may have different types of action information.
  • This embodiment may adopt a variety of embodiments for collecting user actions, so that the position of the virtual sound source in the virtual sound field can be changed according to the varies with different types of user actions.
  • the user's action information can include actions of a variety of body parts, such as any one or more of hand movements, head movements, eye movements, or leg movements; in actual applications, it can be set as needed, or Set by user.
  • the user's action information may be a vector of information used to describe user action changes, which may include multiple types of information, such as any one or more of direction information, distance information, or speed information of the user's action. Kinds etc.
  • the user's action information can be calculated based on a set coordinate system; or the user's action information can also include information describing changes in the user's actions, which changes can refer to the user's actions at the current moment.
  • the change of an action relative to another action which can be implemented in a variety of ways, for example, it can be the previous collection cycle, the set time before the current time, or the set collection cycle before the current time. etc., or it can also be calculated within a set coordinate system as mentioned above. This coordinate system can be flexibly determined according to needs.
  • it can be a coordinate system established based on the initial posture of the user when using the player, or it can be a coordinate system established based on the posture of the player when the player is used by the user; or it can also be It can be other customized coordinate systems, etc., which is not limited in this embodiment.
  • a sound source orientation adjustment instruction can be generated based on the action information.
  • the corresponding relationship between the user's action information and the sound source orientation adjustment instruction can be preset by the technician. That is, what kind of action information generates what kind of sound source orientation adjustment instruction can be decided by the technician. . In other examples, it can also be decided by the user.
  • the user can be provided with a setting function. For example, the user can set the action information of one or more body parts to trigger the sound source orientation adjustment instruction, or set the direction information, Any one or more of distance information or speed information is used to describe the user's action information.
  • the amount of change of the user's action can be determined based on the user's action information; based on the amount of change of the user's action and the first mapping relationship, the amount of change used to adjust the orientation of the virtual sound source in the virtual sound field is generated.
  • the change amount of the user's action can be quickly used to determine the change amount of the virtual sound source's position in the virtual sound field, thereby improving the efficiency of audio processing.
  • the speed also makes the change in the position of the virtual sound source in the virtual sound field correspond to the change in the user's action. For example, if the user's action is larger, the change in the virtual sound source's position in the virtual sound field will be larger. The two have Positive relationship.
  • the orientation information of the virtual sound source in the virtual sound field in the audio to be played can be determined through the first coordinate system; the change amount of the user action can be determined through the first coordinate system, or That is, both are determined by the same coordinate system; in this embodiment, in order to make the position of the virtual sound source in the virtual sound field change with the user's action, the change amount of the user's action is equal to the change of the virtual sound source.
  • the orientation information is determined using the same first coordinate system, and the relationship between the two can be determined quickly and accurately, which facilitates the establishment of the mapping relationship.
  • the first coordinate system may include: a coordinate system established based on the user's initial posture using the at least two players. As an example, the coordinate system represents the user's initial posture.
  • the virtual sound field and the virtual sound source can be defined to be stationary relative to the coordinate system, while the person moves relative to the coordinate system. Therefore, the user's actual action parameters in the space represent the person's movement relative to the virtual sound field, and the user's actual action information is converted into the person's movement relative to the virtual sound field.
  • the user's hand sliding to the right can be defined as the person's movement to the right. , which corresponds to the user dragging the sound field to the right, that is, the person moves to the left relative to the sound field.
  • the first mapping relationship can be implemented in multiple ways.
  • the first mapping relationship is preset.
  • one kind of first mapping relationship may be preset to apply to all users, or different first mapping relationships may be set according to different user types.
  • Different first mapping relationships can also be set based on various factors such as different audio types and application scenarios.
  • the electronic device can select from multiple preset first mapping relationships based on one or more of the aforementioned factors during implementation. Appropriate first mapping relationship. For example, if users of different ages, genders or heights perform the same action, the electronic device of this embodiment can generate different audio signals based on different first mapping relationships, that is, the virtual sounds in the audio heard by different users. The changes in the orientation of the source in the virtual sound field may be different.
  • the first mapping relationship may be generated by obtaining the user's setting instructions.
  • the electronic device implementing this embodiment or other electronic devices connected to the electronic device of this embodiment may provide a user interface.
  • a setting function is provided, through which the user's setting instructions can be obtained, and then the first mapping relationship is generated according to the user's setting instructions.
  • this embodiment provides the function of customizing the mapping relationship, which allows users to customize the mapping relationship they need. Therefore, different users can have different first mapping relationships, that is, for the same audio, different users can make the same Action, the generated audio signals are different, and the changes in the orientation of the virtual sound source in the virtual sound field in the audio heard by different users are different.
  • the first mapping relationship may also be determined based on the user's historical behavior data. For example, with the user's authorization and consent, the user's historical behavior data is obtained.
  • the historical behavior data may include the user's historical action information. and/or one or more information such as the user's historical played audio information, etc., to analyze the user's action characteristics or audio preferences and other user personalized information, and based on this, generate a first mapping relationship that conforms to the user's preferences.
  • the first mapping relationship may be determined using any of the above methods, or a combination of at least two of the above methods may be used, which is not limited in this embodiment.
  • the first mapping relationship can be determined using the aforementioned preset method; after that, the user's audio information after adjusting the orientation is obtained.
  • Preference data such as obtaining the user's evaluation data of the audio after adjusting the position, etc., and continuously obtaining the user's historical behavior data, thereby generating a new first mapping relationship, and using the new first mapping relationship to implement the solution of this embodiment .
  • the audio signal corresponding to each of the at least two players may be determined based on the sound source position adjustment instruction, the orientation information of the virtual sound source, and the audio to be played, The audio signal is played through the corresponding player, so that the position of the virtual sound source in the virtual sound field changes with the user's actions.
  • the audio to be played has at least two virtual sound sources, and the type information of the at least two virtual sound sources is different; based on this, in this embodiment, the sound source is generated based on the action information.
  • the orientation adjustment instruction may include: based on the action information, generating a sound source orientation adjustment instruction for adjusting the orientation of each virtual sound source in the at least two virtual sound sources, so that with the user's actions, different types of information
  • the virtual sound source has different amounts of change in the orientation of the virtual sound field.
  • the sound source position adjustment instruction can be used to adjust the position of each virtual sound source, and the change amount of the position of each virtual sound source is different.
  • an action of the user can make the played audio , the azimuth changes of each virtual sound source are different, so that the user can feel the different changes of different virtual sound sources in the audio.
  • the audio to be played includes two sound sources, but after the user takes an action, one of the sound sources moves from one direction to another by moving a larger amount of change, while the other sound source can move from one direction to another. , by moving a smaller amount of change to reach another direction.
  • virtual sound sources with different types of information have different amounts of change in the orientation of the virtual sound field, which can be automatically determined by the electronic device.
  • the type information of the sound source can be identified, where the type information includes the following One or more: the orientation information, timbre information or volume information of the virtual sound source in the audio to be played in the virtual sound field, so that different sound sources can be accurately distinguished through one or more of the above information.
  • the solution of this embodiment can be applied to a video playback scene.
  • the video playback scene includes audio to be played and multi-frame images synchronized with the audio to be played. Based on this, in this embodiment, after the user performs an action, the user can Changes in the image can also change the position of the virtual sound source in the audio.
  • the electronic device includes a display area; generating a sound source azimuth adjustment instruction based on the action information includes: generating a sound source azimuth adjustment instruction and an image display instruction based on the action information; and passing the corresponding The player plays the audio signal, including: controlling the corresponding player to play the audio signal, and at the same time, based on the image display instruction, acquiring multiple frames of images displayed synchronously with the audio to be played and displaying them in the display area in; wherein the image includes a first pixel area related to the virtual sound source, and the changes in the first pixel area in the multi-frame image and the orientation of the virtual sound source in the virtual sound field all change with changes depending on the user's actions.
  • the electronic device of this embodiment can be a VR device.
  • the VR device displays multi-frame images of the video and simultaneously plays the images during the display.
  • Picture-related audio for example, the audio comes from a virtual object in the video.
  • the virtual object occupies the first pixel area in the image.
  • the VR device can adjust the position of the virtual object in the display area based on the user's actions and simultaneously adjust the virtual object.
  • the orientation of the sound source therefore, the image and the orientation changes of the sound source are adjusted at the same time, which can achieve both visual VR effects and audio VR effects at the same time.
  • the user wears a VR device and the display area displays a concert stage.
  • the sound source in the audio is the performer.
  • the user can move the performer to the display area through hand movements.
  • the VR device detects the user's action, it generates an image display instruction so that in the newly displayed image, the performer's pixel area is displayed on the left side of the display area, and, through the sound source orientation adjustment instruction, the The position of the virtual sound source in the virtual sound field also changes to the left along with the user's movements.
  • the electronic device may also adjust the virtual sound source based on changes in the image.
  • Change for example, the electronic device includes a display area, the display area is used to display a multi-frame image synchronized with the audio to be played, the image includes a first pixel area related to the virtual sound source;
  • the method also includes: when displaying the current image, obtaining the position change of the first pixel area in the current image relative to the image before the current image; based on the position change, the orientation information of the virtual sound source and the For audio to be played, an audio signal corresponding to each of the at least two players is determined, and the corresponding player is controlled to play the audio signal.
  • an audio signal corresponding to each of the at least two players is determined, and the corresponding playback is controlled.
  • the audio signal is played by the device, so that the orientation of the virtual sound source in the virtual sound field changes as the position changes.
  • the embodiment in which the user changes the image and the orientation of the virtual sound source through actions and the embodiment in which the image change causes the orientation of the virtual sound source to change.
  • one of them can be selected or used in combination, for example, During video playback, if the user has no action and only the image changes, the orientation of the virtual sound source can be changed based on the image change; if the user has actions, both the image and the orientation of the virtual sound source can be changed. etc., can be determined as needed in actual applications, and this embodiment does not limit this.
  • the determining the audio signal corresponding to each of the at least two players includes: obtaining scene information of the current audio playback scene, determining a binaural transfer function corresponding to the scene information, according to The binaural transfer function determines an audio signal corresponding to each of the at least two players.
  • the binaural transfer function is also called Head Related Transfer Functions (HRTF), which describes the transmission process of sound waves from the sound source to both ears.
  • HRTF Head Related Transfer Functions
  • electronic devices are in different scenarios, and the propagation process from the sound source to the user's left and right ear auditory systems has different characteristics. Based on this, a variety of binaural transfer functions can be preset, and the electronic device can select when determining the audio signal.
  • the binaural transfer function is consistent with the current scene, so that the processed audio signal can be consistent with the current scene, so that the audio signal has better sound effects.
  • the scene information includes one or more of the following: audio type information, user type information, time information or environmental information of the user's environment, etc.
  • audio types can use different sound effect processing. Identify the type of audio through audio classification algorithms or trained neural networks; different users may have different preferences, so different user types can use different sound effects processing, and user type information can be passed through user information with the user's authorization and consent. Determine; the time information can be obtained by electronic devices through the network and other methods.
  • the environmental information of the user's environment can be determined in a variety of ways, for example, through
  • the image sensor of the electronic device can be determined by collecting images of the environment, or can also be determined by collecting sound information of the surrounding environment, or can also be determined by acquiring geographical location information from the electronic device, etc. This embodiment does not limit this; practical application , you can select any one or a combination of the above information to determine the scene information as needed.
  • a variety of binaural transfer functions can be preset, so that the electronic device can select an appropriate binaural transfer function for sound effect processing according to needs during processing.
  • the audio to be played has at least two virtual sound sources; the determining the audio signal corresponding to each of the at least two players includes: determining the audio signal corresponding to each of the virtual sound sources.
  • the corresponding binaural transfer function is used to determine the audio signal corresponding to each of the at least two players, so that different virtual sound sources in the audio signal have different sound effects.
  • different binaural transfer functions can also be used to process different virtual sound sources, so that different virtual sound sources have different sound effects.
  • the audio to be played includes the singing sound of a singer who is relatively close in front of the user, and the sound of footsteps from one side of the user to the other which is far behind the user.
  • the audio processing of this embodiment can automatically convert ordinary binaural audio signals into VR sound effects, that is, simulate sound effects such as translation and rotation of sounds.
  • the audio stream can be any audio signal that supports binaural playback, and the VR sound effects can be controlled by user actions or user action information set by audio editing software.
  • This embodiment can be applied to electronic devices and audio software with binaural playback functions, such as headphones, mobile phones, cameras, VR devices, wearable devices that support audio playback, etc., as well as software tools with audio editing functions, etc.
  • This embodiment can realize VR sound effects in which the sound follows the movement of the human body. The user can hear different sounds according to different movements and postures, and the changes in the sound are consistent with the movement of the human body.
  • Step 301 Obtain the original audio signal.
  • Step 302 Obtain the user's action information.
  • Step 303 Perform VR sound effect processing on the original audio signal according to the user's action information.
  • Step 304 Output audio signals with VR sound effects.
  • the audio signal originally used for direct playback can be obtained.
  • the audio signal can be derived from the decoded signal of the audio file, or the audio stream signal in the streaming media, etc. If the audio signal is a mono signal, it can be copied into a two-channel signal with the same left and right sides. If the audio signal is a two-channel signal, it can be used directly.
  • step 302 obtain the user's action information; as an example, the user's action information can be determined based on the data collected by the action sensor.
  • the action information is related to VR sound effects, and optionally can be six degrees of freedom action information.
  • Six degrees of freedom of motion can simulate human movement in the sound field.
  • the six degrees of freedom include three rotational degrees of freedom ( ⁇ , ⁇ , ⁇ ) and three translational degrees of freedom (x, y, z).
  • the number of degrees of freedom may be less than six, for example, there may be only three rotational degrees of freedom, or only one rotational degree of freedom, etc., which is not limited in this embodiment.
  • the user's movement and posture can be obtained by inertial sensors (such as VR headsets, mobile phones, smart watches, etc.), image sensors can obtain The user's movements and postures (such as identifying head movements, hand movements, eye movements, etc. through images), or obtaining brain electrical signals through sensors to obtain movement instructions from the brain; or, movement methods can also be defined through software , and all other ways to obtain external motion and posture.
  • inertial sensors such as VR headsets, mobile phones, smart watches, etc.
  • image sensors can obtain The user's movements and postures (such as identifying head movements, hand movements, eye movements, etc. through images), or obtaining brain electrical signals through sensors to obtain movement instructions from the brain; or, movement methods can also be defined through software , and all other ways to obtain external motion and posture.
  • step 303 perform VR sound effect processing on the original audio signal according to the user's action information.
  • the audio signal can be a two-channel or multi-channel audio signal.
  • the output audio signal is used for playing through a player, or for storage, or can be sent to other devices. Players on other devices play etc.
  • FIG. 3B is a flow chart of an embodiment of the aforementioned step 303, which may include the following embodiments:
  • the audio to be played in this embodiment can be mono or dual-channel. If it is a mono signal, it can be copied into a dual-channel signal.
  • the sampling frequency of the time domain signal of the audio to be played is f s
  • this embodiment uses two examples for description.
  • N sampling points can be extracted from the two-channel time domain signal x m (t) at intervals of L sampling points as a frame signal, expressed as x m (n) l .
  • N is called the frame length, L is called the frame shift, 0 ⁇ L ⁇ N, and N can be a power of 2.
  • Windowing is performed on the l-th frame of the binaural original time-domain signal x m (n) l , then the l-th frame of the windowed time-domain signal x′ m (n) l is
  • ha ana (n) is the N-point analysis window function.
  • window functions include sine window or Hamming window.
  • DFT discrete Fourier transform
  • e is a natural constant, Is the unit imaginary number.
  • FFT fast Fourier transform
  • a virtual static coordinate system O-XYZ can be established to represent the initial posture of the person.
  • the initial posture includes the user's initial posture when the solution of this embodiment starts to be executed, such as the moment when the user puts on the headset, or The moment when the device is triggered to start getting the audio to be played.
  • the coordinate system can also be established based on other methods such as a player, which is not limited in this embodiment.
  • each sound source of discrete frequency can be used as an independent sound source, distributed in the coordinate system, representing the sound source in various directions of the person, as shown in Figure 3C, which shows the sound source of each discrete frequency in the coordinate system.
  • Figure 3C which shows the sound source of each discrete frequency in the coordinate system.
  • Azimuth diagram, each dot in Figure 3C represents a discrete frequency, and the sound source azimuth of each frequency is expressed as Usually expressed in the Cartesian coordinate system as Expressed in the spherical coordinate system as
  • each independent sound source in space can be calculated based on the amplitude ratio and phase difference of the two-channel frequency domain signal, combined with the propagation model of sound to both ears.
  • the sound propagation model to both ears can use the Head Related Transfer Function (HRTF), or you can refer to the sound propagation model in the free sound field. Since the mono signal is copied and expanded into the same signal for the left and right channels, all sound sources can be considered to come from directly in front of the user.
  • HRTF Head Related Transfer Function
  • the user's action information in this embodiment includes postures with six degrees of freedom: three rotational degrees of freedom ( ⁇ , ⁇ , ⁇ ), and three translational degrees of freedom (x, y, z). In practical applications, according to needs, It may not be six degrees of freedom.
  • the data collected by the motion sensor can be used to determine the user's action information, where the user's action information can refer to the user's actual action parameters, that is, the user's actual action information in space.
  • the virtual sound field and the virtual sound source can be defined to be stationary relative to the aforementioned coordinate system O-XYZ, while the person moves relative to the coordinate system. Therefore, the user's actual movement parameters represent the person's movement relative to the virtual sound field.
  • the user's actual movement information needs to be converted into the movement of the person relative to the virtual sound field.
  • the user's hand sliding to the right can be defined as the person moving to the right.
  • it can also be defined as dragging the sound field to the right, that is, the person moves to the left relative to the sound field. Therefore, the same action information of the user can correspond to the coordinate system. different motion parameters. Based on this, this embodiment can predefine the relationship between the user's actual action information and the motion parameters in the coordinate system.
  • the detected user's actual action information in space can be converted into motion in the coordinate system, that is, the aforementioned six degrees of freedom ( ⁇ , ⁇ , ⁇ ) and (x, y, z); specifically, The user's actual action information in the space is converted into corresponding six degrees of freedom, which can be customized as needed in actual applications.
  • This embodiment does not limit this.
  • the azimuth direction of each discrete frequency sound source can be calculated based on the head-related transfer function or the propagation model of the sound source in the free field.
  • the transfer functions propagated to both ears respectively, the transfer functions of the left and right ears are T L (k) and T R (k) respectively, both of which are complex numbers and have amplitude and phase components. If the propagation model of the sound source in the free sound field is adopted, the equivalent distance between the two ears needs to be determined. The distance between the two ears can be preset, or it can be measured by the at least two players, etc.
  • the two-channel signal is synthesized into one main audio signal X main (k), which can be obtained by linear superposition of the two-channel complex spectrum X m (k) l , as described in the following formula:
  • the generation method can be as follows The formula states:
  • h syn (n) is a synthetic window function.
  • the synthetic window function may include a sine window or a Hamming window.
  • the N-point time domain signals of the l-th frame of the left and right channels are overlapped and accumulated, respectively:
  • the overlapping and accumulated M point output signals of the lth frame of the left and right channels are the first M elements of x′′′ L (n) l and x′′′ R (n) l respectively:
  • x L (n) l and x R (n) l are the left and right channel output audio signals of the l-th frame, which have VR sound effects that move with people.
  • this embodiment can realize any mono/dual channel sound source to have a VR sound effect that follows the movement of the person during playback. For example, when the person is moving, he can experience an absolutely still virtual sound field. ; For example, if the sound is initially in the front, when the person turns to the right, the sound will also turn to the left relative to the person, giving the user the illusion that the sound is stationary in a virtual space.
  • the above audio processing method embodiments can be implemented by software, or can be implemented by hardware or a combination of software and hardware.
  • software implementation as an example, as a device in a logical sense, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory and running it through the image processing processor where it is located.
  • FIG 4 it is a hardware structure diagram of the audio processing device 400 that implements the audio processing method of this embodiment.
  • the audio processing device used to implement the audio processing method usually may also include other hardware according to the actual functions of the audio processing device, which will not be described again.
  • the processor 401 implements the following steps when executing the computer program:
  • an audio signal corresponding to each of the at least two players is determined, and passed through the corresponding The player plays the audio signal.
  • the audio to be played has at least two virtual sound sources, and the type information of the at least two virtual sound sources is different;
  • the processor 401 executes the generation of sound source orientation adjustment instructions based on the action information, including:
  • a sound source orientation adjustment instruction for adjusting the orientation of each virtual sound source in the at least two virtual sound sources is generated, so that along with the user's actions, virtual sound sources of different types of information are adjusted in the virtual sound source.
  • the orientation of the sound field has different amounts of change.
  • the type information includes one or more of the following:
  • the electronic device includes a display area
  • the processor 401 executes the generation of a sound source orientation adjustment instruction based on the action information, including: generating a sound source orientation adjustment instruction and an image display instruction based on the action information;
  • the processor 401 executes the step of playing the audio signal through the corresponding player, including:
  • Control the corresponding player to play the audio signal and at the same time obtain multiple frames of images displayed synchronously with the audio to be played based on the image display instruction and display them in the display area; wherein the image includes the The first pixel area related to the virtual sound source, the change of the first pixel area in the multi-frame images, and the position of the virtual sound source in the virtual sound field all change with the user's actions.
  • the electronic device includes a display area, the display area is used to display a multi-frame image synchronized with the audio to be played, the image includes a first pixel area related to the virtual sound source;
  • the processor 401 also performs:
  • an audio signal corresponding to each of the at least two players is determined, and the corresponding player is controlled to play the audio signal.
  • the user's action information includes the change amount of the user's action.
  • the processor 401 executes the generation of sound source orientation adjustment instructions based on the action information, including:
  • a sound source orientation adjustment instruction for adjusting the change amount of the virtual sound source in the virtual sound field is generated; wherein the first mapping relationship includes: the change of the user action The corresponding relationship between the quantity and the change quantity of the virtual sound source's position in the virtual sound field.
  • the first mapping relationship is preset, or generated by obtaining the user's setting instructions, or determined based on the user's historical behavior data.
  • the orientation information of the virtual sound source in the virtual sound field in the audio to be played is determined through the first coordinate system
  • the change amount of the user action is determined through the first coordinate system.
  • the first coordinate system is a coordinate system established based on the user's initial posture using the at least two players.
  • the electronic device is a wearable electronic device, and the motion sensor includes an inertial measurement sensor;
  • the processor 401 executes the method of using the motion sensor to obtain the user's motion information, including:
  • the user's action information is determined.
  • the electronic device is a wearable electronic device
  • the motion sensor includes a first image sensor; when the wearable electronic device is worn on the user's head, the first image sensor faces the user's eyes;
  • the processor 401 executes the method of using the motion sensor to obtain the user's motion information, including:
  • the electronic device is a wearable electronic device
  • the motion sensor includes one or more second image sensors.
  • the wearable electronic device is worn on the user's head, one or more of the second image sensors
  • the observation range of the image sensor covers the activity space of the user's hand;
  • the processor 401 executes the method of using the motion sensor to obtain the user's motion information, including:
  • the processor 401 performs the determining the audio signal corresponding to each of the at least two players, including:
  • Scene information of the current audio playback scene is obtained, a binaural transfer function corresponding to the scene information is determined, and an audio signal corresponding to each of the at least two players is determined based on the binaural transfer function.
  • the scene information includes one or more of the following: audio type information, user type information, time information, or environmental information of the user's environment.
  • the audio to be played has at least two virtual sound sources
  • the processor 401 performs the determination of the audio signal corresponding to each player in the at least two players, including:
  • the orientation information of the virtual sound source in the virtual sound field in the audio to be played is obtained by extracting multiple discrete frequencies from the audio to be played, and based on the multiple discrete frequencies in the virtual sound field. The orientation information is determined.
  • the audio to be played is an audio signal of at least two channels; the orientation information of the multiple discrete frequencies in the virtual sound field is obtained in the following manner:
  • the orientation information of the multiple discrete frequencies in the virtual sound field is obtained based on the first coordinate system and the binaural transfer function.
  • an embodiment of the application further provides an electronic device 500, including: at least two players 510 and a motion sensor 520; the electronic device also includes a processor 530, a memory 540, and a computer program executable by said processor;
  • an audio signal corresponding to each of the at least two players is determined, and passed through the corresponding The player plays the audio signal.
  • the audio to be played has at least two virtual sound sources, and the type information of the at least two virtual sound sources is different;
  • the processor executes the generation of sound source orientation adjustment instructions based on the action information, including:
  • a sound source orientation adjustment instruction for adjusting the orientation of each virtual sound source in the at least two virtual sound sources is generated, so that along with the user's actions, virtual sound sources of different types of information are adjusted in the virtual sound source.
  • the orientation of the sound field has different amounts of change.
  • the type information includes one or more of the following:
  • the electronic device includes a display area
  • the processor executes the generation of a sound source orientation adjustment instruction based on the action information, including: generating a sound source orientation adjustment instruction and an image display instruction based on the action information;
  • the processor executes the step of playing the audio signal through the corresponding player, including:
  • Control the corresponding player to play the audio signal and at the same time obtain multiple frames of images displayed synchronously with the audio to be played based on the image display instruction and display them in the display area; wherein the image includes the The first pixel area related to the virtual sound source, the change of the first pixel area in the multi-frame image, and the position of the virtual sound source in the virtual sound field all change with the user's actions.
  • the electronic device includes a display area, the display area is used to display a multi-frame image synchronized with the audio to be played, the image includes a first pixel area related to the virtual sound source;
  • the processor 401 also executes:
  • an audio signal corresponding to each of the at least two players is determined, and the corresponding player is controlled to play the audio signal.
  • the user's action information includes the change amount of the user's action
  • the processor executes the generation of sound source orientation adjustment instructions based on the action information, including:
  • a sound source orientation adjustment instruction for adjusting the change amount of the virtual sound source in the virtual sound field is generated; wherein the first mapping relationship includes: the change of the user action The corresponding relationship between the quantity and the change quantity of the virtual sound source's position in the virtual sound field.
  • the first mapping relationship is preset, or generated by obtaining the user's setting instructions, or determined based on the user's historical behavior data.
  • the orientation information of the virtual sound source in the virtual sound field in the audio to be played is determined through the first coordinate system
  • the change amount of the user action is determined through the first coordinate system.
  • the first coordinate system is a coordinate system established based on the user's initial posture using the at least two players.
  • the electronic device is a wearable electronic device, and the motion sensor includes an inertial measurement sensor;
  • the processor executes the method of using the motion sensor to obtain the user's motion information, including:
  • the user's action information is determined.
  • the electronic device is a wearable electronic device
  • the motion sensor includes a first image sensor; when the wearable electronic device is worn on the user's head, the first image sensor faces the user's eyes;
  • the processor executes the method of using the motion sensor to obtain the user's motion information, including:
  • the electronic device is a wearable electronic device
  • the motion sensor includes one or more second image sensors.
  • the wearable electronic device is worn on the user's head, one or more of the second image sensors
  • the observation range of the image sensor covers the activity space of the user's hand;
  • the processor executes the method of using the motion sensor to obtain the user's motion information, including:
  • the processor performing the determining an audio signal corresponding to each of the at least two players includes:
  • Scene information of the current audio playback scene is obtained, a binaural transfer function corresponding to the scene information is determined, and an audio signal corresponding to each of the at least two players is determined based on the binaural transfer function.
  • the scene information includes one or more of the following: audio type information, user type information, time information, or environmental information of the user's environment.
  • the audio to be played has at least two virtual sound sources
  • the processor performs the determining of an audio signal corresponding to each of the at least two players, including:
  • the orientation information of the virtual sound source in the virtual sound field in the audio to be played is obtained by extracting multiple discrete frequencies from the audio to be played, and based on the multiple discrete frequencies in the virtual sound field. The orientation information is determined.
  • the audio to be played is an audio signal of at least two channels; the orientation information of the multiple discrete frequencies in the virtual sound field is obtained in the following manner:
  • the orientation information of the multiple discrete frequencies in the virtual sound field is obtained based on the first coordinate system and the binaural transfer function.
  • Embodiments of this specification also provide a computer-readable storage medium, which stores a number of computer instructions. When executed, the computer instructions implement the steps of the audio processing method described in any embodiment.
  • Embodiments of the present description may take the form of a computer program product implemented on one or more storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having program code embodied therein.
  • Storage media available for computers include permanent and non-permanent, removable and non-removable media, and can be implemented by any method or technology to store information.
  • Information may be computer-readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to: phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cassettes, tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium can be used to store information that can be accessed by a computing device.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or other memory technology
  • CD-ROM compact disc read-only memory
  • DVD digital versatile disc
  • Magnetic tape cassettes tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium can be used to store information that can be accessed by
  • the device embodiment since it basically corresponds to the method embodiment, please refer to the partial description of the method embodiment for relevant details.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated.
  • the components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. Persons of ordinary skill in the art can understand and implement the method without any creative effort.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Stereophonic System (AREA)

Abstract

Provided in one or more embodiments of the present application are an audio processing method and apparatus, an electronic device and a computer-readable storage medium, the audio processing method being applied to the electronic device and the electronic device comprising at least two players and a motion sensor. The method comprises: acquiring, in respect of an audio to be played, azimuth information of a virtual sound source in a virtual sound field, the virtual sound field being established on the basis of a position relationship between the at least two players and the left and right auditory organs of a user; acquiring motion information of the user by means of the motion sensor, and, on the basis of the motion information, generating a sound source azimuth adjustment instruction; and, on the basis of the sound source position adjustment instruction, the azimuth information of the virtual sound source and said audio, determining an audio signal corresponding to each of the at least two players, and playing the audio signals by means of the corresponding players.

Description

音频处理方法、装置、电子设备及计算机可读存储介质Audio processing method, device, electronic equipment and computer-readable storage medium 技术领域Technical field
本申请涉及音频处理技术领域,具体而言,涉及一种音频处理方法、装置、电子设备及计算机可读存储介质。The present application relates to the field of audio processing technology, specifically, to an audio processing method, device, electronic equipment and computer-readable storage medium.
背景技术Background technique
随着音频处理技术的不断进步,人们对音频的听觉需求也越来越高;现有的双耳音频播放方案,一般都是将原始的音频信号通过播放器输出,所有用户听到的声音都是一样的,这些方案已不能满足用户的个性化需求。With the continuous advancement of audio processing technology, people's hearing requirements for audio are getting higher and higher; existing binaural audio playback solutions generally output the original audio signal through the player, and all sounds heard by the user are It's the same, these solutions can no longer meet the personalized needs of users.
发明内容Contents of the invention
有鉴于此,本申请提供一种音频处理方法、装置、电子设备及计算机可读存储介质,以解决相关技术中音频处理无法满足用户的个性化需求的技术问题。In view of this, the present application provides an audio processing method, device, electronic equipment and computer-readable storage medium to solve the technical problem in related technologies that audio processing cannot meet the personalized needs of users.
第一方面,提供一种音频处理方法,所述方法应用于电子设备,所述电子设备包括至少两个播放器和动作传感器,所述方法包括:In a first aspect, an audio processing method is provided. The method is applied to an electronic device. The electronic device includes at least two players and a motion sensor. The method includes:
获取待播放音频中虚拟声源在虚拟声场的方位信息,所述虚拟声场是基于所述至少两个播放器与用户左右听觉器官的位置关系建立的;Obtaining the orientation information of the virtual sound source in the audio to be played in the virtual sound field, the virtual sound field is established based on the positional relationship between the at least two players and the user's left and right hearing organs;
利用所述动作传感器获取用户的动作信息,基于所述动作信息生成声源方位调整指令;Using the motion sensor to obtain the user's motion information, and generating a sound source orientation adjustment instruction based on the motion information;
基于所述声源位置调整指令、所述虚拟声源的所述方位信息和所述待播放音频,确定对应所述至少两个播放器中每个播放器的音频信号,并通过对应的所述播放器播放所述音频信号。Based on the sound source position adjustment instruction, the orientation information of the virtual sound source and the audio to be played, an audio signal corresponding to each of the at least two players is determined, and passed through the corresponding The player plays the audio signal.
第二方面,提供一种音频处理装置,所述装置包括处理器、存储器、存储在所述存储器上可被所述处理器执行的计算机程序,所述处理器执行所述计算机程序时实现前述第一方面所述的方法实施例。In a second aspect, an audio processing device is provided. The device includes a processor, a memory, and a computer program stored on the memory and executable by the processor. When the processor executes the computer program, the aforementioned first step is implemented. Method embodiments described in one aspect.
第三方面,提供一种电子设备,所述电子设备包括至少两个播放器和动作传感器;所述电子设备还包括处理器、存储器、存储在所述存储器上可被所述处理器执行的计算机程序;In a third aspect, an electronic device is provided. The electronic device includes at least two players and a motion sensor. The electronic device further includes a processor, a memory, and a computer stored on the memory and executable by the processor. program;
所述处理器执行所述计算机程序时实现第一方面所述的方法实施例。When the processor executes the computer program, the method embodiment described in the first aspect is implemented.
第四方面,提供一种计算机可读存储介质,所述计算机可读存储介质上存储有若干计算机指令,所述计算机指令被执行时实现第一方面所述的方法实施例。In a fourth aspect, a computer-readable storage medium is provided. Several computer instructions are stored on the computer-readable storage medium. When the computer instructions are executed, the method embodiment described in the first aspect is implemented.
应用本申请提供的方案,可以基于所述至少两个播放器与用户左右听觉器官的位置关系建立虚拟声场,从而获取到待播放音频中虚拟声源在虚拟声场的方位信息;并且,可以利用所述动作传感器获取用户的动作信息,基于所述动作信息生成声源方位调整指令,从而可以利用声源方位调整指令来调整虚拟声源的方位,确定出新的音频信号,并通过所述的至少两个播放器进行播放;使得用户在动作时能够听到声源方位 具有变化的声源,使用户感受到声音中声源的方位随用户动作而变化。因此,本实施例的音频处理方案,实现了根据用户运动来产生符合相应运动音效的音频,达到逼真的空间音效体验。Applying the solution provided by this application, a virtual sound field can be established based on the positional relationship between the at least two players and the user's left and right hearing organs, thereby obtaining the orientation information of the virtual sound source in the audio to be played in the virtual sound field; and, all the information can be used The motion sensor obtains the user's motion information, and generates a sound source orientation adjustment instruction based on the motion information, so that the sound source orientation adjustment instruction can be used to adjust the orientation of the virtual sound source, determine a new audio signal, and pass the at least Two players play; this allows the user to hear the sound source with changing position when moving, and allows the user to feel that the position of the sound source in the sound changes with the user's movement. Therefore, the audio processing solution of this embodiment realizes the generation of audio that conforms to the corresponding motion sound effects based on the user's movements, achieving a realistic spatial sound effect experience.
附图说明Description of the drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting any creative effort.
图1是本申请一个实施例的真实音乐会场景的示意图。Figure 1 is a schematic diagram of a real concert scene according to an embodiment of the present application.
图2是本申请一个实施例的音频处理方法的流程示意图。Figure 2 is a schematic flowchart of an audio processing method according to an embodiment of the present application.
图3A是本申请另一个实施例的音频处理方法的流程示意图。FIG. 3A is a schematic flowchart of an audio processing method according to another embodiment of the present application.
图3B是本申请一个实施例的步骤303的流程示意图。Figure 3B is a schematic flowchart of step 303 according to an embodiment of the present application.
图3C是本申请一个实施例的坐标系下各个离散频率的方位示意图。Figure 3C is a schematic diagram of the orientation of each discrete frequency in a coordinate system according to an embodiment of the present application.
图3D是本申请一个实施例的以坐标系表示用户动作变化的示意图。Figure 3D is a schematic diagram showing user action changes in a coordinate system according to an embodiment of the present application.
图4是本申请一个实施例的音频处理装置的硬件结构图。Figure 4 is a hardware structure diagram of an audio processing device according to an embodiment of the present application.
图5是本申请一个实施例的电子设备的硬件结构图。Figure 5 is a hardware structure diagram of an electronic device according to an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments.
普通的双耳音频播放方案并未对原始音频信号进行处理,也即是将原始音频信号播放,因此用户只能感受到音频信号的静态效果,例如,用户带上耳机听音乐,那么任意用户在任意时候听到的音乐都是一样的。而随着个性化需求和VR(Virtual Reality,虚拟现实技术)设备的兴起,静态效果的声音越来越难以满足用户的需求和体验。Ordinary binaural audio playback solutions do not process the original audio signal, that is, play the original audio signal, so the user can only feel the static effect of the audio signal. For example, if the user puts on headphones to listen to music, then any user will The music you hear at any time is the same. With the rise of personalized needs and VR (Virtual Reality, virtual reality technology) equipment, it is increasingly difficult for static sound effects to meet the needs and experience of users.
如图1所示,以真实音乐会场景为例,用户面对音乐会舞台,用户的左前方是表演者在敲锣,用户的右前方是表演者在打鼓;敲锣声和打鼓声分别从用户的左前方和右前方传入至用户双耳,用户能够感受到敲锣声和打鼓声分别在其左前方和右前方。当用户右转,此时音乐会舞台在用户的左侧,用户的左前方是打鼓,用户的左后方是敲锣;用户在侧身后,其左右耳相对于敲锣打鼓的位置具有变化,即以用户的角度来说,敲锣打鼓具有新的位置,人的听觉系统能感受到敲锣声和打鼓声与自身方位的变化。As shown in Figure 1, taking a real concert scene as an example, the user faces the concert stage. To the left of the user is a performer playing gongs, and to the right of the user is a performer playing drums. The sounds of gongs and drums are respectively from The user's left front and right front are transmitted to the user's ears, and the user can feel the sound of gongs and drums in their left front and right front respectively. When the user turns right, the concert stage is on the left side of the user, the drums are in front of the user's left, and the gongs are behind the user's left side. When the user turns sideways, the positions of his left and right ears relative to the gongs and drums change, that is, From the user's perspective, gongs and drums have a new position, and the human auditory system can feel the changes in the gongs and drums and their own orientation.
然而,当用户戴着耳机听耳机中播放的音乐,现有的音频处理方案无法表现上述的变化,这是由于播放的音频信号是预先录制的,在录制音频时,收音设备的位置固定,以音频信号中敲锣声和打鼓声固定在收音设备的左前方和右前方为例,用户戴着耳机听音乐时就一直只能感受到音频信号中敲锣声和打鼓声来自于左前方和右前方,当用户侧身,其无法如真实音乐会场景下,感受到敲锣声和打鼓声与自身方位的变化。若收音设备的位置不固定,例如收音设备在录制过程中具有变化的位置,录制到的音频信号中作为声源的敲锣声和打鼓声相对于收音设备的位置具有变化,或者,也可以 是收音设备位置固定,而作为声源的敲锣声和打鼓声的位置变化;这样录制得到的音频能让人感受到具有位置变化的敲锣声和打鼓声;虽然能实现动态效果的声音变化,但这种方式是需要在录制时通过变化收音设备的位置或声源位置的变化来实现的,录制完成后音频中声源位置的变化也就固定下来,其动态效果也与用户在收听音频信号时的动作不匹配。However, when the user wears headphones and listens to the music played in the headphones, the existing audio processing solution cannot express the above changes. This is because the audio signals played are pre-recorded. When recording audio, the position of the radio equipment is fixed. For example, if the gongs and drums in the audio signal are fixed to the front left and right of the radio equipment, the user will always only be able to feel the gongs and drums in the audio signal coming from the front left and right when listening to music with headphones. In the front, when the user turns sideways, he cannot feel the sound of gongs and drums and the changes in his own position as in a real concert scene. If the position of the radio equipment is not fixed, for example, the radio equipment has a changing position during the recording process, the gongs and drums as the sound source in the recorded audio signal change relative to the position of the radio equipment, or it can also be The position of the radio equipment is fixed, but the position of the gongs and drums as the sound source changes; the audio recorded in this way can make people feel the position changes of the gongs and drums; although dynamic effects of sound changes can be achieved, However, this method needs to be realized by changing the position of the radio equipment or the position of the sound source during recording. After the recording is completed, the change in the position of the sound source in the audio is fixed, and its dynamic effect is also consistent with the user's listening to the audio signal. The action does not match.
基于此,如图2所示,是本申请根据一示例性实施例示出的一种音频处理方法的流程图,该方法可应用于电子设备,所述电子设备包括至少两个播放器和动作传感器,所述方法包括如下步骤:Based on this, as shown in Figure 2, it is a flow chart of an audio processing method according to an exemplary embodiment of the present application. This method can be applied to an electronic device. The electronic device includes at least two players and a motion sensor. , the method includes the following steps:
在步骤202中,获取待播放音频中虚拟声源在虚拟声场的方位信息,所述虚拟声场是基于所述至少两个播放器与用户左右听觉器官的位置关系建立的。In step 202, the orientation information of the virtual sound source in the audio to be played in the virtual sound field is obtained. The virtual sound field is established based on the positional relationship between the at least two players and the user's left and right hearing organs.
在步骤204中,利用所述动作传感器获取用户的动作信息,基于所述动作信息生成声源方位调整指令。In step 204, the motion sensor is used to obtain the user's motion information, and a sound source orientation adjustment instruction is generated based on the motion information.
在步骤206中,基于所述声源位置调整指令、所述虚拟声源的所述方位信息和所述待播放音频,确定对应所述至少两个播放器中每个播放器的音频信号,并通过对应的所述播放器播放所述音频信号。In step 206, based on the sound source position adjustment instruction, the orientation information of the virtual sound source and the audio to be played, an audio signal corresponding to each of the at least two players is determined, and The audio signal is played through the corresponding player.
本实施例的电子设备可以是任意具有音频处理能力的设备,例如智能手机、摄像设备、VR设备、支持音频播放的可穿戴设备或计算机等等。其中,电子设备连接有音频播放器,即所述的至少两个播放器,如普通耳机、骨传导耳机或音箱等等。其中,所述动作传感器包括任意能采集数据、且该采集的数据能检测出用户动作的传感器,例如惯性测量传感器、视觉传感器(如单目或双目)或激光雷达等等。The electronic device in this embodiment can be any device with audio processing capabilities, such as a smartphone, a camera device, a VR device, a wearable device or a computer that supports audio playback, and so on. Wherein, the electronic device is connected to an audio player, that is, at least two players, such as ordinary headphones, bone conduction headphones or speakers, etc. The motion sensor includes any sensor that can collect data and the collected data can detect the user's motion, such as inertial measurement sensors, visual sensors (such as monocular or binocular) or laser radar, etc.
在一些例子中,所述的至少两个播放器和动作传感器均可搭载于执行上述方法实施例的设备中,例如可穿戴设备搭载有播放器和动作传感器,可穿戴设备可以获取待播放音频,获取动作传感器采集的数据确定用户的动作信息,并对该待播放音频进行处理后,控制播放器播放处理后的音频信号。In some examples, the at least two players and motion sensors can be mounted on the device that performs the above method embodiments. For example, a wearable device is equipped with a player and a motion sensor, and the wearable device can obtain the audio to be played, Obtain the data collected by the motion sensor to determine the user's motion information, and after processing the audio to be played, control the player to play the processed audio signal.
在另一些例子中,可以是播放器、动作传感器分别连接至执行上述方法实施例的设备,例如,播放器和动作传感器分别连接至移动终端等设备,由移动终端获取待播放音频,并获取动作传感器采集的数据确定用户的动作信息,并对该待播放音频进行处理后,将处理后的音频信号通过播放器播放;其中,播放的方式,可以是音频处理好即由播放器播放,还可以是存储处理好的音频,后续在需要时再通过播放器播放;或者,执行上述方法实施例的设备未直接连接播放器,该设备可以将处理好的音频发送给连接有播放器的其他设备,由其他设备控制播放器播放等等。In other examples, the player and the motion sensor may be respectively connected to the device executing the above method embodiment. For example, the player and the motion sensor may be respectively connected to a device such as a mobile terminal, and the mobile terminal obtains the audio to be played and obtains the action. The data collected by the sensor determines the user's action information, and after processing the audio to be played, the processed audio signal is played through the player; the playback method can be that the audio is processed and then played by the player, or it can be The processed audio is stored and then played through the player when needed; or, if the device executing the above method embodiment is not directly connected to the player, the device can send the processed audio to other devices connected to the player. Control player playback from other devices, etc.
本实施例中,可以基于所述至少两个播放器与用户左右听觉器官的位置关系建立虚拟声场,从而获取到待播放音频中虚拟声源在虚拟声场的方位信息;并且,可以利用所述动作传感器获取用户的动作信息,基于所述动作信息生成声源方位调整指令,从而可以利用声源方位调整指令来调整虚拟声源的方位,确定出新的音频信号,并通过所述的至少两个播放器进行播放;使得用户在动作时能够听到声源方位具有变化的声源,使用户感受到声音中声源的方位随用户动作而变化。因此,本实施例的音频处理方案,可以实现根据用户运动来产生符合相应运动音效的音频,达到逼真的空间音效体验。In this embodiment, a virtual sound field can be established based on the positional relationship between the at least two players and the user's left and right hearing organs, thereby obtaining the orientation information of the virtual sound source in the audio to be played in the virtual sound field; and, the action can be used The sensor obtains the user's action information, and generates a sound source orientation adjustment instruction based on the action information, so that the sound source orientation adjustment instruction can be used to adjust the orientation of the virtual sound source, determine a new audio signal, and pass the at least two The player plays; the user can hear the sound source with changing position when moving, and the user can feel that the position of the sound source in the sound changes with the user's movement. Therefore, the audio processing solution of this embodiment can generate audio that matches the corresponding motion sound effects based on the user's movements, achieving a realistic spatial sound effect experience.
例如,当用户带着耳机听音乐时,若用户向右转身,设备通过获取用户的向右转 身的动作信息,通过上述方案调整音频信号中的虚拟声源的方位信息,实时生成动态的音频信号,音频信号中虚拟声源相对于用户的方位随用户动作而变化,因此实现了声随人动的VR音效,用户能够感受到声源随自己动作变化而实时变化。另外,本实施例还可以向用户提供个性化的音效需求,用户可以通过动作控制虚拟声源的方位变化,即用户可以自定义其动作与虚拟声源的方位变化的对应关系,使得用户可以根据其自身需求,灵活地调整虚拟声源的方位。For example, when the user is listening to music with headphones, if the user turns right, the device obtains the user's action information of turning right, adjusts the orientation information of the virtual sound source in the audio signal through the above solution, and generates a dynamic audio signal in real time. , the position of the virtual sound source in the audio signal relative to the user changes with the user's movements, thus realizing a VR sound effect in which the sound follows the movement of the person, and the user can feel the sound source changing in real time with his/her movements. In addition, this embodiment can also provide users with personalized sound effect requirements. The user can control the azimuth changes of the virtual sound source through actions. That is, the user can customize the corresponding relationship between his actions and the azimuth changes of the virtual sound source, so that the user can control the azimuth changes of the virtual sound source according to the user's needs. Flexibly adjust the position of the virtual sound source according to its own needs.
针对步骤202,在一些例子中,声场是指媒质中有声波存在的区域,描述声场的物理量可以是声压、质点振动速度、位移或媒质密度等,一般是位置和时间的函数。声场中这些物理量随空间位置的变化与随时间的变化间的关系由声学波动方程描述。声源在均匀、各向同性的媒质中,边界的影响可以不计的声场可称为自由声场。本实施例由播放器播放音频,本实施例的虚拟声场可以是基于所述至少两个播放器与用户左右听觉器官的位置关系建立的,方位信息描述了虚拟声源在虚拟声场中所处的位置信息,作为例子,可以基于用户建立一坐标系,例如坐标系的原点在用户身上,或者,以播放器建立坐标系也是可选的,实际应用中可根据需要灵活确定,虚拟声源的方位信息即以虚拟声源在该坐标系的坐标信息来描述。Regarding step 202, in some examples, the sound field refers to the area where sound waves exist in the medium. The physical quantity describing the sound field can be sound pressure, particle vibration velocity, displacement or medium density, etc., and is generally a function of position and time. The relationship between changes in the spatial position and time of these physical quantities in the sound field is described by the acoustic wave equation. The sound source is in a uniform and isotropic medium, and the sound field in which the influence of the boundary is negligible can be called a free sound field. In this embodiment, the player plays the audio. The virtual sound field in this embodiment can be established based on the positional relationship between the at least two players and the user's left and right hearing organs. The orientation information describes the location of the virtual sound source in the virtual sound field. Position information, as an example, can establish a coordinate system based on the user. For example, the origin of the coordinate system is on the user. Alternatively, establishing a coordinate system with the player is also optional. In actual applications, the orientation of the virtual sound source can be flexibly determined according to needs. The information is described by the coordinate information of the virtual sound source in the coordinate system.
在一些例子中,所述虚拟声源可以通过对待播放音频进行信号识别而确定,在识别出待播放音频中的虚拟声源后,再确定虚拟声源在虚拟声场的方位信息,实际应用中可以通过音频处理算法或训练好的神经网络等方式实现。在另一些例子中,可以基于信号频域分析来实现,例如可以是从待播放音频获取频域信号,从频域信号中提取每个离散频率的方位,在后续步骤中调整各个离散频率的方位,使得虚拟声源在所述虚拟声场的方位随所述用户的动作而变化。作为一个例子所述待播放音频中虚拟声源在虚拟声场的方位信息,是通过从所述待播放音频中提取出多个离散频率,根据所述多个离散频率在所述虚拟声场的方位信息确定的,即本实施例中可以将真实声源所产生的声音信号拆解为多个离散频率。In some examples, the virtual sound source can be determined by signal recognition of the audio to be played. After identifying the virtual sound source in the audio to be played, the orientation information of the virtual sound source in the virtual sound field is determined. In practical applications, It is implemented through audio processing algorithms or trained neural networks. In other examples, it can be implemented based on signal frequency domain analysis. For example, it can be to obtain the frequency domain signal from the audio to be played, extract the orientation of each discrete frequency from the frequency domain signal, and adjust the orientation of each discrete frequency in subsequent steps. , so that the position of the virtual sound source in the virtual sound field changes with the user's actions. As an example, the orientation information of the virtual sound source in the virtual sound field in the audio to be played is obtained by extracting multiple discrete frequencies from the audio to be played, and based on the orientation information of the multiple discrete frequencies in the virtual sound field It is certain that in this embodiment, the sound signal generated by the real sound source can be decomposed into multiple discrete frequencies.
在一些例子中,所述待播放音频为至少两个声道的音频信号;所述多个离散频率在所述虚拟声场的方位信息,通过如下方式获取:根据所述至少两个声道的音频信号的幅度比和/或相位差,基于第一坐标系和双耳传递函数,获取所述多个离散频率在所述虚拟声场的方位信息。其中,该第一坐标系是基于所述至少两个播放器与用户左右听觉器官的位置关系建立的,每个离散频率的声源可作为独立的声源,分布于该坐标系中,代表声源的各个方位;播放器通常至少有两个,因此待播放音频也通常是至少两个声道的音频信号,每个独立声源在空间中的方位可以根据双声道的频域信号的幅度比和相位差,再结合声音到双耳的传播模型计算得到。可选的,声音到双耳的传播模型可以采用双耳传递函数实现,双耳传递函数也称头相关传输函数(Head Related Transfer Functions,HRTF),该函数描述了声波从声源到双耳的传输过程;或者也可参考声音在自由声场中的传播模型等。若待播放音频是单声道信号,可以通过复制扩展成左右声道一样的信号。In some examples, the audio to be played is an audio signal of at least two channels; the orientation information of the multiple discrete frequencies in the virtual sound field is obtained in the following manner: according to the audio of the at least two channels The amplitude ratio and/or phase difference of the signal is used to obtain the orientation information of the multiple discrete frequencies in the virtual sound field based on the first coordinate system and the binaural transfer function. Wherein, the first coordinate system is established based on the positional relationship between the at least two players and the user's left and right hearing organs. Each discrete frequency sound source can be used as an independent sound source and is distributed in the coordinate system to represent the sound. Various directions of the source; there are usually at least two players, so the audio to be played is usually an audio signal of at least two channels. The location of each independent sound source in space can be based on the amplitude of the two-channel frequency domain signal. The ratio and phase difference are calculated by combining the propagation model of sound to both ears. Optionally, the sound propagation model to both ears can be implemented using the binaural transfer function. The binaural transfer function is also called Head Related Transfer Functions (HRTF). This function describes the flow of sound waves from the sound source to the binaural Transmission process; or you can also refer to the propagation model of sound in the free sound field, etc. If the audio to be played is a mono signal, it can be expanded into the same signal for the left and right channels through copying.
实际应用中,利用动作传感器获取用户的动作信息可以有多种实现方式。例如,动作传感器可以包括穿戴在用户身上的传感器,电子设备与该传感器连接,该传感器采集的数据可以发送给电子设备,由电子设备确定用户的动作信息。在其他例子中,还可以包括如下一种或多种实现方式:In practical applications, there are many ways to use motion sensors to obtain user motion information. For example, the motion sensor may include a sensor worn on the user. The electronic device is connected to the sensor. The data collected by the sensor may be sent to the electronic device, and the electronic device determines the user's motion information. In other examples, one or more of the following implementation methods may also be included:
例如,所述电子设备为穿戴式电子设备,所述动作传感器包括惯性测量传感器(IMU,Inertial Measurement Unit),即电子设备内置该动作传感器,所述利用所述动作传感器获取用户的动作信息,包括:根据所述惯性测量传感器的测量数据,确定用户的动作信息。For example, the electronic device is a wearable electronic device, and the motion sensor includes an inertial measurement sensor (IMU, Inertial Measurement Unit), that is, the motion sensor is built into the electronic device. The use of the motion sensor to obtain the user's motion information includes: : Determine the user's action information based on the measurement data of the inertial measurement sensor.
或者,所述电子设备为穿戴式电子设备,所述动作传感器包括第一图像传感器;即电子设备搭载有一个或多个第一图像传感器,所述穿戴式电子设备穿戴在用户头部时,所述第一图像传感器朝向用户眼部;所述利用所述动作传感器获取用户的动作信息,包括:获取所述第一图像传感器采集的图像,基于所述图像获取用户眼球的动作信息。可选的,获取的方式可以有多种,例如可以采用眼球识别算法或训练好的神经网络等实现。Alternatively, the electronic device is a wearable electronic device, and the motion sensor includes a first image sensor; that is, the electronic device is equipped with one or more first image sensors. When the wearable electronic device is worn on the user's head, the The first image sensor faces the user's eyes; and using the motion sensor to obtain the user's motion information includes: acquiring an image collected by the first image sensor, and acquiring the motion information of the user's eyeballs based on the image. Optionally, there can be multiple acquisition methods, for example, it can be implemented using an eyeball recognition algorithm or a trained neural network.
或者,所述电子设备为穿戴式电子设备,所述动作传感器包括一个或多个第二图像传感器,所述穿戴式电子设备穿戴在用户头部时,一个或多个所述第二图像传感器的观测范围覆盖用户手部的活动空间;所述利用所述动作传感器获取用户的动作信息,包括:获取所述一个或多个第二图像传感器采集的图像,若从所述图像中检测到用户手部,获取用户手部的动作信息。可选的,获取的方式可以有多种,例如可以采用手部识别算法或训练好的神经网络等实现。Alternatively, the electronic device is a wearable electronic device, and the motion sensor includes one or more second image sensors. When the wearable electronic device is worn on the user's head, the motion sensor of the one or more second image sensors The observation range covers the activity space of the user's hand; the use of the motion sensor to obtain the user's motion information includes: acquiring images collected by the one or more second image sensors. If the user's hand is detected from the image, part to obtain the movement information of the user's hand. Optionally, there can be multiple acquisition methods, for example, it can be implemented using hand recognition algorithms or trained neural networks.
实际应用中,可以采用上述任一种方式确定用户的动作信息,也可以采用上述至少两种方式的组合来确定的动作信息,本实施例对此不进行限定。作为例子,在实施时,待播放音频具有一定的播放时长,可以是在某些时间段采用其中一种实现方式确定用户的动作信息,在另一些时间段采用其他实现方式确定用户的动作信息。作为一个例子,在实施时,用户的眼球转动,前述实施例的第一图像传感器可以采集用户眼部的图像,电子设备基于该图像确定用户眼球的动作信息,基于该动作信息调整当前时间的音频中虚拟声源的方位;后续,用户运动手部,前述实施例的第二图像传感器可以采集用户手部的图像,电子设备基于该图像确定用户手部的动作信息,基于该动作信息调整当前时间的音频中虚拟声源的方位。可选的,在音频播放过程中,用户可以有不同类型的动作信息,本实施例可以通过多种采集用户动作的实施例,使得所述虚拟声源在所述虚拟声场的方位可以随所述用户的不同类型的动作而变化。In practical applications, the user's action information may be determined using any of the above methods, or the action information may be determined using a combination of at least two of the above methods, which is not limited in this embodiment. As an example, during implementation, the audio to be played has a certain playback duration. One of the implementation methods may be used to determine the user's action information in certain time periods, and other implementation methods may be used to determine the user's action information in other time periods. As an example, during implementation, when the user's eyeballs rotate, the first image sensor in the aforementioned embodiment can collect an image of the user's eyes, the electronic device determines the action information of the user's eyeballs based on the image, and adjusts the audio at the current time based on the action information. the direction of the virtual sound source; subsequently, the user moves his hand, and the second image sensor of the aforementioned embodiment can collect an image of the user's hand. The electronic device determines the action information of the user's hand based on the image, and adjusts the current time based on the action information. The position of the virtual sound source in the audio. Optionally, during the audio playback process, the user may have different types of action information. This embodiment may adopt a variety of embodiments for collecting user actions, so that the position of the virtual sound source in the virtual sound field can be changed according to the varies with different types of user actions.
可选的,用户的动作信息可以包括多种身体部位的动作,例如手部动作,头部动作、眼球动作或腿部动作中任一或多种;实际应用中可以根据需要设定,或者是由用户设置。可选的,所述用户的动作信息可以是一矢量的用于描述用户动作变化的信息,其可以包括多种类型的信息,例如用户动作的方向信息、距离信息或速度信息中任一或多种等。Optionally, the user's action information can include actions of a variety of body parts, such as any one or more of hand movements, head movements, eye movements, or leg movements; in actual applications, it can be set as needed, or Set by user. Optionally, the user's action information may be a vector of information used to describe user action changes, which may include multiple types of information, such as any one or more of direction information, distance information, or speed information of the user's action. Kinds etc.
可选的,所述用户的动作信息可以基于一设定的坐标系内来计算得到;或者,所述用户的动作信息还可以包括描述用户动作变化的信息,该变化可以是指当前时刻下用户的动作相对于另一动作的变化,该相对于另一动作可以有多种实现方式,例如可以是上一采集周期、在当前时刻之前的设定的时刻、在当前时刻之前的设定采集周期等,或者还可以是前述的一设定的坐标系内来计算。该坐标系根据需要可以灵活确定,作为例子,可以是以用户使用播放器的初始姿态建立的坐标系,也可以是播放器被用户使用时,以播放器的位姿建立的坐标系;或者还可以是其他自定义的坐标系等等,本实施例对此不进行限定。Optionally, the user's action information can be calculated based on a set coordinate system; or the user's action information can also include information describing changes in the user's actions, which changes can refer to the user's actions at the current moment. The change of an action relative to another action, which can be implemented in a variety of ways, for example, it can be the previous collection cycle, the set time before the current time, or the set collection cycle before the current time. etc., or it can also be calculated within a set coordinate system as mentioned above. This coordinate system can be flexibly determined according to needs. For example, it can be a coordinate system established based on the initial posture of the user when using the player, or it can be a coordinate system established based on the posture of the player when the player is used by the user; or it can also be It can be other customized coordinate systems, etc., which is not limited in this embodiment.
在获取到用户的动作信息后,可以基于所述动作信息生成声源方位调整指令。实际应用中,在一些例子中,可以是由技术人员预设用户的动作信息与声源方位调整指令的对应关系,即怎样的动作信息生成怎样的声源方位调整指令,可以是由技术人员决定。在另一些例子中,也可以由用户决定,例如,可以向用户提供设置功能,例如可以由用户设置一种或多种身体部位的动作信息用于触发声源方位调整指令,或者设置方向信息、距离信息或速度信息中任一或多种等来描述用户的动作信息。After obtaining the user's action information, a sound source orientation adjustment instruction can be generated based on the action information. In practical applications, in some examples, the corresponding relationship between the user's action information and the sound source orientation adjustment instruction can be preset by the technician. That is, what kind of action information generates what kind of sound source orientation adjustment instruction can be decided by the technician. . In other examples, it can also be decided by the user. For example, the user can be provided with a setting function. For example, the user can set the action information of one or more body parts to trigger the sound source orientation adjustment instruction, or set the direction information, Any one or more of distance information or speed information is used to describe the user's action information.
在一些例子中,可以根据所述用户的动作信息,确定用户动作的变化量;基于所述用户动作的变化量和第一映射关系,生成用于调整虚拟声源在虚拟声场中方位的变化量的声源方位调整指令;其中,所述第一映射关系包括:用户动作的变化量与虚拟声源在虚拟声场中方位的变化量的对应关系;也即是,本实施例用户动作的变化量与虚拟声源在虚拟声场中方位的变化量具有对应关系,通过该第一映射关系,可以快速地利用用户动作的变化量确定虚拟声源在虚拟声场中方位的变化量,从而提升音频处理的速度,也使得虚拟声源在虚拟声场中方位的变化量与用户动作的变化量相对应,例如,用户动作幅度较大,则虚拟声源在虚拟声场中方位的变化幅度较大,两者具有正相关关系。In some examples, the amount of change of the user's action can be determined based on the user's action information; based on the amount of change of the user's action and the first mapping relationship, the amount of change used to adjust the orientation of the virtual sound source in the virtual sound field is generated. The sound source orientation adjustment instruction; wherein, the first mapping relationship includes: the corresponding relationship between the change amount of the user's action and the change amount of the virtual sound source's orientation in the virtual sound field; that is, the change amount of the user action in this embodiment There is a corresponding relationship with the change amount of the virtual sound source's position in the virtual sound field. Through this first mapping relationship, the change amount of the user's action can be quickly used to determine the change amount of the virtual sound source's position in the virtual sound field, thereby improving the efficiency of audio processing. The speed also makes the change in the position of the virtual sound source in the virtual sound field correspond to the change in the user's action. For example, if the user's action is larger, the change in the virtual sound source's position in the virtual sound field will be larger. The two have Positive relationship.
在一些例子中,所述待播放音频中虚拟声源在虚拟声场的方位信息可以是通过第一坐标系确定的;所述用户动作的变化量可以是通过所述第一坐标系确定的,也即是两者是通过相同的坐标系确定;本实施例中,为了使所述虚拟声源在所述虚拟声场的方位随所述用户的动作而变化,用户动作的变化量与虚拟声源的方位信息采用相同的第一坐标系来确定,可以快速准确地确定两者的关系,便于映射关系的建立。可选的,该第一坐标系可以包括:基于用户使用所述至少两个播放器的初始姿态建立的坐标系。作为例子,该坐标系代表用户的初始姿态,可选的,可以定义虚拟声场和虚拟声源相对于该坐标系静止,而人相对于该坐标系运动。因此,用户在空间中的实际动作参数表示了人相对于虚拟声场的运动,用户的实际动作信息转换到人相对于虚拟声场的运动,例如,用户的手部向右滑动可以定义为人向右运动,相应的其表示了用户向右拖动声场,即人相对声场向左运动。In some examples, the orientation information of the virtual sound source in the virtual sound field in the audio to be played can be determined through the first coordinate system; the change amount of the user action can be determined through the first coordinate system, or That is, both are determined by the same coordinate system; in this embodiment, in order to make the position of the virtual sound source in the virtual sound field change with the user's action, the change amount of the user's action is equal to the change of the virtual sound source. The orientation information is determined using the same first coordinate system, and the relationship between the two can be determined quickly and accurately, which facilitates the establishment of the mapping relationship. Optionally, the first coordinate system may include: a coordinate system established based on the user's initial posture using the at least two players. As an example, the coordinate system represents the user's initial posture. Optionally, the virtual sound field and the virtual sound source can be defined to be stationary relative to the coordinate system, while the person moves relative to the coordinate system. Therefore, the user's actual action parameters in the space represent the person's movement relative to the virtual sound field, and the user's actual action information is converted into the person's movement relative to the virtual sound field. For example, the user's hand sliding to the right can be defined as the person's movement to the right. , which corresponds to the user dragging the sound field to the right, that is, the person moves to the left relative to the sound field.
实际应用中,所述第一映射关系可以有多种实现方式。在一些例子中,所述第一映射关系是预设的,作为例子,可以是预先设置一种第一映射关系适用于所有用户,也可以是根据不同用户的类型设置不同的第一映射关系,还可以结合不同的音频类型、应用场景等多种因素设置不同的第一映射关系,电子设备可以在实施时根据前述的一种或多种因素,从预设的多个第一映射关系中选取合适的第一映射关系。例如,不同年龄、性别或身高的用户,其做出相同的动作,本实施例的电子设备可以基于不同的第一映射关系,生成不同的音频信号,即不同用户所听到的音频中虚拟声源在所述虚拟声场的方位的变化可以是不同的。In practical applications, the first mapping relationship can be implemented in multiple ways. In some examples, the first mapping relationship is preset. For example, one kind of first mapping relationship may be preset to apply to all users, or different first mapping relationships may be set according to different user types. Different first mapping relationships can also be set based on various factors such as different audio types and application scenarios. The electronic device can select from multiple preset first mapping relationships based on one or more of the aforementioned factors during implementation. Appropriate first mapping relationship. For example, if users of different ages, genders or heights perform the same action, the electronic device of this embodiment can generate different audio signals based on different first mapping relationships, that is, the virtual sounds in the audio heard by different users. The changes in the orientation of the source in the virtual sound field may be different.
在另一些例子中,所述第一映射关系可以是通过获取用户的设置指令生成的,例如,实施本实施例的电子设备或者与本实施例的电子设备连接的其他电子设备,可以提供用户界面提供设置功能,通过该设置功能可以获取到用户的设置指令,再根据用户的设置指令生成该第一映射关系。基于此,本实施例提供了自定义映射关系的功能,可以让用户自定义其所需的映射关系,因此不同的用户可以有不同的第一映射关系,即同一音频,不同用户做出相同的动作,生成的音频信号是不同的,不同用户所听到 的音频中虚拟声源在所述虚拟声场的方位的变化是不同的。In other examples, the first mapping relationship may be generated by obtaining the user's setting instructions. For example, the electronic device implementing this embodiment or other electronic devices connected to the electronic device of this embodiment may provide a user interface. A setting function is provided, through which the user's setting instructions can be obtained, and then the first mapping relationship is generated according to the user's setting instructions. Based on this, this embodiment provides the function of customizing the mapping relationship, which allows users to customize the mapping relationship they need. Therefore, different users can have different first mapping relationships, that is, for the same audio, different users can make the same Action, the generated audio signals are different, and the changes in the orientation of the virtual sound source in the virtual sound field in the audio heard by different users are different.
在另一些例子中,第一映射关系还可以是根据用户的历史行为数据确定的,例如,在用户授权同意的情况下,获取用户的历史行为数据,该历史行为数据可以包括用户的历史动作信息和/或用户的历史播放的音频的信息等一种或多种信息,以此分析用户的动作特征或音频偏好等用户个性化信息,并基于此生成符合用户偏好的第一映射关系。In other examples, the first mapping relationship may also be determined based on the user's historical behavior data. For example, with the user's authorization and consent, the user's historical behavior data is obtained. The historical behavior data may include the user's historical action information. and/or one or more information such as the user's historical played audio information, etc., to analyze the user's action characteristics or audio preferences and other user personalized information, and based on this, generate a first mapping relationship that conforms to the user's preferences.
实际应用中,可以采用上述任一种方式确定第一映射关系,也可以采用上述至少两种方式的组合来确定,本实施例对此不进行限定。例如,针对某个用户,本实施例在初始实施时,未获取到用户的历史行为数据的情况,可以利用前述预设的方式确定第一映射关系;之后,获取用户对调整方位后的音频的偏好数据,例如获取用户对调整方位后的音频的评价数据等,以及可以持续获取用户的历史行为数据,从而生成新的第一映射关系,利用该新的第一映射关系实施本实施例的方案。In practical applications, the first mapping relationship may be determined using any of the above methods, or a combination of at least two of the above methods may be used, which is not limited in this embodiment. For example, for a certain user, if the user's historical behavior data is not obtained during the initial implementation of this embodiment, the first mapping relationship can be determined using the aforementioned preset method; after that, the user's audio information after adjusting the orientation is obtained. Preference data, such as obtaining the user's evaluation data of the audio after adjusting the position, etc., and continuously obtaining the user's historical behavior data, thereby generating a new first mapping relationship, and using the new first mapping relationship to implement the solution of this embodiment .
本实施例中,可以基于所述声源位置调整指令、所述虚拟声源的所述方位信息和所述待播放音频,确定对应所述至少两个播放器中每个播放器的音频信号,并通过对应的所述播放器播放所述音频信号,使得所述虚拟声源在所述虚拟声场的方位随所述用户的动作而变化。In this embodiment, the audio signal corresponding to each of the at least two players may be determined based on the sound source position adjustment instruction, the orientation information of the virtual sound source, and the audio to be played, The audio signal is played through the corresponding player, so that the position of the virtual sound source in the virtual sound field changes with the user's actions.
在一些例子中,所述待播放音频中具有至少两个虚拟声源,所述至少两个虚拟声源的类型信息不同;基于此,本实施例中,所述基于所述动作信息生成声源方位调整指令,可以包括:基于所述动作信息,生成调整所述至少两个虚拟声源中每个虚拟声源的方位的声源方位调整指令,使得随所述用户的动作,不同类型信息的虚拟声源在所述虚拟声场的方位具有不同的变化量。作为例子,声源方位调整指令可以用于调整每个虚拟声源的方位,并且,每个虚拟声源的方位的变化量是不同的,从而,用户的一个动作,能够使得播放出的音频中,每个虚拟声源的方位变化具有不同,使得用户能够感受到音频中不同虚拟声源的不同变化。作为例子,若待播放音频中包括两个声源,但该用户做出动作后,其中一个声源从一个方位,通过移动较大的变化量到达另一方位,而另一声源可以从一个方位,通过移动较小的变化量到达至另一方位。实际应用中,不同类型信息的虚拟声源在所述虚拟声场的方位具有不同的变化量,可以是由电子设备自动确定的,例如可以识别声源的类型信息确定,其中,该类型信息包括如下一种或多种:所述待播放音频中虚拟声源在虚拟声场的方位信息、音色信息或音量信息等,从而可以通过上述一种或多种信息来准确区分不同的声源。In some examples, the audio to be played has at least two virtual sound sources, and the type information of the at least two virtual sound sources is different; based on this, in this embodiment, the sound source is generated based on the action information. The orientation adjustment instruction may include: based on the action information, generating a sound source orientation adjustment instruction for adjusting the orientation of each virtual sound source in the at least two virtual sound sources, so that with the user's actions, different types of information The virtual sound source has different amounts of change in the orientation of the virtual sound field. As an example, the sound source position adjustment instruction can be used to adjust the position of each virtual sound source, and the change amount of the position of each virtual sound source is different. Therefore, an action of the user can make the played audio , the azimuth changes of each virtual sound source are different, so that the user can feel the different changes of different virtual sound sources in the audio. As an example, if the audio to be played includes two sound sources, but after the user takes an action, one of the sound sources moves from one direction to another by moving a larger amount of change, while the other sound source can move from one direction to another. , by moving a smaller amount of change to reach another direction. In practical applications, virtual sound sources with different types of information have different amounts of change in the orientation of the virtual sound field, which can be automatically determined by the electronic device. For example, the type information of the sound source can be identified, where the type information includes the following One or more: the orientation information, timbre information or volume information of the virtual sound source in the audio to be played in the virtual sound field, so that different sound sources can be accurately distinguished through one or more of the above information.
本实施例方案可以应用于视频播放场景,在视频播放场景中包括有待播放音频和与所述待播放音频同步的多帧图像,基于此,本实施例中,当用户执行动作后,既可以使图像画面变化,也可以使音频中虚拟声源的方位发生变化。作为例子,所述电子设备包括显示区域;所述基于所述动作信息生成声源方位调整指令,包括:基于所述动作信息生成声源方位调整指令和图像显示指令;所述通过对应的所述播放器播放所述音频信号,包括:控制对应的所述播放器播放所述音频信号,同时基于所述图像显示指令获取与所述待播放音频同步显示的多帧图像并显示在所述显示区域中;其中,所述图像包括与所述虚拟声源相关的第一像素区域,所述第一像素区域在所述多帧图像的变化以及所述虚拟声源在所述虚拟声场的方位均随所述用户的动作而变化。The solution of this embodiment can be applied to a video playback scene. The video playback scene includes audio to be played and multi-frame images synchronized with the audio to be played. Based on this, in this embodiment, after the user performs an action, the user can Changes in the image can also change the position of the virtual sound source in the audio. As an example, the electronic device includes a display area; generating a sound source azimuth adjustment instruction based on the action information includes: generating a sound source azimuth adjustment instruction and an image display instruction based on the action information; and passing the corresponding The player plays the audio signal, including: controlling the corresponding player to play the audio signal, and at the same time, based on the image display instruction, acquiring multiple frames of images displayed synchronously with the audio to be played and displaying them in the display area in; wherein the image includes a first pixel area related to the virtual sound source, and the changes in the first pixel area in the multi-frame image and the orientation of the virtual sound source in the virtual sound field all change with changes depending on the user's actions.
本实施例可以应用于VR等场景中,作为例子,本实施例的电子设备可以是VR 设备,用户在使用VR设备时,VR设备显示有视频的多帧图像,在显示时还同时播放与图像画面相关的音频,例如,音频来源于视频中的某个虚拟对象,该虚拟对象在图像中即占据第一像素区域,VR设备可以基于用户的动作调整该虚拟对象在显示区域的位置同时调整虚拟声源的方位,因此,图像画面和声源方位变化都同时调整,既能在视觉上实现VR效果也能同时实现音频的VR效果。作为一个例子,用户穿戴VR设备,显示区域显示一音乐会舞台,该舞台中央有一表演者在演奏,音频中的声源即该演奏者,用户可以通过手部动作,表示将该表演者挪动至舞台左侧;VR设备检测到用户的动作后,生成图像显示指令,以使新显示的图像画面中,表演者的像素区域显示在显示区域的左侧,并且,通过声源方位调整指令,使得虚拟声源在所述虚拟声场的方位也随用户的动作变化至左侧。This embodiment can be applied to scenarios such as VR. As an example, the electronic device of this embodiment can be a VR device. When the user uses the VR device, the VR device displays multi-frame images of the video and simultaneously plays the images during the display. Picture-related audio, for example, the audio comes from a virtual object in the video. The virtual object occupies the first pixel area in the image. The VR device can adjust the position of the virtual object in the display area based on the user's actions and simultaneously adjust the virtual object. The orientation of the sound source, therefore, the image and the orientation changes of the sound source are adjusted at the same time, which can achieve both visual VR effects and audio VR effects at the same time. As an example, the user wears a VR device and the display area displays a concert stage. There is a performer playing in the center of the stage. The sound source in the audio is the performer. The user can move the performer to the display area through hand movements. Left side of the stage; after the VR device detects the user's action, it generates an image display instruction so that in the newly displayed image, the performer's pixel area is displayed on the left side of the display area, and, through the sound source orientation adjustment instruction, the The position of the virtual sound source in the virtual sound field also changes to the left along with the user's movements.
上述实施例的VR等场景中,也可以未涉及用户动作,而仅仅是画面的变化,基于此,在另一些例子中,还可以是由电子设备基于图像画面的变化,来调整虚拟声源的变化,例如,所述电子设备包括显示区域,所述显示区域用于显示与所述待播放音频同步的多帧图像,所述图像包括与所述虚拟声源相关的第一像素区域;所述方法还包括:在显示当前图像时,获取当前图像中所述第一像素区域相对于当前图像之前的图像的位置变化;基于所述位置变化、所述虚拟声源的所述方位信息和所述待播放音频,确定对应所述至少两个播放器中每个播放器的音频信号,并控制对应的所述播放器播放所述音频信号。例如,基于所述位置变化、所述虚拟声源的所述方位信息和所述待播放音频,确定对应所述至少两个播放器中每个播放器的音频信号,并控制对应的所述播放器播放所述音频信号,使得所述虚拟声源在所述虚拟声场中的方位随所述位置变化而变化。In the VR and other scenes of the above embodiments, user actions may not be involved, but only changes in the image. Based on this, in other examples, the electronic device may also adjust the virtual sound source based on changes in the image. Change, for example, the electronic device includes a display area, the display area is used to display a multi-frame image synchronized with the audio to be played, the image includes a first pixel area related to the virtual sound source; The method also includes: when displaying the current image, obtaining the position change of the first pixel area in the current image relative to the image before the current image; based on the position change, the orientation information of the virtual sound source and the For audio to be played, an audio signal corresponding to each of the at least two players is determined, and the corresponding player is controlled to play the audio signal. For example, based on the position change, the orientation information of the virtual sound source and the audio to be played, an audio signal corresponding to each of the at least two players is determined, and the corresponding playback is controlled. The audio signal is played by the device, so that the orientation of the virtual sound source in the virtual sound field changes as the position changes.
实际应用中,用户通过动作使图像画面变化以及虚拟声源的方位均变化的实施例,以及画面变化使虚拟声源的方位变化的实施例,在实施时可以选择其一或组合使用,例如,在视频播放过程中,若用户未有动作,而仅是图像画面的变化,可以基于画面变化使虚拟声源的方位变化;若用户有动作,可以使图像画面变化以及虚拟声源的方位均变化等,实际应用中可根据需要确定,本实施例对此不进行限定。In practical applications, the embodiment in which the user changes the image and the orientation of the virtual sound source through actions, and the embodiment in which the image change causes the orientation of the virtual sound source to change. During implementation, one of them can be selected or used in combination, for example, During video playback, if the user has no action and only the image changes, the orientation of the virtual sound source can be changed based on the image change; if the user has actions, both the image and the orientation of the virtual sound source can be changed. etc., can be determined as needed in actual applications, and this embodiment does not limit this.
在一些例子中,所述确定对应所述至少两个播放器中每个播放器的音频信号,包括:获取当前音频播放场景的场景信息,确定与所述场景信息对应的双耳传递函数,根据所述双耳传递函数确定对应所述至少两个播放器中每个播放器的音频信号。本实施例中,双耳传递函数也称头相关传输函数(Head Related Transfer Functions,HRTF),该函数描述了声波从声源到双耳的传输过程。实际应用中,电子设备处于不同场景下,声源至用户左右耳听觉系统的传播过程具有不同的特点,基于此,可以预设多种双耳传递函数,电子设备在确定音频信号时,可以选取与当前场景相符合的双耳传递函数,从而可以使处理好的音频信号与当前场景相符,使音频信号具有更好的音效。作为例子,所述场景信息包括如下一种或多种:音频类型信息、用户类型信息、时间信息或用户所处环境的环境信息等等,例如,不同的音频类型可以采用不同的音效处理,可以通过音频分类算法或训练好的神经网络等方式识别音频的类型;不同用户可能有不同的偏好,因此不同用户类型可以采用不同的音效处理,用户类型信息可以在用户授权同意的情况下通过用户信息确定;而时间信息可以是由电子设备通过网络等方式获取到,用户在不同的时间段听音频,可以采用不同的音效处理;用户所处环境的环境 信息可以有多种方式确定,例如可以通过电子设备的图像传感器通过对环境采集的图像确定,还可以通过采集周围环境的声音信息确定,或者还可以通过电子设备获取地理位置信息而确定等等,本实施例对此不进行限定;实际应用中,可以根据需要选择上述任一一种或多种信息的组合来确定该场景信息。并且,可以预设多种双耳传递函数,从而电子设备在处理时可以根据需要选取合适的双耳传递函数进行音效处理。In some examples, the determining the audio signal corresponding to each of the at least two players includes: obtaining scene information of the current audio playback scene, determining a binaural transfer function corresponding to the scene information, according to The binaural transfer function determines an audio signal corresponding to each of the at least two players. In this embodiment, the binaural transfer function is also called Head Related Transfer Functions (HRTF), which describes the transmission process of sound waves from the sound source to both ears. In actual applications, electronic devices are in different scenarios, and the propagation process from the sound source to the user's left and right ear auditory systems has different characteristics. Based on this, a variety of binaural transfer functions can be preset, and the electronic device can select when determining the audio signal. The binaural transfer function is consistent with the current scene, so that the processed audio signal can be consistent with the current scene, so that the audio signal has better sound effects. As an example, the scene information includes one or more of the following: audio type information, user type information, time information or environmental information of the user's environment, etc. For example, different audio types can use different sound effect processing. Identify the type of audio through audio classification algorithms or trained neural networks; different users may have different preferences, so different user types can use different sound effects processing, and user type information can be passed through user information with the user's authorization and consent. Determine; the time information can be obtained by electronic devices through the network and other methods. When the user listens to the audio in different time periods, different sound effect processing can be used; the environmental information of the user's environment can be determined in a variety of ways, for example, through The image sensor of the electronic device can be determined by collecting images of the environment, or can also be determined by collecting sound information of the surrounding environment, or can also be determined by acquiring geographical location information from the electronic device, etc. This embodiment does not limit this; practical application , you can select any one or a combination of the above information to determine the scene information as needed. Moreover, a variety of binaural transfer functions can be preset, so that the electronic device can select an appropriate binaural transfer function for sound effect processing according to needs during processing.
在一些例子中,所述待播放音频中具有至少两个虚拟声源;所述确定对应所述至少两个播放器中每个播放器的音频信号,包括:确定与每个所述虚拟声源对应的双耳传递函数,利用确定的双耳传递函数确定对应所述至少两个播放器中每个播放器的音频信号,使得所述音频信号中不同虚拟声源具有不同的音效。本实施例中,待播放音频中包括有至少两个虚拟声源的情况下,还可以对不同的虚拟声源采用不同的双耳传递函数进行处理,使得不同虚拟声源具有不同的音效。例如,待播放音频中包括用户前方距离较近的歌手演唱声,以及用户后方较远距离的从用户一侧至另一侧的脚步声,通过对这两个声源采用不同的双耳传递函数,可以实现不同的音效,例如,可以通过双耳传递函数增强该脚步声等。In some examples, the audio to be played has at least two virtual sound sources; the determining the audio signal corresponding to each of the at least two players includes: determining the audio signal corresponding to each of the virtual sound sources. The corresponding binaural transfer function is used to determine the audio signal corresponding to each of the at least two players, so that different virtual sound sources in the audio signal have different sound effects. In this embodiment, when the audio to be played includes at least two virtual sound sources, different binaural transfer functions can also be used to process different virtual sound sources, so that different virtual sound sources have different sound effects. For example, the audio to be played includes the singing sound of a singer who is relatively close in front of the user, and the sound of footsteps from one side of the user to the other which is far behind the user. By using different binaural transfer functions for the two sound sources, , different sound effects can be achieved, for example, the footsteps can be enhanced through the binaural transfer function, etc.
接下来再通过一实施例进行说明。Next, an embodiment will be used to illustrate.
本实施例的音频处理能将普通双耳音频信号自动转换成具有VR音效,即模拟声音的平动、转动等音效。音频流可以是任意支持双耳播放的音频信号,VR音效可以用户动作控制或由音频编辑软件设定的用户动作信息控制。本实施例可以应用于具备双耳播放功能的电子设备和音频软件,例如耳机、手机、相机、VR设备、支持音频播放的可穿戴设备等,以及具备音频编辑功能的软件工具等等。本实施例能实现声随人动的VR音效,用户不同的运动和姿态能听到不同声音,且声音的变化符合人体的运动。例如,根据外部的运动设定,对静态音频信号实时或后期处理,生成动态的音频信号,供用户通过双耳设备播放。VR音效随人而动,作为例子,音频的运动方式可以分解为六个自由度:三个自由度的转动和三个自由度的平动。本实施例方案的处理过程可以包括图3A所示的步骤:The audio processing of this embodiment can automatically convert ordinary binaural audio signals into VR sound effects, that is, simulate sound effects such as translation and rotation of sounds. The audio stream can be any audio signal that supports binaural playback, and the VR sound effects can be controlled by user actions or user action information set by audio editing software. This embodiment can be applied to electronic devices and audio software with binaural playback functions, such as headphones, mobile phones, cameras, VR devices, wearable devices that support audio playback, etc., as well as software tools with audio editing functions, etc. This embodiment can realize VR sound effects in which the sound follows the movement of the human body. The user can hear different sounds according to different movements and postures, and the changes in the sound are consistent with the movement of the human body. For example, based on external motion settings, static audio signals are processed in real time or post-processing to generate dynamic audio signals for users to play through binaural devices. VR sound effects move with people. As an example, the movement of audio can be decomposed into six degrees of freedom: three degrees of freedom of rotation and three degrees of freedom of translation. The processing process of this embodiment may include the steps shown in Figure 3A:
步骤301、获取原始的音频信号。Step 301: Obtain the original audio signal.
步骤302、获取用户的动作信息。Step 302: Obtain the user's action information.
步骤303、根据用户的动作信息,对原始的音频信号进行VR音效处理。Step 303: Perform VR sound effect processing on the original audio signal according to the user's action information.
步骤304、输出具有VR音效的音频信号。Step 304: Output audio signals with VR sound effects.
针对步骤301、获取原始的音频信号,可以获取本用于直接播放的音频信号,作为例子,该音频信号可以来源于音频文件的解码信号、或者流媒体中的音频流信号等。若音频信号为单声道信号,可以将其复制为左右相同的双声道信号。若音频信号为双声道信号,则可直接使用。Regarding step 301, obtaining the original audio signal, the audio signal originally used for direct playback can be obtained. As an example, the audio signal can be derived from the decoded signal of the audio file, or the audio stream signal in the streaming media, etc. If the audio signal is a mono signal, it can be copied into a two-channel signal with the same left and right sides. If the audio signal is a two-channel signal, it can be used directly.
针对步骤302、获取用户的动作信息;作为例子,可以基于动作传感器采集的数据确定用户的动作信息,该动作信息与VR音效相关,可选的,可以是六自由度的动作信息。六个自由度的动作可以模拟人在声场中的运动。六自由度包括三个转动自由度(α,β,γ)和三个平动自由度(x,y,z)。实际应用中,自由度也可少于六个,比如只有三个转动自由度、或只有一个转动自由度等,本实施例对此不进行限定。实际应用中,基于动作传感器采集的数据确定用户的动作信息可以有多种实现方式,例如可以是惯性传感器获取用户的运动和姿态(例如头戴VR设备、手机、智能手表等)、图像传 感器获取用户的运动和姿态(例如通过图像来识别头部运动、手部运动、眼球运动等)、或者是通过传感器获取脑电信号,从而获取大脑下达的运动指令;或者,还可以通过软件定义运动方式、以及其他所有可以获取外部运动和姿态的方式。Regarding step 302, obtain the user's action information; as an example, the user's action information can be determined based on the data collected by the action sensor. The action information is related to VR sound effects, and optionally can be six degrees of freedom action information. Six degrees of freedom of motion can simulate human movement in the sound field. The six degrees of freedom include three rotational degrees of freedom (α, β, γ) and three translational degrees of freedom (x, y, z). In practical applications, the number of degrees of freedom may be less than six, for example, there may be only three rotational degrees of freedom, or only one rotational degree of freedom, etc., which is not limited in this embodiment. In practical applications, there are many ways to determine the user's action information based on the data collected by the action sensor. For example, the user's movement and posture can be obtained by inertial sensors (such as VR headsets, mobile phones, smart watches, etc.), image sensors can obtain The user's movements and postures (such as identifying head movements, hand movements, eye movements, etc. through images), or obtaining brain electrical signals through sensors to obtain movement instructions from the brain; or, movement methods can also be defined through software , and all other ways to obtain external motion and posture.
针对步骤303、根据用户的动作信息,对原始的音频信号进行VR音效处理。Regarding step 303, perform VR sound effect processing on the original audio signal according to the user's action information.
针对步骤304、输出具有VR音效的音频信号,该音频信号可以是双声道或多声道音频信号,输出的音频信号用于通过播放器播放,或者进行存储,也可以发送给其他设备,由其他设备的播放器播放等。For step 304, output an audio signal with VR sound effects. The audio signal can be a two-channel or multi-channel audio signal. The output audio signal is used for playing through a player, or for storage, or can be sent to other devices. Players on other devices play etc.
可选的,如图3B所示,是前述步骤303的一种实施例的流程图,其可以包括如下实施例:Optionally, as shown in Figure 3B, is a flow chart of an embodiment of the aforementioned step 303, which may include the following embodiments:
(1)根据待播放音频,获取双声道频域信号。(1) Obtain the two-channel frequency domain signal based on the audio to be played.
本实施例的待播放音频,可以是单声道或双声道,若是单声道信号可以复制为双声道信号。其中,待播放音频的时域信号的采样频率为f s,各声道的时域信号为x m(t),其中m为麦克风序号,m=1,2,t为采样离散时间序列,t=1,2,…。其中,本实施例以两个为例进行说明。 The audio to be played in this embodiment can be mono or dual-channel. If it is a mono signal, it can be copied into a dual-channel signal. Among them, the sampling frequency of the time domain signal of the audio to be played is f s , and the time domain signal of each channel is x m (t), where m is the microphone serial number, m=1, 2, t is the sampling discrete time sequence, t =1,2,…. Among them, this embodiment uses two examples for description.
可选的,可以每间隔L个采样点,从双声道时域信号x m(t)中提取N个采样点为一帧信号,表示为x m(n) l。其中,n为一帧信号内的时间序列,n=1,2,…,N;l为帧序列,l=1,2,…。其中,N称为帧长,L称为帧移,0<L≤N,且N可以是2的幂次方。 Optionally, N sampling points can be extracted from the two-channel time domain signal x m (t) at intervals of L sampling points as a frame signal, expressed as x m (n) l . Among them, n is the time sequence within a frame signal, n=1, 2,...,N; l is the frame sequence, l=1, 2,.... Among them, N is called the frame length, L is called the frame shift, 0<L≤N, and N can be a power of 2.
对第l帧双声道原始时域信号x m(n) l进行加窗处理,则第l帧加窗时域信号x′ m(n) lWindowing is performed on the l-th frame of the binaural original time-domain signal x m (n) l , then the l-th frame of the windowed time-domain signal x′ m (n) l is
x′ m(n) l=x m(n) lh ana(n) n=1,2,…,N x′ m (n) l =x m (n) l h ana (n) n=1, 2,…,N
其中,h ana(n)为N点分析窗函数,常用的窗函数包括正弦窗或汉明窗等。对第l帧的各声道加窗时域信号x′ m(n) l分别进行离散傅里叶变换(DFT),得到第l帧第m声道的N点复频谱X m(k) l,其中k为离散频谱序列,k=1,2,…,N。 Among them, ha ana (n) is the N-point analysis window function. Commonly used window functions include sine window or Hamming window. Perform discrete Fourier transform (DFT) on each channel windowed time domain signal x′ m (n) l of the l-th frame to obtain the N-point complex spectrum X m (k) l of the m-th channel of the l-th frame. , where k is a discrete spectrum sequence, k=1, 2,...,N.
Figure PCTCN2022080925-appb-000001
Figure PCTCN2022080925-appb-000001
其中,e为自然常数,
Figure PCTCN2022080925-appb-000002
为单位虚数。作为例子,实现过程中可以采用快速傅里叶变换(FFT)等进行加速计算。
Among them, e is a natural constant,
Figure PCTCN2022080925-appb-000002
Is the unit imaginary number. As an example, fast Fourier transform (FFT) or the like can be used to accelerate calculations during the implementation process.
(2)从双声道频域信号中提取每个离散频率的声音方位。(2) Extract the sound direction of each discrete frequency from the two-channel frequency domain signal.
本实施例中,可以建立一个虚拟的静止坐标系O-XYZ,代表人的初始姿态,该初始姿态包括是本实施例方案开始执行时用户的初始姿态,例如用户带上耳机的时刻,或者是设备被触发开始获取待播放音频的时刻。当然,实际应用中也可以基于播放器等其他方式建立坐标系,本实施例对此不进行限定。In this embodiment, a virtual static coordinate system O-XYZ can be established to represent the initial posture of the person. The initial posture includes the user's initial posture when the solution of this embodiment starts to be executed, such as the moment when the user puts on the headset, or The moment when the device is triggered to start getting the audio to be played. Of course, in actual applications, the coordinate system can also be established based on other methods such as a player, which is not limited in this embodiment.
作为例子,每个离散频率的声源可作为独立的声源,分布于该坐标系中,代表声源在人的各个方位,如图3C所示,其示出了坐标系下各个离散频率的方位示意图,图3C中每个圆点表示一个离散频率,每个频率的声源方位表示为
Figure PCTCN2022080925-appb-000003
通常在直角坐标系中表示为
Figure PCTCN2022080925-appb-000004
在球坐标系中表示为
Figure PCTCN2022080925-appb-000005
As an example, each sound source of discrete frequency can be used as an independent sound source, distributed in the coordinate system, representing the sound source in various directions of the person, as shown in Figure 3C, which shows the sound source of each discrete frequency in the coordinate system. Azimuth diagram, each dot in Figure 3C represents a discrete frequency, and the sound source azimuth of each frequency is expressed as
Figure PCTCN2022080925-appb-000003
Usually expressed in the Cartesian coordinate system as
Figure PCTCN2022080925-appb-000004
Expressed in the spherical coordinate system as
Figure PCTCN2022080925-appb-000005
每个独立声源在空间中的方位可以根据双声道频域信号的幅度比和相位差,再结合声音到双耳的传播模型计算得到。可选的,声音到双耳的传播模型可以采用头相关传递函数(Head Related Transfer Function,HRTF),也可参考声音在自由声场中的传 播模型。由于单声道信号通过复制扩展成左右声道一样的信号,因此可认为所有音源都来自在用户的正前方。The position of each independent sound source in space can be calculated based on the amplitude ratio and phase difference of the two-channel frequency domain signal, combined with the propagation model of sound to both ears. Optionally, the sound propagation model to both ears can use the Head Related Transfer Function (HRTF), or you can refer to the sound propagation model in the free sound field. Since the mono signal is copied and expanded into the same signal for the left and right channels, all sound sources can be considered to come from directly in front of the user.
(3)获取用户的动作信息。(3) Obtain the user's action information.
本实施例的用户的动作信息包括六个自由度的姿态:三个转动自由度(α,β,γ),和三个平动自由度(x,y,z),实际应用中根据需要,也可以并非是六自由度。本实施例中可以利用动作传感器采集的数据确定用户的动作信息,其中,用户的动作信息可以是指用户的实际动作参数,即用户在空间中的实际动作信息。The user's action information in this embodiment includes postures with six degrees of freedom: three rotational degrees of freedom (α, β, γ), and three translational degrees of freedom (x, y, z). In practical applications, according to needs, It may not be six degrees of freedom. In this embodiment, the data collected by the motion sensor can be used to determine the user's action information, where the user's action information can refer to the user's actual action parameters, that is, the user's actual action information in space.
本实施例中,为了更有沉浸感的体验,可以定义虚拟声场和虚拟声源相对于前述的坐标系O-XYZ静止,而人相对于该坐标系运动。因此,用户的实际动作参数表示了人相对于虚拟声场的运动。In this embodiment, for a more immersive experience, the virtual sound field and the virtual sound source can be defined to be stationary relative to the aforementioned coordinate system O-XYZ, while the person moves relative to the coordinate system. Therefore, the user's actual movement parameters represent the person's movement relative to the virtual sound field.
本实施例中,需将用户的实际动作信息转换到人相对于虚拟声场的运动。例如,用户的手部向右滑动可以定义为人向右运动,相应的,也可以定义为向右拖动声场,即人相对声场向左运动,因此,用户同样的动作信息可以对应着坐标系下不同的运动参数。基于此,本实施例可以预先定义用户的实际动作信息和坐标系下的运动参数的关系。例如,检测到的用户在空间中的实际动作信息,可以转换为在该坐标系下的运动,即前述的六自由度(α,β,γ)和(x,y,z);具体的,用户在空间中的实际动作信息转换为对应的六自由度,实际应用中可以根据需要自定义,本实施例对此不进行限定。In this embodiment, the user's actual movement information needs to be converted into the movement of the person relative to the virtual sound field. For example, the user's hand sliding to the right can be defined as the person moving to the right. Correspondingly, it can also be defined as dragging the sound field to the right, that is, the person moves to the left relative to the sound field. Therefore, the same action information of the user can correspond to the coordinate system. different motion parameters. Based on this, this embodiment can predefine the relationship between the user's actual action information and the motion parameters in the coordinate system. For example, the detected user's actual action information in space can be converted into motion in the coordinate system, that is, the aforementioned six degrees of freedom (α, β, γ) and (x, y, z); specifically, The user's actual action information in the space is converted into corresponding six degrees of freedom, which can be customized as needed in actual applications. This embodiment does not limit this.
(4)计算声音的相对方位和相对平动参数。六自由度的运动代表人的运动,则人从O-XYZ坐标系运动到O’-X’Y’Z’坐标系,如图3D所示,示出了以坐标系表示用户动作变化的示意图,在坐标系O-XYZ下的频率离散声源坐标
Figure PCTCN2022080925-appb-000006
映射到新坐标系O’-X’Y’Z’下为
Figure PCTCN2022080925-appb-000007
(4) Calculate the relative orientation and relative translation parameters of the sound. The movement of six degrees of freedom represents the movement of a person, and the person moves from the O-XYZ coordinate system to the O'-X'Y'Z' coordinate system, as shown in Figure 3D, which shows a schematic diagram showing the change of the user's action in the coordinate system. , frequency discrete sound source coordinates in the coordinate system O-XYZ
Figure PCTCN2022080925-appb-000006
Mapping to the new coordinate system O'-X'Y'Z' is
Figure PCTCN2022080925-appb-000007
(5)计算双耳传递函数。本实施例中,可以根据头相关传递函数或声源在自由场的传播模型,计算各离散频率声源在方位
Figure PCTCN2022080925-appb-000008
分别传播到双耳的传递函数,左右耳的传递函数分别为T L(k)和T R(k),两者皆为复数,有幅值和相位成分。若采用声源在自由声场的传播模型,需确定双耳的等效间距,该双耳间距可以预设的,也可以是通过所述至少两个播放器测量得到等等。
(5) Calculate the binaural transfer function. In this embodiment, the azimuth direction of each discrete frequency sound source can be calculated based on the head-related transfer function or the propagation model of the sound source in the free field.
Figure PCTCN2022080925-appb-000008
The transfer functions propagated to both ears respectively, the transfer functions of the left and right ears are T L (k) and T R (k) respectively, both of which are complex numbers and have amplitude and phase components. If the propagation model of the sound source in the free sound field is adopted, the equivalent distance between the two ears needs to be determined. The distance between the two ears can be preset, or it can be measured by the at least two players, etc.
(6)重建双耳频域信号。将双声道信号合成一路主音频信号X main(k),可由双声道复频谱X m(k) l线性叠加得到,如下式所述: (6) Reconstruct binaural frequency domain signals. The two-channel signal is synthesized into one main audio signal X main (k), which can be obtained by linear superposition of the two-channel complex spectrum X m (k) l , as described in the following formula:
Figure PCTCN2022080925-appb-000009
Figure PCTCN2022080925-appb-000009
然后,结合双耳传递函数T L(k)和T R(k),可以分别生成左声道频谱信号X L(k) l和右声道频谱信号X R(k) l,生成方式可以如下式所述: Then, combining the binaural transfer functions T L (k) and T R (k), the left channel spectrum signal X L (k) l and the right channel spectrum signal X R (k) l can be generated respectively. The generation method can be as follows The formula states:
X L(k) l=T L(k)X main(k) l X L (k) l =T L (k)X main (k) l
X R(k) l=T R(k)X main(k) l X R (k) l =T R (k)X main (k) l
(7)输出双耳时域信号。对第l帧左右声道频谱信号X L(k) l和X R(k) l分别通过离散傅里叶逆变换(IFFT)得到第l帧左右声道时域信号x′ L(n) l和x′ R(n) l,n=1,2,…,N。可选的,可以采用快速傅里叶逆变换(IFFT)等方式进行加速计算。 (7) Output binaural time domain signals. For the left and right channel spectral signals X L ( k) l and and x' R (n) l , n=1, 2,...,N. Optionally, methods such as inverse fast Fourier transform (IFFT) can be used to accelerate calculations.
Figure PCTCN2022080925-appb-000010
Figure PCTCN2022080925-appb-000010
Figure PCTCN2022080925-appb-000011
Figure PCTCN2022080925-appb-000011
对第l帧左右声道时域信号加合成窗,得到左右声道第l帧加窗时域信号,分别为Add a synthesis window to the left and right channel time domain signals of the l-th frame to obtain the windowed time-domain signals of the l-th frame of the left and right channels, respectively:
x″ L(n) l=x′ L(n) lh syn(n) x″ L (n) l =x′ L (n) l h syn (n)
x″ R(n) l=x′ R(n) lh syn(n) x″ R (n) l =x′ R (n) l h syn (n)
其中,h syn(n)为合成窗函数,可选的,合成窗函数可以包括正弦窗或汉明窗等。 Among them, h syn (n) is a synthetic window function. Optionally, the synthetic window function may include a sine window or a Hamming window.
可选的,通过重叠累加法(Overlap-add),得到左右声道第l帧重叠累加的N点时域信号,分别为Optionally, through the overlap-add method (Overlap-add), the N-point time domain signals of the l-th frame of the left and right channels are overlapped and accumulated, respectively:
Figure PCTCN2022080925-appb-000012
Figure PCTCN2022080925-appb-000012
Figure PCTCN2022080925-appb-000013
Figure PCTCN2022080925-appb-000013
其中,x″′ L(n+M) l-1和x″′ R(n+M) l-1分别为左右声道第l-1帧重叠累加N点时域信号,若l=1,x″′ L(n+M) l-1=0,x″′ R(n+M) l-1=0。 Among them, x″′ L (n+M) l-1 and x″′ R (n+M) l-1 are the overlapped and accumulated N-point time domain signals of the l-1th frame of the left and right channels respectively. If l=1, x″′ L (n+M) l-1 =0, x″′ R (n+M) l-1 =0.
因此,左右声道第l帧重叠累加M点输出信号分别为x″′ L(n) l和x″′ R(n) l的前M个元素: Therefore, the overlapping and accumulated M point output signals of the lth frame of the left and right channels are the first M elements of x″′ L (n) l and x″′ R (n) l respectively:
x L(n) l=x″′ L(n) l n=1,2,…,M x L (n) l =x″′ L (n) l n=1, 2,…,M
x R(n) l=x″′ R(n) l n=1,2,…,M x R (n) l =x″′ R (n) l n=1, 2,…,M
x L(n) l和x R(n) l即为第l帧左右声道输出音频信号,该信号具备有随人而动的VR音效。 x L (n) l and x R (n) l are the left and right channel output audio signals of the l-th frame, which have VR sound effects that move with people.
由上述实施例可见,本实施例能实现任意单声道/双声道音源在播放时具有声随人动的VR音效,例如,人在动的过程中,能体验到一个绝对静止的虚拟声场;以声音初始在正前方为例,当人向右转动的过程中,声音也会相对人向左转,给用户一种声音在一个虚拟空间中静止的错觉。It can be seen from the above embodiment that this embodiment can realize any mono/dual channel sound source to have a VR sound effect that follows the movement of the person during playback. For example, when the person is moving, he can experience an absolutely still virtual sound field. ; For example, if the sound is initially in the front, when the person turns to the right, the sound will also turn to the left relative to the person, giving the user the illusion that the sound is stationary in a virtual space.
上述音频处理方法实施例可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软件实现为例,作为一个逻辑意义上的装置,是通过其所在图像处理的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言,如图4所示,为实施本实施例音频处理方法的音频处理装置400的一种硬件结构图,除了图4所示的处理器401、以及存储器402之外,实施例中用于实施本音频处理方法的音频处理装置,通常根据该音频处理装置的实际功能,还可以包括其他硬件,对此不再赘述。The above audio processing method embodiments can be implemented by software, or can be implemented by hardware or a combination of software and hardware. Taking software implementation as an example, as a device in a logical sense, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory and running it through the image processing processor where it is located. From the hardware level, as shown in Figure 4, it is a hardware structure diagram of the audio processing device 400 that implements the audio processing method of this embodiment. In addition to the processor 401 and the memory 402 shown in Figure 4, the embodiment The audio processing device used to implement the audio processing method usually may also include other hardware according to the actual functions of the audio processing device, which will not be described again.
本实施例中,所述处理器401执行所述计算机程序时实现以下步骤:In this embodiment, the processor 401 implements the following steps when executing the computer program:
获取待播放音频中虚拟声源在虚拟声场的方位信息,所述虚拟声场是基于至少两个播放器与用户左右听觉器官的位置关系建立的;Obtain the orientation information of the virtual sound source in the audio to be played in the virtual sound field, which is established based on the positional relationship between at least two players and the user's left and right hearing organs;
利用动作传感器获取用户的动作信息,基于所述动作信息生成声源方位调整指令;Use a motion sensor to obtain the user's motion information, and generate a sound source orientation adjustment instruction based on the motion information;
基于所述声源位置调整指令、所述虚拟声源的所述方位信息和所述待播放音频,确定对应所述至少两个播放器中每个播放器的音频信号,并通过对应的所述播放器播放所述音频信号。Based on the sound source position adjustment instruction, the orientation information of the virtual sound source and the audio to be played, an audio signal corresponding to each of the at least two players is determined, and passed through the corresponding The player plays the audio signal.
在一些例子中,所述待播放音频中具有至少两个虚拟声源,所述至少两个虚拟声源的类型信息不同;In some examples, the audio to be played has at least two virtual sound sources, and the type information of the at least two virtual sound sources is different;
所述处理器401执行所述基于所述动作信息生成声源方位调整指令,包括:The processor 401 executes the generation of sound source orientation adjustment instructions based on the action information, including:
基于所述动作信息,生成调整所述至少两个虚拟声源中每个虚拟声源的方位的声源方位调整指令,使得随所述用户的动作,不同类型信息的虚拟声源在所述虚拟声场的方位具有不同的变化量。Based on the action information, a sound source orientation adjustment instruction for adjusting the orientation of each virtual sound source in the at least two virtual sound sources is generated, so that along with the user's actions, virtual sound sources of different types of information are adjusted in the virtual sound source. The orientation of the sound field has different amounts of change.
在一些例子中,所述类型信息包括如下一种或多种:In some examples, the type information includes one or more of the following:
所述待播放音频中虚拟声源在虚拟声场的方位信息、音色信息或音量信息。The orientation information, timbre information or volume information of the virtual sound source in the virtual sound field in the audio to be played.
在一些例子中,所述电子设备包括显示区域;In some examples, the electronic device includes a display area;
所述处理器401执行所述基于所述动作信息生成声源方位调整指令,包括:基于所述动作信息生成声源方位调整指令和图像显示指令;The processor 401 executes the generation of a sound source orientation adjustment instruction based on the action information, including: generating a sound source orientation adjustment instruction and an image display instruction based on the action information;
所述处理器401执行所述通过对应的所述播放器播放所述音频信号,包括:The processor 401 executes the step of playing the audio signal through the corresponding player, including:
控制对应的所述播放器播放所述音频信号,同时基于所述图像显示指令获取与所述待播放音频同步显示的多帧图像并显示在所述显示区域中;其中,所述图像包括与所述虚拟声源相关的第一像素区域,所述第一像素区域在所述多帧图像的变化以及所述虚拟声源在所述虚拟声场的方位均随所述用户的动作而变化。Control the corresponding player to play the audio signal, and at the same time obtain multiple frames of images displayed synchronously with the audio to be played based on the image display instruction and display them in the display area; wherein the image includes the The first pixel area related to the virtual sound source, the change of the first pixel area in the multi-frame images, and the position of the virtual sound source in the virtual sound field all change with the user's actions.
在一些例子中,所述电子设备包括显示区域,所述显示区域用于显示与所述待播放音频同步的多帧图像,所述图像包括与所述虚拟声源相关的第一像素区域;所述处理器401还执行:In some examples, the electronic device includes a display area, the display area is used to display a multi-frame image synchronized with the audio to be played, the image includes a first pixel area related to the virtual sound source; The processor 401 also performs:
在显示当前图像时,获取当前图像中所述第一像素区域相对于当前图像之前的图像的位置变化;When displaying the current image, obtain the position change of the first pixel area in the current image relative to the image before the current image;
基于所述位置变化、所述虚拟声源的所述方位信息和所述待播放音频,确定对应所述至少两个播放器中每个播放器的音频信号,并控制对应的所述播放器播放所述音频信号。Based on the position change, the orientation information of the virtual sound source and the audio to be played, an audio signal corresponding to each of the at least two players is determined, and the corresponding player is controlled to play the audio signal.
在一些例子中,所述用户的动作信息包括用户动作的变化量,所述处理器401执行所述基于所述动作信息生成声源方位调整指令,包括:In some examples, the user's action information includes the change amount of the user's action. The processor 401 executes the generation of sound source orientation adjustment instructions based on the action information, including:
基于所述用户动作的变化量和第一映射关系,生成用于调整虚拟声源在虚拟声场中方位的变化量的声源方位调整指令;其中,所述第一映射关系包括:用户动作的变化量与虚拟声源在虚拟声场中方位的变化量的对应关系。Based on the change amount of the user action and the first mapping relationship, a sound source orientation adjustment instruction for adjusting the change amount of the virtual sound source in the virtual sound field is generated; wherein the first mapping relationship includes: the change of the user action The corresponding relationship between the quantity and the change quantity of the virtual sound source's position in the virtual sound field.
在一些例子中,所述第一映射关系是预设的,或者是通过获取用户的设置指令生成的,或者是根据用户的历史行为数据确定的。In some examples, the first mapping relationship is preset, or generated by obtaining the user's setting instructions, or determined based on the user's historical behavior data.
在一些例子中,所述待播放音频中虚拟声源在虚拟声场的方位信息是通过第一坐标系确定的;In some examples, the orientation information of the virtual sound source in the virtual sound field in the audio to be played is determined through the first coordinate system;
所述用户动作的变化量是通过所述第一坐标系确定的。The change amount of the user action is determined through the first coordinate system.
在一些例子中,所述第一坐标系是基于用户使用所述至少两个播放器的初始姿态建立的坐标系。In some examples, the first coordinate system is a coordinate system established based on the user's initial posture using the at least two players.
在一些例子中,所述电子设备为穿戴式电子设备,所述动作传感器包括惯性测量传感器;In some examples, the electronic device is a wearable electronic device, and the motion sensor includes an inertial measurement sensor;
所述处理器401执行所述利用所述动作传感器获取用户的动作信息,包括:The processor 401 executes the method of using the motion sensor to obtain the user's motion information, including:
根据所述惯性测量传感器的测量数据,确定用户的动作信息。Based on the measurement data of the inertial measurement sensor, the user's action information is determined.
在一些例子中,所述电子设备为穿戴式电子设备,所述动作传感器包括第一图像传感器;所述穿戴式电子设备穿戴在用户头部时,所述第一图像传感器朝向用户眼部;In some examples, the electronic device is a wearable electronic device, and the motion sensor includes a first image sensor; when the wearable electronic device is worn on the user's head, the first image sensor faces the user's eyes;
所述处理器401执行所述利用所述动作传感器获取用户的动作信息,包括:The processor 401 executes the method of using the motion sensor to obtain the user's motion information, including:
获取所述第一图像传感器采集的图像,基于所述图像获取用户眼球的动作信息。Obtain the image collected by the first image sensor, and obtain the movement information of the user's eyeball based on the image.
在一些例子中,所述电子设备为穿戴式电子设备,所述动作传感器包括一个或多个第二图像传感器,所述穿戴式电子设备穿戴在用户头部时,一个或多个所述第二图像传感器的观测范围覆盖用户手部的活动空间;In some examples, the electronic device is a wearable electronic device, and the motion sensor includes one or more second image sensors. When the wearable electronic device is worn on the user's head, one or more of the second image sensors The observation range of the image sensor covers the activity space of the user's hand;
所述处理器401执行所述利用所述动作传感器获取用户的动作信息,包括:The processor 401 executes the method of using the motion sensor to obtain the user's motion information, including:
获取所述一个或多个第二图像传感器采集的图像,若从所述图像中检测到用户手部,获取用户手部的动作信息。Obtain images collected by the one or more second image sensors, and if the user's hand is detected from the image, obtain action information of the user's hand.
在一些例子中,所述处理器401执行所述确定对应所述至少两个播放器中每个播放器的音频信号,包括:In some examples, the processor 401 performs the determining the audio signal corresponding to each of the at least two players, including:
获取当前音频播放场景的场景信息,确定与所述场景信息对应的双耳传递函数,根据所述双耳传递函数确定对应所述至少两个播放器中每个播放器的音频信号。Scene information of the current audio playback scene is obtained, a binaural transfer function corresponding to the scene information is determined, and an audio signal corresponding to each of the at least two players is determined based on the binaural transfer function.
在一些例子中,所述场景信息包括如下一种或多种:音频类型信息、用户类型信息、时间信息或用户所处环境的环境信息。In some examples, the scene information includes one or more of the following: audio type information, user type information, time information, or environmental information of the user's environment.
在一些例子中,所述待播放音频中具有至少两个虚拟声源;In some examples, the audio to be played has at least two virtual sound sources;
所述处理器401执行所述确定对应所述至少两个播放器中每个播放器的音频信号,包括:The processor 401 performs the determination of the audio signal corresponding to each player in the at least two players, including:
确定与每个所述虚拟声源对应的双耳传递函数,利用确定的双耳传递函数确定对应所述至少两个播放器中每个播放器的音频信号,使得所述音频信号中不同虚拟声源具有不同的音效。Determine a binaural transfer function corresponding to each of the virtual sound sources, and use the determined binaural transfer function to determine an audio signal corresponding to each of the at least two players, so that different virtual sounds in the audio signal Sources have different sound effects.
在一些例子中,所述待播放音频中虚拟声源在虚拟声场的方位信息,是通过从所述待播放音频中提取出多个离散频率,根据所述多个离散频率在所述虚拟声场的方位信息确定的。In some examples, the orientation information of the virtual sound source in the virtual sound field in the audio to be played is obtained by extracting multiple discrete frequencies from the audio to be played, and based on the multiple discrete frequencies in the virtual sound field. The orientation information is determined.
在一些例子中,所述待播放音频为至少两个声道的音频信号;所述多个离散频率在所述虚拟声场的方位信息,通过如下方式获取:In some examples, the audio to be played is an audio signal of at least two channels; the orientation information of the multiple discrete frequencies in the virtual sound field is obtained in the following manner:
根据所述至少两个声道的音频信号的幅度比和/或相位差,基于第一坐标系和双耳传递函数,获取所述多个离散频率在所述虚拟声场的方位信息。According to the amplitude ratio and/or phase difference of the audio signals of the at least two channels, the orientation information of the multiple discrete frequencies in the virtual sound field is obtained based on the first coordinate system and the binaural transfer function.
如图5所示,是申请实施例还提供一种电子设备500,包括:至少两个播放器510和动作传感器520;所述电子设备还包括处理器530、存储器540、存储在所述存储器上可被所述处理器执行的计算机程序;As shown in Figure 5, an embodiment of the application further provides an electronic device 500, including: at least two players 510 and a motion sensor 520; the electronic device also includes a processor 530, a memory 540, and a computer program executable by said processor;
所述处理器执行所述计算机程序时实现如下步骤:When the processor executes the computer program, the following steps are implemented:
获取待播放音频中虚拟声源在虚拟声场的方位信息,所述虚拟声场是基于至少两个播 放器与用户左右听觉器官的位置关系建立的;Obtain the orientation information of the virtual sound source in the audio to be played in the virtual sound field, which is established based on the positional relationship between at least two players and the user's left and right hearing organs;
利用动作传感器获取用户的动作信息,基于所述动作信息生成声源方位调整指令;Use a motion sensor to obtain the user's motion information, and generate a sound source orientation adjustment instruction based on the motion information;
基于所述声源位置调整指令、所述虚拟声源的所述方位信息和所述待播放音频,确定对应所述至少两个播放器中每个播放器的音频信号,并通过对应的所述播放器播放所述音频信号。Based on the sound source position adjustment instruction, the orientation information of the virtual sound source and the audio to be played, an audio signal corresponding to each of the at least two players is determined, and passed through the corresponding The player plays the audio signal.
在一些例子中,所述待播放音频中具有至少两个虚拟声源,所述至少两个虚拟声源的类型信息不同;In some examples, the audio to be played has at least two virtual sound sources, and the type information of the at least two virtual sound sources is different;
所述处理器执行所述基于所述动作信息生成声源方位调整指令,包括:The processor executes the generation of sound source orientation adjustment instructions based on the action information, including:
基于所述动作信息,生成调整所述至少两个虚拟声源中每个虚拟声源的方位的声源方位调整指令,使得随所述用户的动作,不同类型信息的虚拟声源在所述虚拟声场的方位具有不同的变化量。Based on the action information, a sound source orientation adjustment instruction for adjusting the orientation of each virtual sound source in the at least two virtual sound sources is generated, so that along with the user's actions, virtual sound sources of different types of information are adjusted in the virtual sound source. The orientation of the sound field has different amounts of change.
在一些例子中,所述类型信息包括如下一种或多种:In some examples, the type information includes one or more of the following:
所述待播放音频中虚拟声源在虚拟声场的方位信息、音色信息或音量信息。The orientation information, timbre information or volume information of the virtual sound source in the virtual sound field in the audio to be played.
在一些例子中,所述电子设备包括显示区域;In some examples, the electronic device includes a display area;
所述处理器执行所述基于所述动作信息生成声源方位调整指令,包括:基于所述动作信息生成声源方位调整指令和图像显示指令;The processor executes the generation of a sound source orientation adjustment instruction based on the action information, including: generating a sound source orientation adjustment instruction and an image display instruction based on the action information;
所述处理器执行所述通过对应的所述播放器播放所述音频信号,包括:The processor executes the step of playing the audio signal through the corresponding player, including:
控制对应的所述播放器播放所述音频信号,同时基于所述图像显示指令获取与所述待播放音频同步显示的多帧图像并显示在所述显示区域中;其中,所述图像包括与所述虚拟声源相关的第一像素区域,所述第一像素区域在所述多帧图像的变化以及所述虚拟声源在所述虚拟声场的方位均随所述用户的动作而变化。Control the corresponding player to play the audio signal, and at the same time obtain multiple frames of images displayed synchronously with the audio to be played based on the image display instruction and display them in the display area; wherein the image includes the The first pixel area related to the virtual sound source, the change of the first pixel area in the multi-frame image, and the position of the virtual sound source in the virtual sound field all change with the user's actions.
在一些例子中,所述电子设备包括显示区域,所述显示区域用于显示与所述待播放音频同步的多帧图像,所述图像包括与所述虚拟声源相关的第一像素区域;所述处理器401还执行:In some examples, the electronic device includes a display area, the display area is used to display a multi-frame image synchronized with the audio to be played, the image includes a first pixel area related to the virtual sound source; The processor 401 also executes:
在显示当前图像时,获取当前图像中所述第一像素区域相对于当前图像之前的图像的位置变化;When displaying the current image, obtain the position change of the first pixel area in the current image relative to the image before the current image;
基于所述位置变化、所述虚拟声源的所述方位信息和所述待播放音频,确定对应所述至少两个播放器中每个播放器的音频信号,并控制对应的所述播放器播放所述音频信号。Based on the position change, the orientation information of the virtual sound source and the audio to be played, an audio signal corresponding to each of the at least two players is determined, and the corresponding player is controlled to play the audio signal.
在一些例子中,所述用户的动作信息包括用户动作的变化量,所述处理器执行所述基于所述动作信息生成声源方位调整指令,包括:In some examples, the user's action information includes the change amount of the user's action, and the processor executes the generation of sound source orientation adjustment instructions based on the action information, including:
基于所述用户动作的变化量和第一映射关系,生成用于调整虚拟声源在虚拟声场中方位的变化量的声源方位调整指令;其中,所述第一映射关系包括:用户动作的变化量与虚拟声源在虚拟声场中方位的变化量的对应关系。Based on the change amount of the user action and the first mapping relationship, a sound source orientation adjustment instruction for adjusting the change amount of the virtual sound source in the virtual sound field is generated; wherein the first mapping relationship includes: the change of the user action The corresponding relationship between the quantity and the change quantity of the virtual sound source's position in the virtual sound field.
在一些例子中,所述第一映射关系是预设的,或者是通过获取用户的设置指令生成的,或者是根据用户的历史行为数据确定的。In some examples, the first mapping relationship is preset, or generated by obtaining the user's setting instructions, or determined based on the user's historical behavior data.
在一些例子中,所述待播放音频中虚拟声源在虚拟声场的方位信息是通过第一坐标系确定的;In some examples, the orientation information of the virtual sound source in the virtual sound field in the audio to be played is determined through the first coordinate system;
所述用户动作的变化量是通过所述第一坐标系确定的。The change amount of the user action is determined through the first coordinate system.
在一些例子中,所述第一坐标系是基于用户使用所述至少两个播放器的初始姿态建立的坐标系。In some examples, the first coordinate system is a coordinate system established based on the user's initial posture using the at least two players.
在一些例子中,所述电子设备为穿戴式电子设备,所述动作传感器包括惯性测量传感器;In some examples, the electronic device is a wearable electronic device, and the motion sensor includes an inertial measurement sensor;
所述处理器执行所述利用所述动作传感器获取用户的动作信息,包括:The processor executes the method of using the motion sensor to obtain the user's motion information, including:
根据所述惯性测量传感器的测量数据,确定用户的动作信息。Based on the measurement data of the inertial measurement sensor, the user's action information is determined.
在一些例子中,所述电子设备为穿戴式电子设备,所述动作传感器包括第一图像传感器;所述穿戴式电子设备穿戴在用户头部时,所述第一图像传感器朝向用户眼部;In some examples, the electronic device is a wearable electronic device, and the motion sensor includes a first image sensor; when the wearable electronic device is worn on the user's head, the first image sensor faces the user's eyes;
所述处理器执行所述利用所述动作传感器获取用户的动作信息,包括:The processor executes the method of using the motion sensor to obtain the user's motion information, including:
获取所述第一图像传感器采集的图像,基于所述图像获取用户眼球的动作信息。Obtain the image collected by the first image sensor, and obtain the movement information of the user's eyeball based on the image.
在一些例子中,所述电子设备为穿戴式电子设备,所述动作传感器包括一个或多个第二图像传感器,所述穿戴式电子设备穿戴在用户头部时,一个或多个所述第二图像传感器的观测范围覆盖用户手部的活动空间;In some examples, the electronic device is a wearable electronic device, and the motion sensor includes one or more second image sensors. When the wearable electronic device is worn on the user's head, one or more of the second image sensors The observation range of the image sensor covers the activity space of the user's hand;
所述处理器执行所述利用所述动作传感器获取用户的动作信息,包括:The processor executes the method of using the motion sensor to obtain the user's motion information, including:
获取所述一个或多个第二图像传感器采集的图像,若从所述图像中检测到用户手部,获取用户手部的动作信息。Obtain images collected by the one or more second image sensors, and if the user's hand is detected from the image, obtain action information of the user's hand.
在一些例子中,所述处理器执行所述确定对应所述至少两个播放器中每个播放器的音频信号,包括:In some examples, the processor performing the determining an audio signal corresponding to each of the at least two players includes:
获取当前音频播放场景的场景信息,确定与所述场景信息对应的双耳传递函数,根据所述双耳传递函数确定对应所述至少两个播放器中每个播放器的音频信号。Scene information of the current audio playback scene is obtained, a binaural transfer function corresponding to the scene information is determined, and an audio signal corresponding to each of the at least two players is determined based on the binaural transfer function.
在一些例子中,所述场景信息包括如下一种或多种:音频类型信息、用户类型信息、时间信息或用户所处环境的环境信息。In some examples, the scene information includes one or more of the following: audio type information, user type information, time information, or environmental information of the user's environment.
在一些例子中,所述待播放音频中具有至少两个虚拟声源;In some examples, the audio to be played has at least two virtual sound sources;
所述处理器执行所述确定对应所述至少两个播放器中每个播放器的音频信号,包括:The processor performs the determining of an audio signal corresponding to each of the at least two players, including:
确定与每个所述虚拟声源对应的双耳传递函数,利用确定的双耳传递函数确定对应所述至少两个播放器中每个播放器的音频信号,使得所述音频信号中不同虚拟声源具有不同的音效。Determine a binaural transfer function corresponding to each of the virtual sound sources, and use the determined binaural transfer function to determine an audio signal corresponding to each of the at least two players, so that different virtual sounds in the audio signal Sources have different sound effects.
在一些例子中,所述待播放音频中虚拟声源在虚拟声场的方位信息,是通过从所述待播放音频中提取出多个离散频率,根据所述多个离散频率在所述虚拟声场的方位信息确定的。In some examples, the orientation information of the virtual sound source in the virtual sound field in the audio to be played is obtained by extracting multiple discrete frequencies from the audio to be played, and based on the multiple discrete frequencies in the virtual sound field. The orientation information is determined.
在一些例子中,所述待播放音频为至少两个声道的音频信号;所述多个离散频率在所述虚拟声场的方位信息,通过如下方式获取:In some examples, the audio to be played is an audio signal of at least two channels; the orientation information of the multiple discrete frequencies in the virtual sound field is obtained in the following manner:
根据所述至少两个声道的音频信号的幅度比和/或相位差,基于第一坐标系和双耳传递函数,获取所述多个离散频率在所述虚拟声场的方位信息。According to the amplitude ratio and/or phase difference of the audio signals of the at least two channels, the orientation information of the multiple discrete frequencies in the virtual sound field is obtained based on the first coordinate system and the binaural transfer function.
本说明书实施例还提供一种计算机可读存储介质,所述可读存储介质上存储有若干计算机指令,所述计算机指令被执行时实任一实施例所述音频处理方法的步骤。Embodiments of this specification also provide a computer-readable storage medium, which stores a number of computer instructions. When executed, the computer instructions implement the steps of the audio processing method described in any embodiment.
本说明书实施例可采用在一个或多个其中包含有程序代码的存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。计算机可用存储介质包括永久性和非永久性、可移动和非可移动媒体,可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括但不限于:相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可 编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。Embodiments of the present description may take the form of a computer program product implemented on one or more storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having program code embodied therein. Storage media available for computers include permanent and non-permanent, removable and non-removable media, and can be implemented by any method or technology to store information. Information may be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to: phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cassettes, tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium can be used to store information that can be accessed by a computing device.
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。As for the device embodiment, since it basically corresponds to the method embodiment, please refer to the partial description of the method embodiment for relevant details. The device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. Persons of ordinary skill in the art can understand and implement the method without any creative effort.
上面各种方法的步骤划分,只是为了描述清楚,实现时可以合并为一个步骤或者对某些步骤进行拆分,分解为多个步骤,只要包括相同的逻辑关系,都在本专利的保护范围内;对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计,但不改变其算法和流程的核心设计都在该申请的保护范围内。The steps of the various methods above are divided just for the purpose of clear description. During implementation, they can be combined into one step or some steps can be split into multiple steps. As long as they include the same logical relationship, they are all within the scope of protection of this patent. ; Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but not changing the core design of the algorithm and process, are within the scope of protection of this application.
其中,“具体示例”、或“一些示例”等的描述意指结合所述实施例或示例描述的具体特征、结构、材料或者特点包含于本说明书的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。The description of "specific examples" or "some examples" means that the specific features, structures, materials or characteristics described in connection with the embodiments or examples are included in at least one embodiment or example of this specification. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or operations are mutually exclusive. any such actual relationship or sequence exists between them. The terms "comprises," "comprises," or any other variation thereof are intended to cover a non-exclusive inclusion such that a process, method, article or apparatus including a list of elements includes not only those elements but also others not expressly listed elements, or elements inherent to such process, method, article or equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article, or apparatus that includes the stated element.
以上对本发明实施例所提供的方法和装置进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。The methods and devices provided by the embodiments of the present invention have been introduced in detail above. Specific examples are used in this article to illustrate the principles and implementations of the present invention. The description of the above embodiments is only used to help understand the method and its implementation of the present invention. Core idea; at the same time, for those of ordinary skill in the art, there will be changes in the specific implementation and application scope based on the idea of the present invention. In summary, the content of this description should not be understood as a limitation of the present invention. .

Claims (20)

  1. 一种音频处理方法,其特征在于,所述方法应用于电子设备,所述电子设备包括至少两个播放器和动作传感器,所述方法包括:An audio processing method, characterized in that the method is applied to an electronic device, the electronic device includes at least two players and a motion sensor, and the method includes:
    获取待播放音频中虚拟声源在虚拟声场的方位信息,所述虚拟声场是基于所述至少两个播放器与用户左右听觉器官的位置关系建立的;Obtaining the orientation information of the virtual sound source in the audio to be played in the virtual sound field, the virtual sound field is established based on the positional relationship between the at least two players and the user's left and right hearing organs;
    利用所述动作传感器获取用户的动作信息,基于所述动作信息生成声源方位调整指令;Using the motion sensor to obtain the user's motion information, and generating a sound source orientation adjustment instruction based on the motion information;
    基于所述声源位置调整指令、所述虚拟声源的所述方位信息和所述待播放音频,确定对应所述至少两个播放器中每个播放器的音频信号,并通过对应的所述播放器播放所述音频信号。Based on the sound source position adjustment instruction, the orientation information of the virtual sound source and the audio to be played, an audio signal corresponding to each of the at least two players is determined, and passed through the corresponding The player plays the audio signal.
  2. 根据权利要求1所述的方法,其特征在于,所述待播放音频中具有至少两个虚拟声源,所述至少两个虚拟声源的类型信息不同;The method according to claim 1, characterized in that there are at least two virtual sound sources in the audio to be played, and the type information of the at least two virtual sound sources is different;
    所述基于所述动作信息生成声源方位调整指令,包括:Generating sound source orientation adjustment instructions based on the action information includes:
    基于所述动作信息,生成调整所述至少两个虚拟声源中每个虚拟声源的方位的声源方位调整指令,使得随所述用户的动作,不同类型信息的虚拟声源在所述虚拟声场的方位具有不同的变化量。Based on the action information, a sound source orientation adjustment instruction for adjusting the orientation of each virtual sound source in the at least two virtual sound sources is generated, so that along with the user's actions, virtual sound sources of different types of information are adjusted in the virtual sound source. The orientation of the sound field has different amounts of change.
  3. 根据权利要求2所述的方法,其特征在于,所述类型信息包括如下一种或多种:The method according to claim 2, characterized in that the type information includes one or more of the following:
    所述待播放音频中虚拟声源在虚拟声场的方位信息、音色信息或音量信息。The orientation information, timbre information or volume information of the virtual sound source in the virtual sound field in the audio to be played.
  4. 根据权利要求1所述的方法,其特征在于,所述电子设备包括显示区域;The method according to claim 1, wherein the electronic device includes a display area;
    所述基于所述动作信息生成声源方位调整指令,包括:基于所述动作信息生成声源方位调整指令和图像显示指令;Generating a sound source orientation adjustment instruction based on the action information includes: generating a sound source orientation adjustment instruction and an image display instruction based on the action information;
    所述通过对应的所述播放器播放所述音频信号,包括:Playing the audio signal through the corresponding player includes:
    控制对应的所述播放器播放所述音频信号,同时基于所述图像显示指令获取与所述待播放音频同步显示的多帧图像并显示在所述显示区域中;其中,所述图像包括与所述虚拟声源相关的第一像素区域,所述第一像素区域在所述多帧图像的变化以及所述虚拟声源在所述虚拟声场的方位均随所述用户的动作而变化。Control the corresponding player to play the audio signal, and at the same time obtain multiple frames of images displayed synchronously with the audio to be played based on the image display instruction and display them in the display area; wherein the image includes the The first pixel area related to the virtual sound source, the change of the first pixel area in the multi-frame image, and the position of the virtual sound source in the virtual sound field all change with the user's actions.
  5. 根据权利要求1所述的方法,其特征在于,所述电子设备包括显示区域,所述显示区域用于显示与所述待播放音频同步的多帧图像,所述图像包括与所述虚拟声源相关的第一像素区域;所述方法还包括:The method according to claim 1, characterized in that the electronic device includes a display area, the display area is used to display multiple frames of images synchronized with the audio to be played, the images include images that are synchronized with the virtual sound source. The relevant first pixel area; the method further includes:
    在显示当前图像时,获取当前图像中所述第一像素区域相对于当前图像之前的图像的位置变化;When displaying the current image, obtain the position change of the first pixel area in the current image relative to the image before the current image;
    基于所述位置变化、所述虚拟声源的所述方位信息和所述待播放音频,确定对应所述至少两个播放器中每个播放器的音频信号,并控制对应的所述播放器播放所述音频信号。Based on the position change, the orientation information of the virtual sound source and the audio to be played, an audio signal corresponding to each of the at least two players is determined, and the corresponding player is controlled to play the audio signal.
  6. 根据权利要求1所述的方法,其特征在于,所述用户的动作信息包括用户动作的变化量;所述基于所述动作信息生成声源方位调整指令,包括:The method according to claim 1, wherein the user's action information includes a change amount of the user's action; and generating a sound source orientation adjustment instruction based on the action information includes:
    基于所述用户动作的变化量和第一映射关系,生成用于调整虚拟声源在虚拟声场中方位的变化量的声源方位调整指令;其中,所述第一映射关系包括:用户动作的变化量与虚拟声源在虚拟声场中方位的变化量的对应关系。Based on the change amount of the user action and the first mapping relationship, a sound source orientation adjustment instruction for adjusting the change amount of the virtual sound source in the virtual sound field is generated; wherein the first mapping relationship includes: the change of the user action The corresponding relationship between the quantity and the change quantity of the virtual sound source's position in the virtual sound field.
  7. 根据权利要求6所述的方法,其特征在于,所述第一映射关系是预设的,或者是通过获取用户的设置指令生成的,或者是根据用户的历史行为数据确定的。The method according to claim 6, characterized in that the first mapping relationship is preset, or generated by obtaining the user's setting instructions, or determined based on the user's historical behavior data.
  8. 根据权利要求6所述的方法,其特征在于,所述待播放音频中虚拟声源在虚拟声场的方位信息是通过第一坐标系确定的;The method according to claim 6, characterized in that the orientation information of the virtual sound source in the virtual sound field in the audio to be played is determined through the first coordinate system;
    所述用户动作的变化量是通过所述第一坐标系确定的。The change amount of the user action is determined through the first coordinate system.
  9. 根据权利要求8所述的方法,其特征在于,所述第一坐标系包括:基于用户使用所述至少两个播放器的初始姿态建立的坐标系。The method of claim 8, wherein the first coordinate system includes: a coordinate system established based on a user's initial posture using the at least two players.
  10. 根据权利要求1所述的方法,其特征在于,所述电子设备为穿戴式电子设备,所述动作传感器包括惯性测量传感器;The method of claim 1, wherein the electronic device is a wearable electronic device, and the motion sensor includes an inertial measurement sensor;
    所述利用所述动作传感器获取用户的动作信息,包括:The use of the motion sensor to obtain the user's motion information includes:
    根据所述惯性测量传感器的测量数据,确定用户的动作信息。Based on the measurement data of the inertial measurement sensor, the user's action information is determined.
  11. 根据权利要求1所述的方法,其特征在于,所述电子设备为穿戴式电子设备,所述动作传感器包括第一图像传感器;所述穿戴式电子设备穿戴在用户头部时,所述第一图像传感器朝向用户眼部;The method of claim 1, wherein the electronic device is a wearable electronic device, and the motion sensor includes a first image sensor; when the wearable electronic device is worn on the user's head, the first image sensor The image sensor faces the user's eyes;
    所述利用所述动作传感器获取用户的动作信息,包括:The use of the motion sensor to obtain the user's motion information includes:
    获取所述第一图像传感器采集的图像,基于所述图像获取用户眼球的动作信息。Obtain the image collected by the first image sensor, and obtain the movement information of the user's eyeball based on the image.
  12. 根据权利要求1所述的方法,其特征在于,所述电子设备为穿戴式电子设备,所述动作传感器包括一个或多个第二图像传感器,所述穿戴式电子设备穿戴在用户头部时,一个或多个所述第二图像传感器的观测范围覆盖用户手部的活动空间;The method of claim 1, wherein the electronic device is a wearable electronic device, the motion sensor includes one or more second image sensors, and when the wearable electronic device is worn on the user's head, The observation range of one or more second image sensors covers the activity space of the user's hand;
    所述利用所述动作传感器获取用户的动作信息,包括:The use of the motion sensor to obtain the user's motion information includes:
    获取所述一个或多个第二图像传感器采集的图像,若从所述图像中检测到用户手部,获取用户手部的动作信息。Obtain images collected by the one or more second image sensors, and if the user's hand is detected from the image, obtain action information of the user's hand.
  13. 根据权利要求1所述的方法,其特征在于,所述确定对应所述至少两个播放器中每个播放器的音频信号,包括:The method of claim 1, wherein determining the audio signal corresponding to each of the at least two players includes:
    获取当前音频播放场景的场景信息,确定与所述场景信息对应的双耳传递函数,根据所述双耳传递函数确定对应所述至少两个播放器中每个播放器的音频信号。Scene information of the current audio playback scene is obtained, a binaural transfer function corresponding to the scene information is determined, and an audio signal corresponding to each of the at least two players is determined based on the binaural transfer function.
  14. 根据权利要求13所述的方法,其特征在于,所述场景信息包括如下一种或多种:音频类型信息、用户类型信息、时间信息或用户所处环境的环境信息。The method according to claim 13, wherein the scene information includes one or more of the following: audio type information, user type information, time information or environmental information of the user's environment.
  15. 根据权利要求1所述的方法,其特征在于,所述待播放音频中具有至少两个虚拟声源;The method according to claim 1, characterized in that the audio to be played has at least two virtual sound sources;
    所述确定对应所述至少两个播放器中每个播放器的音频信号,包括:The determination of an audio signal corresponding to each of the at least two players includes:
    确定与每个所述虚拟声源对应的双耳传递函数,利用确定的双耳传递函数确定对应所述至少两个播放器中每个播放器的音频信号,使得所述音频信号中不同虚拟声源具有不同的音效。Determine a binaural transfer function corresponding to each of the virtual sound sources, and use the determined binaural transfer function to determine an audio signal corresponding to each of the at least two players, so that different virtual sounds in the audio signal Sources have different sound effects.
  16. 根据权利要求1或8所述的方法,其特征在于,所述待播放音频中虚拟声源在虚拟声场的方位信息,是通过从所述待播放音频中提取出多个离散频率,根据所述多个离散频率在所述虚拟声场的方位信息确定的。The method according to claim 1 or 8, characterized in that the orientation information of the virtual sound source in the virtual sound field in the audio to be played is obtained by extracting a plurality of discrete frequencies from the audio to be played, according to the A plurality of discrete frequencies are determined by the azimuth information in the virtual sound field.
  17. 根据权利要求16所述的方法,其特征在于,所述待播放音频为至少两个声道的音频信号;所述多个离散频率在所述虚拟声场的方位信息,通过如下方式获取:The method according to claim 16, characterized in that the audio to be played is an audio signal of at least two channels; the orientation information of the multiple discrete frequencies in the virtual sound field is obtained in the following manner:
    根据所述至少两个声道的音频信号的幅度比和/或相位差,基于第一坐标系和双 耳传递函数,获取所述多个离散频率在所述虚拟声场的方位信息。According to the amplitude ratio and/or phase difference of the audio signals of the at least two channels, the orientation information of the multiple discrete frequencies in the virtual sound field is obtained based on the first coordinate system and the binaural transfer function.
  18. 一种音频处理装置,其特征在于,所述装置包括处理器、存储器、存储在所述存储器上可被所述处理器执行的计算机程序,所述处理器执行所述计算机程序时实现权利要求1至17任一所述的方法。An audio processing device, characterized in that the device includes a processor, a memory, and a computer program stored on the memory and executable by the processor. When the processor executes the computer program, claim 1 is realized. to any of the methods described in 17.
  19. 一种电子设备,其特征在于,所述电子设备包括至少两个播放器和动作传感器;所述电子设备还包括处理器、存储器、存储在所述存储器上可被所述处理器执行的计算机程序;An electronic device, characterized in that the electronic device includes at least two players and a motion sensor; the electronic device further includes a processor, a memory, and a computer program stored on the memory and executable by the processor. ;
    所述处理器执行所述计算机程序时实现权利要求1至17任一所述的方法。When the processor executes the computer program, the method of any one of claims 1 to 17 is implemented.
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有若干计算机指令,所述计算机指令被执行时实现权利要求1至17任一项所述方法的步骤。A computer-readable storage medium, characterized in that a number of computer instructions are stored on the computer-readable storage medium, and when the computer instructions are executed, the steps of the method described in any one of claims 1 to 17 are implemented.
PCT/CN2022/080925 2022-03-15 2022-03-15 Audio processing method and apparatus, electronic device, and computer-readable storage medium WO2023173285A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/080925 WO2023173285A1 (en) 2022-03-15 2022-03-15 Audio processing method and apparatus, electronic device, and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/080925 WO2023173285A1 (en) 2022-03-15 2022-03-15 Audio processing method and apparatus, electronic device, and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2023173285A1 true WO2023173285A1 (en) 2023-09-21

Family

ID=88022075

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/080925 WO2023173285A1 (en) 2022-03-15 2022-03-15 Audio processing method and apparatus, electronic device, and computer-readable storage medium

Country Status (1)

Country Link
WO (1) WO2023173285A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1244094A (en) * 1998-07-30 2000-02-09 财团法人资讯工业策进会 3D space sound effect processing system and method
CN105263075A (en) * 2015-10-12 2016-01-20 深圳东方酷音信息技术有限公司 Earphone equipped with directional sensor and 3D sound field restoration method thereof
CN106658344A (en) * 2016-11-15 2017-05-10 北京塞宾科技有限公司 Holographic audio rendering control method
US20180232941A1 (en) * 2017-02-10 2018-08-16 Sony Interactive Entertainment LLC Paired local and global user interfaces for an improved augmented reality experience
CN108957761A (en) * 2012-12-18 2018-12-07 精工爱普生株式会社 Display device and its control method, head-mounted display apparatus and its control method
WO2020102994A1 (en) * 2018-11-20 2020-05-28 深圳市欢太科技有限公司 3d sound effect realization method and apparatus, and storage medium and electronic device
CN112071326A (en) * 2020-09-07 2020-12-11 三星电子(中国)研发中心 Sound effect processing method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1244094A (en) * 1998-07-30 2000-02-09 财团法人资讯工业策进会 3D space sound effect processing system and method
CN108957761A (en) * 2012-12-18 2018-12-07 精工爱普生株式会社 Display device and its control method, head-mounted display apparatus and its control method
CN105263075A (en) * 2015-10-12 2016-01-20 深圳东方酷音信息技术有限公司 Earphone equipped with directional sensor and 3D sound field restoration method thereof
CN106658344A (en) * 2016-11-15 2017-05-10 北京塞宾科技有限公司 Holographic audio rendering control method
US20180232941A1 (en) * 2017-02-10 2018-08-16 Sony Interactive Entertainment LLC Paired local and global user interfaces for an improved augmented reality experience
WO2020102994A1 (en) * 2018-11-20 2020-05-28 深圳市欢太科技有限公司 3d sound effect realization method and apparatus, and storage medium and electronic device
CN112071326A (en) * 2020-09-07 2020-12-11 三星电子(中国)研发中心 Sound effect processing method and device

Similar Documents

Publication Publication Date Title
US11706582B2 (en) Calibrating listening devices
US10939225B2 (en) Calibrating listening devices
Serafin et al. Sonic interactions in virtual reality: State of the art, current challenges, and future directions
CN106797525B (en) For generating and the method and apparatus of playing back audio signal
US9131305B2 (en) Configurable three-dimensional sound system
CN108616789B (en) Personalized virtual audio playback method based on double-ear real-time measurement
JP2023095956A (en) Recording virtual object and real object in composite real device
TW201939973A (en) Method for generating customized spatial audio with head tracking
EP2737727B1 (en) Method and apparatus for processing audio signals
KR101764175B1 (en) Method and apparatus for reproducing stereophonic sound
GB2543275A (en) Distributed audio capture and mixing
KR20200047414A (en) Systems and methods for modifying room characteristics for spatial audio rendering over headphones
US9967693B1 (en) Advanced binaural sound imaging
JP2017522771A (en) Determine and use room-optimized transfer functions
CN112005559B (en) Method for improving positioning of surround sound
WO2007045016A1 (en) Spatial audio simulation
JP2021535648A (en) How to get and play binaural recordings
WO2020189263A1 (en) Acoustic processing device, acoustic processing method, and acoustic processing program
WO2023173285A1 (en) Audio processing method and apparatus, electronic device, and computer-readable storage medium
Cohen et al. Spatial soundscape superposition and multimodal interaction
KR20160136716A (en) A method and an apparatus for processing an audio signal
WO2021212287A1 (en) Audio signal processing method, audio processing device, and recording apparatus
JP2018152834A (en) Method and apparatus for controlling audio signal output in virtual auditory environment
WO2023085186A1 (en) Information processing device, information processing method, and information processing program
Nuora Introduction to sound design for virtual reality games: a look into 3D sound, spatializer plugins and their implementation in Unity game engine

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22931324

Country of ref document: EP

Kind code of ref document: A1