WO2020073563A1 - 用于处理音频信号的方法和装置 - Google Patents

用于处理音频信号的方法和装置 Download PDF

Info

Publication number
WO2020073563A1
WO2020073563A1 PCT/CN2019/072948 CN2019072948W WO2020073563A1 WO 2020073563 A1 WO2020073563 A1 WO 2020073563A1 CN 2019072948 W CN2019072948 W CN 2019072948W WO 2020073563 A1 WO2020073563 A1 WO 2020073563A1
Authority
WO
WIPO (PCT)
Prior art keywords
head
audio signal
channel audio
processed
initial
Prior art date
Application number
PCT/CN2019/072948
Other languages
English (en)
French (fr)
Inventor
黄传增
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Priority to JP2020545268A priority Critical patent/JP7210602B2/ja
Priority to GB2100831.3A priority patent/GB2590256B/en
Priority to US16/980,119 priority patent/US11425524B2/en
Publication of WO2020073563A1 publication Critical patent/WO2020073563A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/74Determining position or orientation of objects or cameras using feature-based methods involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S1/00Two-channel systems
    • H04S1/002Non-adaptive circuits, e.g. manually adjustable or static, for enhancing the sound image or the spatial distribution
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • H04S7/303Tracking of listener position or orientation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/307Frequency adjustment, e.g. tone control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2201/00Details of transducers, loudspeakers or microphones covered by H04R1/00 but not provided for in any of its subgroups
    • H04R2201/40Details of arrangements for obtaining desired directional characteristic by combining a number of identical transducers covered by H04R1/40 but not provided for in any of its subgroups
    • H04R2201/403Linear arrays of transducers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2499/00Aspects covered by H04R or H04S not otherwise provided for in their subgroups
    • H04R2499/10General applications
    • H04R2499/15Transducers incorporated in visual displaying devices, e.g. televisions, computer displays, laptops
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/13Aspects of volume control, not necessarily automatic, in stereophonic sound systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/01Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]

Definitions

  • Embodiments of the present disclosure relate to the field of computer technology, and in particular, to methods and devices for processing audio signals.
  • the embodiments of the present disclosure propose a method and apparatus for processing audio signals.
  • an embodiment of the present disclosure provides a method for processing an audio signal, the method including: acquiring a head user's head image and an audio signal to be processed; based on the head image, determining the target user's head posture Angle, and determine the distance between the target sound source and the head of the target user; input the head posture angle, distance, and audio signal to be processed into the preset head-related transfer function to obtain the processed left channel audio signal and the processed right sound Channel audio signal, wherein the head-related transfer function is used to characterize the correspondence between the head attitude angle, distance, to-be-processed audio signal and the processed left-channel audio signal and the processed right-channel audio signal.
  • determining the head pose angle of the target user based on the head image includes: inputting the head image into a pre-trained head pose recognition model to obtain the head pose angle of the target user, wherein the head pose The recognition model is used to characterize the correspondence between the head image and the head gesture angle of the user represented by the head image.
  • the head posture recognition model is pre-trained according to the following steps: acquiring multiple sample head images and sample head posture angles corresponding to the sample head images in the multiple sample head images; using machine learning methods Taking the sample head image among the multiple sample head images as input, and using the sample head posture angle corresponding to the input sample head image as the desired output, a head posture recognition model is trained.
  • determining the distance between the target sound source and the head of the target user includes: determining the size of the head image; determining the target sound source and the target based on a preset correspondence between the size and distance of the head image The distance of the user's head.
  • the method further includes: obtaining predetermined loudness of the initial left channel audio signal and the initial right channel audio signal The difference is the initial loudness difference; adjust the loudness of the processed left channel audio signal and the processed right channel audio signal separately to make the loudness difference between the processed left channel audio signal and the processed right channel audio signal adjusted The difference from the initial loudness difference is within the first preset range.
  • the method further includes: acquiring a predetermined, binaural time difference between the initial left channel audio signal and the initial right channel audio signal as the initial binaural time difference; adjusting the processed left channel audio signal and after processing The binaural time difference of the right channel audio signal, so that the difference between the binaural time difference between the processed left channel audio signal and the processed right channel audio signal after the adjustment of the binaural time difference and the initial binaural time difference is at the second preset Within range.
  • an embodiment of the present disclosure provides an apparatus for processing audio signals.
  • the apparatus includes: a first acquisition unit configured to acquire a head user's head image and an audio signal to be processed; a determination unit, which is It is configured to determine the head posture angle of the target user based on the head image, and to determine the distance between the target sound source and the target user's head; the processing unit is configured to input the head posture angle, distance and to-be-processed audio signal into the pre Set the head-related transfer function to obtain the processed left channel audio signal and the processed right channel audio signal, where the head related transfer function is used to characterize the head posture angle, distance, to-be-processed audio signal and the processed left channel Correspondence between the audio signal and the processed right channel audio signal.
  • the determining unit includes: a recognition module configured to input a head image into a pre-trained head pose recognition model to obtain a head pose angle of the target user, wherein the head pose recognition model is used to characterize the head Correspondence between the partial image and the head gesture angle of the user represented by the head image.
  • the head posture recognition model is pre-trained according to the following steps: acquiring multiple sample head images and sample head posture angles corresponding to the sample head images in the multiple sample head images; using machine learning methods Taking the sample head image among the multiple sample head images as input, and using the sample head posture angle corresponding to the input sample head image as the desired output, a head posture recognition model is trained.
  • the determination unit includes: a first determination module configured to determine the size of the head image; a second determination module configured to determine based on a preset correspondence between the size of the head image and the distance The distance between the target sound source and the target user's head.
  • the apparatus further includes: a second acquisition unit configured to acquire a predetermined difference in loudness between the initial left channel audio signal and the initial right channel audio signal as the initial loudness difference; the first adjustment unit, It is configured to adjust the loudness of the processed left channel audio signal and the processed right channel audio signal separately, so that the difference between the loudness of the processed left channel audio signal and the processed right channel audio signal after adjusting the loudness is different from the initial loudness The difference is within the first preset range.
  • the device further includes: a third acquiring unit configured to acquire a predetermined binaural time difference between the initial left channel audio signal and the initial right channel audio signal as the initial binaural time difference; second adjustment The unit is configured to adjust the binaural time difference between the processed left channel audio signal and the processed right channel audio signal so that the adjusted left channel audio signal and the processed right channel audio signal after adjusting the binaural time difference The difference between the binaural time difference and the initial binaural time difference is within the second preset range.
  • an embodiment of the present disclosure provides a terminal device including: one or more processors; a storage device on which one or more programs are stored; when one or more programs are Multiple processors execute so that one or more processors implement the method as described in any one of the implementation manners of the first aspect.
  • an embodiment of the present disclosure provides a computer-readable medium on which a computer program is stored, which when executed by a processor implements the method described in any one of the implementation manners of the first aspect.
  • the method and apparatus for processing audio signals provided by the embodiments of the present disclosure, by acquiring the target user's head image and the audio signal to be processed, and then using the head image to determine the target user's head posture angle and target sound source and The distance of the target user's head, and finally input the head posture angle, distance and audio signal to be processed into the preset head-related transfer function to obtain the processed left channel audio signal and the processed right channel audio signal, so as to use the head
  • the image and header-related transfer functions adjust the audio signal, which increases the flexibility of processing the audio signal and helps simulate a near-real audio playback effect.
  • FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present disclosure can be applied;
  • FIG. 2 is a flowchart of one embodiment of a method for processing audio signals according to an embodiment of the present disclosure
  • FIG. 3 is an exemplary schematic diagram of a head posture angle of a method for processing audio signals according to an embodiment of the present disclosure
  • FIG. 4 is another exemplary schematic diagram of a head posture angle of a method for processing audio signals according to an embodiment of the present disclosure
  • FIG. 5 is a schematic diagram of an application scenario of a method for processing audio signals according to an embodiment of the present disclosure
  • FIG. 6 is a flowchart of still another embodiment of a method for processing audio signals according to an embodiment of the present disclosure
  • FIG. 7 is a schematic structural diagram of an embodiment of an apparatus for processing audio signals according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic structural diagram of a terminal device suitable for implementing embodiments of the present disclosure.
  • FIG. 1 illustrates an exemplary system architecture 100 of a method for processing audio signals or an apparatus for processing audio signals to which embodiments of the present disclosure may be applied.
  • the system architecture 100 may include terminal devices 101, 102, and 103, a network 104, and a server 105.
  • the network 104 is a medium used to provide a communication link between the terminal devices 101, 102, 103 and the server 105.
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
  • the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages, and so on.
  • Various communication client applications may be installed on the terminal devices 101, 102, and 103, such as audio playback applications, video playback applications, and social platform software.
  • the terminal devices 101, 102, and 103 may be hardware or software.
  • the terminal devices 101, 102, and 103 are hardware, they may be various electronic devices that support audio playback and include cameras.
  • the terminal devices 101, 102, and 103 are software, they can be installed in the above electronic device. It can be implemented as multiple software or software modules (for example, software or software modules used to provide distributed services), or as a single software or software module. There is no specific limit here.
  • the server 105 may be a server that provides various services, such as a background audio server that supports audio played on the terminal devices 101, 102, and 103.
  • the background audio server can send audio to the terminal device to play on the terminal device.
  • the method for processing audio signals provided by the embodiments of the present disclosure is generally performed by terminal devices 101, 102, and 103. Accordingly, the apparatus for processing audio signals may be provided in terminal devices 101, 102, 103.
  • the server can be hardware or software.
  • the server can be implemented as a distributed server cluster composed of multiple servers or as a single server.
  • the server is software, it may be implemented as multiple software or software modules (for example, software or software modules for providing distributed services), or as a single software or software module. There is no specific limit here.
  • terminal devices, networks, and servers in FIG. 1 are only schematic. According to the implementation needs, there can be any number of terminal devices, networks and servers. In the case where the head image and audio signal to be processed do not need to be acquired from a remote location, the above system architecture may not include the network and the server.
  • the method for processing audio signals includes the following steps:
  • Step 201 Acquire the head image of the target user and the audio signal to be processed.
  • the execution subject of the method for processing audio signals may acquire the head image and left voice of the target user from a remote or local location through a wired connection or a wireless connection Channel to be processed audio signal and right channel to be processed audio signal.
  • the target user may be a user within the shooting range of the camera on the terminal device shown in FIG. 1 (for example, a user using the terminal device shown in FIG. 1).
  • the audio signal to be processed may be an audio signal to be processed that is stored in the execution subject in advance.
  • the audio signal to be processed may be an audio segment that is currently included in the audio currently being played on the execution subject and has not been played.
  • the duration of the audio clip may be a preset duration, such as 5 seconds, 10 seconds, and so on.
  • Step 202 based on the head image, determine the head posture angle of the target user, and determine the distance between the target sound source and the target user's head.
  • the above-mentioned execution subject may determine the head posture angle of the target user, and determine the distance between the target sound source and the target user's head.
  • the head posture angle can be used to characterize the degree of deflection of the face orientation of the target user's face relative to the camera used to capture the head image of the target user.
  • the head attitude angle may include three angles of pitch, yaw, and roll, which respectively represent the angles of upside down, left and right, and rotation in the horizontal plane.
  • the x-axis, y-axis, and z-axis are three axes of the rectangular coordinate system.
  • the z axis may be the optical axis of the camera on the terminal device 301
  • the y axis may be a straight line that passes through the center point of the top contour of the person's head and is perpendicular to the horizontal plane without the person's head turning sideways.
  • the pitch angle can be the angle of rotation of the face about the x axis
  • the yaw angle can be the angle of rotation of the face about the y axis
  • the roll angle can be the angle of rotation of the face about the z axis.
  • the determined head posture angle may not include the roll angle described above.
  • point A in the figure is the target sound source.
  • the target sound source is at the same position as the camera, and the determined head attitude angle includes ⁇ (yaw angle) and ⁇ (pitch angle).
  • the above-mentioned execution subject may perform head pose estimation on the two-dimensional head image according to various existing head pose estimation methods.
  • the method of head pose estimation may include but is not limited to the following methods: a method based on a machine learning model, a coordinate transformation method based on key points of a human face, and the like.
  • the execution subject may determine the head gesture angle of the target user based on the head image according to the following steps:
  • the head pose recognition model is used to characterize the correspondence between the head image and the head gesture angle of the user represented by the head image.
  • the above-mentioned head posture recognition model may include a feature extraction part and a correspondence table.
  • the feature extraction part can be used to extract features from the head image to generate feature vectors.
  • the feature extraction part may be a convolutional neural network, deep neural network, or the like.
  • the correspondence relationship table may be a correspondence table pre-formulated by a technician based on statistics of a large number of feature vectors and head posture angles, and storing correspondence relationships between a plurality of feature vectors and head posture angles. In this way, the above-mentioned head pose recognition model can first use the feature extraction part to extract the features of the head image, thereby generating the target feature vector.
  • the target feature vector is compared with multiple feature vectors in the correspondence table in sequence. If a feature vector in the correspondence table is the same as or similar to the target feature vector, the feature vector in the correspondence table is mapped Is the head pose angle of the target user.
  • the above-mentioned head posture recognition model may be obtained by the above-mentioned executive agent or other electronic device through training in the following steps: First, obtain multiple sample head images and multiple sample heads The sample head pose angle corresponding to the sample head image in the image.
  • the sample head posture angle is the head posture angle of the head of the person indicated by the sample head image, which is pre-labeled on the sample head image. Then, using the machine learning method, the sample head image among the multiple sample head images is used as input, and the sample head posture angle corresponding to the input sample head image is used as the desired output, and the head posture recognition model is trained.
  • the above-mentioned head posture recognition model may be a model obtained by training the initialized artificial neural network.
  • the initialized artificial neural network may be an untrained artificial neural network or an untrained artificial neural network.
  • Each layer of the initialized artificial neural network may be set with initial parameters, and the parameters may be continuously adjusted during the training process of the artificial neural network (for example, the parameters are adjusted using a back propagation algorithm).
  • the initialized artificial neural network may be various types of untrained or untrained artificial neural networks.
  • the initialized artificial neural network may be a convolutional neural network (for example, it may include a convolutional layer, a pooling layer, a fully connected layer, etc.).
  • the head posture of the target user can be monitored in real time, and the use of hardware such as a head-mounted device can be avoided to achieve the purpose of simplifying the hardware structure and reducing the hardware cost.
  • the above-mentioned execution subject may determine the distance between the target sound source and the target user's head based on the head image.
  • the above-mentioned execution subject may determine the distance between the target sound source and the target user's head according to the following steps:
  • the size of the head image may be the head image identified from the head image using the existing target detection model (such as SSD (Single Shot MultiBox Detector), DPM (Deformable Part Part Model), etc.)
  • the size of the area can be characterized in various ways. For example, it may be the length or width of the smallest rectangle including the head image area, or the radius of the smallest circle including the head image area.
  • the above-mentioned correspondence relationship may be characterized by a preset correspondence table, in which the size of the head image and the corresponding distance may be stored, and the above-mentioned execution subject may be based on the determined head
  • the corresponding relationship may be characterized according to a preset conversion formula, and the execution subject may use the conversion formula to calculate the distance between the target sound source and the target user's head according to the determined size of the head image.
  • the above-mentioned execution subject may use existing methods for determining face key points, determine face key points in the head image, and determine the size of the image area including the determined face key points.
  • the characterizing method of the size of the image area may be the same as the above example.
  • the above-mentioned execution subject may determine the distance between the target sound source and the head of the target user based on the preset correspondence between the size of the image area and the distance.
  • the representation of the correspondence in this example may be the same as the above example, and will not be repeated here.
  • the target sound source may be an actual electronic device that outputs audio signals.
  • the electronic device that outputs audio signals is the above-mentioned terminal device including a camera, or it may be determined by the above-mentioned executive body and at the target position Virtual sound source.
  • the distance between the target sound source and the target user's head may be the distance between the electronic device outputting the audio signal and the target user's head determined according to the above example; or, the target sound source and the target user's head
  • the distance of the part can be calculated from the determined distance (for example, multiplying by a preset coefficient, or adding a preset distance, etc.) to obtain the distance between the target sound source (that is, the virtual sound source) and the head of the target user distance.
  • Step 203 Input the head posture angle, distance, and audio signal to be processed into a preset head-related transfer function to obtain a processed left channel audio signal and a processed right channel audio signal.
  • the above-mentioned execution subject may input the head posture angle, distance and to-be-processed audio signal into a preset head-related transfer function (Head Related Transfer Function, HRTF) to obtain the processed left channel audio signal and the processed Right channel audio signal.
  • HRTF head-related transfer Function
  • the head-related transfer function is used to characterize the correspondence between the head attitude angle, the distance, the audio signal to be processed and the processed left channel audio signal and the processed right channel audio signal.
  • the head-related transfer function (also called binaural transfer function) describes the transmission process of sound waves from the sound source to both ears. It is the result of comprehensive filtering of sound waves by human physiological structure (such as head, pinna and torso, etc.). Because the head-related transfer function contains information about sound source localization, it is very important for the study of binaural hearing and psychoacoustics. In practical applications, the use of headphones or speakers to output signals processed with head-related transfer functions can simulate various spatial auditory effects.
  • the HRTF may include two parts, namely a left HRTF and a right HRTF.
  • the above-mentioned execution subject may input the head posture angle, the determined distance, and the audio signal to be processed into the left HRTF and the right HRTF respectively, and the left HRTF output is processed left Channel audio signal, right HRTF output processed right channel audio signal.
  • the processed left channel audio signal and the processed right channel audio signal may have a difference in loudness (Interaural Levels Differences, ILD) and a binaural time difference (ITD, Interaural Time Differences).
  • loudness also known as volume, describes the loudness of the sound and represents the subjective perception of the human ear.
  • the unit of measurement is sone, which is defined as 1 kHz and the sound pressure level is 40 dB.
  • the loudness of pure tone is 1 song.
  • Binaural time difference refers to the time difference between the sound source reaching the ears of the listener.
  • the above-mentioned execution subject may output the processed left channel audio signal and the processed right channel audio signal in various ways.
  • the processed left channel audio signal and the processed right channel audio signal can be played using headphones, speakers, etc .; or, the processed left channel audio signal and the processed right channel audio signal can be output to a preset storage Store in the area.
  • FIG. 5 is a schematic diagram of an application scenario of the method for processing audio signals according to this embodiment.
  • music is being played on the terminal device 501
  • the terminal device 501 first captures the head image 503 of the target user 502, and the terminal device 501 acquires the audio signal 504 to be processed again.
  • the to-be-processed audio signal 504 is an audio segment that has not yet been played out of currently played audios.
  • the terminal device 501 determines the head posture angle 505 of the target user based on the head image 503 (for example, using a pre-trained head posture recognition model to identify the head posture angle), and determines the target sound source and the head of the target user 502
  • the distance 506 of the part (for example, the distance between the target sound source and the head of the target user is determined according to the correspondence between the size of the head image and the distance).
  • the target sound source is the terminal device 501.
  • the terminal device 501 inputs the head posture angle 505, the distance 506, and the audio signal 504 to be processed into a preset head-related transfer function 507 to obtain the processed left channel audio signal 508 and the processed right channel audio signal 509.
  • the method provided by the above embodiment of the present disclosure determines the head user's head posture angle and the distance between the target sound source and the target user's head by acquiring the target user's head image and the audio signal to be processed, and then using the head image , And finally input the head posture angle, distance and to-be-processed audio signal into the preset head-related transfer function to obtain the processed left-channel audio signal and the processed right-channel audio signal, thereby using the head image and the head-related transfer function Adjusting the audio signal improves the flexibility of processing the audio signal, which helps to simulate the near real audio playback effect.
  • FIG. 6 shows a flow 600 of yet another embodiment of a method for processing audio signals.
  • the process 600 of the method for processing audio signals includes the following steps:
  • Step 601 Acquire the head image of the target user and the audio signal to be processed.
  • step 601 is basically the same as step 201 in the embodiment corresponding to FIG. 2 and will not be repeated here.
  • Step 602 Based on the head image, determine the head pose angle of the target user, and determine the distance between the target sound source and the head of the target user.
  • step 602 is basically the same as step 202 in the embodiment corresponding to FIG. 2 and will not be repeated here.
  • Step 603 Input the head posture angle, distance and to-be-processed audio signal into a preset head-related transfer function to obtain the processed left channel audio signal and the processed right channel audio signal.
  • step 603 is basically the same as step 203 in the embodiment corresponding to FIG. 2 and will not be repeated here.
  • Step 604 Acquire a predetermined loudness difference between the initial left channel audio signal and the initial right channel audio signal as the initial loudness difference.
  • the execution subject of the method for processing audio signals may acquire the predetermined loudness difference between the initial left channel audio signal and the initial right channel audio signal.
  • the initial left channel audio signal and the initial right channel audio signal may be unprocessed audio signals pre-stored in the above-mentioned execution subject.
  • the unprocessed audio signal and the aforementioned audio signal to be processed may be generated based on the same audio file.
  • the initial left channel audio signal and the initial right channel audio signal may be audio signals extracted from an audio file
  • the to-be-processed audio signal may be an audio clip extracted from the audio file being played and not yet played.
  • the above-mentioned execution subject may determine the loudness of the initial left channel audio signal and the initial right channel audio signal respectively in advance, and determine the difference between the two determined loudness as the initial left channel audio signal and the initial right channel
  • the loudness of the channel audio signal is poor. It should be noted that the method of determining the loudness of the audio signal is a well-known technology that has been widely studied and applied at present, and will not be repeated here.
  • Step 605 Adjust the loudness of the processed left channel audio signal and the processed right channel audio signal separately, so that the difference between the loudness of the processed left channel audio signal and the processed right channel audio signal after adjusting the loudness and the initial loudness The difference is within the first preset range.
  • the execution subject adjusts the loudness of the processed left channel audio signal and the processed right channel audio signal separately, so that the adjusted loudness of the processed left channel audio signal and the processed right channel audio signal
  • the difference between the loudness difference of and the initial loudness difference is within the first preset range.
  • the first preset range may be a preset loudness difference range, for example, 0 Song, ⁇ 1 Song, and so on.
  • the loudness of the processed left channel audio signal is adjusted to be close to A, and the processed right channel audio signal The loudness of is adjusted to be close to B, so that the difference between the loudness difference between the processed left channel audio signal and the processed right channel audio signal after adjusting the loudness and the initial loudness difference is within the first preset range.
  • the loudness difference between the processed left channel audio signal and the processed right channel audio signal can be restored to the initial loudness difference, which helps When playing audio, avoid sudden changes in the loudness of the audio signal.
  • the foregoing execution subject may further perform the following steps:
  • the predetermined binaural time difference between the initial left channel audio signal and the initial right channel audio signal is obtained as the initial binaural time difference.
  • the initial left channel audio signal and the initial right channel audio signal are the same as the initial left channel audio signal and the initial right channel audio signal described in step 604, and will not be repeated here.
  • the above-mentioned execution subject may determine the binaural time difference between the initial left channel audio signal and the initial right channel audio signal according to the existing method for determining the binaural time difference between the left and right channels. It should be noted that the method of determining the binaural time difference between the left and right channels is a well-known technology that has been widely researched and applied at present, and will not be repeated here.
  • the second preset range may be a preset binaural time difference range, for example, 0 seconds, ⁇ 0.1 seconds, and so on.
  • the purpose of adjusting the binaural time difference between the left channel audio signal and the processed right channel audio signal can be achieved by adjusting the start playback time of the processed left channel audio signal and the processed right channel audio signal.
  • the binaural time difference between the processed left channel audio signal and the processed right channel audio signal can be restored to the original binaural time difference, thereby helping to avoid binaural audio signals when playing audio
  • the sudden change in time difference helps to better simulate the real sound field.
  • the process 600 of the method for processing audio signals in this embodiment highlights the adjustment of the processed left channel audio signal and the processed right channel audio The steps of the loudness of the signal. Therefore, the solution described in this embodiment can restore the loudness of the processed left channel audio signal and the processed right channel audio signal to the initial loudness, thereby helping to avoid sudden changes in the loudness of the audio signal when playing audio.
  • the present disclosure provides an embodiment of an apparatus for processing audio signals, which corresponds to the method embodiment shown in FIG. 2,
  • the device can be specifically applied to various electronic devices.
  • the apparatus 700 for processing audio signals of this embodiment includes: a first acquisition unit 701 configured to acquire a head user's head image and an audio signal to be processed; a determination unit 702 configured to be based on Head image, determine the head posture angle of the target user, and determine the distance between the target sound source and the target user's head; the processing unit 703 is configured to input the head posture angle, distance, and audio signal to be processed into a preset Head-related transfer function to obtain the processed left channel audio signal and the processed right channel audio signal, where the head related transfer function is used to characterize the head attitude angle, distance, audio signal to be processed and processed left channel audio signal Correspondence with the processed right channel audio signal.
  • the first acquiring unit 701 may acquire the head image of the target user and the left-channel to-be-processed audio signal and the right-channel to-be-processed audio signal remotely or locally through a wired connection or a wireless connection.
  • the target user may be a user within the shooting range of the camera on the terminal device shown in FIG. 1 (for example, a user using the terminal device shown in FIG. 1).
  • the audio signal to be processed may be an audio signal to be processed that is stored in the device 700 in advance.
  • the audio signal to be processed may be an audio segment that is currently included in the audio currently being played on the device 700 and has not been played.
  • the duration of the audio clip may be a preset duration, such as 5 seconds, 10 seconds, and so on.
  • the determining unit 702 may determine the head posture angle of the target user, and determine the distance between the target sound source and the target user's head.
  • the head posture angle can be used to characterize the degree of deflection of the front face of the target user's face relative to the camera where the head image of the target user is captured.
  • the above determination unit 702 may perform head pose estimation on the two-dimensional head image according to various existing head pose estimation methods.
  • the method of head pose estimation may include but is not limited to the following methods: a method based on a machine learning model, a coordinate transformation method based on key points of a human face, and the like.
  • the above determination unit 702 may determine the distance between the target sound source and the target user's head based on the head image.
  • the above-mentioned determination unit 702 may determine the key points of the face in the head image using the existing method for determining the key points of the face, and determine the size of the image area including the determined key points of the face. Then, the above determination unit 702 may determine the distance between the target sound source and the head of the target user based on the preset correspondence between the size of the image area and the distance.
  • the target sound source may be an actual electronic device that outputs audio signals.
  • the electronic device that outputs audio signals is the above-mentioned terminal device including a camera, or it may be determined by the above-mentioned executive body and at the target position Virtual sound source.
  • the distance between the target sound source and the target user's head may be the distance between the electronic device outputting the audio signal and the target user's head determined according to the above example; or, the target sound source and the target user's head
  • the distance of the part can be calculated from the determined distance (for example, multiplying by a preset coefficient, or adding a preset distance, etc.) to obtain the distance between the target sound source (that is, the virtual sound source) and the head of the target user distance.
  • the processing unit 703 inputs the head posture angle, distance, and audio signal to be processed into a preset head-related transfer function (Head Relevant Transfer Function, HRTF) to obtain the processed left channel audio signal and the processed right Channel audio signal.
  • HRTF Head Relevant Transfer Function
  • the head-related transfer function is used to characterize the correspondence between the head attitude angle, the distance, the audio signal to be processed and the processed left channel audio signal and the processed right channel audio signal.
  • the head-related transfer function (also called binaural transfer function) describes the transmission process of sound waves from the sound source to both ears. It is the result of comprehensive filtering of sound waves by human physiological structure (such as head, pinna and torso, etc.). Because the head-related transfer function contains information about sound source localization, it is of great significance for the study of binaural hearing and psychoacoustics. In practical applications, the use of headphones or speakers to output signals processed with head-related transfer functions can simulate various spatial auditory effects.
  • the HRTF may include two parts, namely a left HRTF and a right HRTF.
  • the processing unit 703 may input the head posture angle, the determined distance, and the audio signal to be processed into the left HRTF and the right HRTF respectively, and the left HRTF is output and processed Left channel audio signal, right HRTF output processed right channel audio signal.
  • the processed left channel audio signal and the processed right channel audio signal may have a difference in loudness (Interaural Levels Differences, ILD) and a binaural time difference (ITD, Interaural Time Differences).
  • ILD Interaural Levels Differences
  • ITD Interaural Time Differences
  • loudness also known as volume, describes the loudness of the sound and represents the subjective perception of the human ear.
  • the unit of measurement is sone, which is defined as 1 kHz and the sound pressure level is 40 dB.
  • the loudness of pure tone is 1 song.
  • Binaural time difference refers to the time difference between the sound source reaching the ears of the listener.
  • the determining unit 702 may include: a recognition module (not shown in the figure) configured to input a head image into a pre-trained head pose recognition model to obtain the target user ’s Head pose angle, wherein the head pose recognition model is used to characterize the correspondence between the head image and the head gesture angle of the user represented by the head image.
  • a recognition module (not shown in the figure) configured to input a head image into a pre-trained head pose recognition model to obtain the target user ’s Head pose angle, wherein the head pose recognition model is used to characterize the correspondence between the head image and the head gesture angle of the user represented by the head image.
  • the head pose recognition model is trained in advance as follows: acquiring multiple sample head images and the sample head corresponding to the sample head image in the multiple sample head images Pose angle; using machine learning method, the sample head image in the multiple sample head images is used as input, and the sample head posture angle corresponding to the input sample head image is used as the desired output, and the head pose recognition model is trained.
  • the determining unit 702 may include: a first determining module (not shown in the figure) configured to determine the size of the head image; a second determining module (not shown in the figure) Out), configured to determine the distance between the target sound source and the target user's head based on a preset correspondence between the size of the head image and the distance.
  • the apparatus 700 may further include: a second acquisition unit (not shown in the figure) configured to acquire a predetermined, initial left channel audio signal and initial right sound The loudness difference of the channel audio signals is used as the initial loudness difference; the first adjustment unit (not shown in the figure) is configured to adjust the loudness of the processed left channel audio signal and the processed right channel audio signal, respectively, to adjust the loudness The difference between the loudness difference of the processed left channel audio signal and the processed right channel audio signal and the initial loudness difference is within the first preset range.
  • the apparatus 700 may further include: a third acquiring unit (not shown in the figure) configured to acquire a predetermined, initial left channel audio signal and initial right sound The binaural time difference of the channel audio signal is used as the initial binaural time difference; the second adjustment unit (not shown in the figure) is configured to adjust the binaural time difference between the processed left channel audio signal and the processed right channel audio signal to The difference between the binaural time difference between the processed left channel audio signal and the processed right channel audio signal after adjusting the binaural time difference and the initial binaural time difference is within a second preset range.
  • a third acquiring unit (not shown in the figure) configured to acquire a predetermined, initial left channel audio signal and initial right sound The binaural time difference of the channel audio signal is used as the initial binaural time difference
  • the second adjustment unit (not shown in the figure) is configured to adjust the binaural time difference between the processed left channel audio signal and the processed right channel audio signal to The difference between the binaural time difference between the processed left channel audio signal and the processed
  • the device provided by the above embodiments of the present disclosure determines the head user's head posture angle and the distance between the target sound source and the target user's head by acquiring the target user's head image and the audio signal to be processed, and then using the head image , And finally input the head posture angle, distance and to-be-processed audio signal into the preset head-related transfer function to obtain the processed left-channel audio signal and the processed right-channel audio signal, thereby using the head image and the head-related transfer function Adjusting the audio signal improves the flexibility of processing the audio signal, which helps to simulate the near real audio playback effect.
  • Terminal devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals ( For example, mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers, and so on.
  • the terminal device shown in FIG. 8 is only an example, and should not bring any limitation to the functions and use scope of the embodiments of the present disclosure.
  • the terminal device 800 may include a processing device (such as a central processing unit, a graphics processor, etc.) 801, which can be loaded into random access according to a program stored in a read-only memory (ROM) 802 or from the storage device 808
  • the program in the memory (RAM) 803 performs various appropriate operations and processes.
  • various programs and data necessary for the operation of the terminal device 800 are also stored.
  • the processing device 801, ROM 802, and RAM 803 are connected to each other via a bus 804.
  • An input / output (I / O) interface 805 is also connected to the bus 804.
  • the following devices can be connected to the I / O interface 805: including input devices 806 such as touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc .; including, for example, liquid crystal display (LCD), speaker, vibration
  • An output device 807 such as a storage device; a storage device 808 including, for example, a magnetic tape, a hard disk, etc .; and a communication device 809.
  • the communication device 809 may allow the terminal device 800 to perform wireless or wired communication with other devices to exchange data.
  • FIG. 8 shows a terminal device 800 having various devices, it should be understood that it is not required to implement or have all the devices shown. More or fewer devices may be implemented or provided instead.
  • the process described above with reference to the flowchart may be implemented as a computer software program.
  • embodiments of the present disclosure include a computer program product that includes a computer program carried on a computer-readable medium, the computer program containing program code for performing the method shown in the flowchart.
  • the computer program may be downloaded and installed from the network through the communication device 809, or installed from the storage device 808, or installed from the ROM 802.
  • the processing device 801 the above-described functions defined in the method of the embodiments of the present disclosure are executed.
  • the computer-readable medium described in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination of the above. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer diskettes, hard drives, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing.
  • the computer-readable storage medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device.
  • the computer-readable signal medium may include a data signal that is propagated in baseband or as part of a carrier wave, in which computer-readable program code is carried. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • the computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, and the computer-readable signal medium may send, propagate, or transmit a program for use by or in combination with an instruction execution system, apparatus, or device .
  • the program code contained on the computer readable medium may be transmitted using any appropriate medium, including but not limited to: electric wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
  • the above-mentioned computer-readable medium may be included in the above-mentioned terminal device; or it may exist alone without being assembled into the terminal device.
  • the computer-readable medium carries one or more programs.
  • the terminal device is caused to: acquire the head image of the target user and the audio signal to be processed; based on the head image, Determine the head posture angle of the target user, and determine the distance between the target sound source and the head of the target user; input the head posture angle, distance, and audio signal to be processed into a preset head-related transfer function to obtain the processed left channel Audio signal and processed right channel audio signal, where the head-related transfer function is used to characterize the correspondence between the head posture angle, distance, to-be-processed audio signal and processed left-channel audio signal and processed right-channel audio signal .
  • Computer program code for performing the operations of the present disclosure may be written in one or more programming languages or a combination thereof, the programming languages including object-oriented programming languages such as Java, Smalltalk, C ++, and also including conventional Procedural programming language-such as "C" language or similar programming language.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as an independent software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (eg, through an Internet service provider Internet connection).
  • LAN local area network
  • WAN wide area network
  • Internet service provider Internet connection e.g, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
  • each block in the flowchart or block diagram may represent a module, program segment, or part of code that contains one or more logic functions Executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession can actually be executed in parallel, and sometimes they can also be executed in reverse order, depending on the functions involved.
  • each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts can be implemented with dedicated hardware-based systems that perform specified functions or operations Or, it can be realized by a combination of dedicated hardware and computer instructions.
  • the units described in the embodiments of the present disclosure may be implemented in software or hardware.
  • the name of the unit does not constitute a limitation on the unit itself.
  • the first acquisition unit may also be described as “a unit that acquires the head image of the target user and the audio signal to be processed”.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Stereophonic System (AREA)

Abstract

本公开的实施例公开了用于处理音频信号的方法和装置。该方法的一具体实施方式包括:获取目标用户的头部图像和待处理音频信号;基于头部图像,确定目标用户的头部姿态角,以及确定目标声源与目标用户的头部的距离;将头部姿态角、距离和待处理音频信号输入预设的头相关传输函数,得到处理后左声道音频信号和处理后右声道音频信号,其中,头相关传输函数用于表征头部姿态角、距离、待处理音频信号与处理后左声道音频信号和处理后右声道音频信号的对应关系。该实施方式提高了处理音频信号的灵活性,有助于模拟出接近真实的音频播放效果。

Description

用于处理音频信号的方法和装置
本专利申请要求于2018年10月12日提交的、申请号为201811190415.4、申请人为北京微播视界科技有限公司、发明名称为“用于处理音频信号的方法和装置”的中国专利申请的优先权,该申请的全文以引用的方式并入本申请中。
技术领域
本公开的实施例涉及计算机技术领域,具体涉及用于处理音频信号的方法和装置。
背景技术
随着互联网技术与电子技术的结合程度越来越高,人们对电子设备的智能化、人性化的要求也越来越高。手机以及便携式电子终端的使用普及度越来越高,多媒体功能是用户使用最多的应用之一。
目前的音频处理领域,为了模拟接近真实的声场,通常采用调整左右声道的响度差和调整左右声道的双耳时间差的方法。
发明内容
本公开的实施例提出了用于处理音频信号的方法和装置。
第一方面,本公开的实施例提供了一种用于处理音频信号的方法,该方法包括:获取目标用户的头部图像和待处理音频信号;基于头部图像,确定目标用户的头部姿态角,以及确定目标声源与目标用户的头部的距离;将头部姿态角、距离和待处理音频信号输入预设的头相关传输函数,得到处理后左声道音频信号和处理后右声道音频信号,其中,头相关传输函数用于表征头部姿态角、距离、待处理音频信号与处理后左声道音频信号和处理后右声道音频信号的对应关系。
在一些实施例中,基于头部图像,确定目标用户的头部姿态角,包括:将头部图像输入预先训练的头部姿态识别模型,得到目标用户的头部姿态角,其中,头部姿态识别模型用于表征头部图像与头部图 像所表征的用户的头部姿态角的对应关系。
在一些实施例中,头部姿态识别模型预先按照如下步骤训练得到:获取多个样本头部图像和多个样本头部图像中的样本头部图像对应的样本头部姿态角;利用机器学习方法,将多个样本头部图像中的样本头部图像作为输入,将输入的样本头部图像对应的样本头部姿态角作为期望输出,训练得到头部姿态识别模型。
在一些实施例中,确定目标声源与目标用户的头部的距离,包括:确定头部图像的大小;基于预设的、头部图像的大小和距离的对应关系,确定目标声源与目标用户的头部的距离。
在一些实施例中,在得到处理后左声道音频信号和处理后右声道音频信号之后,该方法还包括:获取预先确定的、初始左声道音频信号和初始右声道音频信号的响度差作为初始响度差;分别调整处理后左声道音频信号和处理后右声道音频信号的响度,以使调整响度后的处理后左声道音频信号和处理后右声道音频信号的响度差与初始响度差的差值处于第一预设范围内。
在一些实施例中,该方法还包括:获取预先确定的、初始左声道音频信号和初始右声道音频信号的双耳时间差作为初始双耳时间差;调整处理后左声道音频信号和处理后右声道音频信号的双耳时间差,以使调整双耳时间差后的处理后左声道音频信号和处理后右声道音频信号的双耳时间差与初始双耳时间差的差值处于第二预设范围内。
第二方面,本公开的实施例提供了一种用于处理音频信号的装置,该装置包括:第一获取单元,被配置成获取目标用户的头部图像和待处理音频信号;确定单元,被配置成基于头部图像,确定目标用户的头部姿态角,以及确定目标声源与目标用户的头部的距离;处理单元,被配置成将头部姿态角、距离和待处理音频信号输入预设的头相关传输函数,得到处理后左声道音频信号和处理后右声道音频信号,其中,头相关传输函数用于表征头部姿态角、距离、待处理音频信号与处理后左声道音频信号和处理后右声道音频信号的对应关系。
在一些实施例中,确定单元包括:识别模块,被配置成将头部图像输入预先训练的头部姿态识别模型,得到目标用户的头部姿态角, 其中,头部姿态识别模型用于表征头部图像与头部图像所表征的用户的头部姿态角的对应关系。
在一些实施例中,头部姿态识别模型预先按照如下步骤训练得到:获取多个样本头部图像和多个样本头部图像中的样本头部图像对应的样本头部姿态角;利用机器学习方法,将多个样本头部图像中的样本头部图像作为输入,将输入的样本头部图像对应的样本头部姿态角作为期望输出,训练得到头部姿态识别模型。
在一些实施例中,确定单元包括:第一确定模块,被配置成确定头部图像的大小;第二确定模块,被配置成基于预设的、头部图像的大小和距离的对应关系,确定目标声源与目标用户的头部的距离。
在一些实施例中,该装置还包括:第二获取单元,被配置成获取预先确定的、初始左声道音频信号和初始右声道音频信号的响度差作为初始响度差;第一调整单元,被配置成分别调整处理后左声道音频信号和处理后右声道音频信号的响度,以使调整响度后的处理后左声道音频信号和处理后右声道音频信号的响度差与初始响度差的差值处于第一预设范围内。
在一些实施例中,该装置还包括:第三获取单元,被配置成获取预先确定的、初始左声道音频信号和初始右声道音频信号的双耳时间差作为初始双耳时间差;第二调整单元,被配置成调整处理后左声道音频信号和处理后右声道音频信号的双耳时间差,以使调整双耳时间差后的处理后左声道音频信号和处理后右声道音频信号的双耳时间差与初始双耳时间差的差值处于第二预设范围内。
第三方面,本公开的实施例提供了一种终端设备,该终端设备包括:一个或多个处理器;存储装置,其上存储有一个或多个程序;当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现如第一方面中任一实现方式描述的方法。
第四方面,本公开的实施例提供了一种计算机可读介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如第一方面中任一实现方式描述的方法。
本公开的实施例提供的用于处理音频信号的方法和装置,通过获 取目标用户的头部图像和待处理音频信号,然后利用头部图像,确定目标用户的头部姿态角和目标声源与目标用户的头部的距离,最后将头部姿态角、距离和待处理音频信号输入预设的头相关传输函数,得到处理后左声道音频信号和处理后右声道音频信号,从而利用头部图像和头相关传输函数调整音频信号,提高了处理音频信号的灵活性,有助于模拟出接近真实的音频播放效果。
附图说明
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本公开的其它特征、目的和优点将会变得更明显:
图1是本公开的一个实施例可以应用于其中的示例性系统架构图;
图2是根据本公开的实施例的用于处理音频信号的方法的一个实施例的流程图;
图3是根据本公开的实施例的用于处理音频信号的方法的头部姿态角的示例性示意图;
图4是根据本公开的实施例的用于处理音频信号的方法的头部姿态角的另一示例性示意图;
图5是根据本公开的实施例的用于处理音频信号的方法的一个应用场景的示意图;
图6是根据本公开的实施例的用于处理音频信号的方法的又一个实施例的流程图;
图7是根据本公开的实施例的用于处理音频信号的装置的一个实施例的结构示意图;
图8是适于用来实现本公开的实施例的终端设备的结构示意图。
具体实施方式
下面结合附图和实施例对本公开作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释相关公开,而非对该公开的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与 有关公开相关的部分。
需要说明的是,在不冲突的情况下,本公开中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本公开。
图1示出了可以应用本公开的实施例的用于处理音频信号的方法或用于处理音频信号的装置的示例性系统架构100。
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如音频播放类应用、视频播放类应用、社交平台软件等。
终端设备101、102、103可以是硬件,也可以是软件。当终端设备101、102、103为硬件时,可以是支持音频播放且包括摄像头的各种电子设备。当终端设备101、102、103为软件时,可以安装在上述电子设备中。其可以实现成多个软件或软件模块(例如用来提供分布式服务的软件或软件模块),也可以实现成单个软件或软件模块。在此不做具体限定。
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上播放的音频提供支持的后台音频服务器。后台音频服务器可以向终端设备发送音频,以在终端设备上播放。
需要说明的是,本公开的实施例所提供的用于处理音频信号的方法一般由终端设备101、102、103执行,相应地,用于处理音频信号的装置可以设置于终端设备101、102、103中。
需要说明的是,服务器可以是硬件,也可以是软件。当服务器为硬件时,可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。当服务器为软件时,可以实现成多个软件或软件模块(例如用来提供分布式服务的软件或软件模块),也可以实现成单个 软件或软件模块。在此不做具体限定。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。在需要处理的头部图像和音频信号不需要从远程获取的情况下,上述系统架构可以不包括网络和服务器。
继续参考图2,示出了根据本公开的用于处理音频信号的方法的一个实施例的流程200。该用于处理音频信号的方法,包括以下步骤:
步骤201,获取目标用户的头部图像和待处理音频信号。
在本实施例中,用于处理音频信号的方法的执行主体(例如图1所示的终端设备)可以通过有线连接方式或者无线连接方式从远程或从本地获取目标用户的头部图像以及左声道待处理音频信号和右声道待处理音频信号。其中,目标用户可以是如图1所示的终端设备上的摄像头的拍摄范围内的用户(例如使用如图1所示的终端设备的用户)。上述待处理音频信号可以是预先存储在上述执行主体中的、待对其进行处理的音频信号。作为示例,上述待处理音频信号可以是当前正在上述执行主体上播放的音频包括的、尚未播放的音频片段。该音频片段的时长可以是预设时长,例如5秒、10秒等。
步骤202,基于头部图像,确定目标用户的头部姿态角,以及确定目标声源与目标用户的头部的距离。
在本实施例中,基于步骤201中获取的头部图像,上述执行主体可以确定目标用户的头部姿态角,以及确定目标声源与目标用户的头部的距离。其中,头部姿态角可以用于表征目标用户的脸部的正面朝向相对于用于拍摄得到目标用户的头部图像的摄像头的偏转程度。
实践中,头部姿态角可以包括俯仰角(pitch)、偏航角(yaw)、翻滚角(roll)三种角度,分别代表上下翻转,左右翻转,水平面内旋转的角度。如图3所示,x轴、y轴、z轴是直角坐标系的三个轴。其中,z轴可以为终端设备301上的摄像头的光轴,y轴可以为在人的头部不发生侧转的状态下、通过人的头顶轮廓的中心点且与水平面垂直的直线。俯仰角可以为人脸绕x轴旋转的角度,偏航角可以为人脸绕 y轴旋转的角度,翻滚角可以为人脸绕z轴旋转的角度。在图3中的直角坐标系中,当人的头部转动时,确定以该直角坐标系的原点为端点、且通过人的两个眼球中心点的连线的中点的射线,该射线分别与x轴、y轴、z轴的角度可以确定为头部姿态角。
需要说明的是,在本实施例中,所确定的头部姿态角可以不包括上述翻滚角。如图4所示,图中的点A为目标声源,目标声源与摄像头的位置相同,所确定的头部姿态角包括θ(偏航角)和φ(俯仰角)。
还需要说明的是,上述执行主体可以按照各种现有的头部姿态估计的方法对二维头部图像进行头部姿态估计。其中,头部姿态估计的方法可以包括但不限于以下方法:基于机器学习模型的方法,基于人脸关键点的坐标变换方法等。
在本实施例的一些可选的实现方式中,上述执行主体可以基于头部图像,按照如下步骤确定目标用户的头部姿态角:
将头部图像输入预先训练的头部姿态识别模型,得到目标用户的头部姿态角。其中,头部姿态识别模型用于表征头部图像与头部图像所表征的用户的头部姿态角的对应关系。
作为示例,上述头部姿态识别模型可以包括特征提取部分和对应关系表。其中,特征提取部分可以用于从头部图像中提取特征生成特征向量。例如,特征提取部分可以为卷积神经网络、深度神经网络等等。对应关系表可以是技术人员基于对大量的特征向量和头部姿态角的统计而预先制定的、存储有多个特征向量与头部姿态角的对应关系的对应关系表。这样,上述头部姿态识别模型可以首先使用特征提取部分提取头部图像的特征,从而生成目标特征向量。之后,将该目标特征向量与对应关系表中的多个特征向量依次进行比较,若对应关系表中的某一个特征向量与目标特征向量相同或相似,则将对应关系表中的该特征向量对应的头部姿态角作为目标用户的头部姿态角。
在本实施例的一些可选的实现方式中,上述头部姿态识别模型可以由上述执行主体或其他电子设备预先通过如下步骤训练得到:首先,获取多个样本头部图像和多个样本头部图像中的样本头部图像对应的样本头部姿态角。其中,样本头部姿态角是预先对样本头部图像进行 标注的、样本头部图像指示的人物的头部的头部姿态角。然后,利用机器学习方法,将多个样本头部图像中的样本头部图像作为输入,将输入的样本头部图像对应的样本头部姿态角作为期望输出,训练得到头部姿态识别模型。
上述头部姿态识别模型可以是对初始化的人工神经网络进行训练得到的模型。初始化的人工神经网络可以是未经训练的人工神经网络或未训练完成的人工神经网络。初始化的人工神经网络的各层可以设置有初始参数,参数在人工神经网络的训练过程中可以被不断地调整(例如利用反向传播算法调整参数)。初始化的人工神经网络可以是各种类型的未经训练或未训练完成的人工神经网络。例如,初始化的人工神经网络可以是卷积神经网络(例如可以包括卷积层、池化层、全连接层等)。
通过利用目标用户的头部图像确定头部姿态角,可以实现实时地监测目标用户的头部姿态,并且可以避免使用诸如头戴设备等硬件,达到简化硬件结构、降低硬件成本的目的。
在本实施例中,上述执行主体可以基于头部图像确定目标声源与目标用户的头部的距离。
作为一个示例,上述执行主体可以按照如下步骤确定目标声源与目标用户的头部的距离:
首先,确定头部图像的大小。作为示例,头部图像的大小可以是上述执行主体利用现有的目标检测模型(例如SSD(Single Shot MultiBox Detector)、DPM(Deformable Part Model)等),从头部图像中识别出的头部图像区域的大小。其中,大小可以由各种方式表征。例如,可以是包括头部图像区域的最小矩形的长度或宽度,也可以是包括头部图像区域的最小圆形的半径等。
然后,基于预设的、头部图像的大小和距离的对应关系,确定目标声源与目标用户的头部的距离。具体地,作为示例,上述对应关系可以由预设的对应关系表来表征,在该对应关系表中,可以存储有头部图像的大小和对应的距离,上述执行主体可以根据所确定的头部图像的大小,从该对应关系表中查找与所确定的头部图像的大小对应的 距离。作为另一示例,上述对应关系可以根据预设的转换公式表征,上述执行主体可以利用上述转换公式,根据所确定的头部图像的大小,计算得到目标声源与目标用户的头部的距离。例如,上述转换公式可以为y=kx,其中,k为预设的比例值,x为头部图像的大小,y为目标声源与头部图像所表征的用户的头部的距离。
作为另一个示例,上述执行主体可以利用现有的确定人脸关键点的方法,确定头部图像中的人脸关键点,以及确定包括所确定的人脸关键点的图像区域的大小。其中,图像区域的大小的表征方式可以与上述示例相同。然后,上述执行主体可以基于预设的、图像区域的大小和距离的对应关系,确定目标声源与目标用户的头部的距离。其中,本示例中的对应关系的表征方式可以与上述示例相同,这里不再赘述。
需要说明的是,目标声源可以是实际的、输出音频信号的电子设备,通常,输出音频信号的电子设备即为上述包括摄像头的终端设备,也可以是由上述执行主体确定的、处于目标位置的虚拟声源。相应地,目标声源与目标用户的头部的距离可以是按照上述示例确定出的、输出音频信号的电子设备与目标用户的头部之间的距离;或者,目标声源与目标用户的头部的距离可以是对所确定的距离进行计算(例如乘以预设的系数,或加上预设的距离等),得到目标声源(即虚拟声源)与目标用户的头部之间的距离。
步骤203,将头部姿态角、距离和待处理音频信号输入预设的头相关传输函数,得到处理后左声道音频信号和处理后右声道音频信号。
在本实施例中,上述执行主体可以将头部姿态角、距离和待处理音频信号输入预设的头相关传输函数(Head Related Transfer Function,HRTF),得到处理后左声道音频信号和处理后右声道音频信号。其中,头相关传输函数用于表征头部姿态角、距离、待处理音频信号与处理后左声道音频信号和处理后右声道音频信号的对应关系。
具体地,头相关传输函数(也称为双耳传输函数)描述了声波从声源到双耳的传输过程。它是人的生理结构(如头、耳廓以及躯干等)对声波进行综合滤波的结果。因为头相关传输函数包含了有关声源定位的信息,所以它对于双耳听觉和心理声学的研究具有非常重要的意 义。在实际应用中,利用耳机或扬声器输出用头相关传输函数处理过的信号,可以模拟出各种不同的空间听觉效果。
通常,HRTF可以包括两部分,分别为左HRTF和右HRTF,上述执行主体可以将头部姿态角、所确定出的距离和待处理音频信号分别输入左HRTF和右HRTF,左HRTF输出处理后左声道音频信号,右HRTF输出处理后右声道音频信号。实践中,处理后左声道音频信号和处理后右声道音频信号可以具有响度差(Interaural Level Differences,ILD)和双耳时间差(ITD,Interaural Time Difference)。其中,响度又称音量,描述的是声音的响亮程度,表示人耳对声音的主观感受,其计量单位是宋(sone),定义1kHz,声压级为40dB纯音的响度为1宋。双耳时间差指的是声源到达听者两耳的时间差。通过上述各步骤的处理,可以使得处理后左声道音频信号和处理后右声道音频信号之间的响度差和双耳时间差接近真实的场景,有助于模拟出接近真实的音频播放效果。
可选地,在得到处理后左声道音频信号和处理后右声道音频信号后,上述执行主体可以将处理后左声道音频信号和处理后右声道音频信号以各种方式输出。例如可以利用耳机、扬声器等设备播放处理后左声道音频信号和处理后右声道音频信号;或者,可以将处理后左声道音频信号和处理后右声道音频信号输出至预设的存储区中存储。
继续参见图5,图5是根据本实施例的用于处理音频信号的方法的应用场景的一个示意图。在图5的应用场景中,终端设备501上正在播放音乐,终端设备501首先拍摄到目标用户502的头部图像503,终端设备501又获取到待处理音频信号504。其中,待处理音频信号504是当前播放的音频中、尚未播放的音频片段。然后,终端设备501基于头部图像503,确定目标用户的头部姿态角505(例如使用预先训练的头部姿态识别模型识别出头部姿态角),以及确定目标声源与目标用户502的头部的距离506(例如根据头部图像的大小和距离的对应关系确定出目标声源与目标用户的头部的距离)。其中,目标声源即为终端设备501。最后,终端设备501将头部姿态角505、距离506和待处理音频信号504输入预设的头相关传输函数507,得到处理后左声 道音频信号508和处理后右声道音频信号509。
本公开的上述实施例提供的方法,通过获取目标用户的头部图像和待处理音频信号,然后利用头部图像,确定目标用户的头部姿态角和目标声源与目标用户的头部的距离,最后将头部姿态角、距离和待处理音频信号输入预设的头相关传输函数,得到处理后左声道音频信号和处理后右声道音频信号,从而利用头部图像和头相关传输函数调整音频信号,提高了处理音频信号的灵活性,有助于模拟出接近真实的音频播放效果。
进一步参考图6,其示出了用于处理音频信号的方法的又一个实施例的流程600。该用于处理音频信号的方法的流程600,包括以下步骤:
步骤601,获取目标用户的头部图像和待处理音频信号。
在本实施例中,步骤601与图2对应实施例中的步骤201基本一致,这里不再赘述。
步骤602,基于头部图像,确定目标用户的头部姿态角,以及确定目标声源与目标用户的头部的距离。
在本实施例中,步骤602与图2对应实施例中的步骤202基本一致,这里不再赘述。
步骤603,将头部姿态角、距离和待处理音频信号输入预设的头相关传输函数,得到处理后左声道音频信号和处理后右声道音频信号。
在本实施例中,步骤603与图2对应实施例中的步骤203基本一致,这里不再赘述。
步骤604,获取预先确定的、初始左声道音频信号和初始右声道音频信号的响度差作为初始响度差。
在本实施例中,用于处理音频信号的方法的执行主体(例如图1所示的终端设备)可以获取预先确定的、初始左声道音频信号和初始右声道音频信号的响度差。其中,初始左声道音频信号和初始右声道音频信号可以是预先存储在上述执行主体中的、未进行处理的音频信号。未进行处理的音频信号和上述待处理音频信号可以是基于相同的 音频文件生成的。例如,初始左声道音频信号和初始右声道音频信号可以是从某音频文件中提取的音频信号,待处理音频信号可以是从正在播放的该音频文件中提取的、尚未播放的音频片段。
在本实施例中,上述执行主体可以预先分别确定初始左声道音频信号和初始右声道音频信号的响度,将所确定的两个响度的差值确定为初始左声道音频信号和初始右声道音频信号的响度差。需要说明的是,确定音频信号的响度的方法是目前广泛研究和应用的公知技术,在此不再赘述。
步骤605,分别调整处理后左声道音频信号和处理后右声道音频信号的响度,以使调整响度后的处理后左声道音频信号和处理后右声道音频信号的响度差与初始响度差的差值处于第一预设范围内。
在本实施例中,上述执行主体分别调整处理后左声道音频信号和处理后右声道音频信号的响度,以使调整响度后的处理后左声道音频信号和处理后右声道音频信号的响度差与初始响度差的差值处于第一预设范围内。其中,第一预设范围可以是预设的响度差范围,例如0宋、±1宋等。
作为示例,假设初始左声道音频信号的响度为A,初始右声道音频信号的响度为B,则将处理后左声道音频信号的响度调整为接近A,将处理后右声道音频信号的响度调整为接近B,以使调整响度后的处理后左声道音频信号和处理后右声道音频信号的响度差与初始响度差的差值处于第一预设范围内。
通过调整处理后左声道音频信号和处理后右声道音频信号的响度,可以使处理后左声道音频信号和处理后右声道音频信号的响度差还原为初始的响度差,从而有助于在播放音频时,避免音频信号响度的突变。
在本实施例的一些可选的实现方式中,在上述步骤603之后,上述执行主体还可以执行如下步骤:
首先,获取预先确定的、初始左声道音频信号和初始右声道音频信号的双耳时间差作为初始双耳时间差。具体地,初始左声道音频信号和初始右声道音频信号与步骤604中描述的初始左声道音频信号和 初始右声道音频信号相同,这里不再赘述。上述执行主体可以预先按照现有的确定左右声道的双耳时间差的方法,确定初始左声道音频信号和初始右声道音频信号的双耳时间差。需要说明的是,确定左右声道的双耳时间差的方法是目前广泛研究和应用的公知技术,在此不再赘述。
然后,调整处理后左声道音频信号和处理后右声道音频信号的双耳时间差,以使调整双耳时间差后的处理后左声道音频信号和处理后右声道音频信号的双耳时间差与初始双耳时间差的差值处于第二预设范围内。其中,第二预设范围可以是预设的双耳时间差范围,例如0秒、±0.1秒等。
作为示例,可以通过调整处理后左声道音频信号和处理后右声道音频信号的起始播放时间,来达到调整左声道音频信号和处理后右声道音频信号的双耳时间差的目的。通过调整双耳时间差,可以使处理后左声道音频信号和处理后右声道音频信号的双耳时间差还原为初始的双耳时间差,从而有助于在播放音频时,避免音频信号的双耳时间差突变,有助于更好地模拟真实的声场。从图6中可以看出,与图2对应的实施例相比,本实施例中的用于处理音频信号的方法的流程600突出了调整处理后左声道音频信号和处理后右声道音频信号的响度的步骤。由此,本实施例描述的方案可以使处理后左声道音频信号和处理后右声道音频信号的响度还原为初始的响度,从而有助于在播放音频时,避免音频信号响度的突变。
进一步参考图7,作为对上述各图所示方法的实现,本公开提供了一种用于处理音频信号的装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图7所示,本实施例的用于处理音频信号的装置700包括:第一获取单元701,被配置成获取目标用户的头部图像和待处理音频信号;确定单元702,被配置成基于头部图像,确定目标用户的头部姿态角,以及确定目标声源与目标用户的头部的距离;处理单元703,被配置成将头部姿态角、距离和待处理音频信号输入预设的头相关传 输函数,得到处理后左声道音频信号和处理后右声道音频信号,其中,头相关传输函数用于表征头部姿态角、距离、待处理音频信号与处理后左声道音频信号和处理后右声道音频信号的对应关系。
在本实施例中,第一获取单元701可以通过有线连接方式或者无线连接方式从远程或从本地获取目标用户的头部图像以及左声道待处理音频信号和右声道待处理音频信号。其中,目标用户可以是如图1所示的终端设备上的摄像头的拍摄范围内的用户(例如使用如图1所示的终端设备的用户)。上述待处理音频信号可以是预先存储在上述装置700中的、待对其进行处理的音频信号。作为示例,上述待处理音频信号可以是当前正在上述装置700上播放的音频包括的、尚未播放的音频片段。该音频片段的时长可以是预设时长,例如5秒、10秒等。
在本实施例中,确定单元702可以确定目标用户的头部姿态角,以及确定目标声源与目标用户的头部的距离。其中,头部姿态角可以用于表征目标用户的脸部的正面朝向相对于拍摄得到目标用户的头部图像的摄像头的偏转程度。
需要说明的是,上述确定单元702可以按照各种现有的头部姿态估计的方法对二维头部图像进行头部姿态估计。其中,头部姿态估计的方法可以包括但不限于以下方法:基于机器学习模型的方法,基于人脸关键点的坐标变换方法等。
在本实施例中,上述确定单元702可以基于头部图像确定目标声源与目标用户的头部的距离。作为示例,上述确定单元702可以利用现有的确定人脸关键点的方法,确定头部图像中的人脸关键点,以及确定包括所确定的人脸关键点的图像区域的大小。然后,上述确定单元702可以基于预设的、图像区域的大小和距离的对应关系,确定目标声源与目标用户的头部的距离。
需要说明的是,目标声源可以是实际的、输出音频信号的电子设备,通常,输出音频信号的电子设备即为上述包括摄像头的终端设备,也可以是由上述执行主体确定的、处于目标位置的虚拟声源。相应地,目标声源与目标用户的头部的距离可以是按照上述示例确定出的、输出音频信号的电子设备与目标用户的头部之间的距离;或者,目标声 源与目标用户的头部的距离可以是对所确定的距离进行计算(例如乘以预设的系数,或加上预设的距离等),得到目标声源(即虚拟声源)与目标用户的头部之间的距离。
在本实施例中,处理单元703将头部姿态角、距离和待处理音频信号输入预设的头相关传输函数(Head Related Transfer Function,HRTF),得到处理后左声道音频信号和处理后右声道音频信号。其中,头相关传输函数用于表征头部姿态角、距离、待处理音频信号与处理后左声道音频信号和处理后右声道音频信号的对应关系。
具体地,头相关传输函数(也称为双耳传输函数)描述了声波从声源到双耳的传输过程。它是人的生理结构(如头、耳廓以及躯干等)对声波进行综合滤波的结果。因为头相关传输函数包含了有关声源定位的信息,所以它对于双耳听觉和心理声学的研究具有非常重要的意义。在实际应用中,利用耳机或扬声器输出用头相关传输函数处理过的信号,可以模拟出各种不同的空间听觉效果。
通常,HRTF可以包括两部分,分别为左HRTF和右HRTF,上述处理单元703可以将头部姿态角、所确定出的距离和待处理音频信号分别输入左HRTF和右HRTF,左HRTF输出处理后左声道音频信号,右HRTF输出处理后右声道音频信号。实践中,处理后左声道音频信号和处理后右声道音频信号可以具有响度差(Interaural Level Differences,ILD)和双耳时间差(ITD,Interaural Time Difference)。其中,响度又称音量,描述的是声音的响亮程度,表示人耳对声音的主观感受,其计量单位是宋(sone),定义1kHz,声压级为40dB纯音的响度为1宋。双耳时间差指的是声源到达听者两耳的时间差。通过上述各步骤的处理,可以使得处理后左声道音频信号和处理后右声道音频信号之间的响度差和双耳时间差接近真实的场景,有助于模拟出接近真实的音频播放效果。
在本实施例的一些可选的实现方式中,确定单元702可以包括:识别模块(图中未示出),被配置成将头部图像输入预先训练的头部姿态识别模型,得到目标用户的头部姿态角,其中,头部姿态识别模型用于表征头部图像与头部图像所表征的用户的头部姿态角的对应关 系。
在本实施例的一些可选的实现方式中,头部姿态识别模型预先按照如下步骤训练得到:获取多个样本头部图像和多个样本头部图像中的样本头部图像对应的样本头部姿态角;利用机器学习方法,将多个样本头部图像中的样本头部图像作为输入,将输入的样本头部图像对应的样本头部姿态角作为期望输出,训练得到头部姿态识别模型。
在本实施例的一些可选的实现方式中,确定单元702可以包括:第一确定模块(图中未示出),被配置成确定头部图像的大小;第二确定模块(图中未示出),被配置成基于预设的、头部图像的大小和距离的对应关系,确定目标声源与目标用户的头部的距离。
在本实施例的一些可选的实现方式中,该装置700还可以包括:第二获取单元(图中未示出),被配置成获取预先确定的、初始左声道音频信号和初始右声道音频信号的响度差作为初始响度差;第一调整单元(图中未示出),被配置成分别调整处理后左声道音频信号和处理后右声道音频信号的响度,以使调整响度后的处理后左声道音频信号和处理后右声道音频信号的响度差与初始响度差的差值处于第一预设范围内。
在本实施例的一些可选的实现方式中,该装置700还可以包括:第三获取单元(图中未示出),被配置成获取预先确定的、初始左声道音频信号和初始右声道音频信号的双耳时间差作为初始双耳时间差;第二调整单元(图中未示出),被配置成调整处理后左声道音频信号和处理后右声道音频信号的双耳时间差,以使调整双耳时间差后的处理后左声道音频信号和处理后右声道音频信号的双耳时间差与初始双耳时间差的差值处于第二预设范围内。
本公开的上述实施例提供的装置,通过获取目标用户的头部图像和待处理音频信号,然后利用头部图像,确定目标用户的头部姿态角和目标声源与目标用户的头部的距离,最后将头部姿态角、距离和待处理音频信号输入预设的头相关传输函数,得到处理后左声道音频信号和处理后右声道音频信号,从而利用头部图像和头相关传输函数调整音频信号,提高了处理音频信号的灵活性,有助于模拟出接近真实 的音频播放效果。
下面参考图8,其示出了适于用来实现本公开的实施例的终端设备800的结构示意图。本公开的实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图8示出的终端设备仅仅是一个示例,不应对本公开的实施例的功能和使用范围带来任何限制。
如图8所示,终端设备800可以包括处理装置(例如中央处理器、图形处理器等)801,其可以根据存储在只读存储器(ROM)802中的程序或者从存储装置808加载到随机访问存储器(RAM)803中的程序而执行各种适当的动作和处理。在RAM 803中,还存储有终端设备800操作所需的各种程序和数据。处理装置801、ROM 802以及RAM803通过总线804彼此相连。输入/输出(I/O)接口805也连接至总线804。
通常,以下装置可以连接至I/O接口805:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置806;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置807;包括例如磁带、硬盘等的存储装置808;以及通信装置809。通信装置809可以允许终端设备800与其他设备进行无线或有线通信以交换数据。虽然图8示出了具有各种装置的终端设备800,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置809从网络上被下载和安装,或者从存储装置808被安装,或者从ROM 802被安装。在该计算机程序被处 理装置801执行时,执行本公开的实施例的方法中限定的上述功能。
需要说明的是,本公开所述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
上述计算机可读介质可以是上述终端设备中所包含的;也可以是单独存在,而未装配入该终端设备中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该终端设备执行时,使得该终端设备:获取目标用户的头部图像和待处理音频信号;基于头部图像,确定目标用户的头部姿态角,以及确定目标声源与目标用户的头部的距离;将头部姿态角、距离和待处理音频信号输入预设的头相关传输函数,得到处理后左声道音频信号和处理后右声道音频信号,其中,头相关传输函数用于表征头部姿态角、距离、待处理音频信号与处理后左声道音频信号和处理后右声道音频信号的对应关系。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开 的操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开的实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定,例如,第一获取单元还可以被描述为“获取目标用户的头部图像和待处理音频信号的单元”。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。

Claims (14)

  1. 一种用于处理音频信号的方法,包括:
    获取目标用户的头部图像和待处理音频信号;
    基于所述头部图像,确定所述目标用户的头部姿态角,以及确定目标声源与所述目标用户的头部的距离;
    将所述头部姿态角、所述距离和所述待处理音频信号输入预设的头相关传输函数,得到处理后左声道音频信号和处理后右声道音频信号,其中,所述头相关传输函数用于表征头部姿态角、距离、待处理音频信号与处理后左声道音频信号和处理后右声道音频信号的对应关系。
  2. 根据权利要求1所述的方法,其中,所述基于所述头部图像,确定所述目标用户的头部姿态角,包括:
    将所述头部图像输入预先训练的头部姿态识别模型,得到所述目标用户的头部姿态角,其中,所述头部姿态识别模型用于表征头部图像与头部图像所表征的用户的头部姿态角的对应关系。
  3. 根据权利要求2所述的方法,其中,所述头部姿态识别模型预先按照如下步骤训练得到:
    获取多个样本头部图像和所述多个样本头部图像中的样本头部图像对应的样本头部姿态角;
    利用机器学习方法,将所述多个样本头部图像中的样本头部图像作为输入,将输入的样本头部图像对应的样本头部姿态角作为期望输出,训练得到头部姿态识别模型。
  4. 根据权利要求1所述的方法,其中,所述确定目标声源与所述目标用户的头部的距离,包括:
    确定所述头部图像的大小;
    基于预设的、头部图像的大小和距离的对应关系,确定所述目标 声源与所述目标用户的头部的距离。
  5. 根据权利要求1-4之一所述的方法,其中,在所述得到处理后左声道音频信号和处理后右声道音频信号之后,所述方法还包括:
    获取预先确定的、初始左声道音频信号和初始右声道音频信号的响度差作为初始响度差;
    分别调整所述处理后左声道音频信号和所述处理后右声道音频信号的响度,以使调整响度后的处理后左声道音频信号和处理后右声道音频信号的响度差与所述初始响度差的差值处于第一预设范围内。
  6. 根据权利要求5所述的方法,其中,所述方法还包括:
    获取预先确定的、初始左声道音频信号和初始右声道音频信号的双耳时间差作为初始双耳时间差;
    调整所述处理后左声道音频信号和所述处理后右声道音频信号的双耳时间差,以使调整双耳时间差后的处理后左声道音频信号和处理后右声道音频信号的双耳时间差与所述初始双耳时间差的差值处于第二预设范围内。
  7. 一种用于处理音频信号的装置,包括:
    第一获取单元,被配置成获取目标用户的头部图像和待处理音频信号;
    确定单元,被配置成基于所述头部图像,确定所述目标用户的头部姿态角,以及确定目标声源与所述目标用户的头部的距离;
    处理单元,被配置成将所述头部姿态角、所述距离和所述待处理音频信号输入预设的头相关传输函数,得到处理后左声道音频信号和处理后右声道音频信号,其中,所述头相关传输函数用于表征头部姿态角、距离、待处理音频信号与处理后左声道音频信号和处理后右声道音频信号的对应关系。
  8. 根据权利要求7所述的装置,其中,所述确定单元包括:
    识别模块,被配置成将所述头部图像输入预先训练的头部姿态识别模型,得到所述目标用户的头部姿态角,其中,所述头部姿态识别模型用于表征头部图像与头部图像所表征的用户的头部姿态角的对应关系。
  9. 根据权利要求8所述的装置,其中,所述头部姿态识别模型预先按照如下步骤训练得到:
    获取多个样本头部图像和所述多个样本头部图像中的样本头部图像对应的样本头部姿态角;
    利用机器学习方法,将所述多个样本头部图像中的样本头部图像作为输入,将输入的样本头部图像对应的样本头部姿态角作为期望输出,训练得到头部姿态识别模型。
  10. 根据权利要求7所述的装置,其中,所述确定单元包括:
    第一确定模块,被配置成确定所述头部图像的大小;
    第二确定模块,被配置成基于预设的、头部图像的大小和距离的对应关系,确定所述目标声源与所述目标用户的头部的距离。
  11. 根据权利要求7-10之一所述的装置,其中,所述装置还包括:
    第二获取单元,被配置成获取预先确定的、初始左声道音频信号和初始右声道音频信号的响度差作为初始响度差;
    第一调整单元,被配置成分别调整所述处理后左声道音频信号和所述处理后右声道音频信号的响度,以使调整响度后的处理后左声道音频信号和处理后右声道音频信号的响度差与所述初始响度差的差值处于第一预设范围内。
  12. 根据权利要求11所述的装置,其中,所述装置还包括:
    第三获取单元,被配置成获取预先确定的、初始左声道音频信号和初始右声道音频信号的双耳时间差作为初始双耳时间差;
    第二调整单元,被配置成调整所述处理后左声道音频信号和所述 处理后右声道音频信号的双耳时间差,以使调整双耳时间差后的处理后左声道音频信号和处理后右声道音频信号的双耳时间差与所述初始双耳时间差的差值处于第二预设范围内。
  13. 一种终端设备,包括:
    一个或多个处理器;
    存储装置,其上存储有一个或多个程序,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-6中任一所述的方法。
  14. 一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现如权利要求1-6中任一所述的方法。
PCT/CN2019/072948 2018-10-12 2019-01-24 用于处理音频信号的方法和装置 WO2020073563A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2020545268A JP7210602B2 (ja) 2018-10-12 2019-01-24 オーディオ信号の処理用の方法及び装置
GB2100831.3A GB2590256B (en) 2018-10-12 2019-01-24 Method and device for processing audio signal
US16/980,119 US11425524B2 (en) 2018-10-12 2019-01-24 Method and device for processing audio signal

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811190415.4A CN111050271B (zh) 2018-10-12 2018-10-12 用于处理音频信号的方法和装置
CN201811190415.4 2018-10-12

Publications (1)

Publication Number Publication Date
WO2020073563A1 true WO2020073563A1 (zh) 2020-04-16

Family

ID=70164992

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/072948 WO2020073563A1 (zh) 2018-10-12 2019-01-24 用于处理音频信号的方法和装置

Country Status (5)

Country Link
US (1) US11425524B2 (zh)
JP (1) JP7210602B2 (zh)
CN (1) CN111050271B (zh)
GB (1) GB2590256B (zh)
WO (1) WO2020073563A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2604019A (en) * 2020-12-16 2022-08-24 Nvidia Corp Visually tracked spacial audio
WO2023058466A1 (ja) * 2021-10-06 2023-04-13 ソニーグループ株式会社 情報処理装置およびデータ構造

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200049020A (ko) * 2018-10-31 2020-05-08 삼성전자주식회사 음성 명령에 응답하여 컨텐츠를 표시하기 위한 방법 및 그 전자 장치
CN112637755A (zh) * 2020-12-22 2021-04-09 广州番禺巨大汽车音响设备有限公司 一种基于无线连接的音频播放控制方法、装置及播放系统
CN113099373B (zh) * 2021-03-29 2022-09-23 腾讯音乐娱乐科技(深圳)有限公司 声场宽度扩展的方法、装置、终端及存储介质
CN114501297B (zh) * 2022-04-02 2022-09-02 北京荣耀终端有限公司 一种音频处理方法以及电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030007648A1 (en) * 2001-04-27 2003-01-09 Christopher Currell Virtual audio system and techniques
CN104392241A (zh) * 2014-11-05 2015-03-04 电子科技大学 一种基于混合回归的头部姿态估计方法
CN107168518A (zh) * 2017-04-05 2017-09-15 北京小鸟看看科技有限公司 一种用于头戴显示器的同步方法、装置及头戴显示器
CN107182011A (zh) * 2017-07-21 2017-09-19 深圳市泰衡诺科技有限公司上海分公司 音频播放方法及系统、移动终端、WiFi耳机

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AUPQ896000A0 (en) 2000-07-24 2000-08-17 Seeing Machines Pty Ltd Facial image processing system
EP1424685A1 (en) * 2002-11-28 2004-06-02 Sony International (Europe) GmbH Method for generating speech data corpus
CN102860041A (zh) * 2010-04-26 2013-01-02 剑桥机电有限公司 对收听者进行位置跟踪的扬声器
CN101938686B (zh) * 2010-06-24 2013-08-21 中国科学院声学研究所 一种普通环境中头相关传递函数的测量系统及测量方法
KR101227932B1 (ko) 2011-01-14 2013-01-30 전자부품연구원 다채널 멀티트랙 오디오 시스템 및 오디오 처리 방법
JP2014131140A (ja) 2012-12-28 2014-07-10 Yamaha Corp 通信システム、avレシーバ、および通信アダプタ装置
CN104010265A (zh) * 2013-02-22 2014-08-27 杜比实验室特许公司 音频空间渲染设备及方法
JP6147603B2 (ja) 2013-07-31 2017-06-14 Kddi株式会社 音声伝達装置、音声伝達方法
WO2015162947A1 (ja) * 2014-04-22 2015-10-29 ソニー株式会社 情報再生装置及び情報再生方法、並びに情報記録装置及び情報記録方法
JP2016199124A (ja) * 2015-04-09 2016-12-01 之彦 須崎 音場制御装置及び適用方法
US10595148B2 (en) 2016-01-08 2020-03-17 Sony Corporation Sound processing apparatus and method, and program
WO2017120767A1 (zh) * 2016-01-12 2017-07-20 深圳多哚新技术有限责任公司 一种头部姿态预测方法和装置
CN105760824B (zh) * 2016-02-02 2019-02-01 北京进化者机器人科技有限公司 一种运动人体跟踪方法和系统
US9591427B1 (en) * 2016-02-20 2017-03-07 Philip Scott Lyren Capturing audio impulse responses of a person with a smartphone
CN108038474B (zh) * 2017-12-28 2020-04-14 深圳励飞科技有限公司 人脸检测方法、卷积神经网络参数的训练方法、装置及介质
WO2019246044A1 (en) * 2018-06-18 2019-12-26 Magic Leap, Inc. Head-mounted display systems with power saving functionality

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030007648A1 (en) * 2001-04-27 2003-01-09 Christopher Currell Virtual audio system and techniques
CN104392241A (zh) * 2014-11-05 2015-03-04 电子科技大学 一种基于混合回归的头部姿态估计方法
CN107168518A (zh) * 2017-04-05 2017-09-15 北京小鸟看看科技有限公司 一种用于头戴显示器的同步方法、装置及头戴显示器
CN107182011A (zh) * 2017-07-21 2017-09-19 深圳市泰衡诺科技有限公司上海分公司 音频播放方法及系统、移动终端、WiFi耳机

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2604019A (en) * 2020-12-16 2022-08-24 Nvidia Corp Visually tracked spacial audio
WO2023058466A1 (ja) * 2021-10-06 2023-04-13 ソニーグループ株式会社 情報処理装置およびデータ構造

Also Published As

Publication number Publication date
JP7210602B2 (ja) 2023-01-23
CN111050271A (zh) 2020-04-21
US20210029486A1 (en) 2021-01-28
GB202100831D0 (en) 2021-03-10
GB2590256A (en) 2021-06-23
CN111050271B (zh) 2021-01-29
US11425524B2 (en) 2022-08-23
GB2590256B (en) 2023-04-26
JP2021535632A (ja) 2021-12-16

Similar Documents

Publication Publication Date Title
WO2020073563A1 (zh) 用于处理音频信号的方法和装置
US10585486B2 (en) Gesture interactive wearable spatial audio system
US11765538B2 (en) Wearable electronic device (WED) displays emoji that plays binaural sound
US20160241980A1 (en) Adaptive ambisonic binaural rendering
US11356795B2 (en) Spatialized audio relative to a peripheral device
US11297456B2 (en) Moving an emoji to move a location of binaural sound
TWI709131B (zh) 音訊場景處理技術
US20190246231A1 (en) Method of improving localization of surround sound
WO2023045980A1 (zh) 音频信号播放方法、装置和电子设备
US10582329B2 (en) Audio processing device and method
CN117835121A (zh) 立体声重放方法、电脑、话筒设备、音箱设备和电视
EP3625975B1 (en) Incoherent idempotent ambisonics rendering
WO2020155908A1 (zh) 用于生成信息的方法和装置
CN114339582B (zh) 双通道音频处理、方向感滤波器生成方法、装置以及介质
US10390167B2 (en) Ear shape analysis device and ear shape analysis method
CN114630240B (zh) 方向滤波器的生成方法、音频处理方法、装置及存储介质
WO2024027315A1 (zh) 音频处理方法、装置、电子设备、存储介质和程序产品
CN117793611A (zh) 生成立体声的方法、播放立体声的方法、设备及存储介质
WO2022093162A1 (en) Calculation of left and right binaural signals for output
CN116193196A (zh) 虚拟环绕声渲染方法、装置、设备及存储介质
CN118053442A (zh) 训练数据生成方法、装置、电子设备及存储介质
CN113674751A (zh) 音频处理方法、装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19871599

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020545268

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 202100831

Country of ref document: GB

Kind code of ref document: A

Free format text: PCT FILING DATE = 20190124

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 29/07/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19871599

Country of ref document: EP

Kind code of ref document: A1