WO2020073563A1 - 用于处理音频信号的方法和装置 - Google Patents
用于处理音频信号的方法和装置 Download PDFInfo
- Publication number
- WO2020073563A1 WO2020073563A1 PCT/CN2019/072948 CN2019072948W WO2020073563A1 WO 2020073563 A1 WO2020073563 A1 WO 2020073563A1 CN 2019072948 W CN2019072948 W CN 2019072948W WO 2020073563 A1 WO2020073563 A1 WO 2020073563A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- head
- audio signal
- channel audio
- processed
- initial
- Prior art date
Links
- 230000005236 sound signal Effects 0.000 title claims abstract description 286
- 238000000034 method Methods 0.000 title claims abstract description 65
- 238000012545 processing Methods 0.000 title claims abstract description 48
- 238000012546 transfer Methods 0.000 claims abstract description 37
- 230000006870 function Effects 0.000 claims description 44
- 238000004590 computer program Methods 0.000 claims description 9
- 238000010801 machine learning Methods 0.000 claims description 8
- 230000000694 effects Effects 0.000 abstract description 8
- 238000004088 simulation Methods 0.000 abstract 1
- 210000003128 head Anatomy 0.000 description 263
- 238000010586 diagram Methods 0.000 description 12
- 238000013528 artificial neural network Methods 0.000 description 10
- 239000013598 vector Substances 0.000 description 9
- 238000004891 communication Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 210000005069 ears Anatomy 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000004807 localization Effects 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- XOFYZVNMUHMLCC-ZPOLXVRWSA-N prednisone Chemical group O=C1C=C[C@]2(C)[C@H]3C(=O)C[C@](C)([C@@](CC4)(O)C(=O)CO)[C@@H]4[C@@H]3CCC2=C1 XOFYZVNMUHMLCC-ZPOLXVRWSA-N 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 238000011426 transformation method Methods 0.000 description 2
- 210000005252 bulbus oculi Anatomy 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
- G06T7/74—Determining position or orientation of objects or cameras using feature-based methods involving reference images or patches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S1/00—Two-channel systems
- H04S1/002—Non-adaptive circuits, e.g. manually adjustable or static, for enhancing the sound image or the spatial distribution
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
- H04S7/30—Control circuits for electronic adaptation of the sound field
- H04S7/302—Electronic adaptation of stereophonic sound system to listener position or orientation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
- H04S7/30—Control circuits for electronic adaptation of the sound field
- H04S7/302—Electronic adaptation of stereophonic sound system to listener position or orientation
- H04S7/303—Tracking of listener position or orientation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
- H04S7/30—Control circuits for electronic adaptation of the sound field
- H04S7/307—Frequency adjustment, e.g. tone control
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30196—Human being; Person
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R2201/00—Details of transducers, loudspeakers or microphones covered by H04R1/00 but not provided for in any of its subgroups
- H04R2201/40—Details of arrangements for obtaining desired directional characteristic by combining a number of identical transducers covered by H04R1/40 but not provided for in any of its subgroups
- H04R2201/403—Linear arrays of transducers
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R2499/00—Aspects covered by H04R or H04S not otherwise provided for in their subgroups
- H04R2499/10—General applications
- H04R2499/15—Transducers incorporated in visual displaying devices, e.g. televisions, computer displays, laptops
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2400/00—Details of stereophonic systems covered by H04S but not provided for in its groups
- H04S2400/11—Positioning of individual sound objects, e.g. moving airplane, within a sound field
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2400/00—Details of stereophonic systems covered by H04S but not provided for in its groups
- H04S2400/13—Aspects of volume control, not necessarily automatic, in stereophonic sound systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2420/00—Techniques used stereophonic systems covered by H04S but not provided for in its groups
- H04S2420/01—Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]
Definitions
- Embodiments of the present disclosure relate to the field of computer technology, and in particular, to methods and devices for processing audio signals.
- the embodiments of the present disclosure propose a method and apparatus for processing audio signals.
- an embodiment of the present disclosure provides a method for processing an audio signal, the method including: acquiring a head user's head image and an audio signal to be processed; based on the head image, determining the target user's head posture Angle, and determine the distance between the target sound source and the head of the target user; input the head posture angle, distance, and audio signal to be processed into the preset head-related transfer function to obtain the processed left channel audio signal and the processed right sound Channel audio signal, wherein the head-related transfer function is used to characterize the correspondence between the head attitude angle, distance, to-be-processed audio signal and the processed left-channel audio signal and the processed right-channel audio signal.
- determining the head pose angle of the target user based on the head image includes: inputting the head image into a pre-trained head pose recognition model to obtain the head pose angle of the target user, wherein the head pose The recognition model is used to characterize the correspondence between the head image and the head gesture angle of the user represented by the head image.
- the head posture recognition model is pre-trained according to the following steps: acquiring multiple sample head images and sample head posture angles corresponding to the sample head images in the multiple sample head images; using machine learning methods Taking the sample head image among the multiple sample head images as input, and using the sample head posture angle corresponding to the input sample head image as the desired output, a head posture recognition model is trained.
- determining the distance between the target sound source and the head of the target user includes: determining the size of the head image; determining the target sound source and the target based on a preset correspondence between the size and distance of the head image The distance of the user's head.
- the method further includes: obtaining predetermined loudness of the initial left channel audio signal and the initial right channel audio signal The difference is the initial loudness difference; adjust the loudness of the processed left channel audio signal and the processed right channel audio signal separately to make the loudness difference between the processed left channel audio signal and the processed right channel audio signal adjusted The difference from the initial loudness difference is within the first preset range.
- the method further includes: acquiring a predetermined, binaural time difference between the initial left channel audio signal and the initial right channel audio signal as the initial binaural time difference; adjusting the processed left channel audio signal and after processing The binaural time difference of the right channel audio signal, so that the difference between the binaural time difference between the processed left channel audio signal and the processed right channel audio signal after the adjustment of the binaural time difference and the initial binaural time difference is at the second preset Within range.
- an embodiment of the present disclosure provides an apparatus for processing audio signals.
- the apparatus includes: a first acquisition unit configured to acquire a head user's head image and an audio signal to be processed; a determination unit, which is It is configured to determine the head posture angle of the target user based on the head image, and to determine the distance between the target sound source and the target user's head; the processing unit is configured to input the head posture angle, distance and to-be-processed audio signal into the pre Set the head-related transfer function to obtain the processed left channel audio signal and the processed right channel audio signal, where the head related transfer function is used to characterize the head posture angle, distance, to-be-processed audio signal and the processed left channel Correspondence between the audio signal and the processed right channel audio signal.
- the determining unit includes: a recognition module configured to input a head image into a pre-trained head pose recognition model to obtain a head pose angle of the target user, wherein the head pose recognition model is used to characterize the head Correspondence between the partial image and the head gesture angle of the user represented by the head image.
- the head posture recognition model is pre-trained according to the following steps: acquiring multiple sample head images and sample head posture angles corresponding to the sample head images in the multiple sample head images; using machine learning methods Taking the sample head image among the multiple sample head images as input, and using the sample head posture angle corresponding to the input sample head image as the desired output, a head posture recognition model is trained.
- the determination unit includes: a first determination module configured to determine the size of the head image; a second determination module configured to determine based on a preset correspondence between the size of the head image and the distance The distance between the target sound source and the target user's head.
- the apparatus further includes: a second acquisition unit configured to acquire a predetermined difference in loudness between the initial left channel audio signal and the initial right channel audio signal as the initial loudness difference; the first adjustment unit, It is configured to adjust the loudness of the processed left channel audio signal and the processed right channel audio signal separately, so that the difference between the loudness of the processed left channel audio signal and the processed right channel audio signal after adjusting the loudness is different from the initial loudness The difference is within the first preset range.
- the device further includes: a third acquiring unit configured to acquire a predetermined binaural time difference between the initial left channel audio signal and the initial right channel audio signal as the initial binaural time difference; second adjustment The unit is configured to adjust the binaural time difference between the processed left channel audio signal and the processed right channel audio signal so that the adjusted left channel audio signal and the processed right channel audio signal after adjusting the binaural time difference The difference between the binaural time difference and the initial binaural time difference is within the second preset range.
- an embodiment of the present disclosure provides a terminal device including: one or more processors; a storage device on which one or more programs are stored; when one or more programs are Multiple processors execute so that one or more processors implement the method as described in any one of the implementation manners of the first aspect.
- an embodiment of the present disclosure provides a computer-readable medium on which a computer program is stored, which when executed by a processor implements the method described in any one of the implementation manners of the first aspect.
- the method and apparatus for processing audio signals provided by the embodiments of the present disclosure, by acquiring the target user's head image and the audio signal to be processed, and then using the head image to determine the target user's head posture angle and target sound source and The distance of the target user's head, and finally input the head posture angle, distance and audio signal to be processed into the preset head-related transfer function to obtain the processed left channel audio signal and the processed right channel audio signal, so as to use the head
- the image and header-related transfer functions adjust the audio signal, which increases the flexibility of processing the audio signal and helps simulate a near-real audio playback effect.
- FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present disclosure can be applied;
- FIG. 2 is a flowchart of one embodiment of a method for processing audio signals according to an embodiment of the present disclosure
- FIG. 3 is an exemplary schematic diagram of a head posture angle of a method for processing audio signals according to an embodiment of the present disclosure
- FIG. 4 is another exemplary schematic diagram of a head posture angle of a method for processing audio signals according to an embodiment of the present disclosure
- FIG. 5 is a schematic diagram of an application scenario of a method for processing audio signals according to an embodiment of the present disclosure
- FIG. 6 is a flowchart of still another embodiment of a method for processing audio signals according to an embodiment of the present disclosure
- FIG. 7 is a schematic structural diagram of an embodiment of an apparatus for processing audio signals according to an embodiment of the present disclosure.
- FIG. 8 is a schematic structural diagram of a terminal device suitable for implementing embodiments of the present disclosure.
- FIG. 1 illustrates an exemplary system architecture 100 of a method for processing audio signals or an apparatus for processing audio signals to which embodiments of the present disclosure may be applied.
- the system architecture 100 may include terminal devices 101, 102, and 103, a network 104, and a server 105.
- the network 104 is a medium used to provide a communication link between the terminal devices 101, 102, 103 and the server 105.
- the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
- the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages, and so on.
- Various communication client applications may be installed on the terminal devices 101, 102, and 103, such as audio playback applications, video playback applications, and social platform software.
- the terminal devices 101, 102, and 103 may be hardware or software.
- the terminal devices 101, 102, and 103 are hardware, they may be various electronic devices that support audio playback and include cameras.
- the terminal devices 101, 102, and 103 are software, they can be installed in the above electronic device. It can be implemented as multiple software or software modules (for example, software or software modules used to provide distributed services), or as a single software or software module. There is no specific limit here.
- the server 105 may be a server that provides various services, such as a background audio server that supports audio played on the terminal devices 101, 102, and 103.
- the background audio server can send audio to the terminal device to play on the terminal device.
- the method for processing audio signals provided by the embodiments of the present disclosure is generally performed by terminal devices 101, 102, and 103. Accordingly, the apparatus for processing audio signals may be provided in terminal devices 101, 102, 103.
- the server can be hardware or software.
- the server can be implemented as a distributed server cluster composed of multiple servers or as a single server.
- the server is software, it may be implemented as multiple software or software modules (for example, software or software modules for providing distributed services), or as a single software or software module. There is no specific limit here.
- terminal devices, networks, and servers in FIG. 1 are only schematic. According to the implementation needs, there can be any number of terminal devices, networks and servers. In the case where the head image and audio signal to be processed do not need to be acquired from a remote location, the above system architecture may not include the network and the server.
- the method for processing audio signals includes the following steps:
- Step 201 Acquire the head image of the target user and the audio signal to be processed.
- the execution subject of the method for processing audio signals may acquire the head image and left voice of the target user from a remote or local location through a wired connection or a wireless connection Channel to be processed audio signal and right channel to be processed audio signal.
- the target user may be a user within the shooting range of the camera on the terminal device shown in FIG. 1 (for example, a user using the terminal device shown in FIG. 1).
- the audio signal to be processed may be an audio signal to be processed that is stored in the execution subject in advance.
- the audio signal to be processed may be an audio segment that is currently included in the audio currently being played on the execution subject and has not been played.
- the duration of the audio clip may be a preset duration, such as 5 seconds, 10 seconds, and so on.
- Step 202 based on the head image, determine the head posture angle of the target user, and determine the distance between the target sound source and the target user's head.
- the above-mentioned execution subject may determine the head posture angle of the target user, and determine the distance between the target sound source and the target user's head.
- the head posture angle can be used to characterize the degree of deflection of the face orientation of the target user's face relative to the camera used to capture the head image of the target user.
- the head attitude angle may include three angles of pitch, yaw, and roll, which respectively represent the angles of upside down, left and right, and rotation in the horizontal plane.
- the x-axis, y-axis, and z-axis are three axes of the rectangular coordinate system.
- the z axis may be the optical axis of the camera on the terminal device 301
- the y axis may be a straight line that passes through the center point of the top contour of the person's head and is perpendicular to the horizontal plane without the person's head turning sideways.
- the pitch angle can be the angle of rotation of the face about the x axis
- the yaw angle can be the angle of rotation of the face about the y axis
- the roll angle can be the angle of rotation of the face about the z axis.
- the determined head posture angle may not include the roll angle described above.
- point A in the figure is the target sound source.
- the target sound source is at the same position as the camera, and the determined head attitude angle includes ⁇ (yaw angle) and ⁇ (pitch angle).
- the above-mentioned execution subject may perform head pose estimation on the two-dimensional head image according to various existing head pose estimation methods.
- the method of head pose estimation may include but is not limited to the following methods: a method based on a machine learning model, a coordinate transformation method based on key points of a human face, and the like.
- the execution subject may determine the head gesture angle of the target user based on the head image according to the following steps:
- the head pose recognition model is used to characterize the correspondence between the head image and the head gesture angle of the user represented by the head image.
- the above-mentioned head posture recognition model may include a feature extraction part and a correspondence table.
- the feature extraction part can be used to extract features from the head image to generate feature vectors.
- the feature extraction part may be a convolutional neural network, deep neural network, or the like.
- the correspondence relationship table may be a correspondence table pre-formulated by a technician based on statistics of a large number of feature vectors and head posture angles, and storing correspondence relationships between a plurality of feature vectors and head posture angles. In this way, the above-mentioned head pose recognition model can first use the feature extraction part to extract the features of the head image, thereby generating the target feature vector.
- the target feature vector is compared with multiple feature vectors in the correspondence table in sequence. If a feature vector in the correspondence table is the same as or similar to the target feature vector, the feature vector in the correspondence table is mapped Is the head pose angle of the target user.
- the above-mentioned head posture recognition model may be obtained by the above-mentioned executive agent or other electronic device through training in the following steps: First, obtain multiple sample head images and multiple sample heads The sample head pose angle corresponding to the sample head image in the image.
- the sample head posture angle is the head posture angle of the head of the person indicated by the sample head image, which is pre-labeled on the sample head image. Then, using the machine learning method, the sample head image among the multiple sample head images is used as input, and the sample head posture angle corresponding to the input sample head image is used as the desired output, and the head posture recognition model is trained.
- the above-mentioned head posture recognition model may be a model obtained by training the initialized artificial neural network.
- the initialized artificial neural network may be an untrained artificial neural network or an untrained artificial neural network.
- Each layer of the initialized artificial neural network may be set with initial parameters, and the parameters may be continuously adjusted during the training process of the artificial neural network (for example, the parameters are adjusted using a back propagation algorithm).
- the initialized artificial neural network may be various types of untrained or untrained artificial neural networks.
- the initialized artificial neural network may be a convolutional neural network (for example, it may include a convolutional layer, a pooling layer, a fully connected layer, etc.).
- the head posture of the target user can be monitored in real time, and the use of hardware such as a head-mounted device can be avoided to achieve the purpose of simplifying the hardware structure and reducing the hardware cost.
- the above-mentioned execution subject may determine the distance between the target sound source and the target user's head based on the head image.
- the above-mentioned execution subject may determine the distance between the target sound source and the target user's head according to the following steps:
- the size of the head image may be the head image identified from the head image using the existing target detection model (such as SSD (Single Shot MultiBox Detector), DPM (Deformable Part Part Model), etc.)
- the size of the area can be characterized in various ways. For example, it may be the length or width of the smallest rectangle including the head image area, or the radius of the smallest circle including the head image area.
- the above-mentioned correspondence relationship may be characterized by a preset correspondence table, in which the size of the head image and the corresponding distance may be stored, and the above-mentioned execution subject may be based on the determined head
- the corresponding relationship may be characterized according to a preset conversion formula, and the execution subject may use the conversion formula to calculate the distance between the target sound source and the target user's head according to the determined size of the head image.
- the above-mentioned execution subject may use existing methods for determining face key points, determine face key points in the head image, and determine the size of the image area including the determined face key points.
- the characterizing method of the size of the image area may be the same as the above example.
- the above-mentioned execution subject may determine the distance between the target sound source and the head of the target user based on the preset correspondence between the size of the image area and the distance.
- the representation of the correspondence in this example may be the same as the above example, and will not be repeated here.
- the target sound source may be an actual electronic device that outputs audio signals.
- the electronic device that outputs audio signals is the above-mentioned terminal device including a camera, or it may be determined by the above-mentioned executive body and at the target position Virtual sound source.
- the distance between the target sound source and the target user's head may be the distance between the electronic device outputting the audio signal and the target user's head determined according to the above example; or, the target sound source and the target user's head
- the distance of the part can be calculated from the determined distance (for example, multiplying by a preset coefficient, or adding a preset distance, etc.) to obtain the distance between the target sound source (that is, the virtual sound source) and the head of the target user distance.
- Step 203 Input the head posture angle, distance, and audio signal to be processed into a preset head-related transfer function to obtain a processed left channel audio signal and a processed right channel audio signal.
- the above-mentioned execution subject may input the head posture angle, distance and to-be-processed audio signal into a preset head-related transfer function (Head Related Transfer Function, HRTF) to obtain the processed left channel audio signal and the processed Right channel audio signal.
- HRTF head-related transfer Function
- the head-related transfer function is used to characterize the correspondence between the head attitude angle, the distance, the audio signal to be processed and the processed left channel audio signal and the processed right channel audio signal.
- the head-related transfer function (also called binaural transfer function) describes the transmission process of sound waves from the sound source to both ears. It is the result of comprehensive filtering of sound waves by human physiological structure (such as head, pinna and torso, etc.). Because the head-related transfer function contains information about sound source localization, it is very important for the study of binaural hearing and psychoacoustics. In practical applications, the use of headphones or speakers to output signals processed with head-related transfer functions can simulate various spatial auditory effects.
- the HRTF may include two parts, namely a left HRTF and a right HRTF.
- the above-mentioned execution subject may input the head posture angle, the determined distance, and the audio signal to be processed into the left HRTF and the right HRTF respectively, and the left HRTF output is processed left Channel audio signal, right HRTF output processed right channel audio signal.
- the processed left channel audio signal and the processed right channel audio signal may have a difference in loudness (Interaural Levels Differences, ILD) and a binaural time difference (ITD, Interaural Time Differences).
- loudness also known as volume, describes the loudness of the sound and represents the subjective perception of the human ear.
- the unit of measurement is sone, which is defined as 1 kHz and the sound pressure level is 40 dB.
- the loudness of pure tone is 1 song.
- Binaural time difference refers to the time difference between the sound source reaching the ears of the listener.
- the above-mentioned execution subject may output the processed left channel audio signal and the processed right channel audio signal in various ways.
- the processed left channel audio signal and the processed right channel audio signal can be played using headphones, speakers, etc .; or, the processed left channel audio signal and the processed right channel audio signal can be output to a preset storage Store in the area.
- FIG. 5 is a schematic diagram of an application scenario of the method for processing audio signals according to this embodiment.
- music is being played on the terminal device 501
- the terminal device 501 first captures the head image 503 of the target user 502, and the terminal device 501 acquires the audio signal 504 to be processed again.
- the to-be-processed audio signal 504 is an audio segment that has not yet been played out of currently played audios.
- the terminal device 501 determines the head posture angle 505 of the target user based on the head image 503 (for example, using a pre-trained head posture recognition model to identify the head posture angle), and determines the target sound source and the head of the target user 502
- the distance 506 of the part (for example, the distance between the target sound source and the head of the target user is determined according to the correspondence between the size of the head image and the distance).
- the target sound source is the terminal device 501.
- the terminal device 501 inputs the head posture angle 505, the distance 506, and the audio signal 504 to be processed into a preset head-related transfer function 507 to obtain the processed left channel audio signal 508 and the processed right channel audio signal 509.
- the method provided by the above embodiment of the present disclosure determines the head user's head posture angle and the distance between the target sound source and the target user's head by acquiring the target user's head image and the audio signal to be processed, and then using the head image , And finally input the head posture angle, distance and to-be-processed audio signal into the preset head-related transfer function to obtain the processed left-channel audio signal and the processed right-channel audio signal, thereby using the head image and the head-related transfer function Adjusting the audio signal improves the flexibility of processing the audio signal, which helps to simulate the near real audio playback effect.
- FIG. 6 shows a flow 600 of yet another embodiment of a method for processing audio signals.
- the process 600 of the method for processing audio signals includes the following steps:
- Step 601 Acquire the head image of the target user and the audio signal to be processed.
- step 601 is basically the same as step 201 in the embodiment corresponding to FIG. 2 and will not be repeated here.
- Step 602 Based on the head image, determine the head pose angle of the target user, and determine the distance between the target sound source and the head of the target user.
- step 602 is basically the same as step 202 in the embodiment corresponding to FIG. 2 and will not be repeated here.
- Step 603 Input the head posture angle, distance and to-be-processed audio signal into a preset head-related transfer function to obtain the processed left channel audio signal and the processed right channel audio signal.
- step 603 is basically the same as step 203 in the embodiment corresponding to FIG. 2 and will not be repeated here.
- Step 604 Acquire a predetermined loudness difference between the initial left channel audio signal and the initial right channel audio signal as the initial loudness difference.
- the execution subject of the method for processing audio signals may acquire the predetermined loudness difference between the initial left channel audio signal and the initial right channel audio signal.
- the initial left channel audio signal and the initial right channel audio signal may be unprocessed audio signals pre-stored in the above-mentioned execution subject.
- the unprocessed audio signal and the aforementioned audio signal to be processed may be generated based on the same audio file.
- the initial left channel audio signal and the initial right channel audio signal may be audio signals extracted from an audio file
- the to-be-processed audio signal may be an audio clip extracted from the audio file being played and not yet played.
- the above-mentioned execution subject may determine the loudness of the initial left channel audio signal and the initial right channel audio signal respectively in advance, and determine the difference between the two determined loudness as the initial left channel audio signal and the initial right channel
- the loudness of the channel audio signal is poor. It should be noted that the method of determining the loudness of the audio signal is a well-known technology that has been widely studied and applied at present, and will not be repeated here.
- Step 605 Adjust the loudness of the processed left channel audio signal and the processed right channel audio signal separately, so that the difference between the loudness of the processed left channel audio signal and the processed right channel audio signal after adjusting the loudness and the initial loudness The difference is within the first preset range.
- the execution subject adjusts the loudness of the processed left channel audio signal and the processed right channel audio signal separately, so that the adjusted loudness of the processed left channel audio signal and the processed right channel audio signal
- the difference between the loudness difference of and the initial loudness difference is within the first preset range.
- the first preset range may be a preset loudness difference range, for example, 0 Song, ⁇ 1 Song, and so on.
- the loudness of the processed left channel audio signal is adjusted to be close to A, and the processed right channel audio signal The loudness of is adjusted to be close to B, so that the difference between the loudness difference between the processed left channel audio signal and the processed right channel audio signal after adjusting the loudness and the initial loudness difference is within the first preset range.
- the loudness difference between the processed left channel audio signal and the processed right channel audio signal can be restored to the initial loudness difference, which helps When playing audio, avoid sudden changes in the loudness of the audio signal.
- the foregoing execution subject may further perform the following steps:
- the predetermined binaural time difference between the initial left channel audio signal and the initial right channel audio signal is obtained as the initial binaural time difference.
- the initial left channel audio signal and the initial right channel audio signal are the same as the initial left channel audio signal and the initial right channel audio signal described in step 604, and will not be repeated here.
- the above-mentioned execution subject may determine the binaural time difference between the initial left channel audio signal and the initial right channel audio signal according to the existing method for determining the binaural time difference between the left and right channels. It should be noted that the method of determining the binaural time difference between the left and right channels is a well-known technology that has been widely researched and applied at present, and will not be repeated here.
- the second preset range may be a preset binaural time difference range, for example, 0 seconds, ⁇ 0.1 seconds, and so on.
- the purpose of adjusting the binaural time difference between the left channel audio signal and the processed right channel audio signal can be achieved by adjusting the start playback time of the processed left channel audio signal and the processed right channel audio signal.
- the binaural time difference between the processed left channel audio signal and the processed right channel audio signal can be restored to the original binaural time difference, thereby helping to avoid binaural audio signals when playing audio
- the sudden change in time difference helps to better simulate the real sound field.
- the process 600 of the method for processing audio signals in this embodiment highlights the adjustment of the processed left channel audio signal and the processed right channel audio The steps of the loudness of the signal. Therefore, the solution described in this embodiment can restore the loudness of the processed left channel audio signal and the processed right channel audio signal to the initial loudness, thereby helping to avoid sudden changes in the loudness of the audio signal when playing audio.
- the present disclosure provides an embodiment of an apparatus for processing audio signals, which corresponds to the method embodiment shown in FIG. 2,
- the device can be specifically applied to various electronic devices.
- the apparatus 700 for processing audio signals of this embodiment includes: a first acquisition unit 701 configured to acquire a head user's head image and an audio signal to be processed; a determination unit 702 configured to be based on Head image, determine the head posture angle of the target user, and determine the distance between the target sound source and the target user's head; the processing unit 703 is configured to input the head posture angle, distance, and audio signal to be processed into a preset Head-related transfer function to obtain the processed left channel audio signal and the processed right channel audio signal, where the head related transfer function is used to characterize the head attitude angle, distance, audio signal to be processed and processed left channel audio signal Correspondence with the processed right channel audio signal.
- the first acquiring unit 701 may acquire the head image of the target user and the left-channel to-be-processed audio signal and the right-channel to-be-processed audio signal remotely or locally through a wired connection or a wireless connection.
- the target user may be a user within the shooting range of the camera on the terminal device shown in FIG. 1 (for example, a user using the terminal device shown in FIG. 1).
- the audio signal to be processed may be an audio signal to be processed that is stored in the device 700 in advance.
- the audio signal to be processed may be an audio segment that is currently included in the audio currently being played on the device 700 and has not been played.
- the duration of the audio clip may be a preset duration, such as 5 seconds, 10 seconds, and so on.
- the determining unit 702 may determine the head posture angle of the target user, and determine the distance between the target sound source and the target user's head.
- the head posture angle can be used to characterize the degree of deflection of the front face of the target user's face relative to the camera where the head image of the target user is captured.
- the above determination unit 702 may perform head pose estimation on the two-dimensional head image according to various existing head pose estimation methods.
- the method of head pose estimation may include but is not limited to the following methods: a method based on a machine learning model, a coordinate transformation method based on key points of a human face, and the like.
- the above determination unit 702 may determine the distance between the target sound source and the target user's head based on the head image.
- the above-mentioned determination unit 702 may determine the key points of the face in the head image using the existing method for determining the key points of the face, and determine the size of the image area including the determined key points of the face. Then, the above determination unit 702 may determine the distance between the target sound source and the head of the target user based on the preset correspondence between the size of the image area and the distance.
- the target sound source may be an actual electronic device that outputs audio signals.
- the electronic device that outputs audio signals is the above-mentioned terminal device including a camera, or it may be determined by the above-mentioned executive body and at the target position Virtual sound source.
- the distance between the target sound source and the target user's head may be the distance between the electronic device outputting the audio signal and the target user's head determined according to the above example; or, the target sound source and the target user's head
- the distance of the part can be calculated from the determined distance (for example, multiplying by a preset coefficient, or adding a preset distance, etc.) to obtain the distance between the target sound source (that is, the virtual sound source) and the head of the target user distance.
- the processing unit 703 inputs the head posture angle, distance, and audio signal to be processed into a preset head-related transfer function (Head Relevant Transfer Function, HRTF) to obtain the processed left channel audio signal and the processed right Channel audio signal.
- HRTF Head Relevant Transfer Function
- the head-related transfer function is used to characterize the correspondence between the head attitude angle, the distance, the audio signal to be processed and the processed left channel audio signal and the processed right channel audio signal.
- the head-related transfer function (also called binaural transfer function) describes the transmission process of sound waves from the sound source to both ears. It is the result of comprehensive filtering of sound waves by human physiological structure (such as head, pinna and torso, etc.). Because the head-related transfer function contains information about sound source localization, it is of great significance for the study of binaural hearing and psychoacoustics. In practical applications, the use of headphones or speakers to output signals processed with head-related transfer functions can simulate various spatial auditory effects.
- the HRTF may include two parts, namely a left HRTF and a right HRTF.
- the processing unit 703 may input the head posture angle, the determined distance, and the audio signal to be processed into the left HRTF and the right HRTF respectively, and the left HRTF is output and processed Left channel audio signal, right HRTF output processed right channel audio signal.
- the processed left channel audio signal and the processed right channel audio signal may have a difference in loudness (Interaural Levels Differences, ILD) and a binaural time difference (ITD, Interaural Time Differences).
- ILD Interaural Levels Differences
- ITD Interaural Time Differences
- loudness also known as volume, describes the loudness of the sound and represents the subjective perception of the human ear.
- the unit of measurement is sone, which is defined as 1 kHz and the sound pressure level is 40 dB.
- the loudness of pure tone is 1 song.
- Binaural time difference refers to the time difference between the sound source reaching the ears of the listener.
- the determining unit 702 may include: a recognition module (not shown in the figure) configured to input a head image into a pre-trained head pose recognition model to obtain the target user ’s Head pose angle, wherein the head pose recognition model is used to characterize the correspondence between the head image and the head gesture angle of the user represented by the head image.
- a recognition module (not shown in the figure) configured to input a head image into a pre-trained head pose recognition model to obtain the target user ’s Head pose angle, wherein the head pose recognition model is used to characterize the correspondence between the head image and the head gesture angle of the user represented by the head image.
- the head pose recognition model is trained in advance as follows: acquiring multiple sample head images and the sample head corresponding to the sample head image in the multiple sample head images Pose angle; using machine learning method, the sample head image in the multiple sample head images is used as input, and the sample head posture angle corresponding to the input sample head image is used as the desired output, and the head pose recognition model is trained.
- the determining unit 702 may include: a first determining module (not shown in the figure) configured to determine the size of the head image; a second determining module (not shown in the figure) Out), configured to determine the distance between the target sound source and the target user's head based on a preset correspondence between the size of the head image and the distance.
- the apparatus 700 may further include: a second acquisition unit (not shown in the figure) configured to acquire a predetermined, initial left channel audio signal and initial right sound The loudness difference of the channel audio signals is used as the initial loudness difference; the first adjustment unit (not shown in the figure) is configured to adjust the loudness of the processed left channel audio signal and the processed right channel audio signal, respectively, to adjust the loudness The difference between the loudness difference of the processed left channel audio signal and the processed right channel audio signal and the initial loudness difference is within the first preset range.
- the apparatus 700 may further include: a third acquiring unit (not shown in the figure) configured to acquire a predetermined, initial left channel audio signal and initial right sound The binaural time difference of the channel audio signal is used as the initial binaural time difference; the second adjustment unit (not shown in the figure) is configured to adjust the binaural time difference between the processed left channel audio signal and the processed right channel audio signal to The difference between the binaural time difference between the processed left channel audio signal and the processed right channel audio signal after adjusting the binaural time difference and the initial binaural time difference is within a second preset range.
- a third acquiring unit (not shown in the figure) configured to acquire a predetermined, initial left channel audio signal and initial right sound The binaural time difference of the channel audio signal is used as the initial binaural time difference
- the second adjustment unit (not shown in the figure) is configured to adjust the binaural time difference between the processed left channel audio signal and the processed right channel audio signal to The difference between the binaural time difference between the processed left channel audio signal and the processed
- the device provided by the above embodiments of the present disclosure determines the head user's head posture angle and the distance between the target sound source and the target user's head by acquiring the target user's head image and the audio signal to be processed, and then using the head image , And finally input the head posture angle, distance and to-be-processed audio signal into the preset head-related transfer function to obtain the processed left-channel audio signal and the processed right-channel audio signal, thereby using the head image and the head-related transfer function Adjusting the audio signal improves the flexibility of processing the audio signal, which helps to simulate the near real audio playback effect.
- Terminal devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals ( For example, mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers, and so on.
- the terminal device shown in FIG. 8 is only an example, and should not bring any limitation to the functions and use scope of the embodiments of the present disclosure.
- the terminal device 800 may include a processing device (such as a central processing unit, a graphics processor, etc.) 801, which can be loaded into random access according to a program stored in a read-only memory (ROM) 802 or from the storage device 808
- the program in the memory (RAM) 803 performs various appropriate operations and processes.
- various programs and data necessary for the operation of the terminal device 800 are also stored.
- the processing device 801, ROM 802, and RAM 803 are connected to each other via a bus 804.
- An input / output (I / O) interface 805 is also connected to the bus 804.
- the following devices can be connected to the I / O interface 805: including input devices 806 such as touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc .; including, for example, liquid crystal display (LCD), speaker, vibration
- An output device 807 such as a storage device; a storage device 808 including, for example, a magnetic tape, a hard disk, etc .; and a communication device 809.
- the communication device 809 may allow the terminal device 800 to perform wireless or wired communication with other devices to exchange data.
- FIG. 8 shows a terminal device 800 having various devices, it should be understood that it is not required to implement or have all the devices shown. More or fewer devices may be implemented or provided instead.
- the process described above with reference to the flowchart may be implemented as a computer software program.
- embodiments of the present disclosure include a computer program product that includes a computer program carried on a computer-readable medium, the computer program containing program code for performing the method shown in the flowchart.
- the computer program may be downloaded and installed from the network through the communication device 809, or installed from the storage device 808, or installed from the ROM 802.
- the processing device 801 the above-described functions defined in the method of the embodiments of the present disclosure are executed.
- the computer-readable medium described in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two.
- the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination of the above. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer diskettes, hard drives, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing.
- the computer-readable storage medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device.
- the computer-readable signal medium may include a data signal that is propagated in baseband or as part of a carrier wave, in which computer-readable program code is carried. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
- the computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, and the computer-readable signal medium may send, propagate, or transmit a program for use by or in combination with an instruction execution system, apparatus, or device .
- the program code contained on the computer readable medium may be transmitted using any appropriate medium, including but not limited to: electric wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
- the above-mentioned computer-readable medium may be included in the above-mentioned terminal device; or it may exist alone without being assembled into the terminal device.
- the computer-readable medium carries one or more programs.
- the terminal device is caused to: acquire the head image of the target user and the audio signal to be processed; based on the head image, Determine the head posture angle of the target user, and determine the distance between the target sound source and the head of the target user; input the head posture angle, distance, and audio signal to be processed into a preset head-related transfer function to obtain the processed left channel Audio signal and processed right channel audio signal, where the head-related transfer function is used to characterize the correspondence between the head posture angle, distance, to-be-processed audio signal and processed left-channel audio signal and processed right-channel audio signal .
- Computer program code for performing the operations of the present disclosure may be written in one or more programming languages or a combination thereof, the programming languages including object-oriented programming languages such as Java, Smalltalk, C ++, and also including conventional Procedural programming language-such as "C" language or similar programming language.
- the program code may execute entirely on the user's computer, partly on the user's computer, as an independent software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
- the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (eg, through an Internet service provider Internet connection).
- LAN local area network
- WAN wide area network
- Internet service provider Internet connection e.g, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
- each block in the flowchart or block diagram may represent a module, program segment, or part of code that contains one or more logic functions Executable instructions.
- the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession can actually be executed in parallel, and sometimes they can also be executed in reverse order, depending on the functions involved.
- each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts can be implemented with dedicated hardware-based systems that perform specified functions or operations Or, it can be realized by a combination of dedicated hardware and computer instructions.
- the units described in the embodiments of the present disclosure may be implemented in software or hardware.
- the name of the unit does not constitute a limitation on the unit itself.
- the first acquisition unit may also be described as “a unit that acquires the head image of the target user and the audio signal to be processed”.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Human Computer Interaction (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Medical Informatics (AREA)
- Multimedia (AREA)
- Stereophonic System (AREA)
Abstract
Description
Claims (14)
- 一种用于处理音频信号的方法,包括:获取目标用户的头部图像和待处理音频信号;基于所述头部图像,确定所述目标用户的头部姿态角,以及确定目标声源与所述目标用户的头部的距离;将所述头部姿态角、所述距离和所述待处理音频信号输入预设的头相关传输函数,得到处理后左声道音频信号和处理后右声道音频信号,其中,所述头相关传输函数用于表征头部姿态角、距离、待处理音频信号与处理后左声道音频信号和处理后右声道音频信号的对应关系。
- 根据权利要求1所述的方法,其中,所述基于所述头部图像,确定所述目标用户的头部姿态角,包括:将所述头部图像输入预先训练的头部姿态识别模型,得到所述目标用户的头部姿态角,其中,所述头部姿态识别模型用于表征头部图像与头部图像所表征的用户的头部姿态角的对应关系。
- 根据权利要求2所述的方法,其中,所述头部姿态识别模型预先按照如下步骤训练得到:获取多个样本头部图像和所述多个样本头部图像中的样本头部图像对应的样本头部姿态角;利用机器学习方法,将所述多个样本头部图像中的样本头部图像作为输入,将输入的样本头部图像对应的样本头部姿态角作为期望输出,训练得到头部姿态识别模型。
- 根据权利要求1所述的方法,其中,所述确定目标声源与所述目标用户的头部的距离,包括:确定所述头部图像的大小;基于预设的、头部图像的大小和距离的对应关系,确定所述目标 声源与所述目标用户的头部的距离。
- 根据权利要求1-4之一所述的方法,其中,在所述得到处理后左声道音频信号和处理后右声道音频信号之后,所述方法还包括:获取预先确定的、初始左声道音频信号和初始右声道音频信号的响度差作为初始响度差;分别调整所述处理后左声道音频信号和所述处理后右声道音频信号的响度,以使调整响度后的处理后左声道音频信号和处理后右声道音频信号的响度差与所述初始响度差的差值处于第一预设范围内。
- 根据权利要求5所述的方法,其中,所述方法还包括:获取预先确定的、初始左声道音频信号和初始右声道音频信号的双耳时间差作为初始双耳时间差;调整所述处理后左声道音频信号和所述处理后右声道音频信号的双耳时间差,以使调整双耳时间差后的处理后左声道音频信号和处理后右声道音频信号的双耳时间差与所述初始双耳时间差的差值处于第二预设范围内。
- 一种用于处理音频信号的装置,包括:第一获取单元,被配置成获取目标用户的头部图像和待处理音频信号;确定单元,被配置成基于所述头部图像,确定所述目标用户的头部姿态角,以及确定目标声源与所述目标用户的头部的距离;处理单元,被配置成将所述头部姿态角、所述距离和所述待处理音频信号输入预设的头相关传输函数,得到处理后左声道音频信号和处理后右声道音频信号,其中,所述头相关传输函数用于表征头部姿态角、距离、待处理音频信号与处理后左声道音频信号和处理后右声道音频信号的对应关系。
- 根据权利要求7所述的装置,其中,所述确定单元包括:识别模块,被配置成将所述头部图像输入预先训练的头部姿态识别模型,得到所述目标用户的头部姿态角,其中,所述头部姿态识别模型用于表征头部图像与头部图像所表征的用户的头部姿态角的对应关系。
- 根据权利要求8所述的装置,其中,所述头部姿态识别模型预先按照如下步骤训练得到:获取多个样本头部图像和所述多个样本头部图像中的样本头部图像对应的样本头部姿态角;利用机器学习方法,将所述多个样本头部图像中的样本头部图像作为输入,将输入的样本头部图像对应的样本头部姿态角作为期望输出,训练得到头部姿态识别模型。
- 根据权利要求7所述的装置,其中,所述确定单元包括:第一确定模块,被配置成确定所述头部图像的大小;第二确定模块,被配置成基于预设的、头部图像的大小和距离的对应关系,确定所述目标声源与所述目标用户的头部的距离。
- 根据权利要求7-10之一所述的装置,其中,所述装置还包括:第二获取单元,被配置成获取预先确定的、初始左声道音频信号和初始右声道音频信号的响度差作为初始响度差;第一调整单元,被配置成分别调整所述处理后左声道音频信号和所述处理后右声道音频信号的响度,以使调整响度后的处理后左声道音频信号和处理后右声道音频信号的响度差与所述初始响度差的差值处于第一预设范围内。
- 根据权利要求11所述的装置,其中,所述装置还包括:第三获取单元,被配置成获取预先确定的、初始左声道音频信号和初始右声道音频信号的双耳时间差作为初始双耳时间差;第二调整单元,被配置成调整所述处理后左声道音频信号和所述 处理后右声道音频信号的双耳时间差,以使调整双耳时间差后的处理后左声道音频信号和处理后右声道音频信号的双耳时间差与所述初始双耳时间差的差值处于第二预设范围内。
- 一种终端设备,包括:一个或多个处理器;存储装置,其上存储有一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-6中任一所述的方法。
- 一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现如权利要求1-6中任一所述的方法。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2020545268A JP7210602B2 (ja) | 2018-10-12 | 2019-01-24 | オーディオ信号の処理用の方法及び装置 |
GB2100831.3A GB2590256B (en) | 2018-10-12 | 2019-01-24 | Method and device for processing audio signal |
US16/980,119 US11425524B2 (en) | 2018-10-12 | 2019-01-24 | Method and device for processing audio signal |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811190415.4A CN111050271B (zh) | 2018-10-12 | 2018-10-12 | 用于处理音频信号的方法和装置 |
CN201811190415.4 | 2018-10-12 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020073563A1 true WO2020073563A1 (zh) | 2020-04-16 |
Family
ID=70164992
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/072948 WO2020073563A1 (zh) | 2018-10-12 | 2019-01-24 | 用于处理音频信号的方法和装置 |
Country Status (5)
Country | Link |
---|---|
US (1) | US11425524B2 (zh) |
JP (1) | JP7210602B2 (zh) |
CN (1) | CN111050271B (zh) |
GB (1) | GB2590256B (zh) |
WO (1) | WO2020073563A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2604019A (en) * | 2020-12-16 | 2022-08-24 | Nvidia Corp | Visually tracked spacial audio |
WO2023058466A1 (ja) * | 2021-10-06 | 2023-04-13 | ソニーグループ株式会社 | 情報処理装置およびデータ構造 |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20200049020A (ko) * | 2018-10-31 | 2020-05-08 | 삼성전자주식회사 | 음성 명령에 응답하여 컨텐츠를 표시하기 위한 방법 및 그 전자 장치 |
CN112637755A (zh) * | 2020-12-22 | 2021-04-09 | 广州番禺巨大汽车音响设备有限公司 | 一种基于无线连接的音频播放控制方法、装置及播放系统 |
CN113099373B (zh) * | 2021-03-29 | 2022-09-23 | 腾讯音乐娱乐科技(深圳)有限公司 | 声场宽度扩展的方法、装置、终端及存储介质 |
CN114501297B (zh) * | 2022-04-02 | 2022-09-02 | 北京荣耀终端有限公司 | 一种音频处理方法以及电子设备 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030007648A1 (en) * | 2001-04-27 | 2003-01-09 | Christopher Currell | Virtual audio system and techniques |
CN104392241A (zh) * | 2014-11-05 | 2015-03-04 | 电子科技大学 | 一种基于混合回归的头部姿态估计方法 |
CN107168518A (zh) * | 2017-04-05 | 2017-09-15 | 北京小鸟看看科技有限公司 | 一种用于头戴显示器的同步方法、装置及头戴显示器 |
CN107182011A (zh) * | 2017-07-21 | 2017-09-19 | 深圳市泰衡诺科技有限公司上海分公司 | 音频播放方法及系统、移动终端、WiFi耳机 |
Family Cites Families (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AUPQ896000A0 (en) | 2000-07-24 | 2000-08-17 | Seeing Machines Pty Ltd | Facial image processing system |
EP1424685A1 (en) * | 2002-11-28 | 2004-06-02 | Sony International (Europe) GmbH | Method for generating speech data corpus |
CN102860041A (zh) * | 2010-04-26 | 2013-01-02 | 剑桥机电有限公司 | 对收听者进行位置跟踪的扬声器 |
CN101938686B (zh) * | 2010-06-24 | 2013-08-21 | 中国科学院声学研究所 | 一种普通环境中头相关传递函数的测量系统及测量方法 |
KR101227932B1 (ko) | 2011-01-14 | 2013-01-30 | 전자부품연구원 | 다채널 멀티트랙 오디오 시스템 및 오디오 처리 방법 |
JP2014131140A (ja) | 2012-12-28 | 2014-07-10 | Yamaha Corp | 通信システム、avレシーバ、および通信アダプタ装置 |
CN104010265A (zh) * | 2013-02-22 | 2014-08-27 | 杜比实验室特许公司 | 音频空间渲染设备及方法 |
JP6147603B2 (ja) | 2013-07-31 | 2017-06-14 | Kddi株式会社 | 音声伝達装置、音声伝達方法 |
WO2015162947A1 (ja) * | 2014-04-22 | 2015-10-29 | ソニー株式会社 | 情報再生装置及び情報再生方法、並びに情報記録装置及び情報記録方法 |
JP2016199124A (ja) * | 2015-04-09 | 2016-12-01 | 之彦 須崎 | 音場制御装置及び適用方法 |
US10595148B2 (en) | 2016-01-08 | 2020-03-17 | Sony Corporation | Sound processing apparatus and method, and program |
WO2017120767A1 (zh) * | 2016-01-12 | 2017-07-20 | 深圳多哚新技术有限责任公司 | 一种头部姿态预测方法和装置 |
CN105760824B (zh) * | 2016-02-02 | 2019-02-01 | 北京进化者机器人科技有限公司 | 一种运动人体跟踪方法和系统 |
US9591427B1 (en) * | 2016-02-20 | 2017-03-07 | Philip Scott Lyren | Capturing audio impulse responses of a person with a smartphone |
CN108038474B (zh) * | 2017-12-28 | 2020-04-14 | 深圳励飞科技有限公司 | 人脸检测方法、卷积神经网络参数的训练方法、装置及介质 |
WO2019246044A1 (en) * | 2018-06-18 | 2019-12-26 | Magic Leap, Inc. | Head-mounted display systems with power saving functionality |
-
2018
- 2018-10-12 CN CN201811190415.4A patent/CN111050271B/zh active Active
-
2019
- 2019-01-24 WO PCT/CN2019/072948 patent/WO2020073563A1/zh active Application Filing
- 2019-01-24 US US16/980,119 patent/US11425524B2/en active Active
- 2019-01-24 JP JP2020545268A patent/JP7210602B2/ja active Active
- 2019-01-24 GB GB2100831.3A patent/GB2590256B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030007648A1 (en) * | 2001-04-27 | 2003-01-09 | Christopher Currell | Virtual audio system and techniques |
CN104392241A (zh) * | 2014-11-05 | 2015-03-04 | 电子科技大学 | 一种基于混合回归的头部姿态估计方法 |
CN107168518A (zh) * | 2017-04-05 | 2017-09-15 | 北京小鸟看看科技有限公司 | 一种用于头戴显示器的同步方法、装置及头戴显示器 |
CN107182011A (zh) * | 2017-07-21 | 2017-09-19 | 深圳市泰衡诺科技有限公司上海分公司 | 音频播放方法及系统、移动终端、WiFi耳机 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2604019A (en) * | 2020-12-16 | 2022-08-24 | Nvidia Corp | Visually tracked spacial audio |
WO2023058466A1 (ja) * | 2021-10-06 | 2023-04-13 | ソニーグループ株式会社 | 情報処理装置およびデータ構造 |
Also Published As
Publication number | Publication date |
---|---|
JP7210602B2 (ja) | 2023-01-23 |
CN111050271A (zh) | 2020-04-21 |
US20210029486A1 (en) | 2021-01-28 |
GB202100831D0 (en) | 2021-03-10 |
GB2590256A (en) | 2021-06-23 |
CN111050271B (zh) | 2021-01-29 |
US11425524B2 (en) | 2022-08-23 |
GB2590256B (en) | 2023-04-26 |
JP2021535632A (ja) | 2021-12-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020073563A1 (zh) | 用于处理音频信号的方法和装置 | |
US10585486B2 (en) | Gesture interactive wearable spatial audio system | |
US11765538B2 (en) | Wearable electronic device (WED) displays emoji that plays binaural sound | |
US20160241980A1 (en) | Adaptive ambisonic binaural rendering | |
US11356795B2 (en) | Spatialized audio relative to a peripheral device | |
US11297456B2 (en) | Moving an emoji to move a location of binaural sound | |
TWI709131B (zh) | 音訊場景處理技術 | |
US20190246231A1 (en) | Method of improving localization of surround sound | |
WO2023045980A1 (zh) | 音频信号播放方法、装置和电子设备 | |
US10582329B2 (en) | Audio processing device and method | |
CN117835121A (zh) | 立体声重放方法、电脑、话筒设备、音箱设备和电视 | |
EP3625975B1 (en) | Incoherent idempotent ambisonics rendering | |
WO2020155908A1 (zh) | 用于生成信息的方法和装置 | |
CN114339582B (zh) | 双通道音频处理、方向感滤波器生成方法、装置以及介质 | |
US10390167B2 (en) | Ear shape analysis device and ear shape analysis method | |
CN114630240B (zh) | 方向滤波器的生成方法、音频处理方法、装置及存储介质 | |
WO2024027315A1 (zh) | 音频处理方法、装置、电子设备、存储介质和程序产品 | |
CN117793611A (zh) | 生成立体声的方法、播放立体声的方法、设备及存储介质 | |
WO2022093162A1 (en) | Calculation of left and right binaural signals for output | |
CN116193196A (zh) | 虚拟环绕声渲染方法、装置、设备及存储介质 | |
CN118053442A (zh) | 训练数据生成方法、装置、电子设备及存储介质 | |
CN113674751A (zh) | 音频处理方法、装置、电子设备和存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19871599 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2020545268 Country of ref document: JP Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 202100831 Country of ref document: GB Kind code of ref document: A Free format text: PCT FILING DATE = 20190124 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 29/07/2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19871599 Country of ref document: EP Kind code of ref document: A1 |