CN109410912B - Audio processing method and device, electronic equipment and computer readable storage medium - Google Patents

Audio processing method and device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN109410912B
CN109410912B CN201811400323.4A CN201811400323A CN109410912B CN 109410912 B CN109410912 B CN 109410912B CN 201811400323 A CN201811400323 A CN 201811400323A CN 109410912 B CN109410912 B CN 109410912B
Authority
CN
China
Prior art keywords
audio information
audio
sound
information
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811400323.4A
Other languages
Chinese (zh)
Other versions
CN109410912A (en
Inventor
马永振
朱旭光
梅航
叶希喆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tencent Information Technology Co Ltd
Original Assignee
Shenzhen Tencent Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Tencent Information Technology Co Ltd filed Critical Shenzhen Tencent Information Technology Co Ltd
Priority to CN201811400323.4A priority Critical patent/CN109410912B/en
Publication of CN109410912A publication Critical patent/CN109410912A/en
Application granted granted Critical
Publication of CN109410912B publication Critical patent/CN109410912B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/165Management of the audio stream, e.g. setting of volume, audio stream path
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S1/00Two-channel systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/305Electronic adaptation of stereophonic audio signals to reverberation of the listening space
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L2013/021Overlap-add techniques

Abstract

The embodiment of the application provides an audio processing method, an audio processing device, electronic equipment and a computer readable storage medium, and relates to the technical field of multimedia, wherein the method comprises the following steps: the method comprises the steps of obtaining audio information to be processed and audio information recorded through a human head microphone, then determining preset type audio information from the audio information to be processed, processing the preset type audio information through a preset plug-in, and then carrying out sound mixing processing on the audio information recorded through the human head microphone and the processed audio information. The embodiment of the application can improve the positioning sense and the space sense of sound, and further improve the hearing experience of a user when watching videos.

Description

Audio processing method and device, electronic equipment and computer readable storage medium
Technical Field
The present application relates to the field of multimedia technologies, and in particular, to an audio processing method and apparatus, an electronic device, and a computer-readable storage medium.
Background
With the development of information technology, the video field is further developed, for example, a computer image CG for a game, a virtual reality VR game CG, a dynamic cartoon, etc., in order to make a user better feel video content, it is necessary to perform corresponding processing on audio information synthesized in the video content, and therefore how to process the audio information synthesized in the video content, so that the user can have better hearing experience when watching the video content, which becomes a key problem.
In the prior art, audio information synthesized with video content is processed in a way of copying Ambisonics through Ambisonics, but since Ambisonics technology is a technical means for fuzzifying sound source positioning, and due to a limiting factor of poor far-field positioning of sound, the positioning sense and spatial sense of sound are not sufficiently expressed, and further the auditory experience of a user when watching the video is poor.
Disclosure of Invention
The application provides an audio processing method, an audio processing device, an electronic device and a computer-readable storage medium, which are used for solving the problems that the localization sense and the spatial sense of sound are not enough in performance, and the user experience is poor when watching a video. The technical scheme is as follows:
in a first aspect, a method of audio processing is provided, the method comprising:
acquiring audio information to be processed and audio information recorded by a human head microphone;
and determining preset type audio information from the audio information to be processed, and processing the preset type audio information through a preset plug-in.
In a possible implementation manner, the obtaining of the audio information to be processed and the audio information recorded by the human head microphone further includes:
in the audio information recording process, determining the microphones used for current recording based on the distances between the sound source and the microphones;
and recording corresponding audio information through the determined microphone.
In one possible implementation, the microphones currently used for recording are determined based on the distances between the sound source and the respective microphones; record corresponding audio information through the microphone that determines, include:
when the fact that the distance between a sound source and a human head microphone meets a first preset condition is detected, the microphone used for recording at present is determined to be the human head microphone, and corresponding audio information is recorded through the human head microphone;
and when the distance between the sound source and the condenser microphone is detected to meet a second preset condition, determining that the microphone currently used for recording is the condenser microphone, and recording corresponding audio information through the condenser microphone.
In one possible implementation manner, the sound mixing processing the audio information recorded by the head microphone and the processed audio information includes:
and carrying out sound mixing processing on the audio information recorded by the human head microphone and the processed audio information in a linear superposition mode.
In one possible implementation manner, performing sound mixing processing on the audio information recorded by the human head microphone and the processed audio information in a linear superposition manner includes:
linearly superposing the audio information recorded by the human head microphone and the processed audio information;
dividing the linearly superposed audio mixing signal into at least two audio mixing signal intensity intervals according to the audio intensity;
respectively carrying out audio intensity contraction on each audio mixing signal intensity interval by adopting corresponding contraction proportions;
superposing at least two audio mixing signal intensity intervals subjected to audio intensity contraction;
the contraction proportion adopted by the sound mixing signal interval is in inverse proportion relation with the audio intensity corresponding to the sound mixing signal intensity interval.
In a possible implementation manner, the sound mixing processing the audio information recorded by the head microphone and the processed audio information further includes:
and synthesizing the audio information after the audio mixing process and the video information to be synthesized.
In one possible implementation manner, synthesizing the audio information after the audio mixing process and the video information to be synthesized includes:
respectively encoding the audio information after the audio mixing processing and the video information to be synthesized to obtain the audio information after the encoding processing and the video information after the encoding processing;
and synthesizing the audio information after the coding processing and the video information after the coding processing.
In a possible implementation manner, the audio information after the audio mixing process and the video information to be synthesized are respectively encoded to obtain the audio information after the encoding process and the video information after the encoding process, and then the method further includes:
determining a video frame rate corresponding to the coded video information;
interleaving the encoded audio information and the encoded video information based on a video frame rate corresponding to the encoded video information to obtain an encoded interleaving queue;
synthesizing the audio information after the coding processing and the video information after the coding processing, including:
and synthesizing the coded interleaving queues.
In one possible implementation, the method further includes:
the preset plug-in is a Head Related Transform Function (HRTF) plug-in.
In a second aspect, there is provided an apparatus for audio processing, the apparatus comprising:
the acquisition module is used for acquiring audio information to be processed and audio information recorded by a human head microphone;
the first determining module is used for determining audio information of a preset type from the audio information to be processed acquired by the acquiring module;
the plug-in processing module is used for processing the audio information of the preset type determined by the first determining module through a preset plug-in;
and the sound mixing processing module is used for carrying out sound mixing processing on the audio information recorded by the head microphone and the audio information processed by the plug-in processing module.
In one possible implementation, the audio information to be processed includes at least one of the following:
ambient sound information; sound effect information; audio information recorded by a condenser microphone; background music information.
In one possible implementation, the apparatus further includes: the second determining module and the recording module;
the second determining module is used for determining the microphones used for current recording based on the distances between the sound source and the microphones in the audio information recording process;
and the recording module is used for recording the corresponding audio information through the microphone determined by the second determining module.
In a possible implementation manner, the second determining module is specifically configured to determine that a currently recorded microphone is a human head microphone when it is detected that a distance between a sound source and the human head microphone satisfies a first preset condition;
the recording module is specifically used for recording corresponding audio information through the head microphone determined by the second determining module;
the second determining module is specifically configured to determine that a microphone currently recorded and used is a condenser microphone when it is detected that a distance between a sound source and the condenser microphone meets a second preset condition;
and the recording module is specifically used for recording the corresponding audio information by the capacitor microphone determined by the second determining module.
In a possible implementation manner, the sound mixing processing module is specifically configured to perform sound mixing processing on the audio information recorded by the human head microphone and the processed audio information in a linear superposition manner.
In one possible implementation manner, the sound mixing processing module includes: the device comprises a superposition unit, a division unit and an audio intensity contraction unit;
the superposition unit is used for linearly superposing the audio information recorded by the human head microphone and the processed audio information;
the dividing unit is used for dividing the audio mixing signal linearly superposed by the superposing unit into at least two audio mixing signal intensity intervals according to the audio intensity;
the audio intensity contraction unit is used for respectively performing audio intensity contraction on each mixed signal intensity interval divided by the dividing unit by adopting a corresponding contraction proportion;
the superposition unit is also used for superposing at least two mixed sound signal intensity intervals subjected to audio intensity contraction by the audio intensity contraction unit;
the contraction proportion adopted by the sound mixing signal interval is in inverse proportion relation with the audio intensity corresponding to the sound mixing signal intensity interval.
In one possible implementation, the apparatus further includes: a synthesis module;
and the synthesis module is used for synthesizing the audio information subjected to the sound mixing processing by the sound mixing processing module and the video information to be synthesized.
In one possible implementation, the synthesis module includes: an encoding unit and a synthesizing unit;
the encoding unit is used for respectively encoding the audio information after the audio mixing processing and the video information to be synthesized to obtain the audio information after the encoding processing and the video information after the encoding processing;
and the synthesizing unit is used for synthesizing the audio information coded and processed by the coding unit and the video information coded and processed by the coding unit.
In one possible implementation, the apparatus further includes: a third determining module and an interleaving module;
the third determining module is used for determining a video frame rate corresponding to the coded video information;
the interleaving module is used for interleaving the encoded audio information and the encoded video information based on the video frame rate corresponding to the encoded video information determined by the third determining module to obtain an encoded interleaving queue;
and the synthesis module is specifically used for synthesizing the interleaving queue after the interleaving module codes.
In one possible implementation, the apparatus further includes:
the preset plug-in is a Head Related Transform Function (HRTF) plug-in.
In a third aspect, an electronic device is provided, which includes:
one or more processors;
a memory;
one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: a method of performing audio processing according to the first aspect is shown.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the method of audio processing shown in the first aspect.
The technical scheme provided by the embodiment of the application has the following beneficial effects:
compared with the prior art that audio information synthesized with video is processed in a mode of copying Ambisonics through high-fidelity stereo sound, the audio processing method and device have the advantages that the audio information to be processed and the audio information recorded through a head microphone are obtained, then the preset type of audio information is determined from the audio information to be processed, the preset type of audio information is processed through a preset plug-in, and then the audio information recorded through the head microphone and the processed audio information are subjected to sound mixing processing. This application is through handling the back through predetermineeing the plug-in components with the audio information that belongs to the type of predetermineeing promptly, then synthesize with the audio information that records through the people head microphone, because record through the people head microphone and handle audio information through predetermineeing the plug-in components, all can improve audio information's spatial localization effect to can improve audio information's sound location sense and spatial sensation, and then can improve user's sense of hearing and experience, especially the sense of hearing when watching the video is experienced.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
Fig. 1 is a schematic flowchart of an audio processing method according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of another audio processing apparatus according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device for audio processing according to an embodiment of the present disclosure;
FIG. 5 is a schematic view of a human head microphone;
FIG. 6 is a schematic diagram of a system for loading effectors on classified sample groups;
FIG. 7 is a diagram of a plug-in for processing a predetermined type of audio information;
FIG. 8 is a schematic representation of a Cartesian ear model coordinate system;
FIG. 9 is a schematic diagram of the determination of the position in three dimensions of a sound by inputting values within a plug-in;
FIG. 10 is a schematic diagram of the positional relationship of sound to a listener;
FIG. 11 is a schematic diagram of adjusting the volume of an input sound source by adjusting the GAIN GAIN button in the insert;
FIG. 12 is a schematic diagram illustrating adjustment of the distance between six surfaces and a listener;
FIG. 13 is a schematic diagram of a damper button for optimizing a sound segment;
FIG. 14 is a schematic diagram of a real-time sound field auditory simulation (REALTIME AURALISATION) button;
fig. 15 is a schematic diagram of a reverberation processing approach;
FIG. 16a is a schematic diagram of parameter adjustment for 3D audio output from a headphone loudspeaker;
FIG. 16b is a schematic diagram of parameter adjustment for the output of 3D audio from a speaker;
fig. 17 is a schematic view showing the degree of transition between three-dimensional audio obtained by binaural sound reproduction and three-dimensional audio obtained by binaural sound synthesis;
fig. 18 is a schematic view of volume adjustment by adjusting a compression effector during mixing;
FIG. 19 is a diagram of an interleaving queue before encoding in an embodiment of the present application;
FIG. 20 is a parameter diagram of a video format in the synthesized multimedia message;
FIG. 21 is a diagram illustrating the formatting of output audio information after synthesis;
fig. 22 is a schematic view of an overall manufacturing process of the sound cartoon in the embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The terms referred to in this application will first be introduced and explained:
three-dimensional (3Dimensions, 3D) audio: 3D audio is a relatively general concept, if existing surround sound is said, such as 5.1 standard or 7.1 standard, they are a two-dimensional plane sound standard, and the 3D sound standard will be a standard including the height and depth of sound that the listener can perceive. The mainstream view at present is to use sounds to which HRTF playback is applied as a way of rendering 3D audio.
Ambisonics (English full name: Ambisonics): the simple analysis of the theoretical basis of Ambisronics shows that the sound pressure of a plane in a known certain region can be calculated to obtain the sound pressure of any point through the sound pressure gradient; in a stereo field, a three-dimensional coordinate system is a spherical coordinate system, each layer of spherical surface is called as a first order, and Ambisonics reconstructs an original sound field in loudspeakers reasonably and uniformly distributed on a spherical structure in a playback system by picking up sound pressure in each direction (plane) after the original sound field is three-dimensionally decomposed.
Binaural recording: binaural recording is a technology which is different from common stereo pickup and is obtained after sound is filtered by the structure of the human body in the process of transmitting to the ears; this is achieved by placing microphones in the left and right ear canals of the model human head to pick up sound.
Head Related Transfer Functions (HRTFs): in combination with binaural recording, the filtering of the human body is evolved into filtered encoding, and obtaining of HRTF encoding can refer to a convolution reverberation mode: in an environment with little influence on space environment, an Impulse Response (usually a transient pulse or a sweep frequency signal) is picked up by using binaural recording, and an Impulse Response signal after binaural recording, called Head Related Impulse Response (HRIR), is obtained, and is compared with the original Impulse Response to calculate the HRTF code; thus, we can obtain sounds different between the left and right ears by HRTF coding the recorded mono sound, i.e. binaural stereo, wherein the HRTF coded binaural stereo contains three-dimensional spatial information.
Sound spatialization: in our real listening environment, we only use our ears to perceive the information of sound in space, including the localization of sound and the spatial perception of sound (distance of sound source to our own);
a capacitance microphone: a microphone for converting the sound signal into an electric signal by using the change of the capacitance;
a human head microphone: the human head microphone is a microphone which is provided with an auricle, an ear canal, a skull, hair and shoulders, even the skin and bones are made of materials closest to the human body, and the microphone performs audio recording by a 'simulated human head' double-channel recording mode: two miniature omni-directional microphones are arranged in the auditory canal of a dummy head which is almost the same as a real head (close to the eardrum of the human ear), and the whole process of hearing sound by the human ear is simulated.
In the prior art, in a video or a Virtual Reality (VR) video with a 360-degree rotatable view angle, a sound source is placed and encoded in an Ambisonics manner, discrete transformation of the sound source is performed according to requirements, decoding is performed through an HRTF, and finally, an earphone outputs and replays.
The sound effect manufactured by the technology has the effect that the sound changes in real time by rotating along with the lens in real time, but the Ambisonics is a technical means for fuzzifying sound source positioning, and the positioning and the space sense expression of the sound are insufficient under the limiting factor of poor far-field positioning of the sound.
In order to solve the above problem, the sound representation is used to restore the listening feeling of a real scene, so that the user can feel a complete 3D sound scene in the 2D plane or VR video, even in the representation of the free video content, so as to make the user become a participant of the video story in the sound representation, rather than a listener, specifically by the following means:
the embodiment of the application is in a 3D audio immersion experience form, and technically carries out post-production of sound in a mode of combining human head recording with HRTF plug-in units, and is matched with the hearing experience of video content.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
An embodiment of the present application provides an audio processing method, as shown in fig. 1, the method includes:
and S101, acquiring audio information to be processed and audio information recorded by a human head microphone.
For the embodiment of the application, in the process of recording the audio information through the human head microphone, namely refraction, diffraction and diffraction of sound waves by imitating human auricles, auditory canals, human craniums, shoulders and the like all can affect the sound human head recording technology to a certain extent. Acoustically, this effect is described by HRTFs, i.e. "head related transfer functions". Just because of the influence of HRTFs, the human brain can empirically determine the direction and distance from which sound is emitted. In the embodiment of the present application, the advantage of recording audio information through the head is that the performance of the sound of the ear-attaching sound is much better than that of the plug-in, i.e. the sound close to the ear-attaching sound, such as the blowing sound and the sense of hearing of "whisper," and such performance can be referred to as spontaneous perceptual Meridian reaction (ASMR) type sound; the disadvantages are as follows: the synchronous recording mode has high requirement on the site environment; it is difficult to perform recording in stages; sound positioning of recorded contents, and incapability of changing the space sense in the later period; the requirements for walking and performing during recording are high, the audibility is difficult to predict, and a large amount of tests are needed; the sound of a far sound field generally shows localization. In the embodiment of the present application, a human head microphone is shown in fig. 5.
For the embodiment of the application, the audio information to be processed includes at least one of the following:
ambient sound information; sound effect information; audio information recorded by a condenser microphone; background music information.
For the embodiment of the application, the audio information recorded by the condenser microphone may be the sound of the dubbing actor recorded by the condenser microphone; the environmental sound information, the sound effect information and the background music information may be pre-recorded audio information or pre-synthesized audio information. The embodiments of the present application are not limited.
Step S102, determining preset type audio information from the audio information to be processed, and processing the preset type audio information through a preset plug-in.
For the embodiment of the application, the audio information to be processed can be divided into 3D audio information and non-3D audio information according to the grouping in the project. For example, the 3D class of audio information may include: dialogue audio information and action sound; the non-3D-like audio information includes: ambient sound information, background music information, and special effect audio information. For example, when grouping the classification samples, as shown in fig. 6, three groups of the dialog audio information, the environmental audio information, and the special effect audio information are sequentially arranged from top to bottom, and there are many sound tracks under each group, and an effector can be loaded only under each group, and does not need to be loaded under each sound track, thereby saving resource consumption.
For the embodiment of the present application, the preset type of audio information may be 3D type audio information.
For the embodiment of the application, the preset type of audio information (3D type of audio information) is processed through the preset plug-in, so that the processed audio information has a surround effect. In the application embodiment, the processing performed by the preset plug-in has the advantages that: because the point sound source is processed, the later positioning and the adjustment of the spatial sense are facilitated; the disadvantages are as follows: the sound of sticking the ear is not obvious; when the number of point sound sources is too large, the amount of sound source control in the post-processing is large.
For the embodiment of the application, stereo materials are adopted by the environment sound selected by the test. Specifically, the test was performed in two ways:
(1) downloading the ambient sound in the B-format, and decoding the ambient sound into binaural stereo by using an Ambisonics Tool set (ATK) and an FB360 plug-in, wherein the B-format is a sound standard format of Ambisonics, and the FB360 plug-in is an audio plug-in of Facebook;
(2) using single sound channel and stereo environment sound, using ATK plug-in and FB360 plug-in to carry out coding and conversion process, and decoding into double-ear stereo sound;
in both of these methods, the obtained ambient sound sensation is certainly different from the direct stereo sound sensation, but the surrounding sound sensation is general. Firstly, the four-track sound based on B-format is not a particularly good recording mode for representing the environmental sound localization, because the recording is essentially a three-dimensional Mid/Side (MS) mode recording, and the sound intensity difference is ignored; the environment recording mode based on the Quad flat (english full name: Quad) mode is better.
Moreover, the application of the Ambisonics mode in VR sound is more fire-heat, and the Ambisonics mode is mainly based on that the VR sound can be conveniently combined with a head rotation mode through sound field rotation, and meanwhile, the VR sound can be used in environmental sound as a sound field reconstruction format for fuzzifying a sound production point sound source, and when the head rotates, an integral sound field rotation is formed, so that the resource consumption is saved; however, in a game scene, the sound changes little near or far from a set or a recorded point sound source.
However, in the embodiment of the present application, the Ambisonics has the advantage that the ambient sound is often used as a transitional representation, which is not necessarily required, and therefore, stereo material is used as the ambient sound.
And step S103, carrying out sound mixing processing on the audio information recorded by the human head microphone and the processed audio information.
For the embodiment of the application, the audio information after the audio mixing processing is output through the earphone.
Compared with the prior art in which processing is performed on audio information synthesized with a video in an Ambisonics copying manner, the audio processing method provided by the embodiment of the application acquires the audio information to be processed and the audio information recorded by a head microphone, determines the audio information of a preset type from the audio information to be processed, processes the audio information of the preset type through a preset plug-in, and performs sound mixing processing on the audio information recorded by the head microphone and the processed audio information. This application embodiment is through handling the back with the audio information that belongs to preset type through presetting the plug-in components promptly, then synthesize with the audio information that records through the people head microphone, because record through the people head microphone and handle audio information through presetting the plug-in components, all can improve audio information's spatial localization effect to can improve audio information's sound location sense and spatial sensation, and then can improve user's sense of hearing and experience, especially the sense of hearing when watching the video.
In another possible implementation manner of the embodiment of the present application, the preset plug-in is a head related transform function HRTF plug-in.
Compared with the advantages and disadvantages of recording by the human head microphone and post-processing by the plug-in, the recording by the human head microphone cannot be used as a means for using a large amount of works, but the unique ear-sticking performance of the recording by the human head is a bright point, so that several links of recording by the human head microphone are set in recording works (such as a hand game CG, a VR game CG and a dynamic cartoon).
For the embodiment of the present application, when determining a plug-in used in post-production, we compare several mainstream plug-ins, and because we are directed to audio information on video content, consider plug-ins supporting DAW use, such as DearVR, Oculus, Ambipan, and FB360, and finally determine to use DearVR as a plug-in to process preset type of audio information based on the following reasons, as shown in fig. 7 specifically; the embodiment of the application is mainly introduced by the DearVR plug-in, but is not limited to the DearVR plug-in.
Wherein, DearVR is as the advantage of plug-in components: (1) the DearVR integrates reverberation and early reflection in the plug-in unit, can adjust Damping (damper) and Gain (Gain), and the reverberation can also select the space form;
(2) selection of output mode: the output of DearVR is more flexible, not only supports binaural output;
(3) better than Oculus in terms of sound automation settings, and with the increase of sound sources, support the display of monorail sound sources; the plug-in of Oculus in DAW is used as a test plug-in of a game engine, and the experience in DAW is not particularly good.
For the embodiment of the present application, when the DearVR is selected as the preset plug-in, the step S102 may be to process the preset type of audio information through the preset plug-in, and the processing may specifically include: processing the preset type of audio information through a DearVR plug-in:
(1) a Cartesian model (named Cartesian Mode) is selected, wherein a coordinate system of the Cartesian model is shown in fig. 8, as shown in fig. 9, a white dot in a region represents a sound source, and values can be input by dragging a mouse or inputting values in input boxes respectively corresponding to XYZ in fig. 9, for example, a value of-9.58 in an input box corresponding to an X direction, a value of 0.00 in an input box corresponding to a Y direction, and a value of 6.33 in an input box corresponding to a Z direction, wherein an X-axis direction, a Y-axis direction, and a Z-axis direction in fig. 9 correspond to an X direction, a Y direction, and a Z direction of the coordinate system in fig. 8, a three-dimensional position of a sound is determined by inputting the values, and further, for a sound object which is not a fixed position, a motion estimation of the sound object is set by an automatic means, for example, as shown in fig. 10, XYZ respectively represent the distance relationship between a sound and a listener, wherein the sound is 2.48 meters (m) from the listener in the X direction, 0.47 meters from the listener in the Z direction, and 0.00 meters from the listener in the Z direction;
(2) the volume Gain of the input sound source is adjusted (here, the Gain does not affect the early reflected sound and reverberation in the later stage), and further, the sound volume balance between the respective sounds is adjusted without affecting the spatial hearing, specifically, by adjusting the Gain button as shown in fig. 11, to adjust the volume of the input sound source.
(3) Use of early reflections: the reflection module of the DearVR generates early reflection according to the position of the sound source, the sound object moves, and the transmission mode is also suitable for the change of the signal in real time, for example, the real-time change can be generated on six surfaces (left, front, right, upper, rear and bottom);
the using of early reflection in step (3) may specifically include: a. adjusting the distances between the six surfaces and the listener, wherein the six surfaces are (left), front, right, top, back, and bottom), respectively, and the specific adjustment manner is shown in fig. 12; b. adjusting the low-pass filter of the early reflection to avoid generating more high-frequency sound due to the early reflection, which is quite harsh, wherein the sound frequency segment needs to be optimized (500Hz to 19999Hz), wherein the way of optimizing the sound frequency segment is realized by adjusting a DAMPING (english full name: DAMPING) button in fig. 13; c. the real-time change selection, such as a real-time sound field auditory simulation (real auditory) button shown in fig. 14, indicates a change in the relative position of the sound object and the wall surface, and enables the listener to hear the early reflections that occur in response thereto in real time. Real-time changes in early reflections enhance the three-dimensional localization of the sound source.
(4) Reverberation processing: the reverberation is loaded to the sound to enhance the spatial perception. For example, as shown in fig. 15, a reverberation (overall english: REVERB) module of the DEAR VR plug-in may select a Cinema (overall english: Cinema) effect in real-time auditory simulation (overall english: virtualics) by selecting a Cinema effect space, and may set the space Size by adjusting the Size (overall english: Size) and control the low-pass filter by adjusting the DAMPING (overall english: dampinging).
For the embodiment of the application, after the 3D audio information is processed by the DearVR, the audio information is output in a preset mode. Specifically, for the output of 3D audio played outside by the earphone, a two-channel stereo (english full name: Binaural) is selected, as shown in fig. 16 a; for the output of 3D audio by loudspeaker, 2.0 Stereo (english full name: Stereo) is chosen, as shown in fig. 16b, and the difference between them is only whether the HRTF parameters are loaded or not.
In another possible implementation manner of the embodiment of the present application, step S101 further includes step Sa (not shown in the figure) and step Sb (not shown in the figure), wherein,
and step Sa, in the audio information recording process, determining the microphone currently used for recording based on the distance between the sound source and each microphone.
For the embodiment of the application, in the design, the sound which needs to be listened to by a listener in a close distance is recorded by using a human head microphone; in the actual recording process, the planned dubbing actors dub by taking the head microphone as the center and taking 3-5 meters (m) as the radius, and simultaneously recording circular sounds with the distance between a part of dubbing actors and the head microphone being less than 10cm as the radius as exaggerated sound expression to achieve a certain ASMR effect.
And Sb, recording corresponding audio information through the determined microphone.
In another possible implementation manner of the embodiment of the present application, steps Sa and Sb include step Sab1 (not shown in the figure) and step Sab2 (not shown in the figure), wherein,
and step Sab1, when the distance between the sound source and the head microphone is detected to meet a first preset condition, determining that the currently recorded microphone is the head microphone, and recording corresponding audio information through the head microphone.
For example, if the first preset condition is not greater than 5m, recording of audio information by the head microphone is performed when it is detected that the distance between the sound source and the head microphone is not greater than 5 m.
And step Sab2, when the distance between the sound source and the capacitor microphone is detected to meet a second preset condition, determining that the microphone currently used for recording is the capacitor microphone, and recording corresponding audio information through the capacitor microphone.
For example, if the second preset condition is greater than 5m, recording audio information through the condenser microphone when it is detected that the distance between the sound source and the condenser microphone is greater than 5 m.
For the embodiment of the application, the recording is performed by the human head microphone and the 3D type audio information is processed by the preset plug-in, both for highlighting the 3D sound, and in terms of sound expression, the 3D sound can be highlighted by the following means:
(1) the occurrence of sound behind the body is more attractive to the listener, and therefore it is set in the scene that many sounds of sound (sound sources) appear behind the body;
(2) high-frequency sounds, similar to the sounds of metal, high-heeled footwear, are more obvious in distance representation and positioning representation;
(3) if based on the desire to specifically evoke the fear or special hearing sensation of the listener, a sudden distance-to-distance (preferably ear-to-ear) change of the sound can be assumed;
(4) if a long-time sound needs to express the direction and the sense of space of the sound, a track (namely, the change of sound images at the front, the back, the left and the right) which draws a circle or encloses the trend can be used, so that a listener can better hear the expression of the 3D sound, and the method usually requires that the volume expression of the sound is similar to a piece of light music or monologue with smaller dynamic under the condition of more balanced volume expression;
(5) in DearVR use, if the trajectory change of the 3D sound is to be represented, a larger distance change is usually chosen.
Another possible implementation manner of the embodiment of the present application may include, before step S103: and (3) carrying out frequency range adjustment on binaural sound synthesis (namely the processed audio information) and three-dimensional audio (audio information recorded by a human head microphone) obtained by binaural sound reproduction so as to achieve auditory matching.
Specifically, for a moving sound object, the three-dimensional audio obtained by binaural sound synthesis is transited to the three-dimensional audio obtained by binaural sound reproduction, or the three-dimensional audio obtained by binaural sound reproduction is transited to the three-dimensional audio obtained by binaural sound synthesis. Because of different acquisition modes, the sounds of the same type need to be optimized to a certain extent so as to enable the transition of the listening feeling to be smooth. Since the nature of the HRTF is an audio filter, we use the EQ effector for the transition optimization process, but at the same time avoid excessive use of the EQ effect, and weaken the localization effect by adjusting the Frequency (FREQ), GAIN (GAIN), and Q to transition from three-dimensional audio synthesized by binaural sound to three-dimensional audio synthesized by binaural sound, or to transition from three-dimensional audio synthesized by binaural sound to three-dimensional audio synthesized by binaural sound, as shown in fig. 17, where spectral lines in the figure represent the degree of transition from three-dimensional audio to three-dimensional audio synthesized by binaural sound, or vice versa.
For the embodiment of the present application, step S103 is to perform sound mixing processing on the audio information recorded by the head microphone and the processed audio information, and in the process of performing the sound mixing processing, the sound mixing processing is performed on the audio information including three-dimensional audio information and non-three-dimensional audio information. In the embodiment of the present application, in the process of mixing sound, the volume of a non-three-dimensional audio effect is mainly reduced for a sound portion highlighting the three-dimensional audio effect, and a compression effector is used to dynamically adjust the volume, as shown in fig. 18, when a non-three-dimensional sound track receives the volume of a three-dimensional track, the non-three-dimensional sound track changes according to the setting of the compression effector and the volume, and the frequency spectrum in fig. 18 represents the change of the volume.
In another possible implementation manner of the embodiment of the present application, step S103 may specifically include "step S1031 (not shown in the figure), wherein,
and step S1031, performing sound mixing processing on the audio information recorded by the human head microphone and the processed audio information in a linear superposition mode.
For the embodiment of the present application, step S1031 may specifically be: carrying out sound mixing processing on the audio information recorded by the human head microphone and the processed audio information in a linear superposition averaging mode; the sound mixing processing may also be performed through steps S10311 to S10314, where the manner of performing the sound mixing processing in steps S10311 to S10314 is described in detail below, and is not described herein again.
For the embodiment of the present application, in order to avoid distortion after linear superposition, the result of linear summation is averaged, that is, if there are N mixed sound, the result of summation is divided by N, which is equivalent to multiplying each data by a weight coefficient 1/N. This process effectively avoids distortion problems.
In another possible implementation manner of the embodiment of the application, step S1031 may specifically include step S10311 (not shown in the figure), step S10312 (not shown in the figure), step S10313 (not shown in the figure), and step S10314 (not shown in the figure), wherein,
and step S10311, linearly superposing the audio information recorded by the human head microphone and the processed audio information.
And step S10312, dividing the linearly superposed sound mixing signal into at least two sound mixing signal intensity intervals according to the audio intensity.
For the embodiment of the present application, step S10312 may specifically include: and determining signals of the mixed sound signals in different audio intensity distribution intervals as at least two mixed sound signal intensity intervals according to a plurality of audio intensity distribution intervals with equal lengths which are divided in advance.
Among the pre-divided audio intensity distribution intervals with equal length, the nth audio intensity distribution interval is as follows:
[(n-1)×2Q-1,n×2Q-1]wherein n is more than or equal to 1, and Q is a preset constant.
And step S10313, respectively carrying out audio intensity contraction on each mixed sound signal intensity interval by adopting the corresponding contraction proportion.
The contraction proportion adopted by the sound mixing signal interval is in inverse proportion relation with the audio intensity corresponding to the sound mixing signal intensity interval.
For the embodiment of the present application, step S10313 may include: the contraction ratio corresponding to the mixed sound signal intensity interval in the nth audio intensity distribution interval is [ (k-1)/k]*(1/k)nWhere k is a predetermined shrinkage factor.
For the embodiment of the present application, because the probability of occurrence of the low-medium intensity signal in the speech signal is higher than that of the high-medium intensity signal, the embodiment of the present application may adopt different contraction processing schemes for the high-medium intensity signal and the low-medium intensity signal, that is, the audio signal subjected to the linear superposition after the audio mixing is compressed in different regions, the lower intensity signal adopts a larger contraction ratio, so as to ensure that the identifiability of the lower intensity signal is certain contraction at the same time, and the high intensity signal adopts a smaller contraction ratio, so as to ensure that no audio signal overflow occurs, and also to maintain a certain identifiability. The contraction ratio is the ratio between the contracted signal intensity and the original signal intensity, for example, the original signal intensity is 100, and the contraction ratio is 50% when the contraction ratio is 50%.
For example, the nth audio intensity distribution interval is divided as follows: [ (n-1). times.2Q-1,n×2Q-1]Dividing linearly superimposed mixed sound signal into a plurality of intensitiesFor the interval signal as an example, the contraction ratio corresponding to the mixing signal intensity interval in the nth audio intensity distribution interval is [ (k-1)/k%](1/k) n, where k is a predetermined shrinkage factor, typically taking a multiple of 2, e.g. 8 or 16. In a preferred embodiment, k is 8 and Q is 16.
And step S10314, superposing at least two mixed sound signal intensity intervals subjected to audio intensity contraction.
For the embodiment of the application, by adopting the sound mixing processing method, the linearly superposed sound mixing signals are subjected to intensity partitioning, and then different contraction ratios are adopted for different sound mixing signal intensity intervals to perform contraction processing, so that overflow distortion is avoided.
For the embodiment of the present application, step S103 may further include: sending the audio information recorded by the head microphone and the processed audio information to the terminal equipment; and the terminal equipment performs mixed sound decoding through target players with the same number as the number of the audio information, wherein the target players are in the same target format as the target players for mixed sound decoding.
The target format may be a streaming media (FLASH VIDEO, FLV) format.
For the embodiment of the application, for volume adjustment during sound mixing processing, the volume of non-3D audio information is reduced for a sound part needing to highlight a 3D sound effect. Specifically, the volume is dynamically adjusted by using the compression effector.
In another possible implementation manner of the embodiment of the present application, step S103 may further include step S104 (not shown in the figure), wherein,
and step S104, synthesizing the audio information after the sound mixing processing and the video information to be synthesized.
For the embodiment of the application, the audio information after the audio mixing process is synthesized with the video information to be synthesized to obtain the multimedia information for outputting, such as a sound cartoon and the like.
In another possible implementation manner of the embodiment of the present application, the step S104 may specifically include a step S1041 (not shown in the figure) and a step S1042 (not shown in the figure), wherein,
step S1041, respectively encoding the audio information after the audio mixing process and the video information to be synthesized, to obtain the audio information after the encoding process and the video information after the encoding process.
Step S1042, the audio information after encoding processing and the video information after encoding processing are synthesized.
For the embodiment of the present application, step S1041 may further include: and interleaving the audio information after the audio mixing processing and the video information to be synthesized based on the video frame rate of the video information to be synthesized to form an interleaving queue before encoding.
For the embodiment of the application, the audio information after the audio mixing process and the video information to be synthesized are interleaved before encoding to form an interleaving queue before encoding, so that the audio and the video in the played media file can be kept synchronous. As shown in fig. 19, in the pre-encoding interleaving queue, video information frames Vi and audio information frames Ai are alternately arranged in sequence, wherein any one frame of video information Vi has its corresponding one frame of audio information Ai, specifically, one frame of video information V2 has its corresponding one frame of audio information a 2.
For the embodiment of the application, interleaving is performed according to formula (1), and a pre-coding interleaving queue is obtained, wherein,
nBitA=nChannel×nSampleRate×nBit*(1/nFramerate)/8(1);
and calculating the number nBitA of bytes contained in one frame of audio information Ai corresponding to any frame of video information Vi in an interleaving queue before coding, wherein nChannel is the number of sound channels of the audio information after mixing, nSamplerate is the sampling rate of the audio information after mixing, nBit is the quantization bit number of each audio information after mixing, and nFramerate is the video frame rate of the video information to be synthesized. For example, assume that the video frame rate nFramerate of the video information to be synthesized is 30 frames/second, and other parameters are not considered; the parameters of the audio information after sound mixing are as follows: the channel nChannel is 2 channels, and the sampling rate nSamplerate is 48000 Hz; if the quantization bit number nBit is 24 bits, the number nBitA of bytes included in the audio information Ai to be synthesized in any frame can be calculated and obtained according to formula (1) as 2 × 48000 × 24 (1/30)/8.
In another possible implementation manner of the embodiment of the present application, after the step S1041, the method may further include: step Sc (not shown) and step Sd (not shown), wherein,
and step Sc, determining a video frame rate corresponding to the coded video information.
And Sd, interleaving the coded audio information and the coded video information based on the video frame rate corresponding to the coded video information to obtain a coded interleaving queue.
For the embodiment of the present application, the step Sc and the step Sd may include: respectively collecting each frame of encoded audio information and the number of bytes consumed by each frame of encoded video information in the encoded audio and video queue so as to respectively obtain the duration of each frame of encoded audio information and the duration of each frame of encoded video information; and interleaving the encoded audio information and the encoded video information in the encoded audio-video queue based on the duration of each frame of encoded audio information and the duration of each frame of encoded video information to obtain an encoded interleaved queue.
And the difference between the duration of any frame of encoded video information in the encoded interleaving queue and the duration of a frame of encoded audio information corresponding to the encoded video information is less than or equal to a preset threshold value.
For the embodiment of the application, the problem of asynchronism in the synthetic file of the video and audio can be avoided by interweaving the coded audio information and the coded video information, so that the user experience can be improved.
In another possible implementation manner of the embodiment of the present application, step S1042 specifically includes: step S10421 (not shown), in which,
step S10421, synthesizing the encoded interleaving queues.
For the embodiment of the present application, the synthesized multimedia information (including Video information and audio information) may be encapsulated in a Motion Joint Photographic Experts Group (MJPEG) compression format and a MOV encapsulation format, as shown in fig. 20, where Video encoding (full name: Video codec) is MJPEG and format (format) is MOV.
For the embodiment of the present application, when audio information is Output after synthesis, 48Khz, 24BitPCM, WAV formats are adopted for the Output of the whole sound, and Stereo Output is performed, as shown in fig. 21, the sampling frequency (full english: Sample Rate) is 48000, the Output format (full english: Output format) is WAV, the wave depth is 24Bit Pulse Code Modulation (full english: PCM), and the channel is Stereo (full english: Stereo).
The above embodiments may be applied to various fields including, but not limited to: fields such as hand games, VR games, and dynamic caricatures, and more particularly,
taking a sound cartoon as an example, as shown in fig. 22, after obtaining Intellectual Property (IP) authorization of the sound cartoon, designing and recomposing audio corresponding to the sound cartoon, acquiring audio information to be synthesized of each frame of image (including at least one of audio information of a dubbing actor recorded by a human head microphone, audio information of the dubbing actor recorded by a capacitor microphone, background music audio information, environmental sound information, and sound effect audio information), then processing at least one of the audio information of the dubbing actor recorded by the capacitor microphone, the background music information, the environmental sound information, and the sound effect audio information by a preset plug-in, performing sound mixing processing on the audio information processed by the plug-in and the audio information of the dubbing actor recorded by the human head microphone, and then synthesizing the audio information after sound mixing with the frame of image, to complete the making of the sound cartoon.
For the embodiment of the application, the sound cartoon is taken as an example to introduce the relation on the product side, the sound cartoon is a video which uses a plurality of static pictures, generates a dynamic effect of switching among the pictures through special effect processing and is combined with audio expression, and particularly,
designing audio information of a sound cartoon by the following means:
(1) reasonably planning the corresponding relation between the sound source moving direction and the static picture;
(2) the continuity and rationality of sound movement are ensured during picture switching and scene switching;
(3) the sound of close-to-the-ear is skillfully designed, and the auditory sensation is the same as that of the real world. The traditional sound expression is evolved into a sound expression of the presence;
(4) in terms of sound types, the form of voice-over and monologue is avoided to push the story scenario, and more modes are converted into dialogue modes to reproduce the scene content of the story.
In the above processing and designing for the audio information, the product of the sound cartoon can achieve the following effects:
(1) the story is expressed in a real scene restoration mode, the story is expressed in a dialogue in a character scene, and sound content is enriched through action sound and environmental sound; similar to the cinematographic sound expression, the more real auditory experience is restored, and only in some specific paragraphs, the sound artistic expression method of the super reality is used;
(2) the audience is 'personally on the scene' to experience the story by using the earphone listening as a medium and using the immersive sound to express the story, so that the traditional voice-on-the-side and single-white mode is avoided, and the voice elements in real distance are used. The close-to-ear auditory experience and the position judgment experience of the film and television film and television film and television film and television film and;
(3) in the aspect of scene reproduction, except for enriching sound expression elements, all sound expressions take 3D audio as an expression form, and human head recording and HRTF plug-in are utilized as a manufacturing means, so that the sound expression is not 2D flat, and the 3D sound expression further achieves the design purpose of real scene sound.
As shown in fig. 2, the audio processing apparatus 20 according to the embodiment of the present application may include: an obtaining module 21, a first determining module 22, a plug-in processing module 23, a sound mixing processing module 24, and a synthesizing module 25, wherein,
and the obtaining module 21 is configured to obtain audio information to be processed and audio information recorded by a head microphone.
A first determining module 22, configured to determine a preset type of audio information from the audio information to be processed acquired by the acquiring module 21;
and the plug-in processing module 23 is configured to process the preset type of audio information determined by the first determining module 22 through a preset plug-in.
And the sound mixing processing module 24 is configured to perform sound mixing processing on the audio information recorded by the human head microphone and acquired by the acquisition module 21 and the audio information processed by the plug-in processing module 23.
The embodiment of the application provides an audio processing device, compare with processing through Ambisonics's mode to audio information with video synthesis among the prior art, this application embodiment is through obtaining audio information that pending and recorded through the people head microphone, then confirm the audio information of presetting the type in the audio information of pending, and handle the audio information of presetting the type through presetting the plug-in, then carry out the sound mixing processing through the audio information that the people head microphone recorded and the audio information after handling. This application embodiment is through handling the back with the audio information that belongs to preset type through presetting the plug-in components promptly, then synthesize with the audio information that records through the people head microphone, because record through the people head microphone and handle audio information through presetting the plug-in components, all can improve audio information's spatial localization effect to can improve audio information's sound location sense and spatial sensation, and then can improve user's sense of hearing and experience, especially the sense of hearing when watching the video.
The audio processing apparatus of the embodiment of the present application can execute the audio processing method provided by the foregoing method embodiment, and the implementation principles thereof are similar, and are not described herein again.
As shown in fig. 3, the audio processing apparatus 30 according to an embodiment of the present application may include: an obtaining module 31, a first determining module 32, a plug-in processing module 33, and a sound mixing processing module 34, wherein,
and the obtaining module 31 is configured to obtain audio information to be processed and audio information recorded by a head microphone.
The acquiring module 31 in fig. 3 has the same or similar function as the acquiring module 21 in fig. 2.
And a first determining module 32, configured to determine a preset type of audio information from the audio information to be processed acquired by the acquiring module 31.
Wherein the first determining module 32 in fig. 3 has the same or similar function as the first determining module 22 in fig. 2.
And the plug-in processing module 33 is configured to process the audio information of the preset type determined by the first determining module 32 through a preset plug-in.
The plug-in processing module 33 in fig. 3 has the same or similar function as the plug-in processing module 23 in fig. 2.
And the sound mixing processing module 34 is configured to perform sound mixing processing on the audio information recorded by the human head microphone and acquired by the acquisition module 31 and the audio information processed by the plug-in processing module 33.
The sound mixing processing module 34 in fig. 3 has the same or similar function as the sound mixing processing module 24 in fig. 2.
In another possible implementation manner of the embodiment of the present application, the audio information to be processed includes at least one of the following:
ambient sound information; sound effect information; audio information recorded by a condenser microphone; background music information.
Further, as shown in fig. 3, the apparatus 30 further includes: a second determination module 36, a recording module 37, wherein,
and a second determining module 36, configured to determine, during recording of the audio information, a microphone currently used for recording based on a distance between the sound source and each microphone.
For the embodiment of the present application, the second determining module 36 and the first determining module 32 may be the same determining module, or may be two different determining modules. The embodiments of the present application are not limited.
And the recording module 37 is configured to record the corresponding audio information through the microphone determined by the second determining module 36.
In a possible implementation manner of the embodiment of the present application, the second determining module 36 is specifically configured to determine that the microphone currently used for recording is a human head microphone when it is detected that a distance between a sound source and the human head microphone satisfies a first preset condition.
The recording module 37 is specifically configured to record corresponding audio information through the head microphone determined by the second determining module 36.
The second determining module 36 is specifically configured to determine that the microphone currently used for recording is a condenser microphone when it is detected that the distance between the sound source and the condenser microphone satisfies a second preset condition.
The recording module 37 is specifically configured to record corresponding audio information through the condenser microphone determined by the second determining module 36.
In another possible implementation manner of the embodiment of the present application, the sound mixing processing module 34 is specifically configured to perform sound mixing processing on the audio information recorded by the human head microphone and the processed audio information in a linear superposition manner.
As shown in fig. 3, another possible implementation manner of the embodiment of the present application, the sound mixing processing module 34 includes: a superposition unit 341, a division unit 342, an audio intensity contraction unit 343, wherein,
and a superimposing unit 341, configured to linearly superimpose the audio information recorded by the head microphone and the processed audio information.
The dividing unit 342 is configured to divide the audio mixing signal linearly superimposed by the superimposing unit 341 into at least two audio mixing signal intensity intervals according to the audio intensity.
The audio intensity contracting unit 343 is configured to respectively perform audio intensity contraction on each of the mixed signal intensity intervals divided by the dividing unit 342 by using the corresponding contraction ratio.
The superimposing unit 341 is further configured to superimpose at least two mixed-sound signal strength sections subjected to audio strength narrowing by the audio strength narrowing unit 343.
The contraction proportion adopted by the sound mixing signal interval is in inverse proportion relation with the audio intensity corresponding to the sound mixing signal intensity interval.
Further, as shown in fig. 3, the apparatus 30 further includes: a synthesis module 35 for, among other things,
and a synthesizing module 35, configured to synthesize the audio information subjected to the mixing processing by the sound mixing processing module 34 and the video information to be synthesized.
In another possible implementation manner of the embodiment of the present application, as shown in fig. 3, the synthesis module 35 includes: an encoding unit 351 and a synthesizing unit 352, wherein,
the encoding unit 351 is configured to encode the audio information after the audio mixing process and the video information to be synthesized respectively to obtain encoded audio information and encoded video information.
A synthesizing unit 352 for synthesizing the audio information encoded by the encoding unit 351 and the video information encoded by the encoding unit.
In another possible implementation manner of the embodiment of the present application, as shown in fig. 3, the apparatus 30 further includes: a third determining module 38, an interleaving module 39, wherein,
and a third determining module 38, configured to determine a video frame rate corresponding to the encoded video information.
In the embodiment of the present application, the third determining module 38, the second determining module 36, and the first determining module 32 may be the same determining module, may be different determining modules, or may be the same determining module as any one of the determining modules. The embodiments of the present application are not limited.
The third determination module 38, the second determination module 36, and the first determination module 32 are shown in fig. 3 as being different determination modules, respectively, but do not limit the manner of fig. 3.
An interleaving module 39, configured to interleave the encoded audio information and the encoded video information based on the video frame rate corresponding to the encoded video information determined by the third determining module 38, so as to obtain an encoded interleaving queue;
the synthesizing module 35 is specifically configured to synthesize the interleaving queues after being encoded by the interleaving module 39.
In another possible implementation manner of the embodiment of the present application, the preset plug-in is a head related transform function HRTF plug-in.
The embodiment of the application provides another audio processing device, and compared with the prior art in which audio information synthesized with a video is processed in an Ambisonics copying manner, the embodiment of the application determines preset type audio information from the audio information to be processed by acquiring the audio information to be processed and audio information recorded by a human head microphone, processes the preset type audio information by a preset plug-in, and performs sound mixing processing on the audio information recorded by the human head microphone and the processed audio information. This application embodiment is through handling the back with the audio information that belongs to preset type through presetting the plug-in components promptly, then synthesize with the audio information that records through the people head microphone, because record through the people head microphone and handle audio information through presetting the plug-in components, all can improve audio information's spatial localization effect to can improve audio information's sound location sense and spatial sensation, and then can improve user's sense of hearing experience, especially the sense of hearing experience when watching the video.
The audio processing apparatus according to the embodiment of the present application can execute the audio processing method shown in the foregoing method embodiment, and the implementation principles thereof are similar and will not be described herein again.
An embodiment of the present application provides an electronic device, as shown in fig. 4, an electronic device 4000 shown in fig. 4 includes: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The processor 4001 is applied in this embodiment of the present application, and is configured to implement functions of the obtaining module, the first determining module, the plug-in processing module, and the sound mixing processing module shown in fig. 2 or fig. 3, and/or functions of the synthesizing module, the second determining module, the recording module, the third determining module, and the interleaving module shown in fig. 3. The transceiver 4004 comprises a receiver and a transmitter, and the transceiver 4004 is used in the embodiments of the present application for information interaction by other electronic devices.
Processor 4001 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
Bus 4002 may include a path that carries information between the aforementioned components. Bus 4002 may be a PCI bus, EISA bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.
Memory 4003 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, an optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. The processor 4001 is used to execute application code stored in the memory 4003 to implement the actions of the apparatus for audio processing provided by the embodiment shown in fig. 2 or fig. 3.
The embodiment of the application provides electronic equipment, compare with processing through Ambisonics's mode to audio information with video synthesis among the prior art, this application embodiment is through obtaining audio information that awaits processing and the audio information that records through the people head microphone, then confirm the audio information of predetermineeing the type from the audio information that awaits processing, and handle the audio information of predetermineeing the type through predetermineeing the plug-in, then carry out sound mixing processing through the audio information that the people head microphone recorded and the audio information after handling. This application embodiment is through handling the back with the audio information that belongs to preset type through presetting the plug-in components promptly, then synthesize with the audio information that records through the people head microphone, because record through the people head microphone and handle audio information through presetting the plug-in components, all can improve audio information's spatial localization effect to can improve audio information's sound location sense and spatial sensation, and then can improve user's sense of hearing experience, especially the sense of hearing experience when watching the video.
The present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method of audio processing described in the above method embodiments.
Compared with the prior art in which processing is performed on audio information synthesized with a video in an Ambisonics copying manner, the computer-readable storage medium provided by the embodiment of the present application acquires audio information to be processed and audio information recorded by a head microphone, determines preset type of audio information from the audio information to be processed, processes the preset type of audio information through a preset plug-in, and performs sound mixing processing on the audio information recorded by the head microphone and the processed audio information. This application embodiment is through handling the back with the audio information that belongs to preset type through presetting the plug-in components promptly, then synthesize with the audio information that records through the people head microphone, because record through the people head microphone and handle audio information through presetting the plug-in components, all can improve audio information's spatial localization effect to can improve audio information's sound location sense and spatial sensation, and then can improve user's sense of hearing experience, especially the sense of hearing experience when watching the video.
The embodiment of the application provides a computer-readable storage medium which is suitable for any embodiment of the method. And will not be described in detail herein.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (11)

1. A method of audio processing, comprising:
acquiring audio information to be processed and audio information recorded by a human head microphone;
determining preset type audio information from audio information to be processed, and processing the preset type audio information through a preset plug-in; the processing of the audio information of the preset type through the preset plug-in comprises the following steps: determining a three-dimensional position of a sound source based on a point sound source, generating early reflections according to the three-dimensional position of the sound source; the early reflections change in real time with changes in the three-dimensional position of the sound source; the preset type of audio information comprises 3D type of audio information; the 3D audio information comprises at least one of dialogue audio information and action sound information;
carrying out sound mixing processing on the audio information recorded by the human head microphone and the processed audio information; wherein the sound mixing process includes reducing a volume of non-3D-like audio information; the non-3D audio information comprises at least one of environmental sound information, background music information and special effect audio information;
respectively encoding the audio information after the audio mixing processing and the video information to be synthesized to obtain the audio information after the encoding processing and the video information after the encoding processing;
and synthesizing the audio information after the coding processing and the video information after the coding processing.
2. The method of claim 1, wherein the audio information to be processed comprises at least one of:
ambient sound information; sound effect information; audio information recorded by a condenser microphone; background music information.
3. The method of claim 1 or 2, wherein the obtaining of the audio information to be processed and the audio information recorded by the human head microphone further comprises:
in the audio information recording process, determining the microphones used for current recording based on the distances between the sound source and the microphones;
and recording corresponding audio information through the determined microphone.
4. The method of claim 3, wherein the determining of the microphone currently used for recording is based on the distance between the sound source and each microphone; record corresponding audio information through the microphone that determines, include:
when detecting that the distance between a sound source and a head microphone meets a first preset condition, determining that the currently recorded microphone is the head microphone, and recording corresponding audio information through the head microphone;
when the distance between a sound source and the capacitor microphone is detected to meet a second preset condition, the microphone which is currently recorded and used is determined to be the capacitor microphone, and corresponding audio information is recorded through the capacitor microphone.
5. The method of claim 1, wherein the sound mixing the audio information recorded by the human head microphone and the processed audio information comprises:
and carrying out sound mixing processing on the audio information recorded by the human head microphone and the processed audio information in a linear superposition mode.
6. The method according to claim 5, wherein the processing of sound mixing the audio information recorded by the human head microphone and the processed audio information by linear superposition comprises:
linearly superposing the audio information recorded by the human head microphone and the processed audio information;
dividing the linearly superposed audio mixing signal into at least two audio mixing signal intensity intervals according to the audio intensity;
respectively carrying out audio intensity contraction on each audio mixing signal intensity interval by adopting corresponding contraction proportions;
superposing the at least two sound mixing signal intensity intervals subjected to audio intensity contraction;
the contraction proportion adopted by the sound mixing signal intensity interval is in an inverse proportion relation with the audio intensity corresponding to the sound mixing signal intensity interval.
7. The method according to claim 1, wherein the audio information after the mixing process and the video information to be synthesized are encoded respectively to obtain encoded audio information and encoded video information, and then further comprising:
determining a video frame rate corresponding to the video information after the encoding processing;
interleaving the audio information after the coding processing and the video information after the coding processing based on a video frame rate corresponding to the video information after the coding processing to obtain an interleaving queue after the coding;
the synthesizing the audio information after the encoding processing and the video information after the encoding processing includes:
and synthesizing the coded interleaving queues.
8. The method of claim 1, further comprising:
the preset plug-in is a Head Related Transform Function (HRTF) plug-in.
9. An apparatus for audio processing, comprising:
the acquisition module is used for acquiring audio information to be processed and audio information recorded by a human head microphone;
the first determining module is used for determining audio information of a preset type from the audio information to be processed acquired by the acquiring module;
the plug-in processing module is used for processing the audio information of the preset type determined by the first determining module through a preset plug-in; the processing of the audio information of the preset type through the preset plug-in comprises the following steps: determining a three-dimensional position of a sound source based on a point sound source, generating early reflections according to the three-dimensional position of the sound source; the early reflections change in real time with changes in the three-dimensional position of the sound source; the preset type of audio information comprises 3D type of audio information; the 3D audio information comprises at least one of dialogue audio information and action sound information;
the sound mixing processing module is used for carrying out sound mixing processing on the audio information recorded by the human head microphone and the audio information processed by the plug-in processing module; wherein the sound mixing process includes reducing a volume of non-3D-like audio information; the non-3D audio information comprises at least one of environmental sound information, background music information and special effect audio information;
respectively encoding the audio information after the audio mixing processing and the video information to be synthesized to obtain the audio information after the encoding processing and the video information after the encoding processing;
and synthesizing the audio information after the coding processing and the video information after the coding processing.
10. An electronic device, comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to: method of performing audio processing according to any of claims 1 to 8.
11. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of audio processing of any one of claims 1 to 8.
CN201811400323.4A 2018-11-22 2018-11-22 Audio processing method and device, electronic equipment and computer readable storage medium Active CN109410912B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811400323.4A CN109410912B (en) 2018-11-22 2018-11-22 Audio processing method and device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811400323.4A CN109410912B (en) 2018-11-22 2018-11-22 Audio processing method and device, electronic equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN109410912A CN109410912A (en) 2019-03-01
CN109410912B true CN109410912B (en) 2021-12-10

Family

ID=65474610

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811400323.4A Active CN109410912B (en) 2018-11-22 2018-11-22 Audio processing method and device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN109410912B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110225432B (en) * 2019-05-10 2021-08-31 中国船舶重工集团公司第七一五研究所 Stereo listening method for sonar target
CN113539279A (en) * 2020-04-16 2021-10-22 腾讯科技(深圳)有限公司 Audio data processing method and device and computer readable storage medium
CN113875265A (en) * 2020-04-20 2021-12-31 深圳市大疆创新科技有限公司 Audio signal processing method, audio processing device and recording equipment
CN111866664A (en) * 2020-07-20 2020-10-30 深圳市康冠商用科技有限公司 Audio processing method, device, equipment and computer readable storage medium
CN112530589A (en) * 2020-12-01 2021-03-19 中国科学院深圳先进技术研究院 Method, device and system for triggering ASMR, electronic equipment and storage medium
CN112951199B (en) * 2021-01-22 2024-02-06 杭州网易云音乐科技有限公司 Audio data generation method and device, data set construction method, medium and equipment
CN113971969B (en) * 2021-08-12 2023-03-24 荣耀终端有限公司 Recording method, device, terminal, medium and product

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102404573A (en) * 2011-11-28 2012-04-04 深圳市万兴软件有限公司 Method and device for synchronously processing audio and video
CN105263093A (en) * 2015-10-12 2016-01-20 深圳东方酷音信息技术有限公司 Omnibearing audio acquisition apparatus, omnibearing audio editing apparatus, and omnibearing audio acquisition and editing system
CN105719653A (en) * 2016-01-28 2016-06-29 腾讯科技(深圳)有限公司 Mixing processing method and device
CN106531177A (en) * 2016-12-07 2017-03-22 腾讯科技(深圳)有限公司 Audio treatment method, a mobile terminal and system
KR101725952B1 (en) * 2015-12-21 2017-04-11 서울대학교산학협력단 The method and system regarding down mix sound source of n chanel to optimized binaural sound source for user by using user's head related transfer function information
CN108777832A (en) * 2018-06-13 2018-11-09 上海艺瓣文化传播有限公司 A kind of real-time 3D sound fields structure and mixer system based on the video object tracking

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8787584B2 (en) * 2011-06-24 2014-07-22 Sony Corporation Audio metrics for head-related transfer function (HRTF) selection or adaptation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102404573A (en) * 2011-11-28 2012-04-04 深圳市万兴软件有限公司 Method and device for synchronously processing audio and video
CN105263093A (en) * 2015-10-12 2016-01-20 深圳东方酷音信息技术有限公司 Omnibearing audio acquisition apparatus, omnibearing audio editing apparatus, and omnibearing audio acquisition and editing system
KR101725952B1 (en) * 2015-12-21 2017-04-11 서울대학교산학협력단 The method and system regarding down mix sound source of n chanel to optimized binaural sound source for user by using user's head related transfer function information
CN105719653A (en) * 2016-01-28 2016-06-29 腾讯科技(深圳)有限公司 Mixing processing method and device
CN106531177A (en) * 2016-12-07 2017-03-22 腾讯科技(深圳)有限公司 Audio treatment method, a mobile terminal and system
CN108777832A (en) * 2018-06-13 2018-11-09 上海艺瓣文化传播有限公司 A kind of real-time 3D sound fields structure and mixer system based on the video object tracking

Also Published As

Publication number Publication date
CN109410912A (en) 2019-03-01

Similar Documents

Publication Publication Date Title
CN109410912B (en) Audio processing method and device, electronic equipment and computer readable storage medium
TWI744341B (en) Distance panning using near / far-field rendering
RU2736418C1 (en) Principle of generating improved sound field description or modified sound field description using multi-point sound field description
CN106797525B (en) For generating and the method and apparatus of playing back audio signal
EP1416769B1 (en) Object-based three-dimensional audio system and method of controlling the same
US11516616B2 (en) System for and method of generating an audio image
EP2205007A1 (en) Method and apparatus for three-dimensional acoustic field encoding and optimal reconstruction
KR20170106063A (en) A method and an apparatus for processing an audio signal
JP2019533404A (en) Binaural audio signal processing method and apparatus
JP6983484B2 (en) Concept for generating extended or modified sound field descriptions using multi-layer description
KR20140000240A (en) Data structure for higher order ambisonics audio data
Rafaely et al. Spatial audio signal processing for binaural reproduction of recorded acoustic scenes–review and challenges
JP2018110366A (en) 3d sound video audio apparatus
US10321252B2 (en) Transaural synthesis method for sound spatialization
Suzuki et al. 3D spatial sound systems compatible with human's active listening to realize rich high-level kansei information
US20220386060A1 (en) Signalling of audio effect metadata in a bitstream
KR102058228B1 (en) Method for authoring stereoscopic contents and application thereof
Paterson et al. Producing 3-D audio
JP6421385B2 (en) Transoral synthesis method for sound three-dimensionalization
KR20190060464A (en) Audio signal processing method and apparatus
KR101534295B1 (en) Method and Apparatus for Providing Multiple Viewer Video and 3D Stereophonic Sound
Geier et al. The Future of Audio Reproduction: Technology–Formats–Applications
CN116866817A (en) Device and method for presenting spatial audio content
JP2014222859A (en) Acoustic signal reproduction device and acoustic signal preparation device
KR20180024612A (en) A method and an apparatus for processing an audio signal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant