WO2023216119A1 - Procédé et appareil de codage de signal audio, dispositif électronique et support d'enregistrement - Google Patents

Procédé et appareil de codage de signal audio, dispositif électronique et support d'enregistrement Download PDF

Info

Publication number
WO2023216119A1
WO2023216119A1 PCT/CN2022/092082 CN2022092082W WO2023216119A1 WO 2023216119 A1 WO2023216119 A1 WO 2023216119A1 CN 2022092082 W CN2022092082 W CN 2022092082W WO 2023216119 A1 WO2023216119 A1 WO 2023216119A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
audio
target
scene
input format
Prior art date
Application number
PCT/CN2022/092082
Other languages
English (en)
Chinese (zh)
Inventor
高硕�
Original Assignee
北京小米移动软件有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京小米移动软件有限公司 filed Critical 北京小米移动软件有限公司
Priority to PCT/CN2022/092082 priority Critical patent/WO2023216119A1/fr
Priority to CN202280001342.8A priority patent/CN117813652A/zh
Publication of WO2023216119A1 publication Critical patent/WO2023216119A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis

Definitions

  • the present disclosure relates to the field of communication technology, and in particular, to an audio signal encoding method, device, electronic equipment and storage medium.
  • users negotiate an audio format when establishing voice communication.
  • the negotiated audio format is always used to transmit the local user's audio signal in the audio format to the remote user.
  • the audio scene of the user's voice communication may change, and the audio signal in this audio format may not be able to provide the remote user with the real audio scene information of the local user in the changed audio scene, resulting in poor user experience. This is A problem that needs to be solved urgently.
  • Embodiments of the present disclosure provide an audio signal encoding method, device, electronic device and storage medium, which enables a remote user to obtain audio scene information of the audio scene where the local user is located, thereby improving user experience.
  • embodiments of the present disclosure provide an audio signal encoding method, which method includes: acquiring an audio signal; determining an audio scene type corresponding to the audio signal; and determining a target input according to the audio scene type and the audio signal. format audio signal; encode the target input format audio signal to generate a target encoding code stream.
  • the audio signal is obtained; the audio scene type corresponding to the audio signal is determined; the target input format audio signal is determined based on the audio scene type and audio signal; the target input format audio signal is encoded to generate the target encoding code stream.
  • determining the audio scene type corresponding to the audio signal includes: obtaining audio characteristic parameters of the audio signal; inputting the audio characteristic parameters into an audio scene type analysis model, and determining the audio signal The corresponding audio scene type.
  • determining the target input format audio signal according to the audio scene type and the audio signal includes: determining the target audio signal input format according to the audio scene type and/or the audio signal; The target input format audio signal is determined based on the target audio signal input format and the audio signal.
  • Determining the target audio signal input format according to the audio scene type and the audio signal includes:
  • determining the target audio signal input format to be an object-based signal input format
  • determining the target audio signal input format to be an object-based signal input format and a scene-based signal input format
  • the target audio signal input format is determined to be a scene signal-based input format.
  • determining the target input format audio signal according to the target audio signal input format and the audio signal includes:
  • determining that the target audio signal input format is an object-based signal input format In response to determining that the target audio signal input format is an object-based signal input format, determining that the target input format audio signal is an object-based audio signal among the audio signals;
  • the target audio signal input format is an object-based signal input format and a scene-based signal input format
  • determining that the target input format audio signal is an object-based audio signal and a scene-based audio signal among the audio signals
  • the target audio signal input format is a scene-based signal input format
  • encoding the target input format audio signal to generate a target encoding code stream includes:
  • Code stream multiplexing is performed according to the coding parameters to generate the target coding code stream.
  • determining the target signal encoding core according to the target input format audio signal includes:
  • the target encoding core of the object audio data in the object-based audio signal is an object audio data encoding core, and it is determined that the object-based audio signal is based on metadata in the object audio signal.
  • the target encoding core is an object metadata parameter encoding core;
  • the target encoding core is a scene audio data encoding core
  • the target encoding core is a channel audio data encoding core
  • the target encoding core based on the audio data in the MASA audio signal is a scene audio data encoding core, Determining that the target coding core based on the spatial auxiliary metadata in the MASA audio signal is a spatial auxiliary metadata parameter coding core;
  • the target encoding core is the encoding core selected for the format audio signal in the mixed format audio signal, wherein the mixed format signal includes an audio signal based on the object , at least two of a scene-based audio signal, a channel-based audio signal, and a MASA-based audio signal.
  • determining the target encoding core according to the audio scene type and the target input format audio signal includes:
  • the target encoding core of the object audio data is an object audio data encoding core, it is determined that the target encoding core based on the metadata in the object audio signal is an object metadata parameter encoding core, and it is determined that all the target encoding cores based on the scene audio signal are
  • the target encoding core is the scene audio data encoding core;
  • the target encoding core of the audio data is an object audio data encoding core, it is determined that the target encoding core based on the metadata in the object audio signal is an object metadata parameter encoding core, and it is determined that the target based on the scene audio signal
  • the encoding core is the scene audio data encoding core
  • the target input format audio signal is a scene-based audio signal
  • the target encoding core based on the scene audio signal is scene audio data. Coding core.
  • inventions of the present disclosure provide an audio signal encoding device.
  • the audio signal encoding device includes: a signal acquisition unit configured to acquire an audio signal; and a type determination unit configured to determine the audio corresponding to the audio signal. Scene type; a target signal determination unit configured to determine a target input format audio signal according to the audio scene type and the audio signal; an encoding processing unit configured to encode the target input format audio signal to generate a target Encoding stream.
  • the type determination unit includes:
  • a parameter acquisition module configured to acquire audio characteristic parameters of the audio signal
  • a model processing module configured to input the audio feature parameters into an audio scene type analysis model to determine the audio scene type corresponding to the audio signal.
  • the target signal determination unit includes: a target format determination module configured to determine the target audio signal input format according to the audio scene type and/or the audio signal; the target signal determination module, Configured to determine the target input format audio signal according to the target audio signal input format and the audio signal.
  • the target format determination module is specifically configured as:
  • determining the target audio signal input format to be an object-based signal input format
  • determining the target audio signal input format to be an object-based signal input format and a scene-based signal input format
  • the target audio signal input format is determined to be a scene signal-based input format.
  • the target signal determination module is specifically configured as:
  • determining that the target audio signal input format is an object-based signal input format In response to determining that the target audio signal input format is an object-based signal input format, determining that the target input format audio signal is an object-based audio signal among the audio signals;
  • the target audio signal input format is an object-based signal input format and a scene-based signal input format
  • determining that the target input format audio signal is an object-based audio signal and a scene-based audio signal among the audio signals
  • the target audio signal input format is a scene-based signal input format
  • the encoding processing unit includes: a coding core determination module configured to determine a target coding core for the target input format audio signal, or according to the audio scene type and the target input format.
  • the audio signal determines the target encoding core;
  • the parameter acquisition module is configured to encode the target input format audio signal according to the target encoding core and obtain encoding parameters;
  • the code stream generation module is configured to encode the audio signal according to the encoding parameters. Stream multiplexing to generate the target encoding code stream.
  • the encoding core determination module is specifically configured as:
  • the target encoding core of the object audio data in the object-based audio signal is an object audio data encoding core, and it is determined that the object-based audio signal is based on metadata in the object audio signal.
  • the target encoding core is an object metadata parameter encoding core;
  • the target encoding core is a scene audio data encoding core
  • the target encoding core is a channel audio data encoding core
  • the target encoding core based on the audio data in the MASA audio signal is a scene audio data encoding core, and it is determined that the MASA based audio
  • the target encoding core of the spatial auxiliary metadata in the signal is a spatial auxiliary metadata parameter encoding core
  • the target encoding core is the encoding core selected for the format audio signal in the mixed format audio signal, wherein the mixed format signal includes an audio signal based on the object , at least two of a scene-based audio signal, a channel-based audio signal, and a MASA-based audio signal.
  • the encoding core determination module is specifically configured as:
  • the target encoding core is an object audio data encoding core, it is determined that the target encoding core based on metadata in the object audio signal is an object metadata parameter encoding core, and it is determined that the target encoding core based on the scene audio signal is a scene. Audio data encoding core;
  • the target encoding core of the audio data is an object audio data encoding core, it is determined that the target encoding core based on the metadata in the object audio signal is an object metadata parameter encoding core, and it is determined that the target based on the scene audio signal
  • the encoding core is the scene audio data encoding core
  • the target input format audio signal is a scene-based audio signal
  • the target encoding core based on the scene audio signal is scene audio data. Coding core.
  • embodiments of the present disclosure provide an electronic device, which includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores information that can be used by the at least one processor. Instructions executed by the processor, the instructions being executed by the at least one processor, so that the at least one processor can execute the method described in the first aspect.
  • embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method described in the first aspect.
  • embodiments of the present disclosure provide a computer program product, including computer instructions, characterized in that, when executed by a processor, the computer instructions implement the method described in the first aspect.
  • Figure 1 is a flow chart of an audio signal encoding method provided by an embodiment of the present disclosure
  • Figure 2 is a flow chart of another audio signal encoding method provided by an embodiment of the present disclosure.
  • Figure 3 is a flow chart of yet another audio signal encoding method provided by an embodiment of the present disclosure.
  • Figure 4 is a flow chart of yet another audio signal encoding method provided by an embodiment of the present disclosure.
  • Figure 5 is a flow chart of yet another audio signal encoding method provided by an embodiment of the present disclosure.
  • Figure 6 is a structural diagram of an audio signal encoding device provided by an embodiment of the present disclosure.
  • Figure 7 is a structural diagram of a type determination unit in the audio signal encoding device provided by an embodiment of the present disclosure
  • Figure 8 is a structural diagram of a target signal determination unit in the audio signal encoding device provided by an embodiment of the present disclosure
  • Figure 9 is a structural diagram of a coding processing unit in the audio signal coding device provided by an embodiment of the present disclosure.
  • FIG. 10 is a structural diagram of an electronic device according to an embodiment of the present disclosure.
  • At least one in the present disclosure can also be described as one or more, and the plurality can be two, three, four or more, and the present disclosure is not limited.
  • the technical feature is distinguished by “first”, “second”, “third”, “A”, “B”, “C” and “D” etc.
  • the technical features described in “first”, “second”, “third”, “A”, “B”, “C” and “D” are in no particular order or order.
  • each table in this disclosure can be configured or predefined.
  • the values of the information in each table are only examples and can be configured as other values, which is not limited by this disclosure.
  • it is not necessarily required to configure all the correspondences shown in each table.
  • the corresponding relationships shown in some rows may not be configured.
  • appropriate deformation adjustments can be made based on the above table, such as splitting, merging, etc.
  • the names of the parameters shown in the titles of the above tables may also be other names understandable by the communication device, and the values or expressions of the parameters may also be other values or expressions understandable by the communication device.
  • other data structures can also be used, such as arrays, queues, containers, stacks, linear lists, pointers, linked lists, trees, graphs, structures, classes, heaps, hash tables or hash tables. wait.
  • the first generation of mobile communication technology is the first generation of wireless cellular technology and is an analog mobile communication network.
  • the 3G mobile communication system was proposed by the ITU (International Telecommunication Union) for international mobile communications in 2000.
  • 4G is a better improvement on 3G technology.
  • both data and voice use an all-IP approach to provide real-time HD+Voice services for voice and audio.
  • the EVS codec used can take into account high-quality compression of voice and audio.
  • the voice and audio communication services provided above have expanded from narrowband signals to ultra-wideband and even full-band services, but they are still monophonic services. People's demand for high-quality audio continues to increase. Compared with monophonic audio, stereo audio Have a sense of orientation and distribution for each sound source and improve clarity.
  • Three signal formats including channel-based multi-channel audio signals, object-based audio signals, and scene-based audio signals, can provide three-dimensional audio services.
  • the IVAS codec for immersive voice and audio services being standardized by the Third Generation Partnership Project 3GPP SA4 can support the coding and decoding requirements of the above three signal formats.
  • Terminal devices that can support 3D audio services include mobile phones, computers, tablets, conference system equipment, augmented reality AR/virtual reality VR equipment, cars, etc.
  • the audio scene in which the local user is located may be constantly changing.
  • the local user communicates in real time with a remote user in a quiet and empty outdoor place.
  • the local user terminal device chooses to use mono.
  • the signal format as the input signal format can well transmit the audio scene of the local user to the remote user.
  • a bird will fly over during a certain period of time, and the bird's cry is an important part of the current audio scene. important audio element.
  • the bird's call cannot be well transmitted to the remote user.
  • the local user's terminal device can analyze the audio scene in which the local user is located in real time, and use the obtained audio scene type to guide the audio signal generator to output the optimal audio format signal, thereby ensuring the selected audio
  • the format signal can better represent the audio scene of the local user, so that the remote user can well obtain the audio scene information of the audio scene where the local user is located, and improve the user experience.
  • Figure 1 is a flow chart of an audio signal encoding method provided by an embodiment of the present disclosure.
  • the method may include but is not limited to the following steps:
  • the local user when the local user establishes voice communication with any remote user, the local user can establish voice communication with the terminal equipment of any remote user through the terminal device, wherein the terminal device of the local user can obtain the information in real time. Acquire the audio signal from the sound information of the environment where the local user is located.
  • the sound information of the environment where the local user is located includes the sound information emitted by the local user, the sound information of surrounding things, etc.
  • Sound information of surrounding things such as: sound information of driving vehicles, bird calls, wind sound information, sound information of other users around the local user, etc.
  • the terminal device is an entity on the user side that is used to receive or transmit signals, such as mobile phones, computers, tablets, watches, walkie-talkies, conference system equipment, augmented reality AR/virtual reality VR equipment, cars, etc.
  • Terminal equipment can also be called user equipment (user equipment, UE), mobile station (mobile station, MS), mobile terminal equipment (mobile terminal, MT), etc.
  • the terminal device can be a car with communication functions, a smart car, a mobile phone, a wearable device, a tablet computer (Pad), a computer with wireless transceiver functions, a virtual reality (VR) terminal device, an augmented reality (augmented reality (AR) terminal equipment, wireless terminal equipment in industrial control, wireless terminal equipment in self-driving, wireless terminal equipment in remote medical surgery, smart grid ( Wireless terminal equipment in smart grid, wireless terminal equipment in transportation safety, wireless terminal equipment in smart city, wireless terminal equipment in smart home, etc.
  • the embodiments of the present disclosure do not limit the specific technology and specific equipment form used by the terminal equipment.
  • the terminal device of the local user can obtain the audio signal through a recording device, such as a microphone, provided in or in conjunction with the terminal device, to obtain the sound information of the environment where the user is located, and further generate an audio signal. , obtain the audio signal.
  • a recording device such as a microphone
  • the obtained audio signal can be analyzed to obtain the audio scene type corresponding to the audio signal.
  • the audio signal may include sound information emitted by the local user and/or sound information of surrounding things.
  • the audio scene type corresponding to the audio signal can be determined according to the content included in the audio signal.
  • audio scene types include, for example: offices, theaters, cars, train stations, large shopping malls, etc.
  • an audio signal can be selected for each audio scene type.
  • the input format audio signal of the input format, or the input format audio signal of multiple audio signal input formats can be selected.
  • the audio scene types can also be divided in other ways, or combined with other ways, etc., for example: the audio scene types can only include at least one main audio signal in the audio signal, or The audio signal includes at least one main audio signal and a background audio signal, or the audio signal only includes at least one background audio signal.
  • the audio scene type can be set in advance as needed, and the embodiment of the present disclosure does not impose specific restrictions on this.
  • one or more audio signal input formats can be selected for each audio scene type.
  • the selected audio signal input format can also be determined according to the method of obtaining the audio signal, for example: The number of channels and spatial layout relationship of the audio signal are determined.
  • S3 determine the target input format audio signal according to the audio scene type and audio signal.
  • the target input format audio signal when the audio signal is acquired and the audio scene type corresponding to the audio signal is determined, can be further determined based on the audio scene type and the audio signal. Wherein, when the audio signal includes one or more input format audio signals, the target input format audio signal is determined to be one or more input format audio signals in the audio signal according to the audio scene type and the audio signal.
  • the audio signal may include one or more input format audio signals.
  • the corresponding relationship between the audio scene type and the input format audio signal can be set in advance. In the case of determining the audio scene type , determine the target input format audio signal based on the audio scene type and corresponding relationship.
  • the target input format audio signal in the audio signal can also be determined according to the audio scene type, the corresponding relationship, and the input format audio signal included in the audio signal.
  • the target input format audio signal when the target input format audio signal is determined, in order to communicate with the remote user, the target input format audio signal is sent to the terminal device of the remote user.
  • the audio signal in the target input format needs to be encoded to generate the target encoding stream and send it to the terminal device of the remote user.
  • the remote user's terminal device After receiving the target encoding code stream, the remote user's terminal device decodes the target encoding code stream to obtain the local user's voice information.
  • an audio signal (audio scene signal) is obtained, the audio scene of the audio signal is analyzed, the audio scene type is determined, and the audio signal format generator is input according to the audio scene type and the audio signal. , obtain the target input format audio signal (audio format signal).
  • the target encoding core used is determined according to the target input format audio signal and/or the audio scene type. For example, when the target input format audio signal is determined to be an object-based audio signal, the target encoding core used is determined to be the object. Audio data encoding core; when it is determined that the target input format audio signal is a scene-based audio signal, the target encoding core used is determined to be a scene audio data encoding core; when it is determined that the target input format audio signal is an object-based audio signal and a scene-based audio signal In the case of signals, it is determined that the target coding core used is the object audio data coding core, the scene audio data coding core, and so on.
  • the target input format audio signal (audio format signal) is encoded using a determined target encoding check, and the target encoding code stream is generated through code stream multiplexing and sent to the terminal device of the remote user.
  • the remote user's terminal device After receiving the target encoding code stream, the remote user's terminal device decodes the target encoding code stream to obtain the local user's voice information.
  • the audio signal is obtained; the audio scene type corresponding to the audio signal is determined; the target input format audio signal is determined based on the audio scene type and the audio signal; the target input format audio signal is encoded to generate the target encoding code stream. This ensures that the selected audio format signal can better represent the local user's audio scene, allowing the remote user to obtain audio scene information of the local user's audio scene, thereby improving the user experience.
  • Figure 3 is a flow chart of another audio signal encoding method provided by an embodiment of the present disclosure.
  • the method may include but is not limited to the following steps:
  • the obtained audio signal can be analyzed to obtain the audio scene type corresponding to the audio signal.
  • determining the audio scene type corresponding to the audio signal includes: obtaining audio characteristic parameters of the audio signal; inputting the audio characteristic parameters into the audio scene type analysis model to determine the audio scene type corresponding to the audio signal.
  • the audio characteristic parameters of the audio signal are obtained, such as: linear prediction cepstral coefficient, zero-crossing rate, mel frequency cepstrum coefficient, etc.
  • Audio scene type analysis models such as: HMM (Hidden Markov Model, hidden Markov model) model, Gaussian mixture model, or other mathematical models, etc.
  • the audio scene type analysis model can determine the audio scene type of the audio signal according to the audio characteristic parameters of the audio signal, wherein the audio scene type analysis model can be obtained through pre-training, or through pre-set To obtain, methods in related technologies can be used, and the embodiments of the present disclosure do not specifically limit this.
  • the target audio signal input format is determined according to the audio scene type, including:
  • determining the target audio signal input format to be an object-based signal input format In response to the audio scene type characterizing the audio scene including at least one main audio signal, determining the target audio signal input format to be an object-based signal input format;
  • determining the target audio signal input format to be an object-based signal input format and a scene-based signal input format
  • the target audio signal input format is determined to be based on the scene signal input format.
  • the target audio signal input format may be determined to be an object-based signal input format.
  • the target audio signal input format may be determined to be an object-based signal input format and a scene-based signal input format.
  • determining the audio scene type indicates that the audio scene includes at least one main audio signal and a background audio signal, and the audio signal can be determined to be a mixed format audio signal, including input format audio signals of at least two audio scene types.
  • the audio scene type represents that the audio scene includes at least one main audio signal and a background audio signal
  • the audio signal is determined to be a mixed format audio signal
  • the target audio signal input format is an object-based signal.
  • the input format is at least two of a mono format signal based, a stereo format signal based, a MASA format signal based, a channel format signal based, a FOA format signal based and a HOA format signal based.
  • the target audio signal input format may be determined to be a scene signal-based input format.
  • determining the target input format audio signal according to the target audio signal input format and the audio signal includes:
  • determining that the target audio signal input format is an object-based signal input format In response to determining that the target audio signal input format is an object-based signal input format, determining that the target input format audio signal is an object-based audio signal among the audio signals;
  • determining the target input format audio signal is an object-based audio signal and a scene-based audio signal among the audio signals
  • the target input format audio signal is determined to be a scene-based audio signal among the audio signals.
  • the target audio signal input format is an object-based signal input format
  • it is determined that the target input format audio signal is an object-based audio signal among the audio signals.
  • the target audio signal input format when it is determined that the target audio signal input format is an object-based signal input format and a scene-based signal input format, the target input format audio signal may be determined to be an object-based audio signal and a scene-based audio signal among the audio signals. .
  • the target input format audio signal when it is determined that the target audio signal input format is a scene-based signal input format in response to the determination, the target input format audio signal may be determined to be a scene-based audio signal among the audio signals.
  • S30 Determine the target encoding core according to the target input format audio signal, or determine the target encoding core according to the audio scene type and the target input format audio signal.
  • the target signal encoding core is determined according to the target input format audio signal, including:
  • the target encoding core based on the object audio data in the object audio signal is determined to be the object audio data encoding core, and the target encoding core based on the metadata in the object audio signal is determined to be the object metadata parameter.
  • determining the target encoding core to be the scene audio data encoding core
  • determining the target encoding core to be the channel audio data encoding core
  • the target encoding core based on the audio data in the MASA audio signal is the scene audio data encoding core, and the target based on the spatial auxiliary metadata in the MASA audio signal is determined.
  • the encoding core is the spatial auxiliary metadata parameter encoding core;
  • the target encoding core is determined to be the encoding core selected for the format audio signal in the mixed format audio signal, where the mixed format signal includes an object-based audio signal, a scene-based audio signal, and a sound-based audio signal. at least two of the channel audio signal and the MASA-based audio signal.
  • the target encoding core based on the object audio data in the object audio signal can be determined.
  • the object audio data encoding core it is determined that the target encoding core based on the metadata in the object audio signal is the object metadata parameter encoding core.
  • the target encoding core when it is determined that the target input format audio signal is based on a scene audio signal, the target encoding core may be determined to be a scene audio data encoding core.
  • the target encoding core when it is determined that the target input format audio signal is based on the channel audio signal, the target encoding core may be determined to be the channel audio data encoding core.
  • the target input format audio signal is a spatial audio MASA audio signal based on auxiliary metadata
  • the MASA-based audio signal includes audio data and spatial auxiliary metadata
  • it can be determined that the MASA-based audio signal is The target coding core of the audio data is the scene audio data coding core, and it can be determined that the target coding core based on the spatial auxiliary metadata in the MASA audio signal is the spatial auxiliary metadata parameter coding core.
  • the target encoding core when it is determined that the target input format audio signal is a mixed format audio signal, can be determined to be the encoding core selected for the format audio signal in the mixed format audio signal, where the mixed format signal includes an object-based At least two of an audio signal, a scene-based audio signal, a channel-based audio signal, and a MASA-based audio signal.
  • the target encoding core is determined according to the audio scene type and the target input format audio signal, including:
  • the target encoding core of the object audio data in the object-based audio signal is the object audio data Coding core, determine the target coding core based on the metadata in the object audio signal as the object metadata parameter coding core, and determine the target coding core based on the scene audio signal as the scene audio data coding core;
  • a target encoding core of the object audio data in the object-based audio signal is determined.
  • the object audio data encoding core determine the target encoding core based on the metadata in the object audio signal as the object metadata parameter encoding core, and determine the target encoding core based on the scene audio signal as the scene audio data encoding core;
  • the target input format audio signal is based on the scene audio signal.
  • the target input format audio signal is an object-based audio signal and a scene-based audio signal
  • the object-based audio signal it can be determined that the object-based audio signal
  • the target coding core of the audio data is the object audio data coding core
  • the target coding core based on the metadata in the object audio signal is determined to be the object metadata parameter coding core
  • the target coding core based on the scene audio signal is determined to be the scene audio data coding core.
  • the target input format audio signal is an object-based audio signal and a scene-based audio signal
  • the object-based audio signal is the object audio data coding core.
  • the target coding core based on the metadata in the object audio signal is determined to be the object metadata parameter coding core.
  • the target coding core based on the scene audio signal is determined to be the scene audio data coding. nuclear.
  • the target input format audio signal is a scene-based audio signal
  • the target encoding core based on the scene audio signal is scene Audio data encoding core
  • S40 Encode the audio signal in the target input format according to the target encoding check and obtain encoding parameters.
  • S50 Multiplex the code stream according to the coding parameters to generate the target code stream.
  • the audio signal is obtained, the audio scene type corresponding to the audio signal is determined, and the target input format audio signal is determined according to the audio scene type and/or the audio signal.
  • the target input format audio signal is an object-based audio signal
  • the target encoding core based on the object audio data in the object audio signal is determined to be the object audio data encoding core
  • the target encoding core based on the metadata in the object audio signal is determined.
  • the target encoding core is the object metadata parameter encoding core.
  • the code stream is multiplexed to obtain the target encoding code stream. This ensures that the selected audio format signal can better represent the audio scene of the local user, so that the remote end Users can well obtain the audio scene information of the audio scene where the local user is located, improving the user experience.
  • the audio signal is obtained, the audio scene type corresponding to the audio signal is determined, and the target input format audio is determined according to the audio scene type and/or the audio signal.
  • the signal is based on MASA audio signal
  • the target input format audio signal is based on MASA audio signal
  • it is determined that the target encoding core based on the audio data in the MASA audio signal is the scene audio data encoding core, and it can be determined based on the spatial auxiliary in the MASA audio signal
  • the target encoding core of the metadata is the spatial auxiliary metadata parameter encoding core, in which the spatial auxiliary metadata in the MASA audio signal is optional.
  • the code stream is multiplexed to obtain the target encoding code stream, thus ensuring that all The selected audio format signal can better characterize the audio scene of the local user, allowing the remote user to obtain the audio scene information of the audio scene where the local user is located, and improving the user experience.
  • Figure 6 is a structural diagram of an audio signal encoding device provided by an embodiment of the present disclosure.
  • the audio signal encoding device 1 includes: a signal acquisition unit 11 , a type determination unit 12 , a target signal determination unit 13 and an encoding processing unit 14 .
  • the signal acquisition unit 11 is configured to acquire audio signals.
  • the type determining unit 12 is configured to determine the audio scene type corresponding to the audio signal.
  • the target signal determining unit 13 is configured to determine the target input format audio signal according to the audio scene type and the audio signal.
  • the encoding processing unit 14 is configured to encode the target input format audio signal and generate a target encoded code stream.
  • the type determination unit 12 includes: a parameter acquisition module 121 and a model processing module 122.
  • the parameter acquisition module 121 is configured to acquire audio characteristic parameters of the audio signal.
  • the model processing module 122 is configured to input the audio feature parameters into the audio scene type analysis model to determine the audio scene type corresponding to the audio signal.
  • the target signal determination unit 13 includes: a target format determination module 131 and a target signal determination module 132 .
  • the target format determining module 131 is configured to determine the target audio signal input format according to the audio scene type and/or the audio signal.
  • the target signal determining module 132 is configured to determine the target input format audio signal according to the target audio signal input format and the audio signal.
  • the target format determination module 131 is specifically configured as:
  • determining the target audio signal input format to be an object-based signal input format In response to the audio scene type characterizing the audio scene including at least one main audio signal, determining the target audio signal input format to be an object-based signal input format;
  • determining the target audio signal input format to be an object-based signal input format and a scene-based signal input format
  • the target audio signal input format is determined to be based on the scene signal input format.
  • the target signal determination module 132 is specifically configured as:
  • determining that the target audio signal input format is an object-based signal input format In response to determining that the target audio signal input format is an object-based signal input format, determining that the target input format audio signal is an object-based audio signal among the audio signals;
  • determining the target input format audio signal is an object-based audio signal and a scene-based audio signal among the audio signals
  • the target input format audio signal is determined to be a scene-based audio signal among the audio signals.
  • the encoding processing unit 14 includes: an encoding core determination module 141, a parameter acquisition module 142, and a code stream generation module 143.
  • the encoding core determination module 141 is configured to determine the target encoding core according to the target input format audio signal, or to determine the target encoding core according to the audio scene type and the target input format audio signal.
  • the parameter acquisition module 142 is configured to encode the target input format audio signal according to the target encoding check and obtain encoding parameters.
  • the code stream generation module 143 is configured to perform code stream multiplexing according to encoding parameters and generate a target code stream.
  • the encoding core determination module 141 is specifically configured as:
  • the target encoding core based on the object audio data in the object audio signal is determined to be the object audio data encoding core, and the target encoding core based on the metadata in the object audio signal is determined to be the object metadata parameter.
  • determining the target encoding core to be the scene audio data encoding core
  • determining the target encoding core to be the channel audio data encoding core
  • the target encoding core based on the audio data in the MASA audio signal is the scene audio data encoding core, and the target based on the spatial auxiliary metadata in the MASA audio signal is determined.
  • the encoding core is the spatial auxiliary metadata parameter encoding core;
  • the target encoding core is determined to be the encoding core selected for the format audio signal in the mixed format audio signal, where the mixed format signal includes an object-based audio signal, a scene-based audio signal, and a sound-based audio signal. at least two of the channel audio signal and the MASA-based audio signal.
  • the encoding core determination module 141 is specifically configured as:
  • the target encoding core of the object audio data in the object-based audio signal is the object audio data Coding core, determine the target coding core based on the metadata in the object audio signal as the object metadata parameter coding core, and determine the target coding core based on the scene audio signal as the scene audio data coding core;
  • a target encoding core of the object audio data in the object-based audio signal is determined.
  • the object audio data encoding core determine the target encoding core based on the metadata in the object audio signal as the object metadata parameter encoding core, and determine the target encoding core based on the scene audio signal as the scene audio data encoding core;
  • the target input format audio signal is based on the scene audio signal.
  • the audio signal encoding device provided by the embodiments of the present disclosure can perform the audio signal encoding method as described in some of the above embodiments, and its beneficial effects are the same as those of the audio signal encoding method described above, which will not be described again here.
  • FIG. 10 is a structural diagram of an electronic device 100 for performing an audio signal encoding method according to an exemplary embodiment.
  • the electronic device 100 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like.
  • the electronic device 100 may include one or more of the following components: a processing component 101 , a memory 102 , a power supply component 103 , a multimedia component 104 , an audio component 105 , an input/output (I/O) interface 106 , and a sensor. component 107, and communications component 108.
  • the processing component 101 generally controls the overall operations of the electronic device 100, such as operations associated with display, phone calls, data communications, camera operations, and recording operations.
  • the processing component 101 may include one or more processors 1011 to execute instructions to complete all or part of the steps of the above method. Additionally, processing component 101 may include one or more modules that facilitate interaction between processing component 101 and other components. For example, processing component 101 may include a multimedia module to facilitate interaction between multimedia component 104 and processing component 101 .
  • Memory 102 is configured to store various types of data to support operations at electronic device 100 . Examples of such data include instructions for any application or method operating on the electronic device 100, contact data, phonebook data, messages, pictures, videos, etc.
  • the memory 102 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as SRAM (Static Random-Access Memory), EEPROM (Electrically Erasable Programmable read only memory), which can be Erasable programmable read-only memory), EPROM (Erasable Programmable Read-Only Memory, erasable programmable read-only memory), PROM (Programmable read-only memory, programmable read-only memory), ROM (Read-Only Memory, only read memory), magnetic memory, flash memory, magnetic disk or optical disk.
  • SRAM Static Random-Access Memory
  • EEPROM Electrical Erasable Programmable read only memory
  • EPROM Erasable Programmable Read-Only Memory, erasable programmable read-only memory
  • PROM Pro
  • Power supply component 103 provides power to various components of electronic device 100 .
  • Power supply components 103 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic device 100 .
  • Multimedia component 104 includes a touch-sensitive display screen that provides an output interface between the electronic device 100 and the user.
  • the touch display screen may include LCD (Liquid Crystal Display) and TP (Touch Panel).
  • the touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide action.
  • multimedia component 104 includes a front-facing camera and/or a rear-facing camera. When the electronic device 100 is in an operating mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data.
  • Each front-facing camera and rear-facing camera can be a fixed optical lens system or have a focal length and optical zoom capabilities.
  • Audio component 105 is configured to output and/or input audio signals.
  • the audio component 105 includes a MIC (Microphone), and when the electronic device 100 is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal.
  • the received audio signals may be further stored in memory 102 or sent via communications component 108 .
  • audio component 105 also includes a speaker for outputting audio signals.
  • the I/O interface 2112 provides an interface between the processing component 101 and a peripheral interface module.
  • the peripheral interface module may be a keyboard, a click wheel, a button, etc. These buttons may include, but are not limited to: Home button, Volume buttons, Start button, and Lock button.
  • Sensor component 107 includes one or more sensors for providing various aspects of status assessment for electronic device 100 .
  • the sensor component 107 can detect the open/closed state of the electronic device 100, the relative positioning of components, such as the display and the keypad of the electronic device 100, the sensor component 107 can also detect the electronic device 100 or an electronic device 100.
  • the position of components changes, the presence or absence of user contact with the electronic device 100 , the orientation or acceleration/deceleration of the electronic device 100 and the temperature of the electronic device 100 change.
  • Sensor assembly 107 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact.
  • the sensor component 107 may also include a light sensor, such as a CMOS (Complementary Metal Oxide Semiconductor) or a CCD (Charge-coupled Device) image sensor for use in imaging applications.
  • CMOS Complementary Metal Oxide Semiconductor
  • CCD Charge-coupled Device
  • the sensor component 107 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • Communication component 108 is configured to facilitate wired or wireless communication between electronic device 100 and other devices.
  • the electronic device 100 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof.
  • the communication component 108 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 108 also includes an NFC (Near Field Communication) module to facilitate short-range communication.
  • the NFC module can be based on RFID (Radio Frequency Identification) technology, IrDA (Infrared Data Association) technology, UWB (Ultra Wide Band) technology, BT (Bluetooth, Bluetooth) technology and other Technology to achieve.
  • the electronic device 100 may be configured by one or more ASIC (Application Specific Integrated Circuit), DSP (Digital Signal Processor, digital signal processor), digital signal processing device (DSPD), PLD ( Programmable Logic Device, Programmable Logic Device), FPGA (Field Programmable Gate Array, Field Programmable Logic Gate Array), controller, microcontroller, microprocessor or other electronic components are used to perform the above audio signal encoding method.
  • ASIC Application Specific Integrated Circuit
  • DSP Digital Signal Processor, digital signal processor
  • DSPD digital signal processing device
  • PLD Programmable Logic Device, Programmable Logic Device
  • FPGA Field Programmable Gate Array
  • controller microcontroller, microprocessor or other electronic components
  • the electronic device 100 provided by the embodiments of the present disclosure can perform the audio signal encoding method as described in some of the above embodiments, and its beneficial effects are the same as those of the audio signal encoding method described above, which will not be described again here.
  • the present disclosure also proposes a storage medium.
  • the electronic device can perform the audio signal encoding method as described above.
  • the storage medium can be ROM (Read Only Memory Image, read-only memory), RAM (Random Access Memory, random access memory), CD-ROM (Compact Disc Read-Only Memory, compact disc read-only memory) , tapes, floppy disks and optical data storage devices, etc.
  • the present disclosure also provides a computer program product.
  • the computer program When the computer program is executed by a processor of an electronic device, the electronic device can perform the audio signal encoding method as described above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Stereophonic System (AREA)

Abstract

Procédé et appareil de codage de signal audio, dispositif électronique et support d'enregistrement. Le procédé comprend : l'acquisition d'un signal audio (S1) ; la détermination d'un type de scène audio correspondant au signal audio (S2) ; la détermination d'un signal audio d'un format d'entrée cible selon le type de scène audio et le signal audio (S3) ; et le codage du signal audio du format d'entrée cible pour générer un flux de code codé cible (S4). Par conséquent, il peut être garanti que le signal de format audio sélectionné peut mieux caractériser la scène audio d'un utilisateur local, de telle sorte qu'un utilisateur distant peut bien obtenir les informations de scène audio de la scène audio où se trouve l'utilisateur local, et l'expérience d'utilisateur est améliorée.
PCT/CN2022/092082 2022-05-10 2022-05-10 Procédé et appareil de codage de signal audio, dispositif électronique et support d'enregistrement WO2023216119A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2022/092082 WO2023216119A1 (fr) 2022-05-10 2022-05-10 Procédé et appareil de codage de signal audio, dispositif électronique et support d'enregistrement
CN202280001342.8A CN117813652A (zh) 2022-05-10 2022-05-10 音频信号编码方法、装置、电子设备和存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/092082 WO2023216119A1 (fr) 2022-05-10 2022-05-10 Procédé et appareil de codage de signal audio, dispositif électronique et support d'enregistrement

Publications (1)

Publication Number Publication Date
WO2023216119A1 true WO2023216119A1 (fr) 2023-11-16

Family

ID=88729306

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/092082 WO2023216119A1 (fr) 2022-05-10 2022-05-10 Procédé et appareil de codage de signal audio, dispositif électronique et support d'enregistrement

Country Status (2)

Country Link
CN (1) CN117813652A (fr)
WO (1) WO2023216119A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117373465A (zh) * 2023-12-08 2024-01-09 富迪科技(南京)有限公司 一种语音频信号切换系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101393741A (zh) * 2007-09-19 2009-03-25 中兴通讯股份有限公司 一种宽带音频编解码器中的音频信号分类装置及分类方法
US20110246207A1 (en) * 2010-04-02 2011-10-06 Korea Electronics Technology Institute Apparatus for playing and producing realistic object audio
US20160225377A1 (en) * 2013-10-17 2016-08-04 Socionext Inc. Audio encoding device and audio decoding device
CN112767956A (zh) * 2021-04-09 2021-05-07 腾讯科技(深圳)有限公司 音频编码方法、装置、计算机设备及介质
CN113948099A (zh) * 2021-10-18 2022-01-18 北京金山云网络技术有限公司 音频编码方法、音频解码方法、装置和电子设备
CN114125639A (zh) * 2021-12-06 2022-03-01 维沃移动通信有限公司 音频信号处理方法、装置及电子设备
CN114299967A (zh) * 2020-09-22 2022-04-08 华为技术有限公司 音频编解码方法和装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101393741A (zh) * 2007-09-19 2009-03-25 中兴通讯股份有限公司 一种宽带音频编解码器中的音频信号分类装置及分类方法
US20110246207A1 (en) * 2010-04-02 2011-10-06 Korea Electronics Technology Institute Apparatus for playing and producing realistic object audio
US20160225377A1 (en) * 2013-10-17 2016-08-04 Socionext Inc. Audio encoding device and audio decoding device
CN114299967A (zh) * 2020-09-22 2022-04-08 华为技术有限公司 音频编解码方法和装置
CN112767956A (zh) * 2021-04-09 2021-05-07 腾讯科技(深圳)有限公司 音频编码方法、装置、计算机设备及介质
CN113948099A (zh) * 2021-10-18 2022-01-18 北京金山云网络技术有限公司 音频编码方法、音频解码方法、装置和电子设备
CN114125639A (zh) * 2021-12-06 2022-03-01 维沃移动通信有限公司 音频信号处理方法、装置及电子设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117373465A (zh) * 2023-12-08 2024-01-09 富迪科技(南京)有限公司 一种语音频信号切换系统
CN117373465B (zh) * 2023-12-08 2024-04-09 富迪科技(南京)有限公司 一种语音频信号切换系统

Also Published As

Publication number Publication date
CN117813652A (zh) 2024-04-02

Similar Documents

Publication Publication Date Title
US9966084B2 (en) Method and device for achieving object audio recording and electronic apparatus
EP3046309B1 (fr) Procédé, dispositif et système pour la projection sur un écran
US20170304735A1 (en) Method and Apparatus for Performing Live Broadcast on Game
US11567729B2 (en) System and method for playing audio data on multiple devices
WO2017181551A1 (fr) Procédé et dispositif de traitement vidéo
CN106454644B (zh) 音频播放方法及装置
JP7361890B2 (ja) 通話方法、通話装置、通話システム、サーバ及びコンピュータプログラム
EP4044578A1 (fr) Procédé de traitement audio et dispositif électronique
WO2021244159A1 (fr) Procédé et appareil de traduction, écouteur et appareil de stockage d'écouteur
CN115273831A (zh) 语音转换模型训练方法、语音转换方法和装置
CN112286481A (zh) 音频输出方法及电子设备
WO2023216119A1 (fr) Procédé et appareil de codage de signal audio, dispositif électronique et support d'enregistrement
WO2021244135A1 (fr) Procédé et appareil de traduction, et casque d'écoute
CN110767203B (zh) 音频处理方法、装置及移动终端及存储介质
CN104682908A (zh) 控制音量的方法及装置
CN111739538B (zh) 一种翻译方法、装置、耳机和服务器
CN115550559B (zh) 视频画面显示方法、装置、设备和存储介质
CN110213531B (zh) 监控录像处理方法及装置
WO2024000534A1 (fr) Procédé et appareil de codage de signal audio, dispositif électronique et support de stockage
CN109712629B (zh) 音频文件的合成方法及装置
WO2023212879A1 (fr) Procédé et appareil de génération de données audio d'objet, dispositif électronique, et support de stockage
CN116830193A (zh) 音频码流信号处理方法、装置、电子设备和存储介质
CN114007101B (zh) 融合显示设备的处理方法、设备及存储介质
EP4030294A1 (fr) Procédé de commande de fonction, dispositif de commande de fonction et support de stockage
EP4167580A1 (fr) Procédé de commande audio, système et dispositif électronique

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 202280001342.8

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22941082

Country of ref document: EP

Kind code of ref document: A1