CN116456038A - Audio processing method and device, nonvolatile storage medium and electronic equipment - Google Patents

Audio processing method and device, nonvolatile storage medium and electronic equipment Download PDF

Info

Publication number
CN116456038A
CN116456038A CN202210018390.XA CN202210018390A CN116456038A CN 116456038 A CN116456038 A CN 116456038A CN 202210018390 A CN202210018390 A CN 202210018390A CN 116456038 A CN116456038 A CN 116456038A
Authority
CN
China
Prior art keywords
target
target object
audio signal
determining
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210018390.XA
Other languages
Chinese (zh)
Inventor
江建亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Guangzhou Shizhen Information Technology Co Ltd
Original Assignee
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Guangzhou Shizhen Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shiyuan Electronics Thecnology Co Ltd, Guangzhou Shizhen Information Technology Co Ltd filed Critical Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority to CN202210018390.XA priority Critical patent/CN116456038A/en
Publication of CN116456038A publication Critical patent/CN116456038A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/04Synchronising
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/055Time compression or expansion for synchronising with other signals, e.g. video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen

Abstract

The application discloses an audio processing method, an audio processing device, a nonvolatile storage medium and electronic equipment. Wherein the method comprises the following steps: determining a first position, a second position where a target object displayed by the image display device is located, and a third position where each speaker in the speaker group is located; determining a first positional relationship of the target object relative to the first position according to the first position and the second position, and determining a second positional relationship of each speaker relative to the first position according to the first position and the third position; processing target audio data corresponding to the target object into a first audio signal based on the first and second positional relationships, wherein the first audio signal corresponds to a target speaker in the speaker group; and sending the first audio signal to the target loudspeaker according to the corresponding relation between the first audio signal and the target loudspeaker. The method and the device solve the technical problem that the spatial directions of the sound and the sounding object are inconsistent when the user watches the video played by the video terminal.

Description

Audio processing method and device, nonvolatile storage medium and electronic equipment
Technical Field
The present invention relates to the field of audio processing, and in particular, to an audio processing method, an apparatus, a nonvolatile storage medium, and an electronic device.
Background
Devices for viewing video for users typically include an image display device and a sound playing device. The image display device may include any screen for displaying video images, may be a projection curtain, and the sound playing device may be a single speaker or a group of speakers.
When a group of speakers is used to play sound, since the sound emitted by the object sounding on the screen is directly played by the speakers, there is a case where the sound-to-picture spatial position is inconsistent, that is, the position of the sounding object determined visually by the user is inconsistent with the position of the sounding object determined audibly. The problem may therefore lead to a user misunderstanding of which object on the screen is sounding, which in turn leads to a reduced video call efficiency or an ambiguous understanding of the video when watching the video, directly affecting the experience of the video user.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the application provides an audio processing method and device, a nonvolatile storage medium and electronic equipment, which are used for at least solving the technical problem that when a user watches a video played by a video terminal, the spatial directions of perceived sound and a sound object are inconsistent.
According to an aspect of an embodiment of the present application, there is provided an audio processing method including: determining a first position, a second position where a target object is located, and a third position where each loudspeaker in a loudspeaker set is located, wherein the target object is displayed through an image display device, and the first position is located at a position where an optical signal sent by the image display device can be received; determining a first positional relationship of the target object relative to the first position based on the first position and the second position, and determining a second positional relationship of each speaker relative to the first position based on the first position and the third position; processing target audio data corresponding to the target object into at least one first audio signal based on the first positional relationship and the second positional relationship, wherein the at least one first audio signal corresponds to at least one target speaker in the speaker group; and sending the at least one first audio signal to the at least one target loudspeaker according to the corresponding relation between the at least one first audio signal and the at least one target loudspeaker.
Optionally, determining the second position where the target object is located includes: determining an image position relation between the target object and the image display device according to display parameters of the image display device and image data corresponding to the target audio data, wherein the target object is positioned in a display picture generated by the image display device based on the image data; and determining a fourth position where the image display device is located, and determining the second position where the target object is located according to the fourth position and the image position relation.
Optionally, the determining the image position relationship between the target object and the image display device includes: determining a first image position of the target object in the display screen based on the image data; determining a second image position of the display picture in the image display device based on the display parameters of the image display device; the image positional relationship of the target object and the image display device is determined based on the first image position and the second image position.
Optionally, the determining a first positional relationship of the target object relative to the first position according to the first position and the second position, and determining a second positional relationship of each speaker relative to the first position according to the first position and the third position includes: determining a first vector corresponding to the target object according to the first position and the second position, wherein the first vector represents the first position relation; and determining a second vector corresponding to each loudspeaker according to the first position and the third position, wherein the second vector represents the second position relation.
Optionally, processing the target audio data corresponding to the target object into at least one first audio signal based on the first positional relationship and the second positional relationship, including: determining a plurality of speakers in the speaker group as target speakers based on the first vector and the second vector, wherein the target object is located in a range of a closed figure, and the closed figure is obtained by connecting the target speakers; generating a plurality of audio processing parameters based on the unit direction vector of the first vector and the unit direction vector of a plurality of third vectors, wherein the plurality of third vectors are a plurality of vectors corresponding to the target speakers in the second vector, and the plurality of audio processing parameters are in one-to-one correspondence with the plurality of target speakers; and processing the target audio data based on the plurality of audio processing parameters to obtain a plurality of first audio signals corresponding to the plurality of target loudspeakers one by one.
Optionally, the generating a plurality of audio processing parameters based on the unit direction vector of the first vector and the unit direction vectors of the plurality of third vectors includes: and calculating by adopting a vector-based amplitude feed method based on the unit direction vector of the first vector and the unit direction vectors of the plurality of third vectors to obtain a plurality of amplitude weights corresponding to the plurality of target loudspeakers one by one, wherein the audio processing parameters comprise the amplitude weights.
Optionally, before the sending the at least one first audio signal to the at least one target speaker, the method further comprises: mixing the first audio signal with a second audio signal to obtain a channel audio signal, wherein the second audio signal and the first audio signal both correspond to the same target loudspeaker; and sending the channel audio signal to the target loudspeaker corresponding to the first audio signal.
According to another aspect of the embodiments of the present application, there is also provided an audio processing apparatus, including: a first determining module, configured to determine a first position, a second position where a target object is located, and a third position where each speaker in a speaker group is located, where the target object is displayed by an image display device, and the first position is located at a position where an optical signal sent by the image display device can be received; a second determining module configured to determine a first positional relationship of the target object with respect to the first position according to the first position and the second position, and determine a second positional relationship of each speaker with respect to the first position according to the first position and the third position; a processing module, configured to process target audio data corresponding to the target object into at least one first audio signal based on the first positional relationship and the second positional relationship, where the at least one first audio signal corresponds to at least one target speaker in the speaker group; and the sending module is used for sending the at least one first audio signal to the at least one target loudspeaker according to the corresponding relation between the at least one first audio signal and the at least one target loudspeaker.
According to still another aspect of the embodiments of the present application, there is further provided a non-volatile storage medium, where the non-volatile storage medium includes a stored program, and when the program runs, the device in which the non-volatile storage medium is controlled to execute any one of the audio processing methods described above.
According to still another aspect of the embodiments of the present application, there is also provided an electronic device, including: a memory and a processor, wherein the memory includes a stored program; the processor is configured to execute a program stored in the memory, where the program executes any one of the audio processing methods described above.
In the embodiment of the application, the first position, the second position where the target object displayed by the image display device is located, and the third position where each loudspeaker in the loudspeaker group is located are determined; determining a first positional relationship of the target object relative to the first position according to the first position and the second position, and determining a second positional relationship of each speaker relative to the first position according to the first position and the third position; processing target audio data corresponding to the target object into a first audio signal based on the first and second positional relationships, wherein the first audio signal corresponds to a target speaker in the speaker group; according to the corresponding relation between the first audio signal and the target loudspeaker, the first audio signal is sent to the target loudspeaker, and the purpose of providing the loudspeaker with the audio signal corresponding to the relative position relation between the sounding target object and the listening user is achieved, so that the technical effect of improving the consistency of the object seen by the user and the sound heard in the space direction when the user watches the video is achieved, and the technical problem that the user perceives that the space direction of the sound and the sounding object is inconsistent when watching the video played by the video terminal is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
fig. 1 shows a hardware block diagram of a computer terminal for implementing an audio processing method;
fig. 2 is a flow chart of an audio processing method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an alternative setup space coordinate system provided in accordance with an embodiment of the present application;
FIG. 4 is a schematic diagram of an alternative set-up image coordinate system provided in accordance with an embodiment of the present application;
FIG. 5 is a flow chart of a method of audio processing of multiple video objects provided in accordance with an alternative embodiment of the present application;
fig. 6 is a block diagram of an audio processing apparatus according to an embodiment of the present application.
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
First, partial terms or terminology appearing in the course of describing the embodiments of the present application are applicable to the following explanation:
vector-based amplitude feed (Vector-based Amplitude Panning, VBAP for short), a method of spatial sound codec, may be used to feed audio to a speaker.
In accordance with the embodiments of the present application, there is provided an embodiment of a method of audio processing, it being noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
The method embodiment provided in the first embodiment of the present application may be executed in a mobile terminal, a computer terminal or a similar computing device. Fig. 1 shows a block diagram of a hardware structure of a computer terminal for implementing an audio processing method. As shown in fig. 1, the computer terminal 10 may include one or more processors 102 (shown as 102a, 102b, … …,102 n) and a memory 104 for storing data (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA). In addition, the method may further include: a display, an input/output interface (I/O interface), a Universal Serial BUS (USB) port (which may be included as one of the ports of the BUS), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
It should be noted that the one or more processors 102 and/or other data processing circuits described above may be referred to generally herein as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module or incorporated, in whole or in part, into any of the other elements in the computer terminal 10. As referred to in the embodiments of the present application, the data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination to interface).
The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the audio processing method in the embodiments of the present application, and the processor 102 executes the software programs and modules stored in the memory 104, thereby performing various functional applications and data processing, that is, implementing the method of application program described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10.
Fig. 2 is a flow chart of an audio processing method according to an embodiment of the present application, as shown in fig. 2, the method includes the following steps:
step S202, determining a first position, a second position where a target object is located, and a third position where each speaker in the speaker group is located, where the target object is displayed by the image display device, and the first position is located where an optical signal sent by the image display device can be received.
Alternatively, the first location may be a location where a listener or a user watching a video is located, or may be another fixed location in front of the image display device. When a user views a video picture through an image display device, each speaker in the speaker group is used to play sound from the video. The target object is an object in a screen displayed by the image display device, for example, may be a person object in the screen which is speaking, or may be a sound-producing object or sound-producing animal in the screen. Wherein the image display device may be a display screen.
Step S204, determining a first position relation of the target object relative to the first position according to the first position and the second position, and determining a second position relation of each loudspeaker relative to the first position according to the first position and the third position.
Alternatively, in step S202 to step S204, the position and the positional relationship may be defined by establishing a spatial coordinate system. For example, the first position, the second position, and the third position are defined as coordinates in a three-dimensional coordinate system, and the positional relationship between the positions is further defined according to the coordinates of the positions.
Specifically, fig. 3 is a schematic diagram of an optional space coordinate system establishment provided according to an embodiment of the present application, where, as shown in fig. 3, the space coordinate system may take a first position where a user is located as an origin of coordinates, and establish the space coordinate system according to the origin of coordinates and a position and an attitude of a display screen, where a direction parallel to a horizontal lateral frame of the display screen and facing to the left in fig. 3 may be taken as an X-axis positive direction of the space coordinate systemThe direction parallel to the vertical frame of the display screen and upward in fig. 3 is taken as the positive Z-axis direction of the space coordinate system, and the direction vertically pointing to the display screen through the origin of coordinates is taken as the positive Y-axis direction of the space coordinate system. In the spatial coordinate system established according to the present alternative method, coordinates representing the third position of each speaker in the speaker group may be noted as (x) i ,y i ,z i ) I=1, 2,..l. Where i denotes the ith speaker in the speaker group, L denotes the total number of speakers in the speaker group, (x) i ,y i ,z i ) Representing the projection of the vector of the i-th speaker to the origin of coordinates in the direction X, Y, Z, respectively. Alternatively, the position of the speaker in the spatial coordinate system may be calculated according to the arrangement position thereof, where the speaker may be arranged on the same complete machine as the display screen, or may exist separately from the display screen.
Further, the first positional relationship and the second positional relationship may be represented by a vector from the first position to the second position and the first position to the third position.
Step S206, processing the target audio data corresponding to the target object into at least one first audio signal based on the first positional relationship and the second positional relationship, wherein the at least one first audio signal corresponds to at least one target speaker in the speaker group.
In this step, the target audio data corresponding to the target object may include the following scenes: the target audio data is audio data corresponding to the target object in the video audio when the target object sounds in the video, or the target audio data can also be complete audio data of the video corresponding to the target object. In the first scenario, audio data of a video to which the target object belongs may be marked, and audio data generated by sounding the target object may be marked to obtain target audio data. It should be noted that the number of the first audio signals may be one or more, and the number of the first audio signals processed by the speaker may depend on the number of the target speakers. After the target loudspeaker drives sound according to the first audio signal corresponding to the target loudspeaker, a user at the first position can generate the sense that the visual direction and the hearing direction of the target object are the same, namely, for the first position, the hearing effect of driving sound according to the first audio signal by adopting at least one target loudspeaker is the same as the hearing effect of directly sounding by adopting a single loudspeaker at the second position.
Step S208, according to the correspondence between the at least one first audio signal and the at least one target speaker, the at least one first audio signal is sent to the at least one target speaker. After the target loudspeaker receives the first audio signal, sound production is driven according to the first audio signal.
In the above step, by determining the first position, the second position where the target object displayed by the image display device is located, and the third position where each speaker in the speaker group is located; determining a first positional relationship of the target object relative to the first position according to the first position and the second position, and determining a second positional relationship of each speaker relative to the first position according to the first position and the third position; processing target audio data corresponding to the target object into a first audio signal based on the first and second positional relationships, wherein the first audio signal corresponds to a target speaker in the speaker group; according to the corresponding relation between the first audio signal and the target loudspeaker, the first audio signal is sent to the target loudspeaker, and the purpose of providing the loudspeaker with the audio signal corresponding to the relative position relation between the sounding target object and the listening user is achieved, so that the technical effect of improving the consistency of the object seen by the user and the sound heard in the space direction when the user watches the video is achieved, and the technical problem that the user perceives that the space direction of the sound and the sounding object is inconsistent when watching the video played by the video terminal is solved.
As an alternative embodiment, determining the second position where the target object is located may be as follows: determining an image position relation between a target object and the image display device according to display parameters of the image display device and image data corresponding to target audio data, wherein the target object is positioned in a display picture generated by the image display device based on the image data; and determining a fourth position where the image display device is located, and determining a second position where the target object is located according to the relation between the fourth position and the image position.
Optionally, the image display device may include a display screen, and the display parameters may include resolution of the display screen, that is, the number of pixels in the horizontal direction and the vertical direction of the display screen, and may further include parameter information such as a region position, a region size, and the like of the video in which the target object is displayed in the display screen.
The display screen may display the image data corresponding to the target audio data in full screen, or may display the image in only one window in the display screen. For example, the display screen may display video frames of a plurality of groups of video calls, the target object may be a person speaking in a group of video calls, the image data of the video includes the frame of the person, and the audio data of the video includes the sound of the person. The image data corresponding to the target object can occupy a small area in the display screen only, so that the image position relationship between the target object and the display screen can be determined according to the display parameters of the display screen. In addition, the fourth position where the image display device is located may be a position coordinate of the center of the display screen in the spatial coordinate system, and other definition manners may also be adopted. And according to the image position relation between the target object and the display screen and the position coordinate of the display screen in the space coordinate system, carrying out substitution calculation to obtain a second position of the target object in the space coordinate system.
FIG. 4 is a schematic diagram of an alternative image coordinate system establishment, according to an embodiment of the present application, optionally, a two-dimensional image coordinate system parallel to the plane of the display screen may be established as shown in FIG. 4, and the image position relationship between the target object and the display screen may be determined according to the two-dimensional image coordinate system:
let the number of horizontal pixels of the display screen be m+1 and the number of vertical pixels be n+1. Four vertices of the square display are defined as A, B, C and D, and the positions of A, B, C and D in the image coordinate system are (0, 0), (0, N), (N, M), and (M, 0). And determining the coordinates (m, n) of the target object in the image coordinate system, and further determining the image position relationship between the target object and the display screen, namely the angle and the distance between the target object and the upper left corner of the display screen or the angle and the distance between the target object and the center of the display screen.
As an alternative embodiment, determining the image positional relationship of the target object and the image display device may be as follows: determining a first image position of a target object in a display screen based on the image data; determining a second image position of the display picture in the image display device based on the display parameters of the image display device; an image positional relationship of the target object and the image display device is determined based on the first image position and the second image position.
In this optional embodiment, the first image position of the target object in the display screen to which the target object belongs may be acquired first, then, based on the display parameter, the second image position where the display screen is located is determined in the display screen, and then, position substitution is performed, so as to obtain the image position relationship of the target object in the display screen. Optionally, determining the first image position of the target object in the display screen to which the target object belongs may identify the position of the mouth of the target object in the display screen to which the target object belongs as the first image position, so as to implement more accurate auditory azimuth fitting. Alternatively, the center position of the display screen to which the target object belongs may also be determined as the position of the target object.
Alternatively, the coordinate space established according to fig. 3 and 4 may determine the second position where the target object is located by: based on the fourth position of the display screen, four vertices of the display screen are represented in the spatial coordinate system as (X A ,Y A ,Z A ),(X B ,Y B ,Z B ),(X C ,Y C ,Z C ) And (X) D ,Y D ,Z D ) In this alternative embodiment, Y may be determined A =Y B =Y C . Then, according to the coordinates (m, n) of the target object in the image coordinate system, obtaining the coordinates (X) of the second position where the target object is located by adopting a linear interpolation method E ,Y E ,Z E ) The following are provided:
(X E ,Y E ,Z E )=((X D -X A )*m/M,Y A ,(Z B -Z A )*n/N)。
As an alternative embodiment, the first positional relationship and the second positional relationship may be determined by: determining a first vector corresponding to the target object according to the first position and the second position, wherein the first vector represents a first position relation; and determining a second vector corresponding to each speaker according to the first position and the third position, wherein the second vector represents a second positional relationship. Wherein the first vector and the second vector are both vectors comprising a direction and a distance. For example, in the spatial coordinate system, the direction of the first vector may be the direction in which the origin of coordinates points to the second position, and the size is the modulo length of the line segment from the origin of coordinates to the second position.
As an alternative embodiment, the processing of the target audio data corresponding to the target object into at least one first audio signal based on the first positional relationship and the second positional relationship may determine, based on the first vector and the second vector, that a plurality of speakers in the speaker group are target speakers, where the target object is located within a range of a closed figure, and the closed figure is obtained by connecting the target speakers; generating a plurality of audio processing parameters based on the unit direction vector of the first vector and the unit direction vector of a plurality of third vectors, wherein the plurality of third vectors are a plurality of vectors corresponding to the target speakers in the second vector, and the plurality of audio processing parameters are in one-to-one correspondence with the plurality of target speakers; and processing the target audio data based on the plurality of audio processing parameters to obtain a plurality of first audio signals corresponding to the plurality of target speakers one by one.
Optionally, the target object is located within the range of the closed figure, i.e. in a polygon formed by a plurality of target speakers connected, as seen from the first position where the user is located towards the target object. Connection here does not refer to a physical or communication connection of the target speaker, but to a geometrical connection in space. After the plurality of target speakers are selected, audio processing parameters can be further allocated to each target speaker based on the unit direction vector of the target object and the unit direction vector of the target speaker, and the target audio data can be processed by adopting the audio processing parameters respectively, so that the first audio signal can be obtained.
In addition, in the process of processing the target audio data using the audio processing parameters, audio signal processing methods related to audio enhancement, including, but not limited to, automatic gain control (Automatic Gain Control, AGC), acoustic echo cancellation (Acoustic Echo Canceller, AEC), automatic noise suppression (Automatic Noise Suppression, ANS), and the like, may also be used.
As an alternative embodiment, the generating of the plurality of audio processing parameters based on the unit direction vector of the first vector and the unit direction vectors of the plurality of third vectors may be as follows: and calculating by adopting a vector-based amplitude feed method based on the unit direction vector of the first vector and the unit direction vector of the plurality of third vectors to obtain a plurality of amplitude weights corresponding to the plurality of target loudspeakers one by one, wherein the audio processing parameters comprise the amplitude weights.
Optionally, when the unit direction vector of the first vector overlaps with the unit direction vector of a certain vector in the second vector, the speaker corresponding to the vector in the second vector may be directly determined as the target speaker, and the first audio signal corresponding to the first vector may be directly sent to the target speaker. In this optional step, the position of the target object corresponding to the first audio signal is the same as the position of the target speaker in space, and the speaker may be directly used to play the sound in the video where the target object is located, so that when the user perceives the target object through vision and sound perception, the directions of the target objects perceived in two ways are consistent.
Alternatively, taking three selected target speakers as examples, an amplitude vector-based signal feed method (VBAP) may be used to draw audio signals, and the three first audio signals are obtained after the target audio data are drawn, where the three first audio signals correspond to the target speakers at three positions. The specific implementation algorithm is as follows:
first, the coordinates (X) of each speaker in the speaker group in the spatial coordinate system are determined i ,Y i ,Z i ) And calculates the unit direction vector of each speaker And distance of the individual loudspeakers from the origin of coordinates of the spatial reference coordinate system +.>
In the middle of
Further, the coordinates of the target object in the space coordinate system are obtained X E , Y E , Z E ) Calculating unit direction vector of target objectAnd distance of the individual loudspeakers from the origin of coordinates of the spatial reference coordinate system +.>
In the middle of
Three unit direction vectors are selected according to the unit direction vectors of the speakers, so that the unit direction vectors of the target objects are in a triangular area formed by the selected three unit direction vectors, and the three unit direction vectors correspond to the three target speakers and are respectively marked as a speaker 1, a speaker 2 and a speaker 3.
VBAP space decoding is carried out by the unit direction vectors of the three selected speakers and the unit direction vector of the target object, so that the amplitude weight of the target audio data of the target object fed to the three speakers is obtained. Assume that the unit direction vectors of the three target speakers are respectivelyThe unit direction vector of the video object is +.>The amplitude weights of the target audio data corresponding to the three target speakers are obtained as
Where w1 represents the amplitude weight corresponding to speaker 1, w2 represents the amplitude weight corresponding to speaker 2, and w3 represents the amplitude weight corresponding to speaker 3.
Normalizing the obtained amplitude weights of the three target speakers, namely respectively multiplying the three amplitude weights by normalization factors to obtain normalized amplitude weights of audio streams of the video object replayed by the three speakers, wherein power normalization can be adopted to obtain normalization factors w av
And processing the target audio data by adopting the normalized amplitude weight, so as to obtain three paths of first audio signals corresponding to the three target speakers.
As an alternative embodiment, before sending the at least one first audio signal to the at least one target speaker, the first audio signal may be mixed with a second audio signal to obtain a channel audio signal, where the second audio signal and the first audio signal both correspond to the same target speaker; and sending the channel audio signal to a target loudspeaker corresponding to the first audio signal.
Alternatively, a plurality of windows may be opened in different areas of the image display apparatus, a plurality of video frames may be displayed respectively, audio data of each video may be processed into a plurality of sets of audio signals respectively, for audio signals to be fed to the same speaker, these audio signals may be mixed to obtain a channel audio signal corresponding to the speaker, and then the channel audio signal is fed to the speaker, and sounds are emitted from the speaker, so that a user at the first position may hear the sounds of a plurality of videos simultaneously and the hearing directions of the audio sounds of the respective videos are the same as the visual directions of the video frames thereof.
Fig. 5 is a flowchart of a method of audio processing of multiple video objects provided in accordance with an alternative embodiment of the present application. The audio processing method of multiple video objects shown in fig. 5 may be applied to a processor in a local device responsible for processing video streams, for example, a CPU chip of a host computer or a professional DSP chip. As shown in fig. 5, after n paths of video objects (i.e., video stream data) are received in the processor, firstly encoding and decoding audio and video are performed on the data, and then the decoded image data are sent to the image display device for visual playing of the video; simultaneously, audio data are given to an audio enhancement signal processing module, audio data are subjected to automatic gain control, acoustic echo cancellation or automatic noise suppression and other audio enhancement related signals are subjected to basic processing, and then the audio data after basic processing are subjected to 'target drawing' to obtain first audio signals corresponding to multiple paths of target loudspeakers; after the audio data of a plurality of video objects are respectively subjected to signal processing and 'target drawing', the first audio signal and the second audio signal corresponding to the same target loudspeaker are mixed, and then all the mixed audio signals are sent to the corresponding target loudspeaker, and the target loudspeaker sounds.
According to an embodiment of the present application, there is also provided an audio processing apparatus for implementing the above-mentioned audio processing method, and fig. 6 is a block diagram of a structure of the audio processing apparatus provided according to an embodiment of the present application, as shown in fig. 6, the audio processing apparatus includes: the audio processing apparatus will be described below as a first determination module 62, a second determination module 64, a processing module 66 and a transmission module 68.
A first determining module 62, configured to determine a first position, a second position where a target object is located, and a third position where each speaker in the speaker group is located, where the target object is displayed by the image display device, and the first position is located at a position where an optical signal sent by the image display device can be received;
a second determining module 64, coupled to the first determining module 62, for determining a first positional relationship of the target object with respect to the first position according to the first position and the second position, and determining a second positional relationship of each speaker with respect to the first position according to the first position and the third position;
a processing module 66, coupled to the second determining module 64, for processing the target audio data corresponding to the target object into at least one first audio signal based on the first positional relationship and the second positional relationship, wherein the at least one first audio signal corresponds to at least one target speaker in the speaker group;
And a transmitting module 68, coupled to the processing module 66, for transmitting the at least one first audio signal to the at least one target speaker according to the correspondence between the at least one first audio signal and the at least one target speaker.
Here, the first determining module 62, the second determining module 64, the processing module 66 and the transmitting module 68 correspond to steps S202 to S208 in embodiment 1, and the plurality of modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the above-described module may be operated as a part of the apparatus in the computer terminal 10 provided in embodiment 1.
Embodiments of the present application may provide an electronic device, optionally, in this embodiment, the electronic device may be located in a computer device, and the computer device may be located in at least one network device of a plurality of network devices of a computer network. The electronic device includes a memory and a processor.
The memory may be used to store software programs and modules, such as program instructions/modules corresponding to the audio processing methods and apparatuses in the embodiments of the present application, and the processor executes the software programs and modules stored in the memory, thereby executing various functional applications and data processing, that is, implementing the audio processing methods described above. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory remotely located relative to the processor, which may be connected to the computer terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The processor may call the information and the application program stored in the memory through the transmission device to perform the following steps: determining a first position, a second position where a target object is located, and a third position where each loudspeaker in the loudspeaker set is located, wherein the target object is displayed through the image display device, and the first position is located at a position where an optical signal sent by the image display device can be received; determining a first positional relationship of the target object relative to the first position based on the first position and the second position, and determining a second positional relationship of each speaker relative to the first position based on the first position and the third position; processing target audio data corresponding to the target object into at least one first audio signal based on the first positional relationship and the second positional relationship, wherein the at least one first audio signal corresponds to at least one target speaker in the speaker group; and sending the at least one first audio signal to the at least one target loudspeaker according to the corresponding relation between the at least one first audio signal and the at least one target loudspeaker.
Those skilled in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute on associated hardware, the program may be stored in a non-volatile storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.
Embodiments of the present application also provide a nonvolatile storage medium. Alternatively, in the present embodiment, the above-described nonvolatile storage medium may be used to store program codes executed by the audio processing method provided in the above-described embodiment 1.
Alternatively, in this embodiment, the above-mentioned nonvolatile storage medium may be located in any one of the computer terminals in the computer terminal group in the computer network, or in any one of the mobile terminals in the mobile terminal group.
Optionally, in the present embodiment, the non-volatile storage medium is arranged to store program code for performing the steps of: determining a first position, a second position where a target object is located, and a third position where each loudspeaker in the loudspeaker set is located, wherein the target object is displayed through the image display device, and the first position is located at a position where an optical signal sent by the image display device can be received; determining a first positional relationship of the target object relative to the first position based on the first position and the second position, and determining a second positional relationship of each speaker relative to the first position based on the first position and the third position; processing target audio data corresponding to the target object into at least one first audio signal based on the first positional relationship and the second positional relationship, wherein the at least one first audio signal corresponds to at least one target speaker in the speaker group; and sending the at least one first audio signal to the at least one target loudspeaker according to the corresponding relation between the at least one first audio signal and the at least one target loudspeaker.
The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.
In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a non-volatile storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims (10)

1. An audio processing method, comprising:
determining a first position, a second position where a target object is located, and a third position where each loudspeaker in a loudspeaker set is located, wherein the target object is displayed through an image display device, and the first position is located at a position where an optical signal sent by the image display device can be received;
determining a first positional relationship of the target object relative to the first position based on the first position and the second position, and determining a second positional relationship of each speaker relative to the first position based on the first position and the third position;
processing target audio data corresponding to the target object into at least one first audio signal based on the first positional relationship and the second positional relationship, wherein the at least one first audio signal corresponds to at least one target speaker in the speaker group;
And sending the at least one first audio signal to the at least one target loudspeaker according to the corresponding relation between the at least one first audio signal and the at least one target loudspeaker.
2. The method of claim 1, wherein determining the second location at which the target object is located comprises:
determining an image position relation between the target object and the image display device according to display parameters of the image display device and image data corresponding to the target audio data, wherein the target object is positioned in a display picture generated by the image display device based on the image data;
and determining a fourth position where the image display device is located, and determining the second position where the target object is located according to the fourth position and the image position relation.
3. The method of claim 2, wherein the determining the image positional relationship of the target object with the image display device comprises:
determining a first image position of the target object in the display screen based on the image data;
determining a second image position of the display picture in the image display device based on the display parameters of the image display device;
The image positional relationship of the target object and the image display device is determined based on the first image position and the second image position.
4. The method of claim 1, wherein the determining a first positional relationship of the target object relative to the first position based on the first position and the second position, and determining a second positional relationship of each speaker relative to the first position based on the first position and the third position, comprises:
determining a first vector corresponding to the target object according to the first position and the second position, wherein the first vector represents the first position relation; the method comprises the steps of,
and determining second vectors corresponding to the speakers respectively according to the first position and the third position, wherein the second vectors represent the second position relation.
5. The method of claim 4, wherein processing the target audio data corresponding to the target object into at least one first audio signal based on the first positional relationship and the second positional relationship comprises:
determining a plurality of speakers in the speaker group as target speakers based on the first vector and the second vector, wherein the target object is located in a range of a closed figure, and the closed figure is obtained by connecting the target speakers;
Generating a plurality of audio processing parameters based on the unit direction vector of the first vector and the unit direction vector of a plurality of third vectors, wherein the plurality of third vectors are a plurality of vectors corresponding to the target speakers in the second vector, and the plurality of audio processing parameters are in one-to-one correspondence with the plurality of target speakers;
and processing the target audio data based on the plurality of audio processing parameters to obtain a plurality of first audio signals corresponding to the plurality of target loudspeakers one by one.
6. The method of claim 5, wherein generating a plurality of audio processing parameters based on the unit direction vector of the first vector and the unit direction vectors of the plurality of third vectors comprises:
and calculating by adopting a vector-based amplitude feed method based on the unit direction vector of the first vector and the unit direction vectors of the plurality of third vectors to obtain a plurality of amplitude weights corresponding to the plurality of target loudspeakers one by one, wherein the audio processing parameters comprise the amplitude weights.
7. The method of claim 1, wherein prior to said transmitting the at least one first audio signal to the at least one target speaker, further comprising:
Mixing the first audio signal with a second audio signal to obtain a channel audio signal, wherein the second audio signal and the first audio signal both correspond to the same target loudspeaker;
and sending the channel audio signal to the target loudspeaker corresponding to the first audio signal.
8. An audio processing apparatus, comprising:
a first determining module, configured to determine a first position, a second position where a target object is located, and a third position where each speaker in a speaker group is located, where the target object is displayed by an image display device, and the first position is located at a position where an optical signal sent by the image display device can be received;
a second determining module configured to determine a first positional relationship of the target object with respect to the first position according to the first position and the second position, and determine a second positional relationship of each speaker with respect to the first position according to the first position and the third position;
a processing module, configured to process target audio data corresponding to the target object into at least one first audio signal based on the first positional relationship and the second positional relationship, where the at least one first audio signal corresponds to at least one target speaker in the speaker group;
And the sending module is used for sending the at least one first audio signal to the at least one target loudspeaker according to the corresponding relation between the at least one first audio signal and the at least one target loudspeaker.
9. A non-volatile storage medium, characterized in that the non-volatile storage medium comprises a stored program, wherein the program, when run, controls a device in which the non-volatile storage medium is located to perform the audio processing method of any one of claims 1 to 7.
10. An electronic device, comprising: a memory and a processor, wherein,
the memory comprises a stored program;
the processor is configured to execute a program stored in the memory, wherein the program executes the audio processing method according to any one of claims 1 to 7.
CN202210018390.XA 2022-01-07 2022-01-07 Audio processing method and device, nonvolatile storage medium and electronic equipment Pending CN116456038A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210018390.XA CN116456038A (en) 2022-01-07 2022-01-07 Audio processing method and device, nonvolatile storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210018390.XA CN116456038A (en) 2022-01-07 2022-01-07 Audio processing method and device, nonvolatile storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN116456038A true CN116456038A (en) 2023-07-18

Family

ID=87120730

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210018390.XA Pending CN116456038A (en) 2022-01-07 2022-01-07 Audio processing method and device, nonvolatile storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN116456038A (en)

Similar Documents

Publication Publication Date Title
US11082662B2 (en) Enhanced audiovisual multiuser communication
US11055057B2 (en) Apparatus and associated methods in the field of virtual reality
US8571192B2 (en) Method and apparatus for improved matching of auditory space to visual space in video teleconferencing applications using window-based displays
US20130093837A1 (en) Method and apparatus for processing audio in video communication
EP2352290B1 (en) Method and apparatus for matching audio and video signals during a videoconference
US11290573B2 (en) Method and apparatus for synchronizing viewing angles in virtual reality live streaming
US8411126B2 (en) Methods and systems for close proximity spatial audio rendering
US11109177B2 (en) Methods and systems for simulating acoustics of an extended reality world
US10110994B1 (en) Method and apparatus for providing voice communication with spatial audio
JP7210602B2 (en) Method and apparatus for processing audio signals
US8155358B2 (en) Method of simultaneously establishing the call connection among multi-users using virtual sound field and computer-readable recording medium for implementing the same
US11856386B2 (en) Apparatus and method for processing audiovisual data
JP3488096B2 (en) Face image control method in three-dimensional shared virtual space communication service, three-dimensional shared virtual space communication device, and program recording medium therefor
CN116456038A (en) Audio processing method and device, nonvolatile storage medium and electronic equipment
CN115442556A (en) Spatial audio controller
US11099802B2 (en) Virtual reality
US11589184B1 (en) Differential spatial rendering of audio sources
CN115002401B (en) Information processing method, electronic equipment, conference system and medium
CN113129915B (en) Audio sharing method, device, equipment, storage medium and program product
RU2805260C2 (en) Device and method for processing audiovisual data
US20220360896A1 (en) Modular conferencing system
CN117692596A (en) Video presentation method and device, storage medium and electronic equipment
JP2023167168A (en) Information processing device, information processing method and program
CN115426467A (en) Video call-based film watching accompanying method, device and medium
CN114979544A (en) Interactive method, device and system for audio and video call scene

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination