CN115118919A

CN115118919A - Audio processing method, apparatus, device, storage medium, and program product

Info

Publication number: CN115118919A
Application number: CN202210736921.9A
Authority: CN
Inventors: 崔洋洋; 余俊澎; 王星宇
Original assignee: Shanghai Youme Information Technology Co ltd
Current assignee: Shanghai Youme Information Technology Co ltd
Priority date: 2022-06-27
Filing date: 2022-06-27
Publication date: 2022-09-27

Abstract

The present application relates to an audio processing method, apparatus, computer device, storage medium and computer program product. The method comprises the following steps: acquiring near-end voice data acquired by near-end video conference equipment; acquiring target far-end voice data, wherein the target far-end voice data is voice data transmitted by far-end video conference equipment; obtaining the correlation between the near-end voice data and the target far-end voice data, and determining the system delay time corresponding to the near-end voice data based on the correlation; and based on the system delay time, carrying out echo cancellation processing on the near-end voice data to obtain the processed voice data. By adopting the method, the audio quality of the video conference can be improved.

Description

Audio processing method, apparatus, device, storage medium, and program product

Technical Field

The present application relates to the field of video conferencing technologies, and in particular, to an audio processing method, apparatus, device, storage medium, and program product.

Background

A video conference system refers to a system in which individuals or groups in two or more different places transmit audio, video and file data to each other through a transmission line and multimedia equipment, so as to realize real-time and interactive communication and simultaneous conference. Remote training, meeting and teaching can be conveniently carried out on multiple places through the video conference system.

Along with the development of the prior art, the application of the video conference is more and more popular. How to improve the audio quality in the video conference process becomes a problem which needs to be solved urgently.

Disclosure of Invention

In view of the above, it is necessary to provide an audio processing method, an apparatus, a computer device, a computer readable storage medium, and a computer program product capable of improving the audio quality of a video conference in view of the above technical problems.

In a first aspect, the present application provides an audio processing method. The method comprises the following steps:

acquiring near-end voice data acquired by near-end video conference equipment;

acquiring target far-end voice data, wherein the target far-end voice data is voice data transmitted by far-end video conference equipment;

obtaining the correlation between the near-end voice data and the target far-end voice data, and determining the system delay time corresponding to the near-end voice data based on the correlation;

and based on the system delay time, carrying out echo cancellation processing on the near-end voice data to obtain the processed voice data.

In one embodiment, the obtaining the correlation between the near-end speech data and the target far-end speech data includes:

acquiring first short-time average energy corresponding to near-end voice data and second short-time average energy corresponding to target far-end voice data;

and acquiring the correlation degree based on the first short-time average energy and the second short-time average energy.

In one embodiment, the acquiring target far-end voice data includes:

receiving candidate far-end voice data sent by each far-end video conference device in a plurality of far-end video conference devices;

and determining the playing sequence of each candidate far-end voice data, and acquiring target far-end voice data from the candidate far-end voice data based on the playing sequence.

In one embodiment, obtaining the target far-end speech data from the plurality of candidate far-end speech data based on the playing order includes:

storing part of the candidate far-end voice data in the plurality of candidate far-end voice data into a preset buffer area based on the playing sequence;

and extracting target far-end voice data with preset data length from a preset cache region.

In one embodiment, the method further comprises:

acquiring an initial volume value of indoor sound acquired by indoor acquisition equipment when near-end video conference equipment plays audio;

acquiring a first distance between an indoor participant and near-end video conference equipment, and determining a target indoor volume value based on the first distance;

and adjusting the equipment volume of the near-end video conference equipment based on the target indoor volume value and the initial volume value.

In one embodiment, adjusting the device volume of the near-end video conference device based on the target indoor volume value and the initial volume value comprises:

acquiring the volume adjustment step length of the equipment;

and performing multiple times of adjustment on the equipment volume of the near-end video conference equipment based on the equipment volume adjustment step length until the difference value between the indoor sound volume value acquired by the indoor acquisition equipment after certain adjustment and the target indoor volume value is smaller than a preset threshold value.

In a second aspect, the present application further provides an audio processing apparatus. The device includes:

the first acquisition module is used for acquiring near-end voice data acquired by near-end video conference equipment;

the second acquisition module is used for acquiring target far-end voice data, and the target far-end voice data is voice data transmitted by far-end video conference equipment;

the determining module is used for acquiring the correlation between the near-end voice data and the target far-end voice data and determining the system delay time corresponding to the near-end voice data based on the correlation;

and the eliminating module is used for carrying out echo elimination processing on the near-end voice data based on the system delay time to obtain the processed voice data.

In one embodiment, the determining module is specifically configured to:

In one embodiment, the second obtaining module is specifically configured to:

In one embodiment, the second obtaining module is further specifically configured to:

In one embodiment, the apparatus is further configured to:

adjusting the device volume of the near-end video conference device based on the target indoor volume value and the initial volume value.

In one embodiment, the apparatus is further specifically configured to:

acquiring the volume adjustment step length of the equipment;

In a third aspect, the present application also provides a computer device. The computer device comprises a memory in which a computer program is stored and a processor which, when executing the computer program, implements the audio processing method according to any of the first aspects described above.

In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the audio processing method according to any of the above first aspects.

In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which, when executed by a processor, implements the audio processing method as described in any of the above first aspects.

The audio processing method, the audio processing device, the computer equipment, the storage medium and the computer program product are used for acquiring near-end voice data acquired by near-end video conference equipment; acquiring target far-end voice data, wherein the target far-end voice data is voice data transmitted by far-end video conference equipment; obtaining the correlation between the near-end voice data and the target far-end voice data, and determining the system delay time corresponding to the near-end voice data based on the correlation; and based on the system delay time, carrying out echo cancellation processing on the near-end voice data to obtain the processed voice data. According to the embodiment of the application, the relevance between the near-end voice data and the target far-end voice data is obtained, and the system delay time corresponding to the near-end voice data is determined based on the relevance to be used for echo cancellation, so that the purpose of determining the system delay time in real time is achieved, the influence of system delay jitter on echo cancellation is reduced, the echo cancellation effect of audio data is improved, and the audio quality played by far-end video conference equipment is further improved.

Drawings

FIG. 1 is a diagram of an exemplary audio processing method;

FIG. 2 is a flow diagram of an audio processing method in one embodiment;

FIG. 3 is a schematic flow chart of step 103 in one embodiment;

FIG. 4 is a schematic flow chart of step 102 in one embodiment;

FIG. 5 is a flow chart illustrating step 302 according to one embodiment;

FIG. 6 is a diagram of a receive signal buffer in one embodiment;

FIG. 7 is a flowchart illustrating an audio processing method according to another embodiment;

FIG. 8 is a flow chart illustrating an audio processing method according to another embodiment;

FIG. 9 is a flow chart illustrating an audio processing method according to another embodiment;

FIG. 10 is a block diagram showing the structure of an audio processing apparatus according to an embodiment;

FIG. 11 is a block diagram showing the construction of an audio processing apparatus according to another embodiment;

FIG. 12 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The embodiment of the application provides an audio processing method. The execution main body of the audio processing method can be an audio processing device, and the audio processing device can be implemented as part or all of a terminal or a server in a software, hardware or software and hardware combination mode. The terminal can be a personal computer, a notebook computer, a media player, an intelligent television, a smart phone, a tablet computer, a portable wearable device and the like; the server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.

In the following method embodiments, the execution subject is taken as an example to be described. It is understood that the method can also be applied to a server, and can also be applied to a system comprising a terminal and a server, and is realized through the interaction of the terminal and the server.

The application scenarios of the embodiments of the present application are described by taking the participants of the video conference as examples, including participant a and participant B. As shown in fig. 1, when a remote video conference is performed, there is an echo problem, that is, after the sound signal of a far-end indoor participant (i.e., participant a) is transmitted to a near-end (i.e., participant B), the sound signal is played through a speaker of a near-end video conference device, then is picked up by a microphone of the near-end video conference device after being reflected by a series of acoustics, and finally is transmitted to the far-end indoor participant (i.e., participant a). Therefore, in order to improve the quality of audio data heard by the participants, echo cancellation is required for the audio data.

Please refer to fig. 2, which shows a flowchart of an audio processing method according to an embodiment of the present application. As shown in fig. 2, the audio processing method may include the steps of:

step 101, near-end voice data acquired by near-end video conference equipment is acquired.

The near-end voice data can be a near-end voice signal picked up by a microphone of the near-end video conference device; or near-end voice data obtained by performing voice processing on the near-end voice signal.

Optionally, the speech processing procedure includes noise reduction processing, fourier transform, and the like.

And 102, acquiring target far-end voice data.

The target far-end voice data is voice data transmitted by the far-end video conference equipment; the playing time corresponding to the target far-end voice data is before the picking-up time of the near-end voice data.

Optionally, the target far-end voice data may be a far-end voice signal picked up by a microphone of the far-end video conference device; or the far-end voice data obtained by processing the far-end voice signal.

Optionally, a video conference system composed of the near-end video conference device and the far-end video conference device transmits multimedia data based on a TCP/IP protocol. Specifically, when transmitting voice data, the voice data is segmented based on the target data length to obtain a plurality of data segments, and then each data segment is encapsulated to obtain a plurality of RTP data packets. Specifically, an RTP packet is obtained after a TCP header and a TCP/IP header are added to each data segment, wherein if the length of a certain segmented voice segment is smaller than the target data length, 0 is added to the tail of the voice segment to make the length of the voice segment reach the target data length. When the target far-end voice data needs to be acquired, the corresponding multiple RTP data packets are subjected to decapsulation processing to obtain corresponding data fragments. Then merging the plurality of data fragments; and obtaining target far-end voice data based on the voice data obtained after the merging processing.

Step 103, obtaining the correlation between the near-end voice data and the target far-end voice data, and determining the system delay time corresponding to the near-end voice data based on the correlation.

The system delay time refers to a timing delay time caused by network jitter of the video conference system.

Optionally, the correlation between the near-end speech data and the target far-end speech data is calculated by using a cross-correlation function.

Specifically, the cross-correlation function is:

wherein x (n) represents the target far-end voice data, d (n) represents the near-end voice signal, T _max Denotes the maximum delay of the set delay estimate search, F _s Which represents the sample rate of the speech signal,d represents the time length corresponding to the system delay, n represents the nth sampling point in the target far-end voice data, and tau represents the offset variable.

In obtaining the degree of correlation R _xd (D) Then, the correlation R is determined _xd (D) The corresponding value of D; obtaining a system delay time T 'according to the D value' _s The corresponding formula expression is as follows: t' _s ＝D-T _d Wherein, T _d Representing the direct delay of audio from playing to microphone pick-up, typically 0.01-0.02s, and optionally T _d Set to 0.015 s.

And step 104, based on the system delay time, performing echo cancellation processing on the near-end voice data to obtain processed voice data.

Optionally, echo cancellation processing is performed on the near-end voice data based on the obtained system delay time and the echo cancellation algorithm to cancel echo data included in the near-end voice data. Wherein, the echo cancellation algorithm may be an adaptive filtering algorithm.

In the embodiment, near-end voice data acquired by near-end video conference equipment is acquired; acquiring target far-end voice data, wherein the target far-end voice data is voice data transmitted by far-end video conference equipment; obtaining the correlation between the near-end voice data and the target far-end voice data, and determining the system delay time corresponding to the near-end voice data based on the correlation; and based on the system delay time, carrying out echo cancellation processing on the near-end voice data to obtain the processed voice data. According to the embodiment of the application, the correlation between the near-end voice data and the target far-end voice data is obtained, and the system delay time corresponding to the near-end voice data is determined based on the cross correlation for echo cancellation, so that the purpose of determining the system delay time in real time is achieved, the influence of system delay jitter on echo cancellation is reduced, the echo cancellation effect of audio data is improved, and the audio quality played by far-end video conference equipment is further improved.

In the embodiment of the present application, as shown in fig. 3, based on the embodiment shown in fig. 2, the embodiment relates to an implementation process for acquiring a correlation between near-end voice data and target far-end voice data in step 103, including step 201 and step 202:

step 201, a first short-time average energy corresponding to the near-end voice data and a second short-time average energy corresponding to the target far-end voice data are obtained.

Optionally, dividing the near-end voice data into a plurality of data segments according to the target data length; and obtaining the first short-time average energy by using a short-time average energy calculation formula. The short-time average energy calculation formula is expressed as follows:

wherein, x (N) is the first short-time average energy, N represents the nth sampling point in the far-end speech signal, x (N) represents the far-end speech signal, N represents the total data segment number included in the far-end speech signal, and L represents the sampling point number included in each data segment.

Similarly, a short-time average energy calculation formula is adopted to obtain a second short-time average energy D (N) according to the target far-end voice data d (n).

Step 202, a correlation is obtained based on the first short-time average energy and the second short-time average energy.

Optionally, the correlation R is calculated by using the following formula _XD (D) The formula expression is as follows:

wherein R is _XD (D) For correlation, D represents the number of data segments corresponding to the system delay, and τ represents the offset variable.

Optionally, after obtaining the D value, the system delay time may be determined according to a time length corresponding to each data segment.

In this embodiment, a first short-time average energy corresponding to near-end voice data and a second short-time average energy corresponding to far-end voice data are obtained, and a correlation is obtained based on the first short-time average energy and the second short-time average energy; and calculating the correlation degree through the first short-time average energy and the second short-time average energy, so that the interference of a high-frequency signal in the voice signal on the cross-correlation calculation is reduced, and meanwhile, the calculation amount of the correlation degree is also reduced.

In one embodiment, the number of the far-end video conference devices is not less than 2, and the plurality of far-end video conference devices and the near-end video conference devices form a video conference system based on the DVTS plus topology structure diagram. As shown in fig. 4, based on the embodiment shown in fig. 2, the implementation process of acquiring the target far-end voice data in step 102 includes the following steps:

step 301, receiving candidate far-end voice data sent by each far-end video conference device in a plurality of far-end video conference devices.

Step 302, determining the playing order of each candidate far-end voice data, and acquiring the target far-end voice data from the candidate far-end voice data based on the playing order.

Optionally, a received data list corresponding to the near-end device conference device is determined according to the device identification information of the near-end video conference device, where the received data list includes the device identification information of the far-end video conference device subscribed by the near-end device conference device.

Optionally, the received data list further includes a priority order of each remote video conference device. And determining the playing order of the candidate far-end voice data based on the priority order.

In the embodiment, candidate far-end voice data sent by each far-end video conference device in the plurality of far-end video conference devices is received, the playing sequence of each candidate far-end voice data is determined, and the target far-end voice data is obtained from the candidate far-end voice data based on the playing sequence, so that the determination of the target far-end voice data is realized when the plurality of far-end video conference devices exist, the conflict among the plurality of far-end video conference devices is avoided, and meanwhile, the target far-end voice data is obtained from the candidate far-end voice data based on the playing sequence.

In the embodiment of the present application, as shown in fig. 5, based on the embodiment shown in fig. 4, the step 302 of acquiring target far-end voice data from a plurality of candidate far-end voice data based on the playing order includes steps 401 and 402:

step 401, storing a part of the candidate far-end voice data in the plurality of candidate far-end voice data into a preset buffer area based on the playing order.

Optionally, as shown in fig. 6, the preset buffer area is a received signal buffer area. Specifically, as shown in fig. 6, the multiplexer determines, according to the device identification information (i.e., the user name) of the near-end video conference device, candidate far-end voice data corresponding to the near-end device conference device; and writing the determined candidate far-end voice data into a received signal buffer area.

Optionally, the number of the candidate far-end speech data stored in the preset buffer is related to the length of the preset buffer.

Step 402, extracting target far-end voice data with a preset data length from a preset buffer area.

Optionally, the preset data length is an integer multiple of the sampling interval of the voice data and is a power of 2.

Optionally, after the target far-end voice data is extracted, the target far-end voice data is deleted from the preset buffer area. Each time the target far-end voice data is extracted, the echo cancellation process is started.

In this embodiment, based on the playing order, part of the candidate far-end speech data in the plurality of candidate far-end speech data is stored in the preset buffer area, and the target far-end speech data with the preset data length is extracted from the preset buffer area, so that the real-time determination of the target far-end speech data is realized, and the determination of the subsequent system delay time is facilitated.

In the embodiment of the present application, based on any one of the above embodiments, as shown in fig. 7, the audio processing method further includes

steps

501, 502, and 503:

step 501, acquiring an initial volume value of indoor sound acquired by an indoor acquisition device when a near-end video conference device plays audio.

Optionally, the sound sensor is used to collect the indoor sound.

Optionally, the initial volume value may be a volume value of indoor sound when the near-end video conference device plays the debugging audio before the video conference starts, or may be a volume value of indoor sound when the near-end video conference device plays the far-end audio in the video conference process.

Step 502, a first distance between an indoor participant and a near-end video conference device is obtained, and a target indoor volume value is determined based on the first distance.

Wherein the indoor participant is a near-end participant.

Optionally, a first distance between the indoor participant and the near-end video conference device is obtained through the distance sensor. Specifically, the distance sensor is an infrared sensor.

Alternatively, the distance sensor may be disposed on a side facing the front of the near-end video conference device. The distance between the distance sensor and the front of the near-end video conference device may be set to 0.4-0.7m, for example, the distance may be set to 0.5 m.

When a plurality of indoor participants exist, the method for determining the first distance can comprise the following two implementation modes:

the first method is as follows: and obtaining the distance between each participant and the near-end video conference equipment, then calculating the mean value of a plurality of distances, and taking the calculated mean value as a first distance.

The second method comprises the following steps: acquiring position information of indoor participants and determining a personnel distribution dense area; the method comprises the steps of obtaining the distance between indoor participants located in a personnel distribution dense area and near-end video conference equipment, calculating the average value of the distances, and taking the obtained average value as a first distance.

Optionally, the terminal is provided with a mapping relation table of the target indoor volume value and the distance between the indoor participants and the near-end video conference equipment; and obtaining the target indoor volume value based on the mapping relation table and the first distance.

Specifically, as shown in fig. 8, when the audio is played, the position of the sound source is determined, and the sound source position is associated with the target indoor sound volume value, so as to generate the mapping relationship table. When the volume of the near-end video conference equipment is adjusted, the positions of indoor participants are obtained (namely, the positions of users are tracked), a first distance between the indoor participants and the near-end video conference equipment is determined (namely, the distance between the users and a sound source is estimated), and the equipment volume of the near-end video conference equipment is adjusted according to the first distance.

Optionally, when the first distance is smaller than the preset distance, a prompt message is sent to prompt that the distance is too close and eyes are protected; and obtaining the duration of the first distance, and adjusting the brightness of the near-end video conference equipment when the duration exceeds the preset duration.

Step 503, adjusting the device volume of the near-end video conference device based on the target indoor volume value and the initial volume value.

Optionally, the volume of the corresponding target device is determined according to the target indoor volume value, and the device volume of the near-end video conference device is adjusted to the volume of the target device.

Optionally, obtaining a device volume adjustment step length; and performing multiple times of adjustment on the equipment volume of the near-end video conference equipment based on the equipment volume adjustment step length until the difference value between the indoor sound volume value acquired by the indoor acquisition equipment after certain adjustment and the target indoor volume value is smaller than a preset threshold value. Optionally, the value range of the volume step is 4dB to 9 dB.

In the embodiment, the initial volume value of the indoor sound acquired by the indoor acquisition device when the near-end video conference device plays the audio is acquired, the first distance between the indoor participant and the near-end video conference device is acquired, the target indoor volume value is determined based on the first distance, and the device volume of the near-end video conference device is adjusted based on the target indoor volume value and the initial volume value, so that the purpose of adjusting the device volume of the near-end video conference device based on the distance between the indoor participant and the near-end video conference device is achieved, and the sound quality heard by the participant in the video conference process is improved.

In an embodiment of the present application, as shown in fig. 9, the embodiment provides an audio processing method, including the steps of:

step 601, receiving candidate far-end voice data sent by each far-end video conference device in a plurality of far-end video conference devices.

Step 602, storing a part of the candidate far-end voice data in the plurality of candidate far-end voice data into a preset buffer area based on the playing order.

Step 603, obtaining near-end voice data collected by the near-end video conference device.

Step 604, extracting target far-end voice data with a preset data length from the preset buffer area, where the target far-end voice data is voice data transmitted by the far-end video conference device.

Step 605, obtain a first short-time average energy corresponding to the near-end voice data and a second short-time average energy corresponding to the target far-end voice data.

Step 606, obtaining a correlation based on the first short-time average energy and the second short-time average energy, and determining a system delay time corresponding to the near-end voice data based on the correlation.

Step 607, based on the system delay time, performing echo cancellation processing on the near-end voice data to obtain processed voice data.

Step 608, an initial volume value of the indoor sound collected by the indoor collection device when the near-end video conference device plays the audio is obtained.

Step 609, a first distance between the indoor participant and the near-end video conference device is obtained, and the target indoor volume value is determined based on the first distance.

Step 610, obtaining the step size of the volume adjustment of the device.

Step 611, the device volume of the near-end video conference device is adjusted for multiple times based on the device volume adjustment step length until the difference between the indoor sound volume value acquired by the indoor acquisition device after a certain adjustment and the target indoor volume value is smaller than the preset threshold.

It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the present application further provides an audio processing apparatus for implementing the audio processing method mentioned above. The implementation scheme for solving the problem provided by the apparatus is similar to the implementation scheme described in the method, so specific limitations in one or more embodiments of the audio processing apparatus provided below may refer to the limitations on the audio processing method in the foregoing, and details are not described here again.

In one embodiment, as shown in fig. 10, there is provided an audio processing apparatus including: the device comprises a first acquisition module, a second acquisition module, a determination module and a elimination module, wherein:

In one embodiment, the determining module is specifically configured to:

In an embodiment, the second obtaining module is specifically configured to:

In an embodiment, the second obtaining module is further specifically configured to:

In one embodiment, the apparatus is further configured to:

In one embodiment, the apparatus is further specifically configured to:

acquiring the volume adjustment step length of the equipment;

In one embodiment, the audio processing apparatus is used in a video conference system composed of a plurality of far-end video conference devices and a near-end video conference device based on a DVTS plus topology structure diagram. The video conference system includes: the client (comprising a plurality of far-end video conference devices and near-end video conference devices) is used for completing the collection, encoding, receiving, decoding and playing of audio signals; a conference control server for performing signaling communication with the client and the forwarding server, including client management (login, logout of the client and type of received audio data) and forwarding server management (updating participation and resignation state of the client to the forwarding server in real time, and informing the forwarding server of a received data list of each client); and the forwarding server is used for receiving and forwarding the audio data.

In the embodiment of the application, as shown in fig. 11, the audio processing apparatus includes:

the signal acquisition and audio separation module is used for acquiring near-end voice data acquired by near-end video conference equipment;

the forwarding server cluster is used for receiving candidate far-end voice data sent by each far-end video conference device in the plurality of far-end video conference devices;

a received signal buffer for storing a part of the candidate far-end speech data in the plurality of candidate far-end speech data into a preset buffer based on the playing order;

the shunt selector is used for determining candidate far-end voice data corresponding to a received data list corresponding to the near-end video conference equipment according to the equipment identification information (namely the user name) of the near-end video conference equipment;

the reference signal direct delay buffer is used for extracting target far-end voice data with preset data length from a preset buffer area;

and the system delay estimator is used for acquiring first short-time average energy corresponding to the near-end voice data and second short-time average energy corresponding to the target far-end voice data, acquiring correlation based on the first short-time average energy and the second short-time average energy, and determining system delay time corresponding to the near-end voice data based on the correlation.

And the echo canceller is used for carrying out echo cancellation processing on the near-end voice data based on the system delay time to obtain the processed voice data.

The various modules in the audio processing device described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 12. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement an audio processing method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 12 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

acquiring near-end voice data acquired by near-end video conference equipment;

In one embodiment, the processor, when executing the computer program, further performs the steps of:

acquiring first short-time average energy corresponding to near-end voice data and second short-time average energy corresponding to target far-end voice data; and acquiring the correlation degree based on the first short-time average energy and the second short-time average energy.

receiving candidate far-end voice data sent by each far-end video conference device in a plurality of far-end video conference devices; and determining the playing sequence of each candidate far-end voice data, and acquiring target far-end voice data from the candidate far-end voice data based on the playing sequence.

storing part of the candidate far-end voice data in the plurality of candidate far-end voice data into a preset buffer area based on the playing sequence; and extracting target far-end voice data with a preset data length from a preset cache region.

acquiring an initial volume value of indoor sound acquired by indoor acquisition equipment when near-end video conference equipment plays audio; acquiring a first distance between an indoor participant and near-end video conference equipment, and determining a target indoor volume value based on the first distance; adjusting the device volume of the near-end video conference device based on the target indoor volume value and the initial volume value.

acquiring the volume adjustment step length of the equipment; and performing multiple times of adjustment on the equipment volume of the near-end video conference equipment based on the equipment volume adjustment step length until the difference value between the indoor sound volume value acquired by the indoor acquisition equipment after certain adjustment and the target indoor volume value is smaller than a preset threshold value.

In one embodiment, a computer-readable storage medium is provided, having stored thereon a computer program which, when executed by a processor, performs the steps of:

acquiring near-end voice data acquired by near-end video conference equipment;

In one embodiment, the computer program when executed by the processor further performs the steps of:

storing part of the candidate far-end voice data in the plurality of candidate far-end voice data into a preset buffer area based on the playing sequence; and extracting target far-end voice data with preset data length from a preset cache region.

In one embodiment, a computer program product is provided, comprising a computer program which when executed by a processor performs the steps of:

acquiring near-end voice data acquired by near-end video conference equipment;

storing part of the candidate far-end voice data in the plurality of candidate far-end voice data into a preset cache region based on the playing sequence; and extracting target far-end voice data with preset data length from a preset cache region.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), Magnetic Random Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the various embodiments provided herein may be, without limitation, general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, or the like.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A method of audio processing, the method comprising:

acquiring near-end voice data acquired by near-end video conference equipment;

obtaining the correlation degree between the near-end voice data and the target far-end voice data, and determining the system delay time corresponding to the near-end voice data based on the correlation degree;

2. The method of claim 1, wherein obtaining the correlation between the near-end speech data and the target far-end speech data comprises:

acquiring first short-time average energy corresponding to the near-end voice data and second short-time average energy corresponding to the target far-end voice data;

3. The method of claim 1, wherein the obtaining target far-end speech data comprises:

and determining the playing order of each candidate far-end voice data, and acquiring the target far-end voice data from a plurality of candidate far-end voice data based on the playing order.

4. The method of claim 3, wherein said obtaining the target far-end speech data from the plurality of candidate far-end speech data based on the playing order comprises:

storing part of the candidate far-end voice data in a preset buffer area based on the playing sequence;

and extracting the target far-end voice data with preset data length from the preset cache region.

5. The method of claim 1, further comprising:

acquiring an initial volume value of indoor sound acquired by indoor acquisition equipment when the near-end video conference equipment plays audio;

acquiring a first distance between an indoor participant and the near-end video conference equipment, and determining a target indoor volume value based on the first distance;

adjusting a device volume of the near-end video conference device based on the target indoor volume value and the initial volume value.

6. The method of claim 5, wherein adjusting the device volume of the near-end videoconferencing device based on the target indoor volume value and the initial volume value comprises:

acquiring the volume adjustment step length of the equipment;

and performing multiple times of adjustment on the equipment volume of the near-end video conference equipment based on the equipment volume adjustment step length until the difference value between the indoor sound volume value acquired by the indoor acquisition equipment and the target indoor volume value is smaller than a preset threshold value after the adjustment is performed for a certain time.

7. An audio processing apparatus, characterized in that the apparatus comprises:

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 6.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, realizes the steps of the method of any one of claims 1 to 6.