CN113628638A - Audio processing method, device, equipment and storage medium - Google Patents

Audio processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN113628638A
CN113628638A CN202110883330.XA CN202110883330A CN113628638A CN 113628638 A CN113628638 A CN 113628638A CN 202110883330 A CN202110883330 A CN 202110883330A CN 113628638 A CN113628638 A CN 113628638A
Authority
CN
China
Prior art keywords
voice
signal
speech
signal parameter
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110883330.XA
Other languages
Chinese (zh)
Inventor
吴泰云
何宇术
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Haiyi Zhixin Technology Co Ltd
Original Assignee
Shenzhen Haiyi Zhixin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Haiyi Zhixin Technology Co Ltd filed Critical Shenzhen Haiyi Zhixin Technology Co Ltd
Priority to CN202110883330.XA priority Critical patent/CN113628638A/en
Publication of CN113628638A publication Critical patent/CN113628638A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants

Abstract

The embodiment of the invention relates to an audio processing method, an audio processing device, audio processing equipment and a storage medium, wherein the audio processing method comprises the following steps: receiving, by a first device, a first voice and a second voice; determining a first signal parameter set corresponding to a first voice and a second signal parameter set corresponding to a second voice; comparing the signal intensity of the first signal parameter with the signal intensity of the second signal parameter to determine the signal intensity of the first voice and the second voice, and selecting one of the first voice and the second voice with the maximum signal intensity as a target voice; performing voice activity detection on the first voice and the second voice to obtain a detection result; the current transmission voice of the first equipment is controlled according to the detection result and the target voice, the transmission voice is the first voice or the second voice, whether the switching of the transmission voice is needed or not is judged through the detection signal parameter, the condition that the voice fluctuation caused by the fact that the received voice is directly switched is avoided, and the voice interruption is avoided through voice activity detection to improve the stability of the voice transmission.

Description

Audio processing method, device, equipment and storage medium
Technical Field
Embodiments of the present invention relate to the field of audio processing, and in particular, to an audio processing method, apparatus, device, and storage medium.
Background
At present, with the increasing enhancement of telecommunication requirements, a conference system becomes a main medium of public teleconferences gradually, but the traditional telephone conference machine is heavy in shape and poor in portability, related products of the conference system gradually develop towards miniaturization in recent years, and more people select a small conference treasure product to carry out remote multi-person conference.
In the correlation technique, the precious product application range of meeting is less, and sound power is not big, if connect two at least meeting treasures in order to guarantee that the participant of whole meeting room can all hear the meeting content in a great meeting room, but can have audio interference between the meeting treasures (for example, meeting treasures A receives the sound that meeting treasures A received simultaneously and meeting treasures B sends the sound for meeting treasures A, if these two kinds of pronunciation of simultaneous transmission, then audio interference's the condition can appear in meeting treasures A), can reduce the experience of audio conference.
Disclosure of Invention
In view of the above, to solve the technical problems or some of the technical problems, embodiments of the present invention provide an audio processing method, apparatus, device and storage medium.
In a first aspect, an embodiment of the present invention provides an audio processing method, including:
the method comprises the steps that first voice and second voice are received by first equipment, wherein the first voice is the voice received by the first equipment, and the second voice is the voice received by the second equipment and sent to the first equipment;
determining a first signal parameter set corresponding to the first voice and a second signal parameter set corresponding to the second voice, wherein the first signal parameter set comprises a plurality of first signal parameters, the second signal parameter set comprises a plurality of second signal parameters, and the first signal parameters and the second signal parameters are both used for indicating signal strength and weakness;
comparing the signal intensity of the first signal parameter with the signal intensity of the second signal parameter to determine the signal intensity of the first voice and the second voice, and selecting one of the first voice and the second voice with the maximum signal intensity as a target voice;
performing voice activity detection on the first voice and the second voice to obtain a detection result;
and controlling the current transmission voice of the first equipment according to the detection result and the target voice, wherein the transmission voice is the first voice or the second voice.
In a possible implementation manner, the controlling the current transmission voice of the first device according to the detection result and the target voice includes:
if the detection result is that a speech frame exists in the first speech or the second speech, controlling the first equipment to keep the current transmission speech unchanged; if the detection result indicates that no speech frame exists in the first speech and the second speech, judging whether the target speech is the same as the transmission speech; when the target voice is different from the transmission voice, controlling the first equipment to switch the current transmission voice into the target voice; and controlling the first equipment to keep the current transmission voice unchanged when the target voice is the same as the transmission voice.
In one possible embodiment, the determining a first set of signal parameters corresponding to the first speech and a second set of signal parameters corresponding to the second speech includes:
preprocessing the first voice and the second voice to obtain a multi-frame voice signal corresponding to the first voice and a multi-frame voice signal corresponding to the second voice, wherein the preprocessing at least comprises: framing processing and downsampling processing;
determining a first signal parameter corresponding to each frame of voice signal in the first voice to obtain a first signal parameter set, wherein each first signal parameter in the first signal parameter set carries first time sequence information, and the time sequence of each frame of voice signal in the first voice is the same as that of each frame of voice signal in the first voice;
determining a second signal parameter corresponding to each frame of voice signal in the second voice to obtain a second signal parameter set, where each second signal parameter in the second signal parameter set carries second timing information, and the second timing information is the same as the timing of each frame of voice signal in the second voice.
In one possible embodiment, the comparing the signal strength of the first signal parameter and the second signal parameter to determine the signal strength of the first voice and the second voice, and selecting the largest signal strength of the first voice and the second voice as the target voice includes:
comparing a first signal parameter corresponding to each frame of voice signal in the first voice with a signal strength corresponding to a second signal parameter of each frame of voice signal in the second voice under the condition that the first time sequence information is consistent with the second time sequence information;
and determining the target voice with the maximum signal intensity corresponding to the signal parameter corresponding to one or more continuous frames of voice signals from the first voice or the second voice.
In one possible embodiment, the first signal parameter comprises a signal-to-noise ratio and the second signal parameter comprises a signal-to-noise ratio;
the determining a first signal parameter corresponding to each frame of voice signal in the first voice comprises:
filtering each frame of voice signal in the first voice by adopting wiener filtering;
determining a first signal-to-noise ratio corresponding to each frame of voice signals in the first voice after filtering;
the determining a second signal parameter corresponding to each frame of voice signal in the second voice includes:
filtering each frame of voice signal in the second voice by adopting wiener filtering;
and determining a second signal-to-noise ratio corresponding to each frame of voice signals in the second voice after filtering.
In one possible embodiment, the method further comprises:
when determining that the first voice and the second voice have delay according to the time sequence information, determining the delay time between the first voice and the second voice through a cross-correlation function;
performing a time alignment operation of the first voice and the second voice based on the delay time.
In one possible embodiment, the method further comprises: and in the process of executing switching operation of the current transmission voice of the first equipment, controlling the current transmission voice to fade out and controlling the target voice to fade in so as to smooth the process of switching from the transmission voice to the target voice.
In a second aspect, an embodiment of the present invention provides an audio processing apparatus, including:
the device comprises a receiving module, a processing module and a processing module, wherein the receiving module is used for receiving a first voice and a second voice, the first voice is a voice received by first equipment, and the second voice is a voice received by second equipment and sent to the first equipment;
a first determining module, configured to determine a first signal parameter set corresponding to the first voice and a second signal parameter set corresponding to the second voice, where the first signal parameter set includes a plurality of first signal parameters, the second signal parameter set includes a plurality of second signal parameters, and both the first signal parameter and the second signal parameter are used to indicate signal strength and weakness;
the second determining module is used for comparing the signal intensity of the first signal parameter with the signal intensity of the second signal parameter to determine the signal intensity of the first voice and the second voice, and selecting one of the first voice and the second voice with the maximum signal intensity as a target voice;
the detection module is used for carrying out voice activity detection on the first voice and the second voice to obtain a detection result;
and the control module is used for controlling the current transmission voice of the first equipment according to the detection result and the target voice, wherein the transmission voice is the first voice or the second voice.
In a third aspect, an embodiment of the present invention provides an apparatus, including: a processor and a memory, the processor being configured to execute an audio processing program stored in the memory to implement the audio processing method of any of the first aspects.
In a fourth aspect, an embodiment of the present invention provides a storage medium, where one or more programs are stored, and the one or more programs are executable by one or more processors to implement the audio processing method according to any one of the first aspects.
According to the audio processing scheme provided by the embodiment of the invention, first voice and second voice are received through first equipment; determining a first signal parameter set corresponding to the first voice and a second signal parameter set corresponding to the second voice; comparing the signal intensity of the first signal parameter with the signal intensity of the second signal parameter to determine the signal intensity of the first voice and the second voice, and selecting one of the first voice and the second voice with the maximum signal intensity as a target voice; performing voice activity detection on the target voice and a third voice, wherein the third voice is a transmission voice of the first device; the voice transmission of the first equipment is controlled according to the detection result, whether the voice transmission needs to be switched is judged by detecting the signal parameter, the condition that the voice is received and directly switched to cause voice fluctuation is avoided, and the voice interruption is avoided by voice activity detection to improve the stability of voice transmission.
Drawings
Fig. 1 is a schematic flowchart of an audio processing method according to an embodiment of the present invention;
FIG. 2 is a flow chart of another audio processing method according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
For the convenience of understanding of the embodiments of the present invention, the following description will be further explained with reference to specific embodiments, which are not to be construed as limiting the embodiments of the present invention.
Fig. 1 is a schematic flowchart of an audio processing method according to an embodiment of the present invention, and as shown in fig. 1, the method specifically includes:
s11, the first device receives a first voice and a second voice, where the first voice is a voice received by the first device, and the second voice is a voice received by the second device and sent to the first device.
The audio processing method provided by the embodiment of the invention is applied to the field of audio and video conferences, and the signal intensity detection and the voice activity detection are carried out on at least two voices received by the current equipment, so that whether the current equipment needs to execute the voice transmission switching or not is determined.
In this embodiment, the first device and the second device are devices having an audio transceiving function in an audio/video conference, and the first device and the second device may be disposed in the same conference room or different conference rooms, where the first device is a master device, the second device is a slave device, the number of the first devices is one, the number of the second devices may be one or more, and the second device and the first device establish a wired (e.g., a network cable, a data cable, etc.) or wireless (e.g., a local area network, etc.) connection.
The method comprises the steps that the first equipment receives external voice and voice sent to the first equipment by the second equipment in real time, the first voice is used as the external voice received by the first equipment, and the second voice is used as the voice received by the second equipment and sent to the first equipment.
Further, the switching of the voices is related only after the first device receives the two voices, that is, which voice is played by the first device, and which voice is sent to the second device by the first device, so that the second device plays the voices.
S12, determining a first signal parameter set corresponding to the first voice and a second signal parameter set corresponding to the second voice, wherein the first signal parameter set comprises a plurality of first signal parameters, the second signal parameter set comprises a plurality of second signal parameters, and the first signal parameters and the second signal parameters are both used for indicating signal strength.
In this embodiment, the first voice and the second voice received by the first device are framed, that is, the first voice and the second voice are split according to a fixed time length, so as to obtain a plurality of first voice frame sets corresponding to the first voice and a plurality of second voice frame sets corresponding to the second voice.
Further, a signal parameter is obtained for each voice frame in the first voice frame set, the signal parameter is used for indicating the strength of a signal, and the signal parameter may be a signal-to-noise ratio or a signal strength, so as to obtain a first signal parameter set corresponding to the first voice frame set and a second signal parameter set corresponding to the second voice frame set.
S13, comparing the signal intensity of the first signal parameter and the second signal parameter to determine the signal intensity of the first voice and the second voice, and selecting the highest signal intensity of the first voice and the second voice as the target voice.
The method comprises the steps of comparing signal parameters aiming at a first signal parameter set corresponding to a first voice and a second signal parameter set corresponding to a second voice, and correspondingly determining the quality of signals of the first voice and the second voice after comparing the signal strength of the first signal parameter and the second signal parameter due to the fact that the first voice and the first signal parameter have a corresponding relation and the second voice and the second signal parameter has a corresponding relation, so that whether the first voice is used as a target voice or the second voice is used as the target voice can be determined, and the quality of the signals can be the quality of the comparison of the signal ratio or the quality of the comparison of the volume. The target voice can be understood as voice which may need to be switched, but further judgment on the target voice is needed.
The comparison method of the signal parameters may be understood as comparing the signal parameters of a first speech frame set corresponding to a first speech and a second speech frame set corresponding to a second speech according to a time sequence or other forms, and specifically may be: and comparing the signal intensity corresponding to the signal parameter of the voice frame with the same time sequence of the first voice frame set and the second voice frame set.
And S14, carrying out voice activity detection on the first voice and the second voice to obtain a detection result.
S15, controlling the current transmission voice of the first device according to the detection result and the target voice, wherein the transmission voice is the first voice or the second voice.
In this embodiment, before controlling the current transmission Voice of the first device, Voice Activity Detection (VAD) needs to be performed on the first Voice and the second Voice to obtain a Detection result; the VAD detection process can be understood as performing speech frame detection on speech, that is, determining whether there is a conference participant speaking in the speech.
If the detection result is that a speech frame exists in the first speech or the second speech, controlling the first equipment to keep the current transmission speech unchanged; and if the detection result shows that the speech frame does not exist in the first speech and the second speech, controlling the first equipment to execute the operation of switching the transmission speech to the target speech.
Specifically, through VAD detection on the first voice and the second voice, if the detection result is that a speech frame exists in the first voice or the second voice, it indicates that the voices received by the first device and the second device both have a conference participant to speak, at this time, the current transmission voice of the first device is the first voice or the second voice, and in order to keep the transmission of the first voice or the second voice uninterrupted, the first device is controlled to keep the current transmission voice unchanged; if the detection result shows that the speech received by the first device and the second device does not have the speech of the conference participant and the signal parameter of the target speech is the best, the first device is controlled to execute the operation of switching the transmission speech to the target speech, and the transmission speech is switched to the speech with the best signal parameter on the premise of not losing the speech of the conference participant, so that the experience of the whole audio and video conference is improved. According to the audio processing method provided by the embodiment of the invention, first voice and second voice are received through first equipment; determining a first signal parameter set corresponding to the first voice and a second signal parameter set corresponding to the second voice; comparing the signal intensity of the first signal parameter with the signal intensity of the second signal parameter to determine the signal intensity of the first voice and the second voice, and selecting one of the first voice and the second voice with the maximum signal intensity as a target voice; performing voice activity detection on the target voice and a third voice, wherein the third voice is a transmission voice of the first device; the voice transmission of the first equipment is controlled according to the detection result, whether the voice transmission needs to be switched is judged by detecting the signal parameter, the condition that the voice is received and directly switched to cause voice fluctuation is avoided, and the voice interruption is avoided by voice activity detection to improve the stability of voice transmission.
Fig. 2 is a schematic flow chart of another audio processing method according to an embodiment of the present invention, as shown in fig. 2, the method specifically includes:
s21, the first device receives a first voice and a second voice, where the first voice is a voice received by the first device, and the second voice is a voice received by the second device and sent to the first device.
In this embodiment, the following description will be given by taking an example in which the first device and the second device are disposed in the same conference room, the number of the second devices is 1, and accordingly, the number of the second voices is also 1, and it should be noted that when the number of the second devices is greater than 1, the determination process and the control logic are similar to that when the number of the second devices is 1, and reference may be made to the related description of this embodiment.
Further, the first device receives external voice and voice sent to the first device by the second device in real time, the first voice is used as the external voice received by the first device, and the second voice is used as the voice received by the second device and sent to the first device.
The first speech and the second speech may be a set comprising one speech frame or a plurality of speech frames, and the length of the speech frames may be determined according to the result of framing the speech according to the length of time, for example, one speech frame is 20ms in length.
It should be noted that, when the first device and the second device are connected in a wired manner, the voice delay between the first device and the second device may be ignored, and when the first device and the second device are connected in a wireless manner, there may be a voice delay between the first device and the second device, and it is difficult to obtain accurate comparison information of the first voice and the second voice.
Specifically, when determining that the first voice and the second voice have delay according to timing information (the timing information refers to the timing of each frame of voice signal in the first voice or the second voice), determining the delay time between the first voice and the second voice through a cross-correlation function; performing a time alignment operation of the first voice and the second voice based on the delay time.
The cross-correlation function may be:
Figure BDA0003190171720000101
wherein, X1 and X2 are signals of the first voice and the second voice after performing fast fourier transform, w is a frequency domain, that is, X1 is a first signal of the first voice in the frequency domain, and X2 is a second signal of the second voice in the frequency domain.
And then according to the above formula, performing inverse fourier operation on the first signal and the second signal, and converting the signal in the frequency domain into a time domain signal, specifically adopting the following formula:
Figure BDA0003190171720000102
Figure BDA0003190171720000103
when R takes the maximum value, the optimal delay time can be obtained
Figure BDA0003190171720000104
Figure BDA0003190171720000105
S22, preprocessing the first voice and the second voice to obtain a multi-frame voice signal corresponding to the first voice and a multi-frame voice signal corresponding to the second voice, where the preprocessing at least includes: framing processing and downsampling processing.
In this embodiment, to improve the calculation efficiency of the whole transmission voice switching process, the first voice and the second voice are preprocessed, and the preprocessing step may include: the process of framing processing such as framing processing and downsampling processing can be understood as splitting the speech according to a certain time length, so that the time intervals corresponding to the framed speech frames are consistent, and the downsampling processing may include: and sampling the voice according to a preset frequency, so that the sampled voice frames have the same frequency.
In one example of this embodiment, the first speech and the second speech are framed in a standard of 20ms per speech frame and down-sampled to an 8K frequency.
The time interval corresponding to the frame division and the down-sampled frequency domain may be set according to actual requirements, which is not specifically limited in this embodiment.
S23, determining a first signal parameter corresponding to each frame of voice signal in the first voice, to obtain a first signal parameter set, where each first signal parameter in the first signal parameter set carries first timing information, and the timing of each frame of voice signal in the first voice is the same as the timing of each frame of voice signal in the first voice.
After framing the first voice, obtaining a voice frame set with time sequence information, wherein the set comprises one or more frames of voice signals, and obtaining a first signal parameter corresponding to each frame of voice signal in the set so as to further obtain a first signal parameter set corresponding to the first voice.
The voice frames after framing carry timing sequence information, so the obtained first signal parameter set also carries timing sequence information, that is, each first signal parameter in the first signal parameter set carries first timing sequence information, and the timing sequence information is the same as the timing sequence of each frame of voice signals in the first voice.
In this embodiment, the first signal parameter may be a signal-to-noise ratio, and accordingly, the acquiring the first signal parameter information may include: filtering each frame of voice signal in the first voice by adopting wiener filtering; and determining a first signal-to-noise ratio corresponding to each frame of voice signals in the first voice after filtering.
Specifically, in practical applications, the signal-to-noise ratio of speech is inaccurate due to excessive noise, so that noise estimation is performed on each frame of received speech signals, the speech probability is corrected, noise reduction parameters are obtained, a filter is used to obtain a relatively accurate signal-to-noise ratio, that is, the signal-to-noise ratio is estimated first, and then the filter is used to obtain an accurate signal-to-noise ratio.
In an example of the embodiment of the present invention, a "Robust Signal-to-Noise Ratio Estimation Based on Waveform Distribution Analysis" manner may be adopted to estimate the Signal-to-Noise Ratio, where the manner is Based on that speech frames are distributed by Gamma and Noise is simulated by a Gaussian function (wiener filtering may be adopted in advance to perform filtering, perform Noise Estimation on each frame of received speech signals, and then modify the speech probability to obtain a Noise reduction parameter), so as to obtain the estimated Signal-to-Noise Ratio; and performing Kalman filtering on the estimated signal-to-noise ratio to obtain a relatively accurate signal-to-noise ratio.
In an alternative of the embodiment of the present invention, when there is no noise or the noise is small and negligible in the environment of the audio/video conference, the first parameter information may be the signal strength, and the corresponding signal strength value may be directly obtained according to each frame of voice signal in the first voice.
S24, determining a second signal parameter corresponding to each frame of voice signal in the second voice, to obtain a second signal parameter set, where each second signal parameter in the second signal parameter set carries second timing information, and the timing of the second timing information is the same as the timing of each frame of voice signal in the second voice.
The step of obtaining the corresponding second speech parameter set according to the second speech is similar to the step of obtaining the speech parameter set lower than the second speech parameter set, and for brevity, reference may be made to the related description in S22, which is not described herein again.
S25, comparing, when the first timing information is consistent with the second timing information, a signal strength corresponding to a first signal parameter of each frame of speech signal in the first speech with a signal strength corresponding to a second signal parameter of each frame of speech signal in the second speech.
In this embodiment, according to the first signal parameter set and the second signal parameter set carrying the timing information obtained in the foregoing steps, when the first timing information is consistent with the second timing information, that is, there is no delay, the strength of the first signal parameter corresponding to each frame of voice signal in the first voice and the strength of the signal corresponding to the second signal parameter of each frame of voice signal in the second voice are compared, and the comparison process may be the comparison of the signal-to-noise ratio or the magnitude of the signal strength.
In an example, the set of speech frames to which the first speech corresponds includes: speech frame a1(20ms), speech frame a2(40ms), speech frame A3(60ms) …, the set of speech frames for the first speech comprising: the signal parameters of the speech frame a1 are compared with the signal strength corresponding to the signal parameters of the speech frame B1 in the speech frame B1(20ms), the speech frame B2(40ms), and the speech frame B3(60ms) ….
And S26, determining the target voice with the maximum signal strength corresponding to the signal parameter corresponding to the voice signal of one or more continuous frames from the first voice or the second voice.
In this embodiment, in the process of comparing the signal strength corresponding to the first signal parameter corresponding to each frame of voice signal in the first voice with the signal strength corresponding to the second signal parameter corresponding to each frame of voice signal in the second voice, the signal strength corresponding to the signal parameter corresponding to the voice signal of one frame or consecutive multiple frames determined in the first voice or the second voice is the largest and is taken as the target voice.
In an example, if the signal strength corresponding to the signal parameter corresponding to five consecutive speech frames is the maximum, the speech is taken as the target speech, that is, if the signal strength corresponding to the signal parameter corresponding to the speech is the maximum in 1S, the speech is taken as the target speech.
After the target speech is determined, to avoid unnecessary handover (the largest signal parameter does not represent the current speech to be transmitted), VAD detection is also needed for the target speech and the current speech to be transmitted.
And S27, carrying out voice activity detection on the first voice and the second voice to obtain a detection result.
S28, if the detection result indicates that there is a speech frame in the first speech or the second speech, controlling the first device to keep the current transmission speech unchanged.
In this embodiment, the VAD detection is performed on the first voice and the second voice, and the detection process is to determine whether there is a speech frame in the first voice and the second voice (i.e., whether there is a conference participant speaking), including: and judging whether the discontinuous time of the front and back words exists in the same utterance, if the currently detected audio is 'today weather is good', detecting whether the discontinuous time exists between 'day' and 'atmosphere' or not, or not detecting the silent time of audio input. Through VAD detection, when there is a speech frame in any of the first speech or the second speech, that is, although the target speech has the best signal strength, because there is a conference participant speaking in the first speech or the second speech, in order to keep the transmission of the first speech or the second speech not interrupted, therefore, the switching of the transmission speech cannot be performed, the first device is controlled to keep the current transmission speech unchanged, that is, the transmission speech of the first device remains unchanged, and it is still necessary to determine the next speech frame of the first speech and the second speech. Avoid the sudden sentence break, for example, the voice switching phenomenon that the syllable of a single word is incomplete because a pronunciation 'day' is broken by half without any reason can not occur.
S29, if the detection result is that no speech frame exists in the first speech and the second speech, judging whether the target speech is the same as the transmission speech; when the target voice is different from the transmission voice, controlling the first equipment to switch the current transmission voice into the target voice; and controlling the first equipment to keep the current transmission voice unchanged when the target voice is the same as the transmission voice.
In this embodiment, when the VAD detects that there is no speech frame in the first speech and the second speech, that is, the target speech signal strength is the best, and there is no conference participant speaking in the first speech and the second speech, therefore, the switching of the transmitted speech may be performed, and the specific switching process needs to determine the current transmitted speech and the target speech of the first device.
Specifically, when the detection result is that no speech frame exists in the first speech and the second speech, executing a step of judging whether the target speech is the same as the transmission speech, and when the target speech is different from the transmission speech (for example, the target speech is the first speech and the transmission speech is the second speech; if the target speech is the second speech and the transmission speech is the first speech), controlling the first device to switch the current transmission speech into the target speech; and when the target voice is the same as the transmission voice (for example, the target voice and the transmission voice are both the first voice or the second voice), controlling the first equipment to keep the current transmission voice unchanged.
In an alternative of the embodiment of the present invention, during the current switching operation of the transmission voice of the first device, the current fading out of the transmission voice and the fading in of the target voice are controlled, so that the process of switching from the transmission voice to the target voice tends to be smooth.
It should be noted that: the condition for controlling the current transmission voice to perform the handover to the target voice may include: the target voice is a first voice, and the transmission voice is a second voice; or the target voice is the second voice, and the transmission voice is the first voice; that is, the process may be understood as a process of switching the first voice to the second voice, or a process of switching the second voice to the first voice.
Specifically, in the switching process, the current transmission voice fade-out is controlled, and the target voice fade-in is controlled, where the current transmission voice fade-out means that a voice signal corresponding to the transmission voice gradually changes from a current sound size to silence, and the target voice fade-in means that a voice signal corresponding to the target voice gradually changes from silence to a sound size of the target voice, and the purpose of fade-in and fade-out of the current transmission voice and the target voice is that: the process of switching the transmission voice to the target voice tends to be smooth, the user can complete the switching under the non-inductive condition, the problems of blocking and the like caused by direct switching are avoided, and the user experience is improved. The time for fading in and out may be set according to actual needs (for example, 20ms), and this embodiment is not particularly limited.
According to the audio processing method provided by the embodiment of the invention, first voice and second voice are received through first equipment; determining a first signal parameter set corresponding to the first voice and a second signal parameter set corresponding to the second voice; comparing the signal intensity of the first signal parameter with the signal intensity of the second signal parameter to determine the signal intensity of the first voice and the second voice, and selecting one of the first voice and the second voice with the maximum signal intensity as a target voice; performing voice activity detection on the target voice and a third voice, wherein the third voice is a transmission voice of the first device; the method comprises the steps of controlling transmission voice of first equipment according to a detection result, judging whether switching of the transmission voice needs to be carried out or not by detecting signal parameters and a voice activity detection mode, adopting voice with good signal quality as voice to be switched, carrying out voice frame detection, judging whether conference participants speak in the voice or not, avoiding the problem that conference experience is poor due to the fact that the received voice is directly switched, and adopting a fade-out and fade-in mode to enable the switching process to tend to be smooth when the voice is required to be transmitted and switched, so that user experience of the conference participants is improved.
In practical application, a plurality of audio receiving devices are often placed in a large conference room, and participants can also have the situation of walking while speaking, when the participants are at one end close to the first device, the scheme of the embodiment can compare out a first voice signal to play a first voice, when the participants are close to the second device gradually in the walking process, the scheme can compare out a second voice to play a second voice with higher quality, and because the voice frame is detected by the VAD, even if the participants always speak in the walking process, the situation that any voice is suddenly interrupted can not occur, the volume can be kept stable all the time, and the phenomenon that the volume is reduced due to the increase of the distance can not occur.
Fig. 3 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present invention, and as shown in fig. 3, the apparatus specifically includes:
a receiving module 31, configured to receive a first voice and a second voice, where the first voice is a voice received by a first device, and the second voice is a voice received by a second device and sent to the first device;
a first determining module 32, configured to determine a first signal parameter set corresponding to the first voice and a second signal parameter set corresponding to the second voice, where the first signal parameter set includes a plurality of first signal parameters, the second signal parameter set includes a plurality of second signal parameters, and both the first signal parameter and the second signal parameter are used for indicating signal strength and weakness;
a second determining module 33, configured to compare signal strengths of the first signal parameter and the second signal parameter to determine signal strengths of the first voice and the second voice, and select one of the first voice and the second voice with the highest signal strength as a target voice;
a detection module 34, configured to perform voice activity detection on the first voice and the second voice to obtain a detection result;
a control module 35, configured to control the first device to keep the current transmission voice unchanged if the detection result indicates that a speech frame exists in the first voice or the second voice; if the detection result indicates that no speech frame exists in the first speech and the second speech, judging whether the target speech is the same as the transmission speech; when the target voice is different from the transmission voice, controlling the first equipment to switch the current transmission voice into the target voice; and controlling the first equipment to keep the current transmission voice unchanged when the target voice is the same as the transmission voice.
In a possible embodiment, the control module 35 is specifically configured to control the first device not to perform the switching operation of the transmission voice if the detection result indicates that a voice frame exists in the target voice or a voice frame exists in the third voice; and if the detection result indicates that no speech frame exists in the target speech and the third speech, controlling the first device to execute switching of the transmission speech from the third speech to the target speech.
In a possible implementation manner, the first determining module 32 is specifically configured to perform preprocessing on the first voice and the second voice to obtain a multi-frame voice signal corresponding to the first voice and a multi-frame voice signal corresponding to the second voice, where the preprocessing at least includes: framing processing and downsampling processing; determining a first signal parameter corresponding to each frame of voice signal in the first voice to obtain a first signal parameter set, wherein each first signal parameter in the first signal parameter set carries first time sequence information, and the time sequence of each frame of voice signal in the first voice is the same as that of each frame of voice signal in the first voice; determining a second signal parameter corresponding to each frame of voice signal in the second voice to obtain a second signal parameter set, where each second signal parameter in the second signal parameter set carries second timing information, and the second timing information is the same as the timing of each frame of voice signal in the second voice.
In a possible embodiment, the second determining module 33 is specifically configured to, under the condition that the first timing information is consistent with the second timing information, compare a first signal parameter corresponding to each frame of voice signal in the first voice with a signal strength corresponding to a second signal parameter of each frame of voice signal in the second voice; and determining the target voice with the maximum signal intensity corresponding to the signal parameter corresponding to one or more continuous frames of voice signals from the first voice or the second voice.
In one possible embodiment, the first signal parameter comprises a signal-to-noise ratio and the second signal parameter comprises a signal-to-noise ratio;
the first determining module 32 is specifically configured to perform filtering processing on each frame of voice signal in the first voice by using wiener filtering; determining a first signal-to-noise ratio corresponding to each frame of voice signals in the first voice after filtering; filtering each frame of voice signal in the second voice by adopting wiener filtering; and determining a second signal-to-noise ratio corresponding to each frame of voice signals in the second voice after filtering.
In one possible embodiment, the apparatus further comprises: a delay processing module 36, configured to determine a delay time between the first voice and the second voice through a cross-correlation function when it is determined that there is a delay between the first voice and the second voice according to the timing information; performing a time alignment operation of the first voice and the second voice based on the delay time.
In a possible embodiment, the control module 35 is further configured to control the current transmission voice to fade out and the target voice to fade in during the current transmission voice of the first device performs a switching operation, so that a process of switching from the transmission voice to the target voice tends to be smooth.
The audio processing apparatus provided in this embodiment may be the audio processing apparatus shown in fig. 3, and may perform all the steps of the audio processing method shown in fig. 1-2, so as to achieve the technical effects of the audio processing method shown in fig. 1-2, and for brevity, it is not described herein again.
Fig. 4 is a schematic structural diagram of an apparatus according to an embodiment of the present invention, where the apparatus 400 shown in fig. 4 includes: at least one processor 401, memory 402, at least one network interface 404, and other user interfaces 403. The various components in device 400 are coupled together by a bus system 405. It is understood that the bus system 405 is used to enable connection communication between these components. The bus system 405 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 405 in fig. 4.
The user interface 403 may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, trackball, touch pad, or touch screen, among others.
It will be appreciated that memory 402 in embodiments of the invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (ddr Data Rate SDRAM, ddr SDRAM), Enhanced Synchronous SDRAM (ESDRAM), synchlronous SDRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The memory 402 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
In some embodiments, memory 402 stores the following elements, executable units or data structures, or a subset thereof, or an expanded set thereof: an operating system 4021 and application programs 4022.
The operating system 4021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is configured to implement various basic services and process hardware-based tasks. The application programs 4022 include various application programs, such as a Media Player (Media Player), a Browser (Browser), and the like, for implementing various application services. A program for implementing the method according to the embodiment of the present invention may be included in the application 4022.
In this embodiment of the present invention, by calling a program or an instruction stored in the memory 402, specifically, a program or an instruction stored in the application 4022, the processor 401 is configured to execute the method steps provided by the method embodiments, for example, including:
receiving a first voice and a second voice, wherein the first voice is a voice received by first equipment, and the second voice is a voice received by second equipment and sent to the first equipment; determining a first signal parameter set corresponding to the first voice and a second signal parameter set corresponding to the second voice, wherein the first signal parameter set comprises a plurality of first signal parameters, the second signal parameter set comprises a plurality of second signal parameters, and the first signal parameters and the second signal parameters are both used for indicating signal strength and weakness; comparing the signal intensity of the first signal parameter with the signal intensity of the second signal parameter to determine the signal intensity of the first voice and the second voice, and selecting one of the first voice and the second voice with the maximum signal intensity as a target voice; performing voice activity detection on the first voice and the second voice to obtain a detection result; and controlling the current transmission voice of the first equipment according to the detection result and the target voice, wherein the transmission voice is the first voice or the second voice.
In a possible embodiment, if the detection result is that a speech frame exists in the first speech or the second speech, controlling the first device to keep the current transmission speech unchanged; if the detection result indicates that no speech frame exists in the first speech and the second speech, judging whether the target speech is the same as the transmission speech; when the target voice is different from the transmission voice, controlling the first equipment to switch the current transmission voice into the target voice; and controlling the first equipment to keep the current transmission voice unchanged when the target voice is the same as the transmission voice.
In a possible embodiment, the first voice and the second voice are preprocessed to obtain a multi-frame voice signal corresponding to the first voice and a multi-frame voice signal corresponding to the second voice, where the preprocessing at least includes: framing processing and downsampling processing; determining a first signal parameter corresponding to each frame of voice signal in the first voice to obtain a first signal parameter set, wherein each first signal parameter in the first signal parameter set carries first time sequence information, and the time sequence of each frame of voice signal in the first voice is the same as that of each frame of voice signal in the first voice; determining a second signal parameter corresponding to each frame of voice signal in the second voice to obtain a second signal parameter set, where each second signal parameter in the second signal parameter set carries second timing information, and the second timing information is the same as the timing of each frame of voice signal in the second voice.
In a possible embodiment, in the case that the first timing information is consistent with the second timing information, comparing a first signal parameter corresponding to each frame of voice signal in the first voice with a signal strength corresponding to a second signal parameter corresponding to each frame of voice signal in the second voice; and determining the target voice with the maximum signal intensity corresponding to the signal parameter corresponding to one or more continuous frames of voice signals from the first voice or the second voice.
In one possible embodiment, the first signal parameter comprises a signal-to-noise ratio and the second signal parameter comprises a signal-to-noise ratio; filtering each frame of voice signal in the first voice by adopting wiener filtering; determining a first signal-to-noise ratio corresponding to each frame of voice signals in the first voice after filtering; filtering each frame of voice signal in the second voice by adopting wiener filtering; and determining a second signal-to-noise ratio corresponding to each frame of voice signals in the second voice after filtering.
In one possible implementation, when it is determined that there is a delay between the first voice and the second voice according to the timing information, determining a delay time between the first voice and the second voice through a cross-correlation function; performing a time alignment operation of the first voice and the second voice based on the delay time.
In one possible embodiment, during the switching operation of the current transmission voice of the first device, the current transmission voice is controlled to fade out and the target voice is controlled to fade in, so that the process of switching from the transmission voice to the target voice tends to be smooth.
The method disclosed in the above embodiments of the present invention may be applied to the processor 401, or implemented by the processor 401. The processor 401 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 401. The Processor 401 may be a general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software elements in the decoding processor. The software elements may be located in ram, flash, rom, prom, or eprom, registers, among other storage media that are well known in the art. The storage medium is located in the memory 402, and the processor 401 reads the information in the memory 402 and completes the steps of the method in combination with the hardware.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.
For a software implementation, the techniques described herein may be implemented by means of units performing the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
The device provided in this embodiment may be the device shown in fig. 4, and may perform all the steps of the audio processing method shown in fig. 1-2, so as to achieve the technical effect of the audio processing method shown in fig. 1-2, which is described with reference to fig. 1-2 for brevity and will not be described herein again.
The embodiment of the invention also provides a storage medium (computer readable storage medium). The storage medium herein stores one or more programs. Among others, the storage medium may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, a hard disk, or a solid state disk; the memory may also comprise a combination of memories of the kind described above.
When one or more programs in the storage medium are executable by one or more processors to implement the audio processing method described above as being performed on the audio processing device side.
The processor is configured to execute the audio processing program stored in the memory to implement the following steps of the audio processing method executed on the audio processing device side:
receiving a first voice and a second voice, wherein the first voice is a voice received by first equipment, and the second voice is a voice received by second equipment and sent to the first equipment; determining a first signal parameter set corresponding to the first voice and a second signal parameter set corresponding to the second voice, wherein the first signal parameter set comprises a plurality of first signal parameters, the second signal parameter set comprises a plurality of second signal parameters, and the first signal parameters and the second signal parameters are both used for indicating signal strength and weakness; comparing the signal intensity of the first signal parameter with the signal intensity of the second signal parameter to determine the signal intensity of the first voice and the second voice, and selecting one of the first voice and the second voice with the maximum signal intensity as a target voice; performing voice activity detection on the first voice and the second voice to obtain a detection result; and controlling the current transmission voice of the first equipment according to the detection result and the target voice, wherein the transmission voice is the first voice or the second voice.
In a possible embodiment, if the detection result is that a speech frame exists in the first speech or the second speech, controlling the first device to keep the current transmission speech unchanged; if the detection result indicates that no speech frame exists in the first speech and the second speech, judging whether the target speech is the same as the transmission speech; when the target voice is different from the transmission voice, controlling the first equipment to switch the current transmission voice into the target voice; and controlling the first equipment to keep the current transmission voice unchanged when the target voice is the same as the transmission voice.
In a possible embodiment, the first voice and the second voice are preprocessed to obtain a multi-frame voice signal corresponding to the first voice and a multi-frame voice signal corresponding to the second voice, where the preprocessing at least includes: framing processing and downsampling processing; determining a first signal parameter corresponding to each frame of voice signal in the first voice to obtain a first signal parameter set, wherein each first signal parameter in the first signal parameter set carries first time sequence information, and the time sequence of each frame of voice signal in the first voice is the same as that of each frame of voice signal in the first voice; determining a second signal parameter corresponding to each frame of voice signal in the second voice to obtain a second signal parameter set, where each second signal parameter in the second signal parameter set carries second timing information, and the second timing information is the same as the timing of each frame of voice signal in the second voice.
In a possible embodiment, in the case that the first timing information is consistent with the second timing information, comparing a first signal parameter corresponding to each frame of voice signal in the first voice with a signal strength corresponding to a second signal parameter corresponding to each frame of voice signal in the second voice; and determining the target voice with the maximum signal intensity corresponding to the signal parameter corresponding to one or more continuous frames of voice signals from the first voice or the second voice.
In one possible embodiment, the first signal parameter comprises a signal-to-noise ratio and the second signal parameter comprises a signal-to-noise ratio; filtering each frame of voice signal in the first voice by adopting wiener filtering; determining a first signal-to-noise ratio corresponding to each frame of voice signals in the first voice after filtering; filtering each frame of voice signal in the second voice by adopting wiener filtering; and determining a second signal-to-noise ratio corresponding to each frame of voice signals in the second voice after filtering.
In one possible implementation, when it is determined that there is a delay between the first voice and the second voice according to the timing information, determining a delay time between the first voice and the second voice through a cross-correlation function; performing a time alignment operation of the first voice and the second voice based on the delay time.
In one possible embodiment, during the switching operation of the current transmission voice of the first device, the current transmission voice is controlled to fade out and the target voice is controlled to fade in, so that the process of switching from the transmission voice to the target voice tends to be smooth.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. An audio processing method, comprising:
the method comprises the steps that first voice and second voice are received by first equipment, wherein the first voice is the voice received by the first equipment, and the second voice is the voice received by the second equipment and sent to the first equipment;
determining a first signal parameter set corresponding to the first voice and a second signal parameter set corresponding to the second voice, wherein the first signal parameter set comprises a plurality of first signal parameters, the second signal parameter set comprises a plurality of second signal parameters, and the first signal parameters and the second signal parameters are both used for indicating signal strength and weakness;
comparing the signal intensity of the first signal parameter with the signal intensity of the second signal parameter to determine the signal intensity of the first voice and the second voice, and selecting one of the first voice and the second voice with the maximum signal intensity as a target voice;
performing voice activity detection on the first voice and the second voice to obtain a detection result;
and controlling the current transmission voice of the first equipment according to the detection result and the target voice, wherein the transmission voice is the first voice or the second voice.
2. The method of claim 1, wherein the controlling the current transmission voice of the first device according to the detection result and the target voice comprises:
if the detection result is that a speech frame exists in the first speech or the second speech, controlling the first equipment to keep the current transmission speech unchanged;
if the detection result indicates that no speech frame exists in the first speech and the second speech, judging whether the target speech is the same as the transmission speech; when the target voice is different from the transmission voice, controlling the first equipment to switch the current transmission voice into the target voice; and controlling the first equipment to keep the current transmission voice unchanged when the target voice is the same as the transmission voice.
3. The method of claim 1, wherein the determining a first set of signal parameters corresponding to the first speech and a second set of signal parameters corresponding to the second speech comprises:
preprocessing the first voice and the second voice to obtain a multi-frame voice signal corresponding to the first voice and a multi-frame voice signal corresponding to the second voice, wherein the preprocessing at least comprises: framing processing and downsampling processing;
determining a first signal parameter corresponding to each frame of voice signal in the first voice to obtain a first signal parameter set, wherein each first signal parameter in the first signal parameter set carries first time sequence information, and the time sequence of each frame of voice signal in the first voice is the same as that of each frame of voice signal in the first voice;
determining a second signal parameter corresponding to each frame of voice signal in the second voice to obtain a second signal parameter set, where each second signal parameter in the second signal parameter set carries second timing information, and the second timing information is the same as the timing of each frame of voice signal in the second voice.
4. The method of claim 3, wherein comparing the signal strength of the first signal parameter with the signal strength of the second signal parameter to determine the signal strength of the first speech and the second speech, and selecting the one of the first speech and the second speech with the largest signal strength as the target speech comprises:
comparing a first signal parameter corresponding to each frame of voice signal in the first voice with a signal strength corresponding to a second signal parameter of each frame of voice signal in the second voice under the condition that the first time sequence information is consistent with the second time sequence information;
and determining the target voice with the maximum signal intensity corresponding to the signal parameter corresponding to one or more continuous frames of voice signals from the first voice or the second voice.
5. The method of claim 3, wherein the first signal parameter comprises a signal-to-noise ratio and the second signal parameter comprises a signal-to-noise ratio;
the determining a first signal parameter corresponding to each frame of voice signal in the first voice comprises:
filtering each frame of voice signal in the first voice by adopting wiener filtering;
determining a first signal-to-noise ratio corresponding to each frame of voice signals in the first voice after filtering;
the determining a second signal parameter corresponding to each frame of voice signal in the second voice includes:
filtering each frame of voice signal in the second voice by adopting wiener filtering;
and determining a second signal-to-noise ratio corresponding to each frame of voice signals in the second voice after filtering.
6. The method of claim 1, further comprising:
when determining that the first voice and the second voice have delay according to the time sequence information, determining the delay time between the first voice and the second voice through a cross-correlation function;
performing a time alignment operation of the first voice and the second voice based on the delay time.
7. The method according to any one of claims 1-6, further comprising:
and in the process of executing switching operation of the current transmission voice of the first equipment, controlling the current transmission voice to fade out and controlling the target voice to fade in so as to smooth the process of switching from the transmission voice to the target voice.
8. An audio processing apparatus, comprising:
the device comprises a receiving module, a processing module and a processing module, wherein the receiving module is used for receiving a first voice and a second voice, the first voice is a voice received by first equipment, and the second voice is a voice received by second equipment and sent to the first equipment;
a first determining module, configured to determine a first signal parameter set corresponding to the first voice and a second signal parameter set corresponding to the second voice, where the first signal parameter set includes a plurality of first signal parameters, the second signal parameter set includes a plurality of second signal parameters, and both the first signal parameter and the second signal parameter are used to indicate signal strength and weakness;
the second determining module is used for comparing the signal intensity of the first signal parameter with the signal intensity of the second signal parameter to determine the signal intensity of the first voice and the second voice, and selecting one of the first voice and the second voice with the maximum signal intensity as a target voice;
the detection module is used for carrying out voice activity detection on the first voice and the second voice to obtain a detection result;
and the control module is used for controlling the current transmission voice of the first equipment according to the detection result and the target voice, wherein the transmission voice is the first voice or the second voice.
9. An apparatus, comprising: a processor and a memory, the processor being configured to execute an audio processing program stored in the memory to implement the audio processing method of any one of claims 1 to 7.
10. A storage medium storing one or more programs executable by one or more processors to implement the audio processing method of any one of claims 1 to 7.
CN202110883330.XA 2021-07-30 2021-07-30 Audio processing method, device, equipment and storage medium Pending CN113628638A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110883330.XA CN113628638A (en) 2021-07-30 2021-07-30 Audio processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110883330.XA CN113628638A (en) 2021-07-30 2021-07-30 Audio processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113628638A true CN113628638A (en) 2021-11-09

Family

ID=78382320

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110883330.XA Pending CN113628638A (en) 2021-07-30 2021-07-30 Audio processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113628638A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020085697A1 (en) * 2000-12-29 2002-07-04 Simard Frederic F. Apparatus and method for packet-based media communications
JP2011257627A (en) * 2010-06-10 2011-12-22 Murata Mach Ltd Voice recognition device and recognition method
US20170018282A1 (en) * 2015-07-16 2017-01-19 Chunghwa Picture Tubes, Ltd. Audio processing system and audio processing method thereof
CN110415718A (en) * 2019-09-05 2019-11-05 腾讯科技(深圳)有限公司 The method of signal generation, audio recognition method and device based on artificial intelligence
WO2019227579A1 (en) * 2018-05-29 2019-12-05 平安科技(深圳)有限公司 Conference information recording method and apparatus, computer device, and storage medium
CN110619895A (en) * 2019-09-06 2019-12-27 Oppo广东移动通信有限公司 Directional sound production control method and device, sound production equipment, medium and electronic equipment
CN110648692A (en) * 2019-09-26 2020-01-03 苏州思必驰信息科技有限公司 Voice endpoint detection method and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020085697A1 (en) * 2000-12-29 2002-07-04 Simard Frederic F. Apparatus and method for packet-based media communications
JP2011257627A (en) * 2010-06-10 2011-12-22 Murata Mach Ltd Voice recognition device and recognition method
US20170018282A1 (en) * 2015-07-16 2017-01-19 Chunghwa Picture Tubes, Ltd. Audio processing system and audio processing method thereof
WO2019227579A1 (en) * 2018-05-29 2019-12-05 平安科技(深圳)有限公司 Conference information recording method and apparatus, computer device, and storage medium
CN110415718A (en) * 2019-09-05 2019-11-05 腾讯科技(深圳)有限公司 The method of signal generation, audio recognition method and device based on artificial intelligence
CN110619895A (en) * 2019-09-06 2019-12-27 Oppo广东移动通信有限公司 Directional sound production control method and device, sound production equipment, medium and electronic equipment
CN110648692A (en) * 2019-09-26 2020-01-03 苏州思必驰信息科技有限公司 Voice endpoint detection method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
陈楚婷: "卧室语音智能控制系统", 《科技创新与应用》, no. 8, pages 99 - 100 *
陈立春;: "实时语音采集系统中语音端点检测和增强方法", 电声技术, vol. 37, no. 05, pages 42 - 44 *

Similar Documents

Publication Publication Date Title
US11094330B2 (en) Encoding of multiple audio signals
EP3391371B1 (en) Temporal offset estimation
US10304468B2 (en) Target sample generation
US10115403B2 (en) Encoding of multiple audio signals
US10734001B2 (en) Encoding or decoding of audio signals
US10535357B2 (en) Encoding or decoding of audio signals
EP3692527B1 (en) Decoding of audio signals
US10891960B2 (en) Temporal offset estimation
TWI725343B (en) Device, method and apparatus of communication and computer-readable storage device
CN113628638A (en) Audio processing method, device, equipment and storage medium
US20180082703A1 (en) Suitability score based on attribute scores
US20060106603A1 (en) Method and apparatus to improve speaker intelligibility in competitive talking conditions
JP5321687B2 (en) Voice communication device
JP2000083106A (en) Voice signal adder, summing method used for it, and recording medium storing its control program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination