WO2021056999A1 - 语音通话方法、装置、电子设备及计算机可读存储介质 - Google Patents

语音通话方法、装置、电子设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2021056999A1
WO2021056999A1 PCT/CN2020/081385 CN2020081385W WO2021056999A1 WO 2021056999 A1 WO2021056999 A1 WO 2021056999A1 CN 2020081385 W CN2020081385 W CN 2020081385W WO 2021056999 A1 WO2021056999 A1 WO 2021056999A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
audio collection
signal
voice call
moment
Prior art date
Application number
PCT/CN2020/081385
Other languages
English (en)
French (fr)
Inventor
李岳鹏
刘志鹏
朱睿
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2021558866A priority Critical patent/JP7290749B2/ja
Priority to EP20868976.0A priority patent/EP3920516B1/en
Publication of WO2021056999A1 publication Critical patent/WO2021056999A1/zh
Priority to US17/460,160 priority patent/US11875808B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72448User interfaces specially adapted for cordless or mobile telephones with means for adapting the functionality of the device according to specific conditions
    • H04M1/72454User interfaces specially adapted for cordless or mobile telephones with means for adapting the functionality of the device according to specific conditions according to context-related or environment-related conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/002Applications of echo suppressors or cancellers in telephonic connections
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/60Substation equipment, e.g. for use by subscribers including speech amplifiers
    • H04M1/6008Substation equipment, e.g. for use by subscribers including speech amplifiers in the transmitter circuit
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M9/00Arrangements for interconnection not involving centralised switching
    • H04M9/08Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
    • H04M9/082Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic using echo cancellers

Definitions

  • This application relates to the field of computer technology. Specifically, this application relates to a voice call method, device, electronic equipment, and computer-readable storage medium.
  • an embodiment of the present application provides a voice call method, the method is executed by an electronic device, and the method includes:
  • the target audio collection device at the current moment is determined from each audio collection device.
  • an embodiment of the present application provides a voice call method, the method is executed by an electronic device, and the method includes:
  • the audio collection device of the at least two audio collection devices corresponding to the pre-configured information as the target audio collection device, and determine the voice call state at the initial moment;
  • a voice call is made with the peer device .
  • an embodiment of the present application provides a voice call device, which includes:
  • the call status acquisition module is used to acquire the voice call status at the historical moment of the terminal system, and at least two audio collection devices are provided on the terminal system;
  • the signal energy acquisition module is used to acquire the first voice signal collected by each audio collection device at the current moment, and respectively determine the signal energy of each first voice signal;
  • the target audio collection device determining module is used to determine the target audio collection device from each audio collection device based on the voice call state at the historical moment and the signal energy of each first voice signal.
  • an embodiment of the present application provides a voice call device, which includes:
  • the trigger operation receiving module is used to receive the user's voice call trigger operation
  • the device opening module is used to trigger an operation based on the voice call to turn on the audio playback device and at least two audio collection devices on the terminal system;
  • the initial determination module is configured to use the audio collection device of the at least two audio collection devices corresponding to the pre-configured information as the target audio collection device for the initial moment of the voice call, and determine the voice call state at the initial moment;
  • the voice call module is used for the current moment of the voice call except the initial moment, based on the voice signal collected by the target audio collecting device determined by the method provided in the first aspect or any one of the embodiments of the first aspect, and Make a voice call on the peer device.
  • an embodiment of the present application provides an electronic device, which includes a memory, a processor, an audio playback device, and at least two audio collection devices;
  • Audio playback equipment used to play voice signals
  • At least two audio collection devices for collecting voice signals
  • a computer program is stored in the memory
  • the processor is configured to execute a computer program to implement the method provided in the first aspect or the second aspect.
  • an embodiment of the present application provides a computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method provided in the first aspect or the second aspect is implemented. Methods.
  • Figure 1 shows a schematic structural diagram of a mobile phone terminal
  • FIG. 2 is a schematic flowchart of a voice call method provided by an embodiment of this application.
  • FIG. 3 is a schematic diagram of the implementation process of a voice call in an example of an embodiment of the application
  • FIG. 4 is a schematic diagram of the implementation process of call state estimation and microphone selection in an example of an embodiment of this application;
  • FIG. 5 is a schematic diagram of a selection result of a target microphone in an example of an embodiment of the application
  • FIG. 6 is a schematic flowchart of a voice call method provided by an embodiment of this application.
  • Fig. 7 shows a schematic diagram of an application scenario in an example of the present application.
  • FIG. 8 is a structural block diagram of a voice call device provided by an embodiment of this application.
  • FIG. 9 is a structural block diagram of a voice call device provided by an embodiment of the application.
  • FIG. 10 is a schematic structural diagram of an electronic device provided by an embodiment of the application.
  • Near end The local end in the communication network during a voice call.
  • Far end the opposite end in the communication network during a voice call.
  • Near-end device a call device used by a near-end speaker in a voice call.
  • the near-end device is equipped with audio collection devices (such as microphones) and audio playback devices (such as speakers, receivers).
  • Far-end equipment the call equipment used by the far-end speaker in a voice call.
  • the far-end equipment is equipped with audio collection equipment (such as a microphone) and audio playback equipment (such as a speaker, a receiver).
  • Near-end voice signal In a voice call, the near-end speaker speaks the voice signal collected by the audio collection device of the near-end device.
  • Far-end voice signal In a voice call, the speech of the far-end speaker is collected by the audio collection device of the far-end device and then transmitted to the near-end device through the communication network.
  • the far-end voice signal is the voice signal collected by the audio collection device of the near-end device after being played by the audio playback device of the near-end device.
  • Echo cancellation The process of filtering out the echo signal from the voice signal collected by the audio collection device of the near-end device.
  • Far-end single talk During a voice call, there is a far-end voice signal, and there is no near-end voice signal.
  • Near-end single talk During a voice call, there is no far-end voice signal and the call state when there is a near-end voice signal.
  • Two-end intercom During a voice call, there is a far-end voice signal and the call state when there is a near-end voice signal.
  • the microphone signal with a higher signal amplitude is usually used as the input for subsequent applications. Although this option can effectively enhance the voice when there is only near-end voice Then, in a scene with strong far-end speech, if both microphones collect strong echoes, if the microphone signal with the larger signal amplitude is selected, the one with the larger echo is likely to be selected.
  • the microphone signal causes the voice enhancement to fail to achieve the desired effect, and even reduces the voice call quality.
  • the two devices making a voice call are A and B
  • device A is the near-end device
  • B is the corresponding remote device.
  • End equipment for user b of device B
  • device B is the near-end device
  • device A is the corresponding remote device.
  • the following description takes the near-end device as device A as an example.
  • A is the near-end device
  • the voice signal of the local speaker that is collected by the audio collection device of A that is, the voice signal of user a is the near-end voice signal
  • the voice signal (the voice signal produced by the speaker at the opposite end, that is, the user b) is the far-end voice signal
  • the far-end voice signal is played by the audio playback device on A
  • the voice signal collected by the audio collection device on A is the echo signal
  • the process of canceling the echo signal in the voice signal collected by the audio collection device of A is the echo cancellation.
  • One is to select the input microphone according to the signal amplitude of the voice signal collected by the microphone, and select the microphone with the higher signal amplitude of the collected voice signal from the two microphones as the input microphone, that is, the voice signal with the highest signal amplitude As the input speech signal for subsequent speech enhancement processing.
  • the voice signals collected by the two microphones will have strong echoes, so the microphone with the larger signal amplitude may be the microphone with the larger echo. , Which will cause echo leakage and reduce the quality of voice calls.
  • FIG. 1 shows the layout of a common audio capture device and audio playback device of a mobile phone, as shown in Figure 1.
  • the phone is equipped with a top microphone (referred to as top microphone) 201 above the screen of the terminal system and a bottom microphone (referred to as Microphone 202, and the receiver 203 at the top of the screen and the speaker 204 at the bottom of the screen.
  • top microphone referred to as top microphone
  • Microphone 202 a bottom microphone
  • the speaker 204 at the bottom of the mobile phone plays the far-end voice signal, which is collected by the microphone to form an echo. Because the bottom microphone 202 is closer to the speaker, the collected echo is relatively large; while the near-end speaker is far from the mobile phone, the near-end vocal energy collected by the two microphones is relatively close, and the signal collected by the top microphone at this time is the follow-up Handle better input options.
  • the receiver 203 above the mobile phone plays the far-end voice signal, and the near-end speaker holds the mobile phone close to the bottom of the screen to speak.
  • the near-end voice signal collected by the bottom microphone 202 is larger, while the echo collected by the top microphone 201 Larger, the signal collected by the bottom microphone is a better choice for subsequent processing.
  • choosing the bottom microphone for the handheld scene and the top microphone for the hands-free scene is a more reasonable choice.
  • the actual call scene is more complicated.
  • the user may also put his mouth close to the bottom microphone 202.
  • selecting the top microphone 201 cannot effectively obtain the near-end voice signal.
  • the embodiments of the present application provide a voice call method, which provides another more reasonable way of selecting audio collection equipment, and can effectively improve the voice call effect.
  • FIG. 2 is a schematic flowchart of a voice call method provided by an embodiment of the application.
  • the method may be executed by an electronic device such as a terminal system or a server. As shown in FIG. 1, the method may include the following steps.
  • Step S101 Obtain the voice call status at the historical moment of the terminal system. At least two audio collection devices are provided on the terminal system.
  • the terminal system may be a terminal device that integrates audio playback equipment, audio acquisition equipment, and processors.
  • the specific equipment type of the terminal equipment is not limited in the embodiment of the application, as long as it is a device that can conduct voice calls. That is, including but not limited to mobile phones, PAD, etc.
  • the terminal system can also be a voice call system composed of mutually independent audio playback devices, audio collection devices, and processors.
  • the terminal system can be a video conference system, and the video conference system contains multiple audio collection devices (such as Microphones), one or more audio playback devices (such as speakers), and processors, and audio collection devices and audio playback devices can be distributed according to actual needs such as conference venues.
  • the method can be executed by the processor integrated in the terminal device, or it can be executed by the server corresponding to the terminal device; when the terminal system is independent of each other
  • the method can be executed by the processor in the terminal system, or executed by the server corresponding to the terminal system.
  • the method can be executed by electronic devices such as the terminal system or the server. carried out.
  • the terminal system For the terminal system that executes the voice call method, the terminal system is the near-end device of the current voice call, and the opposite-end device that performs the voice call with the terminal system is the far-end device.
  • each moment in the voice call process can be understood as re-determining the time point of the target audio collection device
  • the historical moment may include one or more moments
  • the interval between the two moments can be set according to actual needs. For example, the interval time between two moments can be set to 0.02 seconds. If the current moment is the 0.20 second of the voice call, the historical moment is the 0.18 second of the voice call.
  • the voice call state represents the near-end voice condition and the far-end voice condition in the voice call.
  • the near-end voice call state at any time it can be determined whether there is a near-end voice signal and a far-end voice signal at that moment.
  • the voice call state at the 0.20 second second of the voice call is far-end single talk, that is, there is a far-end voice signal but no near-end voice signal exists in the 0.20 second of the voice call.
  • the audio collection device set on the terminal system can be a microphone or other types of audio collection devices.
  • the specific type, specific number, and location of the audio collection device on the terminal system of the at least two audio collection devices The embodiments of this application are not limited.
  • the at least two audio collection devices can be two microphones, and the two microphones can be respectively set above and below the front of the mobile phone screen.
  • the dual microphone setting method shown in Figure 1 can also be Other setting methods, such as setting on the back of the screen, are not specifically limited in the embodiment of the present application.
  • Step S102 Obtain the first voice signal collected by each audio collection device at the current moment, and determine the signal energy of each first voice signal respectively.
  • the first voice signal may include a near-end voice signal, an echo signal, and an environmental noise signal.
  • the echo signal and environmental noise signal need to be eliminated, and the near-end voice signal transmitted to the remote device is the near-end voice signal.
  • the signal types and signal energy levels contained in the first voice signals collected by each audio collection device of the near-end device are different, and the signal energy level of each first voice signal can reflect the size of the voice signal contained therein, and then It can be used as a basis for subsequent determination of the target audio capture device.
  • the signal energy of the speech signal can be determined according to the signal amplitude or peak envelope of the speech signal.
  • step numbers in the above steps S101 and S102 do not constitute a limitation on the sequence of the two steps, that is, the order of execution of step S101 and step S102 may not be sequential, for example, step S101 can be executed first, and then step S101 can be executed.
  • step S101 can be executed first, and then step S101 can be executed.
  • step S102, or perform step S102 and then perform step S101, or perform step S101 and step S102 at the same time that is, in the process of implementing the embodiment of the present application, it is necessary to obtain the voice call status of the near-end device at the historical moment and obtain each audio collection
  • the execution sequence of the signal energy of the first voice message collected by the device at the current moment is not limited.
  • Step S103 based on the voice call state at the historical moment and the signal energy of each first voice signal, determine the target audio collection device at the current moment from each audio collection device.
  • the voice call state at the current moment can be estimated from the voice call state at the historical moment.
  • the voice call state at the historical moment is considered to be the voice call state at the current moment.
  • the voice call state of that moment is regarded as the voice call state of the historical moment, and the included moment can be the previous moment adjacent to the current moment; when the historical moment includes multiple moments ,
  • the voice call status at the historical moment can be determined by the following methods: separately obtain the voice call status at each moment, and use the voice call status with the most occurrences as the voice call status at the historical moment; or, set the voice call status closest to the current moment
  • the voice call state is regarded as the voice call state at the historical moment.
  • the types of voice signals collected by the audio collection device are also different. For example, if the voice call state is far-end single talk, the audio collection device collects echo signals; if it is near-end single talk, the audio collection device collects near-end voice signals (of course, there are generally noise signals) . Therefore, the voice call state can characterize whether there is an echo signal in the first voice signal collected by the audio collection device, whether there is a near-end voice signal, etc., that is, according to the voice call state at a historical moment, it can be determined that the first voice signal is in the first voice signal. The types of signals included.
  • the voice call state at the historical moment is near-end single-speaking, there is a near-end voice signal and no far-end voice signal at the current moment. Since the echo signal is generated due to the presence of the far-end voice signal, it can be determined that the first There is no echo signal in a voice signal.
  • the signal of the specific type of voice signal contained in each first voice signal can be determined according to the magnitude of the signal energy of each first voice signal
  • the energy level in other words, can determine the signal energy level of a specific type of voice signal collected by each audio collection device. For example, if the voice call state at a historical moment is near-end single talk, it is determined according to the voice call state that each first voice signal contains near-end voice signals, and generally also includes environmental noise signals, etc., but because each first voice signal is The signal energy of the included environmental noise signal is basically similar.
  • the signal energy of the near-end speech signal in the first speech signal is positively correlated with the signal energy of the first speech signal, that is, the greater the signal energy of the first speech signal Larger, the greater the signal energy of the near-end voice signal contained in the first voice signal, that is, the greater the signal energy of the near-end voice signal collected by the corresponding audio collection device, and the greater the signal energy can be collected at this time
  • the audio collection device of the first voice signal serves as the target audio collection device.
  • the magnitude relationship of the signal energy of the specific type of voice signal collected by each audio collection device in the specific voice call state can be determined.
  • the determined target audio collection device at the current moment, the first voice signal collected by it is the first voice signal that is more conducive to subsequent voice enhancement processing in the corresponding voice state, and is generally more conducive to subsequent voice enhancement
  • the signal energy of the near-end speech signal contained in the processed first speech signal is larger, or the signal energy of the echo signal contained therein is smaller. Since it can be based on the voice call state at historical moments and the signal energy of each first voice signal to determine the magnitude relationship of the signal energy of the specific type of voice signal collected by each audio collection device in the specific voice call state, it can be based on the history
  • the voice call state at the moment and the signal energy of each first voice signal determine the target audio collection device in the specific voice call state.
  • the voice call state at the historical moment is also combined, which can effectively avoid the determined target audio collection
  • the situation where the echo signal contained in the first voice signal collected by the device is the largest.
  • the confirmation process of the target audio collection device does not depend on the call scene of the near-end device, thus avoiding the situation that the determined target audio collection device cannot collect effective near-end voice signals.
  • the target audio collection device at that time can be determined according to the method provided in the embodiment of the present application.
  • the target audio collection device at that moment can be pre-designated or optionally one of at least two audio collection devices, or it can be selected by using the existing target audio collection device determination method, such as based on The call scene determines the target audio capture device at the initial moment.
  • the interaction process between the terminal system and the server in this solution may include: at the initial moment of the voice call, the server sends the pre-configuration information of the target audio collection device to the terminal, and the terminal system receives The obtained pre-configuration information selects the target audio collection device from at least two audio collection devices; or the pre-configuration information itself is stored in the terminal system, and the terminal system selects the target audio collection device from the at least two audio collection devices according to the pre-configuration information.
  • the server receives the first voice signal collected by at least two audio collection devices sent by the terminal, the server obtains the signal energy of each first voice signal, and according to the voice call status at the historical moment and the received first voice signal The signal energy of the voice signal determines the target audio collection device at the current moment.
  • the voice call method uses the voice call state at historical moments and combines the signal energy of the voice signal collected by each audio collection device to correspond to the voice signal that is more conducive to subsequent voice enhancement processing in a specific voice call state
  • the audio capture device is determined as the target audio capture device at the current moment.
  • the process of determining the target audio capture device does not only rely on the signal energy of the voice signal collected by each audio capture device or the call scene of the near-end device, thus avoiding the related technology
  • the problem of large echo or small near-end voice in the voice signal collected by the target audio collection device is determined, which improves the effect of the voice call.
  • the voice call state at a historical moment is determined in the following manner:
  • the voice call state at the historical moment is determined.
  • the voice call status can indicate the near-end voice status and the far-end voice status in the voice call, and the corresponding voice call status can be determined through the near-end voice status and the far-end voice status during the voice call.
  • whether there is a remote voice signal at the historical moment can be determined by judging whether the terminal system has received the remote voice signal at the historical moment. For example, if the voice signal of the remote speaker exists in the voice signal received by the terminal system at the historical moment (I.e. far-end voice signal), it is determined that there is a far-end voice signal at the historical moment. Whether there is a near-end voice signal at the historical moment can be determined by judging whether the voice signal collected by any audio collection device on the terminal system at the historical moment contains a near-end voice signal. For example, if the voice collected by any audio collection device at the historical moment If the signal contains the voice signal of the near-end speaker (ie, the near-end voice signal), it is determined that the near-end voice signal exists at the historical moment.
  • the determination can be made based on the characteristics of the near-end voice signal and the far-end voice signal, signal energy, signal waveform, etc., for example, In the first voice signal, a voice signal whose signal energy is within a preset range is determined to be a near-end voice signal.
  • this application uses the above-mentioned solution to determine the voice call state at each moment separately. After the voice call state at each moment is determined, the historical moment is further determined. Voice call status.
  • the process of determining the voice call state at the historical moment according to the voice call state at multiple moments included in the historical moment can be as described above: the voice call state with the most occurrences among the voice call states corresponding to each moment is regarded as the voice at the historical moment. Call status; or, use the voice call status at the time closest to the current moment as the voice call status at the historical moment.
  • the corresponding interaction process between the terminal system and the server may include: at historical moments, the server receives the remote signal and each first voice signal sent by the terminal system, and the server receives Whether the far-end voice signal is 0, the first determination result is obtained, and the server obtains the second determination result according to whether there is a near-end voice signal in each of the received first voice signals; then according to the first determination result and the second determination result , To determine the voice call status at the historical moment.
  • determining whether there is a near-end voice signal at a historical moment includes:
  • the second voice signal may include a near-end voice signal , Echo signals and environmental noise signals, etc.
  • the echo cancellation of the second voice signal it can be considered that the second voice signal will no longer contain the echo signal, and it can be excluded when determining whether there is a near-end voice signal The influence of the echo signal makes the confirmation result more accurate.
  • echo cancellation of the second voice signal collected by the target audio collection device at the historical moment is also a necessary operation in the voice call. Therefore, the second voice signal after the echo cancellation is selected as the judgment object, and no additional voice will be added. The processing steps in the call.
  • the corresponding interaction process between the terminal system and the server may include: at a historical moment, the server receives the second voice signal collected by the target audio sent by the terminal system, and determines the second voice signal Whether there is a near-end voice signal in the.
  • the voice call status includes far-end single-talk, near-end single-talk, two-end intercom, or unmanned talk.
  • determining the voice call state at a historical moment according to the first determination result and the second determination result includes:
  • the voice call state at the historical moment is the far-end single talk
  • the voice call state at the historical moment is near-end single talk
  • the voice call state at the historical moment is two-end intercom
  • the voice call state at the historical moment is that no one is speaking.
  • the voice call state can be summarized into four states: far-end single-talk, near-end single-talk, two-end intercom, or unmanned talk.
  • far-end single-talk in most cases, one party is speaking and the other party is listening, or one party is listening to the other party.
  • two parties are speaking at the same time or neither party is speaking.
  • the talking state appears more frequently, while the two-end intercom or unmanned talking state appears less.
  • the target audio collection device at the current moment is determined from each audio collection device, including:
  • the audio collection device corresponding to the first voice signal with the smallest signal energy is determined as the target audio collection device at the current moment
  • the audio collection device corresponding to the first voice signal with the largest signal energy is determined as the target audio collection device at the current moment
  • the target audio collection device determined at the historical moment is determined as the target audio collection device at the current moment.
  • the voice call state at the historical moment is the far-end single talk
  • the first voice signal collected by each audio collection device in the near-end device contains echo Signal and environmental noise signal
  • the signal energy of each first speech signal is positively correlated with the signal energy of the echo signal contained therein, in order to minimize the signal energy of the echo signal in the speech signal used for subsequent speech enhancement processing
  • the audio acquisition device corresponding to the first voice signal with the smallest signal energy is selected as the target audio acquisition device, that is, the first voice signal with the smallest signal energy is used as the input signal for subsequent voice enhancement processing.
  • the first voice signal collected by each audio collection device in the near-end device includes the near-end voice signal And the environmental noise signal, the signal energy of each first speech signal is positively correlated with the signal energy of the near-end speech signal contained therein, in order to make the near-end speech signal of the speech signal used for subsequent speech enhancement processing have a positive correlation. If the signal energy is the largest, the audio acquisition device corresponding to the first voice signal with the largest signal energy is selected as the target audio acquisition device, that is, the first voice signal with the largest signal energy is used as the input signal for subsequent voice enhancement processing.
  • the voice call status at the historical moment is two-end intercom
  • the current voice call state is also two-end intercom.
  • the magnitude of the signal energy of the first voice signal collected by each audio collection device in the near-end device It is related to the signal energy of the echo signal and the signal energy of the near-end voice signal.
  • the signal energy of the first voice signal cannot be used to determine the echo signal and the near-end voice signal contained therein.
  • the signal energy is large, and generally the duration of the intercom at both ends is short.
  • it can ensure that the target audio collection device remains unchanged, so the target audio collection device determined at the historical moment is used as the target audio collection device at the current moment .
  • the current voice call state is estimated to be unmanned, and the first voice signal collected by each audio collection device in the near-end device does not include the echo signal and the near-end Voice signals, and generally the duration of intercom at both ends is short.
  • the target audio collection device determined at the historical moment is taken as the target audio collection device at the current moment.
  • the party may also include:
  • the target audio capture device at the current moment is determined as the target audio capture device within the preset time period after the current moment.
  • the call status for a long period of time has been the remote single talk, that is, the situation where the caller at the opposite end is talking, then it can be considered that the call is in the subsequent call process. The status is still likely to continue. Therefore, when the voice call status is determined at a certain moment, you can record the number of consecutive remote single talks. For example, you can set a counter. If the call status is remote single talk, the counter Add 1 to the value of, if it is in another call state, the counter will be cleared, and the next time the call state is determined to be remote single talk, the count will be restarted.
  • the target audio capture device at the current moment can be directly used as the target audio capture device during the subsequent call. Of course, it can also be used as the target audio capture device for a certain period of time during the subsequent call. After the time period has passed, the target audio collection device is determined based on the method described in the foregoing embodiment. If it exceeds the set value, the method described in the previous embodiment can be used to determine the target audio capture device.
  • the interaction process between the corresponding terminal system and the server may include: the server collects statistics on the call status at each moment, and if it is determined that the voice call status before the current moment is continuously the remote single-talk When the number of times is greater than the set value, the target audio collection device at the current moment is determined as the target audio collection device within the preset time period after the current moment.
  • the method further includes:
  • the echo-cancelled first voice signal is sent to the far-end device.
  • the first voice signal collected by the target audio collection device may include near-end voice signals, echo signals, and environmental noise signals. Therefore, in order to avoid echo leakage during voice calls, the first voice Before the signal is sent to the remote device, the echo cancellation of the first voice signal is required. Perform voice detection on the first voice signal after echo cancellation. If there is a near-end voice signal, it will be sent to the far-end device. If there is no near-end voice signal, it will contain residual echo signals and environmental noise signals. , It will not be sent to the remote device.
  • the interaction process between the corresponding terminal system and the server may include: the server performs echo cancellation on the first voice signal collected by the target audio collection device at the current moment, and if the first voice signal is cancelled by the echo If there is a near-end voice signal in the voice signal, the near-end voice signal is sent to the far-end device.
  • performing echo cancellation on the first voice signal collected by the target audio collection device at the current moment specifically includes:
  • the echo signal in the first voice signal collected by the target audio collection device at the current moment is obtained;
  • the echo propagation path function can be understood as the mapping relationship between the far-end voice signal and the echo signal received by the audio collection device, that is, the far-end voice signal at the current moment is substituted into the echo propagation path function at the current moment to obtain the corresponding Echo signal.
  • the corresponding echo signal is obtained according to the echo propagation path function, and then the echo signal in the first voice signal is removed to complete the echo cancellation of the first voice signal.
  • the first voice signal after echo cancellation and before the echo cancellation remain unchanged.
  • the method may further include:
  • the echo propagation path function at the historical moment is updated to obtain the echo propagation path function at the current moment.
  • the echo propagation path function at that time can be used to modify the parameters of the echo propagation path function at that time, that is, to update it to obtain the next time Echo propagation path function. It is understandable that when there is no far-end voice signal at the historical moment, the first voice signal does not contain the echo signal, and there is no residual echo signal, then the echo propagation path function at the current moment and the echo propagation path function at the historical moment the same.
  • sending the echo-cancelled first voice signal to the remote device specifically includes:
  • the environmental noise signal and the residual echo signal in the first voice signal after echo cancellation are removed, and the obtained voice signal is sent to the remote device.
  • Subsequent speech enhancement processing includes removing environmental noise signals and residual echo signals.
  • the following uses an example to further illustrate the embodiments of the present application.
  • the example is described with a terminal system as the execution subject. It is assumed that the near-end device in a voice call is a mobile phone.
  • the mobile phone shown in Figure 1 is taken as an example.
  • the phone is equipped with two audio collection devices: the top microphone (top microphone) 201 and the bottom microphone (
  • the microphone 202 also includes a receiver 203 and a speaker 204. Among them, top wheat 201 and Both the microphone 202 can collect the first voice signal, and both the receiver 203 and the speaker 204 can play the received remote voice.
  • Fig. 3 shows a schematic diagram of the principle of the mobile phone in this example for selecting a target audio collection device.
  • the mobile phone may include a call state estimation and microphone selector 301, an echo estimator 302, and a speech enhancement processor 304.
  • the call state estimation and microphone selector 301 is used to determine the voice call state at each moment, and according to the voice call state at the historical moment and the microphone and microphone at the current moment
  • the signal energy of the voice signal collected by the microphone determines the target microphone.
  • the echo estimator 302 is used to estimate the echo signal at the current moment according to the input far-end speech signal.
  • the echo canceller 303 is used to perform echo cancellation on the input voice signal according to the input echo signal, where the echo canceller 303 can be understood as an adder, where "-" and “+” represent the removal and accumulation of the input signal, respectively .
  • the speech enhancement processor 304 is configured to perform subsequent enhancement processing on the input speech signal, including removing residual echo signals and environmental noise signals.
  • call state estimation and microphone selector 301, echo estimator 302, and speech enhancement processor 304 may be physical devices with corresponding functions, or may be applications capable of implementing corresponding functions.
  • the implementation process of the voice call in the mobile phone at the current moment may include the following steps:
  • Step 1-1 After the mobile phone receives the far-end voice signal, it plays the voice of the far-end talker through the speaker or receiver, and the top microphone and The microphone collects the sound signal of the near-end speaker, the sound signal of the far-end speaker, and the environmental noise signal to obtain the corresponding two first voice signals, and respectively input the two first voice signals to the call state estimation and microphone selection ⁇ 301.
  • Step 1-2 call state estimation and microphone selector 301 according to the pre-obtained voice call state at historical moments and the received top microphone and microphone
  • the two first voice signals input by the microphone determine the target microphone, and the first voice signal collected by the target microphone is input into the echo canceller 303.
  • Steps 1-3 the echo estimator 302 estimates the echo signal according to the input far-end speech signal, and inputs the echo signal into the echo canceller 303.
  • Steps 1-4 The echo canceller 303 performs echo cancellation on the first voice signal collected by the target microphone according to the echo signal input by the echo estimator 302, and inputs the echo canceled first voice signal to the voice enhancement processor 304.
  • the speech enhancement processor 304 performs further speech enhancement processing on the first speech signal after the echo signal has been eliminated, which may include removing environmental noise signals and residual echo signals, etc., and then the speech enhancement processing of the first speech signal Send to the remote device.
  • the echo canceller 303 will also input the echo canceled first voice signal to the call state estimation and microphone selector 301 and the echo estimator 302 for the call.
  • the state estimation and microphone selector 301 determines the voice call state at the current moment according to the input signal to determine the target microphone at the next moment, and the echo estimator 302 can determine the voice call state based on the echo cancellation in the first voice signal
  • the residual echo signal updates itself, such as updating the echo propagation path function.
  • Fig. 4 shows a schematic diagram of an alternative structure of the call state estimation and microphone selector.
  • the call state estimation and microphone selector may include: a first peak envelope detection module 401, a second peak envelope detection module 402, a far-end voice activity detection module 403, and a near-end voice activity detection module 404 , The call state estimation module 405, the microphone selection module 406, and the audio mixing module 407.
  • the first peak envelope detection module 401 is used to detect the size of the peak envelope of the voice signal collected by the top microphone
  • the second peak envelope detection module 402 is used to detect The size of the peak envelope of the voice signal collected by the microphone.
  • the far-end voice activity detection module 403 is used to detect whether there is a far-end voice signal at each call moment
  • the near-end voice activity detection module 404 is used to detect whether there is a near-end voice signal at each call moment.
  • the call state estimation module 405 is used to determine the call state at each time according to whether there is a near-end voice signal at each call moment and whether there is a far-end voice signal, that is, according to the far-end voice activity detection module 403 and the near-end voice activity detection module The judgment result of 403 determines the call status at the corresponding moment.
  • the microphone selection module 406 is used to select the size of the peak envelope of the voice signal collected by the input top microphone and The size of the peak envelope of the voice signal collected by the microphone determines the target microphone selection result.
  • the sound mixing module 407 is configured to output the first voice signal collected by the target microphone according to the input target microphone selection result.
  • the mixing module 407 may be a physical device with corresponding functions, or may be an application program capable of implementing corresponding functions. Based on the structure shown in FIG. 4, the process of determining the target microphone of the mobile phone at the current moment may include the following steps:
  • Step 2-1 the first peak envelope detection module 401 detects the peak envelope size of the first voice signal collected by the top microphone, and the second peak envelope detection module 402 detects The peak envelope size of the first voice signal collected by the microphone, and the two peak envelope sizes are respectively input to the microphone selection module 406.
  • Step 2-2 The microphone selection module 406 determines the target microphone selection result according to the voice call state at the historical moment determined by the call state estimation module 405 and the input two peak envelope sizes, and inputs the target microphone selection result To the mixing module 407.
  • the call state estimation module 405 determines the voice call state at a historical moment
  • the second determination result of whether there is a near-end voice signal at the historical moment determined by the module 404 determines the voice call state at the historical moment.
  • the microphone selection module 406 determines the microphone corresponding to the first voice signal with a smaller signal energy as the target microphone; if the voice call state at the historical moment is near-end single talk, then The microphone selection module 406 determines the microphone corresponding to the first voice signal with larger signal energy as the target microphone; if the voice call state at the historical moment is two-end intercom or no one is speaking, the microphone selection module 406 determines the target microphone at the historical moment The microphone is determined as the target microphone.
  • Step 2-3 The audio mixing module 407 mixes and selects the first voice signal collected by the two microphones according to the input target microphone selection result, and outputs the voice signal of the target microphone.
  • a smooth transition time window can be set to ensure continuous transition.
  • the call state estimation module 405 also needs to further determine the voice call state at the current moment for the selection of the target microphone at the next moment.
  • the process may specifically include the following steps:
  • Step 3-1 The far-end voice activity detection module 403 determines whether there is a far-end voice signal at the current moment according to the input of the far-end voice signal at the current moment (the far-end voice shown in the figure), and the near-end voice status detector module 404 According to the input first voice signal collected by the target microphone at the current moment after echo cancellation (the first voice after echo cancellation shown in the figure), it is determined whether there is a near-end voice signal at the current moment, and the two A confirmation result is input to the call state estimation module 405.
  • Step 3-2 The call state estimation module 405 determines the voice call state at the current moment according to the two input confirmation results.
  • the voice call state at the current moment is far-end single talk; if there is no far-end voice signal and there is a near-end voice signal, the current time The voice call state is near-end single talk; if there is a far-end voice signal, and there is a near-end voice signal, the current voice call state is two-end intercom; if there is no far-end voice signal, and there is no near-end voice Signal, the current voice call status is no one speaking.
  • the solution provided by the embodiment of the application realizes the comprehensive analysis of the voice signal collected by multiple audio collection devices of the terminal system, the voice information number played by the audio playback device, and the call status of the device, and realizes the realization of the target audio collection device Compared with related technologies, the choice of, can effectively improve the overall performance of voice calls.
  • Figure 5 shows a schematic diagram of the effect of the terminal system on microphone selection in a hands-free call scenario.
  • the mobile phone includes two microphones, namely microphone a and microphone b.
  • the mobile phone conducts voice calls hands-free.
  • the time-domain waveform of the voice signal collected by microphone a is shown in waveform a in the figure
  • microphone b collects
  • the time-domain waveform of the voice signal is shown in the b waveform in the figure
  • the time-domain waveform of the voice signal played by the speaker is shown in the c waveform in the figure
  • the target microphone selection result is shown in the curve d in the figure.
  • the result shown by S1 in the curve d indicates that the target microphone is a
  • the result shown in S2 in the curve d indicates that the target microphone is b.
  • the abscissa represents time (only part of the time is shown in the figure), and the unit is seconds (s).
  • the ordinate represents the magnitude of signal energy, specifically the amplitude of the signal.
  • the interval between two adjacent moments is 0.1s
  • the interval between two adjacent moments is 0.1s
  • the historical time is 0.2s
  • the actual voice detection result at 0.2s is: at this moment there is no near-end voice signal, and there is a far-end voice signal, then It is determined that the 0.2 second voice call state is the far-end single talk, then the microphone corresponding to the voice signal with the smaller signal energy should be selected as the target microphone in 0.3s, and from the waveform a and the waveform b, it can be seen that the microphone a is collected in 0.3s
  • the signal energy of the voice signal is less than the signal energy of the voice signal collected by the microphone b, and the microphone a should be selected as the target microphone at 0.3s.
  • the actual detection result of the voice signal is that there is neither a far-end voice signal nor a near-end voice signal.
  • the two microphones basically did not collect any signals during this time period, and there is no far-end voice signal in the actual detection, that is, the far-end voice signal is not received, and the speaker does not play the voice signal, then the time can be determined
  • the voice call state at each moment in the segment is unmanned, then the target microphone at the historical moment can be determined as the target microphone at the current moment, that is, microphone a is still selected as the target microphone at each moment in the period.
  • the target microphone in this period is microphone b
  • the selection process of the target microphone is: the actual voice detection result in this period is: there is a near end
  • the microphone with the greater signal energy of the voice signal collected from the two microphones should be selected
  • the target microphone at each time in this time period and from waveform a and waveform b, it can be known that the energy of the voice signal collected by microphone b is greater than the energy of the voice signal collected by microphone a during this time period.
  • Select microphone b as the target microphone at all times.
  • the corresponding historical time is 4.0s
  • the time detection result corresponding to 4.0s is that there are both near-end voice signals and If there is a far-end voice signal, it is determined that the voice call state of 4.0s is two-end intercom, and the target microphone at the historical moment is determined as the target microphone at the current moment, that is, the target microphone of 4.0s, that is, microphone a, is used as the target microphone of 4.1s .
  • the selection of the target microphone at each moment of the voice call in the above example can be realized, which will not be repeated here. It has been verified by experiments that the solution provided by this application can select the corresponding target microphone in a specific voice call state, which can effectively improve the voice call effect.
  • FIG. 6 is a schematic flowchart of a voice call method provided by an embodiment of the application. As shown in FIG. 6, the method may include the following steps.
  • Step 501 Receive a user's voice call trigger operation.
  • the triggering operation of the voice call refers to an instruction to start the voice call, which may be a user's click operation on the corresponding voice call application, or an instruction of the user to start the voice call through voice or text input.
  • Step 502 based on the voice call trigger operation, turn on the audio playback device and at least two audio collection devices on the terminal system.
  • the specific device type of the terminal system is not limited in this embodiment of the application, as long as it is a device that can conduct voice calls, including but not limited to mobile phones, PADs, etc.
  • the audio playback device set on it can be a speaker, and the audio collection device can be a microphone.
  • the specific type and specific number of the audio playback device and at least two audio collection devices, and the position of the audio collection device on the terminal system this application implements The examples are not limited.
  • the terminal system can provide a corresponding interactive interface for voice calls.
  • the corresponding position on the interactive interface can display the icon of the voice playback device and at least two audio collection device icons.
  • the color or shape of the icon indicates the corresponding device.
  • Step 503 For the initial moment of the voice call, use the audio collection device of the at least two audio collection devices corresponding to the pre-configured information as the target audio collection device, and determine the voice call state at the initial moment.
  • the target audio collection device corresponding to the pre-configuration information can be pre-designated or optionally one of at least two audio collection devices, or can be selected by using an existing target audio collection device determination method, for example, based on a call The scene determines the target audio capture device at the initial moment.
  • Step 504 For the current time of the voice call except the initial time, based on the voice signal collected by the target audio collection device determined by the method provided in the above embodiment, a voice call is performed with the opposite terminal device.
  • the voice call method uses the voice call state at historical moments and combines the signal energy of the voice signal collected by each audio collection device to correspond to the voice signal that is more conducive to subsequent voice enhancement processing in a specific voice call state
  • the audio collection device is determined as the target audio collection device at the current moment.
  • the process of determining the target audio collection device does not only rely on the signal energy of the voice signal collected by each audio collection device or the call scene of the near-end device, thus avoiding the prior art
  • the problem of large echo or small near-end voice in the voice signal collected by the target audio collection device identified in the above improves the voice call effect.
  • the voice call method provided in the embodiments of the present application can be applied to any terminal system with multiple microphones (taking dual microphones as an example) in the voice call process.
  • it can be applied to related applications involving voice call scenarios, using The dual microphones on the terminal system suppress the echo during the call, enhance the near-end voice volume, and improve the call quality.
  • APP voice conference application
  • the user when you open the application, the user (the avatar in the figure represents the current user) can enter the meeting interface, and after turning on the microphone, you can start speaking, as shown in the figure
  • the user can also invite (by clicking the invite button) other users to participate in the session on the meeting interface, and can also perform screen sharing, video recording by turning on the camera, and APP settings.
  • the user's speech will be collected by two microphones on the terminal system, and the voices of other online users will also be collected by the microphone after being played by the device, causing other online users to hear their own speech, that is, echo.
  • the echo canceller can be built into the APP to eliminate the echo of other users collected by the microphone, and only retain the voice of the local user to improve the conference experience.
  • the dual-microphone voice enhancement module of the terminal system can be used (Specifically, it can be used to determine the target microphone, the transmission of voice signals, etc.) The selection of the target microphone is realized, and the voice signal is sent to the terminal system of other users based on the voice signal collected by the selected target microphone. It should be noted that in practical applications, the dual-microphone voice enhancement module can be turned on or off automatically along with the microphone switch being turned on or off, without the user having to perform other operations such as switching microphones.
  • the embodiment of the present application also provides a voice call device.
  • the device 600 may include a call state acquisition module 601, a signal energy acquisition module 602, and a target Audio collection device determining module 603, where:
  • the call status acquisition module 601 is used to acquire the voice call status of the terminal system at a historical moment, and at least two audio collection devices are provided on the terminal system;
  • the signal energy acquisition module 602 is configured to acquire the first voice signal collected by each audio collection device at the current moment, and respectively determine the signal energy of each first voice signal;
  • the target audio collection device determining module 603 is configured to determine the target audio collection device from each audio collection device based on the voice call state at the historical moment and the signal energy of each first voice signal.
  • the voice call device uses the voice call state at historical moments and combines the signal energy of the voice signal collected by each audio collection device to correspond to the voice signal that is more conducive to subsequent voice enhancement processing in a specific voice call state
  • the audio collection device is determined as the target audio collection device at the current moment.
  • the process of determining the target audio collection device does not only rely on the signal energy of the voice signal collected by each audio collection device or the call scene of the near-end device, thus avoiding the prior art
  • the problem of large echo or small near-end voice in the voice signal collected by the target audio collection device identified in the above improves the voice call effect.
  • the device further includes a call status determination module, the call status determination module is used to determine the voice call status of the terminal system, where the module is specifically used to determine the voice call status at a historical moment:
  • the voice call state at the historical moment is determined.
  • the call status determination module determines whether there is a near-end voice signal at a historical moment, it is specifically used to:
  • the voice call state includes far-end single talk, near-end single talk, two-end intercom, or unmanned talk.
  • the call status determination module determines the voice call status at a historical moment according to the first determination result and the second determination result, it is specifically used to:
  • the voice call state at the historical moment is the far-end single talk
  • the voice call state at the historical moment is near-end single talk
  • the voice call state at the historical moment is two-end intercom
  • the voice call state at the historical moment is that no one is speaking.
  • the target audio collection device determining module 603 is specifically configured to:
  • the audio collection device corresponding to the first voice signal with the smallest signal energy is determined as the target audio collection device at the current moment
  • the audio collection device corresponding to the first voice signal with the largest signal energy is determined as the target audio collection device at the current moment;
  • the target audio collection device at the historical moment is determined as the target audio collection device at the current moment.
  • the target audio collection device determining module 603 is further configured to:
  • the voice call status at the historical moment is the remote single talk
  • the device further includes a signal sending module, configured to:
  • the echo-cancelled first voice signal is sent to the peer device of the voice call.
  • the signal sending module when the signal sending module performs echo cancellation on the first voice signal collected by the target audio collection device at the current moment, it is specifically used to:
  • the echo propagation path function at the current moment is obtained in the following manner:
  • the echo propagation path function at the historical moment is updated to obtain the echo propagation path function at the current moment.
  • FIG. 9 is a structural block diagram of a voice call device provided by an embodiment of the application.
  • the device 700 may include: a trigger operation receiving module 701, a device activation module 702, an initial determination module 703, and a voice call module 704 ,among them:
  • the trigger operation receiving module 701 is used to receive the user's voice call trigger operation
  • the device activation module 702 is configured to activate an audio playback device and at least two audio collection devices on the terminal system based on a voice call trigger operation;
  • the initial determination module 703 is configured to use the audio collection device of the at least two audio collection devices corresponding to the pre-configured information as the target audio collection device for the initial moment of the voice call, and determine the voice call state at the initial moment;
  • the voice call module 704 is used for the current moment of the voice call other than the initial moment, based on the target audio collection device location determined based on the first aspect, any of the optional embodiments of the first aspect, or the method provided in the third aspect
  • the collected voice signal is used for voice call with the peer device.
  • the voice call device uses the voice call state at historical moments and combines the signal energy of the voice signal collected by each audio collection device to correspond to the voice signal that is more conducive to subsequent voice enhancement processing in a specific voice call state
  • the audio collection device is determined as the target audio collection device at the current moment.
  • the process of determining the target audio collection device does not only rely on the signal energy of the voice signal collected by each audio collection device or the call scene of the near-end device, thus avoiding the prior art
  • the problem of large echo or small near-end voice in the voice signal collected by the target audio collection device identified in the above improves the voice call effect.
  • an embodiment of the present application also provides an electronic device, which includes a memory, a processor, an audio playback device, and at least two audio collection devices, where the audio playback device is used to play voice signals; Two audio collection devices are used to collect voice signals; a computer program is stored in the memory; when the processor executes the computer program, the method provided in any embodiment of the present application is implemented, which can specifically implement the following situations:
  • Case 1 Obtain the voice call status at the historical moment of the terminal system, and at least two audio collection devices are provided on the terminal system; obtain the first voice signal collected by each audio collection device at the current moment, and determine the status of each first voice signal respectively Signal energy; based on the voice call state at the historical moment and the signal energy of each first voice signal, determine the target audio collection device at the current moment from each audio collection device.
  • Case 2 Receive the user’s voice call trigger operation; based on the voice call trigger operation, turn on the audio playback device and at least two audio collection devices on the terminal system; for the initial moment of the voice call, set at least two corresponding to the pre-configured information
  • the audio collection device in the audio collection device is used as the target audio collection device, and determines the voice call state at the initial moment; for the current moment of the voice call except the initial moment, the target audio collection determined based on the method provided in Case 1
  • the voice signal collected by the device makes a voice call with the peer device.
  • the embodiments of the present application also provide a computer-readable storage medium, and the computer-readable storage medium stores a computer program, and when the program is executed by a processor, the method shown in any embodiment of the present application is implemented. It can be understood that the computer-readable storage medium stores a computer program corresponding to the voice call method provided in any embodiment of the present application.
  • FIG. 10 shows a schematic structural diagram of an electronic device to which an embodiment of the present application is applicable.
  • the electronic device 800 shown in FIG. 10 includes: a processor 801, a memory 803, an audio playback device 805, and at least Two audio capture devices 806.
  • the processor 801, the audio playback device 805, and at least two audio collection devices 806 are connected to the memory 803, for example, connected via a bus 802.
  • the electronic device 800 may further include a transceiver 804.
  • the electronic device 800 can interact with other electronic devices through the transceiver 804 for data.
  • the transceiver 804 is not limited to one, and the structure of the electronic device 800 does not constitute a limitation to the embodiment of the present application.
  • the processor 801 is applied in the embodiment of the present application, and is used to implement the function of the voice call device shown in FIG. 8 or FIG. 9.
  • the processor 801 may be a CPU, a general-purpose processor, a DSP, an ASIC, an FPGA, or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. It can implement or execute various exemplary logical blocks, modules, and circuits described in conjunction with the disclosure of this application.
  • the processor 801 may also be a combination that implements computing functions, for example, includes a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and so on.
  • the bus 802 may include a path for transferring information between the above-mentioned components.
  • the bus 802 may be a PCI bus, an EISA bus, or the like.
  • the bus 802 can be divided into an address bus, a data bus, a control bus, and so on. For ease of representation, only one thick line is used to represent in FIG. 10, but it does not mean that there is only one bus or one type of bus.
  • the memory 803 can be ROM or other types of static storage devices that can store static information and instructions, RAM or other types of dynamic storage devices that can store information and instructions, or it can be EEPROM, CD-ROM or other optical disk storage, or optical disk storage. (Including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures and can be used by a computer Any other media accessed, but not limited to this.
  • the memory 803 is used to store application program codes for executing the solutions of the present application, and the processor 801 controls the execution.
  • the processor 801 is configured to execute application program codes stored in the memory 803 to implement the actions of the voice call device provided in the embodiment shown in FIG. 8 or FIG. 9.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Environmental & Geological Engineering (AREA)
  • Quality & Reliability (AREA)
  • Telephone Function (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

本申请提供了一种语音通话方法、装置、电子设备及计算机可读存储介质,该方法包括:获取终端系统历史时刻的语音通话状态,终端系统上设置有至少两个音频采集设备;获取各音频采集设备在当前时刻采集到的第一语音信号,并分别确定各第一语音信号的信号能量;基于历史时刻的语音通话状态、以及各第一语音信号的信号能量,从各音频采集设备中确定当前时刻的目标音频采集设备。

Description

语音通话方法、装置、电子设备及计算机可读存储介质
本申请要求于2019年9月24日提交国家知识产权局、申请号为201910906728.3,申请名称为“语音通话方法、装置、电子设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,具体而言,本申请涉及一种语音通话方法、装置、电子设备及计算机可读存储介质。
背景技术
随着科学技术的快速发展,人们可以通过智能手机、智能手表以及平板电脑等终端系统进行语音通话。为了提升通话质量,终端系统的厂商会在设备上搭载双麦克风(Microphone)进行声音的采集。双麦克风带来了两路对应的语音信号,从而可以据此设计相应的语音增强方案。
发明内容
第一方面,本申请实施例提供了一种语音通话方法,该方法由电子设备执行,该方法包括:
获取终端系统历史时刻的语音通话状态,终端系统上设置有至少两个音频采集设备;
获取各音频采集设备在当前时刻采集到的第一语音信号,并分别确定各第一语音信号的信号能量;
基于历史时刻的语音通话状态、以及各第一语音信号的信号能量,从各音频采集设备中确定当前时刻的目标音频采集设备。
第二方面,本申请实施例提供了一种语音通话方法,该方法由电子设备执行,该方法包括:
接收用户的语音通话触发操作;
基于语音通话触发操作,开启终端系统上的音频播放设备和至少两个音频采集设备;
对于语音通话的初始时刻,将预配置信息所对应的至少两个音频采集设备中的音频采集设备作为目标音频采集设备,并确定初始时刻的语音通话状态;
对于语音通话的除初始时刻之外的当前时刻,基于第一方面或第一方面任一实施例所提供的方法所确定出的目标音频采集设备所采集的语音信号,与对端设备进行语音通话。
第三方面,本申请实施例提供了一种语音通话装置,该装置包括:
通话状态获取模块,用于获取终端系统历史时刻的语音通话状态,终端系统上设置有至少两个音频采集设备;
信号能量获取模块,用于获取各音频采集设备在当前时刻采集到的第一语音信号,并分别确定各第一语音信号的信号能量;
目标音频采集设备确定模块,用于基于历史时刻的语音通话状态、以及各第一语音信号的信号能量,从各音频采集设备中确定目标音频采集设备。
第四方面,本申请实施例提供了一种语音通话装置,该装置包括:
触发操作接收模块,用于接收用户的语音通话触发操作;
设备开启模块,用于基于语音通话触发操作,开启终端系统上的音频播放设备和至少两个音频采集设备;
初始确定模块,用于对于语音通话的初始时刻,将预配置信息所对应的至少两个音频采集设备中的音频采集设备作为目标音频采集设备,并确定初始时刻的语音通话状态;
语音通话模块,用于对于语音通话的除初始时刻之外的当前时刻,基于第一方面或第一方面任一实施例所提供的方法所确定出的目标音频采集设备所采集的语音信号,与对端设备进行语音通话。
第五方面,本申请实施例提供了一种电子设备,该电子设备包括存储器、处理器、音频播放设备和至少两个音频采集设备;
音频播放设备,用于播放语音信号;
至少两个音频采集设备,用于采集语音信号;
存储器中存储有计算机程序;
处理器,用于执行计算机程序以实现第一方面或第二方面所提供的方法。
第六方面,本申请实施例提供了一种计算机可读存储介质,其特征在于,计算机可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现第一方面或第二方面所提供的方法。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对本申请实施例描述中所需要使用的附图作简单地介绍。
图1示出了一种手机终端的结构示意图;
图2为本申请实施例提供的一种语音通话方法的流程示意图;
图3为本申请实施例的一示例中语音通话的实现过程示意图;
图4为本申请实施例的一示例中通话状态估计与麦克风选择的实现过程示意图;
图5为本申请实施例的一示例中目标麦克风的选择结果示意图;
图6为本申请实施例提供的一种语音通话方法的流程示意图;
图7示出了本申请一示例中的应用场景示意图;
图8为本申请实施例提供的一种语音通话装置的结构框图;
图9为本申请实施例提供的一种语音通话装置的结构框图;
图10为本申请实施例提供的一种电子设备的结构示意图。
具体实施方式
下面详细描述本申请的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本申请,而不能解释为对本发明的限制。
本技术领域技术人员可以理解,除非特意声明,这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理 解的是,本申请的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。应该理解,当我们称元件被“连接”或“耦接”到另一元件时,它可以直接连接或耦接到其他元件,或者也可以存在中间元件。此外,这里使用的“连接”或“耦接”可以包括无线连接或无线耦接。这里使用的措辞“和/或”包括一个或更多个相关联的列出项的全部或任一单元和全部组合。
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
首先对本申请涉及的几个名词进行介绍和解释:
近端:语音通话中通信网络中的本地端。
远端:语音通话中通信网络中的对端。
近端设备:语音通话中近端的讲话者使用的通话设备,近端设备上设置有音频采集设备(例如麦克风)和音频播放设备(例如扬声器、受话器)。
远端设备:语音通话中远端的讲话者使用的通话设备,远端设备上设置有音频采集设备(例如麦克风)和音频播放设备(例如扬声器、受话器)。
近端语音信号:语音通话中,近端的讲话者讲话被近端设备的音频采集设备采集到的语音信号。
远端语音信号:语音通话中,远端的讲话者讲话被远端设备的音频采集设备采集到后通过通信网络传输至近端设备的语音信号。
回声信号:语音通话中,远端语音信号经近端设备的音频播放设备播放后,被近端设备的音频采集设备采集到的语音信号。
回声抵消:从近端设备的音频采集设备采集到的语音信号中滤除回声信号的处理过程。
远端单讲:语音通话中,存在远端语音信号,且不存在近端语音信号时的通话状态。
近端单讲:语音通话中,不存在远端语音信号,且存在近端语音信号时的通话状态。
两端对讲:语音通话中,存在远端语音信号,且存在近端语音信号时的通话状态。
无人讲话:语音通话中,不存在远端语音信号,且不存在近端语音信号时的通话状态。
在具有双麦克风终端系统的通话系统中,通常是将信号幅值较高的那路麦克风信号作为后续应用的输入,该选择方案虽然在只存在近端语音的情况下,能够起到有效增强语音的效果,然后在存在较强远端语音的场景中,如果两个麦克风均采集到了很强的回声,如果选择信号幅值较大的那路麦克风信号则很可能会选中回声较大的那路麦克风信号,导致语音增强达不到期望的效果,甚至会降低语音通话质量。
假设进行语音通话的两个设备为A、B,对于设备A的用户a而言,即站在该用户a的角度来说,设备A则为近端设备,B为对应的远端设备即对端设备。同样地,对于设备B的用户b而言,设备B则为近端设备,那么设备A为对应的远端设备。
下面以近端设备是设备A为例进行描述,当A为近端设备时,A的音频采集设备采集到的本地讲话者即用户a的语音信号即为近端语音信号,B发送给A的语音信号(对端讲话者即用户b说话产生的语音信号)即为远端语音信号;远端语音信号被A上音频播放设备播放后被A上音频采集设备采集到的语音信号即为回声信号,将A的音频采集设备采集到的语音信号中的回声信号消除的过程即为回声抵消。在设备A的用户a和设备B的用户b进行语音通话时,对于设备A而言,当A接收到B发送的远端语音信号(用户b在说话),且A的音频采集设备采集到的语音信号中没有近端语音信号(用户a没有说话)时的通话状态为远端单讲;而用户b没有在说话,只有用户a在说话时的通话状态为近端单讲;当用户a和用户b都有在说话时的通话状态为两端对讲;当用户a和用户b都未说话时的通话状态 为无人讲话。
在具有两个音频采集设备的终端系统的通话系统中,为了进行语音增强,需要从双麦克风中选择一个麦克风作为输入麦克风。相关技术中采用的输入麦克风的选取方案一般有以下两种:
一种是根据麦克风采集到的语音信号的信号幅值选取输入麦克风,从两个麦克风中选取采集到的语音信号的信号幅值较高的麦克风作为输入麦克风,也即将信号幅值最高的语音信号作为输入语音信号以供后续语音增强处理。但在语音通话中存在较强远端语音信号的情况下,两个麦克风采集到的语音信号中都会有较强的回声,则所选择的信号幅值较大的麦克风可能是回声较大的麦克风,从而会引起回声泄露,导致语音通话质量降低。
另一种是根据终端系统的通话场景选取输入麦克风,以双麦克风终端系统为例,图1中示出了一种常见的手机的音频采集设备和音频播放设备的布置示意图,如图1所示,该手机上设置了一个在终端系统屏幕上方的顶部麦克风(简称为顶麦)201和一个在屏幕下方的底部麦克风(简称为
Figure PCTCN2020081385-appb-000001
麦)202,以及在屏幕顶部的受话器203和在屏幕底部的扬声器204。
在免提场景下,手机底部扬声器204播放远端语音信号,被麦克风采集到形成回声。因为底麦202距离扬声器较近,所以采集到的回声比较大;而近端说话人距离手机较远,两个麦克风采集到的近端人声能量比较接近,此时顶麦采集的信号是后续处理更好的输入选择。
在手持场景下,手机上方的受话器203播放远端语音信号,近端讲话人手持手机贴近屏幕下方讲话,此时底麦202采集到的近端语音信号较大,而顶麦201采集到的回声较大,此时底麦采集的信号是后续处理的更好选择。
简而言之,手持场景选择底麦,免提场景选择顶麦是一个更合理的选择。然而,实际通话场景更为复杂,免提场景下,用户也可能将嘴靠近底麦202,此时选取顶麦201无法有效获取近端语音信号。且不同机型音频播放设备、采集设备种类繁多,除了麦克风存在多种摆 放位置,也存在拥有双扬声器的立体声播放手机,这些都导致无法将麦克风的选择和场景简单关联。因此,依据场景进行麦克风选择的方案适用范围较小,且无法保证所选择的麦克风是否合理。
针对现有语音通话方案中所存在的以上技术问题,本申请实施例提供了一种语音通话方法,该方法提供另一种更合理的音频采集设备的选择方式,能够有效提高语音通话效果。
下面以具体地实施例对本申请的技术方案以及本申请的技术方案如何解决上述技术问题进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。下面将结合附图,对本申请的实施例进行描述。
图2为本申请实施例提供的一种语音通话方法的流程示意图,该方法具体可以由终端系统或服务器等电子设备执行,如图1所示,该方法可以包括以下步骤。
步骤S101,获取终端系统历史时刻的语音通话状态,终端系统上设置有至少两个音频采集设备。
需要说明的是,终端系统可以是集成了音频播放设备、音频采集设备以及处理器等器件的终端设备,该终端设备的具体设备类型本申请实施例不做限定,只要是能够进行语音通话的设备即可,包括但不限于手机、PAD等。终端系统也可以是由相互独立的音频播放设备、音频采集设备以及处理器组合而成的语音通话系统,例如,终端系统可以为视频会议系统,视频会议系统中包含有多个音频采集设备(如麦克风),一个或多个音频播放设备(如扬声器),以及处理器,且音频采集设备和音频播放设备可以根据会议场所等实际需求分散布置。
可以理解的是,当终端系统为集成有各器件的终端设备时,该方法可以由终端设备中集成的处理器执行,也可以由与该终端设备对应的服务器执行;当终端系统为由相互独立的多个设备组合而成的语音通话系统时,该方法可以由该终端系统中的处理器执行,也可以由该终端系统对应的服务器执行,总之,该方法可以由终端系统或服务器 等电子设备执行。
其中,对于执行该语音通话方法的终端系统而言,该终端系统即为当前语音通话的近端设备,与该终端系统进行语音通话的对端设备即为远端设备。
其中,在语音通话过程中每一时刻可以理解为对目标音频采集设备进行重新确定时间点,历史时刻可以包括一个或多个时刻,两个时刻之间的间隔时长可以根据实际需求进行设定。例如,可以将两个时刻之间的间隔时长设置为0.02秒,若当前时刻为语音通话的第0.20秒,则历史时刻则为语音通话中的第0.18秒。
其中,语音通话状态表征了语音通话中近端语音状况和远端语音状况,根据任一时刻的近端语音通话状态可以确定该时刻是否存在近端语音信号和远端语音信号。例如,在语音通话的第0.20秒的语音通话状态为远端单讲,即在语音通话的第0.20秒存在远端语音信号而不存在近端语音信号。
其中,终端系统上所设置的音频采集设备可以是麦克风,也可以是其他类型的音频采集设备,对于至少两个音频采集设备的具体类型、具体数量、以及音频采集设备在终端系统上的位置,本申请实施例不作限定。例如,对于手机而言,至少两个音频采集设备可以是两个麦克风,两个麦克风可以分别设置在手机屏幕正面的上方和下方,如图1中所示的双麦克风的设置方式,也可以是其他设置方式,如可以设置在屏幕背面,本申请实施例不做具体限定。
步骤S102,获取各音频采集设备在当前时刻采集到的第一语音信号,并分别确定各第一语音信号的信号能量。
其中,第一语音信号中可能包含近端语音信号、回声信号以及环境噪声信号等,在语音通话中需要消除回声信号和环境噪声信号,并将传输到远端设备的为近端语音信号。近端设备的各音频采集设备采集到的第一语音信号所包含的信号种类和信号能量大小都不相同,每一第一语音信号的信号能量大小可以反映其中所包含的语音信号的大小,进而可以作为后续确定目标音频采集设备时的依据。在实际应 用中,语音信号的信号能量大小可以根据语音信号的信号幅值大小或峰值包络大小等进行确定。
需要说明的是,上述步骤S101和步骤S102中的步骤编号并不构成对两步骤先后顺序的限定,即步骤S101和步骤S102的执行顺序可以是没有先后的,如可以先执行步骤S101再执行步骤S102,或者先执行步骤S102再执行步骤S101,或者同时执行步骤S101和步骤S102,也即在本申请实施例实现的过程中,对获取近端设备历史时刻的语音通话状态,以及获取各音频采集设备在当前时刻采集到的第一语音信的信号能量两者的执行顺序不做限定。
步骤S103,基于历史时刻的语音通话状态、以及各第一语音信号的信号能量,从各音频采集设备中确定当前时刻的目标音频采集设备。
具体地,在实际应用中,在语音通话的一个较短的时间间隔内,语音通话的状态一般会保持不变,因此,可以通过历史时刻的语音通话状态来预估当前时刻的语音通话状态,即将历史时刻的语音通话状态认为是当前时刻的语音通话状态。当历史时刻只包含一个时刻时,将该一个时刻的语音通话状态作为历史时刻的语音通话状态,其所包含的时刻可以为与当前时刻相邻的上一时刻;当历史时刻包括多个时刻时,历史时刻的语音通话状态可以通过以下方式确定:分别获取各时刻的语音通话状态,将其中出现次数最多的语音通话状态作为历史时刻的语音通话状态;或者,将与当前时刻最接近的时刻的语音通话状态作为历史时刻的语音通话状态。
由于存在不同的语音通话状态,音频采集设备所采集到的语音信号的种类也会不同。例如,如果语音通话状态为远端单讲,则音频采集设备采集到的是回声信号;如果为近端单讲,则音频采集设备采集到的是近端语音信号(当然一般还存在噪声信号)。因此,语音通话状态可以表征音频采集设备所采集到的第一语音信号中是否存在回声信号、是否存在近端语音信号等,即根据历史时刻的语音通话状态即可确定出第一语音信号中所包含的信号的种类。例如,若历史时刻 的语音通话状态为近端单讲,则当前时刻存在近端语音信号且不存在远端语音信号,由于回声信号是由于远端语音信号的存在而产生的,故可以确定第一语音信号中不存在回声信号。
在根据历史时刻的语音通话状态确定出各第一语音信号中所包含的信号的种类后,根据各第一语音信号的信号能量的大小即可确定出其中所包含的特定类型的语音信号的信号能量大小,换言之,可以确定出各音频采集设备采集到的特定类型的语音信号的信号能量大小。例如,若历史时刻的语音通话状态为近端单讲,根据该语音通话状态确定各第一语音信号中包含近端语音信号,一般还都会包括环境噪声信号等,但由于各第一语音信号中所包含的环境噪声信号的信号能量大小基本相近,因此第一语音信号中的近端语音信号的信号能量大小与该第一语音信号的信号能量大小正相关,即第一语音信号的信号能量越大,该第一语音信号中包含的近端语音信号的信号能量越大,即对应的音频采集设备采集到的近端语音信号的信号能量越大,此时则可以将采集到信号能量较大的第一语音信号的音频采集设备作为目标音频采集设备。
综上所述,可以基于历史时刻的语音通话状态、以及各第一语音信号的信号能量,确定出各音频采集设备在特定语音通话状态下所采集特定类型的语音信号的信号能量的大小关系。
进一步地,确定出的当前时刻的目标音频采集设备,其采集到的第一语音信号是对应的语音状态下更有利于后续语音增强处理的第一语音信号,一般来说更有利于后续语音增强处理的第一语音信号中所包含的近端语音信号的信号能量更大,或者其中所包含的回声信号的信号能量更小。由于可以基于历史时刻的语音通话状态、以及各第一语音信号的信号能量,确定出各音频采集设备在特定语音通话状态下所采集特定类型的语音信号的信号能量的大小关系,故可以基于历史时刻的语音通话状态、以及各第一语音信号的信号能量,确定出特定语音通话状态下的目标音频采集设备。
那么,在确定目标音频采集设备时,在以各音频采集设备采集的 第一语音信号的信号能量为依据的前提下还结合了历史时刻的语音通话状态,可有效避免所确定出的目标音频采集设备采集的第一语音信号中包含的回声信号最大的情况。同时,该目标音频采集设备确认过程不依赖于近端设备的通话场景,因此也避免了确定的目标音频采集设备采集不到有效的近端语音信号的情况。
可以理解的是,除初始时刻外,在语音通话的任一时刻都可以根据本申请实施例所提供的该方法确定出该时刻的目标音频采集设备。而对于语音通话的初始时刻,该时刻目标音频采集设备可以预先指定或者任选至少两个音频采集设备中的一个音频采集设备,也可以采用现有的目标音频采集设备确定方式来选择,如基于通话场景确定初始时刻的目标音频采集设备。
需要说明的是,当该方法由服务器执行时,该方案中终端系统与服务器的交互过程可以包括:在语音通话的初始时刻,服务器向终端发送目标音频采集设备的预配置信息,终端系统根据接收到的预配置信息从至少两个音频采集设备中选取目标音频采集设备;或者预配置信息本身储存在终端系统中,终端系统根据预配置信息从至少两个音频采集设备中选取目标音频采集设备。在当前时刻,服务器接收终端发送的至少两个音频采集设备采集到的第一语音信号,服务器获取各第一语音信号的信号能量,并根据历史时刻的语音通话状态、以及接收到的各第一语音信号的信号能量,确定当前时刻的目标音频采集设备。
本申请实施例提供的语音通话方法,利用历史时刻的语音通话状态,结合各音频采集设备采集的语音信号的信号能量,将特定语音通话状态下更有利于后续的语音增强处理的语音信号对应的音频采集设备确定为当前时刻的目标音频采集设备,目标音频采集设备确定过程中不是仅依赖于各音频采集设备采集到的语音信号的信号能量或近端设备的通话场景,因此避免了相关技术中确定出的目标音频采集设备所采集到的语音信号中回声较大或近端语音较小的问题,提高了语音通话的效果。
在本申请的一个实施例中,历史时刻的语音通话状态是通过以下方式确定的:
确定历史时刻是否存在远端语音信号,得到第一确定结果;
确定历史时刻是否存在近端语音信号,得到第二确定结果;
根据第一确定结果和第二确定结果,确定历史时刻的语音通话状态。
其中,语音通话状态可以指示语音通话中近端语音状况和远端语音状况,那么通过语音通话中近端语音状况和远端语音状况可以确定出对应的语音通话状态。
具体的,历史时刻是否存在远端语音信号,可以通过判断历史时刻终端系统是否接收到远端语音信号来确定,例如,若历史时刻终端系统接收到的语音信号中存在远端讲话者的声音信号(即远端语音信号),则确定历史时刻存在远端语音信号。历史时刻是否存在近端语音信号,可以通过判断历史时刻终端系统上任一音频采集设备采集到的语音信号中是否包含近端语音信号来确定,例如,若历史时刻任一音频采集设备采集到的语音信号中包含近端讲话者的声音信号(即近端语音信号),则确定历史时刻存在近端语音信号。
可以理解的是,在确定语音信号中是否存在近端语音信号或远端语音信号时,可以根据近端语音信号和远端语音信号的信号能量、信号波形等特点来进行判定,例如,可以将第一语音信号中信号能量处于预设范围内的语音信号确定为近端语音信号。
可以理解的是,当历史时刻中包括多个时刻时,本申请是利用上述方案分别对每一时刻的语音通话状态进行确定,在确定出各时刻的语音通话状态后,再进一步确定历史时刻的语音通话状态。根据历史时刻中包含的多个时刻的语音通话状态来确定历史时刻的语音通话状态的过程可以如前文所述:将各时刻对应的语音通话状态中出现次数最多的语音通话状态作为历史时刻的语音通话状态;或者,将与当前时刻最接近的时刻的语音通话状态作为历史时刻的语音通话状态。
需要说明的是,当该方法由服务器执行时,对应的终端系统与服 务器的交互过程可以包括:在历史时刻,服务器接收终端系统发送的远端信号和各第一语音信号,服务器根据接收到的远端语音信号是否为0,得到第一确定结果,服务器根据接收到的各第一语音信号中是否存在近端语音信号,得到第二确定结果;然后再根据第一确定结果和第二确定结果,确定出历史时刻的语音通话状态。
在本申请的一个实施例中,确定历史时刻是否存在近端语音信号,包括:
获取历史时刻的目标音频采集设备在历史时刻采集到的第二语音信号;
对第二语音信号进行回声抵消,确定经回声抵消后的第二语音信号中是否存在近端语音信号。
具体地,在确定出历史时刻的目标音频采集设备之后,需要对历史时刻的目标音频采集设备采集的第二语音信号进行回声抵消和后续语音增强处理,第二语音信号中可能包含近端语音信号、回声信号和环境噪声信号等,在对第二语音信号进行回声抵消后,则可以认为第二语音信号中将不再包含回声信号,则在确定其中是否存在近端语音信号的时候即可以排除回声信号的影响,使得确认结果更加准确。同时,对历史时刻的目标音频采集设备采集的第二语音信号进行回声抵消也是语音通话中的必要操作,故选择经回声抵消后的所述第二语音信号作为判断对象,也不会额外增加语音通话中的处理步骤。
需要说明的是,当该方法由服务器执行时,对应的终端系统与服务器的交互过程可以包括:在历史时刻,服务器接收终端系统发送的目标音频采集的第二语音信号,确定该第二语音信号中是否存在近端语音信号。
在本申请的一个实施例中,语音通话状态包括远端单讲、近端单讲、两端对讲或无人讲话。
在本申请的一个实施例中,根据第一确定结果和第二确定结果,确定历史时刻的语音通话状态,包括:
若第一确定结果为存在远端语音信号,且第二确定结果为不存在 近端语音信号,则历史时刻的语音通话状态为远端单讲;
若第一确定结果为不存在远端语音信号,且第二确定结果为存在近端语音信号,则历史时刻的语音通话状态为近端单讲;
若第一确定结果为存在远端语音信号,且第二确定结果为存在近端语音信号,则历史时刻的语音通话状态为两端对讲;
若第一确定结果为不存在远端语音信号,且第二确定结果为不存在近端语音信号,则历史时刻的语音通话状态为无人讲话。
可以理解的是,在语音通话中,可以将语音通话状态归纳为远端单讲、近端单讲、两端对讲或无人讲话等四种状态。在实际的语音通话中,多数情况下是一方讲话另一方听,或是一方听另一方讲话,而少数情况下是两方同时讲话或是两方都不讲话,远端单讲和近端单讲的通话状态出现的较多,而两端对讲或无人讲话的通话状态出现的较少。
在本申请的一个实施例中,基于历史时刻的语音通话状态、以及各第一语音信号的信号能量,从各音频采集设备确定当前时刻的目标音频采集设备,包括:
若历史时刻的语音通话状态为远端单讲,则将信号能量最小的第一语音信号对应的音频采集设备确定为当前时刻的目标音频采集设备;
若历史时刻的语音通话状态为近端单讲,则将信号能量最大的第一语音信号对应的音频采集设备确定为当前时刻的目标音频采集设备;
若历史时刻的语音通话状态为两端对讲或无人讲话,则将历史时刻确定的目标音频采集设备确定为当前时刻的目标音频采集设备。
具体地,在历史时刻的语音通话状态为远端单讲时,预估当前时刻的语音通话状态也为远端单讲,近端设备中各音频采集设备采集到的第一语音信号中包含回声信号和环境噪声信号,则各第一语音信号的信号能量的大小与其中包含的回声信号的信号能量的大小正相关,为了使得用于后续语音增强处理的语音信号中的回声信号的信号能 量最小,则选择信号能量最小的第一语音信号对应的音频采集设备确定为目标音频采集设备,即将信号能量最小的第一语音信号作为后续语音增强处理的输入信号。
在历史时刻的语音通话状态为近端单讲时,预估当前时刻的语音通话状态也为近端单讲,近端设备中各音频采集设备采集到的第一语音信号中包含近端语音信号和环境噪声信号,则各第一语音信号的信号能量的大小与其中包含的近端语音信号的信号能量的大小正相关,为了使得用于后续语音增强处理的语音信号中的近端语音信号的信号能量最大,则选择信号能量最大的第一语音信号对应的音频采集设备确定为目标音频采集设备,即将信号能量最大的第一语音信号作为后续语音增强处理的输入信号。
在历史时刻的语音通话状态为两端对讲时,预估当前时刻的语音通话状态也为两端对讲,近端设备中各音频采集设备采集到的第一语音信号的信号能量的大小,既与回声信号的信号能量的大小有关,又与近端语音信号的信号能量的大小有关,此时无法通过第一语音信号的信号能量的大小来确定其中包含的回声信号和近端语音信号的信号能量大小,而一般两端对讲持续的时间较短,为了保证语音通话的稳定,可以保证目标音频采集设备不变,故将历史时刻确定的目标音频采集设备作为当前时刻的目标音频采集设备。
在历史时刻的语音通话状态为无人讲话时,预估当前时刻的语音通话状态也为无人讲话,近端设备中各音频采集设备采集到的第一语音信号中不包含回声信号和近端语音信号,而一般两端对讲持续的时间较短,为了保证语音通话的稳定,可以保证目标音频采集设备不变,故将历史时刻确定的目标音频采集设备作为当前时刻的目标音频采集设备。
在本申请的一个实施例中,若历史时刻的语音通话状态为远端单讲,则该方还可以包括:
确定当前时刻之前语音通话状态连续为远端单讲的次数;
若次数大于设定值,则将当前时刻的目标音频采集设备确定为当 前时刻之后的预设时间段内的目标音频采集设备。
具体地,在实际应用中,若在通话过程中,连续较长一段时间的通话状态一直为远端单讲,即对端通话者自己在说的情况,则可以认为在后续的通话过程中该状态仍然很可能会持续,因此,在某一时刻确定出语音通话状态时,可以记录状态连续为远端单讲的次数,如可以设置一计数器,如果通话状态为远端单讲,则该计数器的值加1,如果是其他通话状态时,则计数器清零,在下一次确定通话状态为远端单讲时,再重新开始计数。如果连续的次数超过设定值,则可以将当前时刻的目标音频采集设备直接作为后续通话过程中的目标音频采集设备,当然,也可以是作为后续通话过程中一定时间段的目标音频采集设备,在超过该时段之后,再基于上述前文实施例中所描述的方式确定目标音频采集设备。如为超过设定值,则可以采用前文实施例中所描述的方式确定目标音频采集设备。
需要说明的是,当该方法由服务器执行时,对应的终端系统与服务器的交互过程可以包括:服务器对各时刻的通话状态进行统计,若确定当前时刻之前语音通话状态连续为远端单讲的次数大于设定值时,则将当前时刻的目标音频采集设备确定为当前时刻之后的预设时间段内的目标音频采集设备。
在本申请的一个实施例中,该方法还包括:
对当前时刻的目标音频采集设备采集到的第一语音信号进行回声抵消;
若经回声抵消的第一语音信号中存在近端语音信号,则将经回声抵消后的第一语音信号发送至远端设备。
具体地,由前文描述可知,目标音频采集设备采集的第一语音信号中可能包含有近端语音信号、回声信号以及环境噪声信号等,因此为了避免语音通话中出现回声泄露,在将第一语音信号发送至远端设备前,要对第一语音信号进行回声抵消。对经回声抵消后的第一语音信号进行语音检测,若其中存在近端语音信号则将其发送至远端设备,若其中不存在近端语音信号,则其中包含有残留回声信号和环境噪声 信号,则不将其发送至远端设备。
需要说明的是,当该方法由服务器执行时,对应的终端系统与服务器的交互过程可以包括:服务器对当前时刻目标音频采集设备采集的第一语音信号进行回声抵消,若经回声抵消的第一语音信号中存在近端语音信号,则将近端语音信号发送至远端设备。
在本申请的一个实施例中,对当前时刻的目标音频采集设备采集到的第一语音信号进行回声抵消,具体包括:
获取当前时刻的远端语音信号;
基于当前时刻的远端语音信号和当前时刻的回声传播路径函数,得到当前时刻的目标音频采集设备采集到的第一语音信号中的回声信号;
基于回声信号对当前时刻的目标音频采集设备采集到的第一语音信号进行回声抵消。
其中,回声传播路径函数可以理解为远端语音信号与音频采集设备接收到的回声信号之间的映射关系,即将当前时刻的远端语音信号代入当前时刻的回声传播路径函数即可得出对应的回声信号。
具体地,在当前时刻存在远端语音信号时,根据回声传播路径函数得出对应的回声信号,再将第一语音信号中的回声信号去除,完成第一语音信号的回声抵消。在当前时刻不存在远端语音信号时,第一语音信号中也不存在回声信号,那么经回声抵消后的第一语音信号与回声抵消前保持不变。
在本申请的一个实施例中,该方法还可以包括:
对历史时刻选取的目标音频采集设备采集到的第二语音信号进行回声信号消除,得到历史时刻的残留回声信号;
基于历史时刻的残留回声信号,对历史时刻的回声传播路径函数进行更新,得到当前时刻的回声传播路径函数。
具体地,由于在每一时刻根据该时刻的远端语音信号和回声传播路径函数得出的该时刻的回声信号与该时刻实际的回声信号之间有一定的偏差,为了使下一时刻得到的回声信号与实际的回声信号之间 的偏差更小,可以用每一时刻回声抵消后的残留回声信号对该时刻的回声传播路径函数的参数进行修正,即对其进行更新,得到下一时刻的回声传播路径函数。可以理解的,在历史时刻不存在远端语音信号时,第一语音信号中也不包含回声信号,进而也不存在残留回声信号,那么当前时刻的回声传播路径函数与历史时刻的回声传播路径函数相同。
在本申请的一个实施例中,将经回声抵消后的第一语音信号发送至远端设备,具体包括:
去除经回声抵消后的第一语音信号中的环境噪声信号和残留回声信号,并将得到的语音信号发送至远端设备。
具体地,对于第一语音信号进行回声抵消后,为了进一步提高语音通话质量,还需要进行后续语音增强处理。后续语音增强处理包括去除环境噪声信号,以及残留回声信号等。
下面通过一示例来对本申请实施例进行进一步说明,该示例以终端系统作为执行主体来进行说明,假设语音通话中的近端设备为一手机,该示例中以图1中所示的手机为例,该手机上设置了两个音频采集设备分别为顶部麦克风(顶麦)201和底部麦克风(
Figure PCTCN2020081385-appb-000002
麦)202,还包括受话器203和扬声器204。其中,顶麦201和
Figure PCTCN2020081385-appb-000003
麦202都可以采集第一语音信号,受话器203和扬声器204都可以对接收到的远端语音进行播放。
图3中示出了本示例中该手机进行目标音频采集设备选择的原理示意图。如图3所示,该手机中可以包括通话状态估计与麦克风选择器301、回声估计器302以及语音增强处理器304。其中,通话状态估计与麦克风选择器301用于确定各个时刻的语音通话状态,并根据历史时刻的语音通话状态及当前时刻顶麦和
Figure PCTCN2020081385-appb-000004
麦采集到的语音信号的信号能量大小确定目标麦克风。回声估计器302用于根据输入的远端语音信号估计出当前时刻的回声信号。回声抵消器303用于根据输入的回声信号对输入的语音信号进行回声抵消,其中,回声抵消器303可以理解为一个加法器,其中“-”和“+”分别表示对输入信号 进行去除和累加。语音增强处理器304用于对输入的语音信号进行后续增强处理,包括去除残留回声信号和环境噪声信号。
需要说明的是,以上通话状态估计与麦克风选择器301、回声估计器302以及语音增强处理器304可以是具有对应功能的实体器件,也可以是能够实现对应功能的应用程序。
基于本申请实施例所提供的方案,当前时刻该手机中语音通话的实现过程可以包括如下步骤:
步骤1-1,该手机接收到远端语音信号后通过扬声器或受话器播放远端讲话者的声音,顶麦和
Figure PCTCN2020081385-appb-000005
麦分别采集近端讲话者的声音信号、远端讲话者的声音信号以及环境噪声信号,得到对应的两个第一语音信号,并分别将两个第一语音信号输入至通话状态估计与麦克风选择器301。
步骤1-2,通话状态估计与麦克风选择器301根据预先获取的历史时刻的语音通话状态以及接收到的顶麦和
Figure PCTCN2020081385-appb-000006
麦输入的两个第一语音信号,确定出目标麦克风,并将目标麦克风采集到的第一语音信号输入至回声抵消器303中。
步骤1-3,回声估计器302根据输入的远端语音信号估计得到回声信号,并将回声信号输入至回声抵消器303中。
步骤1-4,回声抵消器303根据回声估计器302输入的回声信号,对目标麦克风采集到的第一语音信号进行回声抵消,并将经回声抵消后的第一语音信号输入至语音增强处理器304。
步骤1-5,语音增强处理器304对消除回声信号后的第一语音信号进行进一步的语音增强处理,可以包括去除环境噪声信号以及残余回声信号等,再将经语音增强处理的第一语音信号发送至远端设备。
另外,如图中所示,在步骤1-4中,回声抵消器303还会将经回声抵消后的第一语音信号输入至通话状态估计与麦克风选择器301和回声估计器302,以供通话状态估计与麦克风选择器301根据该输入信号确定出当前时刻的语音通话状态,以用于下一时刻的目标麦克风的确定,而回声估计器302则可以根据经回声抵消后的第一语音信 号中的残留回声信号对自身进行更新,如更新回声传播路径函数。
图4中示出了一种通话状态估计与麦克风选择器的可选结构示意图。如图4所示,该通话状态估计与麦克风选择器可以包括:第一峰值包络检测模块401、第二峰值包络检测模块402、远端语音活动检测模块403、近端语音活动检测模块404、通话状态估计模块405、麦克风选择模块406以及混音模块407。
其中,第一峰值包络检测模块401用于检测顶麦采集到的语音信号的峰值包络的大小,第二峰值包络检测模块402用于检测
Figure PCTCN2020081385-appb-000007
麦采集到的语音信号的峰值包络的大小。远端语音活动检测模块403用于检测各通话时刻是否存在远端语音信号,近端语音活动检测模块404用于检测各通话时刻是否存在近端语音信号。通话状态估计模块405用于根据各通话时刻是否存在近端语音信号、以及是否存在远端语音信号,来确定各时刻的通话状态,即根据远端语音活动检测模块403和近端语音活动检测模块403的判断结果,确定相应时刻的通话状态。麦克风选择模块406用于则根据输入的顶麦采集到的语音信号的峰值包络的大小和
Figure PCTCN2020081385-appb-000008
麦采集到的语音信号的峰值包络的大小,确定出目标麦克风选择结果。混音模块407用于根据输入的目标麦克风选择结果将目标麦克风采集到的第一语音信号输出。
需要说明的是,以上第一峰值包络检测模块401、第二峰值包络检测模块402、远端语音活动检测模块403、近端语音活动检测模块404、通话状态估计模块405、麦克风选择模块406以及混音模块407可以是具有对应功能的实体器件,也可以是能够实现对应功能的应用程序。基于图4中所示的该结构,该手机当前时刻的目标麦克风的确定过程可以包括以下步骤:
步骤2-1,第一峰值包络检测模块401检测顶麦采集到的第一语音信号的峰值包络大小,第二峰值包络检测模块402检测
Figure PCTCN2020081385-appb-000009
麦采集的第一语音信号的峰值包络大小,并分别将两个峰值包络大小输入至麦克风选择模块406。
步骤2-2,麦克风选择模块406根据通话状态估计模块405所确 定出的历史时刻的语音通话状态,以及输入的两个峰值包络大小,确定出目标麦克风选择结果,并将目标麦克风选择结果输入至混音模块407。
具体地,通话状态估计模块405在确定历史时刻的语音通话状态时,根据远端语音活动检测模块403所确定出的历史时刻是否存在远端语音信号的第一确定结果,以及近端语音活动检测模块404所确定出的历史时刻是否存在近端语音信号的第二确定结果,确定出历史时刻的语音通话状态。
若历史时刻的语音通话状态为远端单讲,则麦克风选择模块406将信号能量较小的第一语音信号对应的麦克风确定为目标麦克风;若历史时刻的语音通话状态为近端单讲,则麦克风选择模块406将信号能量较大的第一语音信号对应的麦克风确定为目标麦克风;若历史时刻的语音通话状态为两端对讲或无人讲话,则麦克风选择模块406将历史时刻确定的目标麦克风确定为目标麦克风。
步骤2-3,混音模块407根据输入的目标麦克风选择结果,对两个麦克风采集的第一语音信号进行混音选路,并将目标麦克风的语音信号输出。当从一路麦克风信号切换为另一路麦克风信号时,可以设置平滑过渡时间窗,以保证过渡连续。
另外,通话状态估计模块405还需要进一步确定出当前时刻的语音通话状态,以用于下一时刻的目标麦克风的选择,该过程具体可以包括以下步骤:
步骤3-1,远端语音活动检测模块403根据输入的当前时刻的远端语音信号(图中所示的远端语音),确定当前时刻是否存在远端语音信号,近端语音状况检测器模块404根据输入的经回声抵消后的当前时刻的目标麦克风采集的第一语音信号(图中所示的经回声抵消后的第一语音),确定当前时刻是否存在近端语音信号,并分别将两个确认结果输入通话状态估计模块405。
步骤3-2,通话状态估计模块405根据输入的两个确认结果确定出当前时刻的语音通话状态。
具体地,若存在远端语音信号,且不存在近端语音信号,则当前时刻的语音通话状态为远端单讲;若不存在远端语音信号,且存在近端语音信号,则当前时刻的语音通话状态为近端单讲;若存在远端语音信号,且存在近端语音信号,则当前时刻的语音通话状态为两端对讲;若不存在远端语音信号,且不存在近端语音信号,则当前时刻的语音通话状态为无人讲话。
本申请实施例所提供的方案,通过综合分析终端系统的多个音频采集设备所采集到的语音信号、音频播放设备所播放的语音信息号、以及设备的通话状态,实现了对目标音频采集设备的选择,与相关技术相比,可以有效提升语音通话的整体性能。
作为一个示例,基于本申请实施例所提供的语音通话方案,该示例以终端系统作为执行主体来进行说明,图5中示出了一终端系统在免提通话场景下的进行麦克风选择的效果示意图,其中,该手机包括两个麦克风,分别为麦克风a和麦克风b,该手机在免提下进行语音通话,麦克风a采集的语音信号的时域波形如图中的a波形所示,麦克风b采集的语音信号的时域波形如图中的b波形所示,扬声器播放的语音信号的时域波形如图中的c波形所示,目标麦克风的选择结果如图中曲线d所示。其中,该示例中,曲线d中S1所示的结果表示目标麦克风为a,曲线d中S2所示的结果表示目标麦克风为b。该示意图中,横坐标表示时间(图中仅示出了部分时间),单位为秒(s),对于波形a至波形c而言,纵坐标表征信号能量大小,具体是信号的幅值。
具体地,该示例中,假设相邻两个时刻的间隔为0.1s,在语音通话的0s至1s内,由曲线d可知该时间段内选择麦克风a为目标麦克风,具体选择过程为:对于0s至1s该时间段内的任一时刻,例如0.3s,其历史时刻为0.2s,0.2s时刻的实际语音检测结果为:在这一时刻不存在近端语音信号,存在远端语音信号,则确定0.2秒的语音通话状态为远端单讲,则在0.3s应该选择信号能量较小的语音信号所对应的麦克风为目标麦克风,而由波形a和波形b可知,在0.3s麦克风a 采集到的语音信号的信号能量小于麦克风b采集到的语音信号的信号能量,则0.3s时应选择麦克风a为目标麦克风。
再例如,在语音通话的1s至1.5s,语音信号的实际检测结果为既不存在远端语音信号,也不存在近端语音信号,从图中的波形a和波形b在该时段的波形也可以看出,该时间段内两个麦克风基本都未采集到任何信号,实际检测中也不存在远端语音信号,即未接收到远端语音信号,扬声器未播放语音信号,则可以确定该时间段内各时刻的语音通话状态都为无人讲话状态,那么可以将历史时刻的目标麦克风确定为当前时刻的目标麦克风,即该时段内各时刻仍选择麦克风a为目标麦克风。
再例如,在语音通话的1.5s至2.4s这一时段内,由曲线d可知该时段的目标麦克风为麦克风b,其目标麦克风的选择过程为:该时段的实际语音检测结果为:存在近端语音信号,不存在远端语音信号,则可以确定在该时间断内的各时刻的语音通话状态为近端单讲,那么应选择两个麦克风中采集到的语音信号的信号能量较大的麦克风作为该时段内各时刻的目标麦克风,而由波形a和波形b可知,在该时段内麦克风b所采集到的语音信号的能量大于麦克风a所采集到的语音信号的能量,则该时段内各时刻选择麦克风b为目标麦克风。
再例如,在语音通话的3.6s至4.6s这一时段内,以第4.1s为例,其对应的历史时刻为4.0s,4.0s所对应的时间检测结果为既存在近端语音信号,也存在远端语音信号,则确定4.0s的语音通话状态为两端对讲,则将历史时刻的目标麦克风确定为当前时刻的目标麦克风,即将4.0s的目标麦克风即麦克风a作为4.1s的目标麦克风。
同样地,基于本申请实施例所提供的方案,可以实现上述示例中语音通话各时刻的目标麦克风的选择,在此不再赘述。而经实验验证,利用本申请提供的方案可以在特定语音通话状态下选取对应的目标麦克风,可以有效提高语音通话效果。
图6为本申请实施例提供的一种语音通话方法的流程示意图,如图6所示,该方法可以包括以下步骤。
步骤501,接收用户的语音通话触发操作。
其中,语音通话的触发操作是指开启语音通话的指示,可以是用户针对对应的语音通话应用程序的点击操作,也可以是用户通过语音或文字输入开启语音通话的指示。
步骤502,基于所述语音通话触发操作,开启终端系统上的音频播放设备和至少两个音频采集设备。
其中,该终端系统的具体设备类型本申请实施例不做限定,只要是能够进行语音通话的设备即可,包括但不限于手机、PAD等。其上设置的音频播放设备可以为扬声器,音频采集设备可以为麦克风,对于音频播放设备和至少两个音频采集设备的具体类型、具体数量、以及音频采集设备在终端系统上的位置,本申请实施例不作限定。
在实际应用中,终端系统可以为语音通话提供相应的交互界面,该交互界面上的相应位置可以显示语音播放设备的图标和至少两个音频采集设备图标,通过图标的颜色或形状等表明对应设备的开启或关闭状态。
步骤503,对于语音通话的初始时刻,将预配置信息所对应的所述至少两个音频采集设备中的音频采集设备作为目标音频采集设备,并确定所述初始时刻的语音通话状态。
具体地,预配置信息所对应的目标音频采集设备可以预先指定或者任选至少两个音频采集设备中的一个音频采集设备,也可以采用现有的目标音频采集设备确定方式来选择,如基于通话场景确定初始时刻的目标音频采集设备。
步骤504,对于语音通话的除初始时刻之外的当前时刻,基于上述实施例提供的方法所确定出的目标音频采集设备所采集的语音信号,与对端设备进行语音通话。
本申请实施例提供的语音通话方法,利用历史时刻的语音通话状态,结合各音频采集设备采集的语音信号的信号能量,将特定语音通话状态下更有利于后续的语音增强处理的语音信号对应的音频采集 设备确定为当前时刻的目标音频采集设备,目标音频采集设备确定过程中不是仅依赖于各音频采集设备采集到的语音信号的信号能量或近端设备的通话场景,因此避免了现有技术中确定出的目标音频采集设备所采集到的语音信号中回声较大或近端语音较小的问题,提高了语音通话的效果。
本申请实施例所提供的语音通话方法,可以适用于任何具有多麦克风(以双麦克风为例)的终端系统的语音通话过程中,例如,可以应用在涉及语音通话场景的相关应用程序中,利用终端系统上的双麦克风抑制通话过程中的回声,增强近端语音音量,提升通话质量。以语音会议应用程序(APP)为例,具体地,如图7所示,打开应用程序,用户(图中头像代表当前用户)可以进入会议界面,打开麦克风后,即可以开始发言,如图中所示,用户还可以在该会议界面邀请(通过点击邀请按钮)其他用户参加该会话,还可以进行屏幕共享、通过打开摄像头进行录像以及进行APP的设置等。此时,用户发言声音会被终端系统上两个麦克风采集到,线上其他用户的声音经过设备播放后,也会被麦克风采集,导致线上其他用户听到自己发言的声音,也就是回声。回声抵消器可以内置于APP中,以消除麦克风采集到的其他用户的回声,只保留本地用户发言的声音,提升会议体验,在语音通话的过程中,即可通过终端系统的双麦克风语音增强模块(具体可以用于确定目标麦克风、语音信号的发送等)实现目标麦克风的选择,并基于所选择的目标麦克风所采集的语音信号,向其他用户的终端系统发送语音信号。需要说明的是,在实际应用中,双麦克风语音增强模块可以伴随麦克风开关打开或关闭,自动打开或关闭,无需用户做类似切换麦克风等其他操作。
对应于本申请实施例所提供的语音通话方法,本申请实施例还提供了一种语音通话装置,如图8所示,该装置600可以包括通话状态获取模块601、信号能量获取模块602以及目标音频采集设备确定模块603,其中:
通话状态获取模块601用于获取终端系统历史时刻的语音通话状态,终端系统上设置有至少两个音频采集设备;
信号能量获取模块602用于获取各音频采集设备在当前时刻采集到的第一语音信号,并分别确定各第一语音信号的信号能量;
目标音频采集设备确定模块603用于基于历史时刻的语音通话状态、以及各第一语音信号的信号能量,从各音频采集设备中确定目标音频采集设备。
本申请实施例提供的语音通话装置,利用历史时刻的语音通话状态,结合各音频采集设备采集的语音信号的信号能量,将特定语音通话状态下更有利于后续的语音增强处理的语音信号对应的音频采集设备确定为当前时刻的目标音频采集设备,目标音频采集设备确定过程中不是仅依赖于各音频采集设备采集到的语音信号的信号能量或近端设备的通话场景,因此避免了现有技术中确定出的目标音频采集设备所采集到的语音信号中回声较大或近端语音较小的问题,提高了语音通话的效果。
在本申请一实施例中,该装置还包括通话状态确定模块,该通话状态确定模块用于确定终端系统的语音通话状态,其中,该模块在确定历史时刻的语音通话状态时,具体用于:
确定在历史时刻是否存在远端语音信号,得到第一确定结果;
确定在历史时刻是否存在近端语音信号,得到第二确定结果;
根据第一确定结果和第二确定结果,确定历史时刻的语音通话状态。
在本申请一实施例中,通话状态确定模块在确定在历史时刻是否存在近端语音信号时,具体用于:
获取历史时刻的目标音频采集设备在历史时刻采集到的第二语音信号;
对第二语音信号进行回声抵消,确定经回声抵消后的第二语音信号中是否存在近端语音信号。
在本申请一实施例中,语音通话状态包括远端单讲、近端单讲、 两端对讲或无人讲话。
在本申请一实施例中,通话状态确定模块在根据第一确定结果和第二确定结果,确定历史时刻的语音通话状态时,具体用于:
在第一确定结果为存在远端语音信号,且第二确定结果为不存在近端语音信号时,则历史时刻的语音通话状态为远端单讲;
在第一确定结果为不存在远端语音信号,且第二确定结果为存在近端语音信号时,则历史时刻的语音通话状态为近端单讲;
在第一确定结果为存在远端语音信号,且第二确定结果为存在近端语音信号时,则历史时刻的语音通话状态为两端对讲;
在第一确定结果为不存在远端语音信号,且第二确定结果为不存在近端语音信号时,则历史时刻的语音通话状态为无人讲话。
在本申请一实施例中,目标音频采集设备确定模块603具体用于:
在历史时刻的语音通话状态为远端单讲时,则将信号能量最小的第一语音信号对应的音频采集设备确定为当前时刻的目标音频采集设备;
在历史时刻的语音通话状态为近端单讲时,则将信号能量最大的第一语音信号对应的音频采集设备确定为当前时刻的目标音频采集设备;
在历史时刻的语音通话状态为两端对讲或无人讲话时,则将历史时刻的目标音频采集设备确定为当前时刻的目标音频采集设备。
在本申请一实施例中,目标音频采集设备确定模块603还用于:
在历史时刻的语音通话状态为远端单讲时,确定当前时刻之前语音通话状态连续为远端单讲的次数,若该次数大于设定值,则将当前时刻的目标音频采集设备确定为当前时刻之后的目标音频采集设备。
在本申请一实施例中,该装置还包括信号发送模块,用于:
对当前时刻的目标音频采集设备采集到的第一语音信号进行回声抵消;
若经回声抵消的第一语音信号中存在近端语音信号,则将经回声抵消后的第一语音信号发送至语音通话的对端设备。
在本申请一实施例中,信号发送模块在对当前时刻的目标音频采集设备采集到的第一语音信号进行回声抵消时,具体用于:
获取当前时刻的远端语音信号;
基于当前时刻的远端语音信号和当前时刻的回声传播路径函数,确定当前时刻的目标音频采集设备采集到的第一语音信号中的回声信号;
基于回声信号对当前时刻的目标音频采集设备采集到的第一语音信号进行回声抵消。
在本申请一实施例中,当前时刻的回声传播路径函数通过以下方式获得:
对历史时刻的目标音频采集设备采集到的第二语音信号进行回声抵消,得到历史时刻的残留回声信号;
基于历史时刻的残留回声信号,对历史时刻的回声传播路径函数进行更新,得到当前时刻的回声传播路径函数。
图9为本申请实施例提供的一种语音通话装置的结构框图,如图9所示,该装置700可以包括:触发操作接收模块701、设备开启模块702、初始确定模块703以及语音通话模块704,其中:
触发操作接收模块701用于接收用户的语音通话触发操作;
设备开启模块702用于基于语音通话触发操作,开启终端系统上的音频播放设备和至少两个音频采集设备;
初始确定模块703用于对于语音通话的初始时刻,将预配置信息所对应的至少两个音频采集设备中的音频采集设备作为目标音频采集设备,并确定初始时刻的语音通话状态;
语音通话模块704用于对于语音通话的除初始时刻之外的当前时刻,基于第一方面、第一方面任一可选实施例或第三方面所提供的方法所确定出的目标音频采集设备所采集的语音信号,与对端设备进行语音通话。
本申请实施例提供的语音通话装置,利用历史时刻的语音通话状态,结合各音频采集设备采集的语音信号的信号能量,将特定语音通 话状态下更有利于后续的语音增强处理的语音信号对应的音频采集设备确定为当前时刻的目标音频采集设备,目标音频采集设备确定过程中不是仅依赖于各音频采集设备采集到的语音信号的信号能量或近端设备的通话场景,因此避免了现有技术中确定出的目标音频采集设备所采集到的语音信号中回声较大或近端语音较小的问题,提高了语音通话的效果。
基于相同的原理,本申请实施例还提供了一种电子设备,该电子设备包括存储器、处理器、音频播放设备、以及至少两个音频采集设备,其中,音频播放设备用于播放语音信号;至少两个音频采集设备用于采集语音信号;存储器中存储有计算机程序;处理器执行该计算机程序时,实现本申请任一实施例中所提供的方法,具体可实现如下几种情况:
情况一:获取终端系统历史时刻的语音通话状态,终端系统上设置有至少两个音频采集设备;获取各音频采集设备在当前时刻采集到的第一语音信号,并分别确定各第一语音信号的信号能量;基于历史时刻的语音通话状态、以及各第一语音信号的信号能量,从各音频采集设备中确定当前时刻的目标音频采集设备。
情况二:接收用户的语音通话触发操作;基于语音通话触发操作,开启终端系统上的音频播放设备和至少两个音频采集设备;对于语音通话的初始时刻,将预配置信息所对应的至少两个音频采集设备中的音频采集设备作为目标音频采集设备,并确定初始时刻的语音通话状态;对于语音通话的除初始时刻之外的当前时刻,基于情况一所提供的方法所确定出的目标音频采集设备所采集的语音信号,与对端设备进行语音通话。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该程序被处理器执行时实现本申请任一实施例所示的方法。可以理解的是,该计算机可读存储介质中存储的是本申请任一实施例提供的语音通话方法对应的计算机程序。
图10中示出了本申请实施例所适用的一种电子设备的结构示意图,如图10所示,图10所示的电子设备800包括:处理器801、存 储器803、音频播放设备805和至少两个音频采集设备806。其中,处理器801、音频播放设备805、至少两个音频采集设备806与存储器803相连,如通过总线802相连。进一步地,电子设备800还可以包括收发器804。电子设备800可以通过收发器804与其他电子设备进行数据的交互。需要说明的是,实际应用中收发器804不限于一个,该电子设备800的结构并不构成对本申请实施例的限定。
其中,处理器801应用于本申请实施例中,用于实现图8或图9所示的语音通话装置的功能。
处理器801可以是CPU、通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。处理器801也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等。
总线802可包括一通路,在上述组件之间传送信息。总线802可以是PCI总线或EISA总线等。总线802可以分为地址总线、数据总线、控制总线等。为便于表示,图10中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
存储器803可以是ROM或可存储静态信息和指令的其他类型的静态存储设备,RAM或者可存储信息和指令的其他类型的动态存储设备,也可以是EEPROM、CD-ROM或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。
存储器803用于存储执行本申请方案的应用程序代码,并由处理器801来控制执行。处理器801用于执行存储器803中存储的应用程序代码,以实现图8或图9所示实施例提供的语音通话装置的动作。
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制, 其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
以上仅是本申请的部分实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。

Claims (15)

  1. 一种语音通话方法,其特征在于,包括:
    获取终端系统历史时刻的语音通话状态,所述终端系统上设置有至少两个音频采集设备;
    获取各所述音频采集设备在当前时刻采集到的第一语音信号,并分别确定各所述第一语音信号的信号能量;
    基于所述历史时刻的语音通话状态、以及各所述第一语音信号的信号能量,从各所述音频采集设备中确定所述当前时刻的目标音频采集设备。
  2. 根据权利要求1所述的方法,其特征在于,所述历史时刻的语音通话状态是通过以下方式确定的:
    确定在所述历史时刻是否存在远端语音信号,得到第一确定结果;
    确定在所述历史时刻是否存在近端语音信号,得到第二确定结果;
    根据所述第一确定结果和所述第二确定结果,确定所述历史时刻的语音通话状态。
  3. 根据权利要求2所述的方法,其特征在于,所述确定在所述历史时刻是否存在近端语音信号,包括:
    获取所述历史时刻的目标音频采集设备在所述历史时刻采集到的第二语音信号;
    对所述第二语音信号进行回声抵消,确定经回声抵消后的所述第二语音信号中是否存在近端语音信号。
  4. 根据权利要求2所述的方法,其特征在于,所述语音通话状态至少包括远端单讲、近端单讲、两端对讲或无人讲话。
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述第一确定结果和所述第二确定结果,确定所述历史时刻的语音通话状态,包括:
    若所述第一确定结果为存在远端语音信号,且所述第二确定结果为不存在近端语音信号,则所述历史时刻的语音通话状态为远端单讲;
    若所述第一确定结果为不存在远端语音信号,且所述第二确定结果为存在近端语音信号,则所述历史时刻的语音通话状态为近端单讲;
    若所述第一确定结果为存在远端语音信号,且所述第二确定结果为存在近端语音信号,则所述历史时刻的语音通话状态为两端对讲;
    若所述第一确定结果为不存在远端语音信号,且所述第二确定结果为不存在近端语音信号,则所述历史时刻的语音通话状态为无人讲话。
  6. 根据权利要求4所述的方法,其特征在于,所述基于所述历史时刻的语音通话状态、以及各所述第一语音信号的信号能量,从各所述音频采集设备确定所述当前时刻的目标音频采集设备,包括:
    若所述历史时刻的语音通话状态为远端单讲,则将信号能量最小的第一语音信号对应的音频采集设备确定为所述当前时刻的目标音频采集设备;
    若所述历史时刻的语音通话状态为近端单讲,则将信号能量最大的第一语音信号对应的音频采集设备确定为所述当前时刻的目标音频采集设备;
    若所述历史时刻的语音通话状态为两端对讲或无人讲话,则将历史时刻的目标音频采集设备确定为所述当前时刻的目标音频采集设备。
  7. 根据权利要求6所述的方法,其特征在于,若所述历史时刻的语音通话状态为远端单讲,所述方法还包括:
    确定所述当前时刻之前语音通话状态连续为远端单讲的次数;
    若所述次数大于设定值,则将所述当前时刻的目标音频采集设备确定为所述当前时刻之后的目标音频采集设备。
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,还包括:
    对所述当前时刻的目标音频采集设备采集到的第一语音信号进行回声抵消;
    若经回声抵消的第一语音信号中存在近端语音信号,则将所述经回声抵消后的第一语音信号发送至语音通话的对端设备。
  9. 根据权利要求8所述的方法,其特征在于,所述对所述当前时刻的目标音频采集设备采集到的第一语音信号进行回声抵消,具体包括:
    获取所述当前时刻的远端语音信号;
    基于所述当前时刻的远端语音信号和所述当前时刻的回声传播路径函数,确定所述当前时刻的目标音频采集设备采集到的第一语音信号中的回声信号;
    基于所述回声信号对所述当前时刻的目标音频采集设备采集到的第一语音信号进行回声抵消。
  10. 根据权利要求9所述的方法,其特征在于,所述当前时刻的回声传播路径函数通过以下方式获得:
    对所述历史时刻的目标音频采集设备采集到的第二语音信号进行回声抵消,得到所述历史时刻的残留回声信号;
    基于所述历史时刻的残留回声信号,对所述历史时刻的回声传播路径函数进行更新,得到所述当前时刻的回声传播路径函数。
  11. 一种语音通话方法,其特征在于,包括:
    接收用户的语音通话触发操作;
    基于所述语音通话触发操作,开启终端系统上的音频播放设备和至少两个音频采集设备;
    对于语音通话的初始时刻,将预配置信息所对应的所述至少两个音频采集设备中的一个音频采集设备作为目标音频采集设备,并确定所述初始时刻的语音通话状态;
    对于语音通话的除初始时刻之外的当前时刻,基于权利要求1至10中任一项所述的方法所确定出的目标音频采集设备所采集的语音信号,与对端设备进行语音通话。
  12. 一种语音通话装置,其特征在于,包括:
    通话状态获取模块,用于获取终端系统历史时刻的语音通话状态,所述终端系统上设置有至少两个音频采集设备;
    信号能量获取模块,用于获取各所述音频采集设备在当前时刻采集到的第一语音信号,并分别确定各所述第一语音信号的信号能量;
    目标音频采集设备确定模块,用于基于所述历史时刻的语音通话状态、以及各所述第一语音信号的信号能量,从各所述音频采集设备中确定目标音频采集设备。
  13. 一种语音通话装置,其特征在于,包括:
    触发操作接收模块,用于接收用户的语音通话触发操作;
    设备开启模块,用于基于所述语音通话触发操作,开启终端系统上的音频播放设备和至少两个音频采集设备;
    初始确定模块,用于对于语音通话的初始时刻,将预配置信息所对应的所述至少两个音频采集设备中的音频采集设备作为目标音频采集设备,并确定所述初始时刻的语音通话状态;
    语音通话模块,用于对于语音通话的除初始时刻之外的当前时刻,基于权利要求1至10中任一项所述的方法所确定出的目标音频采集设备所采集的语音信号,与对端设备进行语音通话。
  14. 一种电子设备,其特征在于,所述电子设备包括存储器、处理器、音频播放设备和至少两个音频采集设备;
    所述音频播放设备,用于播放语音信号;
    所述至少两个音频采集设备,用于采集语音信号;
    所述存储器中存储有计算机程序;
    所述处理器,用于执行所述计算机程序以实现权利要求1至11中任一项所述的方法。
  15. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1至11中任一项所述的方法。
PCT/CN2020/081385 2019-09-24 2020-03-26 语音通话方法、装置、电子设备及计算机可读存储介质 WO2021056999A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2021558866A JP7290749B2 (ja) 2019-09-24 2020-03-26 音声通話方法並びにその、装置、電子機器及びコンピュータプログラム
EP20868976.0A EP3920516B1 (en) 2019-09-24 2020-03-26 Voice call method and apparatus, electronic device, and computer-readable storage medium
US17/460,160 US11875808B2 (en) 2019-09-24 2021-08-27 Voice call method and apparatus, electronic device, and computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910906728.3 2019-09-24
CN201910906728.3A CN110602327B (zh) 2019-09-24 2019-09-24 语音通话方法、装置、电子设备及计算机可读存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/460,160 Continuation US11875808B2 (en) 2019-09-24 2021-08-27 Voice call method and apparatus, electronic device, and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2021056999A1 true WO2021056999A1 (zh) 2021-04-01

Family

ID=68862870

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/081385 WO2021056999A1 (zh) 2019-09-24 2020-03-26 语音通话方法、装置、电子设备及计算机可读存储介质

Country Status (5)

Country Link
US (1) US11875808B2 (zh)
EP (1) EP3920516B1 (zh)
JP (1) JP7290749B2 (zh)
CN (1) CN110602327B (zh)
WO (1) WO2021056999A1 (zh)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110602327B (zh) 2019-09-24 2021-06-25 腾讯科技(深圳)有限公司 语音通话方法、装置、电子设备及计算机可读存储介质
CN112151051B (zh) * 2020-09-14 2023-12-19 海尔优家智能科技(北京)有限公司 音频数据的处理方法和装置及存储介质
CN115208976B (zh) * 2021-04-13 2024-10-18 深圳市万普拉斯科技有限公司 通话通道切换的处理方法、装置、通话设备和存储介质
CN113452855B (zh) * 2021-06-03 2022-05-27 杭州网易智企科技有限公司 啸叫处理方法、装置、电子设备及存储介质
CN113555030B (zh) * 2021-07-29 2024-05-31 杭州萤石软件有限公司 音频信号的处理方法、装置及设备
WO2023238419A1 (ja) * 2022-06-07 2023-12-14 サントリーホールディングス株式会社 携帯情報端末、情報処理システム、携帯情報端末の制御方法及びプログラム
CN115334413B (zh) * 2022-07-15 2024-07-12 北京达佳互联信息技术有限公司 语音信号处理方法、系统、装置及电子设备
CN117935835B (zh) * 2024-03-22 2024-06-07 浙江华创视讯科技有限公司 音频降噪方法、电子设备以及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120057717A1 (en) * 2010-09-02 2012-03-08 Sony Ericsson Mobile Communications Ab Noise Suppression for Sending Voice with Binaural Microphones
CN105847497A (zh) * 2016-03-28 2016-08-10 乐视控股(北京)有限公司 一种语音信号处理方法及装置
CN106953961A (zh) * 2017-04-28 2017-07-14 苏州科技大学 一种双麦克风的手机语音应用装置及其应用方法
CN110602327A (zh) * 2019-09-24 2019-12-20 腾讯科技(深圳)有限公司 语音通话方法、装置、电子设备及计算机可读存储介质

Family Cites Families (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6424662A (en) * 1987-07-21 1989-01-26 Nippon Telegraph & Telephone Voice conference equipment
JPH07336790A (ja) * 1994-06-13 1995-12-22 Nec Corp マイクロホンシステム
US8401178B2 (en) * 2008-09-30 2013-03-19 Apple Inc. Multiple microphone switching and configuration
US8041054B2 (en) 2008-10-31 2011-10-18 Continental Automotive Systems, Inc. Systems and methods for selectively switching between multiple microphones
CN101719969B (zh) * 2009-11-26 2013-10-02 美商威睿电通公司 判断双端对话的方法、系统以及消除回声的方法和系统
US9729344B2 (en) * 2010-04-30 2017-08-08 Mitel Networks Corporation Integrating a trigger button module into a mass audio notification system
JP2012075039A (ja) * 2010-09-29 2012-04-12 Sony Corp 制御装置、および制御方法
CN102710839B (zh) * 2012-04-27 2017-11-28 华为技术有限公司 一种提升语音通话效果的方法及通信终端
CN107257317B (zh) * 2012-05-04 2022-06-03 江虹 在通信终端设备之间已建立通信信道的即时通信系统和方法
CN105513596B (zh) * 2013-05-29 2020-03-27 华为技术有限公司 一种语音控制方法和控制设备
CN104639719A (zh) * 2013-11-11 2015-05-20 中兴通讯股份有限公司 一种通话方法和通信终端
US9451360B2 (en) * 2014-01-14 2016-09-20 Cisco Technology, Inc. Muting a sound source with an array of microphones
CN104092801A (zh) * 2014-05-22 2014-10-08 中兴通讯股份有限公司 智能终端通话降噪方法及智能终端
US9712915B2 (en) 2014-11-25 2017-07-18 Knowles Electronics, Llc Reference microphone for non-linear and time variant echo cancellation
WO2016123560A1 (en) 2015-01-30 2016-08-04 Knowles Electronics, Llc Contextual switching of microphones
GB2536742B (en) 2015-08-27 2017-08-09 Imagination Tech Ltd Nearend speech detector
KR20170052056A (ko) * 2015-11-03 2017-05-12 삼성전자주식회사 전자 장치 및 그의 음향 에코 저감 방법
CN107181853B (zh) * 2016-03-10 2020-10-09 深圳富泰宏精密工业有限公司 麦克风切换方法及应用该方法的电子装置
CN106101365A (zh) * 2016-06-29 2016-11-09 北京小米移动软件有限公司 通话过程中调整麦克风的方法及装置
US20210407668A1 (en) * 2017-02-28 2021-12-30 19Labs, Inc. Systems and methods for maintaining privacy and security while real time monitoring a plurality of patients over the internet
CN107547704A (zh) * 2017-09-28 2018-01-05 奇酷互联网络科技(深圳)有限公司 通话mic的切换方法、装置和移动终端
CN108076226B (zh) * 2017-12-22 2020-08-21 Oppo广东移动通信有限公司 一种通话质量调整的方法、移动终端及存储介质
CN108234766A (zh) * 2017-12-29 2018-06-29 努比亚技术有限公司 麦克风切换方法、移动终端及计算机可读存储介质
US11404073B1 (en) * 2018-12-13 2022-08-02 Amazon Technologies, Inc. Methods for detecting double-talk
CN110166615A (zh) * 2019-05-28 2019-08-23 努比亚技术有限公司 自动切换通话上行信号源的方法、装置、终端及存储介质
US11114109B2 (en) * 2019-09-09 2021-09-07 Apple Inc. Mitigating noise in audio signals

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120057717A1 (en) * 2010-09-02 2012-03-08 Sony Ericsson Mobile Communications Ab Noise Suppression for Sending Voice with Binaural Microphones
CN105847497A (zh) * 2016-03-28 2016-08-10 乐视控股(北京)有限公司 一种语音信号处理方法及装置
CN106953961A (zh) * 2017-04-28 2017-07-14 苏州科技大学 一种双麦克风的手机语音应用装置及其应用方法
CN110602327A (zh) * 2019-09-24 2019-12-20 腾讯科技(深圳)有限公司 语音通话方法、装置、电子设备及计算机可读存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3920516A4

Also Published As

Publication number Publication date
EP3920516C0 (en) 2023-12-06
US20210390969A1 (en) 2021-12-16
EP3920516A4 (en) 2022-05-04
US11875808B2 (en) 2024-01-16
EP3920516A1 (en) 2021-12-08
CN110602327A (zh) 2019-12-20
CN110602327B (zh) 2021-06-25
EP3920516B1 (en) 2023-12-06
JP7290749B2 (ja) 2023-06-13
JP2022528683A (ja) 2022-06-15

Similar Documents

Publication Publication Date Title
WO2021056999A1 (zh) 语音通话方法、装置、电子设备及计算机可读存储介质
CN105513596B (zh) 一种语音控制方法和控制设备
JP5085556B2 (ja) エコー除去の構成
US20090046866A1 (en) Apparatus capable of performing acoustic echo cancellation and a method thereof
US10978085B2 (en) Doppler microphone processing for conference calls
WO2013127302A1 (zh) 一种防止外放扬声器与麦克风声音串扰的方法及终端
USRE49462E1 (en) Adaptive noise cancellation for multiple audio endpoints in a shared space
WO2020228404A1 (zh) 即时通讯的音质优化方法、装置及设备
US8744524B2 (en) User interface tone echo cancellation
CN109256145B (zh) 基于终端的音频处理方法、装置、终端和可读存储介质
CN110660403B (zh) 一种音频数据处理方法、装置、设备及可读存储介质
JP2008211526A (ja) 音声入出力装置及び音声入出力方法
US9858944B1 (en) Apparatus and method for linear and nonlinear acoustic echo control using additional microphones collocated with a loudspeaker
WO2019144722A1 (zh) 一种闭音提示方法及装置
CN113488066B (zh) 音频信号处理方法、音频信号处理装置及存储介质
CN112217948B (zh) 语音通话的回声处理方法、装置、设备及存储介质
CN111292760B (zh) 发声状态检测方法及用户设备
CN114979344A (zh) 回声消除方法、装置、设备及存储介质
CN110971769A (zh) 通话信号的处理方法、装置、电子设备及存储介质
EP4184507A1 (en) Headset apparatus, teleconference system, user device and teleconferencing method
CN111383648B (zh) 一种回波消除方法和装置
CN105704334B (zh) 电话重拨方法及装置
CN114495967A (zh) 降低混响的方法、装置、通信系统及存储介质
CN113470675A (zh) 音频信号处理方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20868976

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020868976

Country of ref document: EP

Effective date: 20210901

ENP Entry into the national phase

Ref document number: 2021558866

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE