WO2022142984A1 - 语音处理方法、装置、系统、智能终端以及电子设备 - Google Patents

语音处理方法、装置、系统、智能终端以及电子设备 Download PDF

Info

Publication number
WO2022142984A1
WO2022142984A1 PCT/CN2021/134864 CN2021134864W WO2022142984A1 WO 2022142984 A1 WO2022142984 A1 WO 2022142984A1 CN 2021134864 W CN2021134864 W CN 2021134864W WO 2022142984 A1 WO2022142984 A1 WO 2022142984A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio information
flow
stream
recognition
processing
Prior art date
Application number
PCT/CN2021/134864
Other languages
English (en)
French (fr)
Inventor
杨智烨
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Priority to EP21913725.4A priority Critical patent/EP4243019A4/en
Priority to US18/254,568 priority patent/US20240105198A1/en
Publication of WO2022142984A1 publication Critical patent/WO2022142984A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • G10L19/12Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters the excitation function being a code excitation, e.g. in code excited linear prediction [CELP] vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2201/00Electronic components, circuits, software, systems or apparatus used in telephone systems
    • H04M2201/38Displays
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2201/00Electronic components, circuits, software, systems or apparatus used in telephone systems
    • H04M2201/40Electronic components, circuits, software, systems or apparatus used in telephone systems using speech recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/002Applications of echo suppressors or cancellers in telephonic connections

Definitions

  • the embodiments of the present disclosure relate to the technical fields of computers, voice processing, and network communications, and in particular, to a voice processing method, apparatus, system, intelligent terminal, electronic device, and storage medium.
  • Conference refers to the use of modern means of communication to achieve the purpose of convening a conference.
  • the conference can include remote conferences, and remote conferences can mainly include telephone conferences, network conferences, and video conferences.
  • a voice processing method applied in a conference scenario includes: the local conference device collects audio information corresponding to the local user, and sends the audio information corresponding to the local user to the peer conference device, and correspondingly, the peer conference device collects audio information corresponding to the local user.
  • the audio information of the opposite end user, and the audio information of the opposite end user is sent to the local conference device, wherein the audio information is used for voice calls.
  • the traditional voice processing method has at least the following technical problems: Realizing a conference through audio information used for a voice call may result in the presentation of the conference content of the conference with fewer dimensions and a relatively low degree of richness, resulting in a relatively low conference quality.
  • Embodiments of the present disclosure provide a voice processing method, apparatus, system, intelligent terminal, electronic device, and storage medium, so as to overcome the problem of low conference quality in the related art.
  • an embodiment of the present disclosure provides a speech processing method, including:
  • the talk stream and the identification stream are sent.
  • an embodiment of the present disclosure provides an intelligent terminal, where the intelligent terminal includes: a microphone array, a processor, and a communication module;
  • the microphone array is used to collect audio information during the conference
  • the processor configured to respectively generate a call flow and a recognition flow according to the audio information, wherein the call flow is used for voice calls, and the recognition flow is used for voice recognition;
  • the communication module is configured to send the call flow and the identification flow.
  • an embodiment of the present disclosure provides a voice processing device, the device comprising:
  • the acquisition module is used to collect audio information during the conference
  • a generating module configured to respectively generate a call flow and a recognition flow according to the audio information, wherein the call flow is used for voice calls, and the recognition flow is used for voice recognition;
  • a sending module configured to send the call flow and the identification flow.
  • an embodiment of the present disclosure provides a speech processing system, where the system includes: a first terminal device and the smart terminal described in the second aspect above; or, a first terminal device and the device described in the third aspect above , wherein the first terminal device is a terminal device participating in the conference.
  • embodiments of the present disclosure provide an electronic device, including: at least one processor and a memory;
  • the memory stores computer-executable instructions
  • the at least one processor executes the computer-implemented instructions stored in the memory to cause the at least one processor to perform the speech processing method described above in the first aspect and various possible designs of the first aspect.
  • embodiments of the present disclosure provide a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the first aspect and the first Aspects of the speech processing methods described in various possible designs.
  • embodiments of the present disclosure provide a computer program product, which includes a computer program carried on a non-transitory computer-readable medium, the computer program executes the first aspect and each of the first aspect when executed by a processor.
  • an embodiment of the present disclosure provides a computer program that, when executed by a processor, executes the speech processing method described in the first aspect and various possible designs of the first aspect.
  • the voice processing method, device, system, intelligent terminal, electronic device, and storage medium provided in this embodiment include: collecting audio information during a conference, and generating a call stream and a recognition stream respectively according to the audio information, wherein the call stream is used for voice calls , the recognition stream is used for speech recognition, the call stream and the recognition stream are sent respectively, and the technical scheme of generating the technical characteristics of the call stream and the recognition stream based on the audio information, avoids the single dimension and richness of the presentation of the conference content in the related art.
  • the relatively low problem realizes that the content of the conference corresponding to the determined audio information has more dimensions and is richer, which improves the users of the conference to understand the content of the conference, thereby improving the accuracy of the conference and improving the quality of the conference. Intelligence and quality, but also improve the user's meeting experience.
  • FIG. 1 is a schematic diagram of an application scenario of a speech processing method according to an embodiment of the present disclosure
  • FIG. 2 is a schematic flowchart of a speech processing method according to an embodiment of the disclosure
  • FIG. 3 is a schematic flowchart of a voice processing method according to another embodiment of the disclosure.
  • FIG. 4 is a schematic diagram of an application scenario of a speech processing method according to another embodiment of the present disclosure.
  • FIG. 5 is a schematic diagram of the principle of a speech processing method according to an embodiment of the present disclosure.
  • FIG. 6 is a schematic diagram of the processor shown in FIG. 5;
  • FIG. 7 is a schematic diagram of a speech processing apparatus according to an embodiment of the disclosure.
  • FIG. 8 is a schematic diagram of a speech processing apparatus according to another embodiment of the disclosure.
  • FIG. 9 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure.
  • the voice processing method provided by the embodiment of the present disclosure can be applied to the application scenario of conference, and specifically can be applied to the application scenario of remote conference.
  • the conference system can include teleconferencing, web conferencing, and video conferencing.
  • FIG. 1 is a schematic diagram of an application scenario of a speech processing method according to an embodiment of the present disclosure.
  • the application scenario may include: a server, at least two terminal devices, and a user corresponding to each terminal device.
  • Fig. 1 exemplarily shows n terminal devices, that is, the number of participants is n.
  • the server may establish a communication link with each terminal device, and implement information interaction with each terminal device based on the communication link, so that users corresponding to each terminal device can communicate based on a remote conference.
  • the remote conference includes multiple users, one user may correspond to one terminal device, and the number of each user may be one or multiple, which is not limited in this embodiment.
  • a teleconference includes multiple users, and the multiple users are multiple staff members of different enterprises; another example, a teleconference includes two users, and the two users are multiple staff members of different departments of the same enterprise; another example , the remote conference includes two users, and one user is multiple staff members of the enterprise, and the other user is an individual user, and so on.
  • Terminal devices may be mobile terminals, such as mobile phones (or “cellular” phones) and computers with mobile terminals, for example, may be portable, pocket-sized, hand-held, computer-built, or vehicle-mounted mobile devices, which are associated with wireless The access network exchanges language and/or data; the terminal device can also be a smart speaker, a Personal Communication Service (PCS) phone, a cordless phone, a Session Initiation Protocol (SIP) phone, a wireless local loop ( Wireless Local Loop (WLL) station, Personal Digital Assistant (PDA), tablet computer, wireless modem (modem), handheld device (handset), laptop computer (laptop computer), machine type communication (Machine Type Communication) Communication, MTC) terminal and other equipment; terminal equipment can also be called system, subscriber unit (Subscriber Unit), subscriber station (Subscriber Station), mobile station (Mobile Station), mobile station (Mobile), remote station (Remote Station), Remote Terminal, Access Terminal, User Terminal, User Agent, User Device or User Equipment, etc., are not limited here.
  • PCS Personal Communication
  • elements in the application scenario can be adaptively added on the basis of the above example, such as increasing the number of terminal devices; for another example, the elements in the application scenario can be adaptively deleted on the basis of the above example Reduction, such as reducing the number of terminal devices, and/or removing servers, etc.
  • each terminal device can collect the audio information of the user corresponding to the each terminal device, and generate a call stream (for voice calls) according to the audio information, and based on each terminal device and the server The communication link between them sends the call flow to the server.
  • the server can send the call flow to the corresponding terminal devices of the other communication links based on other communication links.
  • the other terminal devices can output the call flow, so that other terminal devices correspond All users can hear the voice and content of the user corresponding to each terminal device.
  • each terminal device only includes the call flow, resulting in a relatively single display dimension of the conference content in the remote conference and low intelligence.
  • the above examples are only used to illustrate the applicable application scenarios of the speech processing method in this embodiment, and should not be construed as a limitation on the application scenarios of the speech processing method in this embodiment of the present disclosure.
  • the voice processing method in this embodiment may also be applied to other conference scenarios (eg, local conference scenarios); or, to other scenarios that need to perform voice processing on audio information.
  • the inventor of the present disclosure has obtained the inventive concept of the present disclosure through creative work: based on the audio information, a call flow and a recognition flow are respectively generated, the call flow is used for voice calls, and the recognition flow is used for voice recognition, so as to realize the conference content of the conference. Diversity to improve users' meeting experience.
  • FIG. 2 is a schematic flowchart of a speech processing method according to an embodiment of the present disclosure.
  • the method includes:
  • S101 Collect audio information during the conference.
  • the execution body of this embodiment may be a voice processing apparatus, and the voice processing apparatus may be a terminal device, a server, a processor, a chip, etc., which is not limited in this embodiment.
  • the voice processing apparatus may be a terminal device as shown in FIG. 1 , such as terminal device 1 to terminal device n in FIG. 1 . at least one.
  • terminal device n when user n makes a speech, terminal device n can collect corresponding audio information.
  • S102 Generate a call stream and a recognition stream respectively according to the audio information, wherein the call stream is used for voice calls, and the recognition stream is used for voice recognition.
  • the voice processing apparatus separately generates, based on the collected audio information, a call flow for voice calls and a recognition flow for voice recognition.
  • this step can be understood as: the terminal device n processes the audio information, and generates a call flow and an identification flow respectively.
  • the technical solution provided in this embodiment includes the technical features of respectively generating a call flow and an identification flow based on audio information, which avoids that in the related art, the content of the conference used for the conference is relatively simple, resulting in The content of the meeting may be inaccurate, that is, the problem that users obtain the content of the meeting incorrectly, which improves the users of the meeting to understand the content of the meeting, thereby improving the accuracy of the meeting, improving the intelligence and quality of the meeting, and also improving the user's experience. meeting experience.
  • the terminal device n can send the call stream and the recognition stream to the server respectively, and the server can send the terminal device 1 to the terminal device 1.
  • the terminal device 1 outputs the call stream, and the user 1 can hear the voice content of the remote conference corresponding to the call stream, that is, the user 1 can hear the speech content of the user n; and the terminal device 1 outputs the text content corresponding to the recognition stream , that is, user 1 can see the speech content with user n.
  • this embodiment provides a voice processing method, the method includes: collecting audio information during a conference, and generating a call stream and a recognition stream respectively according to the audio information, wherein the call stream is used for voice calls, and the recognition stream is used for Speech recognition, sending a call stream and a recognition stream, by generating both a call stream for a voice call and a recognition stream for speech recognition, it is possible to avoid processing audio information in the related art to obtain a conference that is used to characterize the conference.
  • the content method is relatively simple, so that the determined audio information corresponds to more and richer meeting content, thereby improving the accuracy of the meeting, improving the intelligence and quality of the meeting, and improving the user's meeting experience.
  • FIG. 3 is a schematic flowchart of a voice processing method according to another embodiment of the disclosure.
  • the method includes:
  • S201 Collect audio information during the conference.
  • FIG. 4 is a schematic diagram of an application scenario of a voice processing method according to another embodiment of the present disclosure
  • FIG. 5 is a schematic diagram of a principle of the voice processing method according to an embodiment of the present disclosure.
  • the application scenario includes: an intelligent terminal, a first terminal device, a second terminal device, and a cloud server.
  • the first terminal device is a device for the first participant user to conduct a remote conference with the second participant user
  • the second terminal device is a device through which the second participant user conducts a remote conference with the first participant user.
  • the smart terminal and the second terminal device are two independent devices, while in other embodiments, the smart terminal may be integrated in the second terminal device, and This embodiment does not limit the external display form of the smart terminal.
  • this step can be understood as: when the second participating user makes a speech, the intelligent terminal device can collect corresponding audio information.
  • a microphone or a microphone array may be set in the smart terminal, and audio information is collected through a microphone or a microphone array.
  • the number of microphones in the microphone array can be set based on requirements, historical records, experiments, etc. For example, the number of microphones is 6.
  • S202 Convert the signal type of the audio information, wherein the signal type includes an analog signal and a digital signal, the signal type of the audio information before conversion is an analog signal, and the signal type of the converted audio information is a digital signal.
  • an analog-to-digital converter can be set in the smart terminal, the microphone array sends the collected audio information of the analog signal to the analog-to-digital converter, and the analog-to-digital converter converts the analog signal to the analog-to-digital converter.
  • the audio information is converted into audio information of a digital signal, so as to improve the efficiency and accuracy of subsequent processing of the audio information.
  • S203 Perform echo cancellation processing on the converted audio information to obtain a residual signal.
  • a processor may be set in the intelligent terminal, and the processor is connected to the analog-to-digital converter for receiving the converted audio information sent by the analog-to-digital converter, And the processor can perform echo cancellation processing on the converted audio information, wherein FIG. 6 is a schematic diagram of the processor shown in FIG. 5 .
  • the method for echo cancellation processing may include: determining an echo signal corresponding to the audio information, and performing cancellation processing on the echo signal according to the acquired reference signal to obtain a residual signal.
  • the echo path corresponding to the audio information can be estimated according to the microphone array and the speaker of the smart terminal, and according to the echo path and the obtained reference signal (such as the reference signal obtained from the power amplifier in the speaker), it is estimated that the microphone array receives The echo signal of , calculates the difference between the reference signal and the echo signal, the difference is the residual signal, and the residual signal is the echo-cancelled signal.
  • the obtained reference signal such as the reference signal obtained from the power amplifier in the speaker
  • the method for echo cancellation processing may further include: an adaptive filter may be set in the processor, and the adaptive filter may estimate an approximate echo path to approximate the real echo path, thereby obtaining an estimated echo signal, and The echo signal is removed from the mixed signal of the pure speech and the echo to realize the cancellation of the echo, and the adaptive filter may be a Finite Impulse Response (FIR) filter in particular.
  • FIR Finite Impulse Response
  • S204 Perform echo residual suppression processing on the residual signal to obtain a residual echo suppressed signal.
  • the echo residual suppression processing method may include: performing Fourier transform on the residual signal to obtain a frequency domain signal, determining a frequency domain adjustment parameter corresponding to the frequency domain signal, and adjusting the frequency domain signal according to the frequency domain adjustment parameter , and perform inverse Fourier transform on the adjusted frequency domain signal to obtain a signal that suppresses the residual echo.
  • a deep learning neural network can be preset in the processor, the processor performs Fourier transform on the residual signal to obtain a frequency domain signal, and the frequency domain signal is sent to the deep learning neural network, and the deep learning neural network outputs the mask in the frequency domain.
  • code mask representation, the probability of background noise in the frequency domain
  • multiply the mask on the basis of the frequency domain signal to obtain the processed frequency domain signal perform inverse Fourier transform on the processed frequency domain signal, and obtain the suppressed residual echo signal of.
  • S205 Perform de-reverberation processing on the residual echo-suppressed signal to obtain a de-reverberated signal.
  • the method for de-reverberation processing may include: constructing a multi-channel linear prediction (MCLP) model, characterizing the multi-channel linear prediction model, and changing the residual echo-suppressed signal from the current signal (that is, the residual echo-suppressed signal) It is linearly combined with the signals of the past several frames.
  • the signal of the past several frames is convolved based on the multi-channel linear prediction model, and the signal of the reverberation part in the current signal can be obtained.
  • the signal of the reverberation part is subtracted from the current signal.
  • the de-reverberated signal can be obtained.
  • the method for de-reverberation processing may further include: determining a frequency cepstrum coefficient (Mel Frequency Cepstrum Coefficient, MFCC) corresponding to each microphone in the microphone array, and determining a frequency cepstrum coefficient difference between adjacent microphones , and the de-reverberated signal is constructed based on the frequency cepstral coefficient difference.
  • MFCC Frequency Cepstrum Coefficient
  • S206 Process the de-reverberated signal according to different processing methods to obtain a conversation flow and an identification flow.
  • S206 may include: performing definition enhancement processing on the de-reverberated signal to obtain a call stream; and performing fidelity processing on the de-reverberated signal to obtain an identification stream.
  • the processor may include a preprocessor, an enhanced definition processor, and a fidelity processor, wherein the preprocessor is used to perform echo cancellation processing and echo residual suppression processing and a preprocessor for de-reverberation processing, the enhanced definition processor is used to enhance the definition of the signal processed by the preprocessor, and the fidelity processor is used to perform fidelity on the signal processed by the preprocessor. deal with.
  • the sharpening enhancement process may include: basic spectral subtraction.
  • the basic spectral subtraction can be understood as: presetting the basic frequency domain, and removing the de-reverberated signal outside the basic frequency domain.
  • the noise reduction process may include: Wiener filter noise reduction.
  • Wiener filter noise reduction can be understood as: training the filter based on the preset mean square error, and filtering the de-reverberated signal based on the filter, so that the filtered de-reverberated signal is different from the pure de-reverberation signal.
  • the error between the reverberated signals is less than a preset error threshold.
  • the enhancement of clarity processing may include: performing beam processing, synthetic noise reduction processing, minimum beam processing, suppression noise reduction processing, vocal equalization processing, and automatic gain on the de-reverberated signal in sequence. Control, get call flow.
  • the method of beam processing may include: determining a plurality of sets of beam signals corresponding to the de-reverberated signals.
  • a generalized sidelobe canceller (GSC) model is established, the de-reverberated signal is input into the generalized sidelobe canceler model, and multiple sets of beam signals in the horizontal space are output.
  • GSC generalized sidelobe canceller
  • the method of synthetic noise reduction processing may include: determining an expected estimate of the de-reverberated signal (ie, the pure signal of the audio information), an expected estimate of the de-reverberated signal, and multiple sets of beam signals.
  • the group beams are phase-synthesized to obtain a denoised beam signal.
  • modeling the amplitude spectrum of the de-reverberated signal constructing the obtained amplitude spectrum of speech and noise to obey a Gaussian distribution, obtaining the steady-state noise of the meeting, and estimating the posterior signal-to-noise ratio of the de-reverberated signal,
  • the expected estimate of the de-reverberated signal is obtained, and the expected estimate of the de-reverberated signal is synthesized with the phases of multiple groups of beams to obtain the de-noised beam signal .
  • the method for minimum beam processing may include: determining an energy ratio between a beam signal with a maximum energy and a beam signal with a minimum energy in the denoised beam signal, and determining a normalized beam signal according to the energy ratio.
  • the method for suppressing noise reduction processing may include: determining a mask of the normalized beam signal, suppressing non-stationary noise of the normalized beam signal according to the mask of the normalized beam signal, The suppressed beam signal is obtained.
  • a recurrent neural network can be preset, the normalized beam signal is output to the recurrent neural network, and the recurrent neural network outputs the mask of the normalized beam signal, and the mask of the normalized beam signal (mask Characterization, the probability that the non-stationary noise is the background noise) is multiplied by the non-stationary noise to obtain the suppressed beam signal.
  • mask Characterization the probability that the non-stationary noise is the background noise
  • the method for equalizing human voice may include: compensating the suppressed beam signal in a preset frequency band to obtain the compensated beam signal.
  • a segmented peak filter can be preset, and the suppressed beam signal for the noise reduction output is in a preset frequency band (it can be set based on requirements, historical records, experiments, etc., which is not limited in this embodiment) Compensation is performed to obtain a compensated beam signal, so that the sound quality of the hearing sense corresponding to the compensated beam signal is higher.
  • the method for automatic gain control may include: performing Fourier transform on the compensated beam signal, obtaining a power spectrum, and inputting the power spectrum into a preset convolutional neural network to obtain the speech existence probability of the current frame, if based on If the voice existence probability of the current frame is greater than the preset probability threshold, it is determined that the voice of the current frame exists, and a gradually increasing gain is applied to the compensated beam signal until the gain of the compensated beam signal is stable, and the call flow is obtained.
  • the method for automatic gain control may include the following steps:
  • Step 1 Determine the gain weight according to the compensated beam signal and the preset equal loudness curve.
  • the equal loudness curve can be used to characterize the curve corresponding to the compensated beam signal with relatively high user satisfaction, which is determined based on experiments or the like.
  • the compensated beam signal may be specifically mapped to an equal loudness curve, and the gain weight may be determined based on the difference between the two.
  • Step 2 Perform enhancement processing on the compensated beam signal according to the gain weight.
  • the fidelity processing method may include: performing voiceprint recognition processing on the de-reverberated signal.
  • the intelligent terminal performs feature extraction processing on the de-reverberated signal, and obtains the features of the de-reverberated signal, such as pitch, intensity, length, and timbre, and based on the pitch, intensity, length, and timbre and other features to restore the de-reverberated signal to obtain the identification flow, so that the identification flow has the characteristics of lower distortion.
  • the method for fidelity processing may include: performing angle of arrival estimation processing and beam selection processing on the de-reverberated signal.
  • the method for estimating the angle of arrival may include: performing multiple signal classification processing on the de-reverberated signal, obtaining a directional spectrum, and determining a sound source direction corresponding to the de-reverberated signal according to the directional spectrum.
  • multiple signal classification processing is performed on the de-reverberated signal to obtain the frequency and time directional spectrum of the de-reverberated signal.
  • a histogram corresponding to the directional spectrum can be constructed according to the frequency and time, and the de-mixing can be determined based on the histogram. The direction of the sound source of the sounding signal.
  • the beam selection processing method may include: determining the start point, the end point and the controllable power response of the de-reverberated signal according to the sound source direction, and according to the start point, the end point and the controllable power response, from the de-mixed signal. Select the recognition stream in the signal after the ringing.
  • each method shown in FIG. 6 performs enhancement and definition processing on audio information, the order of each processing method can be adjusted at will, for example, suppression and noise reduction processing is performed first, and then minimum beam processing is performed, and so on.
  • the smart terminal can send the call flow to the cloud server through the communication module, and accordingly, the cloud server can send the call flow to the first terminal device.
  • the call flow correspondingly, the first terminal device can perform voice broadcast based on the call flow;
  • the intelligent terminal can send the recognition stream to the cloud server, correspondingly, the cloud server can send the recognition stream to the first terminal device, correspondingly, the first terminal device Text display can be done based on the recognition stream.
  • the cloud server may also perform speech recognition based on the recognition stream to obtain a recognition result (that is, the transcribed text), and send the recognition stream and/or the transcribed text to the first terminal device, and the first terminal device can perform the transcribed text on the transcribed text.
  • a recognition result that is, the transcribed text
  • the recognition stream and/or the transcribed text can also be stored by the cloud server.
  • the cloud server may also send the recognition stream and/or the transcribed text to the second terminal device, and accordingly, the second terminal device may perform text display on the transcribed text.
  • the cloud server may also send the recognition stream and/or the transcribed text to the third terminal device, and correspondingly, the second terminal device may perform text display on the transcribed text.
  • the third terminal device may be a terminal device in a non-remote conference. That is to say, the third terminal device is a device that can have a display function and can display the transcribed text, and the number of the third terminal devices is not limited in this embodiment.
  • the smart terminal can send a call stream to the second terminal device through the communication module, and the software for conducting the conference is running on the second terminal device. Accordingly, the second terminal device can send the call flow to the first terminal device based on the conference software
  • the first terminal device may perform voice broadcast based on the call stream; for the principle of sending the recognition stream by the intelligent terminal, reference may be made to the above example, which will not be repeated here.
  • the configuration of a server can be added in the application scenario shown in FIG. 4, and the configuration of the second terminal device in FIG. 4 can be deleted, and the intelligent terminal can send the call flow to the added server.
  • the server of the smart terminal can send the call stream to the first terminal device, and accordingly, the first terminal device can perform voice broadcast based on the call stream; as described in the above example, the intelligent terminal can send the recognition stream to the cloud server, and accordingly, the cloud server can send The first terminal device sends the identification stream, and accordingly, the first terminal device may perform text display based on the identification stream.
  • the cloud server can also perform speech recognition based on the recognition stream to obtain the recognition result (that is, the transcribed text), and send the recognition stream and/or the transcribed text to the first terminal device, and the first terminal device can transcribe the text.
  • the text is displayed in the text, and of course, the recognition flow and/or the transcribed text can also be stored by the cloud server.
  • the configuration of the cloud server in FIG. 4 can be deleted, for example, the smart terminal can send the call flow and identification flow to the second terminal device, and correspondingly, the second terminal device can send the call flow and identification flow to the first terminal device. correspondingly, the first terminal device may perform voice broadcast based on the call flow, determine the transcribed text based on the recognition flow, and perform text display based on the transcribed text.
  • the configuration of the second terminal device in FIG. 4 can be deleted, for example, the smart terminal can send the call flow and the identification flow to the cloud server, and correspondingly, the cloud server can send the call flow and the identification flow to the first terminal device,
  • the first terminal device may perform voice broadcast based on the call flow, determine the transcribed text based on the recognition flow, and perform text display based on the transcribed text.
  • the smart terminal may send the recognition stream to the second terminal device, and accordingly, the second terminal device may send the recognition stream to the first terminal device, and accordingly, the first terminal device may determine the transcribed text based on the recognition stream, And the text display is performed based on the transcribed text; the intelligent terminal can send the call stream to the cloud server, and accordingly, the cloud server can send the call stream to the first terminal device, and correspondingly, the first terminal device can perform voice broadcast based on the call stream.
  • the smart terminal may send the call flow and the identification flow to the second terminal device, correspondingly, the second terminal device may send the call flow and the identification flow to the cloud server, and correspondingly, the cloud server may send the first terminal device to the call flow and the identification flow.
  • the call flow and the recognition flow correspondingly, the first terminal device may perform voice broadcast based on the call flow, determine the transcribed text based on the recognition flow, and perform text display based on the transcribed text.
  • the communication module may include: a serial bus (Universal Serial Bus, USB) interface, wireless fidelity (Wireless Fidelity, WiFi) and Bluetooth.
  • USB Universal Serial Bus
  • WiFi Wireless Fidelity
  • the smart terminal can be connected to the second terminal device based on any one of the universal serial bus interface, Wi-Fi and Bluetooth; the smart terminal can be connected to the cloud server based on Wi-Fi; the second terminal device can be connected to the cloud server based on Wi-Fi Fidelity with cloud server connection.
  • a memory can be set in the smart terminal, and the memory can be connected to the processor.
  • the memory can receive the identification stream sent by the processor, and sequentially encode and compress the identification stream. Processing and storage; in another example, the memory may receive the identification stream encoded and compressed by the processor, and store the received identification stream after processing.
  • the smart terminal can receive the call stream sent by the first terminal device, or the call stream and the identification stream, and when the smart terminal receives When the call stream is sent by the first terminal device, voice playback can be performed based on the call stream, and when the smart terminal receives the recognition stream sent by the first terminal device, text display can also be performed based on the recognition stream.
  • the first user participating in the conference can make a speech by means of the first terminal device, and the first terminal device can collect the corresponding audio information, and can generate a call
  • a speaker can be set in the smart terminal, and the smart terminal can perform voice playback on the call stream through the speaker.
  • the first terminal device may also generate the call flow and the identification flow based on the above method, and send both the call flow and the identification flow to the second terminal device, and the second terminal device may send the call flow to the smart terminal, Voice playback is performed by the intelligent terminal, and text display is performed by the second terminal device based on the recognition stream.
  • the smart terminal can directly interact with the first terminal device without intermediate forwarding by the second terminal device, and a display can also be set in the smart terminal.
  • the smart terminal can play voice through the speaker, and the other
  • the smart terminal can display text through the display.
  • a display can be used to characterize a device that displays text, such as a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and an Organic Light Emitting Display (OLED), etc., which are not limited in the embodiments of the present application.
  • LCD Liquid Crystal Display
  • LED Light Emitting Diode
  • OLED Organic Light Emitting Display
  • the embodiments of the present disclosure further provide an intelligent terminal.
  • the intelligent terminal may include: a microphone array, a processor and a communication module (not shown in the figure);
  • the microphone array is used to collect audio information during the conference
  • the processor configured to respectively generate a call flow and a recognition flow according to the audio information, wherein the call flow is used for voice calls, and the recognition flow is used for voice recognition;
  • the communication module is configured to send the call flow and the identification flow.
  • the processor is configured to process the audio information according to different processing methods to obtain the call flow and the identification flow.
  • the processor is configured to perform enhanced definition processing on the audio information to obtain the call stream; and perform fidelity processing on the audio information to obtain the identification stream.
  • the processor is configured to perform noise reduction processing and automatic gain control on the audio information to obtain the call stream.
  • the processor is configured to perform beam selection processing on the audio information to obtain the identification stream.
  • the processor is configured to perform echo cancellation processing on the audio information.
  • the intelligent terminal further includes:
  • the loudspeaker is used for voice broadcast of the call flow sent by the first terminal device participating in the conference.
  • the intelligent terminal further includes:
  • An analog-to-digital converter configured to convert the signal type of the audio information to obtain converted audio information, wherein the signal type of the converted audio information is a digital signal.
  • the processor is configured to perform echo cancellation processing on the converted audio information.
  • the audio device further includes:
  • a memory for storing the identification stream.
  • the processor is configured to perform encoding processing and compression processing on the identification stream
  • the memory is used for storing the processed identification stream.
  • the transceiver includes any one of Universal Serial Bus Interface, Wi-Fi, and Bluetooth.
  • the embodiments of the present disclosure further provide a voice processing apparatus.
  • FIG. 7 is a schematic diagram of a speech processing apparatus according to an embodiment of the disclosure.
  • the device includes:
  • the collection module 11 is used for collecting audio information during the conference
  • a generating module 12 configured to respectively generate a call flow and a recognition flow according to the audio information, wherein the call flow is used for voice calls, and the recognition flow is used for voice recognition;
  • the sending module 13 is configured to send the call flow and the identification flow.
  • the generating module 12 is configured to process the audio information according to different processing methods to obtain the call flow and the identification flow.
  • the generating module 12 is configured to perform enhanced definition processing on the audio information to obtain the call stream; and perform fidelity processing on the audio information to obtain the identification stream.
  • the generating module 12 is configured to perform noise reduction processing and automatic gain control on the audio information to obtain the call stream.
  • the generating module 12 is configured to perform beam selection processing on the audio information to obtain the identification stream.
  • the generating module 12 is configured to perform echo cancellation processing on the audio information.
  • the signal type of the audio information is an analog signal; the device further includes: a conversion module 14, configured to convert the signal type of the audio information to obtain the converted audio information, wherein the signal type of the converted audio information is a digital signal.
  • the apparatus further includes: a storage module 15 for storing the identification stream.
  • the storage module 15 is configured to perform encoding processing and compression processing on the identification stream, and store the processed identification stream.
  • the embodiments of the present disclosure further provide an electronic device and a storage medium.
  • the electronic device 900 may be a terminal device or a server.
  • the terminal device may include, but is not limited to, such as smart speakers, mobile phones, notebook computers, digital broadcast receivers, personal digital assistants (Personal Digital Assistant, referred to as PDA), tablet computers (Portable Android Device, referred to as PAD), portable multimedia player Portable Media Player (PMP for short), mobile terminals such as in-vehicle terminals (such as in-vehicle navigation terminals), and fixed terminals such as digital TV (Television), desktop computers, and the like.
  • PDA Personal Digital Assistant
  • PAD Portable Android Device
  • PMP portable multimedia player Portable Media Player
  • mobile terminals such as in-vehicle terminals (such as in-vehicle navigation terminals)
  • fixed terminals such as digital TV (Television), desktop computers, and the like.
  • the electronic device shown in FIG. 9 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
  • the electronic device 900 may include a processing device (such as a central processing unit, a graphics processor, etc.) 901, which may be stored in a read-only memory (Read Only Memory, ROM for short) 902 according to a program or from a storage device 908 is a program loaded into a random access memory (Random Access Memory, RAM for short) 903 to execute various appropriate actions and processes.
  • a processing device such as a central processing unit, a graphics processor, etc.
  • ROM Read Only Memory
  • RAM Random Access Memory
  • various programs and data necessary for the operation of the electronic device 900 are also stored.
  • the processing device 901, the ROM 902, and the RAM 903 are connected to each other through a bus 904.
  • An Input/Output (I/O) interface 905 is also connected to the bus 904 .
  • the following devices can be connected to the I/O interface 905: input devices 906 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD for short) ), speaker, vibrator, etc. output device 907; storage device 908 including, eg, magnetic tape, hard disk, etc.; and communication device 909.
  • the communication means 909 may allow the electronic device 900 to communicate wirelessly or by wire with other devices to exchange data. While FIG. 9 shows an electronic device 900 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 909, or from the storage device 908, or from the ROM 902.
  • the processing apparatus 901 the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • An embodiment of the present disclosure further provides a computer program, which, when executed by a processor, executes the speech processing method provided by any of the foregoing embodiments.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above.
  • Computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Erasable Programmable Read Only Memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (Compact Disc-ROM, CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above .
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • the program code embodied on the computer-readable medium may be transmitted by any suitable medium, including but not limited to: electric wire, optical fiber cable, radio frequency (RF), etc., or any suitable combination of the above.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
  • the aforementioned computer-readable medium carries one or more programs, and when the aforementioned one or more programs are executed by the electronic device, causes the electronic device to execute the methods shown in the foregoing embodiments.
  • Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including object-oriented programming languages—such as Java, Smalltalk, C++, but also conventional Procedural programming language - such as the "C" language or similar programming language.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer can be connected to the user's computer through any kind of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or, can be connected to An external computer (eg using an Internet service provider to connect via the Internet).
  • LAN Local Area Network
  • WAN Wide Area Network
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments of the present disclosure may be implemented in a software manner, and may also be implemented in a hardware manner.
  • the name of the unit does not constitute a limitation of the unit itself under certain circumstances, for example, the first obtaining unit may also be described as "a unit that obtains at least two Internet Protocol addresses".
  • exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (Application Specific Standard Products) Standard Product, ASSP), system on chip (System on Chip, SOC), complex programmable logic device (Complex Programmable Logic Device, CPLD) and so on.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSP Application Specific Standard Products
  • ASSP Application Specific Standard Products
  • SOC System on Chip
  • complex programmable logic device Complex Programmable Logic Device, CPLD
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种语音处理方法、装置、系统、智能终端、电子设备(800)以及存储介质。方法包括:采集会议过程中的音频信息(S101,S201),根据音频信息分别生成通话流和识别流(S102),其中,通话流用于语音通话,识别流用于语音识别,分别发送通话流和识别流(S103,S207),通过基于音频信息分别生成通话流和识别流的技术方案,使得确定出的音频信息对应的会议内容的展现维度较多、较丰富,进而提高了会议的准确性,且提高了会议的智能化和质量,还提高了用户的会议体验。

Description

语音处理方法、装置、系统、智能终端以及电子设备
相关申请交叉引用
本申请要求于2020年12月29日提交、申请号为202011598381.X、发明名称为“语音处理方法、装置、系统、智能终端以及电子设备”的中国专利申请的优先权,其全部内容通过引用并入本文。
技术领域
本公开实施例涉及计算机、语音处理以及网络通信技术领域,尤其涉及一种语音处理方法、装置、系统、智能终端、电子设备以及存储介质。
背景技术
会议是指利用现代化的通讯手段,实现召开会议的目的,会议可以包括远程会议,且远程会议主要可以包含电话会议,网络会议,视频会议。
目前,应用于会议场景中的语音处理方法包括:本端会议设备采集本端用户对应的音频信息,并将本端用户对应的音频信息发送给对端会议设备,相应地,对端会议设备采集对端用户的音频信息,并将对端用户的音频信息发送给本端会议设备,其中,音频信息用于语音通话。
然而,传统的语音处理方法至少存在以下技术问题:通过用于语音通话的音频信息实现会议,可能导致会议的会议内容展现维度较少、丰富程度相对较低,从而导致会议的质量相对偏低。
发明内容
本公开实施例提供一种语音处理方法、装置、系统、智能终端、电子设备以及存储介质,以克服相关技术中会议的质量偏低的问题。
第一方面,本公开实施例提供一种语音处理方法,包括:
采集会议过程中的音频信息;
根据所述音频信息分别生成通话流和识别流,其中,所述通话流用于语音通话,所述识别流用于语音识别;
发送所述通话流和所述识别流。
第二方面,本公开实施例提供一种智能终端,所述智能终端包括:麦克风阵列、处理器和通信模块;
所述麦克风阵列,用于采集会议过程中的音频信息;
所述处理器,用于根据所述音频信息分别生成通话流和识别流,其中,所述通话流用于语音通话,所述识别流用于语音识别;
所述通信模块,用于发送所述通话流和所述识别流。
第三方面,本公开实施例提供一种语音处理装置,所述装置包括:
采集模块,用于采集会议过程中的音频信息;
生成模块,用于根据所述音频信息分别生成通话流和识别流,其中,所述通话流用于语音通话,所述识别流用于语音识别;
发送模块,用于发送所述通话流和所述识别流。
第四方面,本公开实施例提供一种语音处理系统,所述系统包括:第一终端设备和如上第二方面所述的智能终端;或者,第一终端设备和如上第三方面所述的装置,其中,所述第一终端设备为参与所述会议的终端设备。
第五方面,本公开实施例提供一种电子设备,包括:至少一个处理器和存储器;
所述存储器存储计算机执行指令;
所述至少一个处理器执行所述存储器存储的计算机执行指令,使得所述至少一个处理器执行如上第一方面以及第一方面各种可能的设计中所述的语音处理方法。
第六方面,本公开实施例提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如上第一方面以及第一方面各种可能的设计中所述的语音处理方法。
第七方面,本公开实施例提供一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,所述计算机程序被处理器执行时执行如上第一方面以及第一方面各种可能的设计中所述的语音处理方法。
第八方面,本公开实施例提供一种计算机程序,所述计算机程序被处理器执行时执行如上第一方面以及第一方面各种可能的设计中所述的语音处理方法。
本实施例提供的语音处理方法、装置、系统、智能终端、电子设备以及存储介质,包括:采集会议过程中的音频信息,根据音频信息分别生成通话流和识别流,其中,通话流用于语音通话,识别流用于语音识别,分别发送通话流和识别流,通过基于音频信息分别生成通话流和识别流的技术特征的技术方案,避免了相关技术中,会议的会议内容的展现维度单一、丰富程度相对较低的问题,实现了确定出的音频信息对应的会议内容的展现维度较多、较丰富,提高了会议的用户对会议内容的理解,进而提高了会议的准确性,且提高了会议的智能化和质量,还提高了用户的会议体验。
附图说明
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本公开一个实施例的语音处理方法的应用场景示意图;
图2为本公开一个实施例的语音处理方法的流程示意图;
图3为本公开另一实施例的语音处理方法的流程示意图;
图4为本公开另一实施例的语音处理方法的应用场景示意图;
图5为本公开实施例的语音处理方法的原理示意图;
图6为图5所示的处理器的原理图;
图7为本公开一个实施例的语音处理装置的示意图;
图8为本公开另一实施例的语音处理装置的示意图;
图9为本公开实施例提供的电子设备的硬件结构示意图。
具体实施方式
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。
本公开实施例提供的语音处理方法可以应用于会议的应用场景,且具体可以应用于远程会议的应用场景,其中,远程会议是指利用现代化的通讯手段,实现跨区域召开会议的目的,且远程会议系统可以包含电话会议、网络会议以及视频会议等。
图1为本公开一个实施例的语音处理方法的应用场景示意图。
如图1所示,应用场景可以包括:服务器、至少两个终端设备、以及与各终端设备各自对应的用户。其中,图1中示范性地展示了n台终端设备,即与会方的数量为n。
示例性地,服务器可以与每一终端设备建立通信链路,并基于通信链路实现与各终端设备的信息交互,从而实现各终端设备各自对应的用户基于远程会议实现交流。
远程会议包括多方用户,一方用户可以对应一个终端设备,且每一方用户的数量可以为一个,也可以为多个,本实施例不做限定。例如,远程会议包括多方用户,且多方用户分别为不同企业的多个工作人员;又如,远程会议包括两方用户,且两方用户分别为同一企业的不同部门的多个工作人员;再如,远程会议包括两方用户,且一方用户为企业的多个工作人员,另一方面用户为个体用户,等等。
终端设备可以是移动终端,如移动电话(或称为“蜂窝”电话)和具有移动终端的计算机,例如,可以是便携式、袖珍式、手持式、计算机内置的或者车载的移动装置,它们与无线接入网交换语言和/或数据;终端设备还可以是智能音箱、个人通信业务(Personal Communication Service,PCS)电话、无绳电话、会话发起协议(Session Initiation Protocol,SIP)话机、无线本地环路(Wireless Local Loop,WLL)站、个人数字助理(Personal Digital Assistant,PDA),平板型电脑、无线调制解调器(modem)、手持设备(handset)、膝上型电脑(laptop computer)、机器类型通信(Machine Type Communication,MTC)终端等设备;终端设备也可以称为系统、订户单元(Subscriber Unit)、订户站(Subscriber Station),移动站(Mobile Station)、移动台(Mobile)、远程站(Remote Station)、远程终端(Remote Terminal)、接入终端(Access Terminal)、用户终端(User Terminal)、用户代理(User Agent)、用户设备(User Device or User Equipment),等等,在此不作限定。
值得说明的是,上述示例只是用于示范性地说明本公开实施例的语音处理方法可能适用的应用场景,而不能理解为对应用场景的限定。例如,可以在上述示例的基础上对应用场景中的元素进行适应性的增加,如增加终端设备的数量等;又如,可以在上述示例的基础上对应用场景中的元素进行适应性的删减,如减少终端设备的数量,和/或,删减服务器,等等。
在相关技术中,每一终端设备可以采集与所述每一终端设备对应的用户的音频信息,并根据音频信息生成通话流(用于语音通话),并可以基于所述每一终端设备与服务器之间的通信链路,向服务器发送通话流,服务器可以基于其他通信链路分别向其他通信链路各自对应 的终端设备发送通话流,其他终端设备均可以输出通话流,使得其他终端设备各自对应的用户均可以听到所述每一终端设备对应的用户的声音及内容。
然而,所述每一终端设备传输的仅包括通话流,导致远程会议中会议内容的展现维度比较单一,智能化偏低。
值得说明的是,上述示例只是用于示范性地说明,本实施例的语音处理方法可以适用的应用场景,而不能理解为对本公开实施例的语音处理方法的应用场景的限定。如本实施例的语音处理方法还可以应用于其他会议场景(如本地会议场景);或者,应用于其他需要对音频信息进行语音处理的场景。
本公开的发明人经过创造性的劳动,得到了本公开的发明构思:基于音频信息分别生成通话流和识别流,通话流用于语音通话,识别流用于语音识别,从而实现用于会议的会议内容的多样性,提高用户的会议体验。
下面以具体的实施例对本公开的技术方案以及本公开的技术方案如何解决上述技术问题进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。下面将结合附图,对本公开的实施例进行描述。
请参阅图2,图2为本公开一个实施例的语音处理方法的流程示意图。
如图2所示,该方法包括:
S101:采集会议过程中的音频信息。
示例性地,本实施例的执行主体可以是语音处理装置,且语音处理装置可以为终端设备、服务器、处理器以及芯片等,本实施例不做限定。
例如,当本实施例的语音处理方法应用于如图1所示的应用场景时,语音处理装置可以为如图1中所示的终端设备,如图1中终端设备1至终端设备n中的至少一个。
相应地,以终端设备n为例,当用户n发表言论时,终端设备n可以采集对应的音频信息。
S102:根据音频信息分别生成通话流和识别流,其中,通话流用于语音通话,识别流用于语音识别。
在本实施例中,语音处理装置基于采集到的音频信息分别生成:用于语音通话的通话流和用于语音识别的识别流。结合上述示例,该步骤可以理解为:终端设备n对音频信息进行处理,分别生成通话流和识别流。
值得说明的是,本实施例提供的包括基于音频信息分别生成通话流和识别流的技术特征的技术方案,避免了相关技术中,用于会议的会议内容较为单一,导致对端用户接收到的会议内容可能会不准确,即用户获取会议内容错误的问题,提高了会议的用户对会议内容的理解,进而提高了会议的准确性,且提高了会议的智能化和质量,还提高了用户的会议体验。
S103:发送通话流和识别流。
结合如图1所示的应用场景,若语音处理装置为终端设备n,则在一种可能实现的技术方案中,终端设备n可以向服务器分别发送通话流和识别流,服务器可以向终端设备1发送通话流和识别流。相应地,终端设备1输出通话流,用户1可以听到与通话流对应的远程会议的语音内容,即用户1可以听到用户n的发言内容;且终端设备1输出与识别流对应的文本内容,即用户1可以看到与用户n的发言内容。
基于上述分析可知,本实施例提供了一种语音处理方法,该方法包括:采集会议过程中的音频信息,根据音频信息分别生成通话流和识别流,其中,通话流用于语音通话,识别流用于语音识别,发送通话流和识别流,通过既生成用于语音通话的通话流,又生成用于语音识别的识别流,可以避免相关技术中,对音频信息进行处理,得到用于表征会议的会议内容的方式较为单一,实现了确定出的音频信息对应的会议内容较多、较丰富,进而提高了会议的准确性,且提高了会议的智能化和质量,还提高了用户的会议体验。
请参阅图3,图3为本公开另一实施例的语音处理方法的流程示意图。
如图3所示,该方法包括:
S201:采集会议过程中的音频信息。
为使读者更加深刻地理解本实施例的技术方案,以及本实施例的技术方案与相关技术方案的区别等,现结合图4和图5,对图3所示的语音处理方法进行更为详细地阐述。其中,图4为本公开另一实施例的语音处理方法的应用场景示意图,图5为本公开实施例的语音处理方法的原理示意图。
如图4所示,应用场景包括:智能终端、第一终端设备、第二终端设备以及云端服务器,第一终端设备为第一与会用户与第二与会用户进行远程会议的设备,智能终端和第二终端设备为,第二与会用户与第一与会用户进行远程会议的设备。
值得说明的是,在如图4所示的应用场景中,智能终端和第二终端设备为两个独立的设备,而在另一些实施例中,智能终端可以集成于第二终端设备中,且对智能终端的外在展现形式,本实施例不做限定。
结合如图4所示的应用场景,该步骤可以理解为:当第二与会用户发表言论时,智能终端设备可以采集相应的音频信息。
结合图5可知,在一种可能实现的方案中,智能终端中可以设置一个麦克风或者麦克风阵列,并通过一个麦克风或者麦克风阵列采集音频信息。
值得说明的是,麦克风阵列中的麦克风的数量可以基于需求、历史记录和试验等进行设置,例如,麦克风的数量为6个。
S202:对音频信息的信号类型进行转换,其中,信号类型包括模拟信号和数字信号,且转换前的音频信息的信号类型为模拟信号,转换后的音频信息的信号类型为数字信号。
结合图5可知,在一种可能实现的方案中,智能终端中可以设置模数转换器,麦克风阵列将采集到的模拟信号的音频信息发送给模数转换器,模数转换器将模拟信号的音频信息转换为数字信号的音频信息,以便提高后续对音频信息的处理效率和准确性。
S203:对转换后的音频信息进行回声消除处理,得到残差信号。
结合图5和图6可知,在一种可能实现的方案中,智能终端中可以设置处理器,处理器与模数转换器连接,用于接收由模数转换器发送的转换后的音频信息,且处理器可以对转换后的音频信息进行回声消除处理,其中,图6为图5所示的处理器的原理图。
一个示例中,回声消除处理的方法可以包括:确定音频信息对应的回声信号,根据获取到的参考信号对回声信号进行消除处理,得到残差信号。
例如,可以根据麦克风阵列和智能终端的扬声器,估计音频信息对应的回声路径,根据回声路径和获取到的参考信号(如从扬声器中的功率放大器处获取到的参考信号),估计麦克 风阵列接收到的回声信号,计算参考信号与回声信号之间的差值,该差值即为残差信号,且该残差信号即消除过回声的信号。
另一个示例中,回声消除处理的方法还可以包括:处理器中可以设置自适应滤波器,自适应滤波器可以估计一个近似的回声路径来逼近真实回声路径,从而得到估计的回声信号,并在纯净语音和回声的混合信号中除去回声信号来实现回声的消除,且自适应滤波器具体可以为有限脉冲响应(Finite Impulse Response,FIR)滤波器。
S204:对残差信号进行回声残留抑制处理,得到抑制残留回声的信号。
在一些实施例中,回声残留抑制处理的方法可以包括:对残差信号进行傅立叶变换,得到频域信号,确定频域信号对应的频域调整参数,根据频域调整参数对频域信号进行调整,并对调整后的频域信号进行逆傅立叶变换,得到抑制残留回声的信号。
例如,处理器中可以预先设置深度学习神经网络,处理器对残差信号作傅立叶变换,得到频域信号,将频域信号送入给深度学习神经网络,深度学习神经网络输出频域上的掩码(掩码表征,频域的背景噪声的概率),在频域信号的基础上乘以掩码,得到处理后的频域信号,对处理后的频域信号进行逆傅立叶变换,得到抑制残留回声的信号。
S205:对抑制残留回声的信号进行去混响处理,得到去混响后的信号。
一个示例中,去混响处理的方法可以包括:构建多通道线性预测(Multichannel Linear Prediction,MCLP)模型,多通道线性预测模型表征,抑制残留回声的信号由当前信号(即抑制残留回声的信号)与过去若干帧的信号线性组合而成,基于多通道线性预测模型对过去若干帧的信号进行卷积,可以得到当前信号中的混响部分的信号,通过当前信号减去混响部分的信号即可得到去混响后的信号。
另一个示例中,去混响处理的方法还可以包括:确定麦克风阵列中每一麦克风对应的频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC),并确定相邻麦克风之间的频率倒谱系数差,且基于频率倒谱系数差构建去混响后的信号。
S206:根据不同的处理方式对去混响后的信号进行处理,获得通话流和识别流。
在一些实施例中,S206可以包括:对去混响后的信号进行增强清晰度处理,获得通话流;以及,对去混响后的信号进行保真处理,获得识别流。
值得说明的是,在如图5所示的原理图中,处理器可以包括预处理器、增强清晰度处理器以及保真处理器,其中,预处理器用于执行回声消除处理、回声残留抑制处理以及去混响处理的预处理器,增强清晰度处理器用于对由预处理器进行处理后的信号进行增强清晰度处理,保真处理器用于对由预处理器进行处理后的信号进行保真处理。
示例性地,关于增强清晰度处理的描述如下:
一个示例中,增强清晰度处理可以包括:基本谱减法。其中,基本谱减法可以理解为:预先设置基本频域,并对基本频域外的去混响后的信号进行剔除。
另一个示例中,降噪处理可以包括:维纳滤波降噪。其中,维纳滤波降噪可以理解为:基于预设的均方误差训练滤波器,并基于滤波器对去混响后的信号进行过滤,使得过滤后的去混响后的信号与纯净的去混响后的信号之间的误差小于预设的误差阈值。
结合图6可知,再一个示例中,增强清晰度处理可以包括:依次对去混响后的信号进行波束处理、合成降噪处理、最小波束处理、抑制降噪处理、人声均衡处理以及自动增益控制,获得通话流。
示例性地,波束处理的方法可以包括:确定与去混响后的信号对应的多组波束信号。
例如,建立广义旁瓣相消器(General sidelobe canceller,GSC)模型,将去混响后的信号输入至广义旁瓣相消器模型,输出水平空间的多组波束信号。
示例性地,合成降噪处理的方法可以包括:根据多组波束信号确定去混响后的信号(即音频信息的纯净信号)的期望估计,对去混响后的信号的期望估计、以及多组波束进行相位合成,得到降噪后的波束信号。
例如,对去混响后的信号的幅值谱建模,建设得到的语音和噪声的幅度谱服从高斯分布,获取会议的稳态噪声,估计去混响后的信号的后验信噪比,根据贝叶斯原理和后验信噪比,得到去混响后的信号的期望估计,对去混响后的信号的期望估计与多组波束的相位进行合成处理,得到降噪后的波束信号。
示例性地,最小波束处理的方法可以包括:确定降噪后的波束信号中最大能量的波束信号与最小能量的波束信号之间的能量比值,根据能量比值确定归一化的波束信号。
例如,确定降噪后的波束信号中的最大能量的波束信号,并确定降噪后的波束信号中的最小能量的波束信号,计算最大能量与最小能量之间的能量比值,判断能量比值是否大于预设的比值阈值,若是,则将最大能量的波束信号以归一化的方式累加处理,得到归一化的波束信号。
示例性地,抑制降噪处理的方法可以包括:确定归一化的波束信号的掩码,根据归一化的波束信号的掩码,对归一化的波束信号的非稳态噪声进行抑制,获得抑制后的波束信号。
例如,可以预先设置循环神经网络,将归一化的波束信号输出给循环神经网络,循环神经网络输出归一化的波束信号的掩码,可以将归一化的波束信号的掩码(掩码表征,非稳态噪声为背景噪声的概率)乘以非稳态噪声,得到抑制后的波束信号。
示例性地,人声均衡处理的方法可以包括:对抑制后的波束信号在预设频带进行补偿,获得补偿后的波束信号。
例如,可以预先设置分段的峰值滤波器(peak filter),对于降噪输出的抑制后的波束信号在预设频带(可以基于需求、历史记录和试验等进行设置,本实施例不做限定)进行补偿,得到补偿后的波束信号,使得补偿后的波束信号对应的听感的音质更高。
一个示例中,自动增益控制的方法可以包括:将补偿后的波束信号作傅立叶变换,得到的功率谱,将功率谱输入给预先设置的卷积神经网络,得到当前帧的语音存在概率,若基于当前帧的语音存在概率大于预设概率阈值,则确定当前帧的语音存在,则给补偿后的波束信号施加一个逐渐增大的增益,直到补偿后的波束信号的增益稳定,得到通话流。
另一个示例中,自动增益控制的方法可以包括如下步骤:
步骤1:根据补偿后的波束信号和预先设置的等响度曲线,确定增益权重。
其中,等响度曲线可以用于,表征基于实验等方式确定出的,用户满足度相对较高的补偿后的波束信号对应的曲线。
在该步骤中,具体可以将补偿后的波束信号映射至等响度曲线,并基于两者之间的差异确定增益权重。
步骤2:根据增益权重对补偿后的波束信号进行增强处理。
示例性地,关于保真处理的描述如下:
一个示例中,保真处理的方法可以包括:对去混响后的信号进行声纹识别处理。
例如,智能终端对去混响后的信号进行特征提取处理,得到与去混响后的信号的音高、音强、音长以及音色等特征,并基于音高、音强、音长以及音色等特征对去混响后的信号进行还原,得到识别流,使得识别流具有较低失真的特性。
结合图6可知,另一个示例中,保真处理的方法可以包括:对去混响后的信号进行波达角估计处理和波束选择处理。
示例性地,波达角估计处理的方法可以包括:对去混响后的信号进行多重信号分类处理,获得方向谱,根据方向谱确定去混响后的信号对应的声源方向。
例如,对去混响后的信号进行多重信号分类处理,得到去混响后的信号的频率和时间的方向谱,可以根据频率和时间构建方向谱对应的直方图,并基于直方图确定去混响后的信号的声源方向。
示例性地,波束选择处理的方法可以包括:根据声源方向确定去混响后的信号的起始点、结束点以及可控功率响应,根据起始点、结束点以及可控功率响应,从去混响后的信号中选择识别流。
值得说明的是,在采用上述方法对音频信息进行处理,获得通话流和识别流时,可以仅采用部分方法对音频信息进行处理,且对音频信息进行处理的各方法的顺序可以相应地调整。
例如,在对音频信息进行增强清晰度处理时,可以仅采用降噪处理和自动增益处理;又如,在对音频信息进行保真处理时,可以仅采用波束选择处理;再如,在采用如图6所示的各方法对音频信息进行增强度清晰度处理时,可以随意调整各处理方法的顺序,如先进行抑制降噪处理,而后进行最小波束处理,等等。
S207:发送通话流和识别流。
结合如图4所示的应用场景和如图5所示的原理图可知,一个示例中,智能终端可以通过通信模块向云端服务器发送通话流,相应地,云端服务器可以向第一终端设备分发发送通话流,相应地,第一终端设备可以基于通话流进行语音播报;智能终端可以向云端服务器发送识别流,相应地,云端服务器可以向第一终端设备发送识别流,相应地,第一终端设备可以基于识别流进行文本显示。
其中,也可以由云端服务器基于识别流进行语音识别,得到识别结果(即转写文本),并将识别流和/或转写文本发送给第一终端设备,第一终端设备可以对转写文本进行文本显示,当然,也可以由云服务器对识别流和/或转写文本进行存储。
如图5所示,在一些实施例中,云服务器还可以向第二终端设备发送识别流和/或转写文本,相应地,第二终端设备可以对转写文本进行文本显示。
如图5所示,在一些实施例中,云服务器还可以向第三终端设备发送识别流和/或转写文本,相应地,第二终端设备可以对转写文本进行文本显示。以图4所示的应用场景为例,第三终端设备可以为非远程会议中的终端设备。也就是说,第三终端设备为可以具有显示功能的设备,并可以对转写文本进行显示,且本实施例对第三终端设备的数量不做限定。
另一个示例中,智能终端可以通过通信模块向第二终端设备发送通话流,且第二终端设备上运行有进行会议的软件,相应地,第二终端设备可以基于会议的软件向第一终端设备发送通话流,相应地,第一终端设备可以基于通话流进行语音播报;关于智能终端发送识别流的原理可以参见上述示例,此处不再赘述。
另一个示例中,可以在如图4所示的应用场景中增加一个服务器的配置,且删减图4中第二终端设备的配置,智能终端可以向增加的服务器发送通话流,相应地,增加的服务器可以向第一终端设备发送通话流,相应地,第一终端设备可以基于通话流进行语音播报;如上述示例所述,智能终端可以向云端服务器发送识别流,相应地,云端服务器可以向第一终端设备发送识别流,相应地,第一终端设备可以基于识别流进行文本显示。
同理,也可以由云端服务器基于识别流进行语音识别,得到识别结果(即转写文本),并将识别流和/或转写文本发送给第一终端设备,第一终端设备可以对转写文本进行文本显示,当然,也可以由云服务器对识别流和/或转写文本进行存储。
另一个示例中,可以删减图4中云端服务器的配置,如智能终端可以向第二终端设备发送通话流和识别流,相应地,第二终端设备可以向第一终端设备发送通话流和识别流,相应地,第一终端设备可以基于通话流进行语音播报,并基于识别流确定转写文本,并基于转写文本进行文本显示。
再一个示例中,可以删减图4中第二终端设备的配置,如智能终端可以向云端服务器发送通话流和识别流,相应地,云端服务器可以向第一终端设备发送通话流和识别流,相应地,第一终端设备可以基于通话流进行语音播报,并基于识别流确定转写文本,并基于转写文本进行文本显示。
还一个示例中,智能终端可以向第二终端设备发送识别流,相应地,第二终端设备可以向第一终端设备发送识别流,相应地,第一终端设备可以基于识别流确定转写文本,并基于转写文本进行文本显示;智能终端可以向云端服务器发送通话流,相应地,云端服务器可以向第一终端设备发送通话流,相应地,第一终端设备可以基于通话流进行语音播报。
又一个示例中,智能终端可以向第二终端设备发送通话流和识别流,相应地,第二终端设备可以向云端服务器发送通话流和识别流,相应地,云端服务器可以向第一终端设备发送通话流和识别流,相应地,第一终端设备可以基于通话流进行语音播报,并基于识别流确定转写文本,并基于转写文本进行文本显示。
在一些实施例中,通信模块可以包括:串行总线(Universal Serial Bus,USB)接口、无线保真(Wireless Fidelity,WiFi)以及蓝牙。
示例性地,智能终端可以基于通用串行总线接口、无线保真以及蓝牙中的任一种与第二终端设备连接;智能终端可以基于无线保真与云端服务器连接;第二终端设备可以基于无线保真与云端服务器连接。
S208:对识别流进行编码处理和压缩处理,并对处理后的识别流进行存储。
结合如图5所示的原理图,智能终端中可以设置存储器,且存储器可以与处理器连接,一个示例中,存储器可以接收由处理器发送的识别流,并依次对识别流进行编码处理、压缩处理以及存储;另一个示例中,存储器可以接收由处理器进行编码处理和压缩处理后的识别流,并对接收到的处理后的识别流进行存储。
值得说明的是,在本实施例中,通过对处理后的识别流进行存储,可以避免由会议记录员进行手动记录造成的成本高,且可靠性偏低的问题,使得会议中的发言内容自动被记录,并方便后续查询和追溯,提高会议的智能化,且提高与会人员的会议体验。
应该理解的是,会议是各与会方相互交流的过程,因此,在一些实施例中,智能终端可以接收由第一终端设备发送的通话流,或者通话流和识别流,且当智能终端接收到第一终端 设备发送的通话流时,可以基于通话流进行语音播放,当智能终端接收到第一终端设备发送的识别流时,还可以基于识别流进行文本显示。
例如,结合如图4所示的应用场景和图5所示的原理图,第一与会用户可以借助于第一终端设备发表言论,则第一终端设备可以采集对应的音频信息,并可以生成通话流,相应地,智能终端中可以设置扬声器,且智能终端可以通过扬声器对通话流进行语音播放。
另一些实施例中,第一终端设备也可以基于上述方法生成通话流和识别流,并将通话流和识别流均发送给第二终端设备,第二终端设备可以将通话流发送给智能终端,由智能终端进行语音播放,并由第二终端设备基于识别流进行文本显示。
在一些实施例中,智能终端与第一终端设备可以直接交互,而无需第二终端设备的中间转发,且智能终端中还可以设置显示器,一方面,智能终端可以通过扬声器进行语音播放,另一方面,智能终端可以通过显示器进行文本显示。
示例性地,显示器可以用于表征对文本进行显示的设备,如液晶显示器(Liquid Crystal Display,LCD)、发光二极管(Light Emitting Diode,LED)显示器及有机发光显示器(Organic Light Emitting Display,OLED),等等,本申请实施例不做限定。
根据本公开实施例的另一个方面,本公开实施例还提供了一种智能终端。
结合图5可知,智能终端可以包括:麦克风阵列、处理器和通信模块(图中未示出);
所述麦克风阵列,用于采集会议过程中的音频信息;
所述处理器,用于根据所述音频信息分别生成通话流和识别流,其中,所述通话流用于语音通话,所述识别流用于语音识别;
所述通信模块,用于发送所述通话流和所述识别流。
在一些实施例中,所述处理器用于,根据不同的处理方式对所述音频信息进行处理,获得所述通话流和所述识别流。
在一些实施例中,所述处理器用于,对所述音频信息进行增强清晰度处理,获得所述通话流;以及,对所述音频信息进行保真处理,获得所述识别流。
在一些实施例中,所述处理器用于,对所述音频信息进行降噪处理和自动增益控制,获得所述通话流。
在一些实施例中,所述处理器用于,对所述音频信息进行波束选择处理,获得所述识别流。
在一些实施例中,所述处理器用于,对所述音频信息进行回声消除处理。
结合图5可知,在一些实施例中,所述智能终端还包括:
扬声器,用于对参与所述会议的第一终端设备发送的通话流进行语音播报。
结合图5可知,在一些实施例中,所述智能终端还包括:
模数转换器,用于对所述音频信息的信号类型进行转换,获得转换后的音频信息,其中,所述转换后的音频信息的信号类型为数字信号。
在一些实施例中,所述处理器用于,对所述转换后的音频信息进行回声消除处理。
结合图5可知,在一些实施例中,所述音频设备还包括:
存储器,用于对所述识别流进行存储。
在一些实施例中,所述处理器用于,对所述识别流进行编码处理和压缩处理;
所述存储器用于,对处理后的识别流进行存储。
在一些实施例中,所述收发器包括:通用串行总线接口、无线保真以及蓝牙中的任一种。
根据本公开实施例的另一个方面,本公开实施例还提供了一种语音处理装置。
请参阅图7,图7为本公开一个实施例的语音处理装置的示意图。
如图7所示,该装置包括:
采集模块11,用于采集会议过程中的音频信息;
生成模块12,用于根据所述音频信息分别生成通话流和识别流,其中,所述通话流用于语音通话,所述识别流用于语音识别;
发送模块13,用于发送所述通话流和所述识别流。
在一些实施例中,所述生成模块12用于,根据不同的处理方式对所述音频信息进行处理,获得所述通话流和所述识别流。
在一些实施例中,所述生成模块12用于,对所述音频信息进行增强清晰度处理,获得所述通话流;以及,对所述音频信息进行保真处理,获得所述识别流。
在一些实施例中,所述生成模块12用于,对所述音频信息进行降噪处理和自动增益控制,获得所述通话流。
在一些实施例中,所述生成模块12用于,对所述音频信息进行波束选择处理,获得所述识别流。
在一些实施例中,所述生成模块12用于,对所述音频信息进行回声消除处理。
结合图8可知,在一些实施例中,所述音频信息的信号类型为模拟信号;所述装置还包括:转换模块14,用于对所述音频信息的信号类型进行转换,获得转换后的音频信息,其中,所述转换后的音频信息的信号类型为数字信号。
结合图8可知,在一些实施例中,所述装置还包括:存储模块15,用于对所述识别流进行存储。
在一些实施例中,存储模块15用于,对所述识别流进行编码处理和压缩处理,并对处理后的识别流进行存储。
根据本公开实施例的另一个方面,本公开实施例还提供了一种电子设备以及存储介质。
参考图9,其示出了适于用来实现本公开实施例的电子设备900的结构示意图,该电子设备900可以为终端设备或服务器。其中,终端设备可以包括但不限于诸如智能音箱、移动电话、笔记本电脑、数字广播接收器、个人数字助理(Personal Digital Assistant,简称PDA)、平板电脑(Portable Android Device,简称PAD)、便携式多媒体播放器(Portable Media Player,简称PMP)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV(Television)、台式计算机等等的固定终端。图9示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图9所示,电子设备900可以包括处理装置(例如中央处理器、图形处理器等)901,其可以根据存储在只读存储器(Read Only Memory,简称ROM)902中的程序或者从存储装置908加载到随机访问存储器(Random Access Memory,简称RAM)903中的程序而执行各种适当的动作和处理。在RAM 903中,还存储有电子设备900操作所需的各种程序和数据。处理装置901、ROM 902以及RAM 903通过总线904彼此相连。输入/输出(Input/Output,I/O)接口905也连接至总线904。
通常,以下装置可以连接至I/O接口905:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置906;包括例如液晶显示器(Liquid Crystal Display,简称LCD)、扬声器、振动器等的输出装置907;包括例如磁带、硬盘等的存储装置908;以及通信装置909。通信装置909可以允许电子设备900与其他设备进行无线或有线通信以交换数据。虽然图9示出了具有各种装置的电子设备900,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置909从网络上被下载和安装,或者从存储装置908被安装,或者从ROM 902被安装。在该计算机程序被处理装置901执行时,执行本公开实施例的方法中限定的上述功能。
本公开实施例还提供了一种计算机程序,所述计算机程序被处理器执行时执行上述任一实施例提供的语音处理方法。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read Only Memory,EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(Compact Disc-ROM,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、射频(Radio Frequency,RF)等等,或者上述的任意合适的组合。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备执行上述实施例所示的方法。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算 机的情形中,远程计算机可以通过任意种类的网络——包括局域网(Local Area Network,简称LAN)或广域网(Wide Area Network,简称WAN)——连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,示出了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定,例如,第一获取单元还可以被描述为“获取至少两个网际协议地址的单元”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(Field Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Product,ASSP)、片上系统(System on Chip,SOC)、复杂可编程逻辑设备(Complex Programmable Logic Device,CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所 附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。

Claims (20)

  1. 一种语音处理方法,所述方法包括:
    采集会议过程中的音频信息;
    根据所述音频信息分别生成通话流和识别流,其中,所述通话流用于语音通话,所述识别流用于语音识别;
    发送所述通话流和所述识别流。
  2. 根据权利要求1所述的方法,其中,根据所述音频信息分别生成通话流和识别流,包括:
    根据不同的处理方式对所述音频信息进行处理,获得所述通话流和所述识别流。
  3. 根据权利要求2所述的方法,其中,根据不同的处理方式对所述音频信息进行处理,得到所述通话流和所述识别流,包括;
    对所述音频信息进行增强清晰度处理,获得所述通话流;以及,
    对所述音频信息进行保真处理,获得所述识别流。
  4. 根据权利要求3所述的方法,其中,对所述音频信息进行增强清晰度处理,获得所述通话流,包括:
    对所述音频信息进行降噪处理和自动增益控制,获得所述通话流。
  5. 根据权利要求3或4所述的方法,其中,对所述音频信息进行保真处理,获得所述识别流,包括:
    对所述音频信息进行波束选择处理,获得所述识别流。
  6. 根据权利要求3至5任一项所述的方法,在对所述音频信息进行增强清晰度处理,获得所述通话流;以及,对所述音频信息进行保真处理,获得所述识别流之前,所述方法还包括:
    对所述音频信息进行回声消除处理。
  7. 根据权利要求1至6任一项所述的方法,其中,所述方法应用于智能终端;发送所述通话流和所述识别流,包括:
    所述智能终端向云端服务器发送所述识别流,所述识别流用于,由所述云端服务器进行语音识别,并通过所述云端服务器,向参与所述会议的第一终端设备发送所述识别流和/或对所述识别进行语音识别的识别结果;
    所述智能终端向云端服务器发送所述通话流,并通过所述云端服务器向所述第一终端设备分发所述通话流。
  8. 一种智能终端,包括:麦克风阵列、处理器和通信模块;
    所述麦克风阵列,用于采集会议过程中的音频信息;
    所述处理器,用于根据所述音频信息分别生成通话流和识别流,其中,所述通话流用于语音通话,所述识别流用于语音识别;
    所述通信模块,用于发送所述通话流和所述识别流。
  9. 根据权利要求8所述的智能终端,其中,所述处理器用于,根据不同的处理方式对所述音频信息进行处理,获得所述通话流和所述识别流。
  10. 根据权利要求9所述的智能终端,其中,所述处理器用于,对所述音频信息进行增强清晰度处理,获得所述通话流;以及,对所述音频信息进行保真处理,获得所述识别流。
  11. 根据权利要求10所述的智能终端,其中,所述处理器用于,对所述音频信息进行降噪处理和自动增益控制,获得所述通话流。
  12. 根据权利要求10或11所述的智能终端,其中,所述处理器用于,对所述音频信息进行波束选择处理,获得所述识别流。
  13. 根据权利要求10至12任一项所述的智能终端,其中,所述处理器用于,对所述音频信息进行回声消除处理。
  14. 根据权利要求8至13任一项所述的智能终端,所述智能终端还包括:
    扬声器,用于对参与所述会议的第一终端设备发送的通话流进行语音播报。
  15. 一种语音处理装置,包括:
    采集模块,用于采集会议过程中的音频信息;
    生成模块,用于根据所述音频信息分别生成通话流和识别流,其中,所述通话流用于语音通话,所述识别流用于语音识别;
    发送模块,用于发送所述通话流和所述识别流。
  16. 一种语音处理系统,包括:
    第一终端设备和如权利要求8至14任一项所述的智能终端;或者,
    第一终端设备和如权利要求15所述的语音处理装置;其中,所述第一终端设备为参与会议的终端设备。
  17. 一种电子设备,包括:至少一个处理器和存储器;其中,
    所述存储器存储计算机执行指令;
    所述至少一个处理器执行所述存储器存储的计算机执行指令,使得所述至少一个处理器执行如权利要求1至7任一项所述的语音处理方法。
  18. 一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如权利要求1至7任一项所述的语音处理方法。
  19. 一种计算机程序产品,其特征在于,包括承载在计算机可读介质上的计算机程序,所述计算机程序被处理器执行时执行如权利要求1至7任一项所述的语音处理方法。
  20. 一种计算机程序,其特征在于,所述计算机程序被处理器执行时执行如权利要求1至7任一项所述的语音处理方法。
PCT/CN2021/134864 2020-12-29 2021-12-01 语音处理方法、装置、系统、智能终端以及电子设备 WO2022142984A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21913725.4A EP4243019A4 (en) 2020-12-29 2021-12-01 VOICE PROCESSING METHOD, APPARATUS AND SYSTEM, INTELLIGENT TERMINAL AND ELECTRONIC DEVICE
US18/254,568 US20240105198A1 (en) 2020-12-29 2021-12-01 Voice processing method, apparatus and system, smart terminal and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011598381.XA CN112750452A (zh) 2020-12-29 2020-12-29 语音处理方法、装置、系统、智能终端以及电子设备
CN202011598381.X 2020-12-29

Publications (1)

Publication Number Publication Date
WO2022142984A1 true WO2022142984A1 (zh) 2022-07-07

Family

ID=75647014

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/134864 WO2022142984A1 (zh) 2020-12-29 2021-12-01 语音处理方法、装置、系统、智能终端以及电子设备

Country Status (4)

Country Link
US (1) US20240105198A1 (zh)
EP (1) EP4243019A4 (zh)
CN (1) CN112750452A (zh)
WO (1) WO2022142984A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112750452A (zh) * 2020-12-29 2021-05-04 北京字节跳动网络技术有限公司 语音处理方法、装置、系统、智能终端以及电子设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108335697A (zh) * 2018-01-29 2018-07-27 北京百度网讯科技有限公司 会议记录方法、装置、设备及计算机可读介质
US10262674B1 (en) * 2018-06-26 2019-04-16 Capital One Services, Llc Doppler microphone processing for conference calls
CN110797043A (zh) * 2019-11-13 2020-02-14 苏州思必驰信息科技有限公司 会议语音实时转写方法及系统
CN111145751A (zh) * 2019-12-31 2020-05-12 百度在线网络技术(北京)有限公司 音频信号处理方法、装置以及电子设备
GB2581518A (en) * 2019-02-22 2020-08-26 Software Hothouse Ltd System and method for teleconferencing exploiting participants' computing devices
CN111883123A (zh) * 2020-07-23 2020-11-03 平安科技(深圳)有限公司 基于ai识别的会议纪要生成方法、装置、设备及介质
CN112750452A (zh) * 2020-12-29 2021-05-04 北京字节跳动网络技术有限公司 语音处理方法、装置、系统、智能终端以及电子设备

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8983844B1 (en) * 2012-07-31 2015-03-17 Amazon Technologies, Inc. Transmission of noise parameters for improving automatic speech recognition
US9984674B2 (en) * 2015-09-14 2018-05-29 International Business Machines Corporation Cognitive computing enabled smarter conferencing
CN108597518A (zh) * 2018-03-21 2018-09-28 安徽咪鼠科技有限公司 一种基于语音识别的会议记录智能麦克风系统
US10771272B1 (en) * 2019-11-01 2020-09-08 Microsoft Technology Licensing, Llc Throttling and prioritization for multichannel audio and/or multiple data streams for conferencing

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108335697A (zh) * 2018-01-29 2018-07-27 北京百度网讯科技有限公司 会议记录方法、装置、设备及计算机可读介质
US10262674B1 (en) * 2018-06-26 2019-04-16 Capital One Services, Llc Doppler microphone processing for conference calls
GB2581518A (en) * 2019-02-22 2020-08-26 Software Hothouse Ltd System and method for teleconferencing exploiting participants' computing devices
CN110797043A (zh) * 2019-11-13 2020-02-14 苏州思必驰信息科技有限公司 会议语音实时转写方法及系统
CN111145751A (zh) * 2019-12-31 2020-05-12 百度在线网络技术(北京)有限公司 音频信号处理方法、装置以及电子设备
CN111883123A (zh) * 2020-07-23 2020-11-03 平安科技(深圳)有限公司 基于ai识别的会议纪要生成方法、装置、设备及介质
CN112750452A (zh) * 2020-12-29 2021-05-04 北京字节跳动网络技术有限公司 语音处理方法、装置、系统、智能终端以及电子设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4243019A4 *

Also Published As

Publication number Publication date
EP4243019A1 (en) 2023-09-13
CN112750452A (zh) 2021-05-04
US20240105198A1 (en) 2024-03-28
EP4243019A4 (en) 2024-03-27

Similar Documents

Publication Publication Date Title
US11297178B2 (en) Method, apparatus, and computer-readable media utilizing residual echo estimate information to derive secondary echo reduction parameters
US8606249B1 (en) Methods and systems for enhancing audio quality during teleconferencing
CN112071328B (zh) 音频降噪
CN108447496B (zh) 一种基于麦克风阵列的语音增强方法及装置
US20160308929A1 (en) Conferencing based on portable multifunction devices
CN111402915A (zh) 信号处理方法、装置及系统
US9191519B2 (en) Echo suppressor using past echo path characteristics for updating
US9832299B2 (en) Background noise reduction in voice communication
CN105793922B (zh) 用于多路径音频处理的设备、方法和计算机可读介质
CN111556210B (zh) 通话语音处理方法与装置、终端设备和存储介质
WO2014161334A1 (zh) 一种语音通话方法及装置
WO2013121749A1 (ja) エコー消去装置、エコー消去方法、及び、通話装置
CN109215672B (zh) 一种声音信息的处理方法、装置及设备
TWI573133B (zh) 音訊處理系統及方法
WO2019143429A1 (en) Noise reduction in an audio system
WO2022142984A1 (zh) 语音处理方法、装置、系统、智能终端以及电子设备
US20120140918A1 (en) System and method for echo reduction in audio and video telecommunications over a network
US11363147B2 (en) Receive-path signal gain operations
US20200344545A1 (en) Audio signal adjustment
CN114979344A (zh) 回声消除方法、装置、设备及存储介质
CN107170461B (zh) 语音信号处理方法及装置
US9531884B2 (en) Stereo echo suppressing device, echo suppressing device, stereo echo suppressing method, and non-transitory computer-readable recording medium storing stereo echo suppressing program
Fukui et al. Acoustic echo canceller software for VoIP hands-free application on smartphone and tablet devices
CN113299310B (zh) 声音信号处理方法、装置、电子设备及可读存储介质
WO2023093292A1 (zh) 一种多通道回声消除方法和相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21913725

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18254568

Country of ref document: US

ENP Entry into the national phase

Ref document number: 2021913725

Country of ref document: EP

Effective date: 20230607

NENP Non-entry into the national phase

Ref country code: DE