WO2015085959A1 - 语音处理方法及装置 - Google Patents

语音处理方法及装置 Download PDF

Info

Publication number
WO2015085959A1
WO2015085959A1 PCT/CN2015/072099 CN2015072099W WO2015085959A1 WO 2015085959 A1 WO2015085959 A1 WO 2015085959A1 CN 2015072099 W CN2015072099 W CN 2015072099W WO 2015085959 A1 WO2015085959 A1 WO 2015085959A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
network
quality
scene
coding
Prior art date
Application number
PCT/CN2015/072099
Other languages
English (en)
French (fr)
Inventor
刘洪�
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2015085959A1 publication Critical patent/WO2015085959A1/zh
Priority to US15/174,321 priority Critical patent/US9978386B2/en
Priority to US15/958,879 priority patent/US10510356B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/167Audio streaming, i.e. formatting and decoding of an encoded audio signal representation into a data stream for transmission or storage purposes
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/22Mode decision, i.e. based on audio signal content versus external parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0324Details of processing therefor
    • G10L21/034Automatic adjustment
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals
    • G10L2025/932Decision in previous or following frames

Definitions

  • the present invention relates to the field of information technology, and in particular, to a voice processing method and apparatus.
  • DSP digital signal processing
  • a multi-channel speech signal is acquired, it may be necessary to perform a mixing process before the speech encoding packet is obtained. Other sound effects can be processed before the voice encoding package is obtained.
  • the voice stream is processed according to a unified processing manner, and the sound quality requirement cannot be achieved for a scene with high sound quality requirements, and the waste of resources due to occupying more system resources for a scene with low sound quality requirements.
  • the scheme of processing voice streams in a unified processing manner cannot be adapted to the voice requirements in the current multiple scenarios.
  • the embodiments of the present invention provide a voice processing method and device, which are used to provide a voice processing solution based on a voice application scenario, so that the voice processing solution is adapted to the voice application scenario.
  • a voice processing method applied to a network including:
  • a voice processing device is applied to a network, including:
  • a detecting unit configured to detect a current voice application scenario in the network
  • a determining unit configured to determine a voice quality requirement of the current voice application scenario and a requirement for the network
  • a parameter configuration unit configured to configure a voice processing parameter corresponding to the voice application scenario detected by the detecting unit, based on the determined requirement for voice quality and requirements for the network;
  • the voice processing unit is configured to perform voice processing on the voice signal collected in the voice application scenario according to the voice processing parameter configured by the parameter configuration unit.
  • the voice application scenarios for different voice quality requirements correspond to different voice processing parameters, thereby determining voice processing parameters that are compatible with the current voice application scenario.
  • the voice processing parameters adapted to the current voice application scenario are used for voice processing, so that the voice processing scheme can be adapted to the current voice application scenario, so that the technical effect of saving system resources under the premise of satisfying the sound quality requirement can be realized.
  • FIG. 1A is a schematic flowchart of a method according to an embodiment of the present invention.
  • FIG. 1B is a schematic flowchart of a method according to an embodiment of the present invention.
  • FIG. 2 is a schematic flowchart of a method according to an embodiment of the present invention.
  • FIG. 3 is a schematic flowchart of a method according to an embodiment of the present invention.
  • FIG. 4A is a schematic structural view of an apparatus according to an embodiment of the present invention.
  • FIG. 4B is a schematic structural view of an apparatus according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of a terminal according to an embodiment of the present invention.
  • speech broadly refers to audio that contains speech from a vocal organ and audio in which the speech is silent.
  • the voice may be the voices spoken by both parties to the call and the silence contained between the voices, and may be audio containing background voices in the voice and voice environments.
  • the voice can be a concert audio in which the voice is silent.
  • a voice application scenario refers to a scenario in which voice is involved, such as a call, a chat, a show, and the like.
  • a voice processing method 100 is provided, which is applied to a network and includes:
  • Step S1 detecting a current voice application scenario in the network
  • Step S2 determining a voice quality requirement of the current voice application scenario and a requirement for the network
  • Step S3 configuring voice processing parameters corresponding to the voice application scenario based on the determined requirements for voice quality and requirements for the network;
  • Step S4 Perform voice processing on the voice signal collected in the voice application scenario according to the voice processing parameter.
  • the voice application scenario includes: a network game scenario, a call chat scenario, a high-quality no-video network chat scenario, a high-quality network live broadcast scenario or a high-quality video network chat scenario, a super-high-quality network live broadcast scenario, or a super-high sound quality.
  • Video network chat scene includes: a network game scenario, a call chat scenario, a high-quality no-video network chat scenario, a high-quality network live broadcast scenario or a high-quality video network chat scenario, a super-high-quality network live broadcast scenario, or a super-high sound quality.
  • Video network chat scene includes: a network game scenario, a call chat scenario, a high-quality no-video network chat scenario, a high-quality network live broadcast scenario or a high-quality video network chat scenario, a super-high-quality network live broadcast scenario, or a super-high sound quality.
  • the requirements for the network include requirements for network speed, requirements for uplink and downlink bandwidth of the network, requirements for network traffic, or requirements for network delay.
  • the voice processing parameters may include: a voice sampling rate, whether acoustic echo cancellation is turned on, whether noise suppression is turned on, the intensity of noise attenuation, whether automatic gain control is turned on, whether voice activity detection is turned on, the number of silence frames, the code rate, At least one of coding complexity, whether forward error correction is enabled, network packet mode, and network packet transmission mode.
  • the embodiment of the invention provides a voice processing method, as shown in FIG. 1B, including steps 101-103.
  • the process of the scenario detection may be an automatic detection process performed by the device, or may be a setting of the scenario mode by the user, and the manner of obtaining the voice application scenario does not affect the implementation of the embodiment of the present invention. This is not limited.
  • the above-mentioned voice application scenario refers to the current application scenario that the voice processing is targeted for. Therefore, the above voice application scenario may be various application scenarios that can be applied to voice in the current computer technology field, and those skilled in the art may know that voice can be currently used. There are many application scenarios, which cannot be exhaustive in this embodiment of the present invention. However, the embodiments of the present invention still exemplify several representative voice application scenarios.
  • the voice application scenario includes: a game scenario ( Game Talk Mode, GTM, also known as the chat mode of the game scene), Normal Talk Mode (NTM, also known as the general call chat mode), high-quality video chat scene (High Quality Mode, HQM can also It is called a no-video chat mode in a high-quality scene, a high-quality live broadcast scene or a high-quality video chat scene (HQVM, also known as a high-quality live broadcast mode or a video chat mode in a high-quality scene).
  • Super high quality live scene or super high quality video chat scene Super Quality with Vide o Mode, SQV super high quality live mode: at least one of the video chat modes in the super high quality scene.
  • the quality of the voice will be different.
  • the game scene has the lowest voice quality requirements, but requires a higher CPU speed requirement and a CPU for voice processing (Central Processor Unit, Central Processor) has fewer resources.
  • Live-related scenes are relatively high-fidelity and require special sound processing. In high-quality mode, more CPU resources and network traffic are needed to ensure that the sound quality meets user needs.
  • the speech processing parameters are the guiding standard parameters used to determine how to perform the speech processing. Those skilled in the art can know that there are many options for the control of the speech processing, and the system occupied by the speech processing for various possible choices. Changes in resources are also predictable by those skilled in the art, and various voice processing will cause changes in voice quality to be predictable. The voice quality requirements and resource consumption requirements based on various application scenarios can be determined by those skilled in the art. How voice processing parameters are selected.
  • the corresponding voice processing parameters need to be determined, and the voice processing parameters may be Pre-set locally, for example, in the form of a configuration table, which is implemented as follows:
  • voice processing parameters corresponding to each voice application scenario are preset in the voice processing device, and each voice application scenario corresponds to different voice quality;
  • the voice processing parameter corresponding to the voice application scenario is configured to: configure voice processing parameters corresponding to the voice application scenario according to the voice processing parameters corresponding to the preset voice application scenarios.
  • the voice processing parameters include: voice sampling rate, and acoustic echo cancellation. Whether to enable, noise suppression (NS), noise intensity, automatic gain control (AGC), voice activity detection, mute frame number, code rate, coding complexity, Whether at least one of forward error correction is enabled, network packet mode, and network packet transmission mode.
  • NS noise suppression
  • AGC automatic gain control
  • the selection of the parameter results may cause changes in system resources occupied by the voice processing, and those skilled in the art may also predict that various voice processing will cause changes in voice quality to be predictable, based on
  • the various application scenarios exemplified in the foregoing embodiments of the present invention also provide a preferred setting scheme, as follows: The higher the standard of the voice processing parameters corresponding to the application scenario with the higher voice quality requirements is:
  • the voice processing parameters in the game scene are set to: acoustic echo cancellation on, noise suppression on, strong noise attenuation, automatic gain control on, voice activity detection on, number of silence frames, low code rate, high coding complexity, front
  • the error correction is enabled, the network packet method is one voice coding packet for two voice frames, and the network packet transmission mode is single transmission;
  • the voice processing parameters in the call chat scene are set to: acoustic echo cancellation on, noise suppression on, low noise attenuation, automatic gain control on, voice activity detection on, low silence frame number, low coding rate, high coding complexity,
  • the forward error correction is enabled, the network packet mode is one voice coding packet for three voice frames, and the network packet transmission mode is single transmission;
  • the voice processing parameters in the high-quality no video chat scene are set to: acoustic echo cancellation on, noise suppression on, low noise attenuation, automatic gain control on, voice activity detection on, low silence frame number, code rate default value, coding
  • the default value of complexity, forward error correction is enabled, the network packet mode is 1 voice frame for 1 voice frame, and the network packet transmission mode is single.
  • the voice processing parameters in the high-quality live scene or the high-quality video chat scene are set as: acoustic echo cancellation is off, noise suppression off, automatic gain control off, voice activity detection off, encoding rate default value, encoding complexity default value, before The error correction is enabled, the network packet mode is one voice frame, one voice coding packet is sealed, and the network packet transmission mode is dual-issue;
  • the voice processing parameters in the super high quality live scene or the super high quality video chat scene are set as: acoustic echo cancellation off, noise suppression off, automatic gain control off, voice activity detection off, high code rate, coding complexity default value, before The error correction is turned off, the network packet method is one voice frame for one voice frame, and the network packet transmission mode is single.
  • the control of the voice sampling rate may further affect the voice sampling rate by controlling the number of channels.
  • the multi-channel according to the embodiment of the present invention includes two channels or more channels, and the specific number of channels is implemented by the present invention.
  • the preferred setting scheme for the voice sampling rate of various application scenarios is as follows:
  • the voice sampling rate in the game scenario and the call chat scenario is set to: mono low sampling rate, low code rate.
  • High-quality no video chat scene, high-quality live broadcast scene or high-quality video chat scene and super high-quality live broadcast scene or super high-quality video chat scene set the voice sampling rate as: multi-channel high sampling rate, high code rate;
  • the high code rate is higher than the bit rate of the above low code rate.
  • the voice application scenarios for different voice quality requirements correspond to different voice processing parameters, thereby determining voice processing parameters that are compatible with the current voice application scenario.
  • the voice processing packet is processed by using the voice processing parameters adapted to the current voice application scenario to obtain a voice coding package, so that the voice processing solution can be adapted to the current voice application scenario, so that the technology for saving system resources under the premise of satisfying the sound quality requirement can be realized. effect.
  • the following examples are not exhaustive of the alternatives, and therefore should not be construed as limiting the embodiments of the present invention, specifically as follows:
  • the above-mentioned voice signal processing is performed on the collected voice signals.
  • the speech coding obtained includes:
  • the background sound is currently turned on, it is determined whether the voice is input to the microphone, and if the voice is input by the microphone, the digital signal processing is performed, and after the digital signal processing of the voice stream input by the microphone is completed, the background sound is processed. Line mixing, speech coding, and packing to obtain a speech coding package; if not the voice input by the microphone, after the speech acquisition is completed, mixing, speech coding, and packaging are performed to obtain a speech coding package;
  • the collected speech signal is digitally processed to obtain a speech frame, and the obtained speech frame is subjected to speech activity detection to determine whether it is a mute frame, and the non-silent frame is speech-encoded and packaged to obtain a speech-encoded packet.
  • the digital signal processing includes at least one of voice signal pre-processing, echo cancellation, noise suppression, and automatic gain control.
  • Voice calls in different scenarios are a problem that voice designers face, such as game chat scenes, normal chat scenes, high-quality chat scenes, high-quality live scenes (general video mode), and super-high-quality live scenes (mainly for singing).
  • Etc., etc. because different scenes have different requirements on parameters such as sound quality, CPU efficiency, uplink and downlink traffic, etc., it is necessary to design a speech engine algorithm to meet different user needs.
  • the existing voice call software does not distinguish these application scenarios, and processes the voice stream according to the unified processing manner, which may cause the following specific problems in the above application scenarios: 1.
  • the sound quality is not required to be too high. However, it is not required to play the game, so if it is not treated differently, it will cause excessive CPU overhead.
  • FIG. 2 is only a general framework diagram, and the steps of different modes are optional (ie, may not need to be performed), as shown in FIG. 2 Refer to Mode Configuration Table 1 for the specific parameters that will be used in each step.
  • the scene detection in this step is performed to detect the voice application scenario of the voice.
  • the following five scenarios are mainly: a normal chat scenario, a game chat scenario, a high-quality chat scenario, a high-quality live broadcast scenario, and a super-high sound quality. Live scene.
  • the acquisition can be performed through a microphone.
  • This step starts the collection thread and performs voice collection according to the configuration of the engine.
  • the game chat scene uses a mono low sampling rate; several other application scenarios use a two-channel high sampling rate;
  • Some application scenes have background sounds, such as accompaniment of concerts. Some application scenarios have no background sound, such as a scene for voice chat.
  • This step performs the determination of the source of the speech.
  • this step needs to determine whether the voice data collection of each microphone is completed.
  • the mix is a mix of background and microphone sounds.
  • the mixing may not be performed, and the step of mixing may be performed at the opposite end, that is, the receiving end of the voice encoding package, for example, in the chat room scene, the receiving end of each voice encoding packet is received.
  • the background sound can be the same, that is, the receiving end of the speech encoding packet also has the above background sound, and the mixing processing can be performed at the receiving end of the speech encoding packet at this time.
  • the encoding module selects the most suitable algorithm according to different application scenarios.
  • the game mode or the normal chat mode generally starts FEC (Forward Error Correction, Forward error correction), reducing the upstream and downstream traffic, while improving the anti-lost ability; and in the game mode or ordinary chat mode, generally choose low-rate, low-complexity encoder; in high-quality mode will choose High code rate, high complexity encoder.
  • FEC Forward Error Correction
  • Table 1 For details on how to configure voice coding parameters, refer to Table 1.
  • the voice frame is packed to obtain a voice coding package. After the packaging is completed, it can be sent to the receiving end corresponding to the voice encoding package.
  • VAD voice activity detection
  • the voice activity detection in step 211 can determine whether the current frame is a silence frame, and is a silence frame. Then, it can be discarded. If the result of the determination is no, the speech code of 208 is entered.
  • Att is the abbreviation of attenuate, the high mode means more noise attenuation, and lower means less noise attenuation;
  • agg is the abbreviation of Aggressive, high means to generate more silence frames, low means less silence frames;
  • br is the abbreviation of the bit rate, low means low code rate, high means high code rate, def means default code rate;
  • fec indicates the encoding method of forward error correction. After the fec is turned on, the anti-dropping capability will be significantly enhanced.
  • the pack mode indicates the network packet mode. Currently, there are 3 modes, 3 voice frames, 1 packet, 2 voice frames, 1 packet, and 1 voice frame, 1 packet.
  • Send mode indicates the network packet transmission mode. Single transmission indicates that each network packet is sent only once, and dual transmission indicates that each network packet is sent twice.
  • the DSP algorithm flow chart includes the following steps:
  • this step is pre-processing of the voice signal collected by the microphone, mainly performing DC-blocking filtering and high-pass filtering, filtering out related DC noise and ultra-low frequency noise, so that subsequent signal processing is more stable.
  • Echo cancellation this step is to perform echo cancellation on the pre-processed signal to cancel the echo signal collected by the microphone.
  • Noise suppression after the echo processor output signal passes Noise Suppress (NS), the signal-to-noise ratio and the recognition degree of the speech signal are improved.
  • NS Noise Suppress
  • the above scheme can significantly reduce CPU usage and uplink and downlink traffic in the game mode.
  • the sound quality is significantly improved. Therefore, the above provides a voice processing solution based on a voice application scenario, which can adapt the voice processing solution to the voice application scenario, thereby saving system resources under the premise of satisfying the sound quality requirement.
  • a voice processing apparatus 400 for use in a network and includes:
  • the detecting unit 4001 is configured to detect a current voice application scenario in the network.
  • a determining unit 4002 configured to determine a voice quality requirement of the current voice application scenario and a requirement for the network
  • a parameter configuration unit 4003 configured to configure a voice processing parameter corresponding to the voice application scenario detected by the detecting unit, according to the determined requirement for voice quality and a requirement for the network;
  • the voice processing unit 4004 is configured to perform voice processing on the voice signal collected in the voice application scenario according to the voice processing parameters configured by the parameter configuration unit.
  • a voice processing device as shown in FIG. 4B, includes:
  • the detecting unit 401 is configured to detect a current voice application scenario
  • the parameter configuration unit 402 is configured to configure a voice processing parameter corresponding to the voice application scenario acquired by the detecting unit 401; and the higher the voice processing parameter corresponding to the application scenario with higher voice quality requirements;
  • the voice processing unit 403 is configured to perform voice processing on the collected voice signal according to the voice processing parameters configured by the parameter configuration unit 402 to obtain a voice coding package.
  • the sending unit 404 is configured to send the voice encoding packet obtained by the voice processing unit 403 to the voice receiving end.
  • the process of the foregoing scenario detection may be an automatic detection process performed by the device, or may be a setting of the receiving user's scene mode, and the manner of obtaining the voice application scenario does not affect the implementation of the embodiment of the present invention. This is not limited.
  • the speech processing parameters are the guiding standard parameters used to determine how to perform the speech processing. Those skilled in the art can know that there are many options for the control of the speech processing, and the system occupied by the speech processing for various possible choices. Changes in resources are also predictable by those skilled in the art, and various voice processing will cause changes in voice quality to be predictable. The voice quality requirements and resource consumption requirements based on various application scenarios can be determined by those skilled in the art. How voice processing parameters are selected.
  • the voice application scenarios for different voice quality requirements correspond to different voice processing parameters, thereby determining voice processing parameters that are compatible with the current voice application scenario.
  • the voice processing packet is processed by using the voice processing parameters adapted to the current voice application scenario to obtain a voice coding package, so that the voice processing solution can be adapted to the current voice application scenario, so that the technology for saving system resources under the premise of satisfying the sound quality requirement can be realized. effect.
  • the voice processing parameter may be preset locally, for example, in the form of a configuration table.
  • the specific implementation is as follows:
  • each voice processing device is preset. Voice processing parameters corresponding to the voice application scenario, and each voice application scenario corresponds to different voice quality;
  • the parameter configuration unit 402 is configured to configure a voice processing parameter corresponding to the voice application scenario according to the voice processing parameters corresponding to the preset voice application scenarios.
  • the parameter configuration unit 402 is configured to configure voice processing parameters. Including: voice sampling rate, whether acoustic echo cancellation is on, whether noise suppression is on, intensity of noise attenuation, whether automatic gain control is on, whether voice activity detection is on, number of silence frames, code rate, coding complexity, forward error correction Whether to enable, at least one of the network packet method and the network packet transmission method.
  • the process of performing voice processing on the collected voice signal to obtain a voice coding package can be selected according to different needs with the control parameters, there are different control processes for different control parameters.
  • An example of one of the alternatives is given in the embodiment of the present invention. Those skilled in the art may know that the following examples are not optional. Except for the limitation of the embodiment of the present invention, the following is specifically as follows:
  • the voice processing unit 403 is configured to determine whether the voice input by the microphone is a microphone input if the background sound is currently turned on. The voice is processed by digital signal.
  • the background sound is mixed, voice coded and packaged to obtain a voice coding package; if the voice is not input by the microphone, the voice is mixed after the voice collection is completed. Sound, speech coding, and packing to obtain a speech coding package; if the background sound is not currently turned on, the collected speech signal is digitally processed to obtain a speech frame, and the obtained speech frame is subjected to speech activity detection to determine whether it is a silent frame or a non-silent frame. The speech coding is performed and packaged to obtain a speech coding package.
  • the foregoing voice processing unit 403, for performing the foregoing digital signal processing includes: performing at least one of a voice signal pre-processing, an echo cancellation, a noise suppression, and an automatic gain control.
  • the above-mentioned voice application scenario refers to the current application scenario that the voice processing is targeted for. Therefore, the above voice application scenario may be various application scenarios that can be applied to voice in the current computer technology field, and those skilled in the art may know that voice can be currently used. There are many application scenarios, which cannot be exhaustive for the embodiments of the present invention. However, the embodiments of the present invention are still exemplified by several representative voice application scenarios.
  • the detecting unit 401 is configured to obtain The voice application scenario includes at least one of a game scene, a call chat scene, a high-quality no-video chat scene, a high-quality live scene or a high-quality video chat scene, a super-high-quality live scene, or a super-high-quality video chat scene.
  • the quality of the voice will be different.
  • the game scene has the lowest voice quality requirements, but requires a higher CPU speed requirement and a CPU for voice processing (Central Processor Unit, Central Processor) has fewer resources.
  • Live-related scenes are relatively high-fidelity and require special sound processing. In high-quality mode, more CPU resources and network traffic are needed to ensure that the sound quality meets user needs.
  • the parameter configuration unit 402 is configured to: the voice processing parameters in the game scenario are: Acoustic echo cancellation on, noise suppression on, strong noise attenuation, automatic gain control on, voice activity detection on, mute frame number, low code rate, high coding complexity, forward The error correction is enabled, the network packet method is one voice coding packet for two voice frames, and the network packet transmission mode is single transmission;
  • the voice processing parameters in the call chat scene are set to: acoustic echo cancellation on, noise suppression on, low noise attenuation, automatic gain control on, voice activity detection on, low silence frame number, low coding rate, high coding complexity,
  • the forward error correction is enabled, the network packet mode is one voice coding packet for three voice frames, and the network packet transmission mode is single transmission;
  • the voice processing parameters in the high-quality no video chat scene are set to: acoustic echo cancellation on, noise suppression on, low noise attenuation, automatic gain control on, voice activity detection on, low silence frame number, code rate default value, coding
  • the default value of complexity, forward error correction is enabled, the network packet mode is 1 voice frame for 1 voice frame, and the network packet transmission mode is single.
  • the voice processing parameters in the high-quality live scene or the high-quality video chat scene are set as: acoustic echo cancellation is off, noise suppression off, automatic gain control off, voice activity detection off, encoding rate default value, encoding complexity default value, before The error correction is enabled, the network packet mode is one voice frame, one voice coding packet is sealed, and the network packet transmission mode is dual-issue;
  • the voice processing parameters in the super high quality live scene or the super high quality video chat scene are set as: acoustic echo cancellation off, noise suppression off, automatic gain control off, voice activity detection off, high code rate, coding complexity default value, before The error correction is turned off, the network packet method is one voice frame for one voice frame, and the network packet transmission mode is single.
  • the control of the voice sampling rate may further affect the voice sampling rate by controlling the number of channels.
  • the multi-channel according to the embodiment of the present invention includes two channels or more channels, and the specific number of channels is implemented by the present invention.
  • the parameter configuration unit 402 is configured to: the voice processing parameters for the configuration include: the game scene and the voice in the call chat scene.
  • the sampling rate is set to: mono low sampling rate; high-quality no video chat scene, high-quality live scene or high-quality video chat scene, and super high-quality live scene or super high-quality video chat scene set the voice sampling rate:
  • the channel has a high sampling rate.
  • the embodiment of the present invention further provides another voice processing device, as shown in FIG. 5, comprising: a receiver 501, a transmitter 502, a processor 503, and a memory 504;
  • the processor 503 is configured to detect a current voice application scenario, and configure a voice processing parameter corresponding to the voice application scenario; a higher voice quality parameter corresponding to the application scenario with higher voice quality requirements; The parameter performs voice processing on the collected voice signal to obtain a voice coding package, and connects to the voice The receiving end sends the above voice coding package.
  • the process of the foregoing scenario detection may be an automatic detection process performed by the device, or may be a setting of the receiving user's scene mode, and the manner of obtaining the voice application scenario does not affect the implementation of the embodiment of the present invention. This is not limited.
  • the speech processing parameters are the guiding standard parameters used to determine how to perform the speech processing. Those skilled in the art can know that there are many options for the control of the speech processing, and the system occupied by the speech processing for various possible choices. Changes in resources are also predictable by those skilled in the art, and various voice processing will cause changes in voice quality to be predictable. The voice quality requirements and resource consumption requirements based on various application scenarios can be determined by those skilled in the art. How voice processing parameters are selected.
  • the voice application scenarios for different voice quality requirements correspond to different voice processing parameters, thereby determining voice processing parameters that are compatible with the current voice application scenario.
  • the voice processing packet is processed by using the voice processing parameters adapted to the current voice application scenario to obtain a voice coding package, so that the voice processing solution can be adapted to the current voice application scenario, so that the technology for saving system resources under the premise of satisfying the sound quality requirement can be realized. effect.
  • the voice processing parameter may be preset locally, for example, in the form of a configuration table.
  • the specific implementation is as follows:
  • each voice processing device is preset. a voice processing parameter corresponding to the voice application scenario, where each voice application scenario corresponds to a different voice quality; the processor 503, configured to configure a voice processing parameter corresponding to the voice application scenario, including: a voice corresponding to each preset voice application scenario Processing parameters, and configuring voice processing parameters corresponding to the voice application scenario described above.
  • the embodiment of the present invention also exemplifies the voice processing parameters that are used to perform the control decision, as follows.
  • the processor 503 is configured to configure the voice processing parameters. Including: voice sampling rate, whether acoustic echo cancellation is on, whether noise suppression is on, intensity of noise attenuation, whether automatic gain control is on, whether voice activity detection is on, number of silence frames, code rate, coding complexity, forward error correction Whether to enable, at least one of the network packet method and the network packet transmission method.
  • the processor 503 is configured to perform voice processing on the collected voice signal to obtain a voice encoded packet, including: if the background sound is currently enabled, determining whether The voice input for the microphone, if the voice is input by the microphone, performs digital signal processing, and after the digital signal processing of the voice stream input by the microphone is completed, the background sound is mixed, voice coded, and packaged to obtain a voice coding package; if not the microphone input The voice is mixed, voice coded and packaged to obtain a voice coding package after the voice is collected; if the background sound is not currently turned on, the collected voice signal is digitally processed to obtain a voice frame, and the voice activity detection is performed on the obtained voice frame. Determine whether it is a mute frame, perform speech coding on the non-silent frame, and package the speech coding package.
  • the foregoing processor 503 is configured to perform the foregoing digital signal processing, including: at least one of a voice signal pre-processing, an echo cancellation, a noise suppression, and an automatic gain control.
  • the above-mentioned voice application scenario refers to the current application scenario that the voice processing is targeted for. Therefore, the above voice application scenario may be various application scenarios that can be applied to voice in the current computer technology field, and those skilled in the art may know that voice can be currently used. There are many application scenarios, which cannot be exhaustive in this embodiment of the present invention. However, the embodiments of the present invention are still exemplified by several representative voice application scenarios.
  • the voice application scenario includes: a game scenario, At least one of a call chat scene, a high-quality no-video chat scene, a high-quality live broadcast scene or a high-quality video chat scene, a super-high-quality live scene, or a super-high-quality video chat scene.
  • the quality of the voice will be different.
  • the game scene has the lowest voice quality requirements, but requires a higher CPU speed requirement and a CPU for voice processing (Central Processor Unit, Central Processor) has fewer resources.
  • Live-related scenes are relatively high-fidelity and require special sound processing. In high-quality mode, more CPU resources and network traffic are needed to ensure that the sound quality meets user needs.
  • the selection of the parameter results may cause changes in system resources occupied by the voice processing, and those skilled in the art may also predict that various voice processing will cause changes in voice quality to be predictable, based on Various application scenarios exemplified in the foregoing embodiments of the present invention also provide a preferred setting scheme, as follows:
  • the processor 503 is configured to set the voice processing parameters in the game scenario to: acoustic echo cancellation on, noise suppression on Strong noise attenuation, automatic gain control on, voice activity detection on, number of silence frames, low code rate, high coding complexity, forward error correction, and network packet mode for 2 voice frames
  • the encoding method and network packet sending method are single-issue;
  • the voice processing parameters in the call chat scene are set to: acoustic echo cancellation on, noise suppression on, low noise attenuation, automatic gain control on, voice activity detection on, low silence frame number, low code rate,
  • the coding complexity is high, the forward error correction is enabled, the network packet mode is one voice coding packet for three voice frames, and the network packet transmission mode is single transmission;
  • the voice processing parameters in the high-quality no video chat scene are set to: acoustic echo cancellation on, noise suppression on, low noise attenuation, automatic gain control on, voice activity detection on, low silence frame number, code rate default value, coding
  • the default value of complexity, forward error correction is enabled, the network packet mode is 1 voice frame for 1 voice frame, and the network packet transmission mode is single.
  • the voice processing parameters in the high-quality live scene or the high-quality video chat scene are set as: acoustic echo cancellation is off, noise suppression off, automatic gain control off, voice activity detection off, encoding rate default value, encoding complexity default value, before The error correction is enabled, the network packet mode is one voice frame, one voice coding packet is sealed, and the network packet transmission mode is dual-issue;
  • the voice processing parameters in the super high quality live scene or the super high quality video chat scene are set as: acoustic echo cancellation off, noise suppression off, automatic gain control off, voice activity detection off, high code rate, coding complexity default value, before The error correction is turned off, the network packet method is one voice frame for one voice frame, and the network packet transmission mode is single.
  • the control of the voice sampling rate may further affect the voice sampling rate by controlling the number of channels.
  • the multi-channel according to the embodiment of the present invention includes two channels or more channels, and the specific number of channels is implemented by the present invention.
  • the processor 503 is configured to set the voice sampling rate in the game scene and the call chat scene to be: mono.
  • the preferred setting scheme for the voice sampling rate of the different application scenarios is as follows: Low sampling rate; the voice sampling rate is set to: multi-channel high sampling rate in high-quality no-video chat scene, high-quality live scene or high-quality video chat scene, and super high-quality live scene or super high-quality video chat scene.
  • the embodiment of the present invention further provides another voice processing device.
  • FIG. 6 for the convenience of description, only parts related to the embodiment of the present invention are shown. If the specific technical details are not disclosed, please refer to the embodiment of the present invention.
  • Method part. The terminal may be any terminal device including a mobile phone, a tablet computer, a PDA (Personal Digital Assistant), a POS (Point of Sales), an in-vehicle computer, and the terminal is a mobile phone as an example:
  • FIG. 6 is a block diagram showing a partial structure of a mobile phone related to a terminal provided by an embodiment of the present invention.
  • the mobile phone includes: a radio frequency (RF) circuit 610, a memory 620, an input unit 630, a display unit 640, a sensor 650, a voice circuit 660, a wireless fidelity (WiFi) module 670, and a processor 680. And power supply 690 and other components.
  • RF radio frequency
  • the RF circuit 610 can be used for transmitting and receiving information or during a call, and receiving and transmitting the signal. Specifically, after receiving the downlink information of the base station, the processor 680 processes the data. In addition, the uplink data is designed to be sent to the base station. Generally, RF circuit 610 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, RF circuitry 610 can also communicate with the network and other devices via wireless communication. The above wireless communication may use any communication standard or protocol, including but not limited to Global System of Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (Code Division). Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), E-mail, Short Messaging Service (SMS), and the like.
  • GSM Global System of Mobile communication
  • GPRS General Packet Radio Service
  • the memory 620 can be used to store software programs and modules, and the processor 680 executes various functional applications and data processing of the mobile phone by running software programs and modules stored in the memory 620.
  • the memory 620 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may be stored according to Data created by the use of mobile phones (such as voice data, phone book, etc.).
  • memory 620 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
  • the input unit 630 can be configured to receive input numeric or character information and to generate key signal inputs related to user settings and function controls of the handset.
  • the input unit 630 may include a touch panel 631 and other input devices 632.
  • the touch panel 631 also referred to as a touch screen, can collect touch operations on or near the user (such as the user using a finger, a stylus, or the like on the touch panel 631 or near the touch panel 631. Operation), and drive the corresponding connecting device according to a preset program.
  • the touch panel 631 can include two parts: a touch detection device and a touch controller.
  • the touch detection device detects the touch orientation of the user, and detects a signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts the touch information into contact coordinates, and sends the touch information.
  • the processor 680 is provided and can receive commands from the processor 680 and execute them. In addition, it can be used in various types such as resistive, capacitive, infrared, and surface acoustic waves.
  • touch panel 631 In addition to the touch panel 631, the input unit 630 may also include other input devices 632. In particular, other input devices 632 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control buttons, switch buttons, etc.), trackballs, mice, joysticks, and the like.
  • the display unit 640 can be used to display information input by the user or information provided to the user as well as various menus of the mobile phone.
  • the display unit 640 can include a display panel 641.
  • the display panel 641 can be configured in the form of a liquid crystal display (LCD), an organic light-emitting diode (OLED), or the like.
  • the touch panel 631 can cover the display panel 641. When the touch panel 631 detects a touch operation on or near it, the touch panel 631 transmits to the processor 680 to determine the type of the touch event, and then the processor 680 according to the touch event. The type provides a corresponding visual output on display panel 641.
  • the touch panel 631 and the display panel 641 are two independent components to implement the input and input functions of the mobile phone, in some embodiments, the touch panel 631 may be integrated with the display panel 641. Realize the input and output functions of the phone.
  • the handset can also include at least one type of sensor 650, such as a light sensor, motion sensor, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 641 according to the brightness of the ambient light, and the proximity sensor may close the display panel 641 and/or when the mobile phone moves to the ear. Or backlight.
  • the accelerometer sensor can detect the magnitude of acceleration in all directions (usually three axes). When it is stationary, it can detect the magnitude and direction of gravity.
  • the mobile phone can be used to identify the gesture of the mobile phone (such as horizontal and vertical screen switching, related Game, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc.; as for the mobile phone can also be configured with gyroscopes, barometers, hygrometers, thermometers, infrared sensors and other sensors, no longer Narration.
  • the gesture of the mobile phone such as horizontal and vertical screen switching, related Game, magnetometer attitude calibration
  • vibration recognition related functions such as pedometer, tapping
  • the mobile phone can also be configured with gyroscopes, barometers, hygrometers, thermometers, infrared sensors and other sensors, no longer Narration.
  • the voice circuit 660, the speaker 661, and the microphone 662 can provide a voice interface between the user and the mobile phone.
  • the voice circuit 660 can transmit the received electrical signal converted by the voice data to the speaker 661, and is converted into a sound signal output by the speaker 661.
  • the microphone 662 converts the collected sound signal into an electrical signal, and the voice circuit 660 After receiving, it is converted into voice data, and then processed by the voice data output processor 680, sent to the other mobile phone via the RF circuit 610, or outputted to the memory 620 for further processing.
  • WiFi is a short-range wireless transmission technology
  • the mobile phone can help users to send and receive emails, browse web pages, and access streaming media through the WiFi module 670, which provides users with wireless broadband Internet access.
  • FIG. 6 shows the WiFi module 670, it can be understood that it does not belong to the essential configuration of the mobile phone, and can be omitted as needed within the scope of not changing the essence of the invention.
  • the processor 680 is a control center of the mobile phone, and connects various parts of the entire mobile phone by using various interfaces and lines.
  • the mobile phone is monitored in its entirety by running or executing software programs and/or modules stored in memory 620, as well as invoking data stored in memory 620, performing various functions and processing data of the handset.
  • the processor 680 may include one or more processing units; preferably, the processor 680 may integrate an application processor and a modem processor, where the application processor mainly processes an operating system, a user interface, an application, and the like.
  • the modem processor primarily handles wireless communications. It will be appreciated that the above described modem processor may also not be integrated into the processor 680.
  • the handset also includes a power source 690 (such as a battery) that supplies power to the various components.
  • a power source 690 such as a battery
  • the power source can be logically coupled to the processor 680 through a power management system to manage functions such as charging, discharging, and power management through the power management system.
  • the mobile phone may further include a camera, a Bluetooth module, and the like, and details are not described herein again.
  • the processor 680 can execute instructions in the memory 620 to perform the following operations:
  • the processor 680 included in the terminal further has the following functions:
  • the processor 680 is configured to detect a current voice application scenario, and configure a voice processing parameter corresponding to the voice application scenario; the higher the voice quality requirement is, the higher the standard of the voice processing parameter corresponding to the application scenario;
  • the collected speech signal is subjected to speech processing to obtain a speech encoding packet, and the speech encoding packet is transmitted to the speech receiving end.
  • the process of the foregoing scenario detection may be an automatic detection process performed by the device, or may be a setting of the receiving user's scene mode, and the manner of obtaining the voice application scenario does not affect the implementation of the embodiment of the present invention. This is not limited.
  • the speech processing parameters are the guiding standard parameters used to determine how to perform the speech processing. Those skilled in the art can know that there are many options for the control of the speech processing, and the system occupied by the speech processing for various possible choices. Changes in resources are also predictable by those skilled in the art, and various voice processing will result in changes in voice quality that are also predictable, based on various application scenarios for voice quality requirements and resource consumption. It is required by those skilled in the art to determine how speech processing parameters are selected.
  • the voice application scenarios for different voice quality requirements correspond to different voice processing parameters, thereby determining voice processing parameters that are compatible with the current voice application scenario.
  • the voice processing packet is processed by using the voice processing parameters adapted to the current voice application scenario to obtain a voice coding package, so that the voice processing solution can be adapted to the current voice application scenario, so that the technology for saving system resources under the premise of satisfying the sound quality requirement can be realized. effect.
  • the voice processing parameter may be preset locally, for example, in the form of a configuration table.
  • the specific implementation is as follows:
  • each voice processing device is preset. a voice processing parameter corresponding to the voice application scenario, where each voice application scenario corresponds to a different voice quality.
  • the processor 680 is configured to configure a voice processing parameter corresponding to the voice application scenario, including: a voice corresponding to the preset voice application scenario. Processing parameters, and configuring voice processing parameters corresponding to the voice application scenario described above.
  • the embodiment of the present invention also exemplifies the voice processing parameters that are used to perform the control decision, as follows.
  • the processor 680 is configured to configure the voice processing parameters. Including: voice sampling rate, whether acoustic echo cancellation is on, whether noise suppression is on, intensity of noise attenuation, whether automatic gain control is on, whether voice activity detection is on, number of silence frames, code rate, coding complexity, forward error correction Whether to enable, at least one of the network packet method and the network packet transmission method.
  • the processor 680 is used for collecting The voice signal is subjected to voice processing to obtain a voice coding package, including: if the background sound is currently turned on, determining whether the voice is input by the microphone, and if the voice is input by the microphone, performing digital signal processing, and performing digital signal processing on the voice stream input by the microphone.
  • the background sound is mixed, voice coded, and packaged to obtain a voice coding package; if the voice is not input by the microphone, the voice coding package is obtained after the voice collection is completed, and the voice coding package is obtained after the voice collection is completed; if the background sound is not currently enabled, Then the collected speech signal is processed by digital signal to obtain speech , The speech frame was subjected to determine whether voice activity detection mute frames, non-silence frames of speech coding and speech coding package packing obtained.
  • the foregoing processor 680 is configured to perform the foregoing digital signal processing, including: at least one of a voice signal pre-processing, an echo cancellation, a noise suppression, and an automatic gain control.
  • the above-mentioned voice application scenario refers to the current application scenario that the voice processing is targeted for. Therefore, the above voice application scenario may be various application scenarios that can be applied to voice in the current computer technology field, and those skilled in the art may know that voice can be currently used. There are many application scenarios, which cannot be exhaustive in this embodiment of the present invention. However, the embodiments of the present invention are still exemplified by several representative voice application scenarios.
  • the voice application scenario includes: a game scenario, At least one of a call chat scene, a high-quality no-video chat scene, a high-quality live broadcast scene or a high-quality video chat scene, a super-high-quality live scene, or a super-high-quality video chat scene.
  • the quality of the voice will be different.
  • the game scene has the lowest voice quality requirements, but requires a higher CPU speed requirement and a CPU for voice processing (Central Processor Unit, Central Processor) has fewer resources.
  • Live-related scenes are relatively high-fidelity and require special sound processing. In high-quality mode, more CPU resources and network traffic are needed to ensure that the sound quality meets user needs.
  • the selection of the parameter results may cause changes in system resources occupied by the voice processing, and those skilled in the art may also predict that various voice processing will cause changes in voice quality to be predictable, based on Various application scenarios exemplified in the foregoing embodiments of the present invention also provide a preferred setting scheme, as follows:
  • the processor 680 is configured to set the voice processing parameters in the game scene to: acoustic echo cancellation on, noise suppression on Strong noise attenuation, automatic gain control on, voice activity detection on, number of silence frames, low code rate, high coding complexity, forward error correction, and network packet mode for 2 voice frames
  • the encoding method and network packet sending method are single-issue;
  • the voice processing parameters in the call chat scene are set to: acoustic echo cancellation on, noise suppression on, low noise attenuation, automatic gain control on, voice activity detection on, low silence frame number, low coding rate, high coding complexity,
  • the forward error correction is enabled, the network packet mode is one voice coding packet for three voice frames, and the network packet transmission mode is single transmission;
  • the voice processing parameters in the high-quality no video chat scene are set to: acoustic echo cancellation on, noise suppression on, low noise attenuation, automatic gain control on, voice activity detection on, low silence frame number, code rate default value, coding
  • the default value of complexity, forward error correction is enabled, the network packet mode is 1 voice frame for 1 voice frame, and the network packet transmission mode is single.
  • the voice processing parameters in the high-quality live scene or the high-quality video chat scene are set as: acoustic echo cancellation is off, noise suppression is off, automatic gain control is off, voice activity detection is off, encoding rate default value, editing The code complexity default value, the forward error correction is enabled, the network packet mode is one voice frame, one voice coding packet, and the network packet transmission mode is dual-issue;
  • the voice processing parameters in the super high quality live scene or the super high quality video chat scene are set as: acoustic echo cancellation off, noise suppression off, automatic gain control off, voice activity detection off, high code rate, coding complexity default value, before The error correction is turned off, the network packet method is one voice frame for one voice frame, and the network packet transmission mode is single.
  • the control of the voice sampling rate may further affect the voice sampling rate by controlling the number of channels.
  • the multi-channel according to the embodiment of the present invention includes two channels or more channels, and the specific number of channels is implemented by the present invention.
  • the processor 680 is configured to set the voice sampling rate in the game scenario and the call chat scenario to be: mono. Low sampling rate; the voice sampling rate is set to: multi-channel high sampling rate in high-quality no-video chat scene, high-quality live scene or high-quality video chat scene, and super high-quality live scene or super high-quality video chat scene.
  • the included units are only divided according to functional logic, but are not limited to the foregoing division, as long as the corresponding functions can be implemented; in addition, the specific names of the functional units It is also for convenience of distinguishing from each other and is not intended to limit the scope of protection of the present invention.
  • the storage medium may be a read only memory, a magnetic disk or an optical disk or the like.

Abstract

一种语音处理方法及装置。所述语音处理方法包括:检测所述网络中当前的语音应用场景(S1);确定当前的语音应用场景对语音质量的要求以及对所述网络的要求(S2);基于所确定的对语音质量的要求以及对所述网络的要求配置与所述语音应用场景对应的语音处理参数(S3);按照所述语音处理参数对在所述语音应用场景采集的语音信号进行语音处理(S4)。

Description

语音处理方法及装置
本申请要求于2013年12月9日提交中国专利局、申请号为201310661273.6、发明名称为“一种语音处理方法,及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及信息技术领域,特别涉及一种语音处理方法及装置。
背景技术
随着互联网语音通话的普及,语音通话逐渐成为了用户日常生活中不可缺少的一部分。例如:网络聊天室、游戏过程中的聊天以及网络语音直播等均涉及网络语音通话的技术。
要实现网络语音通话,在语音的采集设备侧需要执行如下流程:
1、采集语音信号;该步骤会采集用户的语音,可以通过麦克风等设备实现语音信号的采集工作。
2、对语音信号进行数字信号处理(Digital Signal Processing,DSP)得到语音编码包;该步骤是对采集的语音信号进行的处理过程,可以有的处理包括:回声消除、噪音抑制等。
如果采集到的是多路语音信号,则在得到语音编码包之前,还可能需要进行混音处理。得到语音编码包之前还可以对语音进行其他音效方面的处理。
3、向语音接收端发送上述得到的语音编码包。
目前,对于不同的应用场景,均按照统一处理方式处理语音流,对于音质要求高的场景不能达到音质要求,对于音质要求低的场景又因占用较多的系统资源造成资源浪费的现象,因此目前采用统一处理方式处理语音流的方案并不能与目前多场景下的语音需求相适应。
发明内容
有鉴于此,本发明实施例提供了一种语音处理方法及装置,用于提供基于语音应用场景的语音处理方案,使语音处理方案与语音应用场景相适应。
一种语音处理方法,应用于网络中,包括:
检测所述网络中当前的语音应用场景;
确定当前的语音应用场景对语音质量的要求以及对所述网络的要求;
基于所确定的对语音质量的要求以及对所述网络的要求配置与所述语音应用场景对应的语音处理参数;
按照所述语音处理参数对在所述语音应用场景采集的语音信号进行语音处理。
一种语音处理装置,应用于网络中,包括:
检测单元,用于检测所述网络中当前的语音应用场景;
确定单元,用于确定当前的语音应用场景对语音质量的要求以及对所述网络的要求;
参数配置单元,用于基于所确定的对语音质量的要求以及对所述网络的要求配置与所述检测单元检测的语音应用场景对应的语音处理参数;
语音处理单元,用于按照所述参数配置单元配置的语音处理参数对在所述语音应用场景采集的语音信号进行语音处理。
从以上技术方案可以看出,针对不同语音质量要求的语音应用场景对应有不同的语音处理参数,从而确定与当前的语音应用场景相适应的语音处理参数。采用与当前的语音应用场景相适应的语音处理参数进行语音处理,则可以使语音处理的方案适应于当前语音应用场景,因此可以实现在满足音质要求的前提下节省系统资源的技术效果。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1A为本发明实施例方法流程示意图;
图1B为本发明实施例方法流程示意图;
图2为本发明实施例方法流程示意图;
图3为本发明实施例方法流程示意图;
图4A为本发明实施例装置结构示意图;
图4B为本发明实施例装置结构示意图;
图5为本发明实施例装置结构示意图;以及
图6为本发明实施例终端结构示意图。
具体实施方式
为了使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明作进一步地详细描述,显然,所描述的实施例仅仅是本发明一部份实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本发明保护的范围。
使用在这里,语音宽泛地指包含发声器官所发出话音的音频以及其中话音为静默的音频。例如,语音可以是通话双方所发出的话音以及话音之间所包含的静默,可以是包含话音以及话音环境中的背景声音的音频。再例如,语音可以是其中话音静默的音乐会音频。
使用在这里,语音应用场景是指其中涉及语音的场景,例如通话、聊天、表演等。
参照图1,根据本发明的一个实施例,提供了一种语音处理方法100,所述方法应用于网络中,并且包括:
步骤S1:检测所述网络中当前的语音应用场景;
步骤S2:确定当前的语音应用场景对语音质量的要求以及对所述网络的要求;
步骤S3:基于所确定的对语音质量的要求以及对所述网络的要求配置与所述语音应用场景对应的语音处理参数;以及
步骤S4:按照所述语音处理参数对在所述语音应用场景采集的语音信号进行语音处理。
根据一个实例,所述语音应用场景包括:网络游戏场景、通话聊天场景、高音质无视频网络聊天场景、高音质网络直播场景或高音质视频网络聊天场景、超高音质网络直播场景或超高音质视频网络聊天场景。
根据再一个实例,对所述网络的要求包括对网络速度的要求、对网络上下行带宽的要求、对网络流量的要求或者对网络延迟的要求。
根据各个实例,语音处理参数可以包括:语音采样率、声学回声抵消是否开启、噪声抑制是否开启、噪声衰减的强度、自动增益控制是否开启、语音活性检测是否开启、静音帧数、编码码率、编码复杂度、前向纠错是否开启、网络封包方式、网络包发送方式中的至少一项。
本发明实施例提供了一种语音处理方法,如图1B所示,包括步骤101-103。
101:检测当前的语音应用场景。
上述场景检测的过程,可以是设备执行的自动检测过程,也可以是用户对于场景模式的设置,具体获得语音应用场景的方式并不会影响到本发明实施例的实现,因此本发明实施例对此不予限定。
上述语音应用场景是指语音处理所针对的当前应用场景,因此以上语音应用场景可以是目前计算机技术领域能够应用到语音的各种应用场景,本领域技术人员可以获知的是目前能够用到语音的应用场景有很多,本发明实施例对此无法穷举,不过本发明实施例仍然就其中几种有代表性的语音应用场景进行了举例说明:可选地,上述语音应用场景包括:游戏场景(Game Talk Mode,GTM,也称为游戏场景的聊天模式)、通话聊天场景(Normal Talk Mode,NTM,也称为一般的通话聊天模式)、高音质无视频聊天场景(High Quality Mode,HQM也可以称为高音质场景下的无视频聊天模式)、高音质直播场景或高音质视频聊天场景(High Quality with Video Mode,HQVM,也称为高音质直播模式或者高音质场景下的视频聊天模式)、超高音质直播场景或超高音质视频聊天场景(Super Quality with Video Mode,SQV超高音质直播模式:超高音质场景下的视频聊天模式)中的至少一项。
对于不同的语音应用场景,对语音的质量会有所不同,例如:游戏场景对语音质量要求最低,但是要求对当前的网速占用要求较高,并且语音处理所用的CPU(Central Processor Unit,中央处理器)资源要较少。直播相关的场景则相对需要高保真,需要特殊的音效处理。高音质模式下,则需要消耗更多的CPU资源和网络流量来保证音质满足用户需求。
102:配置与上述语音应用场景对应的语音处理参数;语音质量要求越高的应用场景对应的语音处理参数的标准越高。
语音处理参数是用来决定如何进行语音处理的指导性标准参数,本领域技术人员可以获知的是对语音处理的控制可以有很多种选择,对于各种可能的选择会导致语音处理所占用的系统资源的变化本领域技术人员也是可以预知的,各种语音处理将会导致语音质量的变化也是可以预知的,基于各种应用场景对语音质量要求以及对资源消耗的要求本领域技术人员是可以确定语音处理参数是如何选择的。
在获得语音应用场景以后需要确定相应的语音处理参数,语音处理参数可以是 预置在本地的,例如采用配置表的形式存放,具体实现如下:可选地,在语音处理设备中预置有各语音应用场景对应的语音处理参数,各语音应用场景对应不同的语音质量;上述配置与上述语音应用场景对应的语音处理参数包括:依据预置的各语音应用场景对应的语音处理参数,配置与上述语音应用场景对应的语音处理参数。
本领域技术人员可以获知对语音处理的控制可以有很多种选择,对于各种可能的选择会导致语音处理所占用的系统资源的变化本领域技术人员也是可以预知的,各种语音处理将会导致语音质量的变化也是可以预知的,本发明实施例还对优选用来进行控制决策的语音处理参数进行了举例说明,具体如下:可选地,上述语音处理参数包括:语音采样率、声学回声抵消是否开启、噪声抑制(Noise Suppress,NS)是否开启、噪声衰减的强度、自动增益控制(Automatic Gain Control,AGC)是否开启、语音活性检测是否开启、静音帧数、编码码率、编码复杂度、前向纠错是否开启、网络封包方式、网络包发送方式中的至少一项。
依据以上举例的语音处理参数,其参数结果的选择会导致语音处理所占用的系统资源的变化本领域技术人员也是可以预知的,各种语音处理将会导致语音质量的变化也是可以预知的,基于前述实施例所举例的各种应用场景本发明实施例还给出了优选的设置方案,具体如下:上述语音质量要求越高的应用场景对应的语音处理参数的标准越高包括:
游戏场景下语音处理参数设置为:声学回声抵消开启、噪声抑制开启、噪声衰减的强度强、自动增益控制开启、语音活性检测开启、静音帧数多、编码码率低、编码复杂度高、前向纠错开启、网络封包方式为2个语音帧封1个语音编码包、网络包发送方式为单发;
通话聊天场景下语音处理参数设置为:声学回声抵消开启、噪声抑制开启、噪声衰减的强度低、自动增益控制开启、语音活性检测开启、静音帧数低、编码码率低、编码复杂度高、前向纠错开启、网络封包方式为3个语音帧封1个语音编码包、网络包发送方式为单发;
高音质无视频聊天场景下语音处理参数设置为:声学回声抵消开启、噪声抑制开启、噪声衰减的强度低、自动增益控制开启、语音活性检测开启、静音帧数低、编码码率默认值、编码复杂度默认值、前向纠错开启、网络封包方式为1个语音帧封1个语音编码包、网络包发送方式为单发;
高音质直播场景或高音质视频聊天场景下语音处理参数设置为:声学回声抵消是关闭、噪声抑制关闭、自动增益控制关闭、语音活性检测关闭、编码码率默认值、编码复杂度默认值、前向纠错开启、网络封包方式为1个语音帧封1个语音编码包、网络包发送方式为双发;
超高音质直播场景或超高音质视频聊天场景下语音处理参数设置为:声学回声抵消关闭、噪声抑制关闭、自动增益控制关闭、语音活性检测关闭、编码码率高、编码复杂度默认值、前向纠错关闭、网络封包方式为1个语音帧封1个语音编码包、网络包发送方式为单发。
对于语音采样率的控制还可以进一步通过控制声道数来影响语音采样率,本发明实施例所称的多声道包含双声道或者更多的声道数,具体的声道数本发明实施例可以不予限制,对于各种不同的应用场景语音采样率的优选设置方案具体如下:可选地,游戏场景和通话聊天场景下语音采样率设置为:单声道低采样率,低码率;高音质无视频聊天场景、高音质直播场景或高音质视频聊天场景以及超高音质直播场景或超高音质视频聊天场景下语音采样率设置为:多声道高采样率,高码率;上述高码率为高于上述低码率的码率。
103:按照上述语音处理参数对采集的语音信号进行语音处理得到语音编码包,向语音接收端发送上述语音编码包。
以上实施例,针对不同语音质量要求的语音应用场景对应有不同的语音处理参数,从而确定与当前的语音应用场景相适应的语音处理参数。采用与当前的语音应用场景相适应的语音处理参数进行语音处理得到语音编码包,则可以使语音处理的方案适应于当前语音应用场景,因此可以实现在满足音质要求的前提下节省系统资源的技术效果。
对采集的语音信号进行语音处理得到语音编码包的过程,依据不同需要可以选用控制参数,对应不同的控制参数则会有不同的控制流程,本发明实施例给出了其中的一种可选方案的举例,本领域技术人员可以获知的是以下举例并不是可选方案的穷举,因此不应理解为对本发明实施例的限定,具体如下:可选地,上述对采集的语音信号进行语音处理得到语音编码包括:
若当前开启有背景音,则确定是否为麦克风输入的语音,如是麦克风输入的语音则进行数字信号处理,在对麦克风输入的语音流进行数字信号处理完毕后与背景音进 行混音、语音编码以及打包得到语音编码包;若不是麦克风输入的语音则在语音采集完毕后进行混音、语音编码以及打包得到语音编码包;
若当前未开启背景音,则采集的语音信号进行数字信号处理得到语音帧,对得到的语音帧进行语音活性检测确定是否为静音帧,对非静音帧进行语音编码并打包得到语音编码包。
可选地,上述数字信号处理包括:语音信号预处理、回声消除、噪声抑制、自动增益控制中的至少一项。
以下实施例将就本发明实施例的具体应用场景,进行更详细的举例说明。
不同场景的语音通话是语音设计者要面临的一个问题,比如游戏聊天场景、普通聊天场景、高音质聊天场景、高音质直播场景(一般的视频模式)、超高音质直播场景(主要是针对演唱会的)等等,由于不同场景对音质音效、CPU效率、上下行流量等参数指标的要求不同,所以需要分场景设计语音引擎算法以满足不同的用户需要。然而现有的语音通话软件都不区分这些应用场景,按照统一处理方式去处理语音流,这会导致在以上应用场景中存在如下的具体问题:1、游戏模式场景下,不需要太高的音质,但是要求不能卡游戏,所以如果不区别处理就会造成过高的CPU开销,过大的上下行流量开销,影响到游戏的体验;2、高音质模式场景下,如果按照普通的语音聊天模式处理,音质会明显满足不了用户需求;3、在演唱会的时候,需要高保真的音乐,需要特殊的音效处理;基于以上技术问题,本发明实施例将根据不同的应用场景,设计不同的语音处理方法,达到各中场景下在满足效果要求的前提下实现资源代价的最合理要求。
基于多场景语音引擎技术发送端具体流程,如图2所示,该图2只是一个一般性的框架图,不同模式各步骤是可选的(即可以不需要执行),在图2所示的各步骤中将会使用到的具体参数请参阅模式配置表1。
201:场景检测,确定当前的语音应用场景;
本步骤的场景检测执行的是检测语音的语音应用场景,在本发明实施例的举例中主要如下5个场景:普通聊天场景、游戏聊天场景、高音质聊天场景、高音质直播场景、超高音质直播场景。
202:语音信号采集;
对于语音处理端而言,采集可以通过麦克风来进行采集。
本步骤会启动采集线程,根据引擎的配置进行语音采集,其中普通聊天场景、 游戏聊天场景采用单声道低采样率;其他几种应用场景采用双声道高采样率;
203:确定是否开启背景音;如果是,进入204,如果否,进入210;
有的应用场景是有背景音的,例如音乐会的伴奏。有些应用场景则没有背景音,例如语音聊天的场景。
204:确定是否是麦克风信号;如果是进入205,否则进入206;
本步骤执行的是对语音来源的确定。
205:进行DSP处理;
DSP的具体处理流程,在后续实施例中将给出更详细的说明;
206:确定语音数据的采集是否完毕;如果是,进入207,否则进入202;
对于采用麦克风采集语音的方案来说,此步骤需要确定的是各路麦克风的语音数据采集是否均完毕。
207:混音处理;
本步骤中,混音是对背景音和麦克风音的混音。另外,本步骤也可以不执行混音,混音的步骤在对端,即语音编码包的接收端进行混音也是可以的,例如在聊天室场景下,各语音编码包的接收端接收到的背景音可以是相同的,也即是时候语音编码包的接收端也有上述背景音,此时完全可以在语音编码包的接收端执行混音处理。
208:语音编码;
本步骤执行的是对混音处理后的语音信号进行压缩,从而节省了流量,编码模块会根据不同的应用场景选择最合适的算法,游戏模式或普通的聊天模式一般开启FEC(Forward Error Correction,前向纠错),降低上下行流量的同时,提高了抗丢包能力;而且在游戏模式或普通的聊天模式一般都选择低码率、低复杂度的编码器;在高音质模式下会选择高码率、高复杂度的编码器。具体如何配置语音编码参数可以参考表1。
209:语音帧打包,得到语音编码包。打包完成以后则可以发送给语音编码包对应的接收端。
在本步骤中,会根据不同的场景选择不同的打包长度和打包方式,具体参数控制请参阅表1。
210:进行DSP处理;
211:进行语音活性检测(Voice Active Detect,VAD);
212:通过211步骤的语音活性检测可以确定当前帧是否是静音帧,是静音帧, 则可以丢弃掉,如果确定结果为否,则进入208的语音编码。
表1各语音应用场景语音引擎算法配置信息表
Figure PCTCN2015072099-appb-000001
注:1、on表示该模块打开,off表示关闭;
2、att是attenuate(衰减)的缩写,high模式表示噪声衰减多,low表示噪声衰减少;
3、agg是Aggressive的缩写,high表示产生更多的静音帧,low表示产生静音帧比较少;
4、com是Complicity(复杂度),high表示复杂度高,同等码率下音质也越好;
5、br是bits rate(码率)的缩写,low表示低码率,high表示高码率,def表示默认码率;
6、fec表示前向纠错的编码方式,fec打开后抗丢包能力会明显增强;
7、pack mode表示网络封包方式,目前有3种方式3个语音帧封1包,2个语音帧封1包,1个语音帧封1包;
8、Send mode表示网络包发送方式,单发表示每个网络包只发一次,双发表示每个网络包都发两次。
DSP算法流程图,如图3所示,包括如下步骤:
301:语音信号预处理;本步骤是在麦克风采集到的语音信号经过的预处理,主要做隔直滤波和高通滤波,滤除相关的直流噪声和超低频噪声,使得后续信号处理更加稳定。
302:回声消除;本步骤是对预处理信号进行回声消除来抵消麦克风采集到的回声信号。
303:噪声抑制;回声处理器输出信号通过噪声抑制(Noise Suppress,NS)后,提高语音信号的信噪比和辨识度。
304:自动增益控制。噪声抑制后的信号经过自动增益控制模块,语音信号变的更加的平滑舒缓。
通过实验发现,采用以上方案在游戏模式下可以明显降低了CPU占用和上下行的流量。超高音质视频模式下,音质明显提升。因此以上提供了基于语音应用场景的语音处理方案,可以使语音处理方案与语音应用场景相适应,从而在满足音质要求的前提下节省系统资源。
参照图4A,根据本发明的一个实施例,提供了语音处理装置400,应用于网络中,并且包括:
检测单元4001,用于检测所述网络中当前的语音应用场景;
确定单元4002,用于确定当前的语音应用场景对语音质量的要求以及对所述网络的要求;
参数配置单元4003,用于基于所确定的对语音质量的要求以及对所述网络的要求配置与所述检测单元检测的语音应用场景对应的语音处理参数;以及
语音处理单元4004,用于按照所述参数配置单元配置的语音处理参数对在所述语音应用场景采集的语音信号进行语音处理。
一种语音处理装置,如图4B所示,包括:
检测单元401,用于检测当前的语音应用场景;
参数配置单元402,用于配置与上述检测单元401获取的语音应用场景对应的语音处理参数;语音质量要求越高的应用场景对应的语音处理参数的标准越高;
语音处理单元403,用于按照上述参数配置单元402配置的语音处理参数对采集的语音信号进行语音处理得到语音编码包;
发送单元404,用于向语音接收端发送上述语音处理单元403得到的语音编码包。
上述场景检测的过程,可以是设备执行的自动检测过程,也可以是接收用户对于场景模式的设置,具体获得语音应用场景的方式并不会影响到本发明实施例的实现,因此本发明实施例对此不予限定。
语音处理参数是用来决定如何进行语音处理的指导性标准参数,本领域技术人员可以获知的是对语音处理的控制可以有很多种选择,对于各种可能的选择会导致语音处理所占用的系统资源的变化本领域技术人员也是可以预知的,各种语音处理将会导致语音质量的变化也是可以预知的,基于各种应用场景对语音质量要求以及对资源消耗的要求本领域技术人员是可以确定语音处理参数是如何选择的。
以上实施例,针对不同语音质量要求的语音应用场景对应有不同的语音处理参数,从而确定与当前的语音应用场景相适应的语音处理参数。采用与当前的语音应用场景相适应的语音处理参数进行语音处理得到语音编码包,则可以使语音处理的方案适应于当前语音应用场景,因此可以实现在满足音质要求的前提下节省系统资源的技术效果。
在获得语音应用场景以后需要确定相应的语音处理参数,语音处理参数可以是预置在本地的,例如采用配置表的形式存放,具体实现如下:可选地,在语音处理设备中预置有各语音应用场景对应的语音处理参数,各语音应用场景对应不同的语音质量;
上述参数配置单元402,用于依据预置的各语音应用场景对应的语音处理参数,配置与上述语音应用场景对应的语音处理参数。
本领域技术人员可以获知对语音处理的控制可以有很多种选择,对于各种可能的选择会导致语音处理所占用的系统资源的变化本领域技术人员也是可以预知的,各种语音处理将会导致语音质量的变化也是可以预知的,本发明实施例还对优选用来进行控制决策的语音处理参数进行了举例说明,具体如下:可选地,上述参数配置单元402,用于配置的语音处理参数包括:语音采样率、声学回声抵消是否开启、噪声抑制是否开启、噪声衰减的强度、自动增益控制是否开启、语音活性检测是否开启、静音帧数、编码码率、编码复杂度、前向纠错是否开启、网络封包方式、网络包发送方式中的至少一项。
对采集的语音信号进行语音处理得到语音编码包的过程,依据不同需要可以选 用控制参数,对应不同的控制参数则会有不同的控制流程,本发明实施例给出了其中的一种可选方案的举例,本领域技术人员可以获知的是以下举例并不是可选方案的穷举,因此不应理解为对本发明实施例的限定,具体如下:可选地,上述语音处理单元403,用于若当前开启有背景音,则确定是否为麦克风输入的语音,如是麦克风输入的语音则进行数字信号处理,在对麦克风输入的语音流进行数字信号处理完毕后与背景音进行混音、语音编码以及打包得到语音编码包;若不是麦克风输入的语音则在语音采集完毕后进行混音、语音编码以及打包得到语音编码包;若当前未开启背景音,则采集的语音信号进行数字信号处理得到语音帧,对得到的语音帧进行语音活性检测确定是否为静音帧,对非静音帧进行语音编码并打包得到语音编码包。
可选地,上述语音处理单元403,用于进行的上述数字信号处理包括:进行语音信号预处理、回声消除、噪声抑制、自动增益控制中的至少一项。
上述语音应用场景是指语音处理所针对的当前应用场景,因此以上语音应用场景可以是目前计算机技术领域能够应用到语音的各种应用场景,本领域技术人员可以获知的是目前能够用到语音的应用场景有很多,本发明实施例对此无法穷举,不过本发明实施例仍然就其中几种有代表性的语音应用场景进行了举例说明:可选地,上述检测单元401,用于获取的语音应用场景包括:游戏场景、通话聊天场景、高音质无视频聊天场景、高音质直播场景或高音质视频聊天场景、超高音质直播场景或超高音质视频聊天场景中的至少一项。
对于不同的语音应用场景,对语音的质量会有所不同,例如:游戏场景对语音质量要求最低,但是要求对当前的网速占用要求较高,并且语音处理所用的CPU(Central Processor Unit,中央处理器)资源要较少。直播相关的场景则相对需要高保真,需要特殊的音效处理。高音质模式下,则需要消耗更多的CPU资源和网络流量来保证音质满足用户需求。
依据以上举例的语音处理参数,其参数结果的选择会导致语音处理所占用的系统资源的变化本领域技术人员也是可以预知的,各种语音处理将会导致语音质量的变化也是可以预知的,基于前述实施例所举例的各种应用场景本发明实施例还给出了优选的设置方案,具体如下:上述参数配置单元402,用于配置的语音处理参数包括:游戏场景下语音处理参数设置为:声学回声抵消开启、噪声抑制开启、噪声衰减的强度强、自动增益控制开启、语音活性检测开启、静音帧数多、编码码率低、编码复杂度高、前向 纠错开启、网络封包方式为2个语音帧封1个语音编码包、网络包发送方式为单发;
通话聊天场景下语音处理参数设置为:声学回声抵消开启、噪声抑制开启、噪声衰减的强度低、自动增益控制开启、语音活性检测开启、静音帧数低、编码码率低、编码复杂度高、前向纠错开启、网络封包方式为3个语音帧封1个语音编码包、网络包发送方式为单发;
高音质无视频聊天场景下语音处理参数设置为:声学回声抵消开启、噪声抑制开启、噪声衰减的强度低、自动增益控制开启、语音活性检测开启、静音帧数低、编码码率默认值、编码复杂度默认值、前向纠错开启、网络封包方式为1个语音帧封1个语音编码包、网络包发送方式为单发;
高音质直播场景或高音质视频聊天场景下语音处理参数设置为:声学回声抵消是关闭、噪声抑制关闭、自动增益控制关闭、语音活性检测关闭、编码码率默认值、编码复杂度默认值、前向纠错开启、网络封包方式为1个语音帧封1个语音编码包、网络包发送方式为双发;
超高音质直播场景或超高音质视频聊天场景下语音处理参数设置为:声学回声抵消关闭、噪声抑制关闭、自动增益控制关闭、语音活性检测关闭、编码码率高、编码复杂度默认值、前向纠错关闭、网络封包方式为1个语音帧封1个语音编码包、网络包发送方式为单发。
对于语音采样率的控制还可以进一步通过控制声道数来影响语音采样率,本发明实施例所称的多声道包含双声道或者更多的声道数,具体的声道数本发明实施例可以不予限制,对于各种不同的应用场景语音采样率的优选设置方案具体如下:可选地,上述参数配置单元402,用于配置的语音处理参数包括:游戏场景和通话聊天场景下语音采样率设置为:单声道低采样率;高音质无视频聊天场景、高音质直播场景或高音质视频聊天场景以及超高音质直播场景或超高音质视频聊天场景下语音采样率设置为:多声道高采样率。
本发明实施例还提供了另一种语音处理装置,如图5所示,包括:接收器501、发射器502、处理器503以及存储器504;
其中,上述处理器503,用于检测当前的语音应用场景;配置与上述语音应用场景对应的语音处理参数;语音质量要求越高的应用场景对应的语音处理参数的标准越高;按照上述语音处理参数对采集的语音信号进行语音处理得到语音编码包,向语音接 收端发送上述语音编码包。
上述场景检测的过程,可以是设备执行的自动检测过程,也可以是接收用户对于场景模式的设置,具体获得语音应用场景的方式并不会影响到本发明实施例的实现,因此本发明实施例对此不予限定。
语音处理参数是用来决定如何进行语音处理的指导性标准参数,本领域技术人员可以获知的是对语音处理的控制可以有很多种选择,对于各种可能的选择会导致语音处理所占用的系统资源的变化本领域技术人员也是可以预知的,各种语音处理将会导致语音质量的变化也是可以预知的,基于各种应用场景对语音质量要求以及对资源消耗的要求本领域技术人员是可以确定语音处理参数是如何选择的。
以上实施例,针对不同语音质量要求的语音应用场景对应有不同的语音处理参数,从而确定与当前的语音应用场景相适应的语音处理参数。采用与当前的语音应用场景相适应的语音处理参数进行语音处理得到语音编码包,则可以使语音处理的方案适应于当前语音应用场景,因此可以实现在满足音质要求的前提下节省系统资源的技术效果。
在获得语音应用场景以后需要确定相应的语音处理参数,语音处理参数可以是预置在本地的,例如采用配置表的形式存放,具体实现如下:可选地,在语音处理设备中预置有各语音应用场景对应的语音处理参数,各语音应用场景对应不同的语音质量;上述处理器503,用于配置与上述语音应用场景对应的语音处理参数包括:依据预置的各语音应用场景对应的语音处理参数,配置与上述语音应用场景对应的语音处理参数。
本领域技术人员可以获知对语音处理的控制可以有很多种选择,对于各种可能的选择会导致语音处理所占用的系统资源的变化本领域技术人员也是可以预知的,各种语音处理将会导致语音质量的变化也是可以预知的,本发明实施例还对优选用来进行控制决策的语音处理参数进行了举例说明,具体如下:可选地,上述处理器503,用于配置的上述语音处理参数包括:语音采样率、声学回声抵消是否开启、噪声抑制是否开启、噪声衰减的强度、自动增益控制是否开启、语音活性检测是否开启、静音帧数、编码码率、编码复杂度、前向纠错是否开启、网络封包方式、网络包发送方式中的至少一项。
对采集的语音信号进行语音处理得到语音编码包的过程,依据不同需要可以选用控制参数,对应不同的控制参数则会有不同的控制流程,本发明实施例给出了其中的一种可选方案的举例,本领域技术人员可以获知的是以下举例并不是可选方案的穷举, 因此不应理解为对本发明实施例的限定,具体如下:可选地,上述处理器503,用于对采集的语音信号进行语音处理得到语音编码包包括:若当前开启有背景音,则确定是否为麦克风输入的语音,如是麦克风输入的语音则进行数字信号处理,在对麦克风输入的语音流进行数字信号处理完毕后与背景音进行混音、语音编码以及打包得到语音编码包;若不是麦克风输入的语音则在语音采集完毕后进行混音、语音编码以及打包得到语音编码包;若当前未开启背景音,则采集的语音信号进行数字信号处理得到语音帧,对得到的语音帧进行语音活性检测确定是否为静音帧,对非静音帧进行语音编码并打包得到语音编码包。
可选地,上述处理器503,用于进行的上述数字信号处理包括:语音信号预处理、回声消除、噪声抑制、自动增益控制中的至少一项。
上述语音应用场景是指语音处理所针对的当前应用场景,因此以上语音应用场景可以是目前计算机技术领域能够应用到语音的各种应用场景,本领域技术人员可以获知的是目前能够用到语音的应用场景有很多,本发明实施例对此无法穷举,不过本发明实施例仍然就其中几种有代表性的语音应用场景进行了举例说明:可选地,上述语音应用场景包括:游戏场景、通话聊天场景、高音质无视频聊天场景、高音质直播场景或高音质视频聊天场景、超高音质直播场景或超高音质视频聊天场景中的至少一项。对于不同的语音应用场景,对语音的质量会有所不同,例如:游戏场景对语音质量要求最低,但是要求对当前的网速占用要求较高,并且语音处理所用的CPU(Central Processor Unit,中央处理器)资源要较少。直播相关的场景则相对需要高保真,需要特殊的音效处理。高音质模式下,则需要消耗更多的CPU资源和网络流量来保证音质满足用户需求。依据以上举例的语音处理参数,其参数结果的选择会导致语音处理所占用的系统资源的变化本领域技术人员也是可以预知的,各种语音处理将会导致语音质量的变化也是可以预知的,基于前述实施例所举例的各种应用场景本发明实施例还给出了优选的设置方案,具体如下:上述处理器503,用于游戏场景下语音处理参数设置为:声学回声抵消开启、噪声抑制开启、噪声衰减的强度强、自动增益控制开启、语音活性检测开启、静音帧数多、编码码率低、编码复杂度高、前向纠错开启、网络封包方式为2个语音帧封1个语音编码包、网络包发送方式为单发;
通话聊天场景下语音处理参数设置为:声学回声抵消开启、噪声抑制开启、噪声衰减的强度低、自动增益控制开启、语音活性检测开启、静音帧数低、编码码率低、 编码复杂度高、前向纠错开启、网络封包方式为3个语音帧封1个语音编码包、网络包发送方式为单发;
高音质无视频聊天场景下语音处理参数设置为:声学回声抵消开启、噪声抑制开启、噪声衰减的强度低、自动增益控制开启、语音活性检测开启、静音帧数低、编码码率默认值、编码复杂度默认值、前向纠错开启、网络封包方式为1个语音帧封1个语音编码包、网络包发送方式为单发;
高音质直播场景或高音质视频聊天场景下语音处理参数设置为:声学回声抵消是关闭、噪声抑制关闭、自动增益控制关闭、语音活性检测关闭、编码码率默认值、编码复杂度默认值、前向纠错开启、网络封包方式为1个语音帧封1个语音编码包、网络包发送方式为双发;
超高音质直播场景或超高音质视频聊天场景下语音处理参数设置为:声学回声抵消关闭、噪声抑制关闭、自动增益控制关闭、语音活性检测关闭、编码码率高、编码复杂度默认值、前向纠错关闭、网络封包方式为1个语音帧封1个语音编码包、网络包发送方式为单发。
对于语音采样率的控制还可以进一步通过控制声道数来影响语音采样率,本发明实施例所称的多声道包含双声道或者更多的声道数,具体的声道数本发明实施例可以不予限制,对于各种不同的应用场景语音采样率的优选设置方案具体如下:可选地,上述处理器503,用于在游戏场景和通话聊天场景下语音采样率设置为:单声道低采样率;在高音质无视频聊天场景、高音质直播场景或高音质视频聊天场景以及超高音质直播场景或超高音质视频聊天场景下语音采样率设置为:多声道高采样率。
本发明实施例还提供了另一种语音处理装置,如图6所示,为了便于说明,仅示出了与本发明实施例相关的部分,具体技术细节未揭示的,请参照本发明实施例方法部分。该终端可以为包括手机、平板电脑、PDA(Personal Digital Assistant,个人数字助理)、POS(Point of Sales,销售终端)、车载电脑等任意终端设备,以终端为手机为例:
图6示出的是与本发明实施例提供的终端相关的手机的部分结构的框图。参考图6,手机包括:射频(Radio Frequency,RF)电路610、存储器620、输入单元630、显示单元640、传感器650、语音电路660、无线保真(wireless fidelity,WiFi)模块670、处理器680、以及电源690等部件。本领域技术人员可以理解,图6中示出的手机结构 并不构成对手机的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
下面结合图6对手机的各个构成部件进行具体的介绍:
RF电路610可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,给处理器680处理;另外,将设计上行的数据发送给基站。通常,RF电路610包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(Low Noise Amplifier,LNA)、双工器等。此外,RF电路610还可以通过无线通信与网络和其他设备通信。上述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯系统(Global System of Mobile communication,GSM)、通用分组无线服务(General Packet Radio Service,GPRS)、码分多址(Code Division Multiple Access,CDMA)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、长期演进(Long Term Evolution,LTE)、电子邮件、短消息服务(Short Messaging Service,SMS)等。
存储器620可用于存储软件程序以及模块,处理器680通过运行存储在存储器620的软件程序以及模块,从而执行手机的各种功能应用以及数据处理。存储器620可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如语音数据、电话本等)等。此外,存储器620可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
输入单元630可用于接收输入的数字或字符信息,以及产生与手机的用户设置以及功能控制有关的键信号输入。具体地,输入单元630可包括触控面板631以及其他输入设备632。触控面板631,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板631上或在触控面板631附近的操作),并根据预先设定的程式驱动相应的连接装置。可选的,触控面板631可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器680,并能接收处理器680发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实 现触控面板631。除了触控面板631,输入单元630还可以包括其他输入设备632。具体地,其他输入设备632可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
显示单元640可用于显示由用户输入的信息或提供给用户的信息以及手机的各种菜单。显示单元640可包括显示面板641,可选的,可以采用液晶显示器(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板641。进一步的,触控面板631可覆盖显示面板641,当触控面板631检测到在其上或附近的触摸操作后,传送给处理器680以确定触摸事件的类型,随后处理器680根据触摸事件的类型在显示面板641上提供相应的视觉输出。虽然在图6中,触控面板631与显示面板641是作为两个独立的部件来实现手机的输入和输入功能,但是在某些实施例中,可以将触控面板631与显示面板641集成而实现手机的输入和输出功能。
手机还可包括至少一种传感器650,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板641的亮度,接近传感器可在手机移动到耳边时,关闭显示面板641和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于手机还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。
语音电路660、扬声器661,传声器662可提供用户与手机之间的语音接口。语音电路660可将接收到的语音数据转换后的电信号,传输到扬声器661,由扬声器661转换为声音信号输出;另一方面,传声器662将收集的声音信号转换为电信号,由语音电路660接收后转换为语音数据,再将语音数据输出处理器680处理后,经RF电路610以发送给比如另一手机,或者将语音数据输出至存储器620以便进一步处理。
WiFi属于短距离无线传输技术,手机通过WiFi模块670可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图6示出了WiFi模块670,但是可以理解的是,其并不属于手机的必须构成,完全可以根据需要在不改变发明的本质的范围内而省略。
处理器680是手机的控制中心,利用各种接口和线路连接整个手机的各个部分, 通过运行或执行存储在存储器620内的软件程序和/或模块,以及调用存储在存储器620内的数据,执行手机的各种功能和处理数据,从而对手机进行整体监控。可选的,处理器680可包括一个或多个处理单元;优选的,处理器680可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器680中。
手机还包括给各个部件供电的电源690(比如电池),优选的,电源可以通过电源管理系统与处理器680逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。
尽管未示出,手机还可以包括摄像头、蓝牙模块等,在此不再赘述。
根据本发明的一个实施例,所述处理器680可执行存储器620中的指令,以执行以下操作:
检测网络中当前的语音应用场景;
确定当前的语音应用场景对语音质量的要求以及对所述网络的要求;
基于所确定的对语音质量的要求以及对所述网络的要求配置与所述语音应用场景对应的语音处理参数;
按照所述语音处理参数对在所述语音应用场景采集的语音信号进行语音处理。
在本发明实施例中,该终端所包括的处理器680还具有以下功能:
上述处理器680,用于检测当前的语音应用场景;配置与上述语音应用场景对应的语音处理参数;语音质量要求越高的应用场景对应的语音处理参数的标准越高;按照上述语音处理参数对采集的语音信号进行语音处理得到语音编码包,向语音接收端发送上述语音编码包。
上述场景检测的过程,可以是设备执行的自动检测过程,也可以是接收用户对于场景模式的设置,具体获得语音应用场景的方式并不会影响到本发明实施例的实现,因此本发明实施例对此不予限定。
语音处理参数是用来决定如何进行语音处理的指导性标准参数,本领域技术人员可以获知的是对语音处理的控制可以有很多种选择,对于各种可能的选择会导致语音处理所占用的系统资源的变化本领域技术人员也是可以预知的,各种语音处理将会导致语音质量的变化也是可以预知的,基于各种应用场景对语音质量要求以及对资源消耗的 要求本领域技术人员是可以确定语音处理参数是如何选择的。
以上实施例,针对不同语音质量要求的语音应用场景对应有不同的语音处理参数,从而确定与当前的语音应用场景相适应的语音处理参数。采用与当前的语音应用场景相适应的语音处理参数进行语音处理得到语音编码包,则可以使语音处理的方案适应于当前语音应用场景,因此可以实现在满足音质要求的前提下节省系统资源的技术效果。
在获得语音应用场景以后需要确定相应的语音处理参数,语音处理参数可以是预置在本地的,例如采用配置表的形式存放,具体实现如下:可选地,在语音处理设备中预置有各语音应用场景对应的语音处理参数,各语音应用场景对应不同的语音质量;上述处理器680,用于配置与上述语音应用场景对应的语音处理参数包括:依据预置的各语音应用场景对应的语音处理参数,配置与上述语音应用场景对应的语音处理参数。
本领域技术人员可以获知对语音处理的控制可以有很多种选择,对于各种可能的选择会导致语音处理所占用的系统资源的变化本领域技术人员也是可以预知的,各种语音处理将会导致语音质量的变化也是可以预知的,本发明实施例还对优选用来进行控制决策的语音处理参数进行了举例说明,具体如下:可选地,上述处理器680,用于配置的上述语音处理参数包括:语音采样率、声学回声抵消是否开启、噪声抑制是否开启、噪声衰减的强度、自动增益控制是否开启、语音活性检测是否开启、静音帧数、编码码率、编码复杂度、前向纠错是否开启、网络封包方式、网络包发送方式中的至少一项。
对采集的语音信号进行语音处理得到语音编码包的过程,依据不同需要可以选用控制参数,对应不同的控制参数则会有不同的控制流程,本发明实施例给出了其中的一种可选方案的举例,本领域技术人员可以获知的是以下举例并不是可选方案的穷举,因此不应理解为对本发明实施例的限定,具体如下:可选地,上述处理器680,用于对采集的语音信号进行语音处理得到语音编码包包括:若当前开启有背景音,则确定是否为麦克风输入的语音,如是麦克风输入的语音则进行数字信号处理,在对麦克风输入的语音流进行数字信号处理完毕后与背景音进行混音、语音编码以及打包得到语音编码包;若不是麦克风输入的语音则在语音采集完毕后进行混音、语音编码以及打包得到语音编码包;若当前未开启背景音,则采集的语音信号进行数字信号处理得到语音帧,对得到的语音帧进行语音活性检测确定是否为静音帧,对非静音帧进行语音编码并打包得到语音编码包。
可选地,上述处理器680,用于进行的上述数字信号处理包括:语音信号预处理、回声消除、噪声抑制、自动增益控制中的至少一项。
上述语音应用场景是指语音处理所针对的当前应用场景,因此以上语音应用场景可以是目前计算机技术领域能够应用到语音的各种应用场景,本领域技术人员可以获知的是目前能够用到语音的应用场景有很多,本发明实施例对此无法穷举,不过本发明实施例仍然就其中几种有代表性的语音应用场景进行了举例说明:可选地,上述语音应用场景包括:游戏场景、通话聊天场景、高音质无视频聊天场景、高音质直播场景或高音质视频聊天场景、超高音质直播场景或超高音质视频聊天场景中的至少一项。对于不同的语音应用场景,对语音的质量会有所不同,例如:游戏场景对语音质量要求最低,但是要求对当前的网速占用要求较高,并且语音处理所用的CPU(Central Processor Unit,中央处理器)资源要较少。直播相关的场景则相对需要高保真,需要特殊的音效处理。高音质模式下,则需要消耗更多的CPU资源和网络流量来保证音质满足用户需求。依据以上举例的语音处理参数,其参数结果的选择会导致语音处理所占用的系统资源的变化本领域技术人员也是可以预知的,各种语音处理将会导致语音质量的变化也是可以预知的,基于前述实施例所举例的各种应用场景本发明实施例还给出了优选的设置方案,具体如下:上述处理器680,用于游戏场景下语音处理参数设置为:声学回声抵消开启、噪声抑制开启、噪声衰减的强度强、自动增益控制开启、语音活性检测开启、静音帧数多、编码码率低、编码复杂度高、前向纠错开启、网络封包方式为2个语音帧封1个语音编码包、网络包发送方式为单发;
通话聊天场景下语音处理参数设置为:声学回声抵消开启、噪声抑制开启、噪声衰减的强度低、自动增益控制开启、语音活性检测开启、静音帧数低、编码码率低、编码复杂度高、前向纠错开启、网络封包方式为3个语音帧封1个语音编码包、网络包发送方式为单发;
高音质无视频聊天场景下语音处理参数设置为:声学回声抵消开启、噪声抑制开启、噪声衰减的强度低、自动增益控制开启、语音活性检测开启、静音帧数低、编码码率默认值、编码复杂度默认值、前向纠错开启、网络封包方式为1个语音帧封1个语音编码包、网络包发送方式为单发;
高音质直播场景或高音质视频聊天场景下语音处理参数设置为:声学回声抵消是关闭、噪声抑制关闭、自动增益控制关闭、语音活性检测关闭、编码码率默认值、编 码复杂度默认值、前向纠错开启、网络封包方式为1个语音帧封1个语音编码包、网络包发送方式为双发;
超高音质直播场景或超高音质视频聊天场景下语音处理参数设置为:声学回声抵消关闭、噪声抑制关闭、自动增益控制关闭、语音活性检测关闭、编码码率高、编码复杂度默认值、前向纠错关闭、网络封包方式为1个语音帧封1个语音编码包、网络包发送方式为单发。
对于语音采样率的控制还可以进一步通过控制声道数来影响语音采样率,本发明实施例所称的多声道包含双声道或者更多的声道数,具体的声道数本发明实施例可以不予限制,对于各种不同的应用场景语音采样率的优选设置方案具体如下:可选地,上述处理器680,用于在游戏场景和通话聊天场景下语音采样率设置为:单声道低采样率;在高音质无视频聊天场景、高音质直播场景或高音质视频聊天场景以及超高音质直播场景或超高音质视频聊天场景下语音采样率设置为:多声道高采样率。
值得注意的是,上述装置实施例中,所包括的各个单元只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本发明的保护范围。
另外,本领域普通技术人员可以理解实现上述各方法实施例中的全部或部分步骤是可以通过程序来指令相关的硬件完成,相应的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上仅为本发明较佳的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明实施例揭露的技术范围内,可轻易想到的变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应该以权利要求的保护范围为准。

Claims (19)

  1. 一种语音处理方法,应用于网络中,包括:
    检测所述网络中当前的语音应用场景;
    确定当前的语音应用场景对语音质量的要求以及对所述网络的要求;
    基于所确定的对语音质量的要求以及对所述网络的要求配置与所述语音应用场景对应的语音处理参数;
    按照所述语音处理参数对在所述语音应用场景采集的语音信号进行语音处理。
  2. 根据权利要求1所述方法,其中,所述语音应用场景包括:网络游戏场景、通话聊天场景、高音质无视频网络聊天场景、高音质网络直播场景或高音质视频网络聊天场景、超高音质网络直播场景或超高音质视频网络聊天场景。
  3. 根据权利要求1所述方法,其中,对所述网络的要求包括对网络速度的要求、对网络上下行带宽的要求、对网络流量的要求或者对网络延迟的要求。
  4. 根据权利要求1所述方法,还包括:
    预置各语音应用场景对应的语音处理参数;
    依据预置的各语音应用场景对应的语音处理参数,配置与所述语音应用场景对应的语音处理参数。
  5. 根据权利要求1或4所述方法,其中,所述语音处理参数包括:
    语音采样率、声学回声抵消是否开启、噪声抑制是否开启、噪声衰减的强度、自动增益控制是否开启、语音活性检测是否开启、静音帧数、编码码率、编码复杂度、前向纠错是否开启、网络封包方式、网络包发送方式中的至少一项。
  6. 根据权利要求5所述方法,其中,所述对采集的语音信号进行语音处理包括:
    若当前开启有背景音,则确定是否为麦克风输入的语音,如是麦克风输入的语音则进行数字信号处理,在对麦克风输入的语音流进行数字信号处理完毕后与背景音进行混音、语音编码以及打包得到语音编码包;若不是麦克风输入的语音则在语音采集完毕后进行混音、语音编码以及打包得到语音编码包;
    若当前未开启背景音,则采集的语音信号进行数字信号处理得到语音帧,对得到的语音帧进行语音活性检测确定是否为静音帧,对非静音帧进行语音编码并打包得到语音编码包。
  7. 根据权利要求6所述方法,其中,所述数字信号处理包括:
    语音信号预处理、回声消除、噪声抑制、自动增益控制中的至少一项。
  8. 根据权利要求5所述方法,其中:
    游戏场景下语音处理参数设置为:声学回声抵消开启、噪声抑制开启、噪声衰减的强度强、自动增益控制开启、语音活性检测开启、静音帧数多、编码码率低、编码复杂度高、前向纠错开启、网络封包方式为2个语音帧封1个语音编码包、网络包发送方式为单发;
    通话聊天场景下语音处理参数设置为:声学回声抵消开启、噪声抑制开启、噪声衰减的强度低、自动增益控制开启、语音活性检测开启、静音帧数低、编码码率低、编码复杂度高、前向纠错开启、网络封包方式为3个语音帧封1个语音编码包、网络包发送方式为单发;
    高音质无视频聊天场景下语音处理参数设置为:声学回声抵消开启、噪声抑制开启、噪声衰减的强度低、自动增益控制开启、语音活性检测开启、静音帧数低、编码码率默认值、编码复杂度默认值、前向纠错开启、网络封包方式为1个语音帧封1个语音编码包、网络包发送方式为单发;
    高音质直播场景或高音质视频聊天场景下语音处理参数设置为:声学回声抵消是关闭、噪声抑制关闭、自动增益控制关闭、语音活性检测关闭、编码码率默认值、编码复杂度默认值、前向纠错开启、网络封包方式为1个语音帧封1个语音编码包、网络包发送方式为双发;
    超高音质直播场景或超高音质视频聊天场景下语音处理参数设置为:声学回声抵消关闭、噪声抑制关闭、自动增益控制关闭、语音活性检测关闭、编码码率高、编码复杂度默认值、前向纠错关闭、网络封包方式为1个语音帧封1个语音编码包、网络包发送方式为单发。
  9. 根据权利要求8所述方法,其中,
    游戏场景和通话聊天场景下语音采样率设置为:单声道低采样率,低码率;
    高音质无视频聊天场景、高音质直播场景或高音质视频聊天场景以及超高音质直播场景或超高音质视频聊天场景下语音采样率设置为:多声道高采样率,高码率;所述高码率为高于所述低码率的码率。
  10. 一种语音处理装置,应用于网络中,包括:
    检测单元,用于检测所述网络中当前的语音应用场景;
    确定单元,用于确定当前的语音应用场景对语音质量的要求以及对所述网络的要求;
    参数配置单元,用于基于所确定的对语音质量的要求以及对所述网络的要求配置与所述检测单元检测的语音应用场景对应的语音处理参数;以及
    语音处理单元,用于按照所述参数配置单元配置的语音处理参数对在所述语音应用场景采集的语音信号进行语音处理。
  11. 根据权利要求10所述装置,其中,所述语音应用场景包括:网络游戏场景、通话聊天场景、高音质无视频网络聊天场景、高音质网络直播场景或高音质视频网络聊天场景、超高音质网络直播场景或超高音质视频网络聊天场景。
  12. 根据权利要求10所述装置,其中,对所述网络的要求包括对网络速度的要求、对网络上下行带宽的要求、对网络流量的要求或者对网络延迟的要求。
  13. 根据权利要求10所述装置,其中:
    所述参数配置单元,用于依据预置的各语音应用场景对应的语音处理参数,配置与所述语音应用场景对应的语音处理参数。
  14. 根据权利要求10或13所述装置,其中,
    所述参数配置单元,用于配置的语音处理参数包括:语音采样率、声学回声抵消是否开启、噪声抑制是否开启、噪声衰减的强度、自动增益控制是否开启、语音活性检测是否开启、静音帧数、编码码率、编码复杂度、前向纠错是否开启、网络封包方式、网络包发送方式中的至少一项。
  15. 根据权利要求14所述装置,其中,
    所述语音处理单元,用于若当前开启有背景音,则确定是否为麦克风输入的语音,如是麦克风输入的语音则进行数字信号处理,在对麦克风输入的语音流进行数字信号处理完毕后与背景音进行混音、语音编码以及打包得到语音编码包;若不是麦克风输入的语音则在语音采集完毕后进行混音、语音编码以及打包得到语音编码包;若当前未开启背景音,则采集的语音信号进行数字信号处理得到语音帧,对得到的语音帧进行语音活性检测确定是否为静音帧,对非静音帧进行语音编码并打包得到语音编码包。
  16. 根据权利要求15所述装置,其中,
    所述语音处理单元,用于进行的所述数字信号处理包括:进行语音信号预处理、回声消除、噪声抑制、自动增益控制中的至少一项。
  17. 根据权利要求11所述装置,其中,
    所述参数配置单元用于:
    游戏场景下将语音处理参数设置为:声学回声抵消开启、噪声抑制开启、噪声衰减的强度强、自动增益控制开启、语音活性检测开启、静音帧数多、编码码率低、编码复杂度高、前向纠错开启、网络封包方式为2个语音帧封1个语音编码包、网络包发送方式为单发;
    通话聊天场景下将语音处理参数设置为:声学回声抵消开启、噪声抑制开启、噪声衰减的强度低、自动增益控制开启、语音活性检测开启、静音帧数低、编码码率低、编码复杂度高、前向纠错开启、网络封包方式为3个语音帧封1个语音编码包、网络包发送方式为单发;
    高音质无视频聊天场景下将语音处理参数设置为:声学回声抵消开启、噪声抑制开启、噪声衰减的强度低、自动增益控制开启、语音活性检测开启、静音帧数低、编码码率默认值、编码复杂度默认值、前向纠错开启、网络封包方式为1个语音帧封1个语音编码包、网络包发送方式为单发;
    高音质直播场景或高音质视频聊天场景下将语音处理参数设置为:声学回声抵消是关闭、噪声抑制关闭、自动增益控制关闭、语音活性检测关闭、编码码率默认值、编码复杂度默认值、前向纠错开启、网络封包方式为1个语音帧封1个语音编码包、网络包发送方式为双发;
    超高音质直播场景或超高音质视频聊天场景下将语音处理参数设置为:声学回声抵消关闭、噪声抑制关闭、自动增益控制关闭、语音活性检测关闭、编码码率高、编码复杂度默认值、前向纠错关闭、网络封包方式为1个语音帧封1个语音编码包、网络包发送方式为单发。
  18. 根据权利要求17所述装置,其中,
    所述参数配置单元,用于配置的语音处理参数包括:游戏场景和通话聊天场景下语音采样率设置为:单声道低采样率,低码率;高音质无视频聊天场景、高音质直播场景或高音质视频聊天场景以及超高音质直播场景或超高音质视频聊天场景下语音采样率设置为:多声道高采样率,高码率;所述高码率为高于所述低码率的码率。
  19. 一种非瞬时性的计算机可读存储介质,其上存储有计算机可执行指令,当计算机中运行这些可执行指令时,执行如下步骤:
    检测网络中当前的语音应用场景;
    确定当前的语音应用场景对语音质量的要求以及对所述网络的要求;
    基于所确定的对语音质量的要求以及对所述网络的要求配置与所述语音应用场景对应的语音处理参数;
    按照所述语音处理参数对在所述语音应用场景采集的语音信号进行语音处理。
PCT/CN2015/072099 2013-12-09 2015-02-02 语音处理方法及装置 WO2015085959A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US15/174,321 US9978386B2 (en) 2013-12-09 2016-06-06 Voice processing method and device
US15/958,879 US10510356B2 (en) 2013-12-09 2018-04-20 Voice processing method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310661273.6 2013-12-09
CN201310661273.6A CN103617797A (zh) 2013-12-09 2013-12-09 一种语音处理方法,及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/174,321 Continuation US9978386B2 (en) 2013-12-09 2016-06-06 Voice processing method and device

Publications (1)

Publication Number Publication Date
WO2015085959A1 true WO2015085959A1 (zh) 2015-06-18

Family

ID=50168500

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/072099 WO2015085959A1 (zh) 2013-12-09 2015-02-02 语音处理方法及装置

Country Status (3)

Country Link
US (2) US9978386B2 (zh)
CN (1) CN103617797A (zh)
WO (1) WO2015085959A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106254677A (zh) * 2016-09-19 2016-12-21 深圳市金立通信设备有限公司 一种情景模式设置方法及终端
US20220059101A1 (en) * 2019-11-27 2022-02-24 Tencent Technology (Shenzhen) Company Limited Voice processing method and apparatus, computer-readable storage medium, and computer device

Families Citing this family (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103617797A (zh) 2013-12-09 2014-03-05 腾讯科技(深圳)有限公司 一种语音处理方法,及装置
CN105280188B (zh) * 2014-06-30 2019-06-28 美的集团股份有限公司 基于终端运行环境的音频信号编码方法和系统
CN105609102B (zh) * 2014-11-21 2021-03-16 中兴通讯股份有限公司 一种语音引擎参数配置方法和装置
CN104967960B (zh) * 2015-03-25 2018-03-20 腾讯科技(深圳)有限公司 语音数据处理方法、游戏直播中的语音数据处理方法和系统
CN104867359B (zh) * 2015-06-02 2017-04-19 阔地教育科技有限公司 一种直录播系统中的音频处理方法及系统
US10284703B1 (en) * 2015-08-05 2019-05-07 Netabla, Inc. Portable full duplex intercom system with bluetooth protocol and method of using the same
CN105141730B (zh) * 2015-08-27 2017-11-14 腾讯科技(深圳)有限公司 音量控制方法及装置
CN106506437B (zh) * 2015-09-07 2021-03-16 腾讯科技(深圳)有限公司 一种音频数据处理方法,及设备
CN106878533B (zh) * 2015-12-10 2021-03-19 北京奇虎科技有限公司 一种移动终端的通信方法和装置
CN105682209A (zh) * 2016-04-05 2016-06-15 广东欧珀移动通信有限公司 一种降低移动终端通话功耗的方法及移动终端
CN105959481B (zh) * 2016-06-16 2019-04-30 Oppo广东移动通信有限公司 一种场景音效的控制方法、及电子设备
CN106126176B (zh) * 2016-06-16 2018-05-29 广东欧珀移动通信有限公司 一种音效配置方法及移动终端
US10187504B1 (en) * 2016-09-23 2019-01-22 Apple Inc. Echo control based on state of a device
CN107846605B (zh) * 2017-01-19 2020-09-04 湖南快乐阳光互动娱乐传媒有限公司 主播端流媒体数据生成系统及方法、网络直播系统及方法
CN107122159B (zh) * 2017-04-20 2020-04-17 维沃移动通信有限公司 一种在线音频的品质切换方法及移动终端
CN107358956B (zh) * 2017-07-03 2020-12-29 中科深波科技(杭州)有限公司 一种语音控制方法及其控制模组
CN107861814B (zh) 2017-10-31 2023-01-06 Oppo广东移动通信有限公司 资源配置方法及设备
CN108055417B (zh) * 2017-12-26 2020-09-29 杭州叙简科技股份有限公司 一种基于语音检测回音抑制切换音频处理系统及方法
CN108335701B (zh) * 2018-01-24 2021-04-13 青岛海信移动通信技术股份有限公司 一种进行声音降噪的方法及设备
CN109003620A (zh) * 2018-05-24 2018-12-14 北京潘达互娱科技有限公司 一种回音消除方法、装置、电子设备及存储介质
CN108766454A (zh) * 2018-06-28 2018-11-06 浙江飞歌电子科技有限公司 一种语音噪声抑制方法及装置
CN109273017B (zh) * 2018-08-14 2022-06-21 Oppo广东移动通信有限公司 编码控制方法、装置以及电子设备
CN110970032A (zh) * 2018-09-28 2020-04-07 深圳市冠旭电子股份有限公司 一种音箱语音交互控制的方法及装置
CN111145770B (zh) * 2018-11-02 2022-11-22 北京微播视界科技有限公司 音频处理方法和装置
CN109378008A (zh) * 2018-11-05 2019-02-22 网易(杭州)网络有限公司 一种游戏的语音数据处理方法和装置
CN109743528A (zh) * 2018-12-29 2019-05-10 广州市保伦电子有限公司 一种视频会议的音频采集与播放优化方法、装置及介质
CN109885275B (zh) * 2019-02-13 2022-08-19 杭州新资源电子有限公司 一种音频调控方法、设备及计算机可读存储介质
CN110072011B (zh) * 2019-04-24 2021-07-20 Oppo广东移动通信有限公司 调整码率方法及相关产品
CN110138650A (zh) * 2019-05-14 2019-08-16 北京达佳互联信息技术有限公司 即时通讯的音质优化方法、装置及设备
CN110634485B (zh) * 2019-10-16 2023-06-13 声耕智能科技(西安)研究院有限公司 语音交互服务处理器及处理方法
CN110827838A (zh) * 2019-10-16 2020-02-21 云知声智能科技股份有限公司 一种基于opus的语音编码方法及装置
CN111210826B (zh) * 2019-12-26 2022-08-05 深圳市优必选科技股份有限公司 语音信息处理方法、装置、存储介质和智能终端
CN111511002B (zh) * 2020-04-23 2023-12-05 Oppo广东移动通信有限公司 检测帧率的调节方法和装置、终端和可读存储介质
CN114299967A (zh) * 2020-09-22 2022-04-08 华为技术有限公司 音频编解码方法和装置
CN112565057B (zh) * 2020-11-13 2022-09-23 广州市百果园网络科技有限公司 一种可扩展业务的语聊房服务方法及装置
CN113053405B (zh) * 2021-03-15 2022-12-09 中国工商银行股份有限公司 基于音频场景下的音频原始数据处理方法及装置
CN113113046B (zh) * 2021-04-14 2024-01-19 杭州网易智企科技有限公司 音频处理的性能检测方法、装置、存储介质及电子设备
CN113611318A (zh) * 2021-06-29 2021-11-05 华为技术有限公司 一种音频数据增强方法及相关设备
CN113488076A (zh) * 2021-06-30 2021-10-08 北京小米移动软件有限公司 音频信号处理方法及装置
CN113555024B (zh) * 2021-07-30 2024-02-27 北京达佳互联信息技术有限公司 实时通信音频处理方法、装置、电子设备和存储介质
CN113923065B (zh) * 2021-09-06 2023-11-24 贵阳语玩科技有限公司 基于聊天室音频的跨版本通信方法、系统、介质及服务器
CN114121033B (zh) * 2022-01-27 2022-04-26 深圳市北海轨道交通技术有限公司 基于深度学习的列车广播语音增强方法和系统
CN114448957B (zh) * 2022-01-28 2024-03-29 上海小度技术有限公司 音频数据传输方法和装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1980293A (zh) * 2005-12-03 2007-06-13 鸿富锦精密工业(深圳)有限公司 静音处理装置及方法
JP2009130499A (ja) * 2007-11-21 2009-06-11 Toshiba Corp コンテンツ再生装置、コンテンツ処理システム及びコンテンツ処理方法
CN101719962A (zh) * 2009-12-14 2010-06-02 深圳华为通信技术有限公司 提高手机通话音质的方法及利用该方法的手机
CN102014205A (zh) * 2010-11-19 2011-04-13 中兴通讯股份有限公司 语音通话质量的处理方法及装置
US20120195370A1 (en) * 2011-01-28 2012-08-02 Rodolfo Vargas Guerrero Encoding of Video Stream Based on Scene Type
CN103617797A (zh) * 2013-12-09 2014-03-05 腾讯科技(深圳)有限公司 一种语音处理方法,及装置
CN103716437A (zh) * 2012-09-28 2014-04-09 华为终端有限公司 控制音质和音量的方法和装置

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2281680B (en) * 1993-08-27 1998-08-26 Motorola Inc A voice activity detector for an echo suppressor and an echo suppressor
US6782361B1 (en) * 1999-06-18 2004-08-24 Mcgill University Method and apparatus for providing background acoustic noise during a discontinued/reduced rate transmission mode of a voice transmission system
JP3912003B2 (ja) * 2000-12-12 2007-05-09 株式会社日立製作所 通信装置
JP4556574B2 (ja) * 2004-09-13 2010-10-06 日本電気株式会社 通話音声生成装置及び方法
CN101166377A (zh) * 2006-10-17 2008-04-23 施伟强 一种多语种环绕立体声的低码率编解码方案
US8031857B2 (en) * 2006-12-19 2011-10-04 Scenera Technologies, Llc Methods and systems for changing a communication quality of a communication session based on a meaning of speech data
US20080147411A1 (en) * 2006-12-19 2008-06-19 International Business Machines Corporation Adaptation of a speech processing system from external input that is not directly related to sounds in an operational acoustic environment
JP5198477B2 (ja) * 2007-03-05 2013-05-15 テレフオンアクチーボラゲット エル エム エリクソン(パブル) 定常的な背景雑音の平滑化を制御するための方法及び装置
CN101320563B (zh) * 2007-06-05 2012-06-27 华为技术有限公司 一种背景噪声编码/解码装置、方法和通信设备
KR101476138B1 (ko) * 2007-06-29 2014-12-26 삼성전자주식회사 코덱의 구성 설정 방법 및 이를 적용한 코덱
CN101237489A (zh) * 2008-03-05 2008-08-06 北京邮电大学 基于语音通信内容的处理方法和装置
EP2266231B1 (en) * 2008-04-17 2017-10-04 Telefonaktiebolaget LM Ericsson (publ) Coversational interactivity measurement and estimation for real-time media
US9327193B2 (en) * 2008-06-27 2016-05-03 Microsoft Technology Licensing, Llc Dynamic selection of voice quality over a wireless system
KR101523590B1 (ko) * 2009-01-09 2015-05-29 한국전자통신연구원 통합 인터넷 프로토콜망의 코덱 모드 제어방법 및 단말기
JP5605573B2 (ja) * 2009-02-13 2014-10-15 日本電気株式会社 多チャンネル音響信号処理方法、そのシステム及びプログラム
US20130144617A1 (en) * 2010-04-13 2013-06-06 Nec Corporation Background noise cancelling device and method
JP5644359B2 (ja) * 2010-10-21 2014-12-24 ヤマハ株式会社 音声処理装置
US20120166188A1 (en) * 2010-12-28 2012-06-28 International Business Machines Corporation Selective noise filtering on voice communications
CN103219011A (zh) * 2012-01-18 2013-07-24 联想移动通信科技有限公司 降噪方法、装置与通信终端

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1980293A (zh) * 2005-12-03 2007-06-13 鸿富锦精密工业(深圳)有限公司 静音处理装置及方法
JP2009130499A (ja) * 2007-11-21 2009-06-11 Toshiba Corp コンテンツ再生装置、コンテンツ処理システム及びコンテンツ処理方法
CN101719962A (zh) * 2009-12-14 2010-06-02 深圳华为通信技术有限公司 提高手机通话音质的方法及利用该方法的手机
CN102014205A (zh) * 2010-11-19 2011-04-13 中兴通讯股份有限公司 语音通话质量的处理方法及装置
US20120195370A1 (en) * 2011-01-28 2012-08-02 Rodolfo Vargas Guerrero Encoding of Video Stream Based on Scene Type
CN103716437A (zh) * 2012-09-28 2014-04-09 华为终端有限公司 控制音质和音量的方法和装置
CN103617797A (zh) * 2013-12-09 2014-03-05 腾讯科技(深圳)有限公司 一种语音处理方法,及装置

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106254677A (zh) * 2016-09-19 2016-12-21 深圳市金立通信设备有限公司 一种情景模式设置方法及终端
US20220059101A1 (en) * 2019-11-27 2022-02-24 Tencent Technology (Shenzhen) Company Limited Voice processing method and apparatus, computer-readable storage medium, and computer device
US11869516B2 (en) * 2019-11-27 2024-01-09 Tencent Technology (Shenzhen) Company Limited Voice processing method and apparatus, computer- readable storage medium, and computer device

Also Published As

Publication number Publication date
US9978386B2 (en) 2018-05-22
US20180240468A1 (en) 2018-08-23
CN103617797A (zh) 2014-03-05
US10510356B2 (en) 2019-12-17
US20160284358A1 (en) 2016-09-29

Similar Documents

Publication Publication Date Title
WO2015085959A1 (zh) 语音处理方法及装置
CN105872253B (zh) 一种直播声音处理方法及移动终端
WO2021098405A1 (zh) 数据传输方法、装置、终端及存储介质
WO2015058656A1 (zh) 直播控制方法,及主播设备
KR101540896B1 (ko) 전자 디바이스 상에서의 마스킹 신호 생성
CN104902116B (zh) 一种音频数据与参考信号的时间对齐方法及装置
WO2016184295A1 (zh) 即时通讯方法、用户设备及系统
KR20110054609A (ko) 블루투스 디바이스의 원격 제어 방법 및 장치
WO2021184920A1 (zh) 一种声音的掩蔽方法、装置及终端设备
WO2013127367A1 (zh) 一种即时通信的语音识别方法和终端
JP7361890B2 (ja) 通話方法、通話装置、通話システム、サーバ及びコンピュータプログラム
CN108712566A (zh) 一种语音助手唤醒方法及移动终端
CN106506437B (zh) 一种音频数据处理方法,及设备
WO2017215661A1 (zh) 一种场景音效的控制方法、及电子设备
CN108492837B (zh) 音频突发白噪声的检测方法、装置及存储介质
WO2017101260A1 (zh) 音频切换方法、装置以及存储介质
WO2022037261A1 (zh) 音频播放、设备管理方法及装置
WO2015078349A1 (zh) 麦克风收音状态的切换方法和装置
CN103677582A (zh) 一种控制电子设备的方法及一种电子设备
CN109889665B (zh) 一种音量调节方法、移动终端及存储介质
EP1783600A2 (en) Method for arbitrating audio data output apparatuses
WO2020118560A1 (zh) 一种录音方法、装置、电子设备和计算机可读存储介质
US8781138B2 (en) Method for outputting background sound and mobile communication terminal using the same
WO2019076289A1 (zh) 降低电子设备的功耗的方法以及电子设备
WO2018035873A1 (zh) 音频数据处理方法、终端设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15727849

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 26.10.2016)

122 Ep: pct application non-entry in european phase

Ref document number: 15727849

Country of ref document: EP

Kind code of ref document: A1