WO2022267754A1 - 语音编码、语音解码方法、装置、计算机设备和存储介质 - Google Patents

语音编码、语音解码方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2022267754A1
WO2022267754A1 PCT/CN2022/093329 CN2022093329W WO2022267754A1 WO 2022267754 A1 WO2022267754 A1 WO 2022267754A1 CN 2022093329 W CN2022093329 W CN 2022093329W WO 2022267754 A1 WO2022267754 A1 WO 2022267754A1
Authority
WO
WIPO (PCT)
Prior art keywords
frequency band
target
feature information
voice
initial
Prior art date
Application number
PCT/CN2022/093329
Other languages
English (en)
French (fr)
Inventor
梁俊斌
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP22827252.2A priority Critical patent/EP4362013A1/en
Publication of WO2022267754A1 publication Critical patent/WO2022267754A1/zh
Priority to US18/124,496 priority patent/US20230238009A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/038Speech enhancement, e.g. noise reduction or echo cancellation using band spreading techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0204Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using subband decomposition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture

Definitions

  • the present application relates to the field of computer technology, in particular to a speech encoding and speech decoding method, device, computer equipment, storage medium and computer program product.
  • Voice codec technology can be applied to voice storage and voice transmission.
  • voice collection equipment needs to be used in conjunction with a voice encoder, and the sampling rate of the voice collection equipment needs to be within the sampling rate range supported by the voice encoder, so that the voice signal collected by the voice collection equipment can pass through the voice encoder.
  • the playback of the voice signal also depends on the voice decoder.
  • the voice coder can only decode and play the voice signal whose sampling rate is within the sampling rate range supported by itself, so it can only play Speech signals within the supported sample rate range.
  • the collection of voice signals is limited by the sampling rate supported by the existing voice coder, and the playback of the voice signal is also limited by the sampling rate supported by the existing voice decoder, which has relatively large limitations.
  • a speech encoding and speech decoding method device, computer equipment, storage medium and computer program product are provided.
  • a speech encoding method performed by a speech sending end, said method comprising:
  • the compressed voice signal is encoded by the voice coding module to obtain coded voice data corresponding to the voice signal to be processed, and the target sampling rate corresponding to the compressed voice signal is less than or equal to the supported sampling rate corresponding to the voice coding module , the target sampling rate is smaller than the corresponding sampling rate of the speech signal to be processed.
  • a speech coding device comprising:
  • a frequency band characteristic information acquisition module configured to obtain initial frequency band characteristic information corresponding to the speech signal to be processed
  • the first target feature information determining module is configured to obtain target feature information corresponding to the first frequency band based on the initial feature information corresponding to the first frequency band in the initial frequency band feature information;
  • the second target feature information determination module is configured to perform feature compression on the initial feature information corresponding to the second frequency band in the initial frequency band feature information, to obtain target feature information corresponding to the compressed frequency band, and the frequency of the first frequency band is lower than that of the first frequency band.
  • the frequency of the second frequency band, the frequency interval of the second frequency band is greater than the frequency interval of the compressed frequency band;
  • a compressed voice signal generating module configured to obtain intermediate frequency band feature information based on the target feature information corresponding to the first frequency band and the target feature information corresponding to the compressed frequency band, and obtain the corresponding audio signal to be processed based on the intermediate frequency band feature information.
  • a speech signal encoding module configured to encode the compressed speech signal through the speech encoding module to obtain encoded speech data corresponding to the speech signal to be processed, and the target sampling rate corresponding to the compressed speech signal is less than or equal to the speech
  • the encoding module supports a corresponding sampling rate, and the target sampling rate is lower than the corresponding sampling rate of the speech signal to be processed.
  • a computer device comprising a memory and one or more processors, the memory storing computer readable instructions which, when executed by the one or more processors, cause the one or more The processor executes the steps of the speech coding method described above.
  • One or more non-transitory computer-readable storage media on which are stored computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the above-mentioned Steps of a speech encoding method.
  • a computer program product or computer program comprising computer readable instructions stored on a computer readable storage medium for one or more processors of a computer device to read from the The computer-readable storage medium reads the computer-readable instructions, and the one or more processors execute the computer-readable instructions, so that the computer device executes the steps of the above speech coding method.
  • a voice decoding method performed by a voice receiving end, said method comprising:
  • the coded voice data is obtained by performing voice compression processing on the voice signal to be processed;
  • Target frequency band feature information corresponding to the decoded speech signal Generate target frequency band feature information corresponding to the decoded speech signal, and obtain extended feature information corresponding to the first frequency band based on target feature information corresponding to the first frequency band in the target frequency band feature information;
  • the frequency of the first frequency band is less than the frequency of the compressed frequency band, and the frequency of the compressed frequency band the interval is smaller than the frequency interval of the second frequency band;
  • a speech decoding device comprising:
  • a voice data acquisition module configured to acquire coded voice data obtained by performing voice compression processing on the voice signal to be processed
  • the voice signal decoding module is used to decode the coded voice data through the voice decoding module to obtain a decoded voice signal, and the target sampling rate corresponding to the decoded voice signal is less than or equal to the supported sampling rate corresponding to the voice decoding module;
  • the first extended feature information determination module is configured to generate target frequency band feature information corresponding to the decoded speech signal, and obtain extended feature information corresponding to the first frequency band based on the target feature information corresponding to the first frequency band in the target frequency band feature information;
  • the second extended feature information determination module is used to perform feature expansion on the target feature information corresponding to the compressed frequency band in the target frequency band feature information to obtain the extended feature information corresponding to the second frequency band; the frequency of the first frequency band is less than the compressed the frequency of the frequency band, the frequency interval of the compressed frequency band is smaller than the frequency interval of the second frequency band;
  • a target voice signal determination module configured to obtain extended frequency band feature information based on the extended feature information corresponding to the first frequency band and the extended feature information corresponding to the second frequency band, and obtain the speech signal to be processed based on the extended frequency band feature information Corresponding to the target voice signal, the sampling rate of the target voice signal is greater than the target sampling rate, and the target voice signal is used for playback.
  • a computer device comprising a memory and one or more processors, the memory storing computer readable instructions which, when executed by the one or more processors, cause the one or more The processor executes the steps of the above speech decoding method.
  • One or more non-transitory computer-readable storage media on which are stored computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the above-mentioned Steps of a speech decoding method.
  • a computer program product or computer program comprising computer readable instructions stored on a computer readable storage medium for one or more processors of a computer device to read from the The computer-readable storage medium reads the computer-readable instructions, and the one or more processors execute the computer-readable instructions, so that the computer device executes the steps of the above speech decoding method.
  • Fig. 1 is the application environment diagram of speech coding, speech decoding method in an embodiment
  • Fig. 2 is a schematic flow chart of a speech coding method in an embodiment
  • FIG. 3 is a schematic flow diagram of performing feature compression on initial feature information to obtain target feature information in an embodiment
  • FIG. 4 is a schematic diagram of a mapping relationship between an initial sub-frequency band and a target sub-frequency band in an embodiment
  • Fig. 5 is a schematic flow chart of a speech decoding method in an embodiment
  • Fig. 6A is a schematic flow chart of a speech encoding and decoding method in one embodiment
  • FIG. 6B is a schematic diagram of frequency domain signals before and after compression in an embodiment
  • Fig. 6C is a schematic diagram of speech signals before and after compression in one embodiment
  • FIG. 6D is a schematic diagram of frequency domain signals before and after expansion in an embodiment
  • Figure 6E is a schematic diagram of a speech signal to be processed and a target speech signal in an embodiment
  • FIG. 7A is a structural block diagram of a speech encoding device in an embodiment
  • Fig. 7B is a structural block diagram of a speech encoding device in another embodiment
  • Fig. 8 is a structural block diagram of a speech decoding device in an embodiment
  • Figure 9 is an internal structural diagram of a computer device in an embodiment
  • Figure 10 is a diagram of the internal structure of a computer device in one embodiment.
  • the speech encoding and speech decoding methods provided in this application can be applied to the application environment shown in FIG. 1 .
  • the voice sending end 102 communicates with the voice receiving end 104 through the network.
  • the speech sending end may also be called a speech encoding end, and is mainly used for performing speech encoding.
  • the voice receiving end may also be called a voice decoding end, which is mainly used for voice decoding.
  • the voice sending end 102 and the voice receiving end 104 can be terminals or servers, and the terminals can be but not limited to various desktop computers, notebook computers, smart phones, tablet computers, Internet of Things devices and portable wearable devices, Internet of Things devices It can be smart speakers, smart TVs, smart air conditioners, smart car equipment, etc.
  • Portable wearable devices can be smart watches, smart bracelets, head-mounted devices, and the like.
  • the server 104 may be implemented by an independent server or a server cluster composed of multiple servers or a cloud server.
  • the speech sending end obtains the initial frequency band characteristic information corresponding to the speech signal to be processed, and the speech sending end can obtain the target characteristic information corresponding to the first frequency band based on the initial characteristic information corresponding to the first frequency band in the initial frequency band characteristic information, and the initial frequency band characteristic information
  • the initial feature information corresponding to the second frequency band in the information is subjected to feature compression to obtain target feature information corresponding to the compressed frequency band.
  • the frequency of the first frequency band is smaller than the frequency of the second frequency band
  • the frequency interval of the second frequency band is larger than the frequency interval of the compressed frequency band.
  • the voice sending end obtains the intermediate frequency band feature information based on the target feature information corresponding to the first frequency band and the target feature information corresponding to the compressed frequency band, and obtains the compressed voice signal corresponding to the voice signal to be processed based on the intermediate frequency band feature information, and compresses the voice signal through the voice coding module Encoding processing is performed to obtain encoded voice data corresponding to the voice signal to be processed.
  • the target sampling rate corresponding to the compressed speech signal is less than or equal to the supported sampling rate corresponding to the speech encoding module, and the target sampling rate is smaller than the sampling rate corresponding to the speech signal to be processed.
  • the voice sending end can send the coded voice data to the voice receiving end, so that the voice receiving end performs voice restoration processing on the coded voice data, obtains a target voice signal corresponding to the voice signal to be processed, and plays the target voice signal.
  • the voice sending end can also store the encoded voice data locally. When it needs to be played, the voice sending end performs voice restoration processing on the encoded voice data, obtains the target voice signal corresponding to the voice signal to be processed, and plays the target voice signal.
  • the speech signal to be processed at any sampling rate can be compressed by the frequency band characteristic information, and the sampling rate of the speech signal to be processed is reduced to the sampling rate supported by the speech coder, and the obtained
  • the target sampling rate corresponding to the compressed speech signal is smaller than the sampling rate corresponding to the speech signal to be processed, and the compressed speech signal with a low sampling rate is obtained after compression. Because the sampling rate of the compressed speech signal is less than or equal to the sampling rate supported by the speech encoder, the speech encoder can smoothly encode the compressed speech signal, and finally the encoded speech data obtained by the encoding process can be transmitted to the speech decoder.
  • the voice receiving end obtains the coded voice data, and decodes the coded voice data through the voice decoding module to obtain a decoded voice signal, wherein the coded voice data can be sent by the voice sending end, or can be processed by the voice receiving end locally on the voice signal to be processed. obtained by voice compression.
  • the voice receiving end generates the target frequency band feature information corresponding to the decoded voice signal, and obtains the extended feature information corresponding to the first frequency band based on the target feature information corresponding to the first frequency band in the target frequency band feature information corresponding to the decoded voice signal, and compresses the target frequency band feature information
  • the target feature information corresponding to the frequency band is subjected to feature expansion to obtain the extended feature information corresponding to the second frequency band.
  • the frequency of the first frequency band is smaller than the frequency of the compressed frequency band
  • the frequency interval of the compressed frequency band is smaller than the frequency interval of the second frequency band.
  • the voice receiving end obtains the extended frequency band feature information based on the extended feature information corresponding to the first frequency band and the extended feature information corresponding to the second frequency band, and obtains the target voice signal corresponding to the voice signal to be processed based on the extended frequency band feature information.
  • the sampling rate of the target voice signal is greater than The target sampling rate corresponding to the decoded speech signal.
  • the voice receiver plays the target voice signal.
  • the coded speech data after obtaining the coded speech data obtained through speech compression processing, the coded speech data can be decoded to obtain a decoded speech signal, and the sampling rate of the decoded speech signal can be increased by expanding the frequency band feature information to obtain The target speech signal for playback.
  • the playback of the voice signal is not limited by the sampling rate supported by the voice decoder.
  • the encoded voice data may pass through a server, and the passed server may be implemented by an independent server or a server cluster composed of multiple servers or a cloud server.
  • the voice receiving end and the voice sending end can be converted to each other, that is, the voice receiving end can also be used as a voice sending end, and the voice sending end can also be used as a voice receiving end.
  • a speech coding method is provided, and the method is applied to the speech sending end in FIG. 1 as an example, including the following steps:
  • Step S202 acquiring initial frequency band feature information corresponding to the speech signal to be processed.
  • the voice signal to be processed refers to the voice signal collected by the voice collection device.
  • the speech signal to be processed may be a speech signal collected in real time by the speech collection device, and the speech sending end may perform frequency band compression and coding processing on the newly collected speech signal in real time to obtain coded speech data.
  • the speech signal to be processed can also be a speech signal collected historically by the speech collection device, and the speech sending end can obtain the speech signal collected at historical time from the database as the speech signal to be processed, and perform frequency band compression and encoding processing on the speech signal to be processed to obtain the coded speech data.
  • the voice sending end can store the coded voice data, and decode and play the coded voice data when it needs to be played.
  • the voice sending end can also send the encoded voice signal to the voice receiving end, and the voice receiving end decodes and plays the encoded voice data.
  • the speech signal to be processed is a time-domain signal, which can reflect the change of the speech signal over time.
  • Frequency band compression can reduce the sampling rate of the speech signal while keeping the speech content intelligible.
  • Frequency band compression refers to compressing a speech signal of a large frequency band into a speech signal of a small frequency band, wherein the speech signal of a small frequency band and the speech signal of a large frequency band have the same low-frequency information.
  • the initial frequency band feature information refers to the feature information of the speech signal to be processed in the frequency domain.
  • the characteristic information of the speech signal in the frequency domain includes the amplitude and phase of multiple frequency points within a frequency bandwidth (ie, frequency band).
  • a frequency point represents a specific frequency.
  • Shannon's theorem the relationship between the sampling rate and the frequency band of the voice signal is twice, for example, if the sampling rate of the voice signal is 48khz, the frequency band of the voice signal is 24khz, specifically 0-24khz; if the sampling rate of the voice signal is 16khz, then the frequency band of the voice signal is 8khz, specifically 0-8khz.
  • the voice sending end may use the voice signal collected by the local voice collection device as the voice signal to be processed, and locally extract the frequency domain feature of the voice signal to be processed as the initial frequency band feature information corresponding to the voice signal to be processed.
  • the voice sending end can use the time-frequency domain conversion algorithm to convert the time-domain signal into a frequency-domain signal, thereby extracting the frequency-domain features of the voice signal to be processed, for example, a custom time-frequency domain conversion algorithm, Lap Lass transform algorithm, Z transform algorithm, Fourier transform algorithm, etc.
  • Step S204 Obtain target feature information corresponding to the first frequency band based on the initial feature information corresponding to the first frequency band in the initial frequency band feature information.
  • the frequency band is a frequency interval composed of some frequencies in a frequency band.
  • a frequency band may consist of at least one frequency segment.
  • the initial frequency band corresponding to the speech signal to be processed includes a first frequency band and a second frequency band, and the frequency of the first frequency band is lower than the frequency of the second frequency band.
  • the voice sending end may divide the initial frequency band feature information into initial feature information corresponding to the first frequency band and initial feature information corresponding to the second frequency band. That is, the initial feature information of the frequency band may be divided into initial feature information corresponding to the low frequency band and initial feature information corresponding to the high frequency band.
  • the initial feature information corresponding to the low-frequency band mainly determines the content information of the speech, for example, the specific semantic content "what time do you get off work", and the initial feature information corresponding to the high-frequency band mainly determines the texture of the speech, for example, a hoarse and deep voice.
  • the initial characteristic information refers to the characteristic information corresponding to each frequency before the band compression
  • the target characteristic information refers to the characteristic information corresponding to each frequency after the frequency band is compressed.
  • the speech signal to be processed needs to be band-compressed to Reduce the sampling rate of the speech signal to be processed.
  • frequency band compression in addition to reducing the sampling rate of the speech signal to be processed, it is also necessary to ensure that the semantic content remains unchanged and naturally understandable. Since the semantic content of speech depends on the low-frequency information in the speech signal, the speech sender can divide the initial feature information of the frequency band into initial feature information corresponding to the first frequency band and initial feature information corresponding to the second frequency band.
  • the initial feature information corresponding to the first frequency band is low frequency information in the speech signal to be processed
  • the initial feature information corresponding to the second frequency band is high frequency information in the speech signal to be processed.
  • the voice sending end can keep the low-frequency information unchanged and compress the high-frequency information. Therefore, the voice sending end can obtain the target feature information corresponding to the first frequency band based on the initial feature information corresponding to the first frequency band in the initial frequency band feature information, and use the initial feature information corresponding to the first frequency band in the initial frequency band feature information as the intermediate frequency band feature information.
  • Target characteristic information corresponding to the first frequency band That is, before and after frequency band compression, the low-frequency information remains unchanged, and the low-frequency information is consistent.
  • the voice sending end may divide the initial frequency band into a first frequency band and a second frequency band based on preset frequencies.
  • the preset frequency may be set based on expert knowledge, for example, the preset frequency is set to 6khz. If the sampling rate of the speech signal is 48khz, the initial frequency band corresponding to the speech signal is 0-24khz, the first frequency band is 0-6khz, and the second frequency band is 6-24khz.
  • Step S206 performing feature compression on the initial feature information corresponding to the second frequency band in the initial frequency band feature information, to obtain target feature information corresponding to the compressed frequency band, the frequency of the first frequency band is less than the frequency of the second frequency band, and the frequency range of the second frequency band is greater than the compressed The frequency interval of the band.
  • the feature compression is to compress the feature information corresponding to the large frequency band into the feature information corresponding to the small frequency band, and extract and concentrate the feature information.
  • the second frequency band represents a large frequency band
  • the compressed frequency band represents a small frequency band, that is, the frequency interval of the second frequency band is greater than that of the compressed frequency band, that is, the length of the second frequency band is greater than the length of the compressed frequency band.
  • the minimum frequency in the second frequency band can be the same as the minimum frequency in the compressed frequency band.
  • the maximum frequency in the second frequency band is obviously greater than the maximum frequency in the compressed frequency band. frequency.
  • the compressed frequency bands can be 6-8khz, 6-16khz, etc.
  • Feature compression can also be considered as compressing the feature information corresponding to the high frequency band into the feature information corresponding to the low frequency band.
  • the voice sending end when performing frequency band compression, mainly compresses high-frequency information in the voice signal.
  • the voice sending end may perform feature compression on initial feature information corresponding to the second frequency band in the initial frequency band feature information to obtain target feature information corresponding to the compressed frequency band.
  • the initial frequency band feature information includes amplitudes and phases corresponding to multiple initial voice frequency points.
  • the voice sending end can compress the amplitude and phase of the initial voice audio point corresponding to the second frequency band in the initial frequency band feature information to obtain the amplitude and phase of the target voice audio point corresponding to the compressed frequency band, based on the target The amplitude and phase of the voice audio point obtain the target feature information corresponding to the compressed frequency band.
  • Compressing the amplitude or phase can be to calculate the average value of the amplitude or phase of the initial voice audio point corresponding to the second frequency band as the amplitude or phase of the target voice audio point corresponding to the compressed frequency band, or to calculate the corresponding
  • the weighted average value of the amplitude or phase of the initial voice audio point is used as the amplitude or phase of the target audio audio point corresponding to the compressed frequency band, or other compression methods.
  • the amplitude or phase compression can be further segmented.
  • the voice sending end can only compress the amplitude of the initial voice audio point corresponding to the second frequency band in the initial frequency band feature information to obtain the target voice audio point corresponding to the compressed frequency band
  • the initial speech frequency point corresponding to the second frequency band find the initial speech frequency point that is consistent with the frequency of the target voice frequency point corresponding to the compressed frequency band as the intermediate voice frequency point, and use the phase corresponding to the intermediate voice frequency point as the target language frequency point
  • the phase of the audio point based on the amplitude and phase of the target voice audio point, the target feature information corresponding to the compressed frequency band is obtained.
  • the phase of the initial voice frequency point corresponding to 6-8khz in the second frequency band can be used as each target voice frequency corresponding to 6-8khz in the compressed frequency band point phase.
  • step S208 the middle frequency band feature information is obtained based on the target feature information corresponding to the first frequency band and the target feature information corresponding to the compressed frequency band, and the compressed speech signal corresponding to the speech signal to be processed is obtained based on the middle frequency band feature information.
  • the middle frequency band feature information refers to feature information obtained after performing band compression on the initial frequency band feature information.
  • the compressed voice signal refers to the voice signal obtained after the frequency band of the voice signal to be processed is compressed.
  • Band compression can reduce the sampling rate of the speech signal while keeping the speech content intelligible. It can be understood that the sampling rate of the voice signal to be processed is greater than the corresponding sampling rate of the compressed voice signal.
  • the voice sending end can obtain the middle frequency band feature information based on the target feature information corresponding to the first frequency band and the target feature information corresponding to the compressed frequency band.
  • the characteristic information of the intermediate frequency band is a frequency domain signal.
  • the voice transmitting end may convert the frequency domain signal into a time domain signal, thereby obtaining a compressed voice signal.
  • the voice sending end can use the frequency domain-time domain conversion algorithm to convert the frequency domain signal into a time domain signal, for example, a custom frequency domain-time domain conversion algorithm, Laplace inverse transform algorithm, inverse Z transform algorithm , Inverse Fourier transform algorithm, etc.
  • the sampling rate of the speech signal to be processed is 48khz
  • the initial frequency band is 0-24khz.
  • the voice sending end may obtain the initial feature information corresponding to 0-6khz from the initial frequency band feature information, and directly use the initial feature information corresponding to 0-6khz as the target feature information corresponding to 0-6khz.
  • the voice sending end can obtain initial feature information corresponding to 6-24khz from the initial frequency band feature information, and compress the initial feature information corresponding to 6-24khz into target feature information corresponding to 6-8khz.
  • the voice sending end can generate a compressed voice signal based on the target feature information corresponding to 0-8khz, and the target sampling rate corresponding to the compressed voice signal is 16khz.
  • the sampling rate of the speech signal to be processed can be higher than the sampling rate supported by the speech coder, so the voice sending end can perform frequency band compression on the speech signal to be processed by compressing the speech signal to be processed at a high sampling rate into the speech signal obtained by the speech coder. Supported sampling rate, so that the speech encoder can successfully encode the speech signal to be processed.
  • the sampling rate of the speech signal to be processed can also be equal to or lower than the sampling rate supported by the speech encoder, then the speech sending end can perform frequency band compression on the speech signal to be processed by compressing the speech signal to be processed at the normal sampling rate to a lower sampling rate High-rate voice signals, thereby reducing the amount of calculation when the voice encoder performs encoding processing, and reducing the amount of data transmission, so that the voice signal can be quickly transmitted to the voice receiving end through the network.
  • the frequency band corresponding to the middle frequency band feature information and the frequency band corresponding to the initial frequency band feature information may be the same or different.
  • the frequency band corresponding to the characteristic information of the intermediate frequency band is the same as the frequency band corresponding to the characteristic information of the initial frequency band, in the characteristic information of the intermediate frequency band, there is specific characteristic information in the first frequency band and the compressed frequency band, and the characteristic information corresponding to each frequency greater than the compressed frequency band is zero .
  • the initial frequency band feature information includes the amplitude and phase of multiple frequency points on 0-24khz
  • the intermediate frequency band feature information includes the amplitude and phase of multiple frequency points on 0-24khz
  • the first frequency band is 0-6khz
  • the second The frequency band is 8-24khz
  • the compressed frequency band is 6-8khz.
  • each frequency point on 0-24khz has a corresponding amplitude and phase.
  • each frequency point on 0-8khz has a corresponding amplitude and phase
  • each frequency point on 8-24khz has a corresponding amplitude and phase, both of which are zero.
  • the voice sending end needs to first convert the middle frequency band feature information into a time domain signal, and then down-sample the time domain signal to obtain a compressed voice signal.
  • the frequency band corresponding to the characteristic information of the intermediate frequency band is composed of the first frequency band and the compressed frequency band
  • the frequency band corresponding to the characteristic information of the initial frequency band is composed of the first frequency band and the second frequency band composition.
  • the initial frequency band feature information includes the amplitude and phase of multiple frequency points on 0-24khz
  • the intermediate frequency band feature information includes the amplitude and phase of multiple frequency points on 0-8khz
  • the first frequency band is 0-6khz
  • the second The frequency band is 8-24khz
  • the compressed frequency band is 6-8khz.
  • each frequency point on 0-24khz has a corresponding amplitude and phase.
  • each frequency point on 0-8khz has a corresponding amplitude and phase. If the frequency band corresponding to the middle frequency band feature information is different from the frequency band corresponding to the initial frequency band feature information, the voice transmitting end may directly convert the middle frequency band feature information into a time domain signal to obtain a compressed voice signal.
  • Step S210 the compressed speech signal is encoded by the speech encoding module to obtain encoded speech data corresponding to the speech signal to be processed, the target sampling rate corresponding to the compressed speech signal is less than or equal to the supported sampling rate corresponding to the speech encoding module, and the target sampling rate is less than Sampling rate corresponding to the speech signal to be processed.
  • the speech coding module is a module for coding the speech signal.
  • the speech coding module can be hardware or software.
  • the supported sampling rate corresponding to the speech encoding module refers to the maximum sampling rate supported by the speech encoding module, that is, the upper limit of the sampling rate. It can be understood that, if the supported sampling rate of the speech encoding module is 16khz, the speech encoding module can encode the speech signal whose sampling rate is less than or equal to 16khz.
  • the speech sending end can compress the speech signal to be processed into a compressed speech signal, so that the sampling rate of the compressed speech signal meets the sampling rate requirement of the speech encoding module.
  • the voice coding module supports processing voice signals whose sampling rate is less than or equal to the upper limit of the sampling rate.
  • the voice sending end can encode the compressed voice signal through the voice coding module to obtain coded voice data corresponding to the voice signal to be processed.
  • the coded voice data is code stream data. If the coded voice data is only stored locally and does not need to be transmitted over the network, then the voice sending end can perform voice coding on the compressed voice signal through the voice coding module to obtain coded voice data. If the coded voice data needs to be further transmitted to the voice receiving end, then the voice transmitting end can perform voice coding on the compressed voice signal through the voice coding module to obtain the first voice data, and channel code the first voice data to obtain the coded voice data.
  • friends can conduct voice chat on an instant messaging application of a terminal.
  • a user can send a voice message to a friend on a conversation interface in an instant messaging application.
  • friend A sends a voice message to friend B
  • the terminal corresponding to friend A is the voice sending end
  • the terminal corresponding to friend B is the voice receiving end.
  • the voice sending end can obtain the trigger operation of the friend A acting on the voice collection control on the conversation interface to collect the voice signal, and collect the voice signal of the friend A through the microphone to obtain the voice signal to be processed.
  • the initial sampling rate corresponding to the voice signal to be processed can be 48khz, and the voice signal to be processed has good sound quality and an ultra-wide frequency band, specifically 0-24khz.
  • the voice sending end performs Fourier transform processing on the voice signal to be processed to obtain initial frequency band feature information corresponding to the voice signal to be processed, and the initial frequency band feature information includes frequency domain information in the range of 0-24khz. After the voice sending end compresses the frequency domain information of 0-24khz through nonlinear frequency band, it concentrates the frequency domain information of 0-24khz on 0-8khz.
  • the initial characteristic information corresponding to 0-6khz in the initial frequency band characteristic information can be kept No change
  • the initial feature information corresponding to 6-24khz is compressed to 6-8khz.
  • the voice sending end generates a compressed voice signal based on the 0-8khz frequency domain information obtained after nonlinear frequency band compression, and the target sampling rate corresponding to the compressed voice signal is 16khz.
  • the voice sending end can encode the compressed voice signal through a conventional voice coder supporting 16khz to obtain encoded voice data, and send the encoded voice data to the voice receiving end.
  • the sampling rate corresponding to the encoded voice data is consistent with the target sampling rate.
  • the voice receiving end After receiving the coded voice data, the voice receiving end can undergo decoding processing and non-linear frequency band extension processing to obtain the target voice signal, and the sampling rate of the target voice signal is consistent with the initial sampling rate.
  • the voice receiving end can obtain the trigger operation of the friend B acting on the voice message on the conversation interface to play the voice signal, and play the target voice signal with a high sampling rate through the loudspeaker.
  • the terminal when the terminal acquires a recording operation triggered by the user, the terminal can collect the user's voice signal through the microphone to obtain the voice signal to be processed.
  • the terminal performs Fourier transform processing on the speech signal to be processed to obtain initial frequency band characteristic information corresponding to the speech signal to be processed, and the initial frequency band characteristic information includes frequency domain information in the range of 0-24khz.
  • the terminal compresses the frequency domain information of 0-24khz through a nonlinear frequency band, it concentrates the frequency domain information of 0-24khz on 0-8khz. Specifically, the initial characteristic information corresponding to 0-6khz in the initial frequency band characteristic information can be kept unchanged.
  • the terminal compress the initial feature information corresponding to 6-24khz to 6-8khz.
  • the terminal generates a compressed voice signal based on the 0-8khz frequency domain information obtained after the nonlinear frequency band compression, and the target sampling rate corresponding to the compressed voice signal is 16khz.
  • the terminal can encode the compressed voice signal through a conventional voice coder supporting 16khz to obtain coded voice data, and store the coded voice data.
  • the terminal acquires the recording and playback operation triggered by the user, the terminal can perform voice restoration processing on the coded voice data to obtain a target voice signal, and play the target voice signal.
  • the coded voice data may carry compressed identification information, and the compressed identification information is used to identify frequency band mapping information between the second frequency band and the compressed frequency band. Then, when the voice sending end or the voice receiving end is performing voice restoration processing, it can perform voice restoration processing on the coded voice data based on the compressed identification information to obtain the target voice signal.
  • the maximum frequency in the compressed frequency band may be determined based on the supported sampling rate corresponding to the speech coding module on the speech sending end.
  • the supported sampling rate of the voice encoding module is 16khz.
  • the corresponding frequency band is 0-8khz, so the maximum frequency in the compressed frequency band can be 8khz.
  • the maximum frequency in the compressed frequency band can also be less than 8khz. Even if the maximum value of the frequency in the compressed frequency band is less than 8khz, the voice encoding module supporting a sampling rate of 16khz can encode the corresponding compressed voice signal.
  • the maximum frequency in the compressed frequency band may also be a default frequency, and the default frequency may be determined based on supported sampling rates corresponding to various existing speech coding modules. For example, among the supported sampling rates of various known voice coding modules, the minimum value is 16khz, so the default frequency can be set to 8khz.
  • the target feature information corresponding to the first frequency band is obtained, and the first frequency band feature information in the initial frequency band feature information
  • the initial feature information corresponding to the two frequency bands is subjected to feature compression to obtain the target feature information corresponding to the compressed frequency band.
  • the target feature information corresponding to the target feature information of the compressed frequency band and the target feature information corresponding to the compressed frequency band are obtained to obtain the intermediate frequency band feature information, and the compressed speech signal corresponding to the speech signal to be processed is obtained based on the mid-frequency band feature information, and the compressed speech signal is encoded by the speech encoding module to obtain the speech to be processed
  • the coded voice data corresponding to the signal, and the target sampling rate corresponding to the compressed voice signal is less than or equal to the supported sampling rate corresponding to the voice coding module.
  • the speech signal to be processed at any sampling rate can be compressed by frequency band feature information, and the sampling rate of the speech signal to be processed can be reduced to the sampling rate supported by the speech encoder, and the compressed speech signal obtained after compression
  • the corresponding target sampling rate is lower than the corresponding sampling rate of the speech signal to be processed, and a compressed speech signal with a low sampling rate is obtained through compression.
  • the speech coder can smoothly encode the compressed speech signal, and finally the coded speech data obtained by the coding process can be transmitted to the speech receiving end.
  • obtaining initial frequency band feature information corresponding to the speech signal to be processed includes:
  • the voice collection device refers to a device for collecting voice, for example, a microphone.
  • Fourier transform processing refers to performing Fourier transform on the speech signal to be processed to convert the time domain signal into a frequency domain signal.
  • the frequency domain signal can reflect the characteristic information of the speech signal to be processed in the frequency domain.
  • the initial frequency band feature information is the frequency domain signal.
  • the initial voice frequency point refers to a frequency point in the initial frequency band feature information corresponding to the speech signal to be processed.
  • the voice sending end can obtain the voice signal to be processed collected by the voice collection device, perform Fourier transform processing on the voice signal to be processed, convert the time domain signal into a frequency domain signal, and extract the characteristics of the voice signal to be processed in the frequency domain information to obtain the initial frequency band characteristic information.
  • the initial frequency band feature information is composed of initial amplitudes and initial phases respectively corresponding to multiple initial speech audio points. Among them, the phase of the frequency point determines the smoothness of the speech, the amplitude of the low-frequency frequency point determines the specific semantic content of the speech, and the amplitude of the high-frequency frequency point determines the texture of the speech.
  • the frequency range formed by all the initial speech audio points is the initial frequency band corresponding to the speech signal to be processed.
  • the initial frequency band feature information corresponding to the speech signal to be processed can be quickly obtained.
  • the initial feature information corresponding to the second frequency band in the initial frequency band feature information is subjected to feature compression to obtain target feature information corresponding to the compressed frequency band, including:
  • Step S302 performing frequency band division on the second frequency band to obtain at least two sequentially arranged initial sub-frequency bands.
  • Step S304 divide the compressed frequency band into frequency bands to obtain at least two target sub-frequency bands arranged in sequence.
  • frequency band division refers to dividing a frequency band, and dividing a frequency band into multiple sub-frequency bands.
  • the division of the second frequency band or the compressed frequency band by the voice sending end may be a linear division or a non-linear division.
  • the voice sending end may divide the second frequency band linearly, that is, divide the second frequency band equally.
  • the second frequency band is 6-24khz, and the second frequency band can be evenly divided into three initial sub-frequency bands of equal size, namely 6-12khz, 12-18khz, and 18-24khz.
  • the voice sending end may also perform non-linear frequency band division on the second frequency band, that is, the second frequency band is not evenly divided.
  • the second frequency band is 6-24khz
  • the second frequency band can be non-linearly divided into five initial sub-frequency bands, namely 6-8khz, 8-10khz, 10-12khz, 12-18khz, and 18-24khz.
  • the voice sending end may perform frequency band division on the second frequency band to obtain at least two sequentially arranged initial sub-frequency bands, and perform frequency band division on the compressed frequency band to obtain at least two sequentially arranged target sub-frequency bands.
  • the number of initial sub-frequency bands and the number of target sub-frequency bands may be the same or different.
  • the number of the initial sub-frequency bands is the same as the number of the target sub-frequency bands, there is a one-to-one correspondence between the initial sub-frequency bands and the target sub-frequency bands.
  • multiple initial sub-frequency bands may correspond to one target sub-frequency band, or one initial sub-frequency band may correspond to multiple target sub-frequency bands.
  • Step S306 based on the ordering of the sub-frequency bands of the initial sub-frequency bands and target sub-frequency bands, determine the target sub-frequency bands corresponding to each initial sub-frequency band.
  • the voice sending end may determine the target sub-frequency bands corresponding to the respective initial sub-frequency bands based on the sub-frequency band sorting of the initial sub-frequency bands and the target sub-frequency bands.
  • the voice sending end may associate the initial sub-frequency bands with the same order with the target sub-frequency bands.
  • the initial sub-bands arranged in order are 6-8khz, 8-10khz, 10-12khz, 12-18khz, 18-24khz
  • the target sub-bands arranged in order are 6-6.4khz, 6.4-6.8khz, 6.8-7.2khz, 7.2-7.6khz, 7.6-8khz
  • 6-8khz corresponds to 6-6.4khz
  • 8-10khz corresponds to 6.4-6.8khz
  • 10-12khz corresponds to 6.8-7.2khz
  • 12-18khz corresponds to 7.2-7.6khz
  • 18-24khz corresponds to 7.6-8khz.
  • the voice sending end can establish a one-to-one correspondence relationship between the initial sub-frequency bands that are ranked higher and the target sub-frequency bands, and the initial sub-frequency bands that are ranked lower and the target sub-bands
  • the frequency bands establish a one-to-one correspondence relationship, and establish a one-to-many or many-to-one association relationship between the sorted initial sub-bands and the target sub-bands. For example, when the number of sorted initial sub-bands is greater than the number of target sub-bands, A many-to-one relationship is established.
  • Step S308 using the initial characteristic information of the current initial sub-frequency band corresponding to the current target sub-frequency band as the first intermediate characteristic information, and obtaining the initial characteristic information corresponding to the sub-frequency band consistent with the frequency band information of the current target sub-frequency band from the initial frequency band characteristic information As the second intermediate feature information, target feature information corresponding to the current target sub-frequency band is obtained based on the first intermediate feature information and the second intermediate feature information.
  • the characteristic information corresponding to a frequency band includes an amplitude and a phase corresponding to at least one frequency point.
  • the voice sending end can only compress the amplitude, while the phase keeps using the original phase.
  • the current target sub-frequency band refers to the target sub-frequency band that currently generates target feature information.
  • the voice sending end can use the initial feature information of the current initial sub-band corresponding to the current target sub-band as the first intermediate feature information, and the first intermediate feature information is used to determine the current target sub-band The amplitude of the frequency point in the target feature information corresponding to the frequency band.
  • the voice sending end can obtain the initial feature information corresponding to the sub-frequency band consistent with the frequency band information of the current target sub-band from the initial frequency band feature information as the second intermediate feature information, and the second intermediate feature information is used to determine the frequency band corresponding to the current target sub-band.
  • the phase of the frequency point in the target feature information Therefore, the voice sending end can obtain target feature information corresponding to the current target sub-frequency band based on the first intermediate feature information and the second intermediate feature information.
  • the initial frequency band characteristic information includes initial characteristic information corresponding to 0-24khz.
  • the current target sub-band is 6-6.4khz
  • the initial sub-band corresponding to the current target sub-band is 6-8khz.
  • the voice sending end can obtain the target feature information corresponding to 6-6.4 khz based on the initial feature information corresponding to 6-8 khz and the initial feature information corresponding to 6-6.4 khz in the initial frequency band feature information.
  • Step S310 based on the target feature information corresponding to each target sub-frequency band, the target feature information corresponding to the compressed frequency band is obtained.
  • the voice transmitting end can obtain the target feature information corresponding to the compressed frequency band based on the target feature information corresponding to each target sub-frequency band, which consists of target feature information corresponding to each target sub-frequency band The target characteristic information corresponding to the compressed frequency band.
  • the reliability of feature compression can be improved, and the difference between the initial feature information corresponding to the second frequency band and the target feature information corresponding to the compressed frequency band can be reduced. .
  • a target speech signal with a relatively high similarity to the speech signal to be processed can be recovered during subsequent frequency band expansion.
  • both the first intermediate feature information and the second intermediate feature information include initial amplitudes and initial phases corresponding to a plurality of initial voice audio points.
  • the target amplitude value of each target voice audio audio point corresponding to the current target sub-band is obtained; based on the target amplitude and target phase of each target voice frequency point corresponding to the current target sub-frequency band, the target feature corresponding to the current target sub-frequency band is obtained information.
  • the voice sending end can perform statistics on the initial amplitude corresponding to each initial voice frequency point in the first intermediate feature information, and use the calculated statistical value as the target language frequency corresponding to the current target sub-frequency band.
  • Target amplitude for audio points For the phase of the frequency point, the voice transmitting end may obtain the target phase of each target voice frequency point corresponding to the current target sub-frequency band based on the initial phase corresponding to each initial voice frequency point in the second intermediate feature information.
  • the voice sending end can obtain the initial phase of the initial voice frequency point consistent with the frequency of the target voice frequency point from the second intermediate feature information as the target phase of the target voice frequency point, that is, the target phase corresponding to the target voice frequency point follows the original phase phase.
  • the statistical value may be an arithmetic mean value, a weighted mean value, or the like.
  • the voice sending end can calculate the arithmetic mean value of the initial amplitudes corresponding to each initial voice frequency point in the first intermediate feature information, and use the calculated arithmetic mean value as the target amplitude value of each target voice frequency point corresponding to the current target sub-frequency band value.
  • the voice sending end may also calculate the weighted average of the initial amplitudes corresponding to each initial voice audio point in the first intermediate feature information, and use the calculated weighted average as the target amplitude of each target voice audio audio point corresponding to the current target sub-band .
  • the center frequency point is more important, and the voice sending end can assign a higher weight to the initial amplitude of the center frequency point of a frequency band, and assign a lower weight to the initial amplitude value of other frequency points in the frequency band. weight, and then weighted average the initial amplitude of each frequency band to get the weighted average.
  • the voice sending end may further subdivide the initial sub-frequency band corresponding to the current target sub-frequency band and the current target sub-frequency band to obtain at least two sequenced first sub-frequency bands corresponding to the initial sub-frequency band and corresponding to the current target sub-frequency band. At least two second sub-frequency bands arranged in sequence.
  • the voice sending end can establish an association relationship between the first sub-frequency band and the second sub-frequency band according to the ordering of the first sub-frequency band and the second sub-frequency band, and collect the statistics of the initial amplitude corresponding to each initial voice frequency point in the current first sub-frequency band The value is used as the target amplitude of each target voice frequency point in the second sub-frequency band corresponding to the current first sub-frequency band.
  • the current target sub-frequency band is 6-6.4khz
  • the initial sub-frequency band corresponding to the current target sub-frequency band is 6-8khz.
  • the initial sub-frequency band and the current target sub-band are equally divided to obtain two first sub-frequency bands (6-7khz and 7-8khz) and two second sub-frequency bands (6-6.2khz and 6.2khz-6.4khz).
  • 6-7khz corresponds to 6-6.2khz
  • 7-8khz corresponds to 6.2khz-6.4khz.
  • the frequency band corresponding to the initial frequency band feature information is equal to the frequency band corresponding to the intermediate frequency band feature information
  • the number of initial voice audio points corresponding to the initial frequency band feature information is equal to the number of target voice audio points corresponding to the intermediate frequency band feature information.
  • the frequency bands corresponding to the initial frequency band feature information and the middle frequency band feature information are both 24khz, and in the initial frequency band feature information and the middle frequency band feature information, the amplitude and phase of the voice points corresponding to 0-6khz are the same.
  • the target amplitude of the target speech audio point corresponding to 6-8khz is calculated based on the initial amplitude of the initial speech audio point corresponding to 6-24khz in the initial frequency band feature information, and the target audio frequency point corresponding to 6-8khz
  • the target phase of the audio point is to follow the initial phase of the initial voice point corresponding to 6-8khz in the initial frequency band characteristic information.
  • the target amplitude and target phase of the target speech tone corresponding to 8-24khz are zero.
  • the frequency band corresponding to the initial frequency band feature information is greater than the frequency band corresponding to the intermediate frequency band feature information, then the number of initial voice audio points corresponding to the initial frequency band feature information is greater than the number of target voice audio points corresponding to the intermediate frequency band feature information. Further, the ratio of the number of initial voice audio points to the target audio audio points may be the same as the bandwidth ratio of the initial frequency band characteristic information to the target frequency band characteristic information, so as to facilitate the conversion of amplitude and phase between frequency points.
  • the number of initial voice audio points corresponding to the initial frequency band feature information can be 1024, and the target voice audio points corresponding to the middle frequency band feature information The number can be 512.
  • the amplitude and phase of the audio frequency points corresponding to 0-6khz are the same.
  • the target amplitude of the target voice audio point corresponding to 6-12khz is calculated based on the initial amplitude of the initial voice audio point corresponding to 6-24khz in the initial frequency band feature information, and the target audio frequency point corresponding to 6-12khz
  • the target phase of the audio point is to follow the initial phase of the initial voice point corresponding to 6-12khz in the initial frequency band feature information.
  • the amplitude of the target voice audio point is the statistical value of the amplitude of the corresponding initial audio audio point, and the statistical value can reflect the average level of the amplitude of the initial audio audio point
  • the phase of the target voice audio point follows the original phase, which can further reduce the difference between the initial feature information corresponding to the second frequency band and the target feature information corresponding to the compressed frequency band. In this way, a target speech signal with a relatively high similarity to the speech signal to be processed can be recovered during subsequent frequency band expansion. Keeping the original phase of the target speech audio point can also reduce the amount of calculation and improve the efficiency of determining the target feature information.
  • the intermediate frequency band characteristic information is obtained based on the target characteristic information corresponding to the first frequency band and the target characteristic information corresponding to the compressed frequency band, and the compressed speech signal corresponding to the speech signal to be processed is obtained based on the intermediate frequency band characteristic information, including:
  • the corresponding target feature information obtains the intermediate frequency band feature information; performs inverse Fourier transform processing on the intermediate frequency band feature information to obtain the intermediate voice signal, and the sampling rate corresponding to the intermediate voice signal is consistent with the sampling rate corresponding to the voice signal to be processed; based on support sampling The intermediate voice signal is down-sampled at a lower rate to obtain a compressed voice signal.
  • the third frequency band is a frequency band composed of frequencies between the maximum frequency of the compressed frequency band and the maximum frequency of the second frequency band.
  • the inverse Fourier transform process is to perform inverse Fourier transform on the characteristic information of the intermediate frequency band, and convert the frequency domain signal into a time domain signal. Both the intermediate speech signal and the compressed speech signal are time domain signals.
  • the down-sampling process refers to filtering and sampling the speech signal in the time domain. For example, if the sampling rate of the signal is 48khz, it means that 48k points are collected in one second; if the sampling rate of the signal is 16khz, it means that 16k points are collected in one second.
  • the voice sending end can keep the number of voice audio points unchanged when performing frequency band compression, and change the amplitude and phase of some voice audio points, thus obtaining Intermediate frequency band characteristic information. Furthermore, the voice sending end can quickly perform inverse Fourier transform processing on the feature information of the intermediate frequency band to obtain the intermediate voice signal, and the sampling rate corresponding to the intermediate voice signal is consistent with the sampling rate corresponding to the voice signal to be processed. Then, the voice sending end performs down-sampling processing on the intermediate voice signal, and reduces the sampling rate of the intermediate voice signal to a corresponding supported sampling rate of the voice coder or below to obtain a compressed voice signal.
  • the target characteristic information corresponding to the first frequency band follows the initial characteristic information corresponding to the first frequency band in the initial frequency band characteristic information, and the target characteristic information corresponding to the compressed frequency band is based on the initial frequency band characteristic information corresponding to the second frequency band.
  • the initial feature information is obtained, and the target feature information corresponding to the third frequency band is set as invalid information, that is, the target feature information corresponding to the third frequency band is cleared.
  • the compressed speech signal is encoded by the speech encoding module to obtain encoded speech data corresponding to the speech signal to be processed, including:
  • Voice encoding is performed on the compressed voice signal by the voice encoding module to obtain first voice data; channel coding is performed on the first voice data to obtain coded voice data.
  • speech coding is used to compress the data rate of the speech signal and remove the redundancy in the signal.
  • Speech coding is to encode the analog voice signal and convert the analog signal into a digital signal, so as to reduce the transmission bit rate and carry out digital transmission.
  • Speech coding may also be called source coding. It should be noted that speech encoding does not change the sampling rate of the speech signal.
  • the encoded code stream data can completely restore the speech signal before encoding through decoding processing.
  • the frequency band compression will change the sampling rate of the voice signal.
  • the voice signal after the frequency band compression cannot be exactly restored to the voice signal before the frequency band compression after the frequency band expansion, but the semantic content conveyed by the voice signal before and after the frequency band compression is the same. Does not affect the listener's understanding.
  • the voice sending end can use voice coding methods such as waveform coding, parametric coding (sound source coding) and hybrid coding to code the compressed voice signal.
  • Channel coding is used to improve the stability of data transmission. Due to interference and fading in mobile communication and network transmission, errors may occur in the process of voice signal transmission. Therefore, it is necessary to use error correction and error detection technology for digital signals, that is, error correction and error detection coding technology, to enhance data transmission in the channel. The ability to resist various interferences and improve the reliability of voice transmission.
  • the correction and error detection coding of the digital signal to be transmitted in the channel is channel coding.
  • the voice sending end may perform channel coding on the first voice data by using channel coding methods such as convolutional coding and Turbo coding.
  • the voice transmitting end may perform voice coding on the compressed voice signal through the voice coding module to obtain first voice data, and then perform channel coding on the first voice data to obtain coded voice data.
  • the speech coding module can only integrate a speech coding algorithm, then the speech sending end can perform speech coding on the compressed speech signal through the speech coding module, and then perform channel coding on the first speech data through other modules and software programs.
  • the speech coding module can also be integrated with a speech coding algorithm and a channel coding algorithm at the same time.
  • the speech sending end performs speech coding on the compressed speech signal through the speech coding module to obtain the first speech data, and performs channel coding on the first speech data through the speech coding module to obtain the encoded voice data.
  • performing speech coding and channel coding on the compressed speech signal can reduce the amount of data transmitted by the speech signal and ensure the stability of the speech signal transmission.
  • the method also includes:
  • the coded voice data is sent to the voice receiving end, so that the voice receiving end performs voice restoration processing on the coded voice data to obtain a target voice signal corresponding to the voice signal to be processed, and the target voice signal is used for playing.
  • the voice receiving end refers to a device for decoding voice
  • the voice receiving end can receive the voice data sent by the voice sending end, and decode and play the received voice data.
  • Speech restoration processing is used to restore the coded speech data to a playable speech signal, for example, restore the decoded speech signal with a low sampling rate to a speech signal with a high sampling rate, and decode the code stream data with a small amount of data into a data amount large voice signal.
  • the voice sending end can send the coded voice data to the voice receiving end.
  • the voice receiver can perform voice restoration processing on the coded voice data to obtain a target voice signal corresponding to the voice signal to be processed, so as to play the target voice signal.
  • the voice receiving end may only decode the coded voice data to obtain a compressed voice signal, use the compressed voice signal as the target voice signal, and play the compressed voice signal.
  • the sampling rate of the compressed speech signal is lower than that of the original collected speech signal to be processed, the semantic content reflected by the compressed speech signal and the speech signal to be processed is consistent, and the compressed speech signal can also be listened to by the listener. got it.
  • the voice receiving end when performing voice restoration processing, can decode the coded voice data to obtain a compressed voice signal, and restore the compressed voice signal with a low sampling rate to For a speech signal with a high sampling rate, the restored speech signal is used as the target speech signal.
  • the target voice signal refers to a voice signal obtained by performing band extension on the compressed voice signal corresponding to the voice signal to be processed, and the sampling rate of the target voice signal is consistent with the sampling rate of the voice signal to be processed.
  • the target speech signal restored by the frequency band expansion is not completely consistent with the original speech signal to be processed, but the semantics reflected by the target speech signal and the speech signal to be processed The content is consistent.
  • the target voice signal has a wider frequency band, contains richer information, has better sound quality, and the sound is clear and intelligible.
  • the coded voice data can be applied to voice communication and voice transmission. Compressing the high-sampling-rate speech signal into a low-sampling-rate speech signal before transmitting can reduce the cost of speech transmission.
  • the encoded voice data is sent to the voice receiving end, so that the voice receiving end performs voice restoration processing on the encoded voice data, obtains a target voice signal corresponding to the voice signal to be processed, and plays the target voice signal, including:
  • the compressed identification information corresponding to the voice signal to be processed is obtained; the encoded voice data and the compressed identification information are sent to the voice receiving end, so that the voice receiving end decodes the encoded voice data to obtain the compressed voice signal, based on The compressed identification information extends the frequency band of the compressed voice signal to obtain the target voice signal.
  • the compressed identification information is used to identify frequency band mapping information between the second frequency band and the compressed frequency band.
  • the frequency band mapping information includes the size of the second frequency band and the compressed frequency band, and the mapping relationship (correspondence relationship, association relationship) between the second frequency band and sub-frequency bands of the compressed frequency band.
  • Band extension can increase the sampling rate of the speech signal while keeping the speech content intelligible.
  • the frequency band extension refers to expanding the speech signal of the small frequency band to the speech signal of the large frequency band, wherein the speech signal of the small frequency band and the speech signal of the large frequency band have the same low-frequency information.
  • the voice receiving end may assume that the coded voice data has undergone frequency band compression, automatically decode the coded voice data to obtain a compressed voice signal, and perform frequency band expansion on the compressed voice signal to obtain a target voice signal.
  • the voice sending end when the voice sending end sends encoded voice data to the voice receiving end, it can simultaneously send the compressed identification information to the voice receiving end, so that the voice receiving end Quickly identify whether the coded voice data has undergone frequency band compression, and the frequency band mapping information when the frequency band is compressed, so as to determine whether to directly decode and play the coded voice data, or need to go through the corresponding frequency band expansion after decoding to play.
  • the speech sending end in order to save the computing resources of the speech sending end, for the speech signal whose sampling rate is lower than or equal to the speech encoder, the speech sending end can choose to use the traditional speech processing method to encode directly and send it to the speech receiving end.
  • the voice sending end compresses the frequency band of the voice signal to be processed
  • the voice sending end can generate compressed identification information corresponding to the voice signal to be processed based on the second frequency band and the compressed frequency band, and send the encoded voice data and the compressed identification information to the voice receiving end, so that The voice receiving end performs frequency band expansion on the compressed voice signal based on the frequency band mapping information corresponding to the compressed identification information to obtain the target voice signal.
  • the compressed voice signal is obtained by decoding and processing the coded voice data at the voice receiving end.
  • the voice sending end can directly obtain the pre-agreed
  • the special identifier serves as compression identification information, and the special identifier is used to identify that the compressed voice signal is obtained by performing frequency band compression based on default frequency band mapping information.
  • the voice receiving end can decode the coded voice data to obtain a compressed voice signal, and perform frequency band extension on the compressed voice signal based on the default frequency band mapping information to obtain a target voice signal. If multiple kinds of frequency band mapping information are stored between the voice sending end and the voice receiving end, the voice sending end and the voice receiving end may agree on preset identifiers corresponding to the various frequency band mapping information.
  • the different frequency band mapping information may be that the sizes of the second frequency band and the compressed frequency band are different, the sub-frequency bands are divided in different ways, and so on.
  • the voice sending end can obtain the corresponding preset identification based on the frequency band mapping information used for feature compression based on the second frequency band and the compressed frequency band as the compressed Identification information.
  • the audio receiving end can perform frequency band extension on the decoded compressed audio signal based on the frequency band mapping information corresponding to the compressed identification information to obtain the target audio signal.
  • the compressed identification information may also directly include specific frequency band mapping information.
  • dedicated frequency band mapping information can be designed for different application programs.
  • applications with high sound quality requirements such as singing applications
  • applications with low sound quality requirements such as instant messaging applications
  • the compressed identification information may also be an application program identification.
  • the voice receiving end After receiving the coded voice data and compressed identification information, the voice receiving end can perform corresponding frequency band expansion on the decoded compressed voice signal based on the frequency band mapping information corresponding to the application program ID to obtain the target voice signal.
  • the encoded voice data and compressed identification information are sent to the voice receiving end, so that the voice receiving end can more accurately perform frequency band extension on the decoded compressed voice signal, and obtain a target voice signal with a high degree of restoration.
  • a voice decoding method is provided, and the method is applied to the voice receiving end in Figure 1 as an example, including the following steps:
  • step S502 coded voice data is acquired, and the coded voice data is obtained by performing voice compression processing on the voice signal to be processed.
  • the speech compression process is used to compress the speech signal to be processed into code stream data that can be transmitted, for example, compress the speech signal with high sampling rate into speech signal with low sampling rate, and then encode the speech signal with low sampling rate into code stream data, or encode a voice signal with a large amount of data into stream data with a small amount of data.
  • the audio receiving end acquires encoded audio data, wherein the encoded audio data may be obtained by the audio receiving end performing encoding processing on the speech signal to be processed, or may be received by the audio receiving end and sent by the audio sending end.
  • the encoded speech data may be obtained by encoding the speech signal to be processed, or may be obtained by performing encoding processing on the compressed speech signal to obtain a compressed speech signal obtained by performing frequency band compression on the speech signal to be processed.
  • step S504 the coded voice data is decoded by the voice decoding module to obtain a decoded voice signal, and the target sampling rate corresponding to the decoded voice signal is less than or equal to the supported sampling rate corresponding to the voice decoding module.
  • the voice decoding module is a module for decoding voice signals.
  • the voice decoding module can be hardware or software.
  • the voice encoding module and the voice decoding module can be integrated on one module.
  • the supported sampling rate corresponding to the speech decoding module refers to the maximum sampling rate supported by the speech decoding module, that is, the upper limit of the sampling rate. It can be understood that if the supported sampling rate of the speech decoding module is 16khz, then the speech decoding module can decode the speech signal whose sampling rate is less than or equal to 16khz.
  • the coded voice data can be decoded by the voice decoding module to obtain a decoded voice signal, and the voice signal before encoding can be restored.
  • the voice decoding module supports processing voice signals whose sampling rate is less than or equal to the upper limit of the sampling rate.
  • the speech signal is decoded into a time domain signal.
  • the voice receiving end may decode the encoded voice data to obtain a decoded voice signal.
  • Step S506 generating target frequency band feature information corresponding to the decoded speech signal, and obtaining extended feature information corresponding to the first frequency band based on the target feature information corresponding to the first frequency band in the target frequency band feature information.
  • the target frequency band corresponding to the decoded voice signal includes a first frequency band and a compressed frequency band, and the frequency of the first frequency band is smaller than that of the compressed frequency band.
  • the voice receiving end may divide the target frequency band characteristic information into target characteristic information corresponding to the first frequency band and target characteristic information corresponding to the compressed frequency band. That is, the target frequency band feature information may be divided into target feature information corresponding to a low frequency band and target feature information corresponding to a high frequency band.
  • the target feature information refers to feature information corresponding to each frequency before the frequency band is expanded, and the extended feature information refers to feature information corresponding to each frequency after the frequency band is expanded.
  • the voice receiving end may extract frequency domain features of the decoded voice signal, convert the time domain signal into a frequency domain signal, and obtain target frequency band feature information corresponding to the decoded voice signal. It can be understood that if the sampling rate of the speech signal to be processed is higher than the corresponding support sampling rate of the speech encoding module, then the speech encoding end performs band compression on the speech signal to be processed to reduce the sampling rate of the speech signal to be processed. At this time, the speech receiving end It is necessary to extend the frequency band of the decoded voice signal, so as to restore the voice signal to be processed with a high sampling rate. At this time, the decoded voice signal is a compressed voice signal. If the voice signal to be processed has not undergone frequency band compression, the voice receiving end can also perform frequency band expansion on the decoded voice signal to increase the sampling rate of the decoded voice signal and enrich the frequency domain information.
  • the voice receiving end When performing frequency band expansion, in order to ensure that the semantic content remains unchanged and natural and intelligible, the voice receiving end can keep the low-frequency information unchanged and expand the high-frequency information. Therefore, the voice receiving end can obtain the extended characteristic information corresponding to the first frequency band based on the target characteristic information corresponding to the first frequency band in the target frequency band characteristic information, and use the initial characteristic information corresponding to the first frequency band in the target frequency band characteristic information as the extended frequency band characteristic information. Extended feature information corresponding to the first frequency band. That is, before and after the frequency band extension, the low-frequency information remains unchanged, and the low-frequency information is consistent. Similarly, the voice receiving end may divide the target frequency band into the first frequency band and the compressed frequency band based on the preset frequency.
  • Step S508 performing feature expansion on the target feature information corresponding to the compressed frequency band in the target frequency band feature information to obtain expanded feature information corresponding to the second frequency band; the frequency of the first frequency band is less than the frequency of the compressed frequency band, and the frequency range of the compressed frequency band is smaller than the second frequency band frequency range.
  • the feature extension is to expand the feature information corresponding to the small frequency band to the feature information corresponding to the large frequency band, so as to enrich the feature information.
  • the compressed frequency band represents a small frequency band
  • the second frequency band represents a large frequency band, that is, the frequency interval of the compressed frequency band is smaller than that of the second frequency band, that is, the length of the compressed frequency band is smaller than that of the second frequency band.
  • the voice receiving end when performing frequency band extension, mainly expands the high-frequency information in the voice signal.
  • the voice receiving end may perform feature expansion on the target feature information corresponding to the compressed frequency band in the target frequency band feature information to obtain extended feature information corresponding to the second frequency band.
  • the target frequency band feature information includes amplitudes and phases corresponding to multiple target voice audio points.
  • the voice receiving end can copy the amplitude of the target voice audio point corresponding to the compressed frequency band in the target frequency band feature information to obtain the amplitude of the initial voice audio point corresponding to the second frequency band, and compress the audio frequency point in the target frequency band feature information.
  • the phase of the target voice audio point corresponding to the frequency band is copied or randomly assigned to obtain the phase of the initial voice audio point corresponding to the second frequency band, thereby obtaining the extended feature information corresponding to the second frequency band.
  • it can also be further copied in sections.
  • Step S510 obtain the extended frequency band characteristic information based on the extended characteristic information corresponding to the first frequency band and the extended characteristic information corresponding to the second frequency band, obtain the target speech signal corresponding to the speech signal to be processed based on the extended frequency band characteristic information, and the sampling rate of the target speech signal is greater than Target sample rate, target speech signal for playback.
  • the extended frequency band feature information refers to feature information obtained by extending the target frequency band feature information.
  • the target speech signal refers to a speech signal obtained after the decoded speech signal is subjected to frequency band extension.
  • Band extension can increase the sampling rate of the speech signal while keeping the speech content intelligible. It can be understood that the sampling rate of the target speech signal is greater than the corresponding sampling rate of the decoded speech signal.
  • the voice receiving end obtains the extended frequency band feature information based on the extended feature information corresponding to the first frequency band and the extended feature information corresponding to the second frequency band.
  • the extended frequency band feature information is a frequency domain signal.
  • the voice receiving end can convert the frequency domain signal into a time domain signal to obtain a target voice signal. For example, the voice receiving end performs inverse Fourier transform processing on the feature information of the extended frequency band to obtain the target voice signal.
  • the sampling rate of the decoded speech signal is 16khz
  • the target frequency band is 0-8khz.
  • the voice receiving end can obtain the target feature information corresponding to 0-6khz from the target frequency band feature information, and directly use the target feature information corresponding to 0-6khz as the extended feature information corresponding to 0-6khz.
  • the voice receiving end can obtain target feature information corresponding to 6-8khz from the target frequency band feature information, and expand the target feature information corresponding to 6-8khz into extended feature information corresponding to 6-24khz.
  • the speech receiving end can generate the target speech signal based on the extended feature information corresponding to 0-24khz, and the sampling rate corresponding to the target speech signal is 48khz.
  • the target voice signal is used for playing, and after obtaining the target voice signal, the voice receiving end can play the target voice signal through a loudspeaker.
  • the encoded speech data is obtained by performing speech compression processing on the speech signal to be processed, and the speech decoding module decodes the encoded speech data to obtain the decoded speech signal, and the corresponding target sampling of the decoded speech signal Rate is less than or equal to the support sampling rate corresponding to the voice decoding module, generate the target frequency band feature information corresponding to the decoded voice signal, and obtain the extended feature information corresponding to the first frequency band based on the target feature information corresponding to the first frequency band in the target frequency band feature information, for the target
  • the target feature information corresponding to the compressed frequency band in the frequency band feature information is subjected to feature expansion to obtain the extended feature information corresponding to the second frequency band; the frequency of the first frequency band is less than the frequency of the compressed frequency band, and the frequency interval of the compressed frequency band is smaller than the frequency interval of the second frequency band, based on The extended characteristic information corresponding to the first frequency band and the extended characteristic information corresponding to the second frequency band obtain the extended frequency
  • the coded voice data obtained through the voice compression process can be decoded to obtain the decoded voice signal, and the sampling rate of the decoded voice signal can be increased through the expansion of the frequency band feature information to obtain a target speech signal.
  • the playback of the voice signal is not limited by the sampling rate supported by the voice decoder.
  • the high-sampling rate voice signal with richer information can also be played.
  • the coded voice data is decoded and processed by the voice decoding module to obtain a decoded voice signal, including:
  • channel decoding can be considered as an inverse process of channel coding.
  • Speech decoding can be considered as the inverse process of speech coding.
  • the voice receiving end first performs channel decoding on the coded voice data to obtain the second voice data, and then performs voice decoding on the second voice data through the voice decoding module to obtain the decoded voice signal.
  • the speech decoding module can only integrate a speech decoding algorithm, then the speech receiving end can perform channel decoding on the coded speech data through other modules and software programs, and then perform speech decoding on the second speech data through the speech decoding module.
  • the voice decoding module can also be integrated with a voice decoding algorithm and a channel decoding algorithm at the same time, then the voice receiving end can perform channel decoding on the encoded voice data through the voice decoding module to obtain the second voice data, and perform voice decoding on the second voice data through the voice decoding module Get the decoded speech signal.
  • binary data can be restored to a time-domain signal to obtain a speech signal.
  • the feature extension is performed on the target feature information corresponding to the compressed frequency band in the target frequency band feature information to obtain the extended feature information corresponding to the second frequency band, including:
  • Obtain frequency band mapping information and the frequency band mapping information is used to determine the mapping relationship between at least two target sub-frequency bands corresponding to the compressed frequency band and at least two initial sub-frequency bands corresponding to the second frequency band; compress the target frequency band feature information based on the frequency band mapping information
  • the target feature information corresponding to the frequency band is subjected to feature expansion to obtain the extended feature information corresponding to the second frequency band.
  • the frequency band mapping information is used to determine a mapping relationship between at least two target sub-frequency bands corresponding to the compressed frequency band and at least two initial sub-frequency bands corresponding to the second frequency band.
  • the speech coding end performs feature compression on initial feature information corresponding to the second frequency band in the initial frequency band feature information based on the mapping relationship, to obtain target feature information corresponding to the compressed frequency band.
  • the speech decoding end performs feature expansion on the target feature information corresponding to the compressed frequency band in the target frequency band feature information based on the mapping relationship, in order to restore the initial feature information corresponding to the second frequency band to the maximum extent, and obtain the second frequency band The corresponding extended feature information.
  • the voice receiving end may obtain frequency band mapping information, and perform feature expansion on target feature information corresponding to the compressed frequency band in the target frequency band feature information based on the frequency band mapping information, to obtain extended feature information corresponding to the second frequency band.
  • the voice receiving end and the voice sending end may agree on default frequency band mapping information in advance.
  • the voice sending end performs feature compression based on the default frequency band mapping information, and the voice receiving end performs feature expansion based on the default frequency band mapping information.
  • the voice receiving end and the voice sending end may also agree in advance on various candidate frequency band mapping information.
  • the voice sending end selects a frequency band mapping information for feature compression, and generates compressed identification information and sends it to the voice receiving end, so that the voice receiving end can determine the corresponding frequency band mapping information based on the compressed identification information, and then perform feature expansion based on the frequency band mapping information .
  • the voice receiver can also directly default that the decoded voice signal is a voice signal obtained through frequency band compression.
  • the frequency band mapping information can be preset and unified frequency band mapping information.
  • the target feature information corresponding to the compressed frequency band in the target frequency band feature information is extended to obtain the extended feature information corresponding to the second frequency band, and relatively accurate extended feature information can be obtained, which helps to restore Higher target speech signal.
  • the encoded speech data carries compressed identification information.
  • Get frequency band mapping information including:
  • Frequency band mapping information is acquired based on the compressed identification information.
  • the voice receiver when performing frequency band compression, can generate compression identification information based on the frequency band mapping information used in feature compression, and associate the encoded voice data corresponding to the compressed voice signal with the corresponding compression identification information, so that subsequent When the frequency band is extended, the voice receiver can obtain corresponding frequency band mapping information based on the compressed identification information carried by the encoded voice data, and perform frequency band expansion on the decoded decoded voice signal based on the frequency band mapping information.
  • the voice sending end can generate compressed identification information based on the frequency band mapping information used in feature compression, and then the voice sending end sends the encoded voice data and the compressed identification information to the voice receiving end.
  • the voice receiving end can obtain the frequency band mapping information based on the compressed identification information and perform frequency band extension on the decoded voice signal obtained through decoding.
  • the decoded voice signal is obtained through frequency band compression, and correct frequency band mapping information can be quickly obtained, thereby restoring a more accurate target voice signal.
  • the target feature information corresponding to the compressed frequency band in the target frequency band feature information is subjected to feature expansion to obtain the extended feature information corresponding to the second frequency band, including:
  • the intermediate feature information is based on the third intermediate feature information and the fourth intermediate feature information to obtain extended feature information corresponding to the current initial sub-frequency band; based on the extended feature information corresponding to each initial sub-frequency band to obtain extended feature information corresponding to the second frequency band.
  • the voice receiving end can determine the mapping relationship between at least two target sub-frequency bands corresponding to the compressed frequency band and at least two initial sub-frequency bands corresponding to the second frequency band, so that based on the target frequency corresponding to each target sub-frequency band
  • the feature extension of the feature information can obtain the extended feature information of the initial sub-frequency bands corresponding to each target sub-frequency band, and finally obtain the extended feature information corresponding to the second frequency band.
  • the current initial sub-frequency band refers to the initial sub-frequency band for which extended feature information is currently to be generated.
  • the voice receiver When generating the extended feature information corresponding to the current initial sub-band, the voice receiver can use the target feature information of the current target sub-band corresponding to the current initial sub-band as the third intermediate feature information, and the third intermediate feature information is used to determine the current initial sub-band.
  • the amplitude of the frequency point in the extended feature information corresponding to the frequency band the voice receiving end can obtain the target feature information corresponding to the sub-frequency band consistent with the frequency band information of the current initial sub-band from the target frequency band feature information as the fourth intermediate feature information, the fourth The intermediate characteristic information is used to determine the phase of the intermediate frequency points in the extended characteristic information corresponding to the current initial sub-frequency band.
  • the voice receiving end can obtain extended feature information corresponding to the current initial sub-frequency band based on the third intermediate feature information and the fourth intermediate feature information.
  • the speech receiving end can obtain the extended feature information corresponding to the second frequency band based on the extended feature information corresponding to each initial sub-frequency band, and the extended feature information corresponding to each initial sub-frequency band forms the second Extended feature information corresponding to the frequency band.
  • the target frequency band feature information includes target feature information corresponding to 0-8khz.
  • the current initial sub-band is 6-8khz
  • the target sub-band corresponding to the current initial sub-band is 6-6.4khz.
  • the voice receiving end can obtain the extended feature information corresponding to 6-8khz based on the target feature information corresponding to 6-6.4khz and the target feature information corresponding to 6-8khz in the target frequency band feature information.
  • feature extension is performed by further subdividing the compressed frequency band and the second frequency band, which can improve the reliability of feature extension and reduce the gap between the extended feature information corresponding to the second frequency band and the initial feature information corresponding to the second frequency band. difference. In this way, the target speech signal with a relatively high similarity to the speech signal to be processed can be restored finally.
  • both the third intermediate feature information and the fourth intermediate feature information include target amplitudes and target phases corresponding to multiple target voice audio points.
  • the extended feature information corresponding to the current initial sub-frequency band is obtained, including:
  • each target voice audio point in the third intermediate feature information Based on the target amplitude corresponding to each target voice audio point in the third intermediate feature information, obtain the reference amplitude of each initial voice audio point corresponding to the current initial sub-frequency band; when the fourth intermediate feature information is empty, to the current initial sub-frequency band
  • the phase of each corresponding initial speech frequency point increases the random disturbance value, obtains the reference phase of each initial speech frequency point corresponding to the current initial sub-band; when the fourth intermediate feature information is not empty, based on each target in the fourth intermediate feature information
  • the target phase corresponding to the voice frequency point obtains the reference phase of each initial voice frequency point corresponding to the current initial sub-frequency band; based on the reference amplitude and reference phase of each initial voice frequency point corresponding to the current initial sub-frequency band, the corresponding Extended feature information.
  • the voice receiving end may use the target amplitude corresponding to each target voice audio point in the third intermediate feature information as the reference amplitude of each initial audio audio point corresponding to the current initial sub-frequency band.
  • the voice receiving end adds a random perturbation value to the target phase of each target voice audio point corresponding to the current target sub-frequency band to obtain each initial voice audio frequency corresponding to the current initial sub-frequency band The reference phase of the point. It can be understood that if the fourth intermediate feature information is empty, it means that the current initial sub-frequency band does not exist in the target frequency band feature information, and this part has no energy and phase.
  • the signal needs to have amplitude and phase at the frequency point, the amplitude can be obtained by copying, and the phase can be obtained by adding random disturbance value. Moreover, the human ear is not sensitive to high-frequency phase, and has little effect on the random phase assignment of high-frequency parts. If the fourth intermediate feature information is not empty, the voice receiving end can obtain the target phase of the target voice frequency point consistent with the frequency of the initial voice frequency point from the fourth intermediate feature information as the reference phase of the initial voice frequency point, that is, The reference phase corresponding to the initial audio frequency point may follow the original phase.
  • the random disturbance value is a random phase value. It can be understood that the value of the reference phase needs to be within the value range of the phase.
  • the target frequency band feature information includes target feature information corresponding to 0-8khz
  • the extended frequency band feature information includes extended feature information corresponding to 0-24khz. If the current initial sub-band is 6-8khz, and the target sub-band corresponding to the current initial sub-band is 6-6.4khz, then the voice receiving end can use the target amplitude of each target voice audio point corresponding to 6-6.4khz as 6-8khz
  • the target phase of each target voice frequency point corresponding to 6-6.4khz is used as the reference phase of each initial voice frequency point corresponding to 6-8khz.
  • the voice receiving end can use the target amplitude of each target voice frequency point corresponding to 6.4-6.8 as 8-10khz
  • the reference amplitude of each initial voice frequency point of 6.4-6.8 plus the random perturbation value is used as the reference phase of each initial voice frequency point corresponding to 8-10khz.
  • the number of initial voice audio points in the extended frequency band feature information may be equal to the number of initial voice audio points in the initial frequency band feature information.
  • the number of initial voice audio points corresponding to the second frequency band in the extended frequency band characteristic information is greater than the number of target audio audio points corresponding to the compressed frequency band in the target frequency band characteristic information, and the ratio of the number of initial audio audio points to the target audio audio points is the extended frequency band The frequency band ratio of the characteristic information to the characteristic information of the target frequency band.
  • the amplitude of the initial voice audio point is the amplitude of the corresponding target audio audio point
  • the phase of the initial audio audio point follows the original phase or is a random value, which can reduce the A difference between the extended feature information corresponding to the second frequency band and the initial feature information corresponding to the second frequency band.
  • the present application also provides an application scenario, where the above speech encoding and speech decoding methods are applied.
  • the application of the speech encoding and speech decoding methods in this application scenario is as follows:
  • the coding and decoding of voice signals plays an important role in modern communication systems.
  • the coding and decoding of voice signals can effectively reduce the bandwidth of voice signal transmission, and play a decisive role in saving voice information storage and transmission costs and ensuring the integrity of voice information during communication network transmission.
  • the clarity of speech is directly related to the spectral frequency.
  • Traditional fixed-line telephones are narrow-band speech with a sampling rate of 8khz. The sound quality is poor, the sound is fuzzy, and the intelligibility is low. IP voice transmission)
  • Telephones are usually broadband voice, with a sampling rate of 16khz, good sound quality, clear and understandable sound; and better sound quality experience is ultra-wideband or even full-band voice, whose sampling rate can reach 48khz, sound preservation The degree of authenticity is higher.
  • Speech encoders used at different sampling rates are different or different modes of the same encoder, and the corresponding speech encoding stream sizes are also different.
  • Traditional speech encoders only support speech signals with a specific sampling rate.
  • AMR-NB Adaptive Multi Rate-Narrow Band Speech Codec, adaptive multi-rate narrowband speech coding
  • AMR -WB Adaptive Multi-Rate-Wideband Speech Codec, Adaptive Multi-Rate-Wideband Speech Codec
  • the higher the sampling rate the greater the bandwidth of the speech coding stream that needs to be consumed. If you want a better voice experience, you need to increase the voice band, such as increasing the sampling rate from 8khz to 16khz or even 48khz, etc., but the existing solution must modify and replace the voice codec of the existing client and background transmission system. An increase in bandwidth will inevitably result in an increase in operating costs. It can be understood that the end-to-end voice sampling rate in the existing solution is limited by the settings of the voice codec, and it is impossible to break through the voice band to obtain a better sound quality experience. If you want to improve the sound quality experience, you must modify the voice codec parameters or replace other higher Sample rate supported speech codecs. This will inevitably lead to system upgrades, increased operating costs, and a larger development workload and development cycle.
  • the speech sampling rate of the existing intercom system can be upgraded, so as to achieve a level beyond the existing speech band. Call experience, effectively improving voice clarity and intelligibility, and operating costs are basically not affected.
  • the voice sending end collects high-quality voice signals, performs nonlinear frequency band compression processing on the voice signals, and compresses the original high-sampling rate voice signals into low-sampling supported by the voice encoder of the communication system through nonlinear frequency band compression processing. rate voice signal.
  • the voice sending end performs voice coding and channel coding on the compressed voice signal, and finally transmits it to the voice receiving end through the network.
  • the voice transmitter can compress the high-frequency part of the signal, for example, a full-band 48khz signal (that is, the sampling rate is 48khz, and the frequency range is within 24khz)
  • a full-band 48khz signal that is, the sampling rate is 48khz, and the frequency range is within 24khz
  • all frequency band information is concentrated into the 16khz signal range (that is, the sampling rate is 16khz, and the frequency band range is within 8khz)
  • high-frequency signals higher than the 16khz sampling range are suppressed to zero, and then down-sampled to 16khz signal.
  • the low-sampling rate signal obtained through nonlinear frequency band compression processing can be encoded by a conventional 16khz speech encoder to obtain code stream data.
  • the essence of nonlinear frequency band compression is to not modify the signal below 6khz of the spectrum (ie spectrum), and only compress the spectrum signal of 6khz to 24khz.
  • the frequency band mapping information may be as shown in FIG. 6B during frequency band compression. Before compression, the frequency band of the voice signal is 0-24khz, the first frequency band is 0-6khz, and the second frequency band is 6-24khz.
  • the second frequency band can be further subdivided into 6-8khz, 8-10khz, 10-12khz, 12-18khz, 18-24khz, a total of 5 sub-bands.
  • the frequency band of the voice signal can still be 0-24khz
  • the first frequency band is 0-6khz
  • the compressed frequency band is 6-8khz
  • the third frequency band is 8-24khz.
  • the compressed frequency band can be further subdivided into 6-6.4khz, 6.4-6.8khz, 6.8-7.2khz, 7.2-7.6khz, 7.6-8khz, a total of 5 sub-bands.
  • 6-8khz corresponds to 6-6.4khz
  • 8-10khz corresponds to 6.4-6.8khz
  • 10-12khz corresponds to 6.8-7.2khz
  • 12-18khz corresponds to 7.2-7.6khz
  • 18-24khz corresponds to 7.6-8khz.
  • the amplitude and phase of each frequency point are obtained after fast Fourier transform of the speech signal with high sampling rate.
  • the information on the first band remains unchanged.
  • the statistical value of the amplitude of the intermediate frequency points of each sub-band on the left side of Figure 6B is used as the amplitude of the corresponding sub-frequency point on the right, and the phase of the intermediate frequency point of the right sub-band can use the original phase value.
  • the amplitudes of the frequency points in the left 6khz-8khz are added and then averaged, the average value is used as the amplitude of each frequency point in the right 6khz-6.4khz, and the phase value of each frequency point in the right 6khz-6.4khz is the original phase value.
  • the assignment and phase information of the intermediate frequency points in the third frequency band are cleared.
  • the frequency domain signal of 0-24khz on the right is processed by inverse Fourier transform and down-sampling to obtain the compressed speech signal.
  • (a) is the speech signal before compression
  • (b) is the speech signal after compression.
  • the upper part is the time domain signal
  • the lower part is the frequency domain signal.
  • the low-sampling rate speech signal after nonlinear frequency band compression is not as clear as the original high-sampling rate speech signal, the sound signal is naturally intelligible without perceivable noise and discomfort, so even if the speech receiving end is The existing network equipment will not hinder the calling experience without modification. Therefore, the method of the present application has better compatibility.
  • the voice receiving end after receiving the code stream data, performs channel decoding and voice decoding on the code stream data, and then performs nonlinear frequency band extension processing to restore the voice signal with a low sampling rate to a voice signal with a high sampling rate. Finally, the voice signal with high sampling rate is played.
  • the nonlinear frequency band expansion process is to re-expand the compressed 6khz-8khz signal to the 6khz-24khz spectral signal, that is, after Fourier transform, the middle frequency point of the sub-band before expansion
  • the amplitude of will be used as the amplitude of the mid-frequency point of the corresponding sub-band after expansion, and the phase will follow the original phase or add random disturbance value to the phase value of the mid-frequency point of the sub-band before expansion.
  • the expanded spectrum signal is inverse Fourier transformed, a high-sampling-rate voice signal can be obtained.
  • the original speech codec Realize the super-band codec effect, achieve a call experience beyond the existing voice band, and effectively improve voice clarity and intelligibility.
  • voice encoding and decoding methods of the present application can also be applied to voice content storage, such as voice in video, voice messages and other scenarios involving voice codec applications, in addition to voice calls.
  • steps in the flow charts of FIG. 2 , FIG. 3 , and FIG. 5 are shown sequentially as indicated by the arrows, these steps are not necessarily executed sequentially in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in FIG. 2, FIG. 3, and FIG. 5 may include multiple steps or multiple stages, and these steps or stages are not necessarily executed at the same time, but may be executed at different times. These steps Or the execution sequence of the stages is not necessarily performed sequentially, but may be executed in turn or alternately with other steps or at least a part of steps or stages in other steps.
  • a speech encoding device may adopt a software module or a hardware module, or a combination of the two becomes part of a computer device.
  • the device specifically includes: frequency band characteristic information Acquisition module 702, first target feature information determination module 704, second target feature information determination module 706, compressed speech signal generation module 708 and speech signal encoding module 710, wherein:
  • the frequency band feature information acquisition module 702 is configured to acquire initial frequency band feature information corresponding to the speech signal to be processed.
  • the first target feature information determining module 704 is configured to obtain target feature information corresponding to the first frequency band based on the initial feature information corresponding to the first frequency band in the initial frequency band feature information.
  • the second target feature information determination module 706 is configured to perform feature compression on the initial feature information corresponding to the second frequency band in the initial frequency band feature information, to obtain target feature information corresponding to the compressed frequency band, the frequency of the first frequency band is less than the frequency of the second frequency band, The frequency interval of the second frequency band is larger than the frequency interval of the compressed frequency band.
  • the compressed speech signal generating module 708 is configured to obtain intermediate frequency band characteristic information based on the target characteristic information corresponding to the first frequency band and the target characteristic information corresponding to the compressed frequency band, and obtain a compressed speech signal corresponding to the speech signal to be processed based on the intermediate frequency band characteristic information.
  • Speech signal coding module 710 for carrying out coding processing to compressed speech signal by speech coding module, obtains the coded speech data corresponding to pending speech signal, and the target sampling rate corresponding to compressed speech signal is less than or equal to the supporting sampling rate corresponding to speech coding module , the target sampling rate is smaller than the corresponding sampling rate of the speech signal to be processed.
  • the above-mentioned speech coding device before speech coding, can compress the speech signal to be processed at any sampling rate through frequency band characteristic information, reduce the sampling rate of the speech signal to be processed to the sampling rate supported by the speech coder, and obtain the obtained speech signal after compression
  • the target sampling rate corresponding to the compressed speech signal is smaller than the sampling rate corresponding to the speech signal to be processed, and the compressed speech signal with a low sampling rate is obtained after compression. Because the sampling rate of the compressed speech signal is less than or equal to the sampling rate supported by the speech coder, the speech coder can smoothly encode the compressed speech signal, and finally the coded speech data obtained by the coding process can be transmitted to the speech receiving end.
  • the frequency band characteristic information acquisition module is also used to acquire the speech signal to be processed collected by the speech collection device, perform Fourier transform processing on the speech signal to be processed, and obtain the initial frequency band characteristic information.
  • the initial frequency band characteristic information includes a plurality of initial The initial amplitude and initial phase corresponding to the audio frequency point.
  • the second target feature information determination module includes:
  • the frequency band dividing unit is configured to divide the second frequency band into frequency bands to obtain at least two sequentially arranged initial sub-frequency bands; perform frequency band division on the compressed frequency band to obtain at least two sequentially arranged target sub-frequency bands.
  • a frequency band association unit configured to determine the target sub-frequency bands corresponding to each initial sub-segment based on the sub-frequency band sorting of the initial sub-frequency bands and the target sub-frequency bands;
  • An information conversion unit configured to use the initial characteristic information of the current initial sub-frequency band corresponding to the current target sub-frequency band as the first intermediate characteristic information, and obtain from the initial frequency band characteristic information the information corresponding to the sub-frequency band consistent with the frequency band information of the current target sub-frequency band.
  • the initial characteristic information is used as the second intermediate characteristic information, and the target characteristic information corresponding to the current target sub-frequency band is obtained based on the first intermediate characteristic information and the second intermediate characteristic information;
  • the information determining unit is configured to obtain target feature information corresponding to the compressed frequency band based on the target feature information corresponding to each target sub-frequency band.
  • both the first intermediate feature information and the second intermediate feature information include initial amplitudes and initial phases corresponding to a plurality of initial voice audio points.
  • the information conversion unit is also used to obtain the target amplitude of each target voice audio point corresponding to the current target sub-band based on the statistical value of the initial amplitude corresponding to each initial audio audio point in the first intermediate feature information, based on the second intermediate feature information
  • the compressed speech signal generating module is further configured to determine a third frequency band based on the frequency difference between the compressed frequency band and the second frequency band, set the target characteristic information corresponding to the third frequency band as invalid information, and set the target feature information corresponding to the first frequency band to
  • the characteristic information, the target characteristic information corresponding to the compressed frequency band and the target characteristic information corresponding to the third frequency band are obtained to obtain the intermediate frequency band characteristic information, and the intermediate frequency band characteristic information is subjected to inverse Fourier transform processing to obtain the intermediate voice signal, and the sampling rate corresponding to the intermediate voice signal
  • the sampling rate corresponding to the voice signal to be processed is consistent, and the intermediate voice signal is down-sampled based on the supported sampling rate to obtain a compressed voice signal.
  • the speech signal coding module is further configured to perform speech coding on the compressed speech signal by the speech coding module to obtain first speech data, and perform channel coding on the first speech data to obtain coded speech data.
  • the speech encoding device further includes:
  • the voice data sending module 712 is used to send the coded voice data to the voice receiving end, so that the voice receiving end performs voice restoration processing on the coded voice data to obtain a target voice signal corresponding to the voice signal to be processed; the target voice signal is used for playback.
  • the voice data sending module is also used to obtain the compressed identification information corresponding to the voice signal to be processed based on the second frequency band and the compressed frequency band, and send the encoded voice data and the compressed identification information to the voice receiving end, so that the voice receiving end
  • the coded voice data is decoded to obtain a compressed voice signal, and the frequency band of the compressed voice signal is expanded based on the compression identification information to obtain a target voice signal.
  • a speech decoding device is provided.
  • the device can adopt a software module or a hardware module, or a combination of the two becomes a part of computer equipment.
  • the device specifically includes: voice data acquisition Module 802, voice signal decoding module 804, first extended feature information determination module 806, second extended feature information determination module 808, target voice signal determination module 810, wherein:
  • the voice data acquisition module 802 is configured to acquire coded voice data, which is obtained by performing voice compression processing on the voice signal to be processed.
  • the voice signal decoding module 804 is configured to decode the encoded voice data through the voice decoding module to obtain a decoded voice signal, and the target sampling rate corresponding to the decoded voice signal is less than or equal to the supported sampling rate corresponding to the voice decoding module.
  • the first extended feature information determination module 806 is configured to generate target frequency band feature information corresponding to the decoded speech signal, and obtain extended feature information corresponding to the first frequency band based on the target feature information corresponding to the first frequency band in the target frequency band feature information.
  • the second extended feature information determination module 808 is used to perform feature expansion on the target feature information corresponding to the compressed frequency band in the target frequency band feature information to obtain the extended feature information corresponding to the second frequency band; the frequency of the first frequency band is less than the frequency of the compressed frequency band, and the compression The frequency interval of the frequency band is smaller than the frequency interval of the second frequency band.
  • the target voice signal determination module 810 is configured to obtain extended frequency band feature information based on the extended feature information corresponding to the first frequency band and the extended feature information corresponding to the second frequency band, and obtain the target voice signal corresponding to the voice signal to be processed based on the extended frequency band feature information, the target The sampling rate of the speech signal is greater than the target sampling rate, and the target speech signal is used for playback.
  • the above-mentioned speech decoding device after obtaining the coded speech data obtained through speech compression processing, can decode the coded speech data to obtain a decoded speech signal, and through the expansion of the frequency band feature information, the sampling rate of the decoded speech signal can be increased to obtain The target speech signal for playback.
  • the playback of the voice signal is not limited by the sampling rate supported by the voice decoder.
  • the high-sampling rate voice signal with richer information can also be played.
  • the speech signal decoding module is further configured to perform channel decoding on the encoded speech data to obtain second speech data, and the speech decoding module performs speech decoding on the second speech data to obtain a decoded speech signal.
  • the second extended feature information determination module includes:
  • a mapping information acquisition unit configured to acquire frequency band mapping information, where the frequency band mapping information is used to determine a mapping relationship between at least two target sub-frequency bands corresponding to the compressed frequency band and at least two initial sub-frequency bands corresponding to the second frequency band;
  • a feature extension unit configured to perform feature extension on the target feature information corresponding to the compressed frequency band in the target frequency band feature information based on the frequency band mapping information, to obtain extended feature information corresponding to the second frequency band.
  • the coded voice data carries compressed identification information
  • the mapping information obtaining unit is further configured to obtain frequency band mapping information based on the compressed identification information.
  • the feature extension unit is further configured to use the target feature information of the current target sub-frequency band corresponding to the current initial sub-frequency band as the third intermediate feature information, and obtain the frequency band information corresponding to the current initial sub-frequency band from the target frequency band feature information
  • the target characteristic information corresponding to the consistent sub-frequency band is used as the fourth intermediate characteristic information
  • the extended characteristic information corresponding to the current initial sub-frequency band is obtained based on the third intermediate characteristic information and the fourth intermediate characteristic information
  • the extended characteristic information corresponding to each initial sub-frequency band is obtained.
  • Extended feature information corresponding to the second frequency band is further configured to use the target feature information of the current target sub-frequency band corresponding to the current initial sub-frequency band as the third intermediate feature information, and obtain the frequency band information corresponding to the current initial sub-frequency band from the target frequency band feature information
  • the target characteristic information corresponding to the consistent sub-frequency band is used as the fourth intermediate characteristic information
  • the extended characteristic information corresponding to the current initial sub-frequency band is obtained based on the third intermediate characteristic information and the
  • both the third intermediate feature information and the fourth intermediate feature information include target amplitudes and target phases corresponding to a plurality of target speech audio points
  • the feature extension unit is further configured to The target amplitude corresponding to the audio point obtains the reference amplitude of each initial voice audio point corresponding to the current initial sub-band
  • the fourth intermediate feature information is empty, the phase of each initial voice audio point corresponding to the current initial sub-band increases Randomly perturb the value to obtain the reference phase of each initial voice frequency point corresponding to the current initial sub-frequency band.
  • the fourth intermediate feature information is not empty, the current initial phase is obtained based on the target phase corresponding to each target voice frequency point in the fourth intermediate feature information.
  • the extended feature information corresponding to the current initial sub-frequency band is obtained based on the reference amplitude and reference phase of each initial speech audio point corresponding to the current initial sub-frequency band.
  • Each module in the above speech encoding and speech decoding devices can be fully or partially realized by software, hardware and combinations thereof.
  • the above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, and can also be stored in the memory of the computer device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.
  • a computer device is provided.
  • the computer device may be a terminal, and its internal structure may be as shown in FIG. 9 .
  • the computer device includes a processor, a memory, a communication interface, a display screen and an input device connected through a system bus. Wherein, the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer readable instructions.
  • the internal memory provides an environment for the execution of the operating system and computer readable instructions in the non-volatile storage medium.
  • the communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, an operator network, NFC (Near Field Communication) or other technologies.
  • a speech decoding method is implemented
  • a speech encoding method is implemented.
  • the display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen
  • the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the casing of the computer device , and can also be an external keyboard, touchpad, or mouse.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure may be as shown in FIG. 10 .
  • the computer device includes a processor, memory and a network interface connected by a system bus. Wherein, the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions and a database.
  • the internal memory provides an environment for the execution of the operating system and computer readable instructions in the non-volatile storage medium.
  • the database of the computer device is used to store coded speech data, frequency band mapping information and other data.
  • the network interface of the computer device is used to communicate with an external terminal via a network connection.
  • FIGS 9 and 10 are only block diagrams of partial structures related to the solution of this application, and do not constitute a limitation on the computer equipment on which the solution of this application is applied.
  • the computer device may include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
  • a computer device including a memory and one or more processors, where computer-readable instructions are stored in the memory, and the above-mentioned methods are implemented when the one or more processors execute the computer-readable instructions Steps in the examples.
  • a computer-readable storage medium which stores computer-readable instructions, and when the computer-readable instructions are executed by one or more processors, the steps in the foregoing method embodiments are implemented.
  • a computer program product or computer program comprising computer readable instructions stored in a computer readable storage medium.
  • One or more processors of the computer device read the computer-readable instructions from the computer-readable storage medium, and one or more processors execute the computer-readable instructions, so that the computer device executes the steps in the foregoing method embodiments.
  • Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory or optical memory, etc.
  • Volatile memory can include Random Access Memory (RAM) or external cache memory.
  • RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Quality & Reliability (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

本申请涉及一种语音编码、语音解码方法、装置、计算机设备、存储介质和计算机程序产品。所述方法包括:获取待处理语音信号对应的初始频带特征信息(S202);基于初始频带特征信息中第一频段对应的初始特征信息得到第一频段对应的目标特征信息(S204);对初始频带特征信息中第二频段对应的初始特征信息进行特征压缩,得到压缩频段对应的目标特征信息,第一频段的频率小于第二频段的频率,第二频段的频率区间大于压缩频段的频率区间(S206);基于第一频段对应的目标特征信息和压缩频段对应的目标特征信息得到中间频带特征信息,基于中间频带特征信息得到待处理语音信号对应的压缩语音信号(S208);通过语音编码模块对压缩语音信号进行编码处理,得到待处理语音信号对应的编码语音数据,压缩语音信号对应的目标采样率小于或等于语音编码模块对应的支持采样率,目标采样率小于待处理语音信号对应的采样率(S210)。

Description

语音编码、语音解码方法、装置、计算机设备和存储介质
本申请要求于2021年06月22日提交中国专利局,申请号为2021106931609,申请名称为“语音编码、语音解码方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别是涉及一种语音编码、语音解码方法、装置、计算机设备、存储介质和计算机程序产品。
背景技术
随着计算机技术的发展,出现了语音编解码技术。语音编解码技术可以应用于语音存储和语音传输。
传统技术中,语音采集设备需要和语音编码器配套使用,语音采集设备的采样率需要在语音编码器所支持的采样率范围内,这样,语音采集设备采集到的语音信号才可以通过语音编码器进行编码处理,从而进行存储或传输。此外,语音信号的播放也依赖于语音解码器,语音编码器只能对采样率在自身所支持的采样率范围内的语音信号进行解码处理后播放,因此只能播放采样率在语音编码器所支持的采样率范围内语音信号。
然而,传统方法中,语音信号的采集会受制于现有的语音编码器所支持的采样率,语音信号的播放也会受制于现有的语音解码器所支持的采样率,局限性较大。
发明内容
根据本申请的各种实施例,提供一种语音编码、语音解码方法、装置、计算机设备、存储介质和计算机程序产品。
一种语音编码方法,由语音发送端执行,所述方法包括:
获取待处理语音信号对应的初始频带特征信息;
基于所述初始频带特征信息中第一频段对应的初始特征信息得到第一频段对应的目标特征信息;
对所述初始频带特征信息中第二频段对应的初始特征信息进行特征压缩,得到压缩频段对应的目标特征信息,所述第一频段的频率小于所述第二频段的频率,所述第二频段的频率区间大于所述压缩频段的频率区间;
基于所述第一频段对应的目标特征信息和所述压缩频段对应的目标特征信息得到中间频带特征信息,基于所述中间频带特征信息得到所述待处理语音信号对应的压缩语音信号;
通过语音编码模块对所述压缩语音信号进行编码处理,得到所述待处理语音信号对应的编码语音数据,所述压缩语音信号对应的目标采样率小于或等于所述语音编码模块对应的支持采样率,所述目标采样率小于所述待处理语音信号对应的采样率。
一种语音编码装置,所述装置包括:
频带特征信息获取模块,用于获取待处理语音信号对应的初始频带特征信息;
第一目标特征信息确定模块,用于基于所述初始频带特征信息中第一频段对应的初始特征信息得到第一频段对应的目标特征信息;
第二目标特征信息确定模块,用于对所述初始频带特征信息中第二频段对应的初始特征信息进行特征压缩,得到压缩频段对应的目标特征信息,所述第一频段的频率小于所述第二频段的频率,所述第二频段的频率区间大于所述压缩频段的频率区间;
压缩语音信号生成模块,用于基于所述第一频段对应的目标特征信息和所述压缩频段对应的目标特征信息得到中间频带特征信息,基于所述中间频带特征信息得到所述待处理语音信号对应的压缩语音信号;
语音信号编码模块,用于通过语音编码模块对所述压缩语音信号进行编码处理,得到所述待处理语音信号对应的编码语音数据,所述压缩语音信号对应的目标采样率小于或等于所述语音编码模块对应的支持采样率,所述目标采样率小于所述待处理语音信号对应的采样率。
一种计算机设备,包括存储器和一个或多个处理器,所述存储器存储有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行上述语音编码方法的步骤。
一个或多个非易失性计算机可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行上述语音编码方法的步骤。
一种计算机程序产品或计算机程序,所述计算机程序产品或计算机程序包括计算机可读指令,所述计算机可读指令存储在计算机可读存储介质中,计算机设备的一个或多个处理器从所述计算机可读存储介质读取所述计算机可读指令,所述一个或多个处理器执行所述计算机可读指令,使得所述计算机设备执行上述语音编码方法的步骤。
一种语音解码方法,由语音接收端执行,所述方法包括:
获取编码语音数据,所述编码语音数据是对待处理语音信号进行语音压缩处理得到的;
通过语音解码模块对所述编码语音数据进行解码处理得到解码语音信号,所述解码语音信号对应的目标采样率小于或等于所述语音解码模块对应的支持采样率;
生成所述解码语音信号对应的目标频带特征信息,基于所述目标频带特征信息中第一频段对应的目标特征信息得到第一频段对应的扩展特征信息;
对所述目标频带特征信息中压缩频段对应的目标特征信息进行特征扩展,得到第二频段对应的扩展特征信息;所述第一频段的频率小于所述压缩频段的频率,所述压缩频段的频率区间小于所述第二频段的频率区间;
基于所述第一频段对应的扩展特征信息和所述第二频段对应的扩展特征信息得到扩展频带特征信息,基于所述扩展频带特征信息得到所述待处理语音信号对应的目标语音信号,所述目标语音信号的采样率大于所述目标采样率,所述目标语音信号用于播放。
一种语音解码装置,所述装置包括:
语音数据获取模块,用于获取编码语音数据,所述编码语音数据是对待处理语音信号进行语音压缩处理得到的;
语音信号解码模块,用于通过语音解码模块对所述编码语音数据进行解码处理得到解码语音信号,所述解码语音信号对应的目标采样率小于或等于所述语音解码模块对应的支持采样率;
第一扩展特征信息确定模块,用于生成所述解码语音信号对应的目标频带特征信息,基于所述目标频带特征信息中第一频段对应的目标特征信息得到第一频段对应的扩展特征信息;
第二扩展特征信息确定模块,用于对所述目标频带特征信息中压缩频段对应的目标特征信息进行特征扩展,得到第二频段对应的扩展特征信息;所述第一频段的频率小于所述压缩频段的频率,所述压缩频段的频率区间小于所述第二频段的频率区间;
目标语音信号确定模块,用于基于所述第一频段对应的扩展特征信息和所述第二频段对应的扩展特征信息得到扩展频带特征信息,基于所述扩展频带特征信息得到所述待处理语音信号对应的目标语音信号,所述目标语音信号的采样率大于所述目标采样率,所述目标语音信号用于播放。
一种计算机设备,包括存储器和一个或多个处理器,所述存储器存储有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执 行上述语音解码方法的步骤。
一个或多个非易失性计算机可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行上述语音解码方法的步骤。
一种计算机程序产品或计算机程序,所述计算机程序产品或计算机程序包括计算机可读指令,所述计算机可读指令存储在计算机可读存储介质中,计算机设备的一个或多个处理器从所述计算机可读存储介质读取所述计算机可读指令,所述一个或多个处理器执行所述计算机可读指令,使得所述计算机设备执行上述语音解码方法的步骤。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为一个实施例中语音编码、语音解码方法的应用环境图;
图2为一个实施例中语音编码方法的流程示意图;
图3为一个实施例中对初始特征信息进行特征压缩得到目标特征信息的流程示意图;
图4为一个实施例中初始子频段和目标子频段的映射关系的示意图;
图5为一个实施例中语音解码方法的流程示意图;
图6A为一个实施例中语音编码和解码方法的流程示意图;
图6B为一个实施例中压缩前后频域信号的示意图;
图6C为一个实施例中压缩前后语音信号的示意图;
图6D为一个实施例中扩展前后频域信号的示意图;
图6E为一个实施例中待处理语音信号和目标语音信号的示意图;
图7A为一个实施例中语音编码装置的结构框图;
图7B为另一个实施例中语音编码装置的结构框图;
图8为一个实施例中语音解码装置的结构框图;
图9为一个实施例中计算机设备的内部结构图;
图10为一个实施例中计算机设备的内部结构图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供的语音编码、语音解码方法,可以应用于如图1所示的应用环境中。其中,语音发送端102通过网络与语音接收端104进行通信。语音发送端也可以称为语音编码端,主要用于进行语音编码。语音接收端也可以称为语音解码端,主要用于进行语音解码。语音发送端102和语音接收端104可以是终端,也可以是服务器,终端可以但不限于是各种台式计算机、笔记本电脑、智能手机、平板电脑、物联网设备和便携式可穿戴设备,物联网设备可为智能音箱、智能电视、智能空调、智能车载设备等。便携式可穿戴设备可为智能手表、智能手环、头戴设备等。服务器104可以用独立的服务器或者是多个服务器组成的服务器集群或者云服务器来实现。
具体地,语音发送端获取待处理语音信号对应的初始频带特征信息,语音发送端可以基于初始频带特征信息中第一频段对应的初始特征信息得到第一频段对应的目标特征信息,对初始频带特征信息中第二频段对应的初始特征信息进行特征压缩,得到压缩频段对 应的目标特征信息。其中,第一频段的频率小于第二频段的频率,第二频段的频率区间大于压缩频段的频率区间。语音发送端基于第一频段对应的目标特征信息和压缩频段对应的目标特征信息得到中间频带特征信息,基于中间频带特征信息得到待处理语音信号对应的压缩语音信号,通过语音编码模块对压缩语音信号进行编码处理,得到待处理语音信号对应的编码语音数据。其中,压缩语音信号对应的目标采样率小于或等于语音编码模块对应的支持采样率,目标采样率小于待处理语音信号对应的采样率。语音发送端可以将编码语音数据发送至语音接收端,以使语音接收端对编码语音数据进行语音还原处理,得到待处理语音信号对应的目标语音信号,并播放目标语音信号。语音发送端也可以将编码语音数据存储在本地,在需要播放的时候,语音发送端对编码语音数据进行语音还原处理,得到待处理语音信号对应的目标语音信号,并播放目标语音信号。
上述语音编码方法,在语音编码前,可以将任意采样率的待处理语音信号通过频带特征信息的压缩,将待处理语音信号的采样率降低到语音编码器所支持的采样率,经过压缩得到的压缩语音信号对应的目标采样率小于待处理语音信号对应的采样率,经过压缩得到低采样率的压缩语音信号。因为压缩语音信号的采样率小于或等于语音编码器所支持的采样率,所以通过语音编码器可以顺利对压缩语音信号进行编码处理,最终可以将编码处理得到的编码语音数据传输到语音解码端。
语音接收端获取编码语音数据,通过语音解码模块对编码语音数据进行解码处理得到解码语音信号,其中,编码语音数据可以是语音发送端发送的,也可以是语音接收端在本地对待处理语音信号进行语音压缩处理得到的。语音接收端生成解码语音信号对应的目标频带特征信息,基于解码语音信号对应的目标频带特征信息中第一频段对应的目标特征信息得到第一频段对应的扩展特征信息,对目标频带特征信息中压缩频段对应的目标特征信息进行特征扩展,得到第二频段对应的扩展特征信息。其中,第一频段的频率小于压缩频段的频率,压缩频段的频率区间小于第二频段的频率区间。语音接收端基于第一频段对应的扩展特征信息和第二频段对应的扩展特征信息得到扩展频带特征信息,基于扩展频带特征信息得到待处理语音信号对应的目标语音信号,目标语音信号的采样率大于解码语音信号对应的目标采样率。最终,语音接收端播放目标语音信号。
上述语音解码方法,获取到经过语音压缩处理得到的编码语音数据后,可以对编码语音数据进行解码处理得到解码语音信号,通过频带特征信息的扩展,可以将解码语音信号的采样率升高,得到用于播放的目标语音信号。语音信号的播放并不会受制于语音解码器所支持的采样率,在语音播放时,也可以播放信息更丰富的高采样率语音信号。
可以理解,在编码语音数据的发送过程中,编码语音数据可以途经服务器,途经的服务器可以用独立的服务器或者是多个服务器组成的服务器集群或者云服务器来实现。语音接收端和语音发送端可以互相转换,也就是,语音接收端也可以作为语音发送端,语音发送端也可以作为语音接收端。
在一个实施例中,如图2所示,提供了一种语音编码方法,以该方法应用于图1中的语音发送端为例进行说明,包括以下步骤:
步骤S202,获取待处理语音信号对应的初始频带特征信息。
其中,待处理语音信号是指通过语音采集设备采集的语音信号。待处理语音信号可以是语音采集设备实时采集的语音信号,语音发送端可以实时对最新采集到的语音信号进行频带压缩、编码处理,得到编码语音数据。待处理语音信号也可以是语音采集设备历史采集的语音信号,语音发送端可以从数据库中获取历史时间采集的语音信号作为待处理语音信号,对待处理语音信号进行频带压缩、编码处理,得到编码语音数据。语音发送端可以将编码语音数据进行存储,在需要播放时,对编码语音数据进行解码播放。语音发送端也可以将编码后的语音信号发送至语音接收端,由语音接收端对编码语音数据进行解码播放。待处理语音信号是时域信号,可以反映语音信号随着时间的变化情况。
频带压缩可以在保持语音内容可懂的情况下,降低语音信号的采样率。频带压缩是指将大频带的语音信号压缩为小频带的语音信号,其中,小频带的语音信号和大频带的语音信号之间具有相同的低频信息。
初始频带特征信息是指待处理语音信号在频域上的特征信息。语音信号在频域上的特征信息包括一个频率带宽(即频带)内多个频点的幅值和相位。一个频点表示一个具体的频率。根据香农定理可知,语音信号的采样率和频带是两倍的关系,例如,若语音信号的采样率为48khz,则该语音信号的频带为24khz,具体为0-24khz;若语音信号的采样率为16khz,则该语音信号的频带为8khz,具体为0-8khz。
具体地,语音发送端可以将本地的语音采集设备采集到的语音信号作为待处理语音信号,在本地提取待处理语音信号的频域特征作为待处理语音信号对应的初始频带特征信息。其中,语音发送端可以采用时域-频域转换算法将时域信号转换为频域信号,从而提取待处理语音信号的频域特征,例如,自定义的时域-频域转换算法、拉普拉斯变换算法、Z变换算法、傅里叶变换算法等。
步骤S204,基于初始频带特征信息中第一频段对应的初始特征信息得到第一频段对应的目标特征信息。
其中,频段是由一个频带中的部分频率组成的频率区间。一个频带可以由至少一个频段组成。待处理语音信号对应的初始频带包括第一频段和第二频段,第一频段的频率小于第二频段的频率。语音发送端可以将初始频带特征信息划分为第一频段对应的初始特征信息和第二频段对应的初始特征信息。也就是,可以将初始频带特征信息划分为低频段对应的初始特征信息和高频段对应的初始特征信息。低频段对应的初始特征信息主要决定语音的内容信息,例如,具体的语义内容“几点钟下班”,高频段对应的初始特征信息主要决定语音的质感,例如,沙哑低沉的声音。
初始特征信息是指频带压缩前各个频率对应的特征信息,目标特征信息是指频带压缩后各个频率对应的特征信息。
具体地,若待处理语音信号的采样率高于语音编码器所支持的采样率,那么是无法直接通过语音编码器对待处理语音信号进行编码处理,因此,需要对待处理语音信号进行频带压缩,来降低待处理语音信号的采样率。在进行频带压缩时,除了需要降低待处理语音信号的采样率,同时还需要保障语义内容是保持不变、自然可懂的。由于语音的语义内容取决于语音信号中的低频信息,因此,语音发送端可以将初始频带特征信息划分为第一频段对应的初始特征信息和第二频段对应的初始特征信息。第一频段对应的初始特征信息为待处理语音信号中的低频信息,第二频段对应的初始特征信息为待处理语音信号中的高频信息。为了保障语音的可懂性、可读性,在进行频带压缩时,语音发送端可以保持低频信息不变,对高频信息进行压缩。因此,语音发送端可以基于初始频带特征信息中第一频段对应的初始特征信息得到第一频段对应的目标特征信息,将初始频带特征信息中第一频段对应的初始特征信息作为中间频带特征信息中第一频段对应的目标特征信息。也就是,频带压缩前后,低频信息保持不变,低频信息是一致的。
在一个实施例中,语音发送端可以基于预设频率将初始频带划分为第一频段和第二频段。预设频率可以是基于专家知识设置的,例如,将预设频率设置为6khz。若语音信号的采样率为48khz,那么该语音信号对应的初始频带为0-24khz,第一频段为0-6khz,第二频段为6-24khz。
步骤S206,对初始频带特征信息中第二频段对应的初始特征信息进行特征压缩,得到压缩频段对应的目标特征信息,第一频段的频率小于第二频段的频率,第二频段的频率区间大于压缩频段的频率区间。
其中,特征压缩是为了将大频段对应的特征信息压缩到小频段对应的特征信息中,提炼浓缩特征信息。第二频段代表大频段,压缩频段代表小频段,即第二频段的频率区间大 于压缩频段的频率区间,也就是,第二频段的长度大于压缩频段的长度。可以理解,考虑到第一频段和压缩频段的无缝衔接,第二频段中的最小频率可以和压缩频段中的最小频率相同,此时,第二频段中的最大频率显然大于压缩频段中的最大频率。例如,若第一频段为0-6khz,第二频段为6-24khz,那么压缩频段可以为6-8khz、6-16khz等。特征压缩也可以认为是将高频段对应的特征信息压缩到低频段对应的特征信息中。
具体地,在进行频带压缩时,语音发送端主要是对语音信号中的高频信息进行压缩。语音发送端可以对初始频带特征信息中第二频段对应的初始特征信息进行特征压缩,得到压缩频段对应的目标特征信息。
在一个实施例中,初始频带特征信息包括多个初始语音频点对应的幅值和相位。在进行特征压缩时,语音发送端可以对初始频带特征信息中第二频段对应的初始语音频点的幅值和相位均进行压缩得到压缩频段对应的目标语音频点的幅值和相位,基于目标语音频点的幅值和相位得到压缩频段对应的目标特征信息。对幅值或相位进行压缩可以是计算第二频段对应的初始语音频点的幅值或相位的平均值作为压缩频段对应的目标语音频点的幅值或相位,也可以是计算第二频段对应的初始语音频点的幅值或相位的加权平均值作为压缩频段对应的目标语音频点的幅值或相位,或者其他压缩方法。对幅值或相位进行压缩除了整体压缩,还可以进一步分段压缩。
进一步的,为了减小目标特征信息和初始特征信息的差异,语音发送端可以只是对初始频带特征信息中第二频段对应的初始语音频点的幅值进行压缩得到压缩频段对应的目标语音频点的幅值,在第二频段对应的初始语音频点中,查找与压缩频段对应的目标语音频点频率一致的初始语音频点作为中间语音频点,将中间语音频点对应的相位作为目标语音频点的相位,基于目标语音频点的幅值和相位得到压缩频段对应的目标特征信息。例如,若第二频段为6-24khz,压缩频段为6-8khz,那么,可以将第二频段中6-8khz对应的初始语音频点的相位作为压缩频段中6-8khz对应的各个目标语音频点的相位。
步骤S208,基于第一频段对应的目标特征信息和压缩频段对应的目标特征信息得到中间频带特征信息,基于中间频带特征信息得到待处理语音信号对应的压缩语音信号。
其中,中间频带特征信息是指对初始频带特征信息进行频带压缩后得到的特征信息。压缩语音信号是指对待处理语音信号进行频带压缩后得到的语音信号。频带压缩可以在保持语音内容可懂的情况下,降低语音信号的采样率。可以理解,待处理语音信号的采样率大于压缩语音信号对应的采样率。
具体地,语音发送端基于第一频段对应的目标特征信息和压缩频段对应的目标特征信息可以得到中间频带特征信息。中间频带特征信息是频域信号,在得到中间频带特征信息后,语音发送端可以将频域信号转换为时域信号,从而得到压缩语音信号。其中,语音发送端可以采用频域-时域转换算法将频域信号转换为时域信号,例如,自定义的频域-时域转换算法、拉普拉斯逆变换算法、逆Z变换算法、傅里叶反变换算法等。
举例说明,待处理语音信号的采样率为48khz,初始频带为0-24khz。语音发送端可以从初始频带特征信息中获取0-6khz对应的初始特征信息,将0-6khz对应的初始特征信息直接作为0-6khz对应的目标特征信息。语音发送端可以从初始频带特征信息中获取6-24khz对应的初始特征信息,将6-24khz对应的初始特征信息压缩为6-8khz对应的目标特征信息。语音发送端基于0-8khz对应的目标特征信息可以生成压缩语音信号,压缩语音信号对应的目标采样率为16khz。
可以理解,待处理语音信号的采样率可以高于语音编码器所支持的采样率,那么语音发送端对待处理语音信号进行频带压缩可以是将高采样率的待处理语音信号压缩为语音编码器所支持的采样率,从而使得语音编码器可以成功对待处理语音信号进行编码处理。当然,待处理语音信号的采样率也可以等于或小于语音编码器所支持的采样率,那么语音发送端对待处理语音信号进行频带压缩可以是将正常采样率的待处理语音信号压缩为更 低采样率的语音信号,从而减少语音编码器进行编码处理时的计算量,减少数据传输量,从而可以将语音信号快速通过网络传输到语音接收端。
在一个实施例中,中间频带特征信息对应的频带和初始频带特征信息对应的频带可以相同,也可以不同。当中间频带特征信息对应的频带和初始频带特征信息对应的频带相同时,在中间频带特征信息中,第一频段和压缩频段存在具体的特征信息,大于压缩频段的各个频率对应的特征信息为零。例如,初始频带特征信息包括0-24khz上多个频点的幅值和相位,中间频带特征信息包括0-24khz上多个频点的幅值和相位,第一频段为0-6khz,第二频段为8-24khz,压缩频段为6-8khz。在初始频带特征信息中,0-24khz上各个频点存在对应的幅值和相位。在中间频带特征信息中,0-8khz上各个频点存在对应的幅值和相位,8-24khz上各个频点存在对应的幅值和相位均为零。若中间频带特征信息对应的频带和初始频带特征信息对应的频带相同,语音发送端需要先将中间频带特征信息转换为时域信号,再对时域信号进行降采样处理,得到压缩语音信号。
当中间频带特征信息对应的频带和初始频带特征信息对应的频带不同时,中间频带特征信息对应的频带由第一频段和压缩频段组成,初始频带特征信息对应的频带由第一频段和第二频段组成。例如,初始频带特征信息包括0-24khz上多个频点的幅值和相位,中间频带特征信息包括0-8khz上多个频点的幅值和相位,第一频段为0-6khz,第二频段为8-24khz,压缩频段为6-8khz。在初始频带特征信息中,0-24khz上各个频点存在对应的幅值和相位。在中间频带特征信息中,0-8khz上各个频点存在对应的幅值和相位。若中间频带特征信息对应的频带和初始频带特征信息对应的频带不同,语音发送端可以直接将中间频带特征信息转换为时域信号,即可得到压缩语音信号。
步骤S210,通过语音编码模块对压缩语音信号进行编码处理,得到待处理语音信号对应的编码语音数据,压缩语音信号对应的目标采样率小于或等于语音编码模块对应的支持采样率,目标采样率小于待处理语音信号对应的采样率。
其中,语音编码模块是用于对语音信号进行编码处理的模块。语音编码模块可以是硬件,也可以是软件。语音编码模块对应的支持采样率是指语音编码模块支持的最大采样率,也就是采样率上限。可以理解,若语音编码模块对应的支持采样率为16khz,那么语音编码模块可以对采样率小于或等于16khz的语音信号进行编码处理。
具体地,通过对待处理语音信号进行频带压缩,语音发送端可以将待处理语音信号压缩为压缩语音信号,使得压缩语音信号的采样率达到语音编码模块的采样率要求。语音编码模块支持处理采样率小于或等于采样率上限的语音信号。语音发送端可以通过语音编码模块对压缩语音信号进行编码处理,得到待处理语音信号对应的编码语音数据。编码语音数据为码流数据。若编码语音数据只是存储在本地,无需进行网络传输,那么语音发送端可以通过语音编码模块对压缩语音信号进行语音编码,得到编码语音数据。若编码语音数据需要进一步传输到语音接收端,那么语音发送端可以通过语音编码模块对压缩语音信号进行语音编码,得到第一语音数据,对第一语音数据进行信道编码,得到编码语音数据。
举例说明,在语音聊天场景下,好友之间可以在终端的即时通信应用上进行语音聊天。用户可以在即时通信应用中的会话界面上给好友发送语音消息。当好友A向好友B发送语音消息时,好友A对应的终端为语音发送端,好友B对应的终端为语音接收端。语音发送端可以获取好友A作用于会话界面上语音采集控件的触发操作来采集语音信号,通过麦克风采集好友A的语音信号得到待处理语音信号。当采用优质麦克风采集语音消息时,待处理语音信号对应的初始采样率可以为48khz,待处理语音信号音质较好,具有超宽的频带,具体为0-24khz。语音发送端对待处理语音信号进行傅里叶变换处理,得到待处理语音信号对应的初始频带特征信息,初始频带特征信息包括0-24khz范围内的频域信息。语音发送端将0-24khz的频域信息经过非线性频带压缩后,将0-24khz的频域信息集中到0-8khz上,具体可以将初始频带特征信息中0-6khz对应的初始特征信息保持不变, 将6-24khz对应的初始特征信息压缩到6-8khz上。语音发送端基于非线性频带压缩后得到的0-8khz的频域信息生成压缩语音信号,压缩语音信号对应的目标采样率为16khz。然后,语音发送端可以通过常规支持16khz的语音编码器对压缩语音信号进行编码处理,得到编码语音数据,将编码语音数据发送至语音接收端。编码语音数据对应的采样率和目标采样率一致。语音接收端接收到编码语音数据后,可以经过解码处理、非线性频带扩展处理得到目标语音信号,目标语音信号的采样率和初始采样率一致。语音接收端可以获取好友B作用于会话界面上语音消息的触发操作来播放语音信号,通过扬声器播放高采样率的目标语音信号。
在录音场景下,当终端获取到用户触发的录音录制操作时,终端可以通过麦克风采集用户的语音信号得到待处理语音信号。终端对待处理语音信号进行傅里叶变换处理,得到待处理语音信号对应的初始频带特征信息,初始频带特征信息包括0-24khz范围内的频域信息。终端将0-24khz的频域信息经过非线性频带压缩后,将0-24khz的频域信息集中到0-8khz上,具体可以将初始频带特征信息中0-6khz对应的初始特征信息保持不变,将6-24khz对应的初始特征信息压缩到6-8khz上。终端基于非线性频带压缩后得到的0-8khz的频域信息生成压缩语音信号,压缩语音信号对应的目标采样率为16khz。然后,终端可以通过常规支持16khz的语音编码器对压缩语音信号进行编码处理,得到编码语音数据,并将编码语音数据进行存储。当终端获取到用户触发的录音播放操作时,终端可以对编码语音数据进行语音还原处理,得到目标语音信号,并播放目标语音信号。
在一个实施例中,编码语音数据可以携带压缩标识信息,压缩标识信息用于标识第二频段和压缩频段之间的频段映射信息。那么,语音发送端或语音接收端在进行语音还原处理时,可以基于压缩标识信息对编码语音数据进行语音还原处理,得到目标语音信号。
在一个实施例中,压缩频段中的最大频率可以是基于语音发送端上的语音编码模块对应的支持采样率确定的。例如,语音编码模块对应的支持采样率为16khz,当语音信号的采样率为16khz时,对应的频带为0-8khz,那么压缩频段中的频率最大值可以为8khz。当然,压缩频段中的频率最大值也可以小于8khz。即使压缩频段中的频率最大值小于8khz,支持采样率为16khz的语音编码模块也可以编码对应的压缩语音信号。压缩频段中的最大频率也可以是默认频率,默认频率可以是基于现有的各种语音编码模块对应的支持采样率确定的。例如,在已知的各种语音编码模块对应的支持采样率中,最小值为16khz,那么可以设置默认频率为8khz。
上述语音编码方法中,通过获取待处理语音信号对应的初始频带特征信息,基于初始频带特征信息中第一频段对应的初始特征信息得到第一频段对应的目标特征信息,对初始频带特征信息中第二频段对应的初始特征信息进行特征压缩,得到压缩频段对应的目标特征信息,第一频段的频率小于第二频段的频率,第二频段的频率区间大于压缩频段的频率区间,基于第一频段对应的目标特征信息和压缩频段对应的目标特征信息得到中间频带特征信息,基于中间频带特征信息得到待处理语音信号对应的压缩语音信号,通过语音编码模块对压缩语音信号进行编码处理,得到待处理语音信号对应的编码语音数据,压缩语音信号对应的目标采样率小于或等于语音编码模块对应的支持采样率。这样,在语音编码前,可以将任意采样率的待处理语音信号通过频带特征信息的压缩,将待处理语音信号的采样率降低到语音编码器所支持的采样率,经过压缩得到的压缩语音信号对应的目标采样率小于待处理语音信号对应的采样率,经过压缩得到低采样率的压缩语音信号。因为压缩语音信号的采样率小于或等于语音编码器所支持的采样率,所以通过语音编码器可以顺利对压缩语音信号进行编码处理,最终可以将编码处理得到的编码语音数据传输到语音接收端。
在一个实施例中,获取待处理语音信号对应的初始频带特征信息,包括:
获取语音采集设备采集的待处理语音信号;对待处理语音信号进行傅里叶变换处理,得到初始频带特征信息,初始频带特征信息包括多个初始语音频点对应的初始幅值和初始 相位。
其中,语音采集设备是指用于采集语音的设备,例如,麦克风。傅里叶变换处理是指对待处理语音信号进行傅里叶变换,将时域信号转换为频域信号,频域信号可以反映待处理语音信号在频域上的特征信息。初始频带特征信息即为频域信号。初始语音频点是指待处理语音信号对应的初始频带特征信息中的频点。
具体地,语音发送端可以获取语音采集设备采集的待处理语音信号,对待处理语音信号进行傅里叶变换处理,将时域信号转换为频域信号,提取待处理语音信号在频域上的特征信息,得到初始频带特征信息。初始频带特征信息是由多个初始语音频点分别对应的初始幅值和初始相位组成。其中,频点的相位决定语音的平滑度,低频率频点的幅值决定语音的具体语义内容,高频率频点的幅值决定语音的质感。所有初始语音频点组成的频率范围为待处理语音信号对应的初始频带。
在一个实施例中,待处理语音信号经过快速傅里叶变换可以得到N个初始语音频点,通常N取2的整数次幂,N个初始语音频点是均匀分布的。例如,若N为1024,待处理语音信号对应的初始频带为24khz,那么初始语音频点的分辨率为24k/1024=23.4375,也就是,每隔23.4375kz存在一个初始语音频点。可以理解,为了保障较高的分辨率,不同采样率的语音信号经过快速傅里叶变换可以得到不同数目的语音频点。采样率越高的语音信号,经过快速傅里叶变换得到的初始语音频点数目越多。
上述实施例中,通过对待处理语音信号进行傅里叶变换处理,能够快速得到待处理语音信号对应的初始频带特征信息。
在一个实施例中,如图3所示,对初始频带特征信息中第二频段对应的初始特征信息进行特征压缩,得到压缩频段对应的目标特征信息,包括:
步骤S302,对第二频段进行频段划分,得到至少两个按序排列的初始子频段。
步骤S304,对压缩频段进行频段划分,得到至少两个按序排列的目标子频段。
其中,频段划分是指对一个频段进行切分,将一个频段切分为多个子频段。语音发送端对第二频段或压缩频段进行频段划分可以是线性划分,也可以是非线性划分。以第二频段为例,语音发送端可以对第二频段进行线性的频段划分,即平均地切分第二频段。例如,第二频段为6-24khz,可以将第二频段平均地划分为三个等大的初始子频段,分别为6-12khz、12-18khz、18-24khz。语音发送端也可以对第二频段进行非线性的频段划分,即不是平均地切分第二频段。例如,第二频段为6-24khz,可以将第二频段非线性地划分为五个初始子频段,分别为6-8khz、8-10khz、10-12khz、12-18khz、18-24khz。
具体地,语音发送端可以对第二频段进行频段划分,得到至少两个按序排列的初始子频段,对压缩频段进行频段划分,得到至少两个按序排列的目标子频段。初始子频段的数量和目标子频段的数量可以相同也可以不同。当初始子频段的数量和目标子频段的数量相同时,初始子频段和目标子频段一一对应。当初始子频段的数量和目标子频段的数量不同时,可以是多个初始子频段对应一个目标子频段,或者,一个初始子频段对应多个目标子频段。
步骤S306,基于初始子频段和目标子频段的子频段排序,确定各个初始子片段分别对应的目标子频段。
具体地,语音发送端可以基于初始子频段和目标子频段的子频段排序,确定各个初始子片段分别对应的目标子频段。当初始子频段的数量和目标子频段的数量相同时,语音发送端可以将排序一致的初始子频段和目标子频段建立关联关系。参考图4,按序排列的初始子频段为6-8khz、8-10khz、10-12khz、12-18khz、18-24khz,按序排列的目标子频段为6-6.4khz、6.4-6.8khz、6.8-7.2khz、7.2-7.6khz、7.6-8khz,那么,6-8khz与6-6.4khz对应,8-10khz与6.4-6.8khz对应,10-12khz与6.8-7.2khz对应,12-18khz与7.2-7.6khz对应,18-24khz与7.6-8khz对应。当初始子频段的数量和目标子频段的数量不同时,语音发送端可 以将排序靠前的初始子频段和目标子频段建立一一对应的关联关系,将排序靠后的初始子频段和目标子频段建立一一对应的关联关系,将排序居中的初始子频段和目标子频段建立一对多或多对一的关联关系,例如,当排序居中的初始子频段的数量大于目标子频段的数量,则建立多对一的关联关系。
步骤S308,将当前目标子频段对应的当前初始子频段的初始特征信息作为第一中间特征信息,从初始频带特征信息中,获取与当前目标子频段的频段信息一致的子频段对应的初始特征信息作为第二中间特征信息,基于第一中间特征信息和第二中间特征信息得到当前目标子频段对应的目标特征信息。
具体地,一个频段对应的特征信息包括至少一个频点对应的幅值和相位。在进行特征压缩时,语音发送端可以只是对幅值进行压缩,而相位沿用原有相位。当前目标子频段是指当前生成目标特征信息的目标子频段。在生成当前目标子频段对应的目标特征信息时,语音发送端可以将当前目标子频段对应的当前初始子频段的初始特征信息作为第一中间特征信息,第一中间特征信息用于确定当前目标子频段对应的目标特征信息中频点的幅值。语音发送端可以从初始频带特征信息中,获取与当前目标子频段的频段信息一致的子频段对应的初始特征信息作为第二中间特征信息,第二中间特征信息用于确定当前目标子频段对应的目标特征信息中频点的相位。因此,语音发送端可以基于第一中间特征信息和第二中间特征信息得到当前目标子频段对应的目标特征信息。
举例说明,初始频带特征信息包括0-24khz对应的初始特征信息。当前目标子频段为6-6.4khz,当前目标子频段对应的初始子频段为6-8khz。语音发送端可以基于初始频带特征信息中6-8khz对应的初始特征信息和6-6.4khz对应的初始特征信息得到6-6.4khz对应的目标特征信息。
步骤S310,基于各个目标子频段对应的目标特征信息得到压缩频段对应的目标特征信息。
具体地,在得到各个目标子频段对应的目标特征信息后,语音发送端可以基于各个目标子频段对应的目标特征信息得到压缩频段对应的目标特征信息,由各个目标子频段对应的目标特征信息组成压缩频段对应的目标特征信息。
上述实施例中,通过对第二频段和压缩频段进一步细分来进行特征压缩,能够提高特征压缩的可靠性,降低第二频段对应的初始特征信息和压缩频段对应的目标特征信息之间的差异。这样,后续在频带扩展时可还原出与待处理语音信号相似度比较高的目标语音信号。
在一个实施例中,第一中间特征信息和第二中间特征信息均包括多个初始语音频点对应的初始幅值和初始相位。基于第一中间特征信息和第二中间特征信息得到当前目标子频段对应的目标特征信息,包括:
基于第一中间特征信息中各个初始语音频点对应的初始幅值的统计值,得到当前目标子频段对应的各个目标语音频点的目标幅值;基于第二中间特征信息中各个初始语音频点对应的初始相位,得到当前目标子频段对应的各个目标语音频点的目标相位;基于当前目标子频段对应的各个目标语音频点的目标幅值和目标相位,得到当前目标子频段对应的目标特征信息。
具体地,针对频点的幅值,语音发送端可以对第一中间特征信息中各个初始语音频点对应的初始幅值进行统计,将计算得到的统计值作为当前目标子频段对应的各个目标语音频点的目标幅值。针对频点的相位,语音发送端可以基于第二中间特征信息中各个初始语音频点对应的初始相位,得到当前目标子频段对应的各个目标语音频点的目标相位。语音发送端可以从第二中间特征信息中获取与目标语音频点的频率一致的初始语音频点的初始相位作为目标语音频点的目标相位,也就是,目标语音频点对应的目标相位沿用原相位。其中,统计值可以是算术平均值、加权平均值等。
例如,语音发送端可以计算第一中间特征信息中各个初始语音频点对应的初始幅值的算术平均值,将计算得到的算术平均值作为当前目标子频段对应的各个目标语音频点的目标幅值。
语音发送端也可以计算第一中间特征信息中各个初始语音频点对应的初始幅值的加权平均值,将计算得到的加权平均值作为当前目标子频段对应的各个目标语音频点的目标幅值。例如,通常来说,中心频点的重要性较高,语音发送端可以对一个频段的中心频点的初始幅值赋予较高的权重,对该频段中其他频点的初始幅值赋予较低的权重,然后对各个频段的初始幅值进行加权平均得到加权平均值。
语音发送端也可以进一步对当前目标子频段对应的初始子频段和当前目标子频段进行细分,得到该初始子频段对应的至少两个按序排列的第一子频段和当前目标子频段对应的至少两个按序排列的第二子频段。语音发送端可以按照第一子频段和第二子频段的排序,将第一子频段和第二子频段建立关联关系,将当前第一子频段中各个初始语音频点对应的初始幅值的统计值作为当前第一子频段对应的第二子频段中各个目标语音频点的目标幅值。例如,当前目标子频段为6-6.4khz,当前目标子频段对应的初始子频段为6-8khz。将该初始子频段和当前目标子频段进行等分,得到两个第一子频段(6-7khz和7-8khz)和两个第二子频段(6-6.2khz和6.2khz-6.4khz)。6-7khz和6-6.2khz对应,7-8khz和6.2khz-6.4khz对应。计算6-7khz中各个初始语音频点对应的初始幅值的算术平均值作为6-6.2khz中各个目标语音频点对应的目标幅值。计算7-8khz中各个初始语音频点对应的初始幅值的算术平均值作为6.2khz-6.4khz中各个目标语音频点对应的目标幅值。
在一个实施例中,若初始频带特征信息对应的频带等于中间频带特征信息对应的频带,那么初始频带特征信息对应的初始语音频点的数目等于中间频带特征信息对应的目标语音频点的数目。例如,初始频带特征信息和中间频带特征信息对应的频带均为24khz,在初始频带特征信息和中间频带特征信息中,0-6khz对应的语音频点的幅值和相位相同。在中间频带特征信息中,6-8khz对应的目标语音频点的目标幅值是基于初始频带特征信息中6-24khz对应的初始语音频点的初始幅值计算得到,6-8khz对应的目标语音频点的目标相位是沿用初始频带特征信息中6-8khz对应的初始语音频点的初始相位。在中间频带特征信息中,8-24khz对应的目标语音频点的目标幅值和目标相位为零。
若初始频带特征信息对应的频带大于中间频带特征信息对应的频带,那么初始频带特征信息对应的初始语音频点的数目大于中间频带特征信息对应的目标语音频点的数目。进一步的,初始语音频点和目标语音频点的数量比值可以与初始频带特征信息和目标频带特征信息的频带宽度比值一样,以便频点之间幅值和相位的转换。例如,若初始频带特征信息对应的频带为24khz,中间频带特征信息对应的频带为12khz,那么初始频带特征信息对应的初始语音频点的数目可以是1024,中间频带特征信息对应的目标语音频点的数目可以是512。在初始频带特征信息和中间频带特征信息中,0-6khz对应的语音频点的幅值和相位相同。在中间频带特征信息中,6-12khz对应的目标语音频点的目标幅值是基于初始频带特征信息中6-24khz对应的初始语音频点的初始幅值计算得到,6-12khz对应的目标语音频点的目标相位是沿用初始频带特征信息中6-12khz对应的初始语音频点的初始相位。
上述实施例中,在压缩频段对应的目标特征信息中,目标语音频点的幅值为对应的初始语音频点的幅值的统计值,统计值可以反映初始语音频点的幅值的平均水平,目标语音频点的相位沿用原相位,能够进一步降低第二频段对应的初始特征信息和压缩频段对应的目标特征信息之间的差异。这样,后续在频带扩展时可还原出与待处理语音信号相似度比较高的目标语音信号。目标语音频点的相位沿用原相位也可以减少计算量,提高目标特征信息的确定效率。
在一个实施例中,基于第一频段对应的目标特征信息和压缩频段对应的目标特征信息 得到中间频带特征信息,基于中间频带特征信息得到待处理语音信号对应的压缩语音信号,包括:
基于压缩频段和第二频段的频率差异确定第三频段,将第三频段对应的目标特征信息设置为无效信息;基于第一频段对应的目标特征信息、压缩频段对应的目标特征信息和第三频段对应的目标特征信息得到中间频带特征信息;对中间频段特征信息进行傅里叶反变换处理,得到中间语音信号,中间语音信号对应的采样率和待处理语音信号对应的采样率一致;基于支持采样率对中间语音信号进行降采样处理,得到压缩语音信号。
其中,第三频段是由压缩频段的频率最大值到第二频段的频率最大值之间的频率组成的频段。傅里叶反变换处理是对中间频段特征信息进行傅里叶反变换,将频域信号转换为时域信号。中间语音信号和压缩语音信号都是时域信号。
降采样处理是指在时域上,对语音信号进行滤波、抽样。例如,若信号的采样率为48khz,那么是表示一秒采集48k个点,若信号的采样率为16khz,那么是表示一秒采集16k个点。
具体地,为了提高频域信号和时域信号的转换速度,在进行频带压缩时,语音发送端可以保持语音频点的数量不变,对部分语音频点的幅值和相位进行更改,从而得到中间频带特征信息。进而,语音发送端可以快速对中间频段特征信息进行傅里叶反变换处理,得到中间语音信号,中间语音信号对应的采样率和待处理语音信号对应的采样率一致。然后,语音发送端再对中间语音信号进行降采样处理,将中间语音信号的采样率降低到语音编码器对应的支持采样率或以下,得到压缩语音信号。其中,在中间频带特征信息中,第一频段对应的目标特征信息沿用初始频带特征信息中第一频段对应的初始特征信息,压缩频段对应的目标特征信息基于初始频带特征信息中第二频段对应的初始特征信息得到,第三频段对应的目标特征信息设置为无效信息,即第三频段对应的目标特征信息清零。
上述实施例中,在处理频域信号时,保持频带不变,将频域信号转换为时域信号后,再通过降采样处理降低信号的采样率,能够减少频域信号处理的复杂度。
在一个实施例中,通过语音编码模块对压缩语音信号进行编码处理,得到待处理语音信号对应的编码语音数据,包括:
通过语音编码模块对压缩语音信号进行语音编码,得到第一语音数据;对第一语音数据进行信道编码,得到编码语音数据。
其中,语音编码用于压缩语音信号的数据率,去除信号中的冗余度。语音编码就是对模拟的语音信号进行编码,将模拟信号转化成数字信号,从而降低传输码率并进行数字传输。语音编码也可以称为信源编码。需要注意的是,语音编码并不会改变语音信号的采样率。编码得到的码流数据通过解码处理是可以完整还原出编码前的语音信号。而频带压缩是会改变语音信号的采样率,频带压缩后的语音信号经过频带扩展是无法一模一样还原出频带压缩前的语音信号,但是频带压缩前后的语音信号所传递的语义内容是相同的,并不影响听者理解。语音发送端可以采用波形编码、参量编码(音源编码)和混合编码等语音编码方式对压缩语音信号进行语音编码。
信道编码用于提高数据传输的稳定性。由于移动通信、网络传输存在干扰和衰落,在语音信号传输过程中有可能出现差错,因此需要对数字信号采用纠、检错技术,即纠、检错编码技术,以增强数据在信道中传输时抵御各种干扰的能力,提高语音传输的可靠性。对要在信道中传送的数字信号进行的纠、检错编码就是信道编码。语音发送端可以采用卷积码、Turbo编码等信道编码方式对第一语音数据进行信道编码。
具体地,在进行编码处理时,语音发送端可以通过语音编码模块对压缩语音信号进行语音编码,得到第一语音数据,再对第一语音数据进行信道编码,得到编码语音数据。可以理解,语音编码模块可以只集成有语音编码算法,那么语音发送端可以通过语音编码模块对压缩语音信号进行语音编码,再通过其他模块、软件程序对第一语音数据进行信道编码。语音编码模块也可以同时集成有语音编码算法和信道编码算法,语音发送端通过语音 编码模块对压缩语音信号进行语音编码得到第一语音数据,通过语音编码模块对第一语音数据进行信道编码得到编码语音数据。
上述实施例中,对压缩语音信号进行语音编码、信道编码可以减少语音信号传输的数据量,并保障语音信号传输的稳定性。
在一个实施例中,所述方法还包括:
将编码语音数据发送至语音接收端,以使语音接收端对编码语音数据进行语音还原处理,得到待处理语音信号对应的目标语音信号,目标语音信号用于播放。
其中,语音接收端是指用于进行语音解码的设备,语音接收端可以接收语音发送端发送的语音数据,对接收到的语音数据进行解码播放。语音还原处理用于将编码语音数据还原为可播放的语音信号,例如,将解码得到的低采样率的语音信号还原为高采样率的语音信号,将数据量小的码流数据解码为数据量大的语音信号。
具体地,语音发送端可以将编码语音数据发送至语音接收端。语音接收端接收到编码语音数据后,可以对编码语音数据进行语音还原处理,得到待处理语音信号对应的目标语音信号,从而对目标语音信号进行播放。
在进行语音还原处理时,语音接收端可以只是对编码语音数据进行解码处理,得到压缩语音信号,将压缩语音信号作为目标语音信号,播放压缩语音信号。此时,虽然压缩语音信号的采样率比原始采集的待处理语音信号的采样率低,但是压缩语音信号和待处理语音信号所反映的语义内容是一致的,压缩语音信号也是可以被听者听懂的。
当然,为了进一步提高语音信号的播放清晰度和可懂度,在进行语音还原处理时,语音接收端可以对编码语音数据进行解码处理,得到压缩语音信号,将低采样率的压缩语音信号还原为高采样率的语音信号,将还原得到的语音信号作为目标语音信号。此时,目标语音信号是指对待处理语音信号对应的压缩语音信号进行频带扩展得到的语音信号,目标语音信号的采样率和待处理语音信号的采样率一致。可以理解,在进行频带压缩时,信息有一定的损失,因此频带扩展还原出的目标语音信号和原始的待处理语音信号并不是完全一致的,但是目标语音信号和待处理语音信号所反映的语义内容是一致的。并且,相比于压缩语音信号,目标语音信号具备更宽的频带,包含的信息更丰富,音质更好,声音清晰可懂。
上述实施例中,编码语音数据可以应用于语音通讯、语音传输。将高采样率的语音信号压缩为低采样率的语音信号,再进行传输,可以降低语音传输成本。
在一个实施例中,将编码语音数据发送至语音接收端,以使语音接收端对编码语音数据进行语音还原处理,得到待处理语音信号对应的目标语音信号,并播放目标语音信号,包括:
基于第二频段和压缩频段得到待处理语音信号对应的压缩标识信息;将编码语音数据和压缩标识信息发送至语音接收端,以使语音接收端对编码语音数据进行解码处理得到压缩语音信号,基于压缩标识信息对压缩语音信号进行频带扩展,得到目标语音信号。
其中,压缩标识信息用于标识第二频段和压缩频段之间的频段映射信息。频段映射信息包括第二频段和压缩频段的大小、第二频段和压缩频段的子频段之间的映射关系(对应关系、关联关系)。频带扩展可以在保持语音内容可懂的情况下,提高语音信号的采样率。频带扩展是指将小频带的语音信号扩展为大频带的语音信号,其中,小频带的语音信号和大频带的语音信号之间具有相同的低频信息。
具体地,语音接收端接收到编码语音数据后,可以默认编码语音数据经过了频带压缩,自动对编码语音数据进行解码处理得到压缩语音信号,对压缩语音信号进行频带扩展,得到目标语音信号。但是考虑到兼容传统语音处理方法以及特征压缩时频段映射信息的多样性,语音发送端在将编码语音数据发送至语音接收端时,可以同步将压缩标识信息发送至语音接收端,以便语音接收端快速识别该编码语音数据是否经过频带压缩,以及进行频带 压缩时的频段映射信息,从而决定对编码语音数据是直接解码播放,还是解码后需要经过对应的频段扩展才进行播放。在一个实施例中,为了节省语音发送端的计算资源,针对采样率原本就小于或等于语音编码器的语音信号,语音发送端可以选择采用传统语音处理方法直接编码处理后发送至语音接收端。
若语音发送端对待处理语音信号进行了频带压缩,语音发送端可以基于第二频段和压缩频段生成待处理语音信号对应的压缩标识信息,将编码语音数据和压缩标识信息发送至语音接收端,以便语音接收端基于压缩标识信息对应的频段映射信息对压缩语音信号进行频带扩展,得到目标语音信号。压缩语音信号是语音接收端对编码语音数据进行解码处理得到的。
此外,若语音发送端和语音接收端之间约定了默认的频段映射信息,在基于第二频段和压缩频段生成待处理语音信号对应的压缩标识信息时,语音发送端就可以直接获取预先约定的特殊标识作为压缩标识信息,特殊标识用于标识压缩语音信号是基于默认的频段映射信息进行频带压缩得到的。语音接收端接收到编码语音数据和压缩标识信息后,可以对编码语音数据进行解码处理得到压缩语音信号,基于默认的频段映射信息对压缩语音信号进行频带扩展,得到目标语音信号。若语音发送端和语音接收端之间存储有多种频段映射信息,语音发送端和语音接收端之间可以约定各种频段映射信息分别对应的预设标识。不同的频段映射信息可以是第二频段和压缩频段的大小不同,子频段的划分方法不同等。在基于第二频段和压缩频段生成待处理语音信号对应的压缩标识信息时,语音发送端可以基于第二频段和压缩频段在进行特征压缩时所使用的频段映射信息获取对应的预设标识作为压缩标识信息。语音接收端接收到编码语音数据和压缩标识信息后,可以基于该压缩标识信息对应的频段映射信息对解码得到的压缩语音信号进行频带扩展,得到目标语音信号。当然,压缩标识信息也可以直接包括具体的频段映射信息。
可以理解,对压缩语音信号进行频带扩展的具体过程可以参照后续语音解码方法中各个相关实施例所述的方法,例如步骤S506至步骤S510所述的方法。
在一个实施例中,针对不同的应用程序可以设计专用的频段映射信息。例如,针对音质要求高的应用程序(例如唱歌应用程序)可以设计在特征压缩时采用数量较多的子频段,从而最大限度地保留原始语音信号的整体频域特征、频点幅值的整体变化趋势。针对音质要求低的应用程序(例如即时通信应用程序)可以设计在特征压缩时采用数量较少的子频段,从而在保障语义可懂的情况下加快压缩速度。因此,压缩标识信息也可以是应用程序标识。语音接收端接收到编码语音数据和压缩标识信息后,可以基于应用程序标识对应的频段映射信息对解码得到的压缩语音信号进行对应的频带扩展,得到目标语音信号。
上述实施例中,将编码语音数据和压缩标识信息发送至语音接收端,可以使语音接收端比较准确地对解码得到的压缩语音信号进行频带扩展,得到还原度高的目标语音信号。
在一个实施例中,如图5所示,提供了一种语音解码方法,以该方法应用于图1中的语音接收端为例进行说明,包括以下步骤:
步骤S502,获取编码语音数据,编码语音数据是对待处理语音信号进行语音压缩处理得到的。
其中,语音压缩处理用于将待处理语音信号压缩为可以传输的码流数据,例如,将高采样率的语音信号压缩为低采样率的语音信号,再将低采样率的语音信号编码为码流数据,或者将数据量大的语音信号编码为数据量小的码流数据。
具体地,语音接收端获取编码语音数据,其中,编码语音数据可以是语音接收端对待处理语音信号进行编码处理得到的,也可以是语音接收端接收语音发送端发送的。编码语音数据可以是对待处理语音信号进行编码处理得到的,也可以是对待处理语音信号进行频带压缩得到压缩语音信号,对压缩语音信号进行编码处理得到的。
步骤S504,通过语音解码模块对编码语音数据进行解码处理得到解码语音信号,解码 语音信号对应的目标采样率小于或等于语音解码模块对应的支持采样率。
其中,语音解码模块是用于对语音信号进行解码处理的模块。语音解码模块可以是硬件,也可以是软件。语音编码模块和语音解码模块可以集成在一个模块上。语音解码模块对应的支持采样率是指语音解码模块支持的最大采样率,也就是采样率上限。可以理解,若语音解码模块对应的支持采样率为16khz,那么语音解码模块可以对采样率小于或等于16khz的语音信号进行解码处理。
具体地,语音接收端获取到编码语音数据后,可以通过语音解码模块对编码语音数据进行解码处理得到解码语音信号,还原出编码前的语音信号。语音解码模块支持处理采样率小于或等于采样率上限的语音信号。解码语音信号为时域信号。
可以理解,若编码语音数据是在语音接收端本地生成的,语音接收端对编码语音数据进行解码处理也可以是对编码语音数据进行语音解码得到解码语音信号。
步骤S506,生成解码语音信号对应的目标频带特征信息,基于目标频带特征信息中第一频段对应的目标特征信息得到第一频段对应的扩展特征信息。
其中,解码语音信号对应的目标频带包括第一频段和压缩频段,第一频段的频率小于压缩频段的频率。语音接收端可以将目标频带特征信息划分为第一频段对应的目标特征信息和压缩频段对应的目标特征信息。也就是,可以将目标频带特征信息划分为低频段对应的目标特征信息和高频段对应的目标特征信息。目标特征信息是指频带扩展前各个频率对应的特征信息,扩展特征信息是指频带扩展后各个频率对应的特征信息。
具体地,语音接收端可以提取解码语音信号的频域特征,将时域信号转换为频域信号,得到解码语音信号对应的目标频带特征信息。可以理解,若待处理语音信号的采样率高于语音编码模块对应的支持采样率,那么语音编码端是对待处理语音信号进行了频带压缩来降低待处理语音信号的采样率,此时语音接收端就需要对解码语音信号进行频带扩展,从而还原出高采样率的待处理语音信号,此时,解码语音信号为压缩语音信号。若待处理语音信号没有经过频带压缩,语音接收端也可以对解码语音信号进行频带扩展,提高解码语音信号的采样率和丰富频域信息。
在进行频带扩展时,为了保障语义内容是保持不变、自然可懂的,语音接收端可以保持低频信息不变,对高频信息进行扩展。因此,语音接收端可以基于目标频带特征信息中第一频段对应的目标特征信息得到第一频段对应的扩展特征信息,将目标频带特征信息中第一频段对应的初始特征信息作为扩展频带特征信息中第一频段对应的扩展特征信息。也就是,频带扩展前后,低频信息保持不变,低频信息是一致的。同理,语音接收端可以基于预设频率将目标频带划分为第一频段和压缩频段。
步骤S508,对目标频带特征信息中压缩频段对应的目标特征信息进行特征扩展,得到第二频段对应的扩展特征信息;第一频段的频率小于压缩频段的频率,压缩频段的频率区间小于第二频段的频率区间。
其中,特征扩展是为了将小频段对应的特征信息扩展到大频段对应的特征信息中,丰富特征信息。压缩频段代表小频段,第二频段代表大频段,即压缩频段的频率区间小于第二频段的频率区间,也就是,压缩频段的长度小于第二频段的长度。
具体地,在进行频带扩展时,语音接收端主要是对语音信号中的高频信息进行扩展。语音接收端可以对目标频带特征信息中压缩频段对应的目标特征信息进行特征扩展,得到第二频段对应的扩展特征信息。
在一个实施例中,目标频带特征信息包括多个目标语音频点对应的幅值和相位。在进行特征扩展时,语音接收端可以对目标频带特征信息中压缩频段对应的目标语音频点的幅值进行复制得到第二频段对应的初始语音频点的幅值,对目标频带特征信息中压缩频段对应的目标语音频点的相位进行复制或随机赋值得到第二频段对应的初始语音频点的相位, 从而得到第二频段对应的扩展特征信息。对幅值进行复制除了整体复制,还可以进一步分段复制。
步骤S510,基于第一频段对应的扩展特征信息和第二频段对应的扩展特征信息得到扩展频带特征信息,基于扩展频带特征信息得到待处理语音信号对应的目标语音信号,目标语音信号的采样率大于目标采样率,目标语音信号用于播放。
其中,扩展频带特征信息是指对目标频带特征信息进行扩展后得到的特征信息。目标语音信号是指解码语音信号进行频带扩展后得到的语音信号。频带扩展可以在保持语音内容可懂的情况下,提高语音信号的采样率。可以理解,目标语音信号的采样率大于解码语音信号对应的采样率。
具体地,语音接收端基于第一频段对应的扩展特征信息和第二频段对应的扩展特征信息得到扩展频带特征信息。扩展频带特征信息是频域信号,在得到扩展频带特征信息后,语音接收端可以将频域信号转换为时域信号,从而得到目标语音信号。例如,语音接收端对扩展频带特征信息进行傅里叶反变换处理,得到目标语音信号。
举例说明,解码语音信号的采样率为16khz,目标频带为0-8khz。语音接收端可以从目标频带特征信息中获取0-6khz对应的目标特征信息,将0-6khz对应的目标特征信息直接作为0-6khz对应的扩展特征信息。语音接收端可以从目标频带特征信息中获取6-8khz对应的目标特征信息,将6-8khz对应的目标特征信息扩展为6-24khz对应的扩展特征信息。语音接收端基于0-24khz对应的扩展特征信息可以生成目标语音信号,目标语音信号对应的采样率为48khz。
目标语音信号用于播放,在得到目标语音信号后,语音接收端可以通过扬声器播放目标语音信号。
上述语音解码方法中,通过获取编码语音数据,编码语音数据是对待处理语音信号进行语音压缩处理得到的,通过语音解码模块对编码语音数据进行解码处理得到解码语音信号,解码语音信号对应的目标采样率小于或等于语音解码模块对应的支持采样率,生成解码语音信号对应的目标频带特征信息,基于目标频带特征信息中第一频段对应的目标特征信息得到第一频段对应的扩展特征信息,对目标频带特征信息中压缩频段对应的目标特征信息进行特征扩展,得到第二频段对应的扩展特征信息;第一频段的频率小于压缩频段的频率,压缩频段的频率区间小于第二频段的频率区间,基于第一频段对应的扩展特征信息和第二频段对应的扩展特征信息得到扩展频带特征信息,基于扩展频带特征信息得到待处理语音信号对应的目标语音信号,目标语音信号的采样率大于目标采样率,目标语音信号用于播放。这样,获取到经过语音压缩处理得到的编码语音数据后,可以对编码语音数据进行解码处理得到解码语音信号,通过频带特征信息的扩展,可以将解码语音信号的采样率升高,得到用于播放的目标语音信号。语音信号的播放并不会受制于语音解码器所支持的采样率,在语音播放时,也可以播放信息更丰富的高采样率语音信号。
在一个实施例中,通过语音解码模块对编码语音数据进行解码处理得到解码语音信号,包括:
对编码语音数据进行信道解码,得到第二语音数据;通过语音解码模块对第二语音数据进行语音解码,得到解码语音信号。
具体地,信道解码可以认为是信道编码的逆过程。语音解码可以认为是语音编码的逆过程。语音接收端在对编码语音数据进行解码处理时,先对编码语音数据进行信道解码,得到第二语音数据,再通过语音解码模块对第二语音数据进行语音解码,得到解码语音信号。可以理解,语音解码模块可以只集成有语音解码算法,那么语音接收端可以通过其他模块、软件程序对编码语音数据进行信道解码,再通过语音解码模块对第二语音数据进行语音解码。语音解码模块也可以同时集成有语音解码算法和信道解码算法,那么语音接收端可以通过语音解码模块对编码语音数据进行信道解码得到第二语音数据,通过语音解码 模块对第二语音数据进行语音解码得到解码语音信号。
上述实施例中,基于信道解码和语音解码,可以将二进制数据还原为时域信号,得到语音信号。
在一个实施例中,对目标频带特征信息中压缩频段对应的目标特征信息进行特征扩展,得到第二频段对应的扩展特征信息,包括:
获取频段映射信息,频段映射信息用于确定压缩频段对应的至少两个目标子频段和第二频段对应的至少两个初始子频段之间的映射关系;基于频段映射信息对目标频带特征信息中压缩频段对应的目标特征信息进行特征扩展,得到第二频段对应的扩展特征信息。
其中,频段映射信息用于确定压缩频段对应的至少两个目标子频段和第二频段对应的至少两个初始子频段之间的映射关系。在进行特征压缩时,语音编码端是基于该映射关系对初始频带特征信息中第二频段对应的初始特征信息进行特征压缩,得到压缩频段对应的目标特征信息。那么,在进行特征扩展时,语音解码端基于该映射关系对目标频带特征信息中压缩频段对应的目标特征信息进行特征扩展,才能最大限度还原出第二频段对应的初始特征信息,得到第二频段对应的扩展特征信息。
具体地,语音接收端可以获取频段映射信息,基于频段映射信息对目标频带特征信息中压缩频段对应的目标特征信息进行特征扩展,得到第二频段对应的扩展特征信息。语音接收端和语音发送端可以预先约定默认的频段映射信息。语音发送端基于默认的频段映射信息进行特征压缩,语音接收端基于默认的频段映射信息进行特征扩展。语音接收端和语音发送端也可以预先约定多种候选的频段映射信息。语音发送端从中选择一种频段映射信息进行特征压缩,并生成压缩标识信息发送至语音接收端,从而语音接收端可以基于压缩标识信息确定对应的频段映射信息,进而基于该频段映射信息进行特征扩展。无论解码语音信号是否经过频段压缩,语音接收端也可以直接默认解码语音信号是经过频段压缩才得到的语音信号,此时频段映射信息可以是预先设置的、统一的频段映射信息。
上述实施例中,基于频段映射信息对目标频带特征信息中压缩频段对应的目标特征信息进行特征扩展,得到第二频段对应的扩展特征信息,能够得到比较准确的扩展特征信息,有助于得到还原度较高的目标语音信号。
在一个实施例中,编码语音数据携带压缩标识信息。获取频段映射信息,包括:
基于压缩标识信息获取频段映射信息。
具体地,语音接收端在进行频带压缩时,可以基于特征压缩时所采用的频段映射信息生成压缩标识信息,将压缩语音信号对应的编码语音数据和对应的压缩标识信息进行关联,从而后续在进行频带扩展时,语音接收端可以基于编码语音数据携带的压缩标识信息获取相应的频段映射信息,基于频段映射信息对解码得到的解码语音信号进行频带扩展。例如,语音发送端在进行频带压缩时,可以基于特征压缩时所采用的频段映射信息生成压缩标识信息,后续语音发送端将编码语音数据和压缩标识信息一并发送至语音接收端。语音接收端就可以基于压缩标识信息获取频段映射信息对解码得到的解码语音信号进行频带扩展。
上述实施例中,基于压缩标识信息可以确定解码语音信号是经过频段压缩得到,可以快速获取到正确的频段映射信息,从而还原出比较准确的目标语音信号。
在一个实施例中,基于频段映射信息对目标频带特征信息中压缩频段对应的目标特征信息进行特征扩展,得到第二频段对应的扩展特征信息,包括:
将当前初始子频段对应的当前目标子频段的目标特征信息作为第三中间特征信息,从目标频带特征信息中,获取与当前初始子频段的频段信息一致的子频段对应的目标特征信息作为第四中间特征信息,基于第三中间特征信息和第四中间特征信息得到当前初始子频段对应的扩展特征信息;基于各个初始子频段对应的扩展特征信息得到第二频段对应的扩展特征信息。
具体地,语音接收端基于频段映射信息就可以确定压缩频段对应的至少两个目标子频 段和第二频段对应的至少两个初始子频段之间的映射关系,从而基于各个目标子频段对应的目标特征信息进行特征扩展可以得到各个目标子频段分别对应的初始子频段的扩展特征信息,最终得到第二频段对应的扩展特征信息。当前初始子频段是指当前待生成扩展特征信息的初始子频段。在生成当前初始子频段对应的扩展特征信息时,语音接收端可以将当前初始子频段对应的当前目标子频段的目标特征信息作为第三中间特征信息,第三中间特征信息用于确定当前初始子频段对应的扩展特征信息中频点的幅值,语音接收端可以从目标频带特征信息中,获取与当前初始子频段的频段信息一致的子频段对应的目标特征信息作为第四中间特征信息,第四中间特征信息用于确定当前初始子频段对应的扩展特征信息中频点的相位。因此,语音接收端可以基于第三中间特征信息和第四中间特征信息得到当前初始子频段对应的扩展特征信息。在得到各个初始子频段对应的扩展特征信息后,语音接收端可以基于各个初始子频段对应的扩展特征信息得到第二频段对应的扩展特征信息,由各个初始子频段对应的扩展特征信息组成第二频段对应的扩展特征信息。
举例说明,目标频带特征信息包括0-8khz对应的目标特征信息。当前初始子频段为6-8khz,当前初始子频段对应的目标子频段为6-6.4khz。语音接收端可以基于目标频带特征信息中6-6.4khz对应的目标特征信息和6-8khz对应的目标特征信息得到6-8khz对应的扩展特征信息。
上述实施例中,通过对压缩频段和第二频段进一步细分来进行特征扩展,能够提高特征扩展的可靠性,降低第二频段对应的扩展特征信息和第二频段对应的初始特征信息之间的差异。这样,最终能够还原出与待处理语音信号相似度比较高的目标语音信号。
在一个实施例中,第三中间特征信息和第四中间特征信息均包括多个目标语音频点对应的目标幅值和目标相位。基于第三中间特征信息和第四中间特征信息得到当前初始子频段对应的扩展特征信息,包括:
基于第三中间特征信息中各个目标语音频点对应的目标幅值,得到当前初始子频段对应的各个初始语音频点的参考幅值;当第四中间特征信息为空时,对当前初始子频段对应的各个初始语音频点的相位增加随机扰动值,得到当前初始子频段对应的各个初始语音频点的参考相位;当第四中间特征信息不为空时,基于第四中间特征信息中各个目标语音频点对应的目标相位得到当前初始子频段对应的各个初始语音频点的参考相位;基于当前初始子频段对应的各个初始语音频点的参考幅值和参考相位,得到当前初始子频段对应的扩展特征信息。
具体地,针对频点的幅值,语音接收端可以将第三中间特征信息中各个目标语音频点对应的目标幅值作为当前初始子频段对应的各个初始语音频点的参考幅值。针对频点的相位,若第四中间特征信息为空,语音接收端对当前目标子频段对应的各个目标语音频点的目标相位加上随机扰动值,得到当前初始子频段对应的各个初始语音频点的参考相位。可以理解,若第四中间特征信息为空,说明在目标频带特征信息中当前初始子频段是不存在的,这部分是没有能量的,其相位也是没有的,但是从频域信号转为时域信号需要频点具备幅值和相位,幅值可以通过复制得到,相位则可以加上随机扰动值得到。并且,人耳对高频相位不敏感,对高频部分的相位随机赋值影响不大。若第四中间特征信息不为空,语音接收端可以从第四中间特征信息中获取与初始语音频点的频率一致的目标语音频点的目标相位作为初始语音频点的参考相位,也就是,初始语音频点对应的参考相位可以沿用原相位。其中,随机扰动值为随机的相位值。可以理解,参考相位的数值需要在相位的取值范围内。
举例说明,目标频带特征信息包括0-8khz对应的目标特征信息,扩展频带特征信息包括0-24khz对应的扩展特征信息。若当前初始子频段为6-8khz,当前初始子频段对应的目标子频段为6-6.4khz,则语音接收端可以将6-6.4khz对应的各个目标语音频点的目标幅值作为6-8khz对应的各个初始语音频点的参考幅值,将6-6.4khz对应的各个目标语音频点 的目标相位作为6-8khz对应的各个初始语音频点的参考相位。若当前初始子频段为8-10khz,当前初始子频段对应的目标子频段为6.4-6.8khz,则语音接收端可以将6.4-6.8对应的各个目标语音频点的目标幅值作为8-10khz对应的各个初始语音频点的参考幅值,将6.4-6.8对应的各个目标语音频点的目标相位加上随机扰动值作为8-10khz对应的各个初始语音频点的参考相位。
在一个实施例中,扩展频带特征信息中初始语音频点的数量可以等于初始频带特征信息中初始语音频点的数量。扩展频带特征信息中第二频段对应的初始语音频点的数量大于目标频带特征信息中压缩频段对应的目标语音频点的数量,并且,初始语音频点和目标语音频点的数量比值为扩展频带特征信息与目标频带特征信息的频带比值。
上述实施例中,在第二频段对应的扩展特征信息中,初始语音频点的幅值为对应的目标语音频点的幅值,初始语音频点的相位沿用原相位或为随机值,能够降低第二频段对应的扩展特征信息和第二频段对应的初始特征信息之间的差异。
本申请还提供一种应用场景,该应用场景应用上述的语音编码、语音解码方法。具体地,该语音编码、语音解码方法在该应用场景的应用如下:
语音信号的编解码在现代通讯系统中占有重要的地位。语音信号的编解码可以有效降低语音信号传输的带宽,对于节省语音信息存储、传输成本,保障通信网络传输过程中的语音信息完整性方面起了决定性作用。
语音的清晰度与语谱频带有直接关系,传统固定电话是窄带语音,其采样率为8khz,音质较差,声音比较模糊,可懂度较低;而目前的VoIP(Voice over Internet Protocol,基于IP的语音传输)电话通常是宽带语音,其采样率为16khz,音质较好,声音清晰可懂;而更好的音质体验是超宽带甚至全带语音,其采样率可以达到48khz,声音的保真度更高。不同采样率下采用的语音编码器是不一样的或者是同一个编码器的不同模式,其对应的语音编码码流大小也是不同的。传统的语音编码器只支持处理特定采样率的语音信号,例如AMR-NB(Adaptive Multi Rate-Narrow Band Speech Codec,自适应多速率窄带语音编码)编码器就只支持8khz及以下的输入信号,AMR-WB(Adaptive Multi-Rate-Wideband Speech Codec,自适应多速率宽带语音编码)编码器只支持16khz及以下的输入信号。
此外,一般情况下采样率越高需要消耗的语音编码码流带宽越大。如果要更优质的语音体验,则需要提升语音频带,例如采样率从8khz提升到16khz甚至48khz等,但现有方案必须修改替换现有客户端、后台传输系统的语音编解码器,同时语音传输带宽增加,势必造成运营成本增加。可以理解,现有方案中端到端的语音采样率受制于语音编码器的设置,无法突破语音频带得到更好的音质体验,如果要提升音质体验,必须修改语音编解码器参数或替换其它更高采样率支持的语音编解码器。这势必带来系统的升级、运营成本的增加,以及较大的开发工作量和开发周期。
但是,采用本申请的语音编码、语音解码方法,在无需改变现有通话系统的语音编解码和信号传输系统的前提下,可以升级现有通话系统的语音采样率,实现超越现有语音频带的通话体验,有效提升语音清晰度和可懂度,并且运营成本基本不受影响。
参考图6A,语音发送端采集高质量的语音信号,对语音信号进行非线性频带压缩处理,将原来高采样率的语音信号通过非线性频带压缩处理压缩成通话系统的语音编码器支持的低采样率的语音信号。语音发送端再对压缩后的语音信号进行语音编码、信道编码,最终通过网络传送到语音接收端。
1、非线性频带压缩处理
鉴于人耳对低频信号敏感,而对高频不敏感的特性,语音发送端可以把高频部分的信号进行频带压缩,例如,全带48khz信号(即采样率为48khz,频带范围在24khz以内)经过非线性频带压缩后,把所有频带信息都集中到16khz信号范围(即采样率为16khz,频带范围在8khz以内),而高于16khz采样范围的高频信号则抑制为零,然后经过降采样 到16khz信号。经过非线性频带压缩处理得到的低采样率信号就可以使用常规的16khz的语音编码器进行编码得到码流数据。
以全带48khz信号为例,非线性频带压缩的实质是对语谱(即频谱)6khz以下的信号不做修改,仅对6khz~24khz的语谱信号进行压缩。若是将全带48khz信号压缩到16khz信号,在进行频带压缩时,频段映射信息可以如图6B所示。压缩前,语音信号的频带为0-24khz,第一频段为0-6khz,第二频段为6-24khz。第二频段可以进一步细分为6-8khz、8-10khz、10-12khz、12-18khz、18-24khz,共5个子频段。压缩后,语音信号的频带可以仍然为0-24khz,第一频段为0-6khz,压缩频段为6-8khz,第三频段为8-24khz。压缩频段可以进一步细分为6-6.4khz、6.4-6.8khz、6.8-7.2khz、7.2-7.6khz、7.6-8khz,共5个子频段。6-8khz与6-6.4khz对应,8-10khz与6.4-6.8khz对应,10-12khz与6.8-7.2khz对应,12-18khz与7.2-7.6khz对应,18-24khz与7.6-8khz对应。
首先对高采样率的语音信号进行快速傅里叶变换后得到各个频点的幅值及相位。第一频段的信息保持不变。将图6B左边各子频段中频点的幅值的统计值作为右边对应子频段中频点的幅值,右边子频段中频点的相位则可以沿用原有相位值。例如左边6khz-8khz中各频点幅值相加后求平均值,该平均值作为右边6khz-6.4khz中各频点的幅值,而右边6khz-6.4khz中各频点的相位值为原来的相位值。第三频段中频点的赋值和相位信息清零。右边0-24khz的频域信号经过反傅里叶变换和降采样处理得到压缩后的语音信号。参考图6C,(a)为压缩前的语音信号,(b)为压缩后的语音信号。图6C中上半部分为时域信号,下半部分为频域信号。
可以理解,经过非线性频带压缩后的低采样率语音信号虽然清晰度不如原始高采样率语音信号,但声音信号自然可懂,不会有可感知的杂音和不适感,所以即使语音接收端为现网设备,在没有经过改造情况下也不妨碍通话体验。因此,本申请的方法具有较好的兼容性。
参考图6A,语音接收端接收到码流数据后,对码流数据进行信道解码、语音解码后,再通过非线性频带扩展处理,将低采样率的语音信号还原为高采样率的语音信号,最终对高采样率的语音信号进行播放。
2、非线性频带扩展处理
参考图6D,与非线性频带压缩处理相反,非线性频带扩展处理是将压缩后的6khz-8khz信号重新扩展到6khz-24khz的语谱信号,即傅里叶变换后,扩展前子频段中频点的幅值将作为扩展后对应子频段中频点的幅值,而相位则沿用原相位或将扩展前子频段中频点的相位值加随机扰动值。经过扩展后的频谱信号经过反傅里叶变换后可以得到高采样率的语音信号,虽然不是完美还原,但从听感上比较接近原始的高采样语音信号,主观体验上有显著提升。参考图6E,(a)为原始高采样率语音信号的频谱(即待处理语音信号对应的频谱信息),(b)为扩展后高采样语音信号的频谱(即目标语音信号对应的频谱信息)。
本实施例中,可以基于现有通话系统基础上做少量改造就可以达成音质提升的效果,而且不对通话成本造成影响,通过本申请的语音编码、语音解码方法可以使原有的语音编解码器实现超频带编解码效果,实现超越现有语音频带的通话体验,有效提升语音清晰度和可懂度。
可以理解,本申请的语音编码、语音解码方法除了应用于语音通话,也可以应用于语音类的内容存储,例如视频里面的语音,还有语音消息等涉及语音编解码应用的场景。
应该理解的是,虽然图2、图3、图5的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2、图3、图5中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序 也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。
在一个实施例中,如图7A所示,提供了一种语音编码装置,该装置可以采用软件模块或硬件模块,或者是二者的结合成为计算机设备的一部分,该装置具体包括:频带特征信息获取模块702、第一目标特征信息确定模块704、第二目标特征信息确定模块706、压缩语音信号生成模块708和语音信号编码模块710,其中:
频带特征信息获取模块702,用于获取待处理语音信号对应的初始频带特征信息。
第一目标特征信息确定模块704,用于基于初始频带特征信息中第一频段对应的初始特征信息得到第一频段对应的目标特征信息。
第二目标特征信息确定模块706,用于对初始频带特征信息中第二频段对应的初始特征信息进行特征压缩,得到压缩频段对应的目标特征信息,第一频段的频率小于第二频段的频率,第二频段的频率区间大于压缩频段的频率区间。
压缩语音信号生成模块708,用于基于第一频段对应的目标特征信息和压缩频段对应的目标特征信息得到中间频带特征信息,基于中间频带特征信息得到待处理语音信号对应的压缩语音信号。
语音信号编码模块710,用于通过语音编码模块对压缩语音信号进行编码处理,得到待处理语音信号对应的编码语音数据,压缩语音信号对应的目标采样率小于或等于语音编码模块对应的支持采样率,目标采样率小于待处理语音信号对应的采样率。
上述语音编码装置,在语音编码前,可以将任意采样率的待处理语音信号通过频带特征信息的压缩,将待处理语音信号的采样率降低到语音编码器所支持的采样率,经过压缩得到的压缩语音信号对应的目标采样率小于待处理语音信号对应的采样率,经过压缩得到低采样率的压缩语音信号。因为压缩语音信号的采样率小于或等于语音编码器所支持的采样率,所以通过语音编码器可以顺利对压缩语音信号进行编码处理,最终可以将编码处理得到的编码语音数据传输到语音接收端。
在一个实施例中,频带特征信息获取模块还用于获取语音采集设备采集的待处理语音信号,对待处理语音信号进行傅里叶变换处理,得到初始频带特征信息,初始频带特征信息包括多个初始语音频点对应的初始幅值和初始相位。
在一个实施例中,第二目标特征信息确定模块包括:
频段划分单元,用于对第二频段进行频段划分,得到至少两个按序排列的初始子频段;对压缩频段进行频段划分,得到至少两个按序排列的目标子频段。
频段关联单元,用于基于初始子频段和目标子频段的子频段排序,确定各个初始子片段分别对应的目标子频段;
信息转换单元,用于将当前目标子频段对应的当前初始子频段的初始特征信息作为第一中间特征信息,从初始频带特征信息中,获取与当前目标子频段的频段信息一致的子频段对应的初始特征信息作为第二中间特征信息,基于第一中间特征信息和第二中间特征信息得到当前目标子频段对应的目标特征信息;
信息确定单元,用于基于各个目标子频段对应的目标特征信息得到压缩频段对应的目标特征信息。
在一个实施例中,第一中间特征信息和第二中间特征信息均包括多个初始语音频点对应的初始幅值和初始相位。信息转换单元还用于基于第一中间特征信息中各个初始语音频点对应的初始幅值的统计值,得到当前目标子频段对应的各个目标语音频点的目标幅值,基于第二中间特征信息中各个初始语音频点对应的初始相位,得到当前目标子频段对应的各个目标语音频点的目标相位,基于当前目标子频段对应的各个目标语音频点的目标幅值和目标相位,得到当前目标子频段对应的目标特征信息。
在一个实施例中,压缩语音信号生成模块还用于基于压缩频段和第二频段的频率差异 确定第三频段,将第三频段对应的目标特征信息设置为无效信息,基于第一频段对应的目标特征信息、压缩频段对应的目标特征信息和第三频段对应的目标特征信息得到中间频带特征信息,对中间频段特征信息进行傅里叶反变换处理,得到中间语音信号,中间语音信号对应的采样率和待处理语音信号对应的采样率一致,基于支持采样率对中间语音信号进行降采样处理,得到压缩语音信号。
在一个实施例中,语音信号编码模块还用于通过语音编码模块对压缩语音信号进行语音编码,得到第一语音数据,对第一语音数据进行信道编码,得到编码语音数据。
在一个实施例中,如图7B所示,语音编码装置还包括:
语音数据发送模块712,用于将编码语音数据发送至语音接收端,以使语音接收端对编码语音数据进行语音还原处理,得到待处理语音信号对应的目标语音信号;目标语音信号用于播放。
在一个实施例中,语音数据发送模块还用于基于第二频段和压缩频段得到待处理语音信号对应的压缩标识信息,将编码语音数据和压缩标识信息发送至语音接收端,以使语音接收端对编码语音数据进行解码处理得到压缩语音信号,基于压缩标识信息对压缩语音信号进行频带扩展,得到目标语音信号。
在一个实施例中,如图8所示,提供了一种语音解码装置,该装置可以采用软件模块或硬件模块,或者是二者的结合成为计算机设备的一部分,该装置具体包括:语音数据获取模块802、语音信号解码模块804、第一扩展特征信息确定模块806、第二扩展特征信息确定模块808、目标语音信号确定模块810,其中:
语音数据获取模块802,用于获取编码语音数据,编码语音数据是对待处理语音信号进行语音压缩处理得到的。
语音信号解码模块804,用于通过语音解码模块对编码语音数据进行解码处理得到解码语音信号,解码语音信号对应的目标采样率小于或等于语音解码模块对应的支持采样率。
第一扩展特征信息确定模块806,用于生成解码语音信号对应的目标频带特征信息,基于目标频带特征信息中第一频段对应的目标特征信息得到第一频段对应的扩展特征信息。
第二扩展特征信息确定模块808,用于对目标频带特征信息中压缩频段对应的目标特征信息进行特征扩展,得到第二频段对应的扩展特征信息;第一频段的频率小于压缩频段的频率,压缩频段的频率区间小于第二频段的频率区间。
目标语音信号确定模块810,用于基于第一频段对应的扩展特征信息和第二频段对应的扩展特征信息得到扩展频带特征信息,基于扩展频带特征信息得到待处理语音信号对应的目标语音信号,目标语音信号的采样率大于目标采样率,目标语音信号用于播放。
上述语音解码装置,获取到经过语音压缩处理得到的编码语音数据后,可以对编码语音数据进行解码处理得到解码语音信号,通过频带特征信息的扩展,可以将解码语音信号的采样率升高,得到用于播放的目标语音信号。语音信号的播放并不会受制于语音解码器所支持的采样率,在语音播放时,也可以播放信息更丰富的高采样率语音信号。
在一个实施例中,语音信号解码模块还用于对编码语音数据进行信道解码,得到第二语音数据,通过语音解码模块对第二语音数据进行语音解码,得到解码语音信号。
在一个实施例中,第二扩展特征信息确定模块包括:
映射信息获取单元,用于获取频段映射信息,频段映射信息用于确定压缩频段对应的至少两个目标子频段和第二频段对应的至少两个初始子频段之间的映射关系;
特征扩展单元,用于基于频段映射信息对目标频带特征信息中压缩频段对应的目标特征信息进行特征扩展,得到第二频段对应的扩展特征信息。
在一个实施例中,编码语音数据携带压缩标识信息,映射信息获取单元还用于基于压 缩标识信息获取频段映射信息。
在一个实施例中,特征扩展单元还用于将当前初始子频段对应的当前目标子频段的目标特征信息作为第三中间特征信息,从目标频带特征信息中,获取与当前初始子频段的频段信息一致的子频段对应的目标特征信息作为第四中间特征信息,基于第三中间特征信息和第四中间特征信息得到当前初始子频段对应的扩展特征信息,基于各个初始子频段对应的扩展特征信息得到第二频段对应的扩展特征信息。
在一个实施例中,第三中间特征信息和第四中间特征信息均包括多个目标语音频点对应的目标幅值和目标相位,特征扩展单元还用于基于第三中间特征信息中各个目标语音频点对应的目标幅值,得到当前初始子频段对应的各个初始语音频点的参考幅值,当第四中间特征信息为空时,对当前初始子频段对应的各个初始语音频点的相位增加随机扰动值,得到当前初始子频段对应的各个初始语音频点的参考相位,当第四中间特征信息不为空时,基于第四中间特征信息中各个目标语音频点对应的目标相位得到当前初始子频段对应的各个初始语音频点的参考相位,基于当前初始子频段对应的各个初始语音频点的参考幅值和参考相位,得到当前初始子频段对应的扩展特征信息。
关于语音编码、语音解码装置的具体限定可以参见上文中对于语音编码、语音解码方法的限定,在此不再赘述。上述语音编码、语音解码装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图9所示。该计算机设备包括通过系统总线连接的处理器、存储器、通信接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机可读指令。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、运营商网络、NFC(近场通信)或其他技术实现。该计算机可读指令被一个或多个处理器执行时以实现一种语音解码方法,该计算机可读指令被一个或多个处理器执行时以实现一种语音编码方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图10所示。该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储编码语音数据、频段映射信息等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被一个或多个处理器执行时以实现一种语音编码方法,该计算机可读指令被一个或多个处理器执行时以实现一种语音解码方法。
本领域技术人员可以理解,图9、图10中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,还提供了一种计算机设备,包括存储器和一个或多个处理器,存储器中存储有计算机可读指令,该一个或多个处理器执行计算机可读指令时实现上述各方法 实施例中的步骤。
在一个实施例中,提供了一种计算机可读存储介质,存储有计算机可读指令,该计算机可读指令被一个或多个处理器执行时实现上述各方法实施例中的步骤。
在一个实施例中,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机可读指令,该计算机可读指令存储在计算机可读存储介质中。计算机设备的一个或多个处理器从计算机可读存储介质读取该计算机可读指令,一个或多个处理器执行该计算机可读指令,使得该计算机设备执行上述各方法实施例中的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存或光存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (19)

  1. 一种语音编码方法,由语音发送端执行,所述方法包括:
    获取待处理语音信号对应的初始频带特征信息;
    基于所述初始频带特征信息中第一频段对应的初始特征信息得到第一频段对应的目标特征信息;
    对所述初始频带特征信息中第二频段对应的初始特征信息进行特征压缩,得到压缩频段对应的目标特征信息,所述第一频段的频率小于所述第二频段的频率,所述第二频段的频率区间大于所述压缩频段的频率区间;
    基于所述第一频段对应的目标特征信息和所述压缩频段对应的目标特征信息得到中间频带特征信息,基于所述中间频带特征信息得到所述待处理语音信号对应的压缩语音信号;
    通过语音编码模块对所述压缩语音信号进行编码处理,得到所述待处理语音信号对应的编码语音数据,所述压缩语音信号对应的目标采样率小于或等于所述语音编码模块对应的支持采样率,所述目标采样率小于所述待处理语音信号对应的采样率。
  2. 根据权利要求1所述的方法,所述获取待处理语音信号对应的初始频带特征信息,包括:
    获取语音采集设备采集的待处理语音信号;
    对所述待处理语音信号进行傅里叶变换处理,得到所述初始频带特征信息,所述初始频带特征信息包括多个初始语音频点对应的初始幅值和初始相位。
  3. 根据权利要求1所述的方法,所述对所述初始频带特征信息中第二频段对应的初始特征信息进行特征压缩,得到压缩频段对应的目标特征信息,包括:
    对所述第二频段进行频段划分,得到至少两个按序排列的初始子频段;
    对所述压缩频段进行频段划分,得到至少两个按序排列的目标子频段;
    基于初始子频段和目标子频段的子频段排序,确定各个初始子片段分别对应的目标子频段;
    将当前目标子频段对应的当前初始子频段的初始特征信息作为第一中间特征信息,从初始频带特征信息中,获取与当前目标子频段的频段信息一致的子频段对应的初始特征信息作为第二中间特征信息,基于所述第一中间特征信息和所述第二中间特征信息得到所述当前目标子频段对应的目标特征信息;
    基于各个目标子频段对应的目标特征信息得到所述压缩频段对应的目标特征信息。
  4. 根据权利要求3所述的方法,所述第一中间特征信息和所述第二中间特征信息均包括多个初始语音频点对应的初始幅值和初始相位;
    所述基于所述第一中间特征信息和所述第二中间特征信息得到所述当前目标子频段对应的目标特征信息,包括:
    基于所述第一中间特征信息中各个初始语音频点对应的初始幅值的统计值,得到所述当前目标子频段对应的各个目标语音频点的目标幅值;
    基于所述第二中间特征信息中各个初始语音频点对应的初始相位,得到所述当前目标子频段对应的各个目标语音频点的目标相位;
    基于所述当前目标子频段对应的各个目标语音频点的目标幅值和目标相位,得到所述当前目标子频段对应的目标特征信息。
  5. 根据权利要求1所述的方法,所述基于所述第一频段对应的目标特征信息和所述压缩频段对应的目标特征信息得到中间频带特征信息,基于所述中间频带特征信息得到所述待处理语音信号对应的压缩语音信号,包括:
    基于所述压缩频段和所述第二频段的频率差异确定第三频段,将所述第三频段对应的目标特征信息设置为无效信息;
    基于所述第一频段对应的目标特征信息、所述压缩频段对应的目标特征信息和所述第三频段对应的目标特征信息得到中间频带特征信息;
    对所述中间频段特征信息进行傅里叶反变换处理,得到中间语音信号,所述中间语音信号对应的采样率和所述待处理语音信号对应的采样率一致;
    基于所述支持采样率对所述中间语音信号进行降采样处理,得到所述压缩语音信号。
  6. 根据权利要求1所述的方法,所述通过语音编码模块对所述压缩语音信号进行编码处理,得到所述待处理语音信号对应的编码语音数据,包括:
    通过所述语音编码模块对所述压缩语音信号进行语音编码,得到第一语音数据;
    对所述第一语音数据进行信道编码,得到所述编码语音数据。
  7. 根据权利要求1至6任意一项所述的方法,所述方法还包括:
    将所述编码语音数据发送至语音接收端,以使所述语音接收端对所述编码语音数据进行语音还原处理,得到所述待处理语音信号对应的目标语音信号,所述目标语音信号用于播放。
  8. 根据权利要求7所述的方法,所述将所述编码语音数据发送至语音接收端,以使所述语音接收端对所述编码语音数据进行语音还原处理,得到所述待处理语音信号对应的目标语音信号,包括:
    基于所述第二频段和所述压缩频段得到所述待处理语音信号对应的压缩标识信息;
    将所述编码语音数据和所述压缩标识信息发送至所述语音接收端,以使所述语音接收端对所述编码语音数据进行解码处理得到压缩语音信号,基于所述压缩标识信息对所述压缩语音信号进行频带扩展,得到所述目标语音信号。
  9. 一种语音解码方法,由语音接收端执行,所述方法包括:
    获取编码语音数据,所述编码语音数据是对待处理语音信号进行语音压缩处理得到的;
    通过语音解码模块对所述编码语音数据进行解码处理得到解码语音信号,所述解码语音信号对应的目标采样率小于或等于所述语音解码模块对应的支持采样率;
    生成所述解码语音信号对应的目标频带特征信息,基于所述目标频带特征信息中第一频段对应的目标特征信息得到第一频段对应的扩展特征信息;
    对所述目标频带特征信息中压缩频段对应的目标特征信息进行特征扩展,得到第二频段对应的扩展特征信息;所述第一频段的频率小于所述压缩频段的频率,所述压缩频段的频率区间小于所述第二频段的频率区间;
    基于所述第一频段对应的扩展特征信息和所述第二频段对应的扩展特征信息得到扩展频带特征信息,基于所述扩展频带特征信息得到所述待处理语音信号对应的目标语音信号,所述目标语音信号的采样率大于所述目标采样率,所述目标语音信号用于播放。
  10. 根据权利要求9所述的方法,所述通过语音解码模块对所述编码语音数据进行解码处理得到解码语音信号,包括:
    对所述编码语音数据进行信道解码,得到第二语音数据;
    通过所述语音解码模块对所述第二语音数据进行语音解码,得到所述解码语音信号。
  11. 根据权利要求9所述的方法,所述对所述目标频带特征信息中压缩频段对应的目标特征信息进行特征扩展,得到第二频段对应的扩展特征信息,包括:
    获取频段映射信息,所述频段映射信息用于确定所述压缩频段对应的至少两个目标子频段和所述第二频段对应的至少两个初始子频段之间的映射关系;
    基于所述频段映射信息对所述目标频带特征信息中压缩频段对应的目标特征信息进行特征扩展,得到所述第二频段对应的扩展特征信息。
  12. 根据权利要求11所述的方法,所述编码语音数据携带压缩标识信息,所述获取频段映射信息,包括:
    基于所述压缩标识信息获取所述频段映射信息。
  13. 根据权利要求11所述的方法,所述基于所述频段映射信息对所述目标频带特征信息中压缩频段对应的目标特征信息进行特征扩展,得到所述第二频段对应的扩展特征信息,包括:
    将当前初始子频段对应的当前目标子频段的目标特征信息作为第三中间特征信息,从目标频带特征信息中,获取与当前初始子频段的频段信息一致的子频段对应的目标特征信息作为第四中间特征信息,基于所述第三中间特征信息和所述第四中间特征信息得到所述当前初始子频段对应的扩展特征信息;
    基于各个初始子频段对应的扩展特征信息得到所述第二频段对应的扩展特征信息。
  14. 根据权利要求13所述的方法,所述第三中间特征信息和所述第四中间特征信息均包括多个目标语音频点对应的目标幅值和目标相位;
    所述基于所述第三中间特征信息和所述第四中间特征信息得到所述当前初始子频段对应的扩展特征信息,包括:
    基于所述第三中间特征信息中各个目标语音频点对应的目标幅值,得到所述当前初始子频段对应的各个初始语音频点的参考幅值;
    当所述第四中间特征信息为空时,对所述当前初始子频段对应的各个初始语音频点的相位增加随机扰动值,得到所述当前初始子频段对应的各个初始语音频点的参考相位;
    当所述第四中间特征信息不为空时,基于所述第四中间特征信息中各个目标语音频点对应的目标相位得到所述当前初始子频段对应的各个初始语音频点的参考相位;
    基于所述当前初始子频段对应的各个初始语音频点的参考幅值和参考相位,得到所述当前初始子频段对应的扩展特征信息。
  15. 一种语音编码装置,所述装置包括:
    频带特征信息获取模块,用于获取待处理语音信号对应的初始频带特征信息;
    第一目标特征信息确定模块,用于基于所述初始频带特征信息中第一频段对应的初始特征信息得到第一频段对应的目标特征信息;
    第二目标特征信息确定模块,用于对所述初始频带特征信息中第二频段对应的初始特征信息进行特征压缩,得到压缩频段对应的目标特征信息,所述第一频段的频率小于所述第二频段的频率,所述第二频段的频率区间大于所述压缩频段的频率区间;
    压缩语音信号生成模块,用于基于所述第一频段对应的目标特征信息和所述压缩频段对应的目标特征信息得到中间频带特征信息,基于所述中间频带特征信息得到所述待处理语音信号对应的压缩语音信号;
    语音信号编码模块,用于通过语音编码模块对所述压缩语音信号进行编码处理,得到所述待处理语音信号对应的编码语音数据,所述压缩语音信号对应的目标采样率小于或等于所述语音编码模块对应的支持采样率,所述目标采样率小于所述待处理语音信号对应的采样率。
  16. 一种语音解码装置,所述装置包括:
    语音数据获取模块,用于获取编码语音数据,所述编码语音数据是对待处理语音信号进行语音压缩处理得到的;
    语音信号解码模块,用于通过语音解码模块对所述编码语音数据进行解码处理得到解码语音信号,所述解码语音信号对应的目标采样率小于或等于所述语音解码模块对应的支持采样率;
    第一扩展特征信息确定模块,用于生成所述解码语音信号对应的目标频带特征信息,基于所述目标频带特征信息中第一频段对应的目标特征信息得到第一频段对应的扩展特征信息;
    第二扩展特征信息确定模块,用于对所述目标频带特征信息中压缩频段对应的目标特 征信息进行特征扩展,得到第二频段对应的扩展特征信息;所述第一频段的频率小于所述压缩频段的频率,所述压缩频段的频率区间小于所述第二频段的频率区间;
    目标语音信号确定模块,用于基于所述第一频段对应的扩展特征信息和所述第二频段对应的扩展特征信息得到扩展频带特征信息,基于所述扩展频带特征信息得到所述待处理语音信号对应的目标语音信号,所述目标语音信号的采样率大于所述目标采样率,所述目标语音信号用于播放。
  17. 一种计算机设备,包括存储器和一个或多个处理器,所述存储器存储有计算机可读指令,其特征在于,所述一个或多个处理器执行所述计算机可读指令时实现权利要求1至8或9至14中任一项所述的方法的步骤。
  18. 一种计算机可读存储介质,存储有计算机可读指令,其特征在于,所述计算机可读指令被一个或多个处理器执行时实现权利要求1至8或9至14中任一项所述的方法的步骤。
  19. 一种计算机程序产品,包括计算机可读指令,所述计算机可读指令被一个或多个处理器执行时实现权利要求1至8或9至14中任一项所述的方法的步骤。
PCT/CN2022/093329 2021-06-22 2022-05-17 语音编码、语音解码方法、装置、计算机设备和存储介质 WO2022267754A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22827252.2A EP4362013A1 (en) 2021-06-22 2022-05-17 Speech coding method and apparatus, speech decoding method and apparatus, computer device, and storage medium
US18/124,496 US20230238009A1 (en) 2021-06-22 2023-03-21 Speech coding method and apparatus, speech decoding method and apparatus, computer device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110693160.9A CN115512711A (zh) 2021-06-22 2021-06-22 语音编码、语音解码方法、装置、计算机设备和存储介质
CN202110693160.9 2021-06-22

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/124,496 Continuation US20230238009A1 (en) 2021-06-22 2023-03-21 Speech coding method and apparatus, speech decoding method and apparatus, computer device, and storage medium

Publications (1)

Publication Number Publication Date
WO2022267754A1 true WO2022267754A1 (zh) 2022-12-29

Family

ID=84499351

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/093329 WO2022267754A1 (zh) 2021-06-22 2022-05-17 语音编码、语音解码方法、装置、计算机设备和存储介质

Country Status (4)

Country Link
US (1) US20230238009A1 (zh)
EP (1) EP4362013A1 (zh)
CN (1) CN115512711A (zh)
WO (1) WO2022267754A1 (zh)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1677491A (zh) * 2004-04-01 2005-10-05 北京宫羽数字技术有限责任公司 一种增强音频编解码装置及方法
CN1905373A (zh) * 2005-07-29 2007-01-31 上海杰得微电子有限公司 一种音频编解码器的实现方法
CN101604527A (zh) * 2009-04-22 2009-12-16 网经科技(苏州)有限公司 VoIP环境下基于G.711编码隐藏传送宽频语音的方法
CN102522092A (zh) * 2011-12-16 2012-06-27 大连理工大学 一种基于g.711.1的语音带宽扩展的装置和方法
CN104508740A (zh) * 2012-06-12 2015-04-08 全盛音响有限公司 双重兼容无损音频带宽扩展
CN104737227A (zh) * 2012-11-05 2015-06-24 松下电器(美国)知识产权公司 语音音响编码装置、语音音响解码装置、语音音响编码方法和语音音响解码方法
CN107925388A (zh) * 2016-02-17 2018-04-17 弗劳恩霍夫应用研究促进协会 用于增强瞬时处理的后置处理器、预处理器、音频编码器、音频解码器及相关方法
CN110832582A (zh) * 2017-03-31 2020-02-21 弗劳恩霍夫应用研究促进协会 用于处理音频信号的装置和方法
CN111402908A (zh) * 2020-03-30 2020-07-10 Oppo广东移动通信有限公司 语音处理方法、装置、电子设备和存储介质

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1677491A (zh) * 2004-04-01 2005-10-05 北京宫羽数字技术有限责任公司 一种增强音频编解码装置及方法
CN1905373A (zh) * 2005-07-29 2007-01-31 上海杰得微电子有限公司 一种音频编解码器的实现方法
CN101604527A (zh) * 2009-04-22 2009-12-16 网经科技(苏州)有限公司 VoIP环境下基于G.711编码隐藏传送宽频语音的方法
CN102522092A (zh) * 2011-12-16 2012-06-27 大连理工大学 一种基于g.711.1的语音带宽扩展的装置和方法
CN104508740A (zh) * 2012-06-12 2015-04-08 全盛音响有限公司 双重兼容无损音频带宽扩展
CN104737227A (zh) * 2012-11-05 2015-06-24 松下电器(美国)知识产权公司 语音音响编码装置、语音音响解码装置、语音音响编码方法和语音音响解码方法
CN107925388A (zh) * 2016-02-17 2018-04-17 弗劳恩霍夫应用研究促进协会 用于增强瞬时处理的后置处理器、预处理器、音频编码器、音频解码器及相关方法
CN110832582A (zh) * 2017-03-31 2020-02-21 弗劳恩霍夫应用研究促进协会 用于处理音频信号的装置和方法
CN111402908A (zh) * 2020-03-30 2020-07-10 Oppo广东移动通信有限公司 语音处理方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
US20230238009A1 (en) 2023-07-27
CN115512711A (zh) 2022-12-23
EP4362013A1 (en) 2024-05-01

Similar Documents

Publication Publication Date Title
US8560307B2 (en) Systems, methods, and apparatus for context suppression using receivers
US20170092288A1 (en) Adaptive noise suppression for super wideband music
TWI466102B (zh) 用以使經編碼音訊資料之錯誤消隱的方法和裝置
JP6462653B2 (ja) オーディオ・データを処理するための方法、装置、及びシステム
WO2020037810A1 (zh) 基于蓝牙的音频传输方法、系统、音频播放设备及计算机可读存储介质
WO2023197809A1 (zh) 一种高频音频信号的编解码方法和相关装置
US10727858B2 (en) Error resiliency for entropy coded audio data
WO2022267754A1 (zh) 语音编码、语音解码方法、装置、计算机设备和存储介质
JP2001184090A (ja) 信号符号化装置,及び信号復号化装置,並びに信号符号化プログラムを記録したコンピュータ読み取り可能な記録媒体,及び信号復号化プログラムを記録したコンピュータ読み取り可能な記録媒体
WO2022258036A1 (zh) 编解码方法、装置、设备、存储介质及计算机程序
US10586546B2 (en) Inversely enumerated pyramid vector quantizers for efficient rate adaptation in audio coding
CN116110424A (zh) 一种语音带宽扩展方法及相关装置
JP2012083775A (ja) 信号処理装置および信号処理方法
JP2010160496A (ja) 信号処理装置および信号処理方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22827252

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022827252

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2022827252

Country of ref document: EP

Effective date: 20240122