CN110933233A

CN110933233A - Mobile terminal and method for processing voice communication

Info

Publication number: CN110933233A
Application number: CN202010067617.0A
Authority: CN
Inventors: 李小兵; 黄一火; 冯子龙; 许昊
Original assignee: Marvell World Trade Ltd
Current assignee: ASR Microelectronics Co Ltd
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2020-03-27
Anticipated expiration: 2040-01-21
Also published as: CN110933233B

Abstract

The application discloses a method for processing voice communication by a mobile terminal, which comprises the following steps. Step S10: when voice communication starts, presetting the positions of a read pointer and a write pointer in a play buffer area of a receiving end, and setting a reference value and a threshold value; presetting a read pointer in front of a write pointer; the threshold value is less than the reference value. Step S20: the distance between the read pointer and the write pointer in the play buffer is detected every first time is called the detection distance. Step S30: and representing the historical state of the play cache region by using the offset state, moving the offset state by two bits to the left, and determining a new value of the offset state according to a comparison result of the sum or the difference between the detection distance and the reference value and the threshold value. Step S40: depending on the VAD detection from the voice PCM data stream, the joint offset state determines whether to interpolate the number of samples or extract. The method and the device are used for solving the influence of time delay on voice communication.

Description

Mobile terminal and method for processing voice communication

Technical Field

The application relates to a method for processing voice communication by a mobile terminal.

Background

A mobile terminal (UE) represented by a mobile phone is an indispensable part of life of people, and a delay (delay) is an important indicator for performing voice communication by the mobile terminal. When the end-to-end delay is below 150ms, the person does not feel it. When the end-to-end delay exceeds 150ms and is less than 450ms, people can feel but can tolerate the end-to-end delay, and conversation communication is not influenced. When the end-to-end delay is greater than 1s, the call communication is seriously affected, and the user experience is poor. In the audio test specification of the china mobile communication group ltd, the end-to-end delay of the mobile phone in the transmitting and receiving directions is required to be not more than 220ms, and is recommended to be within 185ms, otherwise, the authentication cannot be passed.

Referring to fig. 1, this is the transmission process of voice from capture to play. When voice communication is carried out between two mobile terminals, the voice transmission process comprises three parts: firstly, voice data are collected from a sending end and sent to network equipment after being processed, secondly, the voice data are transmitted among the network equipment, and thirdly, the voice data are sent to a receiving end from the network equipment and played. Each of the foregoing parts will generate a delay, and this application only discusses the delay introduced at the mobile terminal. Neither the delay between the mobile terminal and the network device nor the delay between the network device and the network device is within the scope of the discussion of this application.

In a CS (circuit switched) voice link, speech enhancement processing, encoding and decoding are performed on an audio signal every 20 ms. Taking an 8K sampling rate as an example, this means that the audio signal is sampled 8000 times per second. At the transmitting end, a 160 sample (sample) PCM data stream (PCM stream) collected from a Microphone (MIC) is subjected to upstream speech enhancement processing, then encoded, and then sent to a modem (modem) for transmission. At the receiving end, the voice packet (voice packet) sent by the modem is decoded to be a PCM data stream of 160 samples, then the PCM data stream is subjected to downstream voice enhancement processing, and finally the 160 samples are written to a digital audio path and a DAC (digital-to-analog conversion module) through an SSP (Synchronous Serial Port), and are played through a speaker after digital-to-analog conversion.

The delay of the sending end mainly comprises delay introduced by sound collection, delay introduced by a voice enhancement processing algorithm and delay introduced by a coding algorithm. Since most vocoders (vocoders) require a frame with a length of 20ms, collected voice data is required to be placed in a buffer area (buffer) for waiting for a period of time, and then taken out for coding when the frame length meets the requirement, so that time delay is introduced. Taking a frame of 20ms as an example, at least 20ms of delay is introduced.

The delay of the receiving end mainly comprises decoding algorithm delay, downlink voice enhancement processing delay and playing delay. Before playing, in order to keep the playing fluency, a level of buffer area is added before the voice data enters a DMA (direct memory access) controller, which also introduces delay.

The implementation of the CS voice link described above depends on the frequency accuracy of a crystal oscillator. TCXO (Temperature Compensated Crystal Oscillator) has better performance than VCXO (Voltage controlled Crystal Oscillator) and DCXO (digital Compensated Crystal Oscillator), has high frequency stability, but is also high in cost. VCXOs and DCXOs are still used in many mobile terminals to save costs. In addition, crystal aging also introduces the problem of not high frequency stability.

If the frequency of the crystal oscillator is inaccurate, the frame length of 20ms of a sending end is inaccurate, and the collection of samples is also inaccurate; this may result in accumulation and lack of samples, long-term accumulation may result in a long session delay, and the receiving end plays invalid data, thereby generating noise and having poor user experience. Assuming that the crystal oscillator is slightly smaller than 3 ppm (parts per million), the duration of the talk time is 1s, the delay time is increased by 3 mus, the duration of the talk time is one hour, the delay time is increased by 9.6ms, the duration of the talk time is 2 hours, the presentation time is increased by 19.2ms, and the duration is about 20ms of a voice frame.

If the frequency of the crystal oscillator is inaccurate, the playing clock of the receiving end is also inaccurate, and if the playing buffer area is smaller, the risk of overflow is more likely to exist. Not only will increase the voice delay here, still need to consider how big the play buffer needs not to overflow, again can be less consume the memory.

Disclosure of Invention

The technical problem to be solved by the application is to offset the influence of time delay caused by unstable crystal oscillator, aging crystal oscillator or clock deviation on voice communication.

In order to solve the above technical problem, the present application provides a method for processing voice communication by a mobile terminal, including the following steps. Step S10: when voice communication starts, presetting the positions of a read pointer and a write pointer in a play buffer area of a receiving end, and setting a reference value and a threshold value; presetting a read pointer in front of a write pointer; the threshold value is less than the reference value. Step S20: the distance between the read pointer and the write pointer in the play buffer is detected every first time is called the detection distance. Step S30: and representing the historical state of the play cache region by using the offset state, moving the offset state by two bits to the left, and determining a new value of the offset state according to a comparison result of the sum or the difference between the detection distance and the reference value and the threshold value. Step S40: depending on the VAD detection from the voice PCM data stream, the joint offset state determines whether to interpolate the number of samples or extract.

The method monitors the distance between the read pointer and the write pointer in the play buffer area in real time, and adaptively interpolates and extracts the samples according to the performance of the crystal oscillator on the premise of not influencing the content and quality of the uplink voice and the downlink voice, so that the accumulation, overflow and noise introduction of the voice samples are avoided.

Further, in step S10, the read pointer in the play buffer is preset 3 to 10ms before the write pointer. This is a preferred implementation.

Further, the reference value represents a distance of the read pointer from the write pointer under normal conditions; when the sampling rate is 8k, the reference value is set to be 160; the reference value is set to 320 at a sampling rate of 16 k. This is a preferred implementation.

Further, the threshold value represents a fluctuation range introduced by the completion of the speech enhancement processing and decoding at the receiving end; when the sampling rate is 8k, the threshold value is set to be 8; at a sampling rate of 16k, the threshold is set to 16. This is a preferred implementation.

Further, the first time is 60ms or an integer multiple of 60 ms. This is a preferred implementation.

Further, in step S30, the offset state is used to characterize whether the detection distance of the read pointer and the write pointer in the play buffer is more or less than the reference value, and the offset state is initialized to 0. This is a preferred implementation.

Further, in step S30, the offset status is 8 bits, and is used to record offset statuses of 4 consecutive play buffers; every 2 bits represents the offset state of a play buffer, the binary number 10 represents more samples of the play buffer, and the binary number 01 represents less samples of the play buffer. This is a preferred implementation.

Further, in step S30, the reference value reference, the threshold value, the detection distance, and the offset state are respectively recorded as thesh, drift, and DriftState. And if the drift is more than reference + the sh, taking the result of the OR operation of the current value of the DriftState and 0x2 as a new value of the DriftState. And if the drift is less than reference-the sh, taking the result of the OR operation of the current value of the DriftState and 0x1 as a new value of the DriftState. If the reference-the-sh is less than or equal to the drift and less than or equal to the reference + the sh, the value of the current DriftState is left-shifted by two bits and then is used as a new value of the DriftState. This is a specific implementation of step S30.

Further, in step S40, if the VAD detects that the number of non-speech frames continuously reaches the first number and the offset state is 0xAA, the samples retained in the play-out buffer are extracted. If the number of non-speech frames detected by the VAD continuously reaches the first number and the offset state is 0x55, the samples retained in the play-out buffer are interpolated. This is a specific implementation of step S40.

Further, the value of the first number is between 4 and 8. This is a preferred implementation.

Further, in step S40, if the VAD detects that the number of the non-speech frames continuously reaches the first number and the offset state is 0xAA, the samples retained in the play buffer are extracted, and the samples in the acquisition buffer are interpolated. If the number of non-speech frames detected by the VAD continuously reaches the first number and the offset state is 0x55, the samples retained in the play-out buffer are interpolated while the samples in the acquisition buffer are decimated. This is a preferred implementation.

The application also provides a mobile terminal for processing voice communication, which comprises an initial setting unit, a detection unit, an offset state calculation unit and an adjustment unit. The initial setting unit is used for presetting the positions of a read pointer and a write pointer in a play buffer area of a receiving end when voice communication starts, and setting a reference value and a threshold value; presetting a read pointer in front of a write pointer; the threshold value is less than the reference value. The detection unit is used for detecting the distance between the read pointer and the write pointer in the play buffer area every first time, and the distance is called as a detection distance. The offset state calculating unit is used for representing the history state of the play cache region by adopting an offset state, and is also used for moving the offset state by two bits to the left, and determining a new value of the offset state according to a comparison result of the sum or the difference between the detection distance and the reference value and the threshold value. The adjusting unit is used for determining whether to carry out sample number interpolation or extraction according to the detection condition of VAD from voice PCM data stream and the joint offset state.

The technical effects achieved by the present application include the following aspects.

First, cost can be saved. The crystal oscillator of the mobile terminal can adopt VCXO and DCXO with lower cost, and the algorithm can adaptively offset the time delay introduced by the instability of the crystal oscillator under the condition of not influencing the conversation content and quality.

And secondly, the influence caused by crystal oscillator aging can be resisted, and the service life of the product is prolonged.

Thirdly, even if the clock of the mobile terminal and the network clock have deviation, the algorithm can also adaptively offset the time delay caused by the deviation, and the smoothness of the conversation is ensured.

Drawings

Fig. 1 is a schematic diagram of the transmission process of voice from acquisition to playing.

Fig. 2 is a flowchart of a method for processing voice communication by a mobile terminal provided by the present application.

Fig. 3 is a first schematic diagram of the acquisition buffer at the transmitting end and the play buffer at the receiving end after the voice communication starts.

Fig. 4 is a second schematic diagram of the acquisition buffer area of the transmitting end and the play buffer area of the receiving end after the voice communication starts.

Fig. 5 is a schematic structural diagram of a mobile terminal provided in the present application.

The reference numbers in the figures illustrate: 10 is an initial setting unit; 20 is a detection unit; 30 is an offset state calculating unit; 40 is an adjusting unit.

Detailed Description

Referring to fig. 2, the method for processing voice communication by a mobile terminal provided by the present application includes the following steps.

Step S10: when voice communication starts, the positions of a read pointer and a write pointer in a play buffer area of a receiving end are preset, and a reference value and a threshold value are set. The read pointer is typically preset before the write pointer in order to prevent the play buffer at the receiving end from reading empty. Preferably, the read pointer is preset 3 to 10ms before the write pointer. The reference value is denoted reference, which represents the normal range of the distance of the read and write pointers, i.e. the read and write pointers, under normal circumstances. The reference value is typically set to 160 for a sample rate of 8k and to 320 for a sample rate of 16 k. The present application takes an 8k sampling rate as an example. The threshold is denoted as thresh, and mainly considers the fluctuation range introduced when the speech enhancement processing and decoding of the receiving end are completed. In general, the threshold is set to about 8 when the sampling rate is 8k, and the threshold is set to about 16 when the sampling rate is 16 k. The threshold value is less than the reference value.

Step S20: the distance between the read pointer and the write pointer in the play buffer is detected every first time t1, called the detection distance, and denoted as Drift. Preferably, the first time t1 is 60ms or an integer multiple of 60 ms. 60ms is chosen to ensure that there are an integer number of speech frames (20 ms for one speech frame) under various communication schemes. Especially in the 2G system, the interval of the voice frames is repeated at 18.46ms, 23.075 ms. And the sum of the three is equal to about 60 ms.

Step S30: the history state of the play buffer is represented by an offset state DriftState, which is used to characterize whether the read pointer in the play buffer is more or less than the detection distance Drift of the write pointer compared to the reference value thresh, and the offset state DriftState is initialized to 0. For example, the offset state DriftState is 8 bits (bit), and the offset states of 4 consecutive play buffers of 20ms can be recorded. Every 2 bits represents a 20ms shift status of the play buffer, binary 10 indicates that the number of samples in the play buffer is more, and binary 01 indicates that the number of samples in the play buffer is less.

Shifting the drift state by two bits to the left means that each bit of the drift state is shifted by two bits to the left, the highest two bits are shifted out, and the lowest two bits are filled with 0. And determining a new value of the offset state DriftState according to a comparison result of the detection distance drift of a certain play buffer area and the sum or difference of the reference value reference and the threshold thresh.

If drift is greater than reference + the sh, which indicates that the number of samples left in the play buffer area is more than the maximum, the value of the current DriftState and the result of the OR operation of 0x2 are taken as the new value of the DriftState. 0x2 represents the hexadecimal number 2, i.e. the binary number 10. For example, 160 samples cannot be played for 20ms in case the crystal is slowed down. Every first time t1, the offset DriftState is left shifted by two bits and ORed with 0x 2.

And if the drift is less than the reference-the sh, which indicates that the number of samples left in the play buffer area is less, taking the value of the current DriftState and the result of the OR operation of 0x1 as a new value of the DriftState. 0x1 represents a hexadecimal number 1, i.e. a binary number 01. For example, when the crystal oscillator is fast, every first time t1, the offset state DriftState is left shifted by two bits and then ored with 0x 1.

If the reference-the-then is not less than the drift is not more than the reference + the-then, it indicates that the number of samples left in the play buffer is within the normal range, and the value of the current DriftState is only left-shifted by two bits and then is used as the new value of the DriftState. The number of samples is not subsequently interpolated or decimated.

Step S40: based on the VAD (Voice activity detection) detection from the Voice PCM data stream, the combined offset state DriftState determines whether to interpolate or extract the number of samples.

If the number of VAD detected non-speech frames (e.g. noise or silence) continuously reaches the first number N1 and the offset state DriftState is 0xAA, the samples remaining in the play-out buffer are decimated. Preferably, the first number N1 takes on a value between 4 and 8. 0xAA represents a hexadecimal number, namely the binary number 10101010. Taking N1 as 4 as an example, this means that VAD detects 4 consecutive frames of noise or silence and the number of samples in the play buffer is large, then the samples are extracted in 4 consecutive frames, for example, 160 samples are extracted into 159 samples and put in 4 consecutive play buffers.

If the VAD detects that the number of non-voice frames continuously reaches the first number N1 and the offset state DriftState is 0x55, the samples remaining in the play-out buffer are interpolated. Preferably, the first number N1 takes on a value between 4 and 8. 0x55 represents a hexadecimal number, i.e. the binary number 01010101. Taking the value of N1 as 4 as an example, this means that VAD detects that there are 4 frames of noise or silence continuously and the number of samples in the play buffer is small, then interpolation is performed in 4 consecutive frames, for example, 160 samples are interpolated into 161 samples and put into 4 consecutive play buffers.

Please refer to fig. 3, which is a schematic diagram of the acquisition buffer at the transmitting end and the play buffer at the receiving end after the voice communication starts. At a sending end, an audio signal collected by an ADC (analog-to-digital converter) is written into a collection buffer area through an SSP (secure program), and a reading pointer of the collection buffer area reads 160 samples to perform uplink voice enhancement processing. At the receiving end, the write pointer of the play buffer writes 160 samples of the processed downlink voice to the play buffer, and the 160 samples are written to DAC (digital-to-analog converter) by SSP.

In the play buffer of the receiving end, the distance between the read pointer and the write pointer changes, which has two reasons. The first reason is that the time for processing the voice coding and decoding completion by a DSP (digital signal processor) is changed in each communication scene. For example, in the case of making a call and accessing the internet, the DSP handles a lot of things, and the priority of the voice encoding and decoding is not high, so that the DSP is interrupted by another task, and thus the completion time point is delayed. The second reason is that the number of samples in the capture buffer and the play buffer is not standard in case of inaccurate frequency due to unstable crystal oscillator. For example, in the downstream direction, if the crystal is slowed down, 160 samples may not be played for 20ms, and the remaining samples remain in the play buffer.

In step S40, the interpolation and decimation of the samples when the VAD detects a non-speech frame are selected mainly to not affect the following echo cancellation effect.

Taking the 8K sampling rate as an example, every 20ms, the receiving direction should write 160 samples into the play buffer, and the transmitting direction should read 160 samples from the capture buffer, as shown in fig. 4. FIG. 4 is a diagram of the embodiment of FIG. 3, in which a hardware buffer in FIFO (first-in first-out) format is added between the SSP and the collection buffer in the upper square, and the data is read from the collection buffer to the voice transmission buffer; a hardware buffer in the form of FIFO between the voice reception buffer for writing data to the play buffer and the play buffer for reading data from the play buffer to the SSP in the downstream direction is also added.

If the offset state DriftState is 0xAA, 159 samples are read from the acquisition buffer in the transmit direction, then the 159 samples are interpolated to 160 samples, and then the upstream speech enhancement processing is performed. In the playing direction, 20ms samples in the voice receiving buffer are extracted (Decimator), changed from 160 samples to 159 samples, and then written into the playing buffer to be played.

If DriftState is 0x55, in the sending direction, 161 samples are read from the collection buffer to the voice sending buffer, then 161 samples are extracted into 160 samples, and then the upstream voice enhancement processing is performed. In the playing direction, 20ms samples in the voice receiving buffer are interpolated (Interpolator), which changes from 160 samples to 161 samples, and then written into the playing buffer to be played.

Referring to fig. 5, the mobile terminal for processing voice communication provided by the present application includes an initial setting unit 10, a detecting unit 20, an offset state calculating unit 30, and an adjusting unit 40.

The initial setting unit 10 is configured to preset the positions of a read pointer and a write pointer in a play buffer of a receiving end at the start of voice communication, and set a reference value and a threshold value.

The detection unit 20 is configured to detect a distance, called a detection distance, denoted as Drift, between the read pointer and the write pointer in the play buffer at a first time t 1.

The offset state calculation unit 30 is configured to represent the history state of the play buffer by using an offset state DriftState, which is used to characterize whether the detection distance Drift of the read pointer and the write pointer in the play buffer is more or less than a reference value thresh, and the offset state DriftState is initialized to 0. The offset state calculating unit 30 is further configured to shift the offset state DriftState by two bits to the left, and determine a new value of the offset state DriftState according to a comparison result between the detection distance drift of a certain play buffer and a sum or a difference between the reference value reference and the threshold value thresh. And if the drift is more than reference + the sh, taking the result of the OR operation of the current value of the DriftState and 0x2 as a new value of the DriftState. And if the drift is less than reference-the sh, taking the result of the OR operation of the current value of the DriftState and 0x1 as a new value of the DriftState. If the reference-the-then is not less than the drift is not more than the reference + the-then, it means that the number of samples left in the play buffer is within the normal range, the value of the current DriftState is only left-shifted by two bits and then is used as the new value of the DriftState.

The adjusting unit 40 is used for deciding whether to perform sample number interpolation or decimation according to the detection condition of VAD from the voice PCM data stream and the combined offset state DriftState. If the VAD detects that the number of non-voice frames continuously reaches the first number N1 and the offset state DriftState is 0xAA, the samples remaining in the play-out buffer are decimated. If the VAD detects that the number of non-voice frames continuously reaches the first number N1 and the offset state DriftState is 0x55, the samples remaining in the play-out buffer are interpolated.

The above are merely preferred embodiments of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for processing voice communication by a mobile terminal is characterized by comprising the following steps:

step S10: when voice communication starts, presetting the positions of a read pointer and a write pointer in a play buffer area of a receiving end, and setting a reference value and a threshold value; presetting a read pointer in front of a write pointer; the threshold value is less than the reference value;

step S20: detecting the distance between a read pointer and a write pointer in a play cache region every first time, wherein the distance is called a detection distance;

step S30: representing the historical state of the play cache region by using the offset state, moving the offset state by two bits to the left, and determining a new value of the offset state according to a comparison result of the sum or the difference between the detection distance and the reference value and the threshold value;

step S40: depending on the VAD detection from the voice PCM data stream, the joint offset state determines whether to interpolate the number of samples or extract.

2. The method of claim 1, wherein in the step S10, the read pointer in the play buffer is preset 3 to 10ms before the write pointer.

3. The method of claim 1, wherein the reference value indicates a distance between a read pointer and a write pointer under normal conditions; when the sampling rate is 8k, the reference value is set to be 160; the reference value is set to 320 at a sampling rate of 16 k.

4. The method of claim 1, wherein the threshold value indicates a fluctuation range introduced by the completion of the speech enhancement processing and decoding at the receiving end; when the sampling rate is 8k, the threshold value is set to be 8; at a sampling rate of 16k, the threshold is set to 16.

5. The method of claim 1, wherein the first time is 60ms or an integer multiple of 60 ms.

6. The method of claim 1, wherein in step S30, the offset status is used to indicate whether the detection distance of the read pointer and the write pointer in the play buffer is more or less than the reference value, and the offset status is initialized to 0.

7. The method of claim 1, wherein in step S30, the offset status is 8 bits for recording offset status of 4 consecutive play buffers; every 2 bits represents the offset state of a play buffer, the binary number 10 represents more samples of the play buffer, and the binary number 01 represents less samples of the play buffer.

8. The method of claim 1, wherein in step S30, the reference value reference, the threshold value, the detection distance, and the offset state are respectively denoted as the then, the drift and drift states, respectively;

if drift is more than reference + the sh, taking the result of the OR operation between the current value of DriftState and 0x2 as the new value of DriftState;

if the drift is less than reference-the sh, taking the result of the OR operation between the current value of the DriftState and 0x1 as a new value of the DriftState;

if the reference-the-sh is less than or equal to the drift and less than or equal to the reference + the sh, the value of the current DriftState is left-shifted by two bits and then is used as a new value of the DriftState.

9. The method of claim 1, wherein in step S40, if the VAD detects that the number of non-speech frames continuously reaches the first number and the offset state is 0xAA, the samples retained in the play buffer are extracted;

if the number of non-speech frames detected by the VAD continuously reaches the first number and the offset state is 0x55, the samples retained in the play-out buffer are interpolated.

10. The method of claim 9, wherein the first number takes on a value between 4 and 8.

11. The method of claim 9, wherein in step S40, if the VAD detects that the number of non-speech frames continuously reaches the first number and the offset state is 0xAA, the samples retained in the play buffer are extracted and the samples in the acquisition buffer are interpolated;

if the number of non-speech frames detected by the VAD continuously reaches the first number and the offset state is 0x55, the samples retained in the play-out buffer are interpolated while the samples in the acquisition buffer are decimated.

12. A mobile terminal for processing voice communication is characterized by comprising an initial setting unit, a detection unit, an offset state calculation unit and an adjustment unit;

the initial setting unit is used for presetting the positions of a read pointer and a write pointer in a play buffer area of a receiving end when voice communication starts, and setting a reference value and a threshold value; presetting a read pointer in front of a write pointer; the threshold value is less than the reference value;

the detection unit is used for detecting the distance between a read pointer and a write pointer in the play cache region every first time, and the distance is called as a detection distance;

the offset state calculation unit is used for representing the historical state of the play cache region by adopting an offset state, also used for shifting the offset state by two bits to the left, and determining a new value of the offset state according to a comparison result of the sum or the difference between the detection distance and the reference value and the threshold value;

the adjusting unit is used for determining whether to carry out sample number interpolation or extraction according to the detection condition of VAD from voice PCM data stream and the joint offset state.