CN111883170A - Voice signal processing method and system, audio processing chip and electronic equipment - Google Patents

Voice signal processing method and system, audio processing chip and electronic equipment Download PDF

Info

Publication number
CN111883170A
CN111883170A CN202010271015.7A CN202010271015A CN111883170A CN 111883170 A CN111883170 A CN 111883170A CN 202010271015 A CN202010271015 A CN 202010271015A CN 111883170 A CN111883170 A CN 111883170A
Authority
CN
China
Prior art keywords
signal
energy spectrum
voice
packet loss
modulation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010271015.7A
Other languages
Chinese (zh)
Other versions
CN111883170B (en
Inventor
方桂萍
肖全之
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Jieli Technology Co Ltd
Original Assignee
Zhuhai Jieli Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Jieli Technology Co Ltd filed Critical Zhuhai Jieli Technology Co Ltd
Priority to CN202010271015.7A priority Critical patent/CN111883170B/en
Publication of CN111883170A publication Critical patent/CN111883170A/en
Application granted granted Critical
Publication of CN111883170B publication Critical patent/CN111883170B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The invention relates to a method and a system for processing a voice signal containing packet loss data, a voice processing chip, a computer readable medium and electronic equipment. The processing method comprises the following steps: firstly, acquiring a voice signal containing packet loss data and a packet loss judgment signal corresponding to the voice signal, wherein the packet loss judgment signal comprises information of whether each audio data packet in the voice signal is packet loss data; then converting the voice signal into a frequency domain signal, recording the frequency domain signal as a first signal, and generating a modulation signal by using the packet loss judgment signal; modulating the first signal by using the modulation signal to obtain a pre-output signal; and then carrying out Fourier inversion on the corrected signal to obtain a pre-output signal. The processing method of the invention can avoid the retransmission of the voice signal as much as possible, reduce the load of the transmission bandwidth and improve the real-time performance of the voice signal.

Description

Voice signal processing method and system, audio processing chip and electronic equipment
Technical Field
The invention relates to the technical field of communication, in particular to a processing method and a processing system of a voice signal containing packet loss data, an audio processing chip, electronic equipment and a computer readable storage medium.
Background
With the popularization of wireless connectivity, audio communication based on bluetooth, wifi, etc. is almost ubiquitous. However, in the actual process of transmitting the voice signal, interference is often brought in the transmission under the influence of factors such as environment or antenna, so that the voice signal is prone to be lost, and great discomfort is brought to the hearing.
In the prior art, a method for resisting packet loss can select to retransmit a voice signal, but often cannot retransmit the voice signal for an infinite number of times due to factors such as bandwidth limitation, real-time requirements and the like, so that transmitted data still has problems; in some methods for packet loss resistance, extra redundancy information is added by encoding for error correction, which undoubtedly increases the amount of data to be transmitted and increases the bandwidth burden.
Disclosure of Invention
Based on the above situation, a primary objective of the present invention is to provide a processing method and a processing system for a voice signal containing packet loss data, an audio processing chip, an electronic device, and a computer-readable storage medium, so as to solve the problems of bandwidth burden increase and real-time performance reduction caused by retransmission or coding addition in the prior art.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a first aspect of the present invention provides a method for processing a voice signal containing packet loss data, including:
s100: acquiring a voice signal containing packet loss data and a packet loss judgment signal corresponding to the voice signal, wherein the packet loss judgment signal is a sequence formed by error marks of all audio data packets in the voice signal, if the data in the audio data packets are packet loss data packets, the error mark corresponding to the data is 0, otherwise, the error mark is 1;
s300: performing Fourier transform on the packet loss judgment signal, solving an energy spectrum to obtain a packet loss energy spectrum, and generating the modulation signal through the packet loss energy spectrum;
s400: converting the voice signal into a voice frequency domain signal, and solving an energy spectrum to obtain a voice energy spectrum;
s500: selecting L peak values with the largest energy in the voice energy spectrum, taking each peak value with the largest energy as a main peak, selecting a plurality of secondary peaks which are symmetrical on the left side and the right side of the main peak in the voice energy spectrum to generate modulation energy spectrums, then generating gain coefficients corresponding to the modulation energy spectrums through the modulation signals and the modulation energy spectrums, and performing multiple modulation correction processing on the voice frequency domain signals by using the gain coefficients to obtain correction signals; the position of each peak value in the modulation energy spectrum uses the position of the peak value in the voice energy spectrum, and the number of secondary peaks in each modulation energy spectrum is one of the voice energy spectrum with the main peak and with fewer number of left secondary peaks and right secondary peaks;
s700: and carrying out Fourier inversion on the corrected signal to obtain a pre-output signal.
Preferably, the method further includes, between steps S100 and S300, the steps of:
s2: judging whether the voice signal contains human voice; if yes, executing S300; if not, executing S600;
s600: generating a correction signal using the background sound estimation frequency domain energy spectrum, and then performing S700; the background sound estimation frequency domain energy spectrum is a frequency domain energy spectrum generated by a voice signal which is closest to the current voice signal and does not contain packet loss data.
Preferably, the step S300 includes:
s310: performing Fourier transform on the packet loss judgment signal, and solving an energy spectrum to obtain a packet loss energy spectrum;
s320: and selecting a main lobe and part of side lobes adjacent to the main lobe in the packet loss energy spectrum, normalizing the main lobe and the part of side lobes, wherein the amplitude of the main lobe is replaced by the reciprocal of the main lobe, and the modulation signal delta is generated.
Preferably, the step S500 includes:
s510: the voice energy spectrum is recorded as AMP, and L peak values with the maximum energy are recorded as AMP [ Ki]iAt each AMP [ Ki]iAs a main peak, and selecting multiple speech energy spectrums on the left and right sides of the main peakThe secondary peak generates a modulated energy spectrum in which AMP [ Ki]iRepresents the ith peak in the speech energy spectrum, which is located at the Ki-th position in the speech energy spectrum; ki is 0, 1, 2, …, n-1; i is 1, 2, 3, …, L; n is the length of the modulation signal detla; l is greater than or equal to 4 or less than or equal to 6;
s520: setting i to 1;
s530: setting j to 0, and multiplying the value of the [ Ki + j ] th position in the voice frequency domain signal by delta [ j ] as the value at the [ Ki + j ] th position in the voice frequency domain signal; wherein, delta [ j ] refers to the value of j th position of the modulation signal delta;
s540: calculating j as j + 1;
determine rate1 ═ 1-delta [ j]×AMP[Ki]i/AMP[Ki+j]iWhether less than 0; if yes, the [ Ki + j ] th signal in the first signal is transmitted]The value of each position is set to 0; if not, the [ Ki + j ] th voice frequency domain signal is processed]The value of each position is multiplied by rate1 as the [ Ki + j ] th value in the first signal]A value of each position;
determine rate2 ═ 1-delta [ j]×AMP[Ki]i/AMP[Ki-j]iWhether less than 0; if yes, the [ Ki-j ] th signal in the first signal is transmitted]The value of each position is set to 0; if not, the [ Ki-j ] th voice frequency domain signal is processed]Multiplying the value of each position by rate2 as the [ Ki-j ] th number in the voice frequency domain signal]A value of each position;
s550: judging whether j is smaller than n, if so, returning to S540; if not, executing S560;
s560: judging whether i is smaller than L +1, if so, i is i +1, and returning to S530; if not, executing S570;
s570: taking the corrected voice frequency domain signal as a correction signal;
wherein each i value corresponds to delta [0 ]]Each of the rates 1 and 2 formed the AMP [ Ki [ ]]iThe corresponding gain factor.
Preferably, the length n of the delta of the modulation signal is 0.2 × the length of the spectrum of the speech energy.
Preferably, the step S700 is followed by:
s800: performing windowing frame folding on the pre-output signal by using a frame folding signal to obtain an actual output signal;
s900: updating the frame overlap signal using the pre-output signal;
wherein the initial value of the frame overlapping signal is 0.
Preferably, the step S700 further comprises:
s580: calculating the correlation degree between the frequency domain energy spectrum of the correction signal and the correlation frequency domain energy spectrum;
s590: judging whether the correlation degree is smaller than a preset correlation threshold value, if so, setting the frame overlapping window length as a first window length; otherwise, setting the frame stacking window length as a second window length, wherein the second window length is larger than the first window length;
the step S800 specifically includes:
windowing frame overlapping is carried out on the pre-output signal by using the frame overlapping signal according to the frame overlapping window length to obtain an actual output signal, and the correlation frequency domain energy spectrum is updated by using the frequency domain energy spectrum of the correction signal;
wherein the initial value of the correlation frequency domain energy spectrum is 0.
Preferably, the step S580 includes:
s581: selecting u peak values with the maximum amplitude from the correlation frequency domain energy spectrum, and recording corresponding positions of the u peak values, wherein the positions form a first array X;
s582: solving an energy spectrum of the correction signal, recording the energy spectrum as a correction energy spectrum, selecting w peak values with the maximum amplitude in the correction energy spectrum, and recording corresponding positions of the w peak values, wherein the positions form a second array Y;
s583: for each X [ m ] in the first array X, finding the value closest to it in the second array Y, denoted as Y [ v ], then Δ [ m ] ═ X [ m ] -Y [ v ] |, where m is 1, 2, …, u;
s584: the correlation is the sum of all Δ [ m ];
wherein u is less than w.
Preferably, the voice signal is a voice signal obtained by bluetooth.
A second aspect of the present invention provides a system for processing a voice signal containing packet loss data, including:
the acquisition module is used for acquiring a voice signal containing packet loss data and a packet loss judgment signal corresponding to the voice signal; the packet loss judgment signal is a sequence formed by error marks of all audio data packets in the voice signal, wherein if the data of the audio data packets is packet loss data, the error mark corresponding to the data is 0, otherwise, the error mark is 1;
the correction module is used for carrying out Fourier transform on the packet loss judgment signal, solving an energy spectrum to obtain a packet loss energy spectrum, and generating the modulation signal through the packet loss energy spectrum; converting the voice signal into a voice frequency domain signal, and solving an energy spectrum to obtain a voice energy spectrum; then selecting the L peak values with the largest energy in the voice energy spectrum, taking each peak value with the largest energy as a main peak, selecting a plurality of secondary peaks which are symmetrical on the left side and the right side of the main peak in the voice energy spectrum to generate a modulation energy spectrum, then generating a plurality of gain coefficients corresponding to each modulation energy spectrum through the modulation signal and each modulation energy spectrum, and circularly performing a plurality of times of modulation correction processing on the voice frequency domain signal by using the plurality of gain coefficients to obtain the correction signal; the position of each peak value in the modulation energy spectrum uses the position of the peak value in the voice energy spectrum, and the number of secondary peaks in each modulation energy spectrum is one of the voice energy spectrum with the main peak and with fewer number of left secondary peaks and right secondary peaks;
the output module is used for carrying out Fourier inversion on the corrected signal to obtain a pre-output signal;
the acquisition module is in signal connection with the correction module, and the correction module is in signal connection with the output module.
Preferably, the first and second electrodes are formed of a metal,
the acquisition module comprises an input cache unit, and the input cache unit is used for storing the acquired voice signal and the packet loss judgment signal;
the correction module comprises a modulation unit and a correction unit, wherein the modulation unit is used for carrying out Fourier transform on the packet loss judgment signal, solving an energy spectrum to obtain a packet loss energy spectrum, and generating the modulation signal through the packet loss energy spectrum; the correction unit is used for converting the voice signal into a voice frequency domain signal and solving an energy spectrum to obtain a voice energy spectrum; then selecting the L peak values with the largest energy in the voice energy spectrum, taking each peak value with the largest energy as a main peak, selecting a plurality of secondary peaks which are symmetrical on the left side and the right side of the main peak in the voice energy spectrum to generate a modulation energy spectrum, then generating a plurality of gain coefficients corresponding to each modulation energy spectrum through the modulation signal and each modulation energy spectrum, and circularly performing a plurality of times of modulation correction processing on the voice frequency domain signal by using the plurality of gain coefficients to obtain the correction signal;
the input buffer unit and the correction unit share one buffer space.
A third aspect of the present invention provides an audio processing chip, where the audio processing chip is capable of executing any one of the above processing methods for a voice signal containing packet loss data.
A fourth aspect of the present invention provides an electronic device, having the above audio processing chip, wherein the voice signal processed by the audio processing chip is a voice signal obtained through bluetooth; the electronic equipment is a mobile phone, a Bluetooth microphone, a Bluetooth sound box or a Bluetooth earphone.
A fifth aspect of the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when executed, implements the method for processing a voice signal containing packet loss data according to any one of the above.
The processing method of the voice signal containing the packet loss data directly uses the information whether each audio data packet carried in the voice signal is the packet loss data to generate a packet loss judgment signal, generates a modulation signal according to the packet loss judgment signal, and then corrects the voice signal by using the modulation signal. According to the method, because the information of whether the data is lost is contained in the voice signal without adding extra redundant information, the information of the voice signal can be used for correction, and the load of bandwidth cannot be increased in the transmission of the voice signal; in addition, the method does not need repeated retransmission for the transmission of packet loss data, and can improve the real-time performance of the voice signal.
Other advantages of the present invention will be described in the detailed description, and those skilled in the art will understand the technical features and technical solutions presented in the description.
Drawings
Preferred embodiments of the present invention will be described below with reference to the accompanying drawings. In the figure:
FIG. 1 is a flow chart of a preferred embodiment of a method for processing a speech signal according to the present invention;
fig. 2 is a waveform diagram of a packet loss determination signal of a voice signal containing packet loss data according to a preferred embodiment of the voice signal processing method provided in the present invention;
fig. 3 is a waveform diagram of a packet loss energy spectrum corresponding to the packet loss determination signal shown in fig. 2;
FIG. 4 is a graph comparing waveforms of a packet loss energy spectrum corresponding to a speech signal and a packet loss energy spectrum corresponding to a correct speech signal in the embodiment shown in FIG. 2;
FIG. 5 is a waveform comparison diagram of a speech energy spectrum and a modified energy spectrum according to a preferred embodiment of the speech signal processing method provided by the present invention;
fig. 6 is a waveform comparison diagram of a packet loss signal and an actual output signal corresponding to the packet loss signal in the speech signal processing method according to the present invention;
fig. 7 is a system diagram of a speech signal processing system according to a preferred embodiment of the present invention.
Detailed Description
The present invention will be described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth in order to avoid obscuring the nature of the present invention, well-known methods, procedures, and components have not been described in detail.
Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale. Unless the context clearly requires otherwise, throughout the description and the claims, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".
In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.
It should be noted that, although the logic sequence of some embodiments is shown in the drawings of the present invention, the present invention is not limited to the logic sequence given in the drawings, and in some cases, the processing method of the speech signal according to the present invention may be executed in a logic sequence different from that of the drawings.
In communication performed through signals such as bluetooth and wireless, when a voice signal is transmitted between a host and a slave, for example, the host is a mobile phone and the slave is an earphone, when the mobile phone sends voice information to the earphone for listening by an earphone wearer, the earphone receives and decodes the communication signal sent by the mobile phone to obtain a voice signal, and then the voice signal is output by the earphone.
In view of the above problem, the present invention provides a method for processing a voice signal containing packet loss data, as shown in fig. 1, including the steps of:
s100: acquiring a voice signal (hereinafter referred to as a packet loss signal) containing packet loss data and a packet loss judgment signal corresponding to the voice signal, where the packet loss judgment signal is a sequence formed by error flags of each audio data packet, where if data in an audio data packet is packet loss data, an error flag corresponding to the data is 0, otherwise, the packet loss judgment signal is 1, that is, the voice signal includes a plurality of audio data packets, each audio data packet includes a plurality of data, such as 20, 30, 40, 50, and the like, each data corresponds to an error flag, and if the data is packet loss data, the corresponding error flag is 0; if the data is non-packet-loss data (i.e. correct data), the error flag corresponding to the data is 1, so that each audio data packet forms an error flag subsequence, and the error flag subsequences form the sequence together, as shown in fig. 2, which is a waveform diagram of a packet loss judgment signal. Wherein, the information of whether the lost data is the lost packet data comes from the information of the voice signal itself (as mentioned in the following text, the check code in the voice signal);
s300: performing fourier transform on the packet loss judgment signal to obtain a second signal, solving an energy spectrum of the second signal to obtain a packet loss energy spectrum, as shown in fig. 3, and then generating a modulation signal according to the packet loss energy spectrum;
s400: converting the voice signal (packet loss signal) into a voice frequency domain signal, recording the voice frequency domain signal as a first signal, and solving an energy spectrum to obtain a voice energy spectrum;
s500: selecting L peak values with the largest energy in a voice energy spectrum, taking each peak value with the largest energy as a main peak, selecting a plurality of secondary peaks which are symmetrical on the left side and the right side of the main peak in the voice energy spectrum to generate modulation energy spectrums, then generating gain coefficients corresponding to the modulation energy spectrums through modulation signals and the modulation energy spectrums, and performing multiple modulation correction processing on voice frequency domain signals by using the gain coefficients to obtain correction signals; the position of each peak in the modulation energy spectrum uses the position of the voice energy spectrum in which the peak is positioned, and the number of secondary peaks in each modulation energy spectrum is one of the voice energy spectrum in which the main peak is positioned, wherein the number of the left secondary peaks and the number of the right secondary peaks are less. S700: and performing inverse Fourier transform on the corrected signal to obtain a pre-output signal, wherein the pre-output signal is a signal obtained by correcting the packet loss signal, and the pre-output signal and the packet loss signal both belong to signals on a time domain so as to be used for subsequent output.
The packet loss signal processing method directly uses the information (i.e., the check code described below) of whether each audio data packet in the voice signal is packet loss data to generate a packet loss judgment signal, generates a modulation signal according to the packet loss judgment signal, and then corrects the current voice signal by using the modulation signal. The method considers that the voice signal has periodicity in a period of time and contains a plurality of redundant information, so that the method fully utilizes the information of the voice signal without additionally increasing coding information, and can correct the packet-lost signal; and because the information of whether the packet is lost or not is contained in the voice signal, the burden of bandwidth is not increased in the transmission; meanwhile, the mode does not need to retransmit the communication signal corresponding to the packet loss signal for many times, so that the real-time performance of the voice signal can be improved.
In the step S100, the packet loss judgment signal formed by the method 1 and 0 is simple in form, which audio data packets are lost and easy to find, and the audio data packets are easier to position and correct. As shown in fig. 2, which is a schematic diagram of a packet loss determination signal corresponding to a voice signal containing packet loss audio data, it can be seen that packet loss occurs at a concave rectangular wave trough, and an audio data packet at the concave rectangular wave trough needs to be corrected.
In the step S500, instead of directly selecting each data in the voice energy spectrum to correct the first signal, L peak values with the largest energy in the voice energy spectrum are selected, each peak value with the largest energy is used as a main peak, a plurality of secondary peaks symmetrical to each other on the left and right sides of the main peak in the voice energy spectrum are selected to generate a modulation energy spectrum, then gain coefficients corresponding to each modulation energy spectrum are generated through the modulation signal and each modulation energy spectrum, and a plurality of gain coefficients are used to perform modulation correction processing on the first signal for a plurality of times to obtain a correction signal; the position of each peak in the modulation energy spectrum uses the position of the voice energy spectrum in which the peak is positioned, and the number of secondary peaks in each modulation energy spectrum is one of the voice energy spectrum in which the main peak is positioned, wherein the number of the left secondary peaks and the number of the right secondary peaks are less. Obviously, the voice reputation signal is corrected for a plurality of times in this way, so that the actual output signal corresponding to the packet loss signal is closer to the correct signal, and the user can better hear the packet loss signal.
The execution sequence of the steps S300 and S400 is not limited, and the step S300 may be executed first, and then the step S400 may be executed; or executing step S400 first and then executing step S300; or step S300 is performed simultaneously with step S400.
It is understood that the voice signal may have a plurality of states, such as possibly including a voice, i.e. a voice signal; it is also possible that the voice signal is a background sound signal without containing human voice, and the voice signal can be directly corrected by adopting the steps. In order to improve the efficiency of correcting the voice signal containing the packet loss data, in a preferred embodiment of the present invention, different correction manners are adopted for the voice signal containing the human voice and the voice signal being the background voice, and specifically, the steps S100 and S300 further include:
s200: judging whether the voice signal contains human voice; if yes, executing S300; if not, executing S600;
s600: generating a correction signal using the background sound estimation frequency domain energy spectrum, performing S700; the background sound estimation frequency domain energy spectrum is a voice signal frequency domain energy spectrum which is closest to the current voice signal and does not contain packet loss data, namely the background sound estimation frequency domain energy spectrum is from a closest correct signal (namely the voice signal which does not contain the packet loss data) before the current packet loss signal (namely the voice signal which currently contains the packet loss data).
That is, if the current packet loss signal is a background sound signal, the background sound estimation frequency domain energy spectrum is directly used for correction, and for a voice signal containing human voice, the method of steps S300 to S500 is adopted to correct the voice signal in consideration of more information contained therein, so that various complex operations can be reduced, and the processing efficiency of the packet loss signal is improved.
In step S200, when determining whether the voice signal includes a voice, it may be directly determined whether the current voice signal includes a voice, or it may be determined whether the previous voice signal includes a voice. For the method for determining whether the current speech or the previous speech signal contains voice, a method commonly used by those skilled in the art may be adopted, for example, the energy of the speech signal is calculated, if the energy is higher than a certain threshold, it is considered that voice exists, that is, the voice signal, and other various deformation methods based on energy threshold are not listed here. However, when the application environment is complex, the judgment may be invalid, the accuracy of the judgment is affected, and the calculation amount is large.
In order to solve the above problem, in a preferred embodiment of the present invention, the previous voice signal (i.e. normal voice) is used for determining, and if the previous voice signal includes voice, it is determined that the current voice signal also includes voice, that is, the voice signal; if the last voice signal is a background sound signal, the current voice signal is considered to be the background sound signal, and the state of whether the voice exists is updated only when the voice signal is a normal signal (namely, the packet loss data packet is not included), otherwise, the state of the last voice signal is directly used, and actually, the state of the normal signal which is closest to the current packet loss signal before the packet loss signal is selected by the invention. Specifically, before S100, the method further includes a step of determining whether the voice signal is a packet loss signal, and when the voice signal is a normal signal, the method further determines whether the voice signal contains human voice, and specifically, before S100, the method further includes:
s000: collecting voice signals, judging whether the voice signals contain packet loss data or not, if so, executing S100, and if not, executing S010;
s010: acquiring a voice signal (the voice signal does not include packet loss data), converting the voice signal into a frequency domain signal, recording the frequency domain signal as a third signal, and solving an energy spectrum of the third signal to obtain a fourth signal;
s020: performing Fourier transform on the fourth signal to obtain a fifth signal;
s030: calculating the occupation ratio B of low-frequency components in the fifth signal in the whole frequency spectrum; estimating the actual possibility P2 that the voice signal contains the human voice according to the ratio B and the estimated possibility P1; wherein, the estimated possibility P1 is an estimated value containing voice, and its initial value is 0;
s040: taking the actual possibility P2 as an estimated possibility P1, and judging whether the actual possibility is larger than a voice possibility threshold value; if yes, the voice signal contains voice, namely the voice signal; if not, the voice signal does not contain the voice, namely the voice signal is the background sound signal.
Specifically, step S030 includes:
s031: calculating the occupation ratio B of low-frequency components in the fifth signal in the whole frequency spectrum;
s032: calculating the estimated minimum possibility M of the voice signal containing the human voice to be 0.9 Xmin (M, (1-B)) + 0.1X (1-B); wherein min (M, (1-B)) is the smaller of M and (1-B);
s033: the actual likelihood P2 is calculated as P1 × 0.7+ (1-B-M) × 0.5/(1-M).
In this embodiment, the voice possibility threshold in step S040 is selected to be 0.4-0.7, such as 0.4, 0.5, 0.6, 0.7, etc., and preferably, the voice possibility threshold is 0.5, so as to distinguish whether voice is included in the voice signal as much as possible.
Further, the estimated frequency domain energy spectrum of the background sound in step S600 is updated only when the speech signal is a correct signal, in a preferred embodiment of the present invention, the estimated frequency domain energy spectrum of the background sound is generated in different manners according to whether the speech signal contains a human voice, and the method for determining whether the speech signal contains a human voice continues to be used, specifically, if it is determined that the speech signal contains a human voice in step S040, step S050 is executed to generate the estimated frequency domain energy spectrum of the background sound; if the background sound does not contain the human voice, executing step S060, estimating a frequency domain energy spectrum for the background sound:
s050: calculating a background sound estimation frequency domain energy spectrum Q ═ min (Q, a fourth signal); then, the calculated result is used to update the estimated frequency domain energy spectrum Q of the background sound to (Q × (1-P2) + P2 × fourth signal) × 0.6+ audio energy spectrum × 0.4;
s060: calculating a background sound estimation frequency domain energy spectrum Q ═ min (Q, a fourth signal); the background sound estimated frequency domain energy spectrum Q ═ Q × 0.09+0.01 × fourth signal is then updated again using the above calculation results.
In practice, in step S050 and step S060, after updating the background sound estimation energy spectrum, the method further includes the steps of: the third signal is fourier transformed to obtain a pre-output signal, i.e. in step S700, the third signal is directly used to replace the correction signal, so as to output the normal signal.
The generating of the correction signal in step S600 specifically includes: and estimating the frequency domain energy spectrum as amplitude and the random number as phase by using the background sound to generate a correction signal.
The updating of the background sound estimation frequency domain energy spectrum respectively updates whether a correct signal (a non-packet loss signal) is a human sound signal in different modes, particularly does not directly use a fourth signal of the correct signal when the correct signal is the human sound signal, but comprehensively considers various proportions of a previous voice signal and a current human sound signal, so that an output actual output signal has better continuity when packet loss occurs, and the phenomenon that the human sound is converted into the background sound suddenly does not occur.
It should be noted that the initial values of the estimated minimum likelihood M, the estimated likelihood P1 and the background sound estimated frequency domain energy spectrum Q are all 0.
Although in the above embodiment, when the packet loss signal is a background sound signal, the background sound estimation frequency domain energy spectrum is used to generate the correction signal, the present invention is not limited to this, and when the packet loss signal is a background sound signal, the background sound of the normal signal before the packet loss signal may also be directly extracted, and the specific extraction manner may be a manner commonly used by those skilled in the art, such as directly filtering to filter out the high-frequency energy.
Actually, the voice signal includes a plurality of audio data packets, each audio data packet includes data itself and a check code, and the check code can determine whether each of the audio data packets is error data, that is, whether each of the audio data packets is packet-lost data, and further determine whether the voice signal including the audio data packet includes the packet-lost data. The step S100 specifically includes:
s110: receiving and transmitting a data packet, decoding the data packet to obtain a decoded audio data packet, wherein the decoded audio data packet comprises data pcm and a check code, the check code is used for judging whether each data in the audio data packet is packet-lost data, and then storing the current audio data packet into a voice signal (comprising the data pcm and information about whether the packet is lost);
s120: judging whether the length of the voice signal is a preset length (namely the length of the voice signal) or not, if so, taking out the voice signal, and executing S300 (executing S200 when the step S200 is included); if not, the process returns to S110.
In this embodiment, in step S110, if the data in the corresponding audio data packet is found to be packet-lost data through the check code, the corresponding error flag is directly stored as 0 during storage; if the data in the corresponding audio data packet is found to be non-packet-loss data through the check code, the corresponding error flag is directly stored as 1 during storage, so that each audio data packet corresponds to a subsequence 1 and 0, an error flag sequence including 1 and 0, namely a packet loss judgment signal, is finally formed, and when the voice signal is obtained, the packet loss judgment signal is simultaneously taken out.
For the generation of the modulated signal, the entire packet loss energy spectrum can be directly used. Considering that the data amount using the whole packet loss energy spectrum is too large to affect the operation efficiency, and the packet loss energy spectrum has symmetry, as shown in fig. 3, in a preferred embodiment of the present invention, only half of the data may be selected, and further, if the whole half of the data is selected, such as all the data on the right side of the main lobe (peak with the largest amplitude) in fig. 3, the operation amount is too large, and the side lobes (peaks with smaller amplitude than the main lobe, i.e., peaks located on the left and right sides of the main lobe) are also only higher in amplitude in the portion close to the main lobe, and smaller in amplitude in the portion farther away from the main lobe, so more preferably, only the main lobe and the portion on one side thereof are selected to generate the delta of the modulation signal. Specifically, step S300 includes:
s310: performing Fourier transform on the packet loss judgment signal to obtain a second signal, and solving an energy spectrum of the second signal to obtain a packet loss energy spectrum;
s320: and selecting a main lobe and part of side lobes adjacent to the main lobe in the packet loss energy spectrum, normalizing the main lobe and the part of side lobes, and replacing the amplitude of the main lobe with the reciprocal of the main lobe to generate a modulation signal delta.
Referring to fig. 4, a corresponding curve is a packet loss amplitude spectrum corresponding to the packet loss signal, and is denoted as a first packet loss amplitude spectrum; the solid line curve is a packet loss amplitude spectrum of the correct voice signal corresponding to the packet loss signal, and is recorded as a second packet loss amplitude spectrum, and it can be seen from the graph that, at the main lobe, the peak value of the first packet loss amplitude spectrum is lower than the peak value of the second packet loss amplitude spectrum, and at the side lobe, the peak value of the first packet loss amplitude spectrum is higher than the peak value of the second packet loss amplitude spectrum.
In this embodiment, step S500 includes:
s510: recording the voice energy spectrum of the packet loss signal as AMP, and recording the L peak values with the maximum energy as AMP [ Ki]iAt each AMP [ Ki]iSelecting a plurality of secondary peaks at the left and right sides of the main peak in the voice energy spectrum as the main peak to generate a modulation energy spectrum, namely, selecting L peak values AMP [ K1 ] with the maximum energy in the voice energy spectrum]1、AMP[K2]2、AMP[K3]3、AMP[K4]4、AMP[K5]5、…、AMP[K5]LOf course, the positions Ki in these L peaks are actually different, where AMP [ Ki]iRepresents the ith peak in the speech energy spectrum, which is located at the Ki-th position in the speech energy spectrum; ki is 0, 1, 2, …, n-1; i is 1, 2, 3, …, L; n is the length of the modulation signal detla, L is greater than or equal to 4 or less than or equal to 6, and L is 4, 5 or 6; next, using the delta of the modulation signal and the L peaks and the data in the vicinity thereof to update the value of the corresponding position in the first signal;
s520: setting i to 1;
s530: setting j to 0, using delta j]×AMP[Ki+j]iUpdating the value at the Ki-th position in the first signal; wherein, delta [ j]Refers to the value of the jth position of the modulation signal delta; AMP [ Ki + j ]]iMeans the [ Ki + j ] th voice energy spectrum AMP]A value of each position;
s540: calculating j as j + 1;
determine rate1 ═ 1-delta [ j]×AMP[Ki]i/AMP[Ki+j]iWhether less than 0; if yes, the [ Ki + j ] th signal in the first signal is transmitted]The value of each position is set to 0; if not, the [ Ki + j ] th signal in the first signal is transmitted]The value of each position is multiplied by rate1 as the [ Ki + j ] th value in the first signal]A value of each position;
determine rate2 ═ 1-delta [ j]×AMP[Ki]i/AMP[Ki-j]iWhether less than 0; if yes, the [ Ki-j ] th signal in the first signal is transmitted]The value of each position is set to 0; if not, the [ Ki-j ] th signal in the first signal is transmitted]The value of each position is multiplied by rate2 as the [ Ki-j ] th value in the first signal]A value of each position;
s550: judging whether j is smaller than n, if so, returning to S540; if not, executing S560;
s560: judging whether i is smaller than L +1, if so, i is i +1, and returning to S530; if not, executing S570;
s570: and taking the modified first signal as a modified signal.
Wherein each i value corresponds to delta [0 ]]Each of the rates 1 and 2 formed the AMP [ Ki [ ]]iThe corresponding gain factor. That is to say, the L peak values with the largest energy in the speech energy spectrum AMP are selected first, and then according to the order of the energy from large to small, the 2n-1 values corresponding to the first signal are updated once by using each peak value, the values on the left and right sides of the peak value and the value corresponding to the delta of the modulation signal, so that the first signal is updated L times, and further the correction signal is obtained. By adopting the mode of updating the first signal for multiple times, the obtained corrected voice can be closer to the correct signal corresponding to the packet loss signal, so that the hearing of a user is better improved.
For the value of L, the length of the voice signal and the number of main formants of a factor (the factor in the sound) and other information are integrated, and the problem that if the number of the peak values is too many, the calculation amount is too large, and the timeliness of the voice signal is influenced is also considered; if the number of peaks is too small, some major frequencies may be missed, so that L is selected to be 4, 5 or 6, preferably L is 5.
The length n of the delta modulation signal may be selected according to the length of the speech energy spectrum, i.e. related to the length of the speech signal, and preferably, the length n of the delta modulation signal is 0.2 × the length of the speech energy spectrum. By adopting the length, the amplitude of the interval can be suppressed as far as possible by including side lobes with higher amplitude, and the influence of too much data on the operation efficiency can be avoided, so that the high restoration rate is realized, and the real-time property of voice signal output is improved.
Referring to fig. 4, and taking L as 5 as an example, the 100 th to 140 th data in the voice energy spectrum (corresponding to the corresponding curve) containing the packet loss signal are extracted and normalized to [0.795, 0.191128, 0.152861, 0.099290, 0.042668, 0.005000, 0.035136, 0.044618, 0.036097, 0.016605, … ″]When the first value is replaced by the result of 1/0.795, the modulation signal delta is generated to be [1.2579, 0.191128, 0.152861, 0.099290, 0.042668, 0.005000, 0.035136, 0.044618, 0.036097, 0.016605, … ]]By setting in this way, the amplitude of the position of the packet loss can be increased (i.e., increased), and the amplitudes of the rest of the packet loss can be reduced to be closer to the correct value. Then find five peak value AMP [ K1 ] with maximum amplitude in speech energy spectrum]1、AMP[K2]2、AMP[K3]3、AMP[K4]4、AMP[K5]5Followed by a first peak AMP [ K1 ]]1For example, the correction of the first signal will be described, where A [ Ki ] is the first signal A]Indicates the [ Ki ] th in the first signal]The values of the positions, it can be seen, that the speech energy spectrum AMP corresponds one-to-one to the data of the first signal a. Thus A [ K1]=delta[0]×AMP[K1]1(ii) a Due to the fact that for [ K1 ]]The data processing on both sides is similar, so the data on the right side of the main lobe in the figure is taken as an example for the following explanation, A [ K1+ 1]]It is necessary to first determine that rate1 is 1-delta 1]×AMP[K1]1/AMP[K1+1]1If the rate1 is less than 0, then A [ K1+ 1]]0; otherwise, use A [ K1+ 1]]=rate1×A[K1+1](ii) a By analogy, for A [ K1+2]、A[K1+3]、…、A[K1+n-1](ii) a For A [ K1]The data on the left side is also updated in sequence, except that rate2 is used at this timeAnd (4) performing row updating operation. Then, AMP [ K2 was used in the above manner, respectively]2、AMP[K3]3、AMP[K4]4、AMP[K5]5And updating the first signal A for multiple times, wherein the finally obtained first signal A is the updated correction signal. Referring to fig. 5, the corresponding curve is a waveform of the energy spectrum of the speech before the correction, and the solid curve is a waveform of the energy spectrum of the corrected signal (i.e., the energy spectrum of the frequency domain is obtained from the corrected signal).
It should be noted that if the number of data on each side of the [ Ki ] position in the first signal is less than n-1, it is not necessary to execute steps S540 and S550 until j is n-1, and there are several data on each side of the [ Ki ] position, and several data are replaced.
In the above embodiments, the pre-output signal may be directly used as the actual output signal for output. In a preferred embodiment of the present invention, in order to improve the continuity of the output of the modified speech signal, step S700 is followed by:
s800: performing windowing frame folding on the pre-output signal by using the frame folding signal to obtain an actual output signal;
s900: updating the frame overlap signal using a pre-output signal;
that is, the second half of the frame-stacked signal and the first half of the pre-output signal are taken as actual output signals, and then the pre-output signal is used to update the frame-stacked signal for the subsequent output time. Wherein the initial value of the frame-stacked signal is 0.
For the frame overlap window length (i.e. the length of the second half or the first half) during the frame overlap of windowing, the same window length may be selected for each speech signal, or different window lengths may be selected for different speech signals, in a preferred embodiment of the present invention, step S700 further includes:
s580: calculating the correlation degree between the frequency domain energy spectrum (namely the corrected energy spectrum) of the corrected signal and the correlation frequency domain energy spectrum;
s590: judging whether the correlation degree is smaller than a preset correlation threshold value, if so, setting the frame overlapping window length as a first window length; otherwise, setting the frame overlapping window length as a second window length, wherein the second window length is longer than the first window length;
in this case, step S800 specifically includes:
windowing and frame overlapping are carried out on the pre-output signal according to the frame overlapping window length by using the frame overlapping signal, so as to obtain an actual output signal; in practice, when packet loss occurs, the positions of the peak values of the corrected energy spectrum of the correction signal and the voice energy spectrum before correction are basically unchanged, so that the voice energy spectrum can be directly used in step S580; when the voice signal is a packet loss signal, the frequency domain energy spectrum of the pre-output signal in step S800 may also directly use the voice energy spectrum;
wherein, the initial value of the correlation frequency domain energy spectrum is 0. In fact, whether the previous speech signal is a packet loss signal or not, the correlation frequency-domain energy spectrum is actually the frequency-domain energy spectrum (i.e. the fourth signal) of the pre-output signal (the third signal when the previous speech signal is a normal signal) corresponding to the previous speech signal. In the embodiment including step S600, after the correction signal is obtained in step S600, the above steps S580 and S590 need to be executed to determine the frame overlap window length used by the packet loss signal in S800.
Therefore, the method can fully consider the volatility between the frequency spectrums of adjacent voice signals, and set the frame overlapping window length according to the relation between the current pre-output signal and the last pre-output signal, namely, different frame overlapping window lengths are selected according to different conditions; otherwise, the similarity between the correction signal and the energy spectrum of the correlation frequency domain is too small, and the jump is easy to generate during output, so that the actual output signal is generated by using a larger frame overlapping window length to reduce the jump during output as much as possible. In the mode, continuity of the current voice signal and the previous voice signal is fully considered, so that jump when the signal is actually output is avoided as far as possible, or details of the voice signal are blurred.
Specifically, the first window length and the second window length are set in relation to the sampling rate of the speech signal, and in order to improve the operation efficiency, the second window length may be twice as long as the first window length, such as 64 times the first window length and 128 times the second window length.
In the step S580, the correlation may be obtained by a method including:
s581: selecting u peak values with the maximum amplitude from the correlation frequency domain energy spectrum, and recording corresponding positions of the u peak values, wherein the positions form a first array X;
s582: solving an energy spectrum of the correction signal, recording the energy spectrum as a correction energy spectrum, selecting w peak values with the maximum amplitude in the correction energy spectrum (hereinafter, recorded as the correction energy spectrum), and recording corresponding positions of the w peak values, wherein the positions form a second array Y;
s583: for each X [ m ] (m is 1, 2, …, u) in the first array X, finding the value closest to it in the second array Y, denoted as Y [ v ], then Δ [ m ] ═ X [ m ] -Y [ v ] |, where m is 1, 2, …, u;
s584: the correlation is the sum of all Δ m;
wherein u is less than w, preferably w-u-2; u may be 2 or 4 or less, w may be 4 or 6 or more, for example, u is 2, 3 or 4, and w is 4, 5 or 6.
When the correlation degree is calculated in the mode, only partial data in the correlation frequency domain energy spectrum and the correction energy spectrum are selected, so that a large amount of operation can be reduced, and the real-time property of voice signal output is improved; and the positions corresponding to the data are selected to be the largest ones of the amplitude values, so that the similarity of the two amplitude spectrums can be well reflected by using the values to calculate the similarity. Considering that the frequency spectrum between adjacent voice signals is always fluctuated, if u is larger than w, part of the positions in the correlation frequency domain energy spectrum are not main peaks and are possibly introduced by analyzing side lobes and the like, so that the trend of the whole correlation frequency domain energy spectrum is not reflected, errors are introduced instead, and the correlation degree is interfered, namely if u and w are too large, the distribution quantity of the main peaks is not too large, and the peak positions are selected too many, so that small values are calculated, and the reliability of the correlation degree is influenced; if u and w are too small, the trend of the frequency domain energy spectrum may not be reflected, and the reliability of the correlation may be reduced. Therefore, u is smaller than w in the present invention, so that the position of the selected correlation frequency domain energy spectrum is located at the middle position of the corrected energy spectrum as much as possible.
Further preferably, u is selected to be three and w is selected to be five. In practical operation, since the restoration of the first signal does not substantially affect the peak positions of the speech energy spectrum, the positions of the L peaks selected in step S510 can be directly used to form the first array X, so as to further reduce the amount of data operations. As in one embodiment, the positions of the first three peaks with the largest amplitude values in the correlation frequency domain energy spectrum are selected, and a first array X ═ 4, 17, 19 is formed; selecting the position of the first five peak values with the largest amplitude (here, the peak value with L being 5 in step S510 is directly used) in the corrected energy spectrum, and forming (7, 10, 5, 23, 20) the second data Y, then Δ [1] ═ 4-5| ═ 1; Δ [2] ═ 17-20| ═ 3; Δ [3] ═ 19-20| ═ 1; the correlation degree is 5, the preset correlation threshold may be set to 6, and since 5 is smaller than 6, the first frame overlap window length is selected.
The execution order of step S581 and step S582 is not limited, and step S581 may be executed first and step S582 may be executed later, step S582 may be executed first and step S581 may be executed later, or step S581 and step S582 may be executed simultaneously.
Although the above embodiment shows that when the packet loss signal is detected, the frame overlap mode is adopted to perform windowing frame overlap on the pre-output signal and then output the pre-output signal. In fact, when the speech signal is a normal signal, in order to improve the continuity of the speech signal output, the actual output signal is also updated in a windowing frame overlapping manner, so that the correlation frequency domain energy spectrum is actually the frequency domain energy spectrum (i.e., the fourth signal) of the pre-output signal (the third signal when the speech signal is a normal signal) corresponding to the previous speech signal, regardless of whether the previous speech signal is a lost packet signal or not.
Referring to fig. 6, a corresponding curve is a waveform diagram of a voice signal containing packet loss data, and a solid curve is a waveform diagram of an actual output signal after processing the voice signal corresponding to the corresponding curve by using the voice processing method of the present invention.
In addition, although only the packet loss signal is updated on the correlation frequency domain energy spectrum, in practical application, the voice signal is often an accurate signal, in this case, if the voice signal is a human voice signal, the correlation frequency domain energy spectrum is also updated, and only the frequency domain energy spectrum of the voice signal can be used for updating.
The present invention also provides a processing system for a voice signal containing a packet loss data packet, which can be used to execute the processing method for the voice signal, but the processing method for the voice signal is not limited to the processing system. As shown in fig. 7, the processing system includes:
an obtaining module 10, configured to obtain a voice signal including packet loss data and a packet loss judgment signal corresponding to the voice signal; the packet loss judgment signal is a sequence formed by error marks of all audio data packets in the voice signal, wherein if the data of the audio data packets are packet loss data, the error marks corresponding to the data are 0, otherwise, the error marks are 1;
the correction module 20 is configured to perform fourier transform on the packet loss judgment signal, solve an energy spectrum to obtain a packet loss energy spectrum, and generate a modulation signal according to the packet loss energy spectrum; converting the voice signal into a voice frequency domain signal, and solving an energy spectrum to obtain a voice energy spectrum; then selecting L peak values with the largest energy in the voice energy spectrum, taking each peak value with the largest energy as a main peak, selecting a plurality of secondary peaks which are symmetrical on the left side and the right side of the main peak in the voice energy spectrum to generate a modulation energy spectrum, then generating a plurality of gain coefficients corresponding to each modulation energy spectrum through a modulation signal and each modulation energy spectrum, and circularly performing multiple modulation correction processing on the voice frequency domain signal by using the plurality of gain coefficients to obtain a correction signal; the position of each peak value in the modulation energy spectrum uses the position in the voice energy spectrum where the peak value is located, and the number of secondary peaks in each modulation energy spectrum is one of the main peak voice energy spectrum with fewer left secondary peaks and fewer right secondary peaks;
an output module 30, configured to perform inverse fourier transform on the modified signal to obtain a pre-output signal;
the acquiring module 10 is in signal connection with the correcting module 20, and the correcting module 20 is in signal connection with the output module 30.
The processing system considers that the voice signal has periodicity in a period of time and contains a lot of redundant information, so that the processing system can correct the voice signal containing packet loss data by fully utilizing the information of the voice signal without additionally adding coding information; and because the information of whether the packet is lost data packet is contained in the voice signal, and no extra redundant information is needed to be added, the burden of bandwidth is not increased in the transmission; meanwhile, the processing system does not need repeated retransmission for the transmission of packet loss data, and can better improve the real-time performance of voice signals.
In practical operation, the processing system may apply for one or more buffer units for each module, and in a preferred embodiment, the obtaining module 10 includes an input buffer unit, and the input buffer unit is configured to store the obtained voice signal and the packet loss judgment signal. The modification module 20 includes a modulation unit and a modification unit, the modulation unit is configured to perform fourier transform on the packet loss judgment signal, and solve an energy spectrum to obtain a packet loss energy spectrum, so as to generate a modulation signal; the correction unit is used for converting the voice signal into a voice frequency domain signal and solving an energy spectrum to obtain a voice energy spectrum; then, selecting L peak values with the largest energy in the voice energy spectrum, taking each peak value with the largest energy as a main peak, selecting a plurality of secondary peaks which are symmetrical on the left side and the right side of the main peak in the voice energy spectrum to generate a modulation energy spectrum, then generating a plurality of gain coefficients corresponding to each modulation energy spectrum through a modulation signal and each modulation energy spectrum, and circularly performing modulation correction processing on the voice frequency domain signal by using the plurality of gain coefficients for a plurality of times to obtain a correction signal. The input buffer unit and the correction unit share one buffer space, that is, the first signal and the correction signal share the same space, that is, the correction signal directly covers the original first signal after the first signal is corrected each time. By adopting the mode, the utilization rate of the cache space can be improved, and the occupation of system resources is reduced. Of course, the input buffer unit and the correction unit may also apply for a buffer space respectively.
The voice signal in the processing method and the processing system can be a voice signal obtained through Bluetooth or a voice signal obtained through wifi.
The invention also provides an audio processing chip, which can execute the processing method of the voice signal containing the packet loss data in any embodiment.
The invention also provides an electronic device, which comprises the audio processing chip, when the voice signal is obtained through Bluetooth, the electronic device can comprise a mobile phone, a Bluetooth microphone, a Bluetooth sound box or a Bluetooth earphone, namely the electronic device is a Bluetooth device, in practical use, the two electronic devices can mutually transmit data, such as data transmission between the mobile phone and the Bluetooth earphone, and the two electronic devices can be respectively provided with the audio processing chip. Of course, the electronic devices may be all wireless devices.
Furthermore, the present invention also provides a computer readable storage medium, such as an optical disc, a usb disk, a hard disk, a flash memory, etc., or other various types of storage media, on which a computer program is stored, which when executed, implements the processing method of the voice signal according to any of the above embodiments. Wherein, the computer program can be presented in a demo visual dialog box when being executed, and can also be directly an executable exe file.
It will be appreciated by those skilled in the art that the above-described preferred embodiments may be freely combined, superimposed, without conflict.
It will be understood that the embodiments described above are illustrative only and not restrictive, and that various obvious and equivalent modifications and substitutions for details described herein may be made by those skilled in the art without departing from the basic principles of the invention.

Claims (14)

1. A method for processing a voice signal containing packet loss data, comprising the steps of:
s100: acquiring a voice signal containing packet loss data and a packet loss judgment signal corresponding to the voice signal, wherein the packet loss judgment signal is a sequence formed by error marks of all audio data packets in the voice signal, if the data in the audio data packets are packet loss data packets, the error mark corresponding to the data is 0, otherwise, the error mark is 1;
s300: performing Fourier transform on the packet loss judgment signal, solving an energy spectrum to obtain a packet loss energy spectrum, and generating the modulation signal through the packet loss energy spectrum;
s400: converting the voice signal into a voice frequency domain signal, and solving an energy spectrum to obtain a voice energy spectrum;
s500: selecting L peak values with the largest energy in the voice energy spectrum, taking each peak value with the largest energy as a main peak, selecting a plurality of secondary peaks which are symmetrical on the left side and the right side of the main peak in the voice energy spectrum to generate modulation energy spectrums, then generating gain coefficients corresponding to the modulation energy spectrums through the modulation signals and the modulation energy spectrums, and performing multiple modulation correction processing on the voice frequency domain signals by using the gain coefficients to obtain correction signals; the position of each peak value in the modulation energy spectrum uses the position of the peak value in the voice energy spectrum, and the number of secondary peaks in each modulation energy spectrum is one of the voice energy spectrum with the main peak and with fewer number of left secondary peaks and right secondary peaks;
s700: and carrying out Fourier inversion on the corrected signal to obtain a pre-output signal.
2. The processing method according to claim 1, wherein between the steps S100 and S300, further comprising the steps of:
s200: judging whether the voice signal contains human voice; if yes, executing S300; if not, executing S600;
s600: generating a correction signal using the background sound estimation frequency domain energy spectrum, and then performing S700; the background sound estimation frequency domain energy spectrum is a frequency domain energy spectrum generated by a voice signal which is closest to the current voice signal and does not contain packet loss data.
3. The processing method according to claim 1, wherein the step S300 comprises:
s310: performing Fourier transform on the packet loss judgment signal, and solving an energy spectrum to obtain a packet loss energy spectrum;
s320: and selecting a main lobe and part of side lobes adjacent to the main lobe in the packet loss energy spectrum, normalizing the main lobe and the part of side lobes, wherein the amplitude of the main lobe is replaced by the reciprocal of the main lobe, and the modulation signal delta is generated.
4. The processing method according to claim 3, wherein the step S500 comprises:
s510: the voice energy spectrum is recorded as AMP, and L peak values with the maximum energy are recorded as AMP [ Ki]iAt each AMP [ Ki]iSelecting multiple secondary peaks at left and right sides of the main peak in the voice energy spectrum to generate a modulation energy spectrum, wherein AMP [ Ki]iRepresents the ith peak in the speech energy spectrum, which is located at the Ki-th position in the speech energy spectrum; ki is 0, 1, 2, …, n-1; i is 1, 2, 3, …, L; n is the length of the modulation signal detla; l is greater than or equal to 4 or less than or equal to 6;
s520: setting i to 1;
s530: setting j to 0, and multiplying the value of the [ Ki + j ] th position in the voice frequency domain signal by delta [ j ] as the value at the [ Ki + j ] th position in the voice frequency domain signal; wherein, delta [ j ] refers to the value of j th position of the modulation signal delta;
s540: calculating j as j + 1;
determine rate1 ═ 1-delta [ j]×AMP[Ki]i/AMP[Ki+j]iWhether less than 0; if yes, the [ Ki + j ] th signal in the first signal is transmitted]The value of each position is set to 0; if not, the [ Ki + j ] th voice frequency domain signal is processed]The value of each position is multiplied by rate1 as the [ Ki + j ] th value in the first signal]A value of each position;
determine rate2 ═ 1-delta [ j]×AMP[Ki]i/AMP[Ki-j]iWhether less than 0; if yes, the [ Ki-j ] th signal in the first signal is transmitted]The value of each position is set to 0; if not, the [ Ki-j ] th voice frequency domain signal is processed]Multiplying the value of each position by rate2 as the [ Ki-j ] th number in the voice frequency domain signal]A value of each position;
s550: judging whether j is smaller than n, if so, returning to S540; if not, executing S560;
s560: judging whether i is smaller than L +1, if so, i is i +1, and returning to S530; if not, executing S570;
s570: taking the corrected voice frequency domain signal as a correction signal;
wherein each i value corresponds to delta [0 ]]Each of the rates 1 and 2 formed the AMP [ Ki [ ]]iThe corresponding gain factor.
5. The processing method according to claim 4, wherein the length n of the delta of the modulation signal is 0.2 x the length of the spectrum of speech energy.
6. The processing method according to any one of claims 1 to 5, characterized in that said step S700 is followed by further comprising:
s800: performing windowing frame folding on the pre-output signal by using a frame folding signal to obtain an actual output signal;
s900: updating the frame overlap signal using the pre-output signal;
wherein the initial value of the frame overlapping signal is 0.
7. The processing method according to claim 6, wherein the step S700 is preceded by:
s580: calculating the correlation degree between the frequency domain energy spectrum of the correction signal and the correlation frequency domain energy spectrum;
s590: judging whether the correlation degree is smaller than a preset correlation threshold value, if so, setting the frame overlapping window length as a first window length; otherwise, setting the frame stacking window length as a second window length, wherein the second window length is larger than the first window length;
the step S800 specifically includes:
windowing frame overlapping is carried out on the pre-output signal by using the frame overlapping signal according to the frame overlapping window length to obtain an actual output signal, and the correlation frequency domain energy spectrum is updated by using the frequency domain energy spectrum of the correction signal;
wherein the initial value of the correlation frequency domain energy spectrum is 0.
8. The processing method according to claim 7, wherein the step S580 comprises:
s581: selecting u peak values with the maximum amplitude from the correlation frequency domain energy spectrum, and recording corresponding positions of the u peak values, wherein the positions form a first array X;
s582: solving an energy spectrum of the correction signal, recording the energy spectrum as a correction energy spectrum, selecting w peak values with the maximum amplitude in the correction energy spectrum, and recording corresponding positions of the w peak values, wherein the positions form a second array Y;
s583: for each X [ m ] in the first array X, finding the value closest to it in the second array Y, denoted as Y [ v ], then Δ [ m ] ═ X [ m ] -Y [ v ] |, where m is 1, 2, …, u;
s584: the correlation is the sum of all Δ [ m ];
wherein u is less than w.
9. The processing method according to any of claims 1 to 8, characterized in that the speech signal is a speech signal obtained by Bluetooth.
10. A system for processing a voice signal containing lost packet data, comprising:
the acquisition module is used for acquiring a voice signal containing packet loss data and a packet loss judgment signal corresponding to the voice signal; the packet loss judgment signal is a sequence formed by error marks of all audio data packets in the voice signal, wherein if the data of the audio data packets is packet loss data, the error mark corresponding to the data is 0, otherwise, the error mark is 1;
the correction module is used for carrying out Fourier transform on the packet loss judgment signal, solving an energy spectrum to obtain a packet loss energy spectrum, and generating the modulation signal through the packet loss energy spectrum; converting the voice signal into a voice frequency domain signal, and solving an energy spectrum to obtain a voice energy spectrum; then selecting the L peak values with the largest energy in the voice energy spectrum, taking each peak value with the largest energy as a main peak, selecting a plurality of secondary peaks which are symmetrical on the left side and the right side of the main peak in the voice energy spectrum to generate a modulation energy spectrum, then generating a plurality of gain coefficients corresponding to each modulation energy spectrum through the modulation signal and each modulation energy spectrum, and circularly performing a plurality of times of modulation correction processing on the voice frequency domain signal by using the plurality of gain coefficients to obtain the correction signal; the position of each peak value in the modulation energy spectrum uses the position of the peak value in the voice energy spectrum, and the number of secondary peaks in each modulation energy spectrum is one of the voice energy spectrum with the main peak and with fewer number of left secondary peaks and right secondary peaks;
the output module is used for carrying out Fourier inversion on the corrected signal to obtain a pre-output signal;
the acquisition module is in signal connection with the correction module, and the correction module is in signal connection with the output module.
11. The processing system of claim 10,
the acquisition module comprises an input cache unit, and the input cache unit is used for storing the acquired voice signal and the packet loss judgment signal;
the correction module comprises a modulation unit and a correction unit, wherein the modulation unit is used for carrying out Fourier transform on the packet loss judgment signal, solving an energy spectrum to obtain a packet loss energy spectrum, and generating the modulation signal through the packet loss energy spectrum; the correction unit is used for converting the voice signal into a voice frequency domain signal and solving an energy spectrum to obtain a voice energy spectrum; then selecting the L peak values with the largest energy in the voice energy spectrum, taking each peak value with the largest energy as a main peak, selecting a plurality of secondary peaks which are symmetrical on the left side and the right side of the main peak in the voice energy spectrum to generate a modulation energy spectrum, then generating a plurality of gain coefficients corresponding to each modulation energy spectrum through the modulation signal and each modulation energy spectrum, and circularly performing a plurality of times of modulation correction processing on the voice frequency domain signal by using the plurality of gain coefficients to obtain the correction signal;
the input buffer unit and the correction unit share one buffer space.
12. An audio processing chip, wherein the audio processing chip is capable of executing the method for processing a voice signal containing packet loss data according to any one of claims 1 to 9.
13. An electronic device, comprising the audio processing chip of claim 12, wherein the voice signal processed by the audio processing chip is a voice signal obtained by bluetooth; the electronic equipment is a mobile phone, a Bluetooth microphone, a Bluetooth sound box or a Bluetooth earphone.
14. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed, implements the method for processing a voice signal containing packet loss data according to any one of claims 1 to 9.
CN202010271015.7A 2020-04-08 2020-04-08 Voice signal processing method and system, audio processing chip and electronic equipment Active CN111883170B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010271015.7A CN111883170B (en) 2020-04-08 2020-04-08 Voice signal processing method and system, audio processing chip and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010271015.7A CN111883170B (en) 2020-04-08 2020-04-08 Voice signal processing method and system, audio processing chip and electronic equipment

Publications (2)

Publication Number Publication Date
CN111883170A true CN111883170A (en) 2020-11-03
CN111883170B CN111883170B (en) 2023-09-08

Family

ID=73153985

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010271015.7A Active CN111883170B (en) 2020-04-08 2020-04-08 Voice signal processing method and system, audio processing chip and electronic equipment

Country Status (1)

Country Link
CN (1) CN111883170B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117061039A (en) * 2023-10-09 2023-11-14 成都思为交互科技有限公司 Broadcast signal monitoring device, method, system, equipment and medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080046236A1 (en) * 2006-08-15 2008-02-21 Broadcom Corporation Constrained and Controlled Decoding After Packet Loss
CN104347076A (en) * 2013-08-09 2015-02-11 中国电信股份有限公司 Network audio packet loss concealment method and device
EP3023983A1 (en) * 2014-11-21 2016-05-25 AKG Acoustics GmbH Method of packet loss concealment in ADPCM codec and ADPCM decoder with PLC circuit
CN109167650A (en) * 2018-10-18 2019-01-08 珠海市杰理科技股份有限公司 Bluetooth receiver and bluetooth encode frame detection method
EP3454336A1 (en) * 2017-09-12 2019-03-13 Dolby Laboratories Licensing Corp. Packet loss concealment for critically-sampled filter bank-based codecs using multi-sinusoidal detection
CN110400569A (en) * 2018-04-24 2019-11-01 安凯(广州)微电子技术有限公司 Bluetooth audio frequency restorative procedure and terminal device
CN111883171A (en) * 2020-04-08 2020-11-03 珠海市杰理科技股份有限公司 Audio signal processing method and system, audio processing chip and Bluetooth device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080046236A1 (en) * 2006-08-15 2008-02-21 Broadcom Corporation Constrained and Controlled Decoding After Packet Loss
CN104347076A (en) * 2013-08-09 2015-02-11 中国电信股份有限公司 Network audio packet loss concealment method and device
EP3023983A1 (en) * 2014-11-21 2016-05-25 AKG Acoustics GmbH Method of packet loss concealment in ADPCM codec and ADPCM decoder with PLC circuit
EP3454336A1 (en) * 2017-09-12 2019-03-13 Dolby Laboratories Licensing Corp. Packet loss concealment for critically-sampled filter bank-based codecs using multi-sinusoidal detection
CN110400569A (en) * 2018-04-24 2019-11-01 安凯(广州)微电子技术有限公司 Bluetooth audio frequency restorative procedure and terminal device
CN109167650A (en) * 2018-10-18 2019-01-08 珠海市杰理科技股份有限公司 Bluetooth receiver and bluetooth encode frame detection method
CN111883171A (en) * 2020-04-08 2020-11-03 珠海市杰理科技股份有限公司 Audio signal processing method and system, audio processing chip and Bluetooth device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PREDRAM S K: "Audio packet loss concealment using spectral motion", 《ICASSP 2014》 *
刘晨东: "语音通信中波形域抗丢包技术研究", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117061039A (en) * 2023-10-09 2023-11-14 成都思为交互科技有限公司 Broadcast signal monitoring device, method, system, equipment and medium
CN117061039B (en) * 2023-10-09 2024-01-19 成都思为交互科技有限公司 Broadcast signal monitoring device, method, system, equipment and medium

Also Published As

Publication number Publication date
CN111883170B (en) 2023-09-08

Similar Documents

Publication Publication Date Title
ES2867537T3 (en) Voice / Audio Signal Processing Procedure and Apparatus
US9294834B2 (en) Method and apparatus for reducing noise in voices of mobile terminal
CN102754150B (en) Method and device for constructing lost packets in a sub-band coding decoder
CN102612711B (en) Signal processing method, information processor
US11683103B2 (en) Method and system for acoustic communication of data
US8615394B1 (en) Restoration of noise-reduced speech
US8578247B2 (en) Bit error management methods for wireless audio communication channels
US8009756B2 (en) Peak suppressing and restoring method, transmitter, receiver, and peak suppressing and restoring system
KR20140116511A (en) Devices for redundant frame coding and decoding
US20130290002A1 (en) Voice control device, voice control method, and portable terminal device
CN103325380A (en) Gain post-processing for signal enhancement
US10013989B2 (en) Frame error concealment
JP2013150250A (en) Voice processing apparatus and voice processing method
US8340977B2 (en) Compensation technique for audio decoder state divergence
US9485572B2 (en) Sound processing device, sound processing method, and program
CN111883170A (en) Voice signal processing method and system, audio processing chip and electronic equipment
EP2993666B1 (en) Voice switching device, voice switching method, and computer program for switching between voices
US9002311B2 (en) Frequency domain interference cancellation and equalization for downlink cellular systems
US20130184018A1 (en) Intercell frequency offset compensation for frequency domain interference cancellation and equalization for downlink cellular systems
CN111883171B (en) Audio signal processing method and system, audio processing chip and Bluetooth device
WO2014000559A1 (en) Processing method for speech or audio signals and encoding apparatus thereof
UA114233C2 (en) Systems and methods for determining an interpolation factor set
US20240105198A1 (en) Voice processing method, apparatus and system, smart terminal and electronic device
US20140370858A1 (en) Call device and voice modification method
KR20060091970A (en) Signal to noise ratio improvement method for mobile phone and mobile phone

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 519075 No. 333, Kexing Road, Xiangzhou District, Zhuhai City, Guangdong Province

Applicant after: ZHUHAI JIELI TECHNOLOGY Co.,Ltd.

Address before: Floor 1-107, building 904, ShiJiHua Road, Zhuhai City, Guangdong Province

Applicant before: ZHUHAI JIELI TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant