CN116259322A

CN116259322A - Audio data compression method and related products

Info

Publication number: CN116259322A
Application number: CN202111510547.2A
Authority: CN
Inventors: 梁俊斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2023-06-13

Abstract

The application belongs to the technical field of audio and video, and particularly relates to an audio data compression method, an audio data compression device, a computer readable medium, electronic equipment and a computer program product. The method comprises the following steps: acquiring a target compression amount for data compression of audio data, wherein the target compression amount is a data amount difference value of the audio data before and after compression; classifying the audio data to obtain at least two audio sub-data; respectively distributing target compression ratios for the at least two audio sub-data according to the target compression amounts, wherein the target compression ratios are the ratio of the compression amounts of the audio sub-data to the data amount before compression; and compressing the audio sub-data according to the target compression ratio. The method can control different types of audio sub-data to carry out differential data compression, and improves the compression playing effect of the audio data.

Description

Audio data compression method and related products

Technical Field

The application belongs to the technical field of audio and video, and particularly relates to an audio data compression method, an audio data compression device, a computer readable medium, electronic equipment and a computer program product.

Background

In the business applications such as audio and video call, live broadcast and the like, sound signals are collected from a sender terminal, transmitted or distributed to a receiver terminal through a network after compression coding, and finally decoded and played on the receiver terminal. The sender can normally ensure that the speech-coded data packets are transmitted in a smooth and uniform manner, but because of unpredictable network jitter in the transmission network, the arrival time of the data packets at the receiver terminal is also unstable, i.e. sometimes a data packet is not received for a long time, but sometimes a very large number of data packets are received in a short time, resulting in the problem that sound is sometimes present when the data packets are directly played. When the receiving party receives a large number of data packets in a short time, the playing channel is easy to be blocked, even the buffer area overflows, and further the problems of playing delay, playing lag, sound interruption and the like are generated.

Disclosure of Invention

An object of the present invention is to provide an audio data compression method, an audio data compression device, a computer readable medium, an electronic device, and a computer program product, which overcome, at least to some extent, the technical problem of poor audio playing stability in the related art.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned in part by the practice of the application.

According to an aspect of the embodiments of the present application, there is provided an audio data compression method, including:

acquiring a target compression amount for data compression of audio data, wherein the target compression amount is a data amount difference value of the audio data before and after compression;

classifying the audio data to obtain at least two audio sub-data;

respectively distributing target compression ratios for the at least two audio sub-data according to the target compression amounts, wherein the target compression ratios are the ratio of the compression amounts of the audio sub-data to the data amount before compression;

and compressing the audio sub-data according to the target compression ratio.

According to an aspect of the embodiments of the present application, there is provided an audio data compression apparatus, the apparatus including:

an acquisition module configured to acquire a target compression amount for data compression of audio data, the target compression amount being a data amount difference of the audio data before and after compression;

the classification module is configured to classify the audio data to obtain at least two audio sub-data;

An allocation module configured to allocate a target compression ratio, which is a ratio of a compression amount of the audio sub-data to a pre-compression data amount, to the at least two audio sub-data, respectively, according to the target compression amount;

and the compression module is configured to compress the audio sub-data according to the target compression ratio.

According to an aspect of the embodiments of the present application, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements an audio data compression method as in the above technical solution.

According to an aspect of the embodiments of the present application, there is provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the audio data compression method as in the above technical solution via execution of the executable instructions.

According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the audio data compression method as in the above technical solution.

In the technical scheme provided by the embodiment of the application, the audio data is classified to obtain the audio sub-data of different categories, the audio sub-data of different categories are respectively subjected to compression ratio distribution, and the audio sub-data of different categories can be controlled to carry out differential data compression, so that the audio sub-data of different categories can be adaptively controlled to be compressed and played at a proper speed, and the compression and playing effect of the audio data is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is apparent that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

Fig. 1 shows a schematic diagram of an exemplary system architecture to which the technical solutions of the embodiments of the present application may be applied.

Fig. 2 shows the placement of an audio-video encoding device and an audio-video decoding device in a streaming environment.

Fig. 3 illustrates a related art method for alleviating the problem of congestion of play data by using data compression play.

Fig. 4 shows a flow chart of the steps of a method of audio data compression in one embodiment of the present application.

Fig. 5 shows an effect of classifying audio data according to whether voice content is carried or not in an embodiment of the present application.

FIG. 6 is a flowchart illustrating steps for speech rate estimation for speech sub-data in one embodiment of the present application.

FIG. 7 is a flowchart illustrating steps for assigning target compression ratios to speech sub-data and non-speech sub-data in one embodiment of the present application.

FIG. 8 is a flowchart illustrating steps for assigning target compression ratios to speech segments having different speech rate levels in one embodiment of the present application.

Fig. 9 shows a block diagram of an audio data compression apparatus according to an embodiment of the present application.

Fig. 10 shows a block diagram of a computer system suitable for use in implementing embodiments of the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present application. One skilled in the relevant art will recognize, however, that the aspects of the application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

In the specific embodiments of the present application, related data such as voice, video, etc. of a user are referred to, when various embodiments of the present application are applied to specific products or technologies, user permission or consent is required, and collection, use, and processing of related data is required to comply with related laws and regulations and standards of related countries and regions.

It should be noted that: references herein to "a plurality" means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., a and/or B may represent: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

As shown in fig. 1, the system architecture 100 includes a plurality of terminal devices that can communicate with each other through, for example, a network 150. For example, the system architecture 100 may include a first terminal device 110 and a second terminal device 120 interconnected by a network 150. In the embodiment of fig. 1, the first terminal apparatus 110 and the second terminal apparatus 120 perform unidirectional data transmission.

For example, the first terminal device 110 may encode audio-video data (e.g., an audio-video data stream collected by the terminal device 110) for transmission over the network 150 to the second terminal device 120, the encoded audio-video data being transmitted in one or more encoded audio-video code streams, the second terminal device 120 may receive the encoded audio-video data from the network 150, decode the encoded audio-video data to recover the audio-video data, and play or display the content according to the recovered audio-video data.

In one embodiment of the present application, the system architecture 100 may include a third terminal device 130 and a fourth terminal device 140 that perform bi-directional transmission of encoded audiovisual data, such as may occur during an audiovisual conference. For bi-directional data transmission, each of the third terminal device 130 and the fourth terminal device 140 may encode audio-video data (e.g., an audio-video data stream collected by the terminal device) for transmission to the other of the third terminal device 130 and the fourth terminal device 140 over the network 150. Each of the third terminal apparatus 130 and the fourth terminal apparatus 140 may also receive encoded audio-video data transmitted by the other of the third terminal apparatus 130 and the fourth terminal apparatus 140, and may decode the encoded audio-video data to restore the audio-video data, and play or display the content according to the restored audio-video data.

In the embodiment of fig. 1, the first, second, third and

fourth terminal apparatuses

110, 120, 130 and 140 may be servers, personal computers and smart phones, but the principles disclosed herein may not be limited thereto. Embodiments disclosed herein are applicable to laptop computers, tablet computers, media players, and/or dedicated audio video conferencing devices. Network 150 represents any number of networks that transfer encoded audio-video data between first terminal device 110, second terminal device 120, third terminal device 130, and fourth terminal device 140, including, for example, wired and/or wireless communication networks. The communication network 150 may exchange data in circuit-switched and/or packet-switched channels. The network may include a telecommunications network, a local area network, a wide area network, and/or the internet. For the purposes of this application, the architecture and topology of network 150 may be irrelevant to the operation disclosed herein, unless explained below.

In one embodiment of the present application, fig. 2 illustrates the placement of an audio-video encoding device and an audio-video decoding device in a streaming environment. The subject matter disclosed herein is equally applicable to other audio-video enabled applications including, for example, audio-video conferencing, digital TV (television), storing compressed audio-video on digital media including CDs, DVDs, memory sticks, etc.

The streaming system may include a collection subsystem 213, and the collection subsystem 213 may include an audio-video source 201 such as a microphone, camera, etc., which creates an uncompressed audio-video data stream 202. Compared to the encoded audio-video data 204 (or the encoded audio-video code stream 204), the audio-video data stream 202 is depicted as bold lines to emphasize high data volume audio-video data streams, the audio-video data stream 202 being processed by the electronic device 220, the electronic device 220 comprising the audio-video encoding device 203 coupled to the audio-video source 201. The audio video encoding device 203 may include hardware, software, or a combination of hardware and software to implement or embody aspects of the disclosed subject matter as described in more detail below. The encoded audio-video data 204 (or encoded audio-video code stream 204) is depicted as a thin line to emphasize a lower amount of encoded audio-video data 204 (or encoded audio-video code stream 204) than the audio-video data stream 202, which may be stored on the streaming server 205 for future use. One or more streaming client subsystems, such as client subsystem 206 and client subsystem 208 in fig. 2, may access streaming server 205 to retrieve

copies

207 and 209 of encoded audio-video data 204. Client subsystem 206 may include an audio video decoding device 210, for example, in an electronic device 230. An audiovisual decoding device 210 decodes an incoming copy 207 of the encoded audiovisual data and produces an output audiovisual data stream 211 that can be presented on an output 212 (e.g., speaker, display) or another presentation device. In some streaming systems, encoded audio-video data 204, audio-video data 207, and audio-video data 209 (e.g., an audio-video code stream) may be encoded according to some audio-video encoding/compression standard.

It should be noted that electronic device 220 and electronic device 230 may include other components not shown in the figures. For example, electronic device 220 may include an audio-video decoding device, and electronic device 230 may also include an audio-video encoding device.

Under normal conditions, the audio/video data sender can smoothly and uniformly send the encoded data packets to the data receiver. However, when the jitter phenomenon of the transmission network occurs, the receiving time of the data packet of the data receiver is unstable, so that the phenomena of delay, hysteresis, intermittent sound and the like exist in the audio and video playing process. In order to solve this problem, a data buffer storage mode is generally adopted in the related art, that is, a data receiving party temporarily stores received audio and video data in a data buffer area, so as to reduce the influence caused by network jitter. Once the playing data congestion occurs in the data buffer zone, a data compression playing strategy can be adopted, so that the problem of delay of playing lag is reduced.

Fig. 3 illustrates a related art method for alleviating the problem of congestion of play data by using data compression play. As shown in fig. 3, the method for compressing and playing the received audio data at the data receiving end includes the following steps.

Step S301: and decoding the data compression packet transmitted from the network to obtain the sound signal.

Step S302: the decoded sound signal is stored in a play buffer. The audio signal obtained through the decoding process is a signal which can be directly played and output, but in order to avoid the problem of poor playing stability caused by network jitter, the audio signal can be stored in a playing buffer area first, and meanwhile, the data is read and played from the playing buffer area according to the sequence of data storage.

Step S303: and judging whether congestion problems occur in the data stored in the playing buffer zone. Under the normal network transmission condition, the data writing speed and the data reading speed of the playing buffer zone are basically equivalent, so that the playing buffer zone is always in a dynamic balanced data storage state. In case of serious network jitter, the data receiving speed of the playing buffer may exceed the playing speed, and if the data to be played accumulated in the playing buffer exceeds the expected data amount of the buffered data, a congestion problem may occur. If congestion problems occur, step S304 is performed; if no congestion problem occurs, step S305 is performed.

Step S304: and calculating a data compression ratio according to the congestion degree of the playing buffer zone, and playing the cached data at variable speed according to the data compression ratio. After the cache data is compressed, the playing time length of the compressed cache data can be controlled to be lower than the original playing time length, so that the purpose of quickly consuming the cache data is realized, and the congestion problem of a playing buffer zone is relieved.

Step S305: and when the congestion problem does not occur, playing and outputting the cached data according to the normal playing speed.

Based on the above compression playing scheme, the buffered sound signals in the playing buffer area can be played at variable speed through data compression, so that the problem of data congestion in the playing buffer area is relieved. However, the sound signal is subjected to compression shifting, and then is likely to cause an unnatural problem in terms of hearing, such as sudden acceleration of sound, and a problem such as robot sound or unclear sound after acceleration.

Aiming at the problem of poor sound playing effect after compression and speed change in the related technology, the embodiment of the application adopts a scheme of carrying out content analysis on the cached data in the playing buffer zone, and different compression ratios can be configured for different types of audio data through classifying the cached audio data, so that the different types of audio data are controlled to be accelerated to be played by adopting different compression ratios, and the playing effect after the audio data compression and speed change is optimized.

The following describes in detail the technical schemes such as the audio data compression method, the audio data compression device, the computer readable medium, the electronic device, and the computer program product provided in the present application with reference to the specific embodiments.

Fig. 4 shows a flowchart of steps of a method for compressing audio data in an embodiment of the present application, which may be performed by the terminal device shown in fig. 1 or the electronic device shown in fig. 2. As shown in fig. 4, the audio data compression method in the embodiment of the present application may mainly include the following steps S410 to S440.

Step S410: a target compression amount for data compression of audio data is acquired, the target compression amount being a difference in data amount of the audio data before and after compression.

Step S420: and classifying the audio data to obtain at least two audio sub-data.

Step S430: and respectively distributing target compression ratios for at least two types of audio sub-data according to the target compression amount, wherein the target compression ratio is the ratio of the compression amount of the audio sub-data to the data amount before compression.

Step S440: and compressing the audio sub-data according to the target compression ratio.

In the audio data compression method provided by the embodiment of the application, the audio data is classified to obtain the audio sub-data of different categories, the audio sub-data of different categories are respectively subjected to compression ratio distribution, and the audio sub-data of different categories can be controlled to carry out differential data compression, so that the audio sub-data of different categories can be adaptively controlled to be compressed and played at a proper speed, and the compression and playing effect of the audio data is improved.

The following describes in detail each method step in the audio data compression method.

In step S410, a target compression amount for data compression of audio data, which is a difference in data amounts of the audio data before and after compression, is acquired.

In one embodiment of the present application, the target compression amount for data compression of the audio data may be determined according to the amount of data stored in the audio buffer in real time. For example, the target compression amount may be in positive correlation with the real-time stored data amount of the audio buffer, the larger the real-time stored data amount, the larger the target compression amount.

In one embodiment of the present application, the audio data in the audio buffer may be divided into a plurality of data frames at certain time intervals, for example, every 20ms of audio data is divided into one data frame. Monitoring a real-time storage frame number of audio data stored in an audio buffer; acquiring an expected storage frame number of an audio buffer area; when the real-time storage frame number is greater than the expected storage frame number, a target compression amount for data compression of the audio data is determined according to a difference between the real-time storage frame number and the expected storage frame number.

The data packet transmitted from the data transmitting end is written into the audio buffer area for waiting to play after decoding, and the audio data which is output and played is removed from the audio buffer area, so that the audio data stored in the audio buffer area is always in a dynamic change state. The target compression amount for data compression of the audio data can be predicted by monitoring the real-time storage frame number of the audio data stored in the audio buffer and comparing the real-time storage frame number with the expected storage frame number of the audio buffer. When the audio data is compressed in accordance with the target compression amount, the real-time storage frame number of the audio data in the audio buffer can be kept dynamically changed around the desired storage frame number.

In step S420, the audio data is classified to obtain at least two kinds of audio sub-data.

Fig. 5 shows an effect of classifying audio data according to whether voice content is carried or not in an embodiment of the present application. As shown in fig. 5, performing voice activity detection on each data frame in the audio data, and determining that each data frame is a voice frame 501 or a non-voice frame 502; marking consecutive distributed speech frames 501 in the audio data as speech sub-data 503 carrying speech content; consecutive distributed non-speech frames 502 in the audio data are marked as non-speech sub-data 504 that do not carry speech content.

Voice activity detection (Voice Activity Detection, VAD) is a detection method that distinguishes between areas of speech and non-speech, and by feature extraction of frames of audio data, it can be predicted that the frames of audio data are speech frames carrying speech content or non-speech frames not carrying speech content. The non-speech frames may include silence frames or noise frames.

The VAD algorithm may include a variety of algorithms including threshold-based VAD, classifier-based VAD, and model-based VAD. The VAD algorithm based on the threshold value achieves the aim of distinguishing voice from non-voice by extracting the characteristics of time domain (short-time energy, short-time zero crossing rate and the like) or frequency domain (MFCC, spectral entropy and the like) and reasonably setting the threshold. The VAD based on the classifier can treat the voice detection as a voice/non-voice two-classification problem, and further train the classifier by a machine learning method so as to achieve the purpose of detecting the voice. The model-based VAD algorithm can use a complete acoustic model to distinguish between speech segments and non-speech segments based on decoding via global information.

In one embodiment of the present application, for the voice sub-data carrying the voice content, a classification process may be further performed to obtain voice segments with different speech speed levels. The higher the speech rate level, the faster the speech content carried by the speech sub-data has speech rate.

In one embodiment of the present application, speech speed estimation is performed on speech sub-data carrying speech content to obtain speech speed state parameters for representing the speech speed of the speech sub-data; the speech sub-data is marked as speech segments having different speech rate levels according to the speech rate state parameters.

According to the embodiment of the application, one or more parameter thresholds for dividing different speech speed levels can be preconfigured, and the speech speed state parameters obtained through speech speed estimation are compared with the parameter thresholds, so that corresponding speech speed levels are determined according to comparison results. For example, the embodiment of the present application may divide three speech speed levels according to the speech speed from fast to slow, i.e. a high speech speed speech signal, a medium speech speed speech signal and a low speech speed speech signal. The high speech rate speech signal indicates that the speech rate state parameter of the speech segment is greater than the first parameter threshold, the low speech rate speech signal indicates that the speech rate state parameter of the speech segment is less than the second parameter threshold, and the medium speech rate speech signal indicates that the speech rate state parameter of the speech segment is between the first parameter threshold and the second parameter threshold. Wherein the second parameter threshold is less than the first parameter threshold.

In one embodiment of the present application, a method for speech rate estimation of speech sub-data carrying speech content may include: performing pitch detection on voice sub-data carrying audio content to obtain pitch periods of all data frames in the voice sub-data; and carrying out speech rate estimation on the voice sub-data according to the time domain change state of the pitch period, wherein the time domain change state is used for representing the period change trend of the pitch period in the time domain.

The pitch is the basis of sound as the name implies, and sound signals can be classified into unvoiced and voiced according to the manner in which the vocal cords vibrate. Among them, voiced sound requires periodic vibration of the vocal cords, so that it has a remarkable periodicity, and the frequency of such vibration of the vocal cords is called a pitch frequency, and the corresponding period becomes a pitch period. In general, the pitch frequency has a great relationship with the structure of the vocal cords of an individual, so the gene frequency can also be used to identify the source of sound. The estimation of the pitch period is called pitch detection, the final purpose of which is to find a trajectory profile that is exactly or as closely as possible coincident with the vocal cord vibration frequency. The gene period is one of the important parameters describing the excitation source in the speech signal processing, and has wide and important application in the fields of speech synthesis, speech compression coding, speech recognition, speaker confirmation and the like.

The methods of pitch detection can be broadly divided into three categories: 1) The time domain method, which directly estimates the pitch period from the speech waveform, is commonly known as: an autocorrelation method, a parallel processing method, an average amplitude difference method, a data reduction method and the like; 2) The frequency domain method is a method for estimating the pitch period by transforming a voice signal into a frequency domain, firstly, eliminating the influence of a sound channel by using a homomorphic analysis method to obtain information belonging to an excitation part, then, obtaining the pitch period, and most commonly obtaining the pitch period by using a cepstrum method, wherein the method has the defects that an algorithm is complex, but the effect of pitch estimation is good; 3) The mixing method is to extract the signal channel model parameters, filter the signal to obtain the sound source sequence, and finally obtain the gene sound period by the autocorrelation method or the average amplitude difference method.

FIG. 6 is a flowchart illustrating steps for speech rate estimation for speech sub-data in one embodiment of the present application. As shown in fig. 6, the method for performing speech rate estimation on the speech sub-data according to the time domain variation state of the pitch period may include the following steps S610 to S640 on the basis of the above embodiments.

Step S610: and comparing the pitch periods of two adjacent data frames in a time domain to obtain a period change trend and a period change amplitude of the pitch period of the next data frame relative to the pitch period of the previous data frame, wherein the period change trend is used for representing the rising, falling or leveling trend of the pitch period, and the period change amplitude is used for representing the pitch period difference value of the next data frame and the previous data frame.

Step S620: and determining a time domain change state of the pitch period between two adjacent data frames according to the period change trend and the period change amplitude, wherein the time domain change state comprises at least two of a period rising state, a period reducing state or a period leveling state.

In one embodiment of the present application, an amplitude threshold associated with a periodic variation trend is obtained, where the amplitude threshold includes a first threshold associated with a periodic rising trend and a second threshold associated with a periodic falling trend, the first threshold being a positive number, the second threshold being a negative number; if the period variation amplitude is smaller than the first threshold value and larger than the second threshold value, marking the time domain variation state of the pitch period between two adjacent data frames as a period leveling state; if the period change amplitude is larger than the first threshold value, marking the time domain change state as a period rising state; and if the period change amplitude is smaller than the second threshold value, marking the time domain change state as a period descending state.

Step S630: and counting the state switching frequency of the time domain change state occurrence state switching in the time domain, wherein the state switching frequency is used for representing the switching times of the time domain change state from one state to the other state.

Dividing the pitch period of each adjacent frame by three states of 'up', 'flat', 'down', counting the adjacent frames in the same state, for example, the pitch period state of ten adjacent signals is 0000111122, wherein the state '0' is the pitch period 'up' state, i.e. the pitch period value of the current frame is larger than the pitch period value of the previous frame, a threshold can be set for comparison, and the state is 'up' when the threshold is larger than the threshold; the state "1" is a pitch period "flat" state, i.e. the pitch period value of the current frame is equal to or has a smaller difference than the pitch period value of the previous frame, and a threshold can be set for comparison, and if the pitch period of the current frame differs from the pitch period value of the previous frame by less than the threshold, the state is determined to be a "flat" state; the state "2" is a pitch period "down" state, i.e. the pitch period value of the current frame is smaller than the pitch period value of the previous frame, and a threshold may be set for comparison, i.e. the pitch period value of the current frame is smaller than the pitch period value of the previous frame and smaller than the threshold, and is determined to be in the "down" state; the statistics of 0000111122 states are therefore 4 for a continuous "up" state accumulation (four 0 s in succession), 4 for a continuous "flat" state accumulation (four 1 s in succession), and 2 for a continuous "down" state accumulation (two 2 s in succession), with a total of 3 switches being approximated by counting the number of state switches per unit time, for example 0000111122.

Step S640: and estimating the speech speed of the voice sub-data according to the state switching frequency, wherein the speech speed of the voice sub-data and the state switching frequency are in positive correlation.

In one embodiment of the present application, a frame number of a data frame in voice sub-data is obtained; and determining a speech speed state parameter for representing the speech speed of the voice sub-data according to the ratio of the state switching frequency to the frame number.

If the number of frames of the data frames in the voice sub-data is cnt_v and the switching frequency of the pitch period states in the data frames is cnt_p, the speech speed condition can be approximately represented by a speech speed state parameter rate_v=cnt_p/cnt_v, and two thresholds are set through empirical values, for example: 0.08, 0.15, if the rate_v is lower than 0.08, the low speech rate speech signal, if the rate_v is higher than 0.15, the high speech rate speech signal, and if the rest rate_v is between 0.08 and 0.15, the medium speech rate speech signal.

In one embodiment of the present application, speech sub-data may also be subjected to speech rate estimation by way of curve fitting. In the embodiment of the application, according to the pitch period of each data frame in the voice sub-data, a period distribution diagram of the pitch period in a time domain is drawn; performing curve fitting on the periodic distribution diagram to obtain a time domain change curve for representing the time domain change state of the pitch period; and carrying out speech rate estimation on the voice sub-data according to the time domain change curve.

In one embodiment of the present application, a method for speech rate estimation of speech sub-data according to a time domain variation curve includes: acquiring the number of frames of a data frame in voice sub-data; counting the number of extreme points in the time domain change curve; and determining a speech speed state parameter for representing the speech speed of the speech sub-data according to the ratio of the number of extreme points to the number of frames.

The speech rate estimation is carried out by adopting a curve fitting mode, so that high-frequency change of the pitch period in a short time caused by detection errors and the like can be avoided, and the stability and reliability of the speech rate estimation are improved.

In step S430, a target compression ratio, which is a ratio of the compression amount of the audio sub-data to the pre-compression data amount, is allocated to at least two audio sub-data according to the target compression amount, respectively.

In one embodiment of the present application, the at least two types of audio sub-data obtained by classifying the audio data include speech sub-data carrying speech content and non-speech sub-data not carrying speech content.

FIG. 7 is a flowchart illustrating steps for assigning target compression ratios to speech sub-data and non-speech sub-data in one embodiment of the present application. As shown in fig. 7, on the basis of the above embodiment, in step S430, target compression ratios are respectively allocated to at least two kinds of audio sub-data according to target compression amounts, and the following steps S710 to S750 may be included.

Step S710: a first compression ratio is determined based on the target compression amount and the number of frames of the non-speech sub-data.

Counting the number of frames of data frames in non-voice sub-data as N _unvoice The target compression amount is, for example, N, then the first compression ratio may be expressed as the target compression amount N and the number of frames N _unvoice Ratio of (N/N) _unvoice 。

Step S720: and configuring a target compression ratio for the non-voice sub-data according to a smaller value between the first compression ratio and the first compression ratio threshold.

The first compression ratio threshold represents a maximum compression ratio at which data compression is performed on non-speech sub-data, for example, the first compression ratio threshold is 0.2; if N/N _unvoice Less than or equal to 0.2, the target compression ratio of the non-speech sub-data may be configured to be N/N _unvoice The method comprises the steps of carrying out a first treatment on the surface of the If N/N _unvoice Greater than 0.2, the target compression ratio of the non-speech sub-data may be configured to be 0.2.

Step S730: the actual compression amount of the non-voice sub-data is determined according to the target compression ratio of the non-voice sub-data, and the expected compression amount of the voice sub-data is determined according to the target compression amount and the actual compression amount of the non-voice sub-data.

After the target compression ratio of the non-speech sub-data is determined according to step S720, the actual compression amount of the non-speech sub-data may be determined based on the target compression ratio, for example, when the target compression ratio is N/N _unvoice When the actual compression amount of the non-voice sub-data is N; for example, when the target compression ratio is the first compression ratio threshold value of 0.2, the actual compression amount of the non-voice sub-data is 0.2×n _unvoice 。

When it is determined that the actual compression amount of the non-voice sub-data is equal to the target compression amount N, the desired compression amount of the voice sub-data is 0, that is, it means that the desired data compression effect can be achieved by compressing only the non-voice sub-data. When the actual compression amount of the non-voice sub-data office is determined to be N1 (N1 is smaller than the target compression amount N), the desired compression amount of the voice frame may be determined to be m=n—n1.

Step S740: the second compression ratio is determined based on the desired compression amount and the number of frames of the speech sub-data.

In one embodiment of the present application, all the voice sub-data are compressed and played with a uniform compression ratio, and the expected compression amount M and the frame number N of the voice sub-data _voice The ratio of (2) is the second compression ratio.

Step S750: and configuring a target compression ratio for the voice sub-data according to a smaller value between the second compression ratio and the second compression ratio threshold.

In order to avoid sound distortion caused by excessive compression of voice data, the embodiment of the present application may set the second compression ratio threshold as the maximum compression ratio. When the second compression ratio determined in step S740 is less than (or equal to) the second compression ratio threshold, data compression may be performed with the second compression ratio as a target compression ratio such that the compression amount of the voice sub-data reaches a desired compression amount. When the second compression ratio determined in step S740 is greater than the second compression ratio threshold value, the second compression ratio threshold value may be taken as the target compression ratio.

In one embodiment of the present application, the voice sub-data may be classified into voice segments having different voice speed levels according to the voice speed, and on this basis, different target compression ratios may be allocated to each voice segment according to the voice speed level, where the target compression ratio of the voice segment has a negative correlation with the voice speed level, i.e., a voice segment having a relatively high voice speed may be allocated to a lower target compression ratio, and a voice segment having a relatively low voice speed may be allocated to a higher target compression ratio.

As shown in fig. 8, on the basis of the above embodiment, determining the second compression ratio according to the desired compression amount and the number of frames of the voice sub-data, and configuring the target compression ratio for the voice sub-data according to the smaller value between the second compression ratio and the second compression ratio threshold value may include steps S810 to S840 as follows.

Step S810: and acquiring the voice frame weight which has a negative correlation with the speech speed level of the voice fragment.

The higher the speech speed level is, the faster the speech speed of the speech content carried in the speech segment is, and the speech frame weight is allocated to the speech segment of each speech speed level according to the negative correlation relationship, for example, when the speech level is low speech speed speech information, the first speech frame weight a1 may be allocated to the speech segment, the second speech frame weight a2 with a value smaller than the first speech frame weight a1 may be allocated to the medium speech speed speech information, and the third speech frame weight a3 with a value smaller than the second speech frame weight a2 may be allocated to the high speech speed speech information. For example, the first speech frame weight a1 is configured to be 1.3, the second speech frame weight a2 is configured to be 1.15, and the third speech frame weight a3 is configured to be 1.

Step S820: and carrying out weighting processing on the frame number of the voice fragment according to the weight of the voice frame to obtain the weighted frame number of the voice sub-data.

And carrying out weighted summation on the frame numbers of each voice fragment in the voice sub-data according to the corresponding voice frame weight, so as to obtain the weighted frame numbers of the voice sub-data. The number of frames of each speech segment at different speech speed levels can be counted in the audio buffer, for example, the number of frames of low speech speed is K1, the number of frames of medium speech speed is K2, the number of frames of high speech speed is K3, and the weighted number of frames of the speech sub-data is k=a1×k1+a2×k2+a3×k3.

Step S830: a second compression ratio is determined based on a ratio of the desired compression amount to the weighted frame number.

By carrying out weighting processing on the voice fragments with different voice speed levels, the equivalent weighting frame number of the voice sub-data on the whole can be determined on the basis of balancing the voice speed of each voice fragment, and the second compression ratio can be determined as M/K by utilizing the ratio of the expected compression amount M to the weighting frame number K.

Step S840: and weighting the smaller value between the second compression ratio and the second compression ratio threshold according to the weight of the voice frame to obtain the target compression ratio of each voice fragment.

The second compression ratio threshold is, for example, cmax, and a smaller value between the second compression ratio threshold and the second compression ratio is selected as a target compression ratio c=min (Cmax, K/M) of the entire voice sub-data. And then weighting the overall target compression ratio C according to the voice frame weight of each voice segment to obtain the target compression ratio of each voice segment. For example, the target compression ratio of low speech is a1×c, the target compression ratio of medium speech is a2×c, and the target compression ratio of high speech is a3×c.

Based on the mode of distributing different compression ratios to the voice fragments with different speech speeds, the voice frame is prevented from being excessively compressed to cause sound distortion and uncomfortable feeling, so that low-perception compression playing can be realized.

In step S440, the audio sub-data is compressed according to the target compression ratio.

Differentiated data compression may be performed on the audio sub-data according to the target compression ratios allocated for the different categories of audio sub-data.

In one embodiment of the present application, the audio data includes speech sub-data carrying speech content and non-speech sub-data not carrying speech content. The method for compressing the audio sub-data according to the target compression ratio comprises the following steps: deleting part of non-voice frames in the non-voice sub-data according to the target compression ratio to obtain compressed non-voice sub-data; and superposing partial voice frames in the voice sub-data according to the target compression ratio and waveform similarity to obtain the compressed voice sub-data.

Aiming at non-voice sub-data, a mode of directly deleting part of non-voice frames is adopted, so that the data compression efficiency can be improved, and the computing resources can be saved. For the voice sub-data, in order to improve the playing effect after the voice data is compressed, the superposition processing may be performed based on the waveform similarity between the voice frames. The scheme for carrying out superposition processing on the voice frames based on waveform similarity is to carry out superposition processing on the voice frames in a windowing fusion mode. For example, firstly, selecting a voice frame from the voice sub-data as a reference frame, then selecting the voice frame with the highest waveform similarity with the reference frame through a sliding time window, further superposing and fusing the selected voice frame and the reference frame, and the like until the compression processing of all the voice sub-data is completed. The sliding step length of the sliding time window and the target compression ratio are in a negative correlation, for example, the target compression ratio is 1/X, and the sliding step length of the sliding time window is X, that is, the average interval X speech frames are overlapped and fused once.

In one embodiment of the present application, a method for compressing non-speech sub-data includes: and randomly selecting part of non-voice frames to be deleted from the non-voice sub-data according to the target compression ratio, wherein the target compression ratio is 1/X, namely, one non-voice frame is required to be randomly selected from X non-voice frames for deletion. For another example, when the target compression ratio is 1/X, one frame may be deleted every X non-speech frames in time order. After deleting the selected non-voice frame from the non-voice sub-data, the adjacent frames deleted the non-voice frame need to be subjected to fade-in fade-out splicing processing, so that the problem of noise introduced after deleting the non-voice frame is avoided. In some alternative embodiments, the same compression strategy as that used for speech sub-data may be used for non-speech sub-data, such as superposition based on waveform similarity between non-speech frames.

It should be noted that although the steps of the methods in the present application are depicted in the accompanying drawings in a particular order, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.

The following describes an embodiment of an apparatus of the present application that may be used to perform the audio data compression method of the above-described embodiments of the present application. Fig. 9 shows a block diagram of an audio data compression apparatus according to an embodiment of the present application. As shown in fig. 9, the audio data compression device 900 may mainly include:

an acquisition module 910 configured to acquire a target compression amount for data-compressing audio data, the target compression amount being a difference in data amount of the audio data before and after compression;

the classification module 920 is configured to perform classification processing on the audio data to obtain at least two audio sub-data;

an allocation module 930 configured to allocate a target compression ratio, which is a ratio of a compression amount of the audio sub-data to a pre-compression data amount, to the at least two audio sub-data, respectively, according to the target compression amount;

and a compression module 940 configured to compress the audio sub-data according to the target compression ratio.

In one embodiment of the present application, based on the above embodiments, the obtaining module 910 may further include:

a frame number monitoring module 911 configured to monitor a real-time storage frame number of audio data stored in the audio buffer;

A frame number acquisition module 912 configured to acquire a desired storage frame number of the audio buffer;

the compression amount determining module 913 is configured to determine a target compression amount for data compression of the audio data according to a difference between the real-time storage frame number and the expected storage frame number when the real-time storage frame number is greater than the expected storage frame number.

In one embodiment of the present application, based on the above embodiments, the classification module 920 may further include:

a voice detection module 921 configured to perform voice activity detection on each data frame in the audio data to determine whether the data frame is a voice frame or a non-voice frame;

a voice tagging module 922 configured to tag consecutively distributed voice frames in the audio data as voice sub-data carrying voice content;

a non-speech marking module 923 configured to mark consecutive distributed non-speech frames in the audio data as non-speech sub-data not carrying speech content.

In one embodiment of the present application, based on the above embodiments, the voice markup module 922 may further include:

the speech speed estimation module is configured to perform speech speed estimation on the speech sub-data carrying the speech content to obtain a speech speed state parameter used for representing the speech speed of the speech sub-data;

And the voice segment marking module is configured to mark the voice sub-data into voice segments with different voice speed levels according to the voice speed state parameters.

In one embodiment of the present application, based on the above embodiments, the speech rate estimation module may be further configured to: performing pitch detection on the voice sub-data carrying the audio content to obtain pitch periods of all data frames in the voice sub-data; and carrying out speech speed estimation on the voice sub-data according to the time domain change state of the pitch period, wherein the time domain change state is used for representing the period change trend of the pitch period in the time domain.

In one embodiment of the present application, based on the above embodiments, the speech rate estimation module may be further configured to: comparing the pitch periods of two adjacent data frames in a time domain to obtain a period change trend and a period change amplitude of the pitch period of a next data frame relative to the pitch period of a previous data frame, wherein the period change trend is used for representing the rising, falling or leveling trend of the pitch period, and the period change amplitude is used for representing the pitch period difference value of the next data frame and the previous data frame; determining a time domain change state of the pitch period between two adjacent data frames according to the period change trend and the period change amplitude, wherein the time domain change state comprises at least two of a period rising state, a period reducing state or a period leveling state; counting state switching frequency of state switching of the time domain change state in a time domain, wherein the state switching frequency is used for representing the switching frequency of the time domain change state from one state to another state; and estimating the speech speed of the voice sub-data according to the state switching frequency, wherein the speech speed of the voice sub-data and the state switching frequency are in positive correlation.

In one embodiment of the present application, based on the above embodiments, the speech rate estimation module may be further configured to: acquiring an amplitude threshold associated with the periodic variation trend, wherein the amplitude threshold comprises a first threshold related to the periodic rising trend and a second threshold related to the periodic falling trend, the first threshold is positive, and the second threshold is negative; if the period variation amplitude is smaller than the first threshold value and larger than the second threshold value, marking the time domain variation state of the pitch period between two adjacent data frames as a period leveling state; if the period change amplitude is larger than the first threshold value, marking the time domain change state as a period rising state; and if the period change amplitude is smaller than the second threshold value, marking the time domain change state as a period descending state.

In one embodiment of the present application, based on the above embodiments, the speech rate estimation module may be further configured to: acquiring the number of frames of a data frame in the voice sub-data; and determining a speech speed state parameter for representing the speech speed of the voice sub-data according to the ratio of the state switching frequency to the frame number.

In one embodiment of the present application, based on the above embodiments, the speech rate estimation module may be configured to: according to the pitch period of each data frame in the voice sub-data, drawing a period distribution diagram of the pitch period in a time domain; performing curve fitting on the periodic distribution map to obtain a time domain change curve for representing the time domain change state of the pitch period; and carrying out speech rate estimation on the voice sub-data according to the time domain change curve.

In one embodiment of the present application, based on the above embodiments, the speech rate estimation module may be further configured to: acquiring the number of frames of a data frame in the voice sub-data; counting the number of extreme points in the time domain change curve; and determining a speech speed state parameter for representing the speech speed of the voice sub-data according to the ratio of the number of the extreme points to the frame number.

In one embodiment of the present application, based on the above embodiments, the at least two audio sub-data includes a voice sub-data carrying voice content and a non-voice sub-data not carrying voice content; the allocation module 930 includes:

a first compression ratio determining module 931 configured to determine a first compression ratio according to the target compression amount and a frame number of the non-voice sub-data;

A non-speech compression ratio configuration module 932 configured to configure a target compression ratio for the non-speech sub-data according to a smaller value between the first compression ratio and a first compression ratio threshold;

a desired compression amount determining module 933 configured to determine an actual compression amount of the non-voice sub-data according to a target compression ratio of the non-voice sub-data, and determine a desired compression amount of the voice sub-data according to the target compression amount and the actual compression amount of the non-voice sub-data;

a second compression ratio determination module 934 configured to determine a second compression ratio based on the desired amount of compression and the number of frames of the speech sub-data;

a speech compression ratio configuration module 935 configured to configure a target compression ratio for the speech sub-data according to a smaller value between the second compression ratio and a second compression ratio threshold.

In one embodiment of the present application, based on the above embodiments, the voice sub-data includes voice segments having different voice speed levels; the second compression ratio determination module 934 may be further configured to: acquiring a voice frame weight which has positive correlation with the voice speed level of the voice fragment; weighting the frame number of the voice fragment according to the voice frame weight to obtain the weighted frame number of the voice sub-data; and determining a second compression ratio according to the ratio of the expected compression amount to the weighted frame number.

In one embodiment of the present application, based on the above embodiments, the voice sub-data includes voice segments having different voice speed levels; the voice compression ratio configuration module 935 may be further configured to: acquiring a voice frame weight which has positive correlation with the voice speed level of the voice fragment; and weighting the smaller value between the second compression ratio and the second compression ratio threshold according to the voice frame weight to obtain the target compression ratio of each voice fragment.

In one embodiment of the present application, based on the above embodiments, the audio data includes voice sub-data carrying voice content and non-voice sub-data not carrying voice content; the compression module 940 includes:

the non-voice deleting module 941 is configured to delete a part of non-voice frames in the non-voice sub-data according to the target compression ratio to obtain compressed non-voice sub-data;

the voice superposition module 942 is configured to perform superposition processing on a part of voice frames in the voice sub-data according to the target compression ratio and waveform similarity, so as to obtain compressed voice sub-data.

Specific details of the audio data compression device provided in each embodiment of the present application have been described in the corresponding method embodiments, and are not described herein.

Fig. 10 schematically shows a block diagram of a computer system for implementing an electronic device according to an embodiment of the present application.

It should be noted that, the computer system 1000 of the electronic device shown in fig. 10 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

As shown in fig. 10, the computer system 1000 includes a central processing unit 1001 (Central Processing Unit, CPU) which can execute various appropriate actions and processes according to a program stored in a Read-Only Memory 1002 (ROM) or a program loaded from a storage section 1008 into a random access Memory 1003 (Random Access Memory, RAM). In the random access memory 1003, various programs and data necessary for the system operation are also stored. The cpu 1001, the rom 1002, and the ram 1003 are connected to each other via a bus 1004. An Input/Output interface 1005 (i.e., an I/O interface) is also connected to bus 1004.

The following components are connected to the input/output interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output portion 1007 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker; a storage portion 1008 including a hard disk or the like; and a communication section 1009 including a network interface card such as a local area network card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The drive 1010 is also connected to the input/output interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in the drive 1010, so that a computer program read out therefrom is installed as needed in the storage section 1008.

In particular, according to embodiments of the present application, the processes described in the various method flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1009, and/or installed from the removable medium 1011. The computer programs, when executed by the central processor 1001, perform the various functions defined in the system of the present application.

It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal that propagates in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, in accordance with embodiments of the present application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a usb disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of audio data compression, comprising:

classifying the audio data to obtain at least two audio sub-data;

and compressing the audio sub-data according to the target compression ratio.

2. The audio data compression method according to claim 1, wherein obtaining a target compression amount for data compression of the audio data comprises:

monitoring a real-time storage frame number of audio data stored in an audio buffer;

acquiring an expected storage frame number of the audio buffer;

and when the real-time storage frame number is larger than the expected storage frame number, determining a target compression amount for carrying out data compression on the audio data according to a difference value between the real-time storage frame number and the expected storage frame number.

3. The audio data compression method according to claim 1, wherein the classifying processing of the audio data includes:

Performing voice activity detection on each data frame in the audio data to determine whether the data frame is a voice frame or a non-voice frame;

marking continuously distributed voice frames in the audio data as voice sub-data carrying voice content;

and marking the non-voice frames which are distributed continuously in the audio data as non-voice sub-data which do not carry voice contents.

4. The audio data compression method according to claim 3, wherein the audio data is subjected to classification processing, further comprising:

carrying out speech speed estimation on the voice sub-data carrying the voice content to obtain a speech speed state parameter for representing the speech speed of the voice sub-data;

and marking the voice sub-data as voice fragments with different voice speed levels according to the voice speed state parameters.

5. The method of audio data compression according to claim 4, wherein performing speech rate estimation on the voice sub-data carrying voice content comprises:

performing pitch detection on the voice sub-data carrying the audio content to obtain pitch periods of all data frames in the voice sub-data;

and carrying out speech speed estimation on the voice sub-data according to the time domain change state of the pitch period, wherein the time domain change state is used for representing the period change trend of the pitch period in the time domain.

6. The method of audio data compression according to claim 5, wherein performing speech rate estimation on the speech sub-data according to the time domain change state of the pitch period comprises:

comparing the pitch periods of two adjacent data frames in a time domain to obtain a period change trend and a period change amplitude of the pitch period of a next data frame relative to the pitch period of a previous data frame, wherein the period change trend is used for representing the rising, falling or leveling trend of the pitch period, and the period change amplitude is used for representing the pitch period difference value of the next data frame and the previous data frame;

determining a time domain change state of the pitch period between two adjacent data frames according to the period change trend and the period change amplitude, wherein the time domain change state comprises at least two of a period rising state, a period reducing state or a period leveling state;

counting state switching frequency of state switching of the time domain change state in a time domain, wherein the state switching frequency is used for representing the switching frequency of the time domain change state from one state to another state;

and estimating the speech speed of the voice sub-data according to the state switching frequency, wherein the speech speed of the voice sub-data and the state switching frequency are in positive correlation.

7. The method of audio data compression according to claim 6, wherein determining a time domain change state of the pitch period between two adjacent data frames from the period change trend and the period change amplitude comprises:

acquiring an amplitude threshold associated with the periodic variation trend, wherein the amplitude threshold comprises a first threshold related to the periodic rising trend and a second threshold related to the periodic falling trend, the first threshold is positive, and the second threshold is negative;

if the period variation amplitude is smaller than the first threshold value and larger than the second threshold value, marking the time domain variation state of the pitch period between two adjacent data frames as a period leveling state;

if the period change amplitude is larger than the first threshold value, marking the time domain change state as a period rising state;

and if the period change amplitude is smaller than the second threshold value, marking the time domain change state as a period descending state.

8. The audio data compression method according to claim 6, wherein the speech sub-data is subjected to speech rate estimation according to the state switching frequency, comprising:

Acquiring the number of frames of a data frame in the voice sub-data;

and determining a speech speed state parameter for representing the speech speed of the voice sub-data according to the ratio of the state switching frequency to the frame number.

9. The method of audio data compression according to claim 5, wherein performing speech rate estimation on the speech sub-data according to the time domain change state of the pitch period comprises:

according to the pitch period of each data frame in the voice sub-data, drawing a period distribution diagram of the pitch period in a time domain;

performing curve fitting on the periodic distribution map to obtain a time domain change curve for representing the time domain change state of the pitch period;

and carrying out speech rate estimation on the voice sub-data according to the time domain change curve.

10. The method of audio data compression according to claim 9, wherein performing speech rate estimation on the speech sub-data according to the time domain variation curve comprises:

acquiring the number of frames of a data frame in the voice sub-data;

counting the number of extreme points in the time domain change curve;

and determining a speech speed state parameter for representing the speech speed of the voice sub-data according to the ratio of the number of the extreme points to the frame number.

11. The audio data compression method according to any one of claims 1 to 10, wherein the at least two kinds of audio sub-data include voice sub-data carrying voice content and non-voice sub-data not carrying voice content; respectively distributing target compression ratios to the at least two audio sub-data according to the target compression amounts, wherein the method comprises the following steps:

determining a first compression ratio according to the target compression amount and the frame number of the non-voice sub-data;

configuring a target compression ratio for the non-speech sub-data according to a smaller value between the first compression ratio and a first compression ratio threshold;

determining an actual compression amount of the non-voice sub data according to the target compression ratio of the non-voice sub data, and determining an expected compression amount of the voice sub data according to the target compression amount and the actual compression amount of the non-voice sub data;

determining a second compression ratio based on the desired compression amount and the number of frames of the speech sub-data;

and configuring a target compression ratio for the voice sub-data according to a smaller value between the second compression ratio and a second compression ratio threshold.

12. The method of audio data compression according to claim 11, wherein the speech sub-data comprises speech segments having different speech rate levels; determining a second compression ratio based on the desired compression amount and the number of frames of the speech sub-data, comprising:

Acquiring a voice frame weight which has positive correlation with the voice speed level of the voice fragment;

weighting the frame number of the voice fragment according to the voice frame weight to obtain the weighted frame number of the voice sub-data;

and determining a second compression ratio according to the ratio of the expected compression amount to the weighted frame number.

13. The method of audio data compression according to claim 11, wherein the speech sub-data comprises speech segments having different speech rate levels; configuring a target compression ratio for the speech sub-data according to a smaller value between the second compression ratio and a second compression ratio threshold, comprising:

and weighting the smaller value between the second compression ratio and the second compression ratio threshold according to the voice frame weight to obtain the target compression ratio of each voice fragment.

14. The audio data compression method according to any one of claims 1 to 10, wherein the audio data includes voice sub-data carrying voice content and non-voice sub-data not carrying voice content; compressing the audio sub-data according to the target compression ratio, including:

Deleting part of non-voice frames in the non-voice sub-data according to the target compression ratio to obtain compressed non-voice sub-data;

and superposing partial voice frames in the voice sub-data according to the target compression ratio and waveform similarity to obtain compressed voice sub-data.

15. An audio data compression apparatus, comprising:

16. A computer readable medium, characterized in that the computer readable medium has stored thereon a computer program which, when executed by a processor, implements the audio data compression method of any of claims 1 to 14.

17. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to cause the electronic device to perform the audio data compression method of any one of claims 1 to 14 via execution of the executable instructions.

18. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the audio data compression method of any one of claims 1 to 14.