CN113178202A - Audio data processing method, device and equipment and readable storage medium - Google Patents

Audio data processing method, device and equipment and readable storage medium Download PDF

Info

Publication number
CN113178202A
CN113178202A CN202110484184.3A CN202110484184A CN113178202A CN 113178202 A CN113178202 A CN 113178202A CN 202110484184 A CN202110484184 A CN 202110484184A CN 113178202 A CN113178202 A CN 113178202A
Authority
CN
China
Prior art keywords
voice
processed
packet
voice packet
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110484184.3A
Other languages
Chinese (zh)
Inventor
洪家明
李雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hytera Communications Corp Ltd
Original Assignee
Hytera Communications Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hytera Communications Corp Ltd filed Critical Hytera Communications Corp Ltd
Priority to CN202110484184.3A priority Critical patent/CN113178202A/en
Publication of CN113178202A publication Critical patent/CN113178202A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Telephone Function (AREA)

Abstract

The embodiment of the application provides an audio data processing method, an audio data processing device, an audio data processing apparatus, and a readable storage medium, wherein in response to receiving a voice packet to be processed, whether a target number is greater than or equal to a preset first overflow threshold is determined, where the voice packet to be processed is a voice packet to be written into a cache, the target number is a number of voice packets in the cache, and if the target number is greater than or equal to the first overflow threshold and a preset first filtering condition is met, the voice packet to be processed is discarded. The first filtering condition includes: the voice packet to be processed is a mute packet, and the mute packet is a voice packet with a voice energy value smaller than a preset energy value. Because the voice packet to be processed is the voice packet to be written into the cache, when the target quantity is greater than the first overflow threshold value, the voice packet to be processed is discarded, so that the delay can be reduced, and the voice energy value of the voice packet to be processed is less than the preset energy value, so that the voice quality is not influenced by discarding the voice packet to be processed, the phenomenon of losing words is avoided, and the voice quality is improved.

Description

Audio data processing method, device and equipment and readable storage medium
Technical Field
The present application relates to the field of multimedia audio/video technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for processing audio data.
Background
At present, a common technical means is to store the voice packet output by the source end into a jitter elimination buffer, and the destination end reads the voice packet from the jitter elimination buffer.
At present, the increase of a jitter elimination buffer can effectively solve the delay jitter, but under the condition of excessive buffer, in order to prevent the delay from increasing, the current solution is to discard the voice packet causing the overflow, but the discarded voice packet has the phenomenon of word loss, which causes the reduction of the voice quality.
Therefore, how to reduce the delay and improve the voice quality is a problem to be solved urgently.
Disclosure of Invention
The application provides a method, a device, equipment and a readable storage medium for processing audio data, aiming at reducing time delay and improving voice quality, and the method comprises the following steps:
a method of processing audio data, comprising:
responding to the received voice packets to be processed, and judging whether the target number is greater than or equal to a preset first overflow threshold value or not, wherein the voice packets to be processed are voice packets to be written into a cache; the target number is the number of the voice packets in the cache;
if the target number is greater than or equal to the first overflow threshold and a preset first filtering condition is met, discarding the voice packet to be processed, where the first filtering condition includes: the voice packet to be processed is a mute packet, and the mute packet is a voice packet with a voice energy value smaller than a first preset energy value.
Optionally, the method further comprises:
and if the target number is greater than or equal to the first overflow threshold value and the voice packet to be processed is a voiceprint voice packet, writing the voice packet to be processed into the cache, wherein the voiceprint voice packet is a voiceprint voice packet.
Optionally, the method further comprises:
and if the target quantity is smaller than the first overflow threshold value, writing the voice packet to be processed into the cache.
Optionally, the method further comprises:
when the preset monitoring time is reached, the voice packets to be processed are not received, and whether the target number is smaller than a preset underflow threshold value is judged;
and if the target quantity is smaller than the underflow threshold, writing a null voice packet into the cache, wherein the null voice packet comprises the mute packet.
Optionally, if the target number is greater than or equal to the first overflow threshold and meets a preset first filtering condition, discarding the voice packet to be processed includes:
if the target number is greater than or equal to the first overflow threshold value and the target number is less than a preset second overflow threshold value, discarding the voice packet to be processed if the first filtering condition is met;
the first filtering condition further includes: and the last continuous N voice packets of the sequence bits in the cache are the mute packets, and N is a first preset value.
Optionally, the method further comprises:
if the target number is greater than or equal to the second overflow threshold value and the target number is less than a preset third overflow threshold value, discarding the voice packet to be processed if a preset second filtering condition is met;
the second filtering condition includes: the voice packets to be processed are background voice packets, the last continuous M voice packets of the sequence bits in the cache are the background voice packets, M is a second preset value, and the background voice packets are voice packets without voiceprints.
Optionally, the method further comprises:
if the target number is greater than or equal to the third overflow threshold value, if a preset third filtering condition is met, discarding the voice packet to be processed;
the third filtering condition includes: the voice packet to be processed is the background voice packet.
An apparatus for processing audio data, comprising:
the overflow judging unit is used for responding to the received voice packets to be processed and judging whether the target number is greater than or equal to a preset first overflow threshold value or not, wherein the voice packets to be processed are the voice packets to be written into the cache; the target number is the number of voice packets in the cache;
a filtering unit, configured to discard the to-be-processed voice packet if the target number is greater than or equal to the first overflow threshold and meets a preset first filtering condition, where the first filtering condition includes: the voice packet to be processed is a mute packet, and the mute packet is a voice packet with a voice energy value smaller than a preset energy value.
An apparatus for processing audio data, comprising: a memory and a processor;
the memory is used for storing programs;
the processor is used for executing the program and realizing each step of the audio data processing method.
A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of processing audio data.
It can be seen from the foregoing technical solutions that, in response to receiving a voice packet to be processed, the method, the apparatus, the device, and the readable storage medium for processing audio data provided in the embodiments of the present application determine whether a target number is greater than or equal to a preset first overflow threshold, where the voice packet to be processed is a voice packet to be written into a cache, and the target number is a number of voice packets in the cache, and if the target number is greater than or equal to the first overflow threshold and meets a preset first filtering condition, discard the voice packet to be processed. Wherein the first filtering condition includes: the voice packet to be processed is a mute packet, and the mute packet is a voice packet with a voice energy value smaller than a preset energy value. From the above, because the voice packet to be processed is the voice packet to be written into the cache, when the target quantity is greater than the first overflow threshold, the voice packet to be processed is discarded so as to reduce the delay, and because the voice packet to be processed is the mute packet, that is, the voice energy value of the voice packet to be processed is less than the preset energy value, so that the voice quality is not affected by discarding the voice packet to be processed, and compared with directly discarding the voice packet to be processed, the phenomenon of losing words is avoided, and the voice quality is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart illustrating an embodiment of a method for processing audio data according to an embodiment of the present disclosure;
fig. 2 is a schematic flowchart illustrating a specific implementation of another audio data processing method according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a specific implementation of a caching method according to an embodiment of the present application;
fig. 4 is a flowchart illustrating a method for processing audio data according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an apparatus for processing audio data according to an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of an audio data processing apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The method for processing voice data provided by the embodiment of the application is applied to, but not limited to, a scenario of performing data processing on a voice packet output by a source end, and is particularly applied to a processing device of voice data, the processing device of voice data is respectively in communication connection with the source end and a jitter elimination buffer (buffer for short), wherein the source end refers to an output end of the voice data, and voice data to be processed of the processing device of voice data is the voice packet output by the source end.
Fig. 1 illustrates a specific implementation flow of an optional voice data processing method provided in an embodiment of the present application. The method specifically comprises the following steps:
s101, obtaining a voice packet to be processed.
In this embodiment, the voice packet to be processed is a voice packet to be written into the cache and output by the source end.
And S102, acquiring the type of the voice packet to be processed, and taking the type as a label of the voice packet to be processed.
In this embodiment, the voice packet to be processed is one of a voiceprint voice packet, a mute packet, and a background voice packet. The mute packet is a voice packet with a voice energy value smaller than a first predetermined energy value (the first predetermined energy value is a value close to 0), the background packet is a voice packet without voiceprint, and the voiceprint packet is a voice packet with voiceprint.
An optional method for obtaining the type of the voice packet to be processed includes 1 to 3, as follows:
1. and judging whether the type of the voice packet to be processed is a mute packet or not at least according to the energy value of the voice data in the voice packet to be processed.
Specifically, the voice type of the voice packet to be processed is determined according to the corresponding relationship between the preset type and the energy value. For example, the voice packet to be processed with the energy value within a first predetermined range (greater than or equal to 0 and less than the first predetermined energy value, where the first predetermined energy value is a value close to 0) is a silent packet.
It should be noted that, it may be determined whether the type of the Voice packet to be processed is a mute packet by using VAD (Voice Activity Detection), which may specifically refer to the prior art.
2. And judging whether the voice packet to be processed is a voiceprint voice packet or not at least according to the frequency value of the voice data in the voice packet to be processed.
It should be noted that, a VD (Voice Detection) technology may be used to determine whether the type of the Voice packet to be processed is a voiceprint Voice packet, which may be referred to in the prior art.
3. And if the voice packet to be processed does not belong to the voiceprint voice packet, determining that the voice packet to be processed is a background voice packet.
It should be noted that the mute packet may also be a voiceprint voice packet, but since the voice energy value of the mute packet is smaller than the first predetermined energy value and is close to 0, the voice packet with the energy value smaller than the first predetermined energy value and including the voiceprint is treated as the mute packet.
S103, acquiring the number of the voice packets in the cache, and recording the number as the target number.
In this embodiment, the number of the voice packets in the cache may be obtained by multiple methods, and optionally, the number of the voice packets in the cache may be obtained by monitoring the voice packets input by the input port and the voice packets output by the output port of the cache. See in particular the prior art.
It should be noted that, in this embodiment, the timing for acquiring the target number is: and when the voice packet to be processed is acquired, acquiring target data.
And S104, if the target number is smaller than a preset first overflow threshold value, writing the voice packet to be processed and the tag of the voice packet to be processed into a cache.
In this embodiment, when the target number is smaller than the first overflow threshold, that is, the voice packets in the cache do not overflow, and the time delay is normal.
S105, if the target number is larger than or equal to the first overflow threshold and smaller than a preset second overflow threshold, judging whether a first preset condition is met.
In this embodiment, if the target number is greater than or equal to the first overflow threshold and less than the preset second overflow threshold, the voice packets in the cache have an overflow phenomenon, and the overflow degree is a slight overflow.
In this embodiment, the first preset condition includes: and the last continuous N tags of the sequence bits in the cache are all mute packets, and the voice packet to be processed is a mute packet. It should be noted that the sequence of the voice packet in the cache indicates the time of writing the voice packet, and the sequence of the voice packet written first is earlier.
Where N is a first preset number, it should be noted that N +1 consecutive silence packets are defined as a concatenated silence packet.
And S106, if the first preset condition is met, discarding the voice packet to be processed, and writing the label of the voice packet to be processed into the cache.
In this embodiment, when the first predetermined condition is satisfied, the target number is greater than or equal to the first overflow threshold and less than the second overflow threshold, the last consecutive N tags of the sequence bits in the buffer are all silent packets, and the voice packet to be processed is a silent packet,
and S107, if the first preset condition is not met, writing the voice packet to be processed and the tag of the voice packet to be processed into a cache.
It should be noted that, when the first preset condition is met, dropping the voice packet to be processed can prevent aggravation of overflow and ensure voice quality.
And S108, if the target number is greater than or equal to the second overflow threshold and smaller than a preset third overflow preset threshold, judging whether a second preset condition is met.
In this embodiment, the second preset condition includes:
1. the voice packet to be processed is a mute packet. Or 2, the voice packet to be processed is a background sound packet, and the last continuous M tags of the sequence bits in the cache are all background sound packets.
Wherein M is a second preset number, and it should be noted that M +1 consecutive background sound packets are defined as a continuous background sound packet.
And S109, if the second preset condition is met, discarding the voice packet to be processed, and writing the tag of the voice packet to be processed into the cache.
And S110, if the second preset condition is not met, writing the voice packet to be processed and the tag of the voice packet to be processed into a cache.
And S111, if the target number is greater than or equal to a preset third overflow threshold value, judging whether a third preset condition is met.
In this embodiment, the third preset condition includes: the voice packet to be processed is a mute packet, or the voice packet to be processed is a background packet.
And S112, if the third preset condition is met, discarding the voice packet to be processed, and writing the tag of the voice packet to be processed into the cache.
And S113, if the third preset condition is not met, writing the voice packet to be processed and the tag of the voice packet to be processed into a cache.
According to the process shown in fig. 1, the method for processing voice data according to the embodiment of the present invention is to process the voice packets to be processed according to the target number and the type of the voice packets to be processed in response to the received voice packets to be processed, and is aimed at discarding the voice packets meeting the corresponding conditions when the target number belongs to different number ranges, that is, different overflow degrees, so as to reduce the delay and ensure the voice quality. The specific summary is as follows:
and if the target quantity is smaller than the first overflow threshold value, writing the voice packet to be processed into a cache.
And if the target quantity is greater than or equal to the first overflow threshold value and the voice packet to be processed is a voiceprint voice packet, writing the voice packet to be processed into the cache, wherein the voiceprint voice packet is a voiceprint voice packet. Therefore, compared with the method for directly discarding the voice packet to reduce the time delay in the prior art, the voice quality reduction caused by word loss is avoided.
Further, if the target number is greater than or equal to a first overflow threshold and less than a second overflow threshold, if the first filtering condition is satisfied, the voice packet to be processed is discarded, and the first filtering condition includes: the voice packets to be processed are mute packets, and the types of the last continuous N voice packets of the sequence bits in the cache are all mute packets.
And if the target number is greater than or equal to the second overflow threshold and less than a preset third overflow threshold, discarding the voice packet to be processed if the voice packet to be processed is a mute packet, and discarding the voice packet to be processed if the voice packet to be processed is not a mute packet but meets a second filtering condition. The second filtering condition includes: the voice packets to be processed are background voice packets, and the types of the last continuous M voice packets of the sequence bits in the cache are all the background voice packets.
And if the target quantity is greater than or equal to the third overflow threshold value, discarding the voice packet to be processed if the voice packet to be processed is a mute packet, and discarding the voice packet to be processed if the voice packet to be processed is not a mute packet but meets a third filtering condition. The third filtering condition includes: the voice packet to be processed is a background voice packet.
It should be noted that fig. 1 only illustrates a specific implementation flow of the processing method for voice data provided in this embodiment, and in other scenarios, the present application further includes other optional specific implementation flows, for example, the time for acquiring the target number includes multiple optional times (that is, preset monitoring times), for example: t1, when the output port of the buffer is called (i.e. voice packet is read out from the buffer), the target number is obtained. T2, the target number is obtained in response to receiving the voice data to be processed (i.e. the voice packets are output from the source) in the flow shown in fig. 1. And T3, acquiring the target number, T4 and the initial time according to a preset period.
Fig. 2 illustrates a specific implementation flow of another optional voice data processing method provided in this embodiment, which specifically includes:
s201, responding to the monitored calling of the output port of the cache, and acquiring the number of the voice packets in the cache.
It should be noted that, the method for monitoring whether the output port of the cache is called can be referred to in the prior art.
S202, if the voice packets in the cache are smaller than a preset underflow threshold, writing the target number of empty voice packets into the cache.
In this embodiment, the target number is a difference between an underflow threshold and a voice packet in the buffer.
It should be noted that the null voice packet refers to a voice packet of a type of a mute packet, and a method for generating the null voice packet can be referred to in the prior art.
For example, the underflow threshold is 5, and if the number of voice packets in the buffer is 3, 2 empty data packets are written into the buffer, so that the number of voice packets in the buffer is not less than the underflow threshold, and underflow is avoided.
S203, responding to the received voice data to be processed, and executing a first preset flow.
In this embodiment, the first preset process includes S102 to S113 shown in fig. 1, which may be referred to the above embodiments specifically, and this embodiment is not described in detail herein.
It should be noted that, compared with fig. 1, the flow shown in fig. 2 adds a judgment on whether the buffer has underflow, and when the voice packet in the buffer is smaller than the underflow threshold, writes a null voice packet into the buffer, so as to avoid underflow.
Taking an application scenario of audio playing as an example, the present embodiment provides a buffering flow based on the processing method of voice data shown in fig. 2, and fig. 3 illustrates an exemplary implementation process of the buffering method.
S301, judging the buffer state according to the number of the voice packets in the buffer.
Specifically, the buffer state is determined according to the relation between the number of voice packets in the buffer and a plurality of preset thresholds, and the buffer state includes a packet complement state, a normal state, a first compact buffer state, a second compact buffer state, and a third compact buffer state.
Specifically, the number of voice packets in the buffer is smaller than the underflow threshold, and the buffer status is a packet complement status.
The number of the voice packets in the buffer is larger than or equal to an underflow threshold value and smaller than a preset first overflow threshold value, and the buffer state is a common state.
The number of voice packets in the cache is greater than or equal to a first overflow threshold and less than a second overflow threshold, and the cache state is a first packed cache state.
The number of the voice packets in the cache is greater than or equal to the second overflow threshold value and less than a third overflow preset threshold value, and the cache state is a second compact cache state.
The number of voice packets in the buffer is greater than or equal to a third overflow threshold, and the buffer status is a third packed buffer status.
And S302, if the buffer memory state is a packet complementing state, writing empty voice packets (mute packets) into the buffer memory until the number of the voice packets in the buffer memory is not less than an underflow threshold value, so as to avoid the underflow phenomenon in the buffer memory.
And S303, if the cache state is the common state, responding to the received voice packet to be processed, and writing the voice packet to be processed into the cache.
It will be appreciated that when the buffer state is normal, there is no overflow and no underflow. At this time, the voice packet to be processed is in an ideal buffer state, so the embodiment may write the voice packet to be processed into the buffer directly.
It should be noted that, the method may further use the type of the voice packet to be processed as a tag, and write the tag into the cache, which is specifically referred to the above embodiment.
And S303, if the cache state is the first tightening cache state, responding to the received voice packet to be processed, and filtering the connected mute packet.
Specifically, whether the voice packet to be processed belongs to the connected silence packet is judged.
It should be noted that the continuous silence packet refers to a continuous silence packet composed of a preset number of voice packets, and the to-be-processed voice packet belonging to the continuous silence packet refers to a continuous silence packet composed of the to-be-processed voice packet and at least one voice packet located before the to-be-processed voice packet.
And if the voice packet to be processed belongs to the continuous mute packet, discarding the voice packet to be processed.
It should be noted that, in the first compact buffer state, when the number of voice packets in the buffer reaches the first overflow threshold, the delay will increase, but the degree of increase of the delay is low. The method discards the voice packets to be processed belonging to the continuous mute packet, and aims to shorten the time delay and ensure the voice quality.
And S304, if the cache state is the second tightening cache state, responding to the received voice packet to be processed, and filtering the mute packet and the continuous background voice packet.
Specifically, it is determined whether the type of the voice packet to be processed is a silent packet or whether the voice packet to be processed belongs to a continuous background sound packet, where the continuous background sound packet refers to a continuous background sound packet in which the types of a preset number of voice packets are background sound packets, and the voice packet to be processed belongs to a continuous background sound packet refers to a continuous background sound packet composed of the voice packet to be processed and at least one voice packet located before the voice packet to be processed.
And if the voice packet to be processed belongs to the continuous background voice packet or the type of the voice packet to be processed is a mute packet, discarding the voice packet to be processed.
It should be noted that, in the second compact cache state, when the number of voice packets in the cache reaches the second overflow threshold, the delay will increase, and the degree of the increase in delay is high relative to the first compact cache state. The method discards mute packets or voice packets to be processed belonging to continuous background tones, and aims to shorten time delay and ensure voice quality.
And S305, if the cache state is the third tightening cache state, responding to the received voice packet to be processed, and filtering the mute packet and the background voice packet.
Specifically, it is determined whether the type of the voice packet to be processed is any of the mute packet and the background packet, and if so, the voice packet to be processed is discarded.
It should be noted that, in the third compact cache state, when the number of voice packets in the cache reaches the third overflow threshold, the delay will increase, and the degree of the increase of the delay is high relative to the second compact cache state. The method discards the mute packet or the background packet, and aims to shorten the time delay and ensure the voice quality.
In summary, the present embodiment separates the packed cache state into three levels: the first compact cache state, the second compact cache state and the third compact cache state have different delay increasing degrees and different voice packet filtering conditions under the cache states of all levels, so that the playing delay is reduced, the word loss is avoided, and the playing voice quality is improved.
It should be noted that the processing method for voice data provided in this embodiment is not limited to be applied in a scenario of audio playing, and may also be applied in other scenarios, for example, in mixing for a conference, and the processing method for voice data provided in this embodiment is applied to buffering of each path input of mixing, so as to reduce input delay of mixing, avoid word loss, and improve mixing quality. For another example, the method is applied to a voice gateway for butting a third-party system, and the voice delay when the third-party system is output is reduced.
It should be further noted that the method for processing voice data provided in the embodiment of the present application is not limited to the flow shown in fig. 1 and fig. 2, and in an optional scenario, after discarding the voice packet to be processed, the tag of the voice packet to be processed is recorded according to the time of the discarded voice packet to be processed, and the tag does not need to be written into the cache. In another optional scenario, S108 to S113 are optional steps.
In summary, summarizing the processing method of voice data provided in this embodiment to the flow shown in fig. 4, the processing method may specifically include:
s401, in response to receiving the voice packet to be processed, judging whether the target number is greater than or equal to a preset first overflow threshold value.
In this embodiment, the voice packets to be processed are voice packets to be written into the cache, and the target number is the number of the voice packets in the cache.
It should be noted that, the method for obtaining the target number includes various methods, and specifically refer to the prior art.
S401, if the target number is greater than or equal to the first overflow threshold and meets a preset first filtering condition, discarding the voice packet to be processed.
In this embodiment, the first filtering condition includes: the voice packet to be processed is a mute packet, wherein the mute packet is a voice packet with a voice energy value smaller than a first preset energy value. The method for determining whether the type of the Voice packet to be processed is a mute packet includes various methods, for example, a VAD (Voice Activity Detection) technology may be used to determine whether the type of the Voice packet to be processed is a mute packet, which may be referred to in the prior art.
It can be seen from the foregoing technical solutions that, in the audio data processing method provided in this embodiment of the present application, in response to receiving a voice packet to be processed, whether a target number is greater than or equal to a preset first overflow threshold is determined, where the voice packet to be processed is a voice packet to be written into a cache, the target number is a number of voice packets in the cache, and if the target number is greater than or equal to the first overflow threshold and meets a preset first filtering condition, the voice packet to be processed is discarded. Wherein the first filtering condition includes: the voice packet to be processed is a mute packet, and the mute packet is a voice packet with a voice energy value smaller than a preset energy value. From the above, because the voice packet to be processed is the voice packet to be written into the cache, when the target quantity is greater than the first overflow threshold, the voice packet to be processed is discarded so as to reduce the delay, and because the voice packet to be processed is the mute packet, that is, the voice energy value of the voice packet to be processed is less than the preset energy value, so that the voice quality is not affected by discarding the voice packet to be processed, and compared with directly discarding the voice packet to be processed, the phenomenon of losing words is avoided, and the voice quality is improved.
Fig. 5 is a schematic structural diagram of an apparatus for processing audio data according to an embodiment of the present application, where as shown in fig. 5, the apparatus may include:
an overflow determining unit 501, configured to determine, in response to receiving a voice packet to be processed, whether a target number is greater than or equal to a preset first overflow threshold, where the voice packet to be processed is a voice packet to be written into the cache; the target number is the number of voice packets in the cache;
a filtering unit 502, configured to discard the to-be-processed voice packet if the target number is greater than or equal to the first overflow threshold and a preset first filtering condition is met, where the first filtering condition includes: the voice packet to be processed is a mute packet, and the mute packet is a voice packet with a voice energy value smaller than a preset energy value. Optionally, the method further comprises:
a first writing unit, configured to write the voice packet to be processed into the cache if the target number is greater than or equal to the first overflow threshold and the voice packet to be processed is a voiceprint voice packet, where the voiceprint voice packet is a voiceprint-including voice packet.
Optionally, the method further comprises:
and the second writing unit is used for writing the voice packets to be processed into the cache if the target quantity is smaller than the first overflow threshold.
Optionally, the method further comprises:
a third writing unit, configured to determine whether the target number is smaller than a preset underflow threshold in response to that the voice packet to be processed is not received when a preset monitoring time is reached;
and if the target quantity is smaller than the underflow threshold, writing a null voice packet into the cache, wherein the null voice packet comprises the mute packet.
Optionally, the filtering unit is configured to discard the to-be-processed voice packet if the target number is greater than or equal to the first overflow threshold and meets a preset first filtering condition, and includes: the filter unit is specifically configured to:
if the target number is greater than or equal to the first overflow threshold value and the target number is less than a preset second overflow threshold value, discarding the voice packet to be processed if the first filtering condition is met;
the first filtering condition further includes: and the last continuous N voice packets of the sequence bits in the cache are the mute packets, and N is a first preset value.
Optionally, the filter unit is further configured to:
if the target number is greater than or equal to the second overflow threshold value and the target number is less than a preset third overflow threshold value, discarding the voice packet to be processed if a preset second filtering condition is met;
the second filtering condition includes: the voice packets to be processed are background voice packets, the last continuous M voice packets of the sequence bits in the cache are the background voice packets, M is a second preset value, and the background voice packets are voice packets without voiceprints.
Optionally, the filter unit is further configured to:
if the target number is greater than or equal to the third overflow threshold value, if a preset third filtering condition is met, discarding the voice packet to be processed;
the third filtering condition includes: the voice packet to be processed is the background voice packet.
Fig. 6 shows a schematic structural diagram of the audio data processing apparatus, which may include: at least one processor 601, at least one communication interface 602, at least one memory 603, and at least one communication bus 604;
in the embodiment of the present application, the number of the processor 601, the communication interface 602, the memory 603, and the communication bus 604 is at least one, and the processor 601, the communication interface 602, and the memory 603 complete communication with each other through the communication bus 604;
the processor 601 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, or the like;
the memory 603 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory), etc., such as at least one disk memory;
the memory stores a program, and the processor can execute the program stored in the memory to realize the steps of the audio data processing method provided by the embodiment of the application, as follows:
a method of processing audio data, comprising:
responding to the received voice packets to be processed, and judging whether the target number is greater than or equal to a preset first overflow threshold value or not, wherein the voice packets to be processed are voice packets to be written into a cache; the target number is the number of the voice packets in the cache;
if the target number is greater than or equal to the first overflow threshold and a preset first filtering condition is met, discarding the voice packet to be processed, where the first filtering condition includes: the voice packet to be processed is a mute packet, and the mute packet is a voice packet with a voice energy value smaller than a first preset energy value.
Optionally, the method further comprises:
and if the target number is greater than or equal to the first overflow threshold value and the voice packet to be processed is a voiceprint voice packet, writing the voice packet to be processed into the cache, wherein the voiceprint voice packet is a voiceprint voice packet.
Optionally, the method further comprises:
and if the target quantity is smaller than the first overflow threshold value, writing the voice packet to be processed into the cache.
Optionally, the method further comprises:
when the preset monitoring time is reached, the voice packets to be processed are not received, and whether the target number is smaller than a preset underflow threshold value is judged;
and if the target quantity is smaller than the underflow threshold, writing a null voice packet into the cache, wherein the null voice packet comprises the mute packet.
Optionally, if the target number is greater than or equal to the first overflow threshold and meets a preset first filtering condition, discarding the voice packet to be processed includes:
if the target number is greater than or equal to the first overflow threshold value and the target number is less than a preset second overflow threshold value, discarding the voice packet to be processed if the first filtering condition is met;
the first filtering condition further includes: and the last continuous N voice packets of the sequence bits in the cache are the mute packets, and N is a first preset value.
Optionally, the method further comprises:
if the target number is greater than or equal to the second overflow threshold value and the target number is less than a preset third overflow threshold value, discarding the voice packet to be processed if a preset second filtering condition is met;
the second filtering condition includes: the voice packets to be processed are background voice packets, the last continuous M voice packets of the sequence bits in the cache are the background voice packets, M is a second preset value, and the background voice packets are voice packets without voiceprints.
Optionally, the method further comprises:
if the target number is greater than or equal to the third overflow threshold value, if a preset third filtering condition is met, discarding the voice packet to be processed;
the third filtering condition includes: the voice packet to be processed is the background voice packet.
An embodiment of the present application further provides a readable storage medium, where the readable storage medium may store a computer program adapted to be executed by a processor, and when the computer program is executed by the processor, the computer program implements the steps of the audio data processing method provided in the embodiment of the present application, as follows:
a method of processing audio data, comprising:
responding to the received voice packets to be processed, and judging whether the target number is greater than or equal to a preset first overflow threshold value or not, wherein the voice packets to be processed are voice packets to be written into a cache; the target number is the number of the voice packets in the cache;
if the target number is greater than or equal to the first overflow threshold and a preset first filtering condition is met, discarding the voice packet to be processed, where the first filtering condition includes: the voice packet to be processed is a mute packet, and the mute packet is a voice packet with a voice energy value smaller than a first preset energy value.
Optionally, the method further comprises:
and if the target number is greater than or equal to the first overflow threshold value and the voice packet to be processed is a voiceprint voice packet, writing the voice packet to be processed into the cache, wherein the voiceprint voice packet is a voiceprint voice packet.
Optionally, the method further comprises:
and if the target quantity is smaller than the first overflow threshold value, writing the voice packet to be processed into the cache.
Optionally, the method further comprises:
when the preset monitoring time is reached, the voice packets to be processed are not received, and whether the target number is smaller than a preset underflow threshold value is judged;
and if the target quantity is smaller than the underflow threshold, writing a null voice packet into the cache, wherein the null voice packet comprises the mute packet.
Optionally, if the target number is greater than or equal to the first overflow threshold and meets a preset first filtering condition, discarding the voice packet to be processed includes:
if the target number is greater than or equal to the first overflow threshold value and the target number is less than a preset second overflow threshold value, discarding the voice packet to be processed if the first filtering condition is met;
the first filtering condition further includes: and the last continuous N voice packets of the sequence bits in the cache are the mute packets, and N is a first preset value.
Optionally, the method further comprises:
if the target number is greater than or equal to the second overflow threshold value and the target number is less than a preset third overflow threshold value, discarding the voice packet to be processed if a preset second filtering condition is met;
the second filtering condition includes: the voice packets to be processed are background voice packets, the last continuous M voice packets of the sequence bits in the cache are the background voice packets, M is a second preset value, and the background voice packets are voice packets without voiceprints.
Optionally, the method further comprises:
if the target number is greater than or equal to the third overflow threshold value, if a preset third filtering condition is met, discarding the voice packet to be processed;
the third filtering condition includes: the voice packet to be processed is the background voice packet.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of processing audio data, comprising:
responding to the received voice packets to be processed, and judging whether the target number is greater than or equal to a preset first overflow threshold value or not, wherein the voice packets to be processed are voice packets to be written into a cache; the target number is the number of the voice packets in the cache;
if the target number is greater than or equal to the first overflow threshold and a preset first filtering condition is met, discarding the voice packet to be processed, where the first filtering condition includes: the voice packet to be processed is a mute packet, and the mute packet is a voice packet with a voice energy value smaller than a first preset energy value.
2. The method of claim 1, further comprising:
and if the target number is greater than or equal to the first overflow threshold value and the voice packet to be processed is a voiceprint voice packet, writing the voice packet to be processed into the cache, wherein the voiceprint voice packet is a voiceprint voice packet.
3. The method of claim 1, further comprising:
and if the target quantity is smaller than the first overflow threshold value, writing the voice packet to be processed into the cache.
4. The method of claim 1, further comprising:
when the preset monitoring time is reached, the voice packets to be processed are not received, and whether the target number is smaller than a preset underflow threshold value is judged;
and if the target quantity is smaller than the underflow threshold, writing a null voice packet into the cache, wherein the null voice packet comprises the mute packet.
5. The method according to claim 1, wherein the discarding the voice packet to be processed if the target number is greater than or equal to the first overflow threshold and a preset first filtering condition is satisfied comprises:
if the target number is greater than or equal to the first overflow threshold value and the target number is less than a preset second overflow threshold value, discarding the voice packet to be processed if the first filtering condition is met;
the first filtering condition further includes: and the last continuous N voice packets of the sequence bits in the cache are the mute packets, and N is a first preset value.
6. The method of claim 5, further comprising:
if the target number is greater than or equal to the second overflow threshold value and the target number is less than a preset third overflow threshold value, discarding the voice packet to be processed if a preset second filtering condition is met;
the second filtering condition includes: the voice packets to be processed are background voice packets, the last continuous M voice packets of the sequence bits in the cache are the background voice packets, M is a second preset value, and the background voice packets are voice packets without voiceprints.
7. The method of claim 5, further comprising:
if the target number is greater than or equal to the third overflow threshold value, if a preset third filtering condition is met, discarding the voice packet to be processed;
the third filtering condition includes: the voice packet to be processed is the background voice packet.
8. An apparatus for processing audio data, comprising:
the overflow judging unit is used for responding to the received voice packets to be processed and judging whether the target number is greater than or equal to a preset first overflow threshold value or not, wherein the voice packets to be processed are the voice packets to be written into the cache; the target number is the number of voice packets in the cache;
a filtering unit, configured to discard the to-be-processed voice packet if the target number is greater than or equal to the first overflow threshold and meets a preset first filtering condition, where the first filtering condition includes: the voice packet to be processed is a mute packet, and the mute packet is a voice packet with a voice energy value smaller than a preset energy value.
9. An apparatus for processing audio data, comprising: a memory and a processor;
the memory is used for storing programs;
the processor, which executes the program, realizes each step of the audio data processing method according to any one of claims 1 to 7.
10. A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for processing audio data according to any one of claims 1 to 7.
CN202110484184.3A 2021-04-30 2021-04-30 Audio data processing method, device and equipment and readable storage medium Pending CN113178202A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110484184.3A CN113178202A (en) 2021-04-30 2021-04-30 Audio data processing method, device and equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110484184.3A CN113178202A (en) 2021-04-30 2021-04-30 Audio data processing method, device and equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN113178202A true CN113178202A (en) 2021-07-27

Family

ID=76925834

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110484184.3A Pending CN113178202A (en) 2021-04-30 2021-04-30 Audio data processing method, device and equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN113178202A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0281538A (en) * 1988-09-19 1990-03-22 Hitachi Ltd Voice packet processing system
JP2003046490A (en) * 2001-07-30 2003-02-14 Mitsubishi Electric Corp Voice transmission device
CN1463125A (en) * 2002-05-28 2003-12-24 华为技术有限公司 Large capacity realtime stream processing method for removing dithering in using buffer memory
US20080170562A1 (en) * 2007-01-12 2008-07-17 Accton Technology Corporation Method and communication device for improving the performance of a VoIP call
CN102761468A (en) * 2011-04-26 2012-10-31 中兴通讯股份有限公司 Method and system for adaptive adjustment of voice jitter buffer

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0281538A (en) * 1988-09-19 1990-03-22 Hitachi Ltd Voice packet processing system
JP2003046490A (en) * 2001-07-30 2003-02-14 Mitsubishi Electric Corp Voice transmission device
CN1463125A (en) * 2002-05-28 2003-12-24 华为技术有限公司 Large capacity realtime stream processing method for removing dithering in using buffer memory
US20080170562A1 (en) * 2007-01-12 2008-07-17 Accton Technology Corporation Method and communication device for improving the performance of a VoIP call
CN102761468A (en) * 2011-04-26 2012-10-31 中兴通讯股份有限公司 Method and system for adaptive adjustment of voice jitter buffer

Similar Documents

Publication Publication Date Title
US8213635B2 (en) Keystroke sound suppression
WO2016180100A1 (en) Method and device for improving audio processing performance
KR100350562B1 (en) Digital compressed sound recorder
CN108055417B (en) Audio processing system and method for inhibiting switching based on voice detection echo
CN106708578A (en) Dual-thread-based journal output method and device
CN103888377A (en) Message cache method and device
CN109981482B (en) Audio processing method and device
US8717891B2 (en) Shaping apparatus and method
CN109830249B (en) Data processing method, device and storage medium
US6891762B2 (en) Buffer memory device
CN113178202A (en) Audio data processing method, device and equipment and readable storage medium
JP4272033B2 (en) Data playback device
JP6899972B2 (en) Information processing equipment, information processing methods and information processing programs
US9641912B1 (en) Intelligent playback resume
WO2018026452A1 (en) System and method for distributing and replaying trigger packets via a variable latency bus interconnect
JP3024447B2 (en) Audio compression device
WO2022174444A1 (en) Data stream transmission method and apparatus, and network device
CN109584891B (en) Audio decoding method, device, equipment and medium in embedded environment
JP6123315B2 (en) Audio reception / playback device
CN113553001A (en) Adaptive current limiting method, device, equipment and medium for data writing rate
CN109378019B (en) Audio data reading method and processing system
CN110572330A (en) method, device and medium for realizing compatibility of forwarding acceleration function and protocol stack function
CN115831150A (en) Voice playing method, device, equipment and storage medium
JP5691721B2 (en) Audio data processing device
CN110995947B (en) Channel density detection method and device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination