CN110070878B

CN110070878B - Decoding method of audio code stream and electronic equipment

Info

Publication number: CN110070878B
Application number: CN201910234737.2A
Authority: CN
Inventors: 李婧; 胡小鹏; 顾振华
Original assignee: Suzhou Keda Technology Co Ltd
Current assignee: Suzhou Keda Technology Co Ltd
Priority date: 2019-03-26
Filing date: 2019-03-26
Publication date: 2021-05-04
Anticipated expiration: 2039-03-26
Also published as: CN110070878A

Abstract

The invention relates to the technical field of voice conferences, in particular to a decoding method of an audio code stream and electronic equipment, wherein the method comprises the steps of receiving the audio code stream sent by a conference terminal; acquiring the sound mixing identifiers of all the conference terminals, wherein the sound mixing identifiers are used for representing the sound mixing state corresponding to the conference terminals; sequentially determining whether the mixed sound identification of each conference terminal indicates that the conference terminal is not in mixed sound, and calculating the frame information quantity of the audio code stream of the conference terminal which is not in mixed sound; screening the audio code stream of the conference terminal which is not in the audio mixing based on the frame information amount obtained by calculation; and decoding the screened audio code stream. According to the decoding method, the information quantity of the audio code stream is calculated before the audio code stream sent by the conference terminal is decoded, so that a large number of conference terminals can be screened, the pressure of the decoder can be greatly reduced, and the decoding efficiency of the decoder is improved.

Description

Decoding method of audio code stream and electronic equipment

Technical Field

The invention relates to the technical field of voice conferences, in particular to a decoding method of an audio code stream and electronic equipment.

Background

In recent years, with the development and popularization of internet technology, people have more and more demands for voice communication through the internet. Voice teleconferencing is also widely used, and can be used between branch offices, between departments, and in large and small conferences within a customer base. The voice communication technology (also called VOIP technology) is that voice signals are encoded and compressed by a compression algorithm, then packed according to the standard of a network protocol, transmitted through a network and sent to a receiving end; after receiving the data packets, the receiving end unpacks and restores the data packets according to the network protocol, and finally restores the data packets into original voice signals after decompression processing.

In order to meet the requirements of various voice call conferences, the voice call modes can be divided into distributed and centralized conference modes according to different network protocols. In the distributed conference, a conference platform and equipment for centralized control and centralized management are not provided, and all conference terminals are equal. The conference management and the voice data are both completed by the conference terminals, and the voice data can be directly multicast to other conference terminals and can also be received from any conference terminal. This mode is typically used for audio on demand, one-to-one voice calls, or sessions with up to 5 people in a small percentage. When one-to-many or many-to-many is needed, especially a large-format conference meeting with hundreds of people at the same time, and processing such as format conversion and sound mixing is needed for voice, a centralized conference mode is generally adopted, that is, based on a conference platform-conference terminal architecture technology, all conference terminals are connected with the conference platform, sound mixing is uniformly performed by the conference platform, and then the voice is distributed to each conference terminal through a language channel in a network, so that multi-person voice communication is realized.

In the centralized conference mode, a conference platform manages a conference through a Multipoint Control Unit (MCU), and the MCU decodes voice data sent from all conference terminals, mixes the voice effectively, and encodes and compresses the voice data. In the process, the conference terminal needs to judge the effective voice, then marks the voice data and sends the voice data to the MCU, or after the MCU decodes the voice data, judges whether the voice data is effective, and then carries out subsequent sound mixing processing. However, in the above processing method, if the conference terminal needs to perform effective voice determination, the conference terminal side needs to have a sufficient requirement on data processing performance; if MCU then need carry out the pronunciation data decoding earlier and just can judge effective pronunciation, to the meeting of big parameter meeting volume, the decoding way number of decoder will follow and increase, under this kind of condition, because the restriction of meeting platform performance, often can't satisfy the condition that the way decoding way number corresponds the decoder of a way, generally decode through the mode of rotating the decoder, and it can arouse very big time delay to rotate the decoder certainly, leads to the decoding efficiency of decoder to be lower.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method for decoding an audio code stream, so as to solve the problem in the prior art that there is a sufficient processing performance requirement on a conference terminal side or the decoding efficiency of a decoder is low.

According to a first aspect, an embodiment of the present invention provides a method for decoding an audio code stream, including:

receiving an audio code stream sent by a conference terminal;

acquiring the sound mixing identifiers of all the conference terminals, wherein the sound mixing identifiers are used for representing the sound mixing state of the conference terminals;

sequentially determining whether the audio mixing identifier of each conference terminal represents that the audio of the conference terminal participates in audio mixing, and calculating the frame information quantity of the audio code stream of the conference terminal which is not in the audio mixing;

screening the audio code stream of the conference terminal which is not in the audio mixing based on the frame information amount obtained by calculation;

and decoding the screened audio code stream.

According to the audio code stream decoding method provided by the embodiment of the invention, before the audio code stream sent by the conference terminal is decoded, the frame information amount is calculated for the audio code stream generated by the conference terminal which is not in audio mixing, so that a large number of conference terminals can be screened, the pressure of a decoder can be greatly reduced, and the decoding efficiency of the decoder is improved; meanwhile, because the screening of the conference terminal is finished at the conference platform side and is irrelevant to the conference terminal, the method can also reduce the requirement on the conference terminal and can be compatible with any form of conference terminal.

With reference to the first aspect, in a first implementation manner of the first aspect, the screening, based on the calculated frame information amount, an audio code stream of the conference terminal that is not in the mixed sound, includes:

judging whether the frame information quantity is smaller than a preset value or not;

and when the frame information quantity is smaller than a preset value, discarding the audio code stream corresponding to the frame information quantity.

According to the audio code stream decoding method provided by the embodiment of the invention, the number of the conference terminals speaking in the audio conference is not too much, because if too many conference terminals speaking at the same time are provided, the conference effect is influenced; therefore, in the audio conference, most of the audio code streams sent by the conference terminals are mute packets, and the frame information amount of the mute packets is smaller than the preset value, so that the audio code streams with the frame information amount smaller than the preset value are discarded, which is equivalent to the discarding of the mute packets, thereby relieving the pressure of a decoder before decoding.

With reference to the first aspect, in a second implementation manner of the first aspect, the decoding the filtered audio code stream includes:

creating a hierarchical decoder; wherein the hierarchical decoder includes a primary decoder and a secondary decoder, the primary decoder having a different core than the secondary decoder, the primary decoder having a same core as the mixer.

According to the audio code stream decoding method provided by the embodiment of the invention, the hierarchical decoder is established, and the primary decoder and the audio mixer are homonuclear, so that the audio code stream can be directly subjected to audio mixing after being decoded without inter-nuclear communication, the audio mixing efficiency is improved, and the use of transmission resources is reduced. In particular, the decoder hierarchy management mechanism is specifically designed to accommodate the various types of conferences that may be held on the conference platform. Such as: the conference with the small number of parties is few in conference terminals and large in conference number, so that resources on one core can meet the requirement of one conference, and the cores are not interfered with each other; the conference with large number of parties is provided with more conference terminals and less conference number, only one conference is held on one DSP, and then except the core of the appointed audio mixer, decoders on other cores are called along with the conference, so that the sufficiency of the resources of the decoders is ensured.

With reference to the second implementation manner of the first aspect, in a third implementation manner of the first aspect, the decoding the filtered audio code stream includes:

judging whether the conference terminal corresponding to the screened audio code stream is a newly added conference terminal;

and when the conference terminal corresponding to the screened audio code stream is a newly added conference terminal, transmitting the audio code stream to a corresponding decoder for decoding according to the decoding resource of the decoder.

With reference to the third implementation manner of the first aspect, in the fourth implementation manner of the first aspect, the decoding the filtered audio code stream further includes:

when the conference terminal is a joined conference terminal, extracting the activity of the conference terminal;

and transmitting the audio code stream to a corresponding decoder for decoding according to the liveness and the decoding resources.

According to the decoding method of the audio code stream provided by the embodiment of the invention, the decoding resources corresponding to different conference terminals are distinguished by using the liveness, so that the audio code stream generated by the conference terminal with high liveness can be timely decoded, and the decoding efficiency is improved.

With reference to the fourth implementation manner of the first aspect, in the fifth implementation manner of the first aspect, the transmitting the audio code stream to a corresponding decoder for decoding according to the liveness and the decoding resources includes:

judging whether the activity degree is greater than or equal to a preset activity degree or not;

when the activity is greater than or equal to the preset activity, transmitting the audio code stream to the primary decoder;

when the activity is smaller than the preset activity, judging whether the primary decoder is idle;

when the primary decoder is idle, transmitting the audio code stream to the primary decoder;

and when the decoding resources of the primary decoder are not idle, transmitting the audio code stream to the secondary decoder.

The audio code stream decoding method provided by the embodiment of the invention transmits the audio code stream generated by the conference terminal with liveness to the first-level decoder for decoding; the audio code stream generated by the conference terminal with low activity is preferentially transmitted to the first-level decoder, so that timely decoding of all the audio code streams can be ensured, and the decoding efficiency is improved.

With reference to the first aspect or any one of the first to fifth embodiments of the first aspect, in a sixth embodiment of the first aspect, the method further includes:

judging whether the decoded audio code stream comprises effective sound;

and when the decoded audio code stream comprises the effective sound, the audio mixer is utilized to mix the decoded audio code stream, and the activity of the conference terminal is updated.

With reference to the sixth embodiment of the first aspect, in the seventh embodiment of the first aspect, the method further includes:

when the decoded audio code stream has no effective sound, judging whether the conference terminal participates in sound mixing;

when the conference terminal participates in sound mixing, calculating the time interval between the last sound mixing and the current time, and judging the sound mixing overtime based on the time interval;

and when the time interval is within a preset time range, mixing the decoded audio code stream by using the mixer, and updating the activity of the conference terminal.

The decoding method of the audio code stream provided by the embodiment of the invention adopts a mechanism of 'fast-in slow-out', ensures the continuity of voice, and can cause that effective voice data contains continuous multi-packet mute packets because of the natural pause problem during the speaking of human voice, and the overtime threshold protection is specified, thereby avoiding the occurrence of frequent in-out sound mixing.

According to a third aspect, an embodiment of the present invention further provides an electronic device, including:

the decoding method of the audio code stream includes a memory and a processor, where the memory and the processor are communicatively connected with each other, the memory stores computer instructions, and the processor executes the computer instructions to execute the decoding method of the audio code stream in the first aspect of the present invention or any embodiment of the first aspect.

According to a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are configured to cause the computer to execute the method for decoding an audio code stream according to the first aspect of the present invention or any implementation manner of the first aspect.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a decoding method of an audio code stream according to an embodiment of the present invention;

fig. 2 is a flowchart of a decoding method of an audio code stream according to an embodiment of the present invention;

fig. 3 is a flowchart of a decoding method of an audio code stream according to an embodiment of the present invention;

FIG. 4 is a partial flow diagram of a method for decoding an audio bitstream according to an embodiment of the present invention;

fig. 5 is a flowchart of a decoding method of an audio code stream according to an embodiment of the present invention;

fig. 6 is an application scenario of a decoding method of an audio code stream according to an embodiment of the present invention;

fig. 7 is a flowchart of a decoding method of an audio code stream according to an embodiment of the present invention;

fig. 8 is a flowchart of a decoding method of an audio code stream according to an embodiment of the present invention;

fig. 9 is a flowchart of a decoding method of an audio code stream according to an embodiment of the present invention;

fig. 10 is a block diagram of a structure of an apparatus for decoding an audio code stream according to an embodiment of the present invention;

fig. 11 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In accordance with an embodiment of the present invention, there is provided an embodiment of a method for decoding an audio bitstream, it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions, and that while a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than here.

The embodiment provides a decoding method of an audio code stream, which can be used for electronic equipment, wherein the electronic equipment can be understood as a conference platform in a conference system. Fig. 1 is a flowchart of a decoding method of an audio code stream according to an embodiment of the present invention, as shown in fig. 1, the flowchart includes the following steps:

and S11, receiving the audio code stream sent by the conference terminal.

Before the conference process starts, each conference terminal can be connected with the electronic equipment, the electronic equipment can number all conference terminals participating in, the number is used for giving out each conference terminal a unique channel number, namely, the electronic equipment can distribute the unique channel number for each conference terminal at the moment. And when each subsequent conference terminal sends the audio code stream to the electronic equipment, the audio code stream is sent based on the allocated channel number. The electronic equipment receives a plurality of audio code streams at the same time, and the audio code streams come from a plurality of conference terminals.

The channel numbers correspond to the conference terminals one by one, and the conference terminals corresponding to the audio code streams can be known by using the channel numbers during decoding, sound mixing and encoding.

It should be noted that, in the embodiment of the present invention, in addition to distinguishing each conference terminal by using a channel number, other identification numbers may also be used to distinguish the conference terminals, for example, a device number of a terminal, and the like. The channel number is described in detail below as an example.

And S12, acquiring the mixing identifiers of all the conference terminals.

The mixing mark is used for representing the mixing state of the conference terminal. Specifically, in the electronic apparatus, the mixing state, address, terminal number, and the like of the conference terminal corresponding to each channel number are stored. The mixing state is represented by a mixing identifier, for example, when the conference terminal is in mixing, the mixing identifier is 1; when the conference terminal is not in the mix, the mix flag is 0. Therefore, the electronic device can determine whether the conference terminal is in the mixing by using the channel number.

S13, determining whether the mixing mark of each conference terminal indicates that the conference terminal is not in the mixing, and calculating the frame information amount of the audio code stream of the conference terminal which is not in the mixing.

And the audio code stream is subjected to sound mixing and re-encoding after being decoded, and then is sent to the conference terminal. Then, the electronic device can store the condition that the audio code stream generated by each conference terminal participates in the audio mixing, and the conference terminals are distinguished by using the channel numbers, so that the electronic device can judge whether the conference terminal is in the audio mixing by using the audio mixing identifier in the storage unit corresponding to the channel number.

And when the electronic equipment determines that the mixed sound identification of each conference terminal indicates that the conference terminal is not in mixed sound, calculating the information quantity of each audio code stream. Specifically, the information amount of the audio code stream can be calculated, and the frame information amount of the audio code stream can be calculated, that is, the entropy information amount is calculated.

And S14, screening the audio code stream of the conference terminal which is not in the mixed sound based on the frame information amount obtained by calculation.

After the frame information amount of the audio code stream is obtained through calculation, the electronic device screens the audio code stream of the conference terminal that is not in the audio mixing by using the calculated frame information amount, for example, the audio code stream may be screened by using the size relationship between the frame information amounts and a preset value, or other manners. The screening mode is not limited at all, and only the electronic equipment is ensured to be capable of screening useless audio code streams from all audio code streams of the conference terminals which are not in the audio mixing by utilizing the frame information quantity. The useless audio code stream may be a mute packet.

And S15, decoding the screened audio code stream.

The electronic device obtains an available audio code stream after screening the audio code stream, and decodes the screened audio code stream (i.e., the available audio code stream). The specific decoding method is not limited at all, and only the decoding can be realized. The specific decoding steps will be described in detail below.

According to the decoding method of the audio code stream provided by the embodiment, the information content of the audio code stream is calculated before the audio code stream sent by the conference terminal is decoded, so that a large number of conference terminals can be screened, the pressure of a decoder can be greatly reduced, and the decoding efficiency of the decoder is improved; meanwhile, because the screening of the conference terminal is finished at the conference platform side and is irrelevant to the conference terminal, the method can also reduce the requirement on the conference terminal and can be compatible with any form of conference terminal.

As an optional implementation manner of this embodiment, when the electronic device determines that the conference terminal corresponding to the channel number is in audio mixing, after receiving an audio code stream sent by the conference terminal, the electronic device directly sends the received audio code stream to a decoder for decoding. For the conference terminal in the mixed sound, because of the natural pause of a person in the speaking process, a plurality of packets are in a mute state, but the packets can not be skipped, otherwise discontinuous voice is heard, and therefore, the audio code stream generated by the conference terminal in the mixed sound does not carry out information quantity evaluation calculation.

Alternatively, two processors may be included in the electronic device, wherein one processor is configured to perform S11-S14 and the other processor is configured to perform S15. For example, the processor executing S11-S14 is a control chip of the conference platform, and the processor executing S15 is a multi-core processor of the conference platform.

The embodiment provides a decoding method of an audio code stream, which can be used in electronic equipment. Fig. 2 is a flowchart of a decoding method of an audio code stream according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

and S21, receiving the audio code stream sent by the conference terminal.

Please refer to S11 in fig. 1, which is not described herein again.

And S22, acquiring the mixing identifiers of all the conference terminals.

The mixing mark is used for representing the mixing state of the conference terminal. Please refer to S12 in fig. 1, which is not described herein again.

S23, determining whether the mixing mark of each conference terminal indicates that the conference terminal is not in the mixing, and calculating the frame information amount of the audio code stream of the conference terminal which is not in the mixing.

Please refer to S13 in fig. 1, which is not described herein again.

And S24, screening the audio code stream of the conference terminal which is not in the mixed sound based on the frame information amount obtained by calculation.

And after the frame information quantity of the audio code stream corresponding to each conference terminal is obtained through calculation by the electronic equipment, determining whether the audio code stream is useful or not by adopting a threshold comparison mode. Specifically, the method comprises the following steps:

and S241, judging whether the frame information amount is less than a preset value.

The electronic device determines whether the frame information amount of the audio code stream is smaller than a preset value, where the preset value is used to indicate the information amount when the audio code stream is a mute packet, and if the frame information amount is smaller than the preset value, indicates that the audio code stream corresponding to the frame information amount is a mute packet, and then S242 is performed; otherwise, the audio code stream corresponding to the frame information amount may be subjected to subsequent decoding, and then S25 is executed.

And S242, discarding the audio code stream corresponding to the frame information quantity.

And when the electronic equipment determines that the frame information quantity of a certain audio code stream is less than a preset value, discarding the audio code stream so as to reduce the subsequent decoding pressure.

And S25, decoding the screened audio code stream.

And the electronic equipment sends the screened audio code stream to a decoder for decoding. Wherein optionally the decoder for decoding is created before the conference starts. Because different conference types have different requirements on the decoder, the electronic equipment creates the decoder in real time according to the acquired conference type, and the requirements of different conference types can be met. For example, when the conference type is a cell number conference, a plurality of decoders may be created, each decoder being a peer, and the decoders are directly allocated based on decoding resources during decoding; when the conference type is a large-square conference, a hierarchical decoder may be created.

Further, in the present embodiment, the electronic device includes two processors, the processor being configured to execute S21-S24, and the second processor being configured to execute S25. Then, the audio code stream generated by the conference terminal is screened before decoding, so that on one hand, the decoding pressure can be reduced, on the other hand, frequent communication between the processor and the second processor can be avoided, and the signaling pressure between the processor and the second processing device is relieved.

Further optionally, the processor may be a control chip, configured to execute S21-S24, and receive the decoded, mixed and encoded audio code stream and send the audio code stream to each conference terminal; the second processor may be a multi-core processor for performing S21-S24, the core of the multi-core processor having a decoder running thereon. How the second processor decodes the screened audio code stream will be described in detail below.

Compared with the embodiment shown in fig. 1, the method for decoding the audio code stream provided by this embodiment does not have too many conference terminals speaking in the audio conference, because if too many conference terminals speaking simultaneously, a conference effect will be caused; therefore, in the audio conference, most of the audio code streams sent by the conference terminals are mute packets, and the information amount of the mute packets is smaller than the preset value, so that discarding the audio code streams with the frame information amount smaller than the preset value is equivalent to discarding the mute packets, thereby relieving the pressure of a decoder before decoding.

The embodiment provides a decoding method of an audio code stream, which can be used for electronic equipment, wherein the electronic equipment can be understood as a conference platform in a conference system. Fig. 3 is a flowchart of a decoding method of an audio code stream according to an embodiment of the present invention, and as shown in fig. 3, the flowchart includes the following steps:

s31, a hierarchical decoder is created.

The hierarchical decoder comprises a primary decoder and a secondary decoder, wherein the primary decoder and the secondary decoder have different cores, and the primary decoder and the mixer have the same core.

Furthermore, the hierarchical decoder operates on a multi-core processor, which may be a multi-core DSP or other types of multi-core processors, and the type of the multi-core decoder is not limited. Taking the multi-core DSP as an example, when the multi-core DSP creates a decoder based on the conference type, the multi-core DSP is created hierarchically.

Wherein, the 0 cores of the multiple DSPs are used as control cores and are responsible for data receiving and transmitting and message processing; the remaining 7 cores are resource cores, running a mixer, a decoder and an encoder. Due to the limitation of resources on the cores, the number of decoder paths on each core is limited, and the 0 core is responsible for reasonably using and distributing the decoder resources, so that the utilization rate of the resources on the cores is greatly improved. The primary decoder, mixer, and encoder operate on the same core, and the secondary decoder operates on other cores.

And S32, receiving the audio code stream sent by the conference terminal.

Please refer to S22 in fig. 2 for details, which are not described herein.

And S33, acquiring the mixing identifiers of all the conference terminals.

The mixing mark is used for representing the mixing state of the conference terminal. Please refer to S23 in fig. 2 for details, which are not described herein.

S34, determining whether the mixing mark of each conference terminal indicates that the conference terminal is not in the mixing, and calculating the frame information amount of the audio code stream of the conference terminal which is not in the mixing.

Please refer to S23 in fig. 2 for details, which are not described herein.

And S35, screening the audio code stream of the conference terminal which is not in the mixed sound based on the frame information amount obtained by calculation.

Please refer to S24 in fig. 2 for details, which are not described herein.

And S36, decoding the screened audio code stream.

When the electronic device decodes the screened audio code stream, the electronic device distinguishes and decodes whether the conference terminal is a newly added conference terminal or an added conference terminal, and the method specifically comprises the following steps:

and S361, judging whether the conference terminal corresponding to the screened audio code stream is a newly added conference terminal.

Because the screened audio code streams correspond to the channel numbers one by one, the multi-core processor can determine which conference terminals corresponding to the channel numbers are added into the conference according to the received audio code streams, and the multi-core processor can store the part of the channel numbers. And when the audio code stream is received, judging by using the channel number to determine whether the conference terminal corresponding to the channel number is a newly added conference terminal. When the conference terminal is a newly added conference terminal, S362 is performed; otherwise, S363 is executed.

And S362, transmitting the audio code stream to a corresponding decoder for decoding according to the decoding resources of the decoder.

When the multi-core processor determines that a certain conference terminal is a newly added conference terminal, whether a first-level decoder which is in the same core with the audio mixer is idle or not can be preferentially judged, and when the first-level decoder is idle, an audio code stream is transmitted to the first-level decoder; or judging the decoding resources of all decoders, transmitting the audio code stream to an idle decoder, and the like.

And S363, extracting the activity of the conference terminal.

When the multi-core processor determines that a certain conference terminal is a joined conference terminal, the multi-core processor indicates that the multi-core processor has received the audio code stream corresponding to the conference terminal, and then the conference terminal can use a variable to indicate the activity of each conference terminal. For example, the initial value of the variable is 1, and when the multi-core processor receives an audio code stream sent by the conference terminal corresponding to the channel number 1, the activity corresponding to the channel number 1 is 1; then, if the audio code stream sent by the conference terminal corresponding to the channel number 1 participates in the subsequent audio mixing, the activity degree is correspondingly increased; if the user does not participate in the subsequent sound mixing, the activity is correspondingly reduced.

The multi-core processor distinguishes each conference terminal by adopting the channel number, so that the activity of the corresponding conference terminal can be extracted by utilizing the channel number.

And S364, transmitting the audio code stream to a corresponding decoder for decoding according to the liveness and the decoding resources.

The multi-core processor can directly transmit the audio code stream generated by the conference terminal with high activity degree to the primary decoder, and transmit the audio code stream generated by the conference terminal with low activity degree to the secondary decoder; and meanwhile, the audio code stream is transmitted to the idle secondary decoder based on the decoding resources of each secondary decoder. Or when the audio code stream generated by the conference terminal with low activity is divided by the encoder, preferentially judging whether the first-level decoder is idle, and if the first-level decoder is idle, transmitting the audio code stream to the first-level decoder. The multi-core processor receives the audio code stream, and then judges the activity of the corresponding conference terminal by using the audio code stream, and if the activity is low, the audio code stream can be transmitted to the secondary decoder; if the liveness is larger, the audio code stream can be transmitted to the primary decoder. For example, when the audio code stream is received last time, the activity of the conference terminal corresponding to the channel number 1 is 6, and the conference terminal is transmitted into the first-level decoder; when the current audio code stream is received, the activity is 2, and the audio code stream can be transmitted into a secondary decoder. Correspondingly, the audio code stream generated at present can be transmitted into the primary decoder at the conference terminal corresponding to the audio code stream transmitted into the secondary decoder at the last time. That is, the audio code stream transmitted to the primary decoder or the secondary decoder is determined based on the activity of the conference terminal, and may be changed in real time.

The decoding method for the audio code stream provided by the embodiment distinguishes decoding resources corresponding to different conference terminals by using the liveness, can ensure that the audio code stream generated by the conference terminal with high liveness can be decoded in time, and improves the decoding efficiency.

As an alternative implementation manner of this embodiment, as shown in fig. 4, S364 includes the following steps:

and S4.1, judging whether the activity is greater than or equal to the preset activity.

And the multi-core processor judges whether the activity of each conference terminal is greater than or equal to a preset activity by using the channel number, wherein the preset activity can be specifically set according to actual conditions. When the activity degree is greater than or equal to the preset activity degree, executing S4.2; otherwise, S4.3 is performed.

And S4.2, transmitting the audio code stream to a primary decoder.

And when the activity is greater than or equal to the preset activity, the 0 core in the multi-core processor transmits the audio code stream to a first-level decoder for decoding.

And S4.3, judging whether the primary decoder is idle or not.

And when the activity is less than the preset activity, the multi-core processor judges whether the primary decoder is idle or not. When the primary decoder is idle, S4.2 is executed; otherwise, S4.4 is performed.

And S4.4, transmitting the audio code stream to a secondary decoder.

In the decoding method of the audio code stream provided by the embodiment, the audio code stream generated by the conference terminal with liveness is transmitted to the primary decoder for decoding; the audio code stream generated by the conference terminal with low activity is preferentially transmitted to the first-level decoder, so that timely decoding of all the audio code streams can be ensured, and the decoding efficiency is improved.

The embodiment provides a decoding method of an audio code stream, which can be used for electronic equipment. Fig. 5 is a flowchart of a decoding method of an audio code stream according to an embodiment of the present invention, and as shown in fig. 5, the flowchart includes the following steps:

s51, a hierarchical decoder is created.

Please refer to S31 in fig. 3 for details, which are not described herein.

And S52, receiving the audio code stream sent by the conference terminal.

Please refer to S32 in fig. 3 for details, which are not described herein.

And S53, acquiring the mixing identifiers of all the conference terminals.

The mixing mark is used for representing the mixing state of the conference terminal. Please refer to S33 in fig. 3 for details, which are not described herein.

S54, determining whether the mixing mark of each conference terminal indicates that the conference terminal is not in the mixing, and calculating the frame information amount of the audio code stream of the conference terminal which is not in the mixing.

Please refer to S34 in fig. 3 for details, which are not described herein.

And S55, screening the audio code stream of the conference terminal which is not in the mixed sound based on the frame information amount obtained by calculation.

Please refer to S35 in fig. 3 for details, which are not described herein.

And S56, decoding the screened audio code stream.

Please refer to S36 in fig. 3 for details, which are not described herein.

And S57, judging whether the decoded audio code stream comprises effective sound.

After the decoder decodes, the audio code stream is subjected to voice test and voice energy calculation, the audio code stream judged to have voice can directly participate in sound mixing in the primary decoder, and after the audio code stream is required to be firstly called into the primary decoder in the secondary decoder, the audio code stream is judged to have voice again and is allowed to participate in sound mixing. The multi-core decoder determines whether the audio code stream after the decoder includes valid sound, for example, the multi-core decoder may determine by using two determination algorithms, namely Voice Activity Detection (VAD) and voice energy value, to prevent the false determination caused by too small volume of the speaker.

Specifically, the multi-core processor performs noise reduction on the decoded audio code stream, then calculates a VAD value and a voice energy value, and can judge whether effective sound is included by using a voice detection function. When the decoded audio code stream includes valid sound, executing S60; otherwise, S58 is executed.

And S58, judging whether the conference terminal participates in the sound mixing.

When the multi-core processor judges that the decoded audio code stream does not include effective sound, whether the conference terminal participates in sound mixing is judged, and when the conference terminal participates in the sound mixing, S59 is executed; otherwise, the audio code stream can be directly discarded.

And S59, calculating the time interval between the last mixing and the current time, and judging the mixing timeout based on the time interval.

And when the multi-core processor judges that the conference terminal participates in sound mixing, calculating the time interval between the last sound mixing and the current time, and judging whether the sound mixing is overtime or not by utilizing the time interval. When the time interval is within the preset range, indicating that the mixing is not overtime, executing S60; and if not, indicating that the mixed sound is overtime, and directly discarding the audio code stream.

And S60, mixing the decoded audio code stream by using the mixer, and updating the activity of the conference terminal.

And when the multi-core processor performs sound mixing on the decoded audio code stream, updating the activity of the conference terminal corresponding to the channel number carried in the audio code stream, namely increasing the activity.

The reason for judging the time interval of sound mixing is that a person naturally pauses during the speaking process, so that many packets are silent packets, but the packets cannot be skipped, otherwise discontinuous voice is heard. The sound mixing resource uses a time-out mechanism, and the sound mixing is judged to exit without sound for a long time in a channel in the sound mixing, so that the sound mixing resource is released.

The decoding method for the audio code stream provided by this embodiment adopts a mechanism of "fast-in slow-out", so that the continuity of speech is ensured, because of the natural pause problem during speaking of human voice, the valid speech data contains a plurality of continuous silence packets, the overtime threshold protection is specified, and the sound discontinuity caused by frequent in-out mixing is avoided.

As an optional implementation manner of this embodiment, before audio code stream is mixed, it may be determined whether a current mixed depth value reaches a preset mixed depth value, that is, whether a so-called mixed depth value is full. If the current mixed sound depth value is not full, directly adding mixed sound; if the current mixing depth is full, it is not necessary to add mixing. This is because when the conference is opened, there are generally no more than 4 people speaking at the same time, and there is no sound to hear if more than 4 people speak at the same time, so a mixing depth (generally defined as 6) is defined to indicate the number of conference terminals speaking at the same time.

As a specific implementation manner of this embodiment, an application scenario of this embodiment is shown in fig. 6: a control chip 111 and a multi-core DSP112 are arranged in a conference central point (i.e., a conference platform or an electronic device), wherein the control chip 111 is configured to receive an audio code stream sent by a conference terminal, screen the audio code stream, and send the encoded audio code stream to each conference terminal; and the multi-core DSP112 decodes, mixes and codes the screened audio code stream. As shown in fig. 7, specifically includes the following steps:

step S1, the audio code stream sent to the conference platform by the conference terminal is screened by the control chip of the conference platform, and whether packet loss, data integrity, voice signal change, etc. occur is determined. When the audio code stream generated by the conference terminal does not participate in the audio mixing, the frame information amount of the audio code stream needs to be calculated, and then the usable real voice signal is sent to the DSP side.

In a large-number conference, all participating terminals are numbered, the newly added channel refers to the channel which just obtains the number, and the newly added channel does not have corresponding decoding resources on the DSP side, so that the decoding resources are allocated according to the received code stream condition. For a conference with a large number of parties, even if the discussion is started, people who speak at the same time generally do not exceed 6, so that code streams sent by a large number of conference terminals are all silent packets.

Step S2, in the multi-core DSP, a control core0 can be separated for distributing data or messages; the first-level decoder is used for decoding the audio data stream sent by the conference terminal with high activity and is in the same core with the audio mixer; the secondary decoder is used for decoding the audio data stream sent by the conference terminal with low activity, and cores different from the audio mixer can be distributed in one core or two cores; the encoder is used for encoding audio data after audio mixing, at least encoding N +1 audio code streams (N is audio mixing depth and has a value range of 3-6), and establishing the audio code streams and the audio mixer in a core.

Referring to fig. 8, the Core0 at the DSP side receives the audio data from the control chip, first determines whether a new conference terminal is added or not, whether the conference terminal has a high liveness, then selects the level 1 decoder according to the usage of the decoder resources, and then determines whether the audio data is transmitted to the Core1 or the Core 2. Specifically, the data allocation principle of the core0 is followed: the decoder co-core with the assigned mixer determines the 1-level decoder of the mixer, and the idle decoders of other cores all act as the 2-level decoders of the mixer. Therefore, not only is the frequent communication between the cores avoided when the number of the small parties meet; but also can ensure that sufficient decoder resources are used when a large digital conference is started.

Referring to fig. 9, in steps S3 to S4, the audio data transmitted into Core1 is first decoded and denoised, then VAD and speech energy values are calculated, and then a speech detection function is used to determine whether there is sound or silence. If it is voiced, the mixing can be entered directly; if the voice is silent, the user needs to see whether the voice is qualified for mixing before (namely, whether the voice is participated in mixing before), and if so, the user needs to judge whether the time is out (whether continuous multiple frames are not participated in mixing), and the voice mixing qualification is cancelled after the time is out. If there is no mixing qualification, it is determined whether the activity has dropped to 0 (i.e. the conference terminal that did not speak for a long time), and the conference terminal with activity 0 needs to notify Core0 to be put into the secondary decoder.

In steps S5 to S6, the audio data transmitted into the Core2 is first decoded and denoised, then VAD and speech energy values are calculated, and then a speech detection function is used to determine whether there is sound or silence. If it is voiced, Core0 needs to be informed to be put into the secondary decoder; if it is silent, the data is not saved. This step of downgrading is unique to large-format conferences. Because the decoding resource on one core is limited, and a plurality of cores are arranged on one DSP, the decoding of some inactive terminals is arranged on other cores, thereby relieving the decoding pressure and simultaneously improving the utilization rate of idle resources.

And step S7, sending the data after being mixed by the mixer to an encoder set, and encoding N +1 paths of audio code streams (N is the mixing depth and the value range is 3-6).

And step S8, if the conference terminal has special requirements, the conference terminal also needs to transmit an adapter to encode the audio format required by the conference terminal, and then the encoded audio data stream is transmitted back to the control chip and is transmitted to the conference terminal by the control chip.

In this embodiment, the number of mixing channels is set to 6, that is, the mixing depth is 6, and N is 6, so that the encoder needs to encode the voice code streams of N speakers except the speaking conference terminal and one voice code stream including all speaking conference terminals. In addition, the primary decoder has 20 decoders, and the secondary decoder is a decoder on the other two cores (total 40 paths), so that the decoding requirements of 130 conference terminal conferences can be guaranteed.

In this embodiment, a decoding apparatus for an audio code stream is further provided, where the apparatus is used to implement the foregoing embodiments and preferred embodiments, and details are not repeated for what has been described. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

The present embodiment provides an apparatus for decoding an audio code stream, as shown in fig. 10, including:

the first receiving module 71 is configured to receive an audio code stream sent by the conference terminal.

An obtaining module 72, configured to obtain the audio mixing identifiers of all the conference terminals, where the audio mixing identifiers are used to represent the audio mixing states corresponding to the conference terminals.

And the calculating module 73 is configured to sequentially determine whether the mixing identifier of each conference terminal indicates that the conference terminal is not in the mixing, and calculate a frame information amount of an audio code stream of the conference terminal that is not in the mixing.

And a screening module 74, configured to screen the audio code stream of the conference terminal not in the audio mixing based on the frame information amount obtained through calculation.

And a decoding module 75, configured to decode the screened audio code stream.

The decoding device of audio code stream that this embodiment provided calculates the information content of audio code stream through before decoding the audio code stream that conference terminal sent, can screen a large amount of conference terminals, and the pressure of decoder that can significantly reduce improves decoder's decoding efficiency.

The decoding apparatus of the audio code stream in this embodiment is presented in the form of a functional unit, where the unit refers to an ASIC circuit, a processor and a memory executing one or more software or fixed programs, and/or other devices capable of providing the above functions.

Further functional descriptions of the modules are the same as those of the corresponding embodiments, and are not repeated herein.

An embodiment of the present invention further provides an electronic device, which has the decoding apparatus for audio code stream shown in fig. 10.

Referring to fig. 11, fig. 11 is a schematic structural diagram of an electronic device according to an alternative embodiment of the present invention, and as shown in fig. 11, the electronic device may include: at least one processor 81, such as a CPU (Central Processing Unit), at least one communication interface 83, memory 84, and at least one communication bus 82. Wherein a communication bus 82 is used to enable the connection communication between these components. The communication interface 83 may include a Display (Display) and a Keyboard (Keyboard), and the optional communication interface 83 may also include a standard wired interface and a standard wireless interface. The Memory 84 may be a high-speed RAM Memory (volatile Random Access Memory) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The memory 84 may optionally be at least one memory device located remotely from the processor 81. Wherein the processor 81 may be in connection with the apparatus described in fig. 10, an application program is stored in the memory 84, and the processor 81 calls the program code stored in the memory 84 for performing any of the above-mentioned method steps.

The communication bus 82 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The communication bus 82 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 11, but this is not intended to represent only one bus or type of bus.

The memory 84 may include a volatile memory (RAM), such as a random-access memory (RAM); the memory may also include a non-volatile memory (english: non-volatile memory), such as a flash memory (english: flash memory), a hard disk (english: hard disk drive, abbreviated: HDD) or a solid-state drive (english: SSD); the memory 84 may also comprise a combination of the above types of memory.

The processor 81 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of CPU and NP.

The processor 81 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.

Optionally, the memory 84 is also used to store program instructions. The processor 81 may call program instructions to implement the decoding method of the audio code stream as shown in the above embodiments of the present application.

Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. A decoding method of an audio code stream is characterized by comprising the following steps:

receiving an audio code stream sent by a conference terminal;

acquiring the sound mixing identifiers of all the conference terminals, wherein the sound mixing identifiers are used for representing the sound mixing state corresponding to the conference terminals;

sequentially determining whether the mixed sound identification of each conference terminal indicates that the conference terminal is not in mixed sound, and calculating the frame information quantity of the audio code stream of the conference terminal which is not in mixed sound;

decoding the screened audio code stream;

before the step of decoding the screened audio code stream, the method further includes:

2. The decoding method according to claim 1, wherein the filtering the audio code stream of the conference terminal not in the mixed sound based on the calculated frame information amount comprises:

3. The decoding method according to claim 1, wherein the decoding the filtered audio code stream comprises:

4. The decoding method according to claim 3, wherein the decoding the filtered audio code stream further comprises:

5. The decoding method according to claim 4, wherein said passing the audio code stream to a corresponding decoder for decoding according to the liveness and the decoding resources comprises:

6. The decoding method according to any one of claims 1-5, further comprising:

judging whether the decoded audio code stream comprises effective sound;

and when the decoded audio code stream comprises the effective sound, performing sound mixing on the decoded audio code stream by using a sound mixer, and updating the activity of the conference terminal.

7. The decoding method according to claim 6, further comprising:

8. An electronic device, comprising:

a memory and a processor, the memory and the processor are connected with each other in communication, the memory stores computer instructions, and the processor executes the computer instructions to execute the method for decoding audio code stream according to any one of claims 1 to 7.

9. A computer-readable storage medium storing computer instructions for causing a computer to execute the method for decoding an audio bitstream according to any one of claims 1 to 7.