CN114285830B

CN114285830B - Voice signal processing method, device, electronic equipment and readable storage medium

Info

Publication number: CN114285830B
Application number: CN202111572482.4A
Authority: CN
Inventors: 许士宾
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2024-05-24
Anticipated expiration: 2041-12-21
Also published as: CN114285830A

Abstract

The disclosure discloses a voice signal processing method, a voice signal processing device, electronic equipment and a readable storage medium, and relates to the technical field of computers, in particular to the technical field of voices. The specific implementation scheme is as follows: when at least two paths of voice signals sent by the SFU server are received, each voice packet corresponding to each path of voice signal is independently stored in a matched cache queue; when the mixing condition of the multipath voice signals is met, respectively acquiring corresponding voice packets from each buffer queue to form a composite voice packet; obtaining a plurality of decoded voice signals obtained by decoding each voice packet in the composite voice packet by a decoding module in the FREESWITCH communication gateway; and mixing the decoded voice signals to obtain a mixed voice packet, and sending the mixed voice packet to the terminal, wherein the terminal incapable of performing mixed voice processing can be added into the SFU conference, so that the terminal can acquire voice contents of a plurality of other terminals.

Description

Voice signal processing method, device, electronic equipment and readable storage medium

Technical Field

The disclosure relates to the field of computer technology, in particular to the field of voice technology, and more particularly to a voice signal processing method, a device, an electronic device and a readable storage medium.

Background

At present, various instant messaging tools mainly realize multi-terminal online voice interaction through a conference scheme of a selective forwarding unit (SELECTIVE FORWARDING UNIT, SFU). The SFU conference scheme consists of a server and a plurality of terminals, the SFU does not mix audio and video, but selects audio and video streams of a certain terminal in the conference according to a specific algorithm, forwards the audio and video streams to other terminals in the conference, and the terminals need to mix the audio by themselves. The SFU is actually an audio/video routing repeater, and is suitable for large-scale deployment. If the SFU conference terminal wants to talk with a third party terminal in a third party network having a different communication protocol, the communication gateway is required to access the third party network.

In the prior art FREESWITCH (also called soft phone switch) may be used as a communication gateway connecting SFU conferencing terminals and third party terminals. Wherein FREESWITCH is an open-source, multi-protocol communication soft-switching platform, which realizes multiple communication functions. Technically, FREESWITCH is a back-to-back user agent (Back to Back User Agent, B2 BUA) that is used to assist the connected parties in making audio and video calls. However, the B2BUA mechanism cannot mix the forwarded multiple voice streams in the SFU conference, and when the third party terminal cannot perform the mixing process, the third party terminal can only hear one of the multiple voice streams at the same time.

Disclosure of Invention

The disclosure provides a voice signal processing method, a voice signal processing device, electronic equipment and a readable storage medium.

According to an aspect of the present disclosure, there is provided a voice signal processing method, including:

when at least two paths of voice signals sent by an SFU server are received, each voice packet corresponding to each path of voice signal is independently stored in a matched cache queue;

When the mixing condition of the multipath voice signals is met, respectively acquiring corresponding voice packets from each buffer queue to form a composite voice packet;

Obtaining a plurality of decoded voice signals obtained by decoding each voice packet in the composite voice packet by a decoding module in the FREESWITCH communication gateway;

And carrying out audio mixing processing on each decoded voice signal to obtain an audio mixing packet, and sending the audio mixing packet to a terminal.

According to another aspect of the present disclosure, there is provided a voice signal processing method, including:

acquiring an original voice data packet sent by a terminal;

Analyzing the original voice signal in the original voice data packet, and calculating effective voice identification information and voice energy value corresponding to the original voice signal;

forming an improved voice data packet which is processed by the SFU server by adopting a route selection process according to the effective voice identification information and the voice energy value corresponding to the original voice signal;

and sending the improved voice data packet to the SFU server.

According to another aspect of the present disclosure, there is provided a voice signal processing apparatus including:

The voice packet storage module is used for independently storing each voice packet corresponding to each voice signal in a matched cache queue when receiving at least two paths of voice signals sent by the SFU server;

The composite voice packet forming module is used for respectively acquiring corresponding voice packets from each buffer queue to form a composite voice packet when the mixing condition of the multipath voice signals is met;

The decoding voice signal acquisition module is used for acquiring a plurality of decoding voice signals obtained by decoding each voice packet in the composite voice packets by the decoding module in the FREESWITCH communication gateway;

And the mixing packet sending module is used for carrying out mixing processing on each decoded voice signal to obtain a mixing packet and sending the mixing packet to a terminal.

the original voice data packet acquisition module is used for acquiring an original voice data packet sent by the terminal;

The original voice signal information calculation module is used for analyzing and processing the original voice signal in the original voice data packet and calculating effective voice identification information and voice energy value corresponding to the original voice signal;

The improved voice data packet forming module is used for forming an improved voice data packet which is adopted by the SFU server to carry out route selection processing according to the effective voice identification information and the voice energy value corresponding to the original voice signal;

and the improved voice data packet sending module is used for sending the improved voice data packet to the SFU server.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the speech signal processing methods provided by the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the voice signal processing method provided by the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the speech signal processing method provided by the present disclosure.

According to the technology disclosed by the invention, the problem that the terminal is accessed to the SFU conference for communication is solved, and the sound mixing processing can be carried out through FREESWITCH, so that the third party terminal can hear multiple paths of voices.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a method of processing speech signals according to an embodiment of the present disclosure;

FIG. 2a is a flow chart of another speech signal processing method according to an embodiment of the present disclosure;

FIG. 2b is a flow diagram of a target voice packet inserted into a target cache queue according to an embodiment of the present disclosure;

FIG. 3a is a flow chart of yet another speech signal processing method according to an embodiment of the present disclosure;

FIG. 3b is a schematic diagram of a composite voice packet composition in accordance with an embodiment of the present disclosure;

FIG. 4a is a flow chart of yet another speech signal processing method according to an embodiment of the present disclosure;

FIG. 4b is a flow chart of yet another method of speech signal processing according to an embodiment of the present disclosure;

Fig. 5 is a schematic structural view of a voice signal processing apparatus according to an embodiment of the present disclosure;

Fig. 6 is a schematic structural view of yet another voice signal processing apparatus according to an embodiment of the present disclosure;

fig. 7 is a block diagram of an electronic device for implementing a speech signal processing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic flow chart of a voice signal processing method according to an embodiment of the disclosure, where the embodiment is applicable to a case of accessing a third party terminal in a third party network with different communication protocols into an SFU conference for communication, the method may be performed by a voice signal processing apparatus, and the apparatus may be implemented by software and/or hardware and integrated into an electronic device, such as a FREESWITCH communication gateway. In particular, the speech signal processing method may be performed by a left user agent in FREESWITCH communications gateway. Referring to fig. 1, the method specifically includes the steps of:

step 110, when at least two paths of voice signals sent by the SFU server are received, each voice packet corresponding to each path of voice signal is independently stored in a matched buffer queue.

Wherein the SFU server can selectively forward the received voice signals, but not completely forward the received voice signals. The SFU server may select the received voice signal according to a specific algorithm, and forward the voice signal to the third party terminal after processing the voice signal through FREESWITCH. The voice signal received by the SFU server can come from a terminal with the same communication protocol in the SFU conference or can come from a third party terminal with a different communication protocol with the SFU conference. When the voice signal received by the SFU server comes from a third party terminal having a different communication protocol from the SFU conference, the voice signal needs to be subjected to signal processing before the voice signal selection processing of the SFU server is performed, and a specific signal processing manner is described in the following embodiments.

In the present disclosure, a specific algorithm used by the SFU server may be formulated according to an actual situation and a specific application scenario. For example, the number of voice signal paths selected by the SFU server may have an upper limit. For example, the SFU server may forward 6 voice signals at most to participants of the SFU conference (which may be third party terminals or other terminals).

Specifically, valid voice identification information (vad) and voice energy value (energy) may be included in the voice packet received by the SFU server. For example, vad and energy may be pre-filled in an extension header of a Real-time transport protocol data packet (Real-time Transport Protocol, RTP) of a voice packet, and then the voice packet is sent to the SFU server. The SFU server may order the voice packets according to the vad and the energy size. For example, the SFU server may select the first n (n is a natural number greater than or equal to 2 and may be configured according to the actual requirement) path of voice packets with vad as true and highest energy.

In the present disclosure, each path of voice signal received by the left user agent in FREESWITCH communication gateway is selected by SFU server according to a specific algorithm, and the left user agent only needs to perform a receiving buffer operation. When the number of terminals in the SFU conference is less than the n value, the SFU server may forward all paths of voice packets to the left user agent.

In the prior art, the user agent in FREESWITCH communication gateway can only process one downlink voice signal. In the present disclosure, the left user agent in FREESWITCH communication gateway may receive multiple voice signals at the same time and forward to the third party terminal after the mixing process. For the received multipath voice signals, the left user agent needs to solve the buffer processing of each voice packet corresponding to each voice signal. The voice packets are cached through the cache queue, so that the problems of mixed sound, noise and the like caused by network jitter of multiple paths of voice signals can be avoided.

In particular, the speech signals may be buffered and ordered in an ordered queue. The memory of the cache queue node can be applied to a simple memory pool which is realized by the cache queue node and can be recycled at any time, and memory fragments caused by frequently applying and releasing the memory to the system can be avoided. For example, the left user agent may receive a voice packet through a Socket and copy the voice packet to a corresponding cache queue node, inserting the voice packet into a matching location of the cache queue node.

By way of example, a voice signal may correspond to a terminal in an SFU conference. A voice signal may comprise a plurality of voice packets. A voice signal may be stored in a buffer queue. The voice packets in each buffer queue may be buffered in a sorted order. The sequence number of the voice packet may be the sequence number of the voice packet, or the time sequence of receiving the voice packet.

And 120, when the mixing condition of the multipath voice signals is met, respectively acquiring corresponding voice packets from each buffer queue to form a composite voice packet.

In the prior art, FREESWITCH communication gateway realizes communication through a 1:1B 2BUA exchange model. Wherein, 1:1 can be understood as receiving and forwarding a single-path voice signal. Two sessions are created in the 1:1B 2BUA exchange model, one session representing one user agent, the left side may be referred to as the left user agent and the right side may be referred to as the right user agent. Each session is associated with a bridge thread (audio_bridge), and signaling, media transceiving and processing procedures are processed in a callback mode. When FREESWITCH receives the call request, a left user agent may be created, a corresponding state machine may be started, an audio_bridge thread may be associated, and a dial plan may be performed. Then, a right user agent is created, a corresponding state machine is started, and an audio_bridge thread is associated. When the state machine enters a media exchange state, the audio_bridge thread can complete media exchange of both the left user agent and the right user agent in a callback mode.

For example, the receiving of the voice packet may be accomplished by a callback function io_read_frame, and the sending of the voice packet may be accomplished by a callback function io_write_frame. The process of receiving the voice packet by the left user agent through the io_read_frame may be: after the left user agent receives the voice packet from the network, the voice packet may be copied into a switch_frame_t_frame object, where frame is a parameter of io_read_frame. The audio_bridge thread acquires the received voice packet through the io_read_frame, and then calls a callback function io_write_frame of the right user agent to forward the voice packet of the left user agent to the terminal represented by the right user agent.

In the present disclosure, FREESWITCH communication gateway adopts the B2BUA switching model of n:1 to implement communication. Where n:1 may be understood as receiving multiple voice signals and forwarding a single voice signal. The left user agent processes the received multipath voice signals differently than in the prior art. When the mixing condition of the multipath voice signals is met, the left user agent can acquire a plurality of corresponding voice packets.

Wherein, the mixing condition of the multipath voice signals can be various. For example, when the third party terminal determines to receive the mix package, communication with other terminals in the SFU conference may be performed. Specifically, the mixing condition of the multipath voice signal may be the io_read_frame callback time in FREESWITCH communication gateways. The audio_bridge thread of the left user agent makes a callback to the io_read_frame, so that multipath voice signals can be received, voice packets of the multipath voice signals are not directly acquired at the network interface, but are indirectly acquired from the buffer queue, and mixed noise caused by network jitter can be reduced.

Specifically, when the left user agent acquires the corresponding voice packet in each buffer queue, the sub_frame may be split in the parameter frame of the io_read_frame, and each voice packet is taken as the sub_frame to be put into the parameter frame of the io_read_frame, and still exchanged in a mode of 1:1 in the view of the audio_bridge thread. I.e. the corresponding voice packets (which can be understood as sub_ feame) are respectively acquired in each buffer queue to form a composite voice packet (which can be understood as a frame). The voice packet may be obtained from each buffer queue, or one voice packet may be obtained from each buffer queue.

And 130, acquiring FREESWITCH a plurality of decoded voice signals obtained by decoding each voice packet in the composite voice packets by a decoding module in the communication gateway.

The decoding module in FREESWITCH communication gateway may be session_write_frame in session_write. The decoding process for the composite speech packet may be to split each sub_frame from the parameter frames (i.e. obtain each speech packet in the composite speech packet), and independently decode each sub_frame. For example, the decoded data is pulse code modulated data (pcm), and the resulting decoded speech signal can be represented by pcm_data1, pcm_data2, …, pcm_data6, or the like.

And 140, performing audio mixing processing on each decoded voice signal to obtain an audio mixing packet, and sending the audio mixing packet to the terminal.

Each decoded speech signal may be used as an input parameter of a mix process (mix), and mix data (mix_pcm_data) may be output after mix call is completed, and a mix packet may be obtained after mix_pcm_data is encoded. The mix packet may be sent to the terminal via the callback function io_write of the right user agent.

The terminal may refer to a third party terminal in a third party network. The third party network may be a network having a different communication protocol than the SFU conference. For example, the third party network may be a public switched telephone network (Public Switched Telephone Network, PSTN). The third party terminal may be a terminal that communicates using a third party network, such as a PSTN terminal. The PSTN terminal may be a mobile phone, a World Wide Web (Web) client, a session initiation protocol (Session initialization Protocol, SIP) client, or the like, which does not support a mixing processing operation.

According to the technical scheme, when at least two paths of voice signals sent by an SFU server are received, voice packets corresponding to the paths of voice signals are independently stored in a matched cache queue; when the mixing condition of the multipath voice signals is met, respectively acquiring corresponding voice packets from each buffer queue to form a composite voice packet; obtaining a plurality of decoded voice signals obtained by decoding each voice packet in the composite voice packet by a decoding module in the FREESWITCH communication gateway; the method comprises the steps of carrying out mixing processing on each decoded voice signal to obtain a mixed voice packet, and sending the mixed voice packet to a terminal, in particular to a third party terminal in a third party network, so that the problem that the third party terminal is accessed into an SFU conference to carry out communication is solved, the mixing processing can be carried out through FREESWITCH, the third party terminal can hear multiple paths of voices, and the mixed voice problem can be avoided by adopting a mode of processing the multiple voice packets through a buffer queue.

Fig. 2a is a flow chart of another method for processing a speech signal according to an embodiment of the disclosure, which is a further refinement of the foregoing technical solution, where the technical solution in the embodiment may be combined with one or more of the foregoing implementations.

Specifically, in an optional embodiment of the present disclosure, storing each voice packet corresponding to each voice signal separately in a matched buffer queue includes:

When a target voice packet matched with the first target number is detected, extracting a first basic serial number of the target voice packet;

Acquiring the current sequence number turnover times of a target buffer queue matched with a target voice packet and a sequence number turnover threshold value;

Determining a target expansion sequence number matched with the target voice packet according to the first basic sequence number, the current sequence number turnover times and the sequence number turnover threshold value;

and inserting the target voice packet into a target cache queue according to the target extension serial number.

Specifically, referring to fig. 2a, the voice signal processing method provided in the present disclosure includes the following steps:

Step 210, when at least two paths of voice signals sent by the SFU server are received, and when a target voice packet matched with the first target path number is detected, a first basic serial number of the target voice packet is extracted.

The first target number of paths may be understood as an identifier of the terminal, the buffer queue, or the target voice packet. The terminal, the buffer queue and the target voice packet may have a certain correspondence. For example, the first target number of paths may be used to represent a target voice packet that a terminal sends to the SFU server and stores in a corresponding buffer queue. The sequence number (seq) of the target voice packet itself may be the first base sequence number. Specifically, the first base sequence number may be stored in a header of the target voice packet.

Illustratively, seq may be represented by unit_16, with a maximum of 65535, and when 65535 is exceeded, seq will again begin with a new sequence number generation from 0, i.e., there is a sequence number flip.

Step 220, the current sequence number turnover times and the sequence number turnover threshold value of the target buffer queue matched with the target voice packet are obtained.

The current sequence number rollover number (roll_num) can be understood as the number of times the seq restarts from 0. The roll_num value may be incremented by 1 every time a new flip exists in the target cache queue. The sequence number flip threshold can be understood as the maximum value of seq. For example, the sequence number flip threshold value may be 65535.

Step 230, determining the target extended sequence number matched with the target voice packet according to the first basic sequence number, the current sequence number turnover number and the sequence number turnover threshold value.

The target extended sequence number (ext_seq) may be understood as a sequence number redetermined according to the first base sequence number, the current sequence number flip number, and the sequence number flip threshold value. The problem of sequence number overturning when the target voice packet is stored in the buffer queue can be solved through the target extended sequence number, and the definition and the correctness of the storage position of the target voice packet can be ensured.

Specifically, the target extension sequence number may be determined by: ext_seq=roll_num×seq_max+seq. Wherein seq_max is the sequence number rollover threshold. For example, when seq_max is 65535, ext_seq may be represented by unit_32, and the original unit_16 may be extended, so as to solve the sequence number rollover problem.

Step 240, inserting the target voice packet into the target buffer queue according to the target extension serial number.

When inserting the target data packet in the target cache queue according to the ext_seq, the data packets can be orderly arranged, and no disorder condition occurs.

In an alternative embodiment of the present disclosure, inserting the target voice packet into the target buffer queue according to the target extension sequence number includes: detecting whether the target voice packet is the first voice packet corresponding to the set voice fragment; if yes, inserting the target voice packet into a target cache queue according to the target extension serial number; otherwise, determining the expected insertion sequence number according to the last expansion sequence number in the target cache queue; if the expected insertion sequence number is matched with the target extension sequence number, inserting the target voice packet into a target cache queue according to the target extension sequence number; if the expected insertion sequence number is not matched with the target extension sequence number, determining that a packet loss event occurs, or determining that the target voice packet is a retransmission data packet after the packet loss event occurs; if the packet loss event is determined to occur, constructing a packet loss retransmission request packet, and inserting a target voice packet into a target cache queue according to a target expansion serial number; if the target voice packet is determined to be a retransmission data packet after the packet loss event, detecting whether the target voice packet meets the timeliness requirement; if yes, inserting the target voice packet into a target cache queue according to the target extension serial number; otherwise, the target voice packet is discarded.

Fig. 2b is a schematic flow chart of inserting a target voice packet into a target buffer queue according to an embodiment of the disclosure. As shown in fig. 2b, it may be first detected whether the target speech packet is the first speech packet of the set speech segment. Specifically, whether the target voice packet is the first voice packet may be determined according to the target extension sequence number of the target voice packet. If the target voice packet is the first voice packet, the target voice packet may be directly inserted into the target buffer queue, for example, may be inserted into the first position of the target buffer queue. If the target voice packet is not the first voice packet, it may be determined whether the target extension sequence number matches the expected insertion sequence number. Wherein the expected insertion sequence number may be the next extension sequence number to the last extension sequence number.

As shown in fig. 2b, if the expected insertion sequence number matches (e.g., is the same as) the target extension sequence number, the target voice packet may be inserted directly into the target cache queue, e.g., may be inserted at a position subsequent to the current last position of the target cache queue, i.e., a position corresponding to the expected insertion sequence number. If the expected insertion sequence number does not match (e.g., is not the same as) the target extension sequence number, it may be determined that a packet loss event has occurred or that the target voice packet is a retransmitted data packet following the packet loss event. When the target extended sequence number is the extended sequence number before the expected insertion sequence number, the target voice packet can be determined to be a retransmission data packet after the packet loss event; when the target extended sequence number is an extended sequence number subsequent to the expected insertion sequence number, it may be determined that a packet loss event has occurred.

As shown in fig. 2b, if it is determined that a packet loss event occurs, a packet loss retransmission request packet may be constructed, and the target voice packet may be inserted into the target buffer queue, for example, a position corresponding to the target extended sequence number in the target buffer queue.

As shown in fig. 2b, if it is determined that the target voice packet is a retransmission packet after the packet loss event, it may be detected whether the target voice packet meets the timeliness requirement, i.e. whether the target voice packet expires. If the target voice packet does not meet the timeliness requirement, namely, is out of date, the target voice packet can be directly discarded. If the target voice packet meets the timeliness requirement, i.e. does not expire, the target voice packet can be inserted into the target cache queue, for example, the target voice packet can be inserted into a position corresponding to the target expansion sequence number in the target cache queue.

According to the method and the device, under various detection conditions, the target voice packets are inserted into the matched positions in the target cache queue according to different conditions, so that the target data packets can be accurately ordered, the disordered storage of the data packets is avoided, the problem of packet loss can be solved, and the timeliness of the data packets is guaranteed. The expired data packets can be directly discarded, so that invalid ordering of the data packets is avoided.

In an optional embodiment of the disclosure, constructing a packet loss retransmission request packet includes: determining a second basic sequence number corresponding to the lost voice packet according to the expected insertion sequence number and the target extension sequence number; and constructing a lost packet retransmission request packet according to the second basic sequence number, and sending the lost packet retransmission request packet to the SFU server so as to request the SFU terminal sending the lost voice packet to carry out lost packet retransmission on the lost voice packet.

When determining that the packet loss event occurs, the method can construct a packet loss retransmission request packet, perform packet loss retransmission, acquire a retransmission data packet after the packet loss event occurs, and ensure that a third party terminal can acquire as much voice data as possible, so as to avoid the loss of the voice data.

Specifically, the expected insertion sequence number, the extended sequence number between the expected insertion sequence number and the target extended sequence number, and the expected insertion sequence number may be used as a second base sequence number corresponding to the lost voice packet. The second base sequence number may include one or more extended sequence numbers, i.e., there may be one or more voice packet losses. For example, with expected insertion sequence number 3501 and target extension sequence number 3504, it may be determined that three voice packets of ext_seq being 3501, 3502, and 3503 are missing, i.e., the second base sequence number may be 3501, 3502, and 3503.

Illustratively, the left side user agent may construct a packet loss retransmission request packet (RTCP NACK request packet) containing the second base sequence number and send the RTCP NACK request packet to the SFU server. The SFU server may forward the RTCP NACK request packet to the corresponding terminal (e.g., the terminal corresponding to the target cache queue). After receiving the RTCP NACK request packet, the terminal may extract the second base sequence numbers, such as 3501, 3502, and 3503, therein. If the terminal finds the voice packet corresponding to the second basic serial number in the local buffer, the packet loss retransmission can be performed to the SFU server again, the SFU server can send the acquired packet loss data to the left user agent, and the left user agent inserts the retransmission data packet (packet loss data) into the corresponding position in the target buffer queue according to the second basic serial number.

Step 250, when the mixing condition of the multi-path voice signals is satisfied, respectively obtaining the corresponding voice packets from each buffer queue to form a composite voice packet.

And 260, obtaining FREESWITCH a plurality of decoded voice signals obtained by decoding each voice packet in the composite voice packets by a decoding module in the communication gateway.

Step 270, performing mixing processing on each decoded voice signal to obtain a mixing packet, and sending the mixing packet to the terminal.

According to the technical scheme, when at least two paths of voice signals sent by an SFU server are received and a target voice packet matched with a first target path number is detected, a first basic sequence number of the target voice packet is extracted; acquiring the current sequence number turnover times of a target buffer queue matched with a target voice packet and a sequence number turnover threshold value; determining a target expansion sequence number matched with the target voice packet according to the first basic sequence number, the current sequence number turnover times and the sequence number turnover threshold value; inserting the target voice packet into a target cache queue according to the target expansion serial number; when the mixing condition of the multipath voice signals is met, respectively acquiring corresponding voice packets from each buffer queue to form a composite voice packet; obtaining a plurality of decoded voice signals obtained by decoding each voice packet in the composite voice packet by a decoding module in the FREESWITCH communication gateway; the method comprises the steps of carrying out audio mixing processing on each decoded voice signal to obtain an audio mixing package, and sending the audio mixing package to a terminal, in particular to a third party terminal in a third party network, so that the problem that the third party terminal accesses an SFU conference for communication is solved, the acquired target voice packages can be orderly inserted into a target cache queue, the storage order and the timeliness of the voice packages are ensured, the loss of the voice packages can be reduced, the comprehensiveness of the acquired data of the terminal is ensured, the service quality of the voice signals is ensured, and in addition, the audio mixing processing can be carried out through FREESWITCH, so that the third party terminal accesses the SFU conference for communication and multi-path voice is heard.

Fig. 3a is a flow chart of yet another method for processing a speech signal according to an embodiment of the present disclosure, which is a further refinement of the foregoing technical solution, where the technical solution in the present embodiment may be combined with one or more of the foregoing implementations.

Specifically, in an optional embodiment of the present disclosure, when a mixing condition of multiple voice signals is satisfied, respectively acquiring corresponding voice packets from each buffer queue to form a composite voice packet, including:

when the mixing condition of the multipath voice signals is met, judging whether a voice packet can be obtained from each buffer queue;

If yes, the voice packets respectively acquired from each buffer queue form a composite voice packet;

if not, after waiting for the preset target time length, judging whether a voice packet can be acquired from each buffer queue which does not acquire the voice packet last time;

If it is determined that one voice packet can be acquired from each buffer queue in which the voice packet is not acquired last time, forming a composite voice packet from each voice packet acquired in the previous and subsequent times;

If it is determined that one voice packet cannot be acquired from each buffer queue of the voice packet which is not acquired last time, a mute packet corresponding to each buffer queue of the voice packet which is not acquired currently is constructed, and each voice packet acquired twice before and after and each mute packet form a composite voice packet.

Specifically, referring to fig. 3a, the voice signal processing method provided in the present disclosure includes the following steps:

Step 310, when at least two paths of voice signals sent by the SFU server are received, each voice packet corresponding to each path of voice signal is separately stored in a matched buffer queue.

Step 320, when the mixing condition of the multi-path voice signals is satisfied, judging whether a voice packet can be obtained from each buffer queue; if yes, go to step 330; otherwise, step 340 is performed.

Wherein fig. 3b is a schematic diagram of a composite voice packet composition according to an embodiment of the present disclosure. As shown in fig. 3b, when a mixing condition of multiple voice signals is satisfied, such as io_read_frame callback, a voice packet may be obtained from each buffer queue.

Step 330, the voice packets respectively obtained from each buffer queue are formed into a composite voice packet; step 370 is performed.

As shown in fig. 3b, if one speech packet can be obtained from each buffer queue, the obtained speech can be directly formed into a composite speech packet. The decoding process and the mixing process may then be performed on the basis of the composite speech packet.

Step 340, after waiting for the preset target duration, it is determined again whether a voice packet can be acquired from each buffer queue from which a voice packet was not acquired last time.

As shown in fig. 3b, if one voice packet cannot be acquired from each buffer queue, it is possible to wait for a preset target duration and then retry to acquire the voice packet from the buffer queue from which the voice packet was not acquired. The preset target duration may be a shorter duration, for example, 20 ms, so that the problem that the voice data transmission efficiency is too low due to too long waiting duration can be avoided.

Step 350, if it is determined that a voice packet can be acquired from each buffer queue from which a voice packet was not acquired last time, forming a composite voice packet from each voice packet acquired twice before and after; step 370 is performed.

As shown in fig. 3b, if one voice packet can be acquired from each buffer queue from which no voice packet was acquired last time, each voice packet acquired twice before and after may be formed into a composite voice packet. The method can ensure that one voice packet corresponding to each path of voice data selected by the SFU server exists in the composite voice packet. The decoding process and the mixing process may then be performed on the basis of the composite speech packet.

Step 360, if it is determined that one voice packet cannot be acquired from each buffer queue from which a voice packet is not acquired last time, a mute packet corresponding to each buffer queue from which a voice packet is not acquired currently is constructed, and each voice packet acquired twice before and after and each mute packet form a composite voice packet.

As shown in fig. 3b, if one voice packet cannot be acquired from each buffer queue from which a voice packet was not acquired last time, a silence packet (CNG) corresponding to each buffer queue from which a voice packet is not currently acquired may be constructed. And forming a composite voice packet by the voice packets acquired in the front and the back and the mute packets, and then performing decoding processing and audio mixing processing according to the composite voice packet. And a mute packet can be inserted into a buffer queue of the unobtained voice packet, so that one voice packet corresponding to each path of voice data selected by the SFU server exists in the composite voice packet, the audio mixing hearing effect received by the terminal after audio mixing processing can be more similar to the voice effect heard by a user in the real situation, and the authenticity of the voice effect is improved.

In an alternative embodiment of the present disclosure, the voice packet includes: effective voice identification information and voice energy values; before judging whether a voice packet can be acquired from each buffer queue, the method further comprises the following steps: detecting whether each voice packet currently cached in each cache queue exceeds a preset quantity threshold value, and determining an overload cache queue according to a detection result; and eliminating the voice packets with voice energy values smaller than or equal to the preset energy threshold value, wherein the voice identification information in the overload buffer queue is set to be invalid.

Wherein the valid voice identification information (vad) may represent the validity of the voice packet. For example, vad may indicate whether the data in the voice packet is voice data, and vad may be true when the data in the voice packet is voice data; when the data in the voice packet is non-voice data such as silence or noise, vad may be false. The speech energy value (energy) may represent the energy value size of the speech packet. For example, the energy may be the magnitude of the data in the voice packet.

As shown in fig. 3b, it may be determined whether the buffer queue is an overloaded buffer queue before the voice packets are retrieved from the buffer queues. Specifically, when the mixing condition of the multipath voice signals is met, the left user agent can detect whether each voice packet currently cached in each cache queue exceeds a preset quantity threshold value, and can judge whether the cache queue is overlong. If the voice packets buffered in the buffer queue are determined to exceed the preset quantity threshold value, the buffer queue can be determined to be an overload buffer queue. When the buffer queue is determined to be an overload buffer queue, the overload buffer queue can be accelerated according to vad and energy. Specifically, voice packets with bad vad, i.e., false vad, in the overload buffer queue can be removed; or eliminating voice packets with energy less than or equal to a preset energy threshold value in the overload buffer queue; or the voice packet with the vad invalid and the energy less than or equal to the preset energy threshold value can be removed. By performing acceleration processing on the overload buffer queue, invalid voice packets and/or voice packets with small voice energy values can be not processed, so that voice data transmission can be accelerated, and the delay problem introduced by mixing processing is reduced. The speech packet with the energy less than or equal to the preset energy threshold value can be understood as a speech packet in which the content in the speech packet cannot be heard.

And 370, acquiring FREESWITCH a plurality of decoded voice signals obtained by decoding each voice packet in the composite voice packets by a decoding module in the communication gateway.

Step 380, mixing the decoded voice signals to obtain a mixing packet, and sending the mixing packet to the terminal.

According to the technical scheme, when at least two paths of voice signals sent by an SFU server are received, voice packets corresponding to the paths of voice signals are independently stored in a matched cache queue; when the mixing condition of the multipath voice signals is met, judging whether a voice packet can be obtained from each buffer queue; if yes, the voice packets respectively acquired from each buffer queue form a composite voice packet; otherwise, after waiting for the preset target time length, judging whether a voice packet can be acquired from each buffer queue in which the voice packet is not acquired last time; if it is determined that one voice packet can be acquired from each buffer queue in which the voice packet is not acquired last time, forming a composite voice packet from each voice packet acquired in the previous and subsequent times; if it is determined that one voice packet cannot be acquired from each buffer queue of the voice packet which is not acquired last time, constructing mute packets respectively corresponding to each buffer queue of the voice packet which is not acquired currently, and forming a composite voice packet by each voice packet acquired twice before and after and each mute packet; obtaining a plurality of decoded voice signals obtained by decoding each voice packet in the composite voice packet by a decoding module in the FREESWITCH communication gateway; the method comprises the steps of carrying out audio mixing processing on each decoded voice signal to obtain an audio mixing packet, and sending the audio mixing packet to a terminal, in particular to a third party terminal in a third party network, so that the problem that the third party terminal is accessed into an SFU conference for communication is solved, one voice packet of each buffer queue can be ensured to be contained in the composite voice packet, and the audio effect after audio mixing processing is more similar to the voice effect in a real environment; the delay introduced by the mixing process can be compensated by the acceleration process of the overload buffer queue; and can carry out the sound mixing processing through FREESWITCH, make third party terminal access SFU meeting in communication, hear multichannel pronunciation.

On the basis of the above embodiment, optionally, mixing the decoded speech signals to obtain a mixed packet includes: performing time domain superposition on each decoded voice signal to obtain a time domain superposition signal; and if the time domain superposition signal is determined to have signal amplitude overflow, performing overflow processing on the time domain superposition signal according to a preset overflow processing algorithm.

The time domain superposition of the decoded voice signals can be the time domain superposition of the quantized voice signals, and can be equivalent to the superposition of sound waves in the air. If the signal amplitude overflows after superposition, overflow processing is needed. The preset overflow processing algorithm may be an averaging algorithm, a normalization processing algorithm, a clamping algorithm, or the like, which is not particularly limited in the present disclosure. By performing overflow processing through a preset overflow processing algorithm, signal distortion can be avoided.

Fig. 4a is a schematic flow chart of yet another voice signal processing method according to an embodiment of the disclosure, where the method is applicable to a case of accessing a third party terminal in a third party network with a different communication protocol into an SFU conference for communication, and the method may be performed by a voice signal processing apparatus, where the apparatus may be implemented by software and/or hardware, and integrated into an electronic device, such as a FREESWITCH communication gateway.

In particular, the speech signal processing method may be performed by a right-hand user agent in FREESWITCH communications gateway. Referring to fig. 4a, the method specifically comprises the steps of:

step 410, obtain the original voice data packet sent by the terminal.

Wherein the original voice data packet may be transmitted by a third party terminal via a third party network. The third party network may be a PSTN network; the third party terminal may be a PSTN terminal. The original voice data packet may be a voice data packet sent by the third party terminal to the right user agent.

Step 420, analyze the original voice signal in the original voice data packet, and calculate the effective voice identification information and voice energy value corresponding to the original voice signal.

The parsing of the original speech signal may be decoding and extracting the speech signal therein and processing the same. The right user agent may determine the validity of the speech signal and the energy value size in the original speech signal. For example, the right user agent determines that valid voice identification information is true when determining that the original voice signal is a voice signal; and when the original voice signal is determined to be a non-voice signal, determining that the valid voice identification information is false. The right user agent may determine the amplitude of the original speech signal as the energy value size.

Step 430, forming an improved voice data packet which is adopted by the SFU server to perform routing processing according to the effective voice identification information and the voice energy value corresponding to the original voice signal.

The right user agent can add effective voice identification information and voice energy value to the header of the original voice data packet to form an improved voice data packet. The effective voice identification information and the voice energy value in the improved voice data packet can be used as the basis for the SFU server to carry out route selection processing. The voice signal which is produced by the third party terminal and can not be subjected to route selection processing originally through the SFU server can be processed to generate an improved voice data packet, so that the third party terminal can be added into the SFU conference, and the voice data generated by the third party terminal can be listened to by other terminals in the SFU conference.

Step 440, send the modified voice data packet to the SFU server.

The signal in the improved voice data packet can be used as a voice signal to be processed by the SFU server in a routing way. When the SFU server selects the signal in the modified voice data packet, the signal in the modified voice data packet may be sent to the left user agent in the FREESWITCH communication gateway, and subsequent buffering, composite voice packet forming, decoding, mixing processing and other operations are performed, so that other terminals in the SFU conference hear the multipath voice signal containing the signal in the modified voice data packet.

According to the technical scheme, an original voice data packet sent by a third party terminal through a third party network is obtained; analyzing the original voice signal in the original voice data packet, and calculating effective voice identification information and voice energy value corresponding to the original voice signal; forming an improved voice data packet which is adopted by the SFU server to carry out route selection processing according to the effective voice identification information and the voice energy value corresponding to the original voice signal; the improved voice data packet is sent to the SFU server, so that the problem that the third party terminal accesses the SFU conference to communicate is solved, voice data of the third party terminal can participate in the selection process of the SFU server, further, the voice data of the third party terminal can be heard by other terminals, and the third party terminal is accessed to the SFU conference to communicate is realized.

Fig. 4b is a flow chart of yet another speech signal processing method according to an embodiment of the present disclosure. As shown in fig. 4b, a specific use procedure of a voice signal processing method provided in an embodiment of the present disclosure may be: the SFU server may perform routing based on the received voice signal (e.g., the signal in the modified voice data packet) and the selected voice signal may be sent to a left user agent in the FREESWITCH communications gateway. The media receiving thread of the left user agent may acquire the corresponding voice packet from the network and insert the voice packet into the corresponding buffer queue. When the voice packet is cached, whether the voice packet is lost or not can be checked, if the voice packet is lost, the packet loss retransmission can be carried out, and the corresponding retransmission data packet is obtained. In the process of voice packet buffering, whether the voice packet is out of date or not can be checked, and if the voice packet is out of date, the voice packet can be optionally discarded. When the mixing condition of the multipath voice signals is met, the audio_bridge thread corresponding to the left user agent can call back the io_read_frame function to acquire voice packets, and one voice packet can be acquired in each buffer queue to be used as a sub_frame to form a composite voice packet to be placed in a frame object. After the io_read_frame function returns, each sub_frame in the frames can be extracted and respectively decoded to obtain a plurality of decoded voice signals, and then the audio mixing processing and the encoding are carried out to obtain a mixed audio packet. The left user agent may invoke the io_write_frame function of the right user agent to send the mixing package to a terminal, which may in particular be a third party terminal in a third party network. The third party terminal can be accessed into the SFU conference, and the speech of a plurality of users in the SFU conference is heard.

Before the audio_bridge thread callback io_read_frame function corresponding to the left user agent performs voice packet acquisition, the left user agent may first check the length of each buffer queue to determine whether the buffer queue is an overload buffer queue. If the buffer queue is determined to be an overload buffer queue, the buffer queue can be accelerated. When the audio_bridge thread corresponding to the left user agent calls back the io_read_frame function to acquire the voice packet, whether a voice packet can be acquired in each buffer queue can be judged. If a speech packet can be obtained in each buffer queue, the composite speech packet can be directly formed. If a voice packet cannot be acquired in each buffer queue, the voice packet can be continuously acquired in the buffer queue in which the voice packet is not acquired after waiting for 20 ms. If the voice packet can be acquired in the buffer queue of the voice packet which is not acquired after waiting for 20ms, the composite voice packet can be formed according to the voice packet acquired twice. If the voice packet can not be acquired in the buffer queue of the voice packet which is not acquired after waiting for 20ms, inserting a mute packet corresponding to the buffer queue of the voice packet which is not acquired, and forming a composite voice packet according to the voice packet acquired twice and the mute packet.

As shown in fig. 4b, the terminal may send the original voice data packet to the right user agent. After receiving the original voice data packet, the right user agent can calculate the effective voice identification information and the voice energy value of the original voice data packet. For example, the calculation may be performed by a voice activity detection (Voice Activity Detection, VAD) algorithm. Noise reduction processing may also be performed thereafter. The right user agent can form an improved voice data packet according to the original voice data packet, the corresponding effective voice identification information and the voice energy value, and can call the io_write_frame function of the left user agent to send the improved voice data packet to the SFU server, so that voice data of a third party terminal in the third party network can participate in the route selection processing of the SFU service, and the effect that the third party terminal accesses the SFU conference for communication is achieved.

By the voice signal processing method provided by the embodiment of the disclosure, the terminal, particularly the third party terminal in the third party network, can be accessed into the SFU conference, can hear the mixed sound consistent with other terminals, and cannot leak details; the third party terminal can also perform route selection processing on voice data of the third party terminal through the SFU server and send the voice data to other terminals, so that indiscriminate communication with other terminals in the SFU conference is realized.

Fig. 5 is a schematic structural view of a voice signal processing apparatus according to an embodiment of the present disclosure, which may be provided in an electronic device. In particular, the apparatus may be implemented by a left user agent in FREESWITCH communications gateway. As shown in fig. 5, the voice signal processing apparatus 500 includes: the audio packet storage module 510, the composite audio packet constructing module 520, the decoded audio signal acquiring module 530, and the mixed audio packet transmitting module 540. Wherein:

the decoding voice signal acquisition module is used for acquiring a plurality of decoding voice signals obtained by decoding each voice packet in the composite voice packet by the decoding module in the FREESWITCH communication gateway;

And the mixing packet sending module is used for carrying out mixing processing on each decoded voice signal to obtain a mixing packet and sending the mixing packet to the terminal.

Optionally, the voice packet storage module 510 includes:

a first base sequence number extraction unit configured to extract a first base sequence number of a target voice packet when the target voice packet matching the first target number of passes is detected;

the number and threshold value obtaining unit is used for obtaining the current sequence number turnover number and the sequence number turnover threshold value of the target buffer queue matched with the target voice packet;

the target extension serial number determining unit is used for determining a target extension serial number matched with the target voice packet according to the first basic serial number, the current serial number turnover number and the serial number turnover threshold value;

And the target voice packet inserting unit is used for inserting the target voice packet into the target cache queue according to the target extension serial number.

Optionally, the target voice packet inserting unit includes:

a first voice packet detection subunit, configured to detect whether the target voice packet is a first voice packet corresponding to the set voice segment;

The first target voice packet inserting subunit is used for inserting the target voice packet into the target cache queue according to the target expansion serial number if the target voice packet is detected to be the first voice packet corresponding to the set voice fragment;

The expected insertion sequence number determining subunit is configured to determine an expected insertion sequence number according to the last expanded sequence number in the target buffer queue if the target voice packet is detected not to be the first voice packet corresponding to the set voice fragment;

a second target voice packet inserting subunit, configured to insert, if the expected insertion sequence number matches the target extension sequence number, the target voice packet into the target buffer queue according to the target extension sequence number;

A packet loss event or retransmission data packet determining subunit, configured to determine that a packet loss event occurs or determine that a target voice packet is a retransmission data packet after the packet loss event occurs if the expected insertion sequence number is not matched with the target extension sequence number;

A third target voice packet inserting subunit, configured to construct a packet loss retransmission request packet if it is determined that a packet loss event occurs, and insert the target voice packet into a target buffer queue according to a target extended sequence number;

A fourth target voice packet access subunit, configured to detect whether the target voice packet meets the timeliness requirement if it is determined that the target voice packet is a retransmission data packet after the packet loss event occurs; if yes, inserting the target voice packet into a target cache queue according to the target extension serial number; otherwise, the target voice packet is discarded.

Optionally, the third target voice packet inserting subunit is specifically configured to:

determining a second basic sequence number corresponding to the lost voice packet according to the expected insertion sequence number and the target extension sequence number;

And constructing a lost packet retransmission request packet according to the second basic sequence number, and sending the lost packet retransmission request packet to the SFU server so as to request the SFU terminal sending the lost voice packet to carry out lost packet retransmission on the lost voice packet.

Optionally, the composite voice packet forming module 520 includes:

The voice packet acquisition judging unit is used for judging whether a voice packet can be acquired from each buffer queue or not when the mixing condition of the multipath voice signals is met;

The first composite voice packet forming unit is used for forming the voice packets respectively obtained from each buffer queue into composite voice packets if one voice packet can be obtained from each buffer queue;

The voice packet obtaining and re-judging unit is used for re-judging whether a voice packet can be obtained from each buffer queue in which the voice packet is not obtained last time after waiting for a preset target time length if the voice packet cannot be obtained from each buffer queue;

A second composite voice packet forming unit, configured to form each voice packet acquired twice before and after each voice packet acquired twice if it is determined that one voice packet can be acquired from each buffer queue from which the voice packet was not acquired last time;

and the third composite voice packet forming unit is used for constructing mute packets respectively corresponding to the buffer queues which cannot acquire the voice packet at present and forming the voice packets acquired twice before and after and the mute packets into the composite voice packet if the fact that one voice packet cannot be acquired from the buffer queues which cannot acquire the voice packet at last time is determined.

Optionally, the voice packet includes: effective voice identification information and voice energy values;

the device further comprises:

The overload buffer queue determining module is used for detecting whether each voice packet currently buffered in each buffer queue exceeds a preset quantity threshold before judging whether each voice packet can be acquired from each buffer queue or not, and determining the overload buffer queue according to the detection result;

And the voice packet removing module is used for removing voice packets with valid voice identification information in the overload buffer queue set as invalid and/or voice energy values smaller than or equal to a preset energy threshold value.

Optionally, the mixing packet sending module 540 includes:

the time domain superposition signal acquisition unit is used for performing time domain superposition on each decoded voice signal to obtain a time domain superposition signal;

And the signal overflow processing unit is used for carrying out overflow processing on the time domain superposition signal according to a preset overflow processing algorithm if the time domain superposition signal is determined to have signal amplitude overflow.

The voice signal processing device provided by the embodiment of the disclosure can execute the voice signal processing method provided by any embodiment of the disclosure, and has the corresponding functional modules and beneficial effects of the execution method.

Fig. 6 is a schematic structural view of yet another voice signal processing apparatus according to an embodiment of the present disclosure, which may be provided in an electronic device. In particular, the apparatus may be implemented by a right side user agent in a FREESWITCH communications gateway. As shown in fig. 6, the voice signal processing apparatus 600 includes: an original voice data packet acquisition module 610, an original voice signal information calculation module 620, a modified voice data packet formation module 630, and a modified voice data packet transmission module 640. Wherein:

the original voice data packet obtaining module 610 is configured to obtain an original voice data packet sent by the terminal;

the original voice signal information calculation module 620 is configured to analyze an original voice signal in an original voice data packet, and calculate effective voice identification information and a voice energy value corresponding to the original voice signal;

An improved voice data packet forming module 630, configured to form an improved voice data packet that is processed by the SFU server by adopting routing according to the valid voice identification information and the voice energy value corresponding to the original voice signal;

The modified voice data packet transmitting module 640 is configured to transmit the modified voice data packet to the SFU server.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the respective methods and processes described above, for example, a voice signal processing method. For example, in some embodiments, the speech signal processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the above-described speech signal processing method may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the speech signal processing method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of speech signal processing, performed by a FREESWITCH communications gateway, comprising:

when receiving at least two paths of voice signals sent by a selective forwarding unit SFU server, independently storing each voice packet corresponding to each path of voice signals in a matched cache queue;

when the mixing condition of the multipath voice signals is met, judging whether a voice packet can be obtained from each buffer queue; when the mixing condition of the multipath voice signals is that the io_read_frame in the FREESWITCH communication gateway is called back, the FREESWITCH communication gateway adopts a B2BUA exchange model of n1 to realize communication, wherein n1 is to receive the multipath voice signals and forward the single path voice signals;

if it is determined that one voice packet cannot be acquired from each buffer queue of the voice packet which is not acquired last time, constructing mute packets respectively corresponding to each buffer queue of the voice packet which is not acquired currently, and forming a composite voice packet by each voice packet acquired twice before and after and each mute packet;

obtaining a plurality of decoded voice signals obtained by decoding each voice packet in the composite voice packet by a decoding module in a telephone soft switch FREESWITCH communication gateway;

mixing each decoded voice signal to obtain a mixing packet, and sending the mixing packet to a terminal;

each voice packet corresponding to each path of voice signal is independently stored in a matched buffer queue, and the method comprises the following steps:

When a target voice packet matched with a first target path number is detected, extracting a first basic serial number of the target voice packet; wherein, the first basic sequence number is the sequence number existing in the target voice packet and is stored in the packet head of the target voice packet;

acquiring the current sequence number turnover times of a target cache queue matched with the target voice packet and a sequence number turnover threshold value;

Inserting the target voice packet into the target cache queue according to the target extension serial number;

The step of inserting the target voice packet into the target cache queue according to the target expansion sequence number comprises the following steps:

detecting whether the target voice packet is the first voice packet corresponding to the set voice fragment;

If yes, inserting the target voice packet into the target cache queue according to the target expansion serial number;

Otherwise, determining an expected insertion sequence number according to the last-bit expansion sequence number in the target cache queue;

if the expected insertion sequence number is matched with the target expansion sequence number, inserting the target voice packet into the target cache queue according to the target expansion sequence number;

If the expected insertion sequence number is not matched with the target extension sequence number, determining that a packet loss event occurs, or determining that the target voice packet is a retransmission data packet after the packet loss event occurs;

if the packet loss event is determined to occur, constructing a packet loss retransmission request packet, and inserting the target voice packet into the target cache queue according to the target expansion sequence number;

If the target voice packet is determined to be a retransmission data packet after the packet loss event, detecting whether the target voice packet meets the timeliness requirement; if yes, inserting the target voice packet into the target cache queue according to the target expansion serial number; otherwise, discarding the target voice packet.

2. The method of claim 1, wherein constructing a lost packet retransmission request packet comprises:

And constructing a packet loss retransmission request packet according to the second basic sequence number, and sending the packet loss retransmission request packet to the SFU server so as to request the SFU terminal sending the lost voice packet to carry out packet loss retransmission on the lost voice packet.

3. The method of claim 1, wherein the voice packet comprises: effective voice identification information and voice energy values;

before judging whether a voice packet can be acquired from each buffer queue, the method further comprises the following steps:

Detecting whether each voice packet currently cached in each cache queue exceeds a preset quantity threshold value, and determining an overload cache queue according to a detection result;

And rejecting the voice packets with the voice energy value smaller than or equal to the preset energy threshold value, wherein the voice identification information in the overload buffer queue is set to be invalid.

4. The method of claim 1, wherein mixing each of the decoded speech signals to obtain a mixed packet comprises:

performing time domain superposition on each decoded voice signal to obtain a time domain superposition signal;

And if the time domain superposition signal is determined to have signal amplitude overflow, carrying out overflow processing on the time domain superposition signal according to a preset overflow processing algorithm.

5. A method of speech signal processing, comprising:

acquiring an original voice data packet sent by a terminal;

Processing the voice signal which cannot be processed by the selective forwarding unit SFU server in a routing way originally according to the effective voice identification information and the voice energy value corresponding to the original voice signal to form an improved voice data packet which is processed by the selective forwarding unit SFU server in a routing way; the improved voice data packet is obtained by adding effective voice identification information and voice energy value into the packet head of the original voice data packet, wherein the effective voice identification information and the voice energy value in the improved voice data packet are used as the basis of the SFU server for carrying out route selection processing;

the improved voice data packet is sent to the SFU server and is used as a voice signal to be processed by the SFU server in a routing way;

When the SFU server selects the signal in the modified voice data packet, the signal in the modified voice data packet is sent to the FREESWITCH communications gateway as claimed in claim 1 for processing.

6. A speech signal processing device, configured in a FREESWITCH communication gateway, comprising:

The voice packet storage module is used for independently storing each voice packet corresponding to each voice signal in a matched cache queue when receiving at least two paths of voice signals sent by the selective forwarding unit SFU server;

the composite voice packet forming module comprises,

The voice packet acquisition judging unit is used for judging whether a voice packet can be acquired from each buffer queue or not when the mixing condition of the multipath voice signals is met; when the mixing condition of the multipath voice signals is that the io_read_frame in the FREESWITCH communication gateway is called back, the FREESWITCH communication gateway adopts a B2BUA exchange model of n1 to realize communication, wherein n1 is to receive the multipath voice signals and forward the single path voice signals;

a third composite voice packet forming unit, configured to, if it is determined that one voice packet cannot be obtained from each buffer queue from which a voice packet has not been obtained last time, construct silence packets corresponding to each buffer queue from which a voice packet has not been obtained currently, and form each voice packet and each silence packet obtained twice before and after into a composite voice packet;

The decoding voice signal acquisition module is used for acquiring a plurality of decoding voice signals obtained by decoding each voice packet in the composite voice packets by a decoding module in the telephone soft switch FREESWITCH communication gateway;

the audio mixing packet sending module is used for carrying out audio mixing processing on each decoded voice signal to obtain an audio mixing packet, and sending the audio mixing packet to a terminal;

Wherein, the pronunciation package storage module includes:

A first basic sequence number extracting unit, configured to extract a first basic sequence number of a target voice packet when the target voice packet matching with a first target number is detected; wherein, the first basic sequence number is the sequence number existing in the target voice packet and is stored in the packet head of the target voice packet;

the number and threshold value obtaining unit is used for obtaining the current sequence number turnover number and the sequence number turnover threshold value of the target cache queue matched with the target voice packet;

The target extension sequence number determining unit is used for determining a target extension sequence number matched with the target voice packet according to the first basic sequence number, the current sequence number turnover times and the sequence number turnover threshold value;

a target voice packet inserting unit, configured to insert the target voice packet into the target buffer queue according to the target extension sequence number;

wherein the target voice packet insertion unit includes:

A first voice packet detection subunit, configured to detect whether the target voice packet is a first voice packet corresponding to a set voice segment;

a first target voice packet inserting subunit, configured to insert, if it is detected that the target voice packet is a first voice packet corresponding to a set voice segment, the target voice packet into the target buffer queue according to the target extension sequence number;

the expected insertion sequence number determining subunit is configured to determine an expected insertion sequence number according to the last expanded sequence number in the target buffer queue if it is detected that the target voice packet is not the first voice packet corresponding to the set voice fragment;

A packet loss event or retransmission data packet determining subunit, configured to determine that a packet loss event occurs or determine that the target voice packet is a retransmission data packet after the packet loss event occurs if the expected insertion sequence number is not matched with the target extension sequence number;

A third target voice packet inserting subunit, configured to construct a packet loss retransmission request packet if it is determined that a packet loss event occurs, and insert the target voice packet into the target buffer queue according to the target extended sequence number;

A fourth target voice packet access subunit, configured to detect whether the target voice packet meets a timeliness requirement if it is determined that the target voice packet is a retransmission data packet after a packet loss event occurs; if yes, inserting the target voice packet into the target cache queue according to the target expansion serial number; otherwise, discarding the target voice packet.

7. The apparatus of claim 6, wherein the third target voice packet is inserted into the subunit, specifically for:

8. The apparatus of claim 6, wherein the voice packet comprises: effective voice identification information and voice energy values;

the device further comprises:

9. The apparatus of claim 6, wherein the mix packet transmission module comprises:

A time domain superposition signal obtaining unit, configured to perform time domain superposition on each of the decoded speech signals to obtain a time domain superposition signal;

10. A speech signal processing apparatus comprising:

The improved voice data packet forming module is used for processing the voice signals which cannot be subjected to route selection processing originally through the selective forwarding unit SFU server according to the effective voice identification information and the voice energy value corresponding to the original voice signals to form improved voice data packets which are subjected to route selection processing by the selective forwarding unit SFU server; the improved voice data packet is obtained by adding effective voice identification information and voice energy value into the packet head of the original voice data packet, wherein the effective voice identification information and the voice energy value in the improved voice data packet are used as the basis of the SFU server for carrying out route selection processing;

The improved voice data packet sending module is used for sending the improved voice data packet to the SFU server and performing route selection processing by the SFU server as a voice signal;

Wherein, the modified voice data packet sending module is further configured to send the signal in the modified voice data packet to the FREESWITCH communication gateway according to claim 1 for processing when the SFU server selects the signal in the modified voice data packet.

11. An electronic device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-5.