CN117995159A

CN117995159A - Voice signal processing method and device and electronic equipment

Info

Publication number: CN117995159A
Application number: CN202211377686.7A
Authority: CN
Inventors: 梁俊斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-04
Filing date: 2022-11-04
Publication date: 2024-05-07

Abstract

The application discloses a voice signal processing method, a voice signal processing device and electronic equipment. The method comprises the following steps: in the process of carrying out voice call with a first call end, receiving a target data packet sent by the first call end under the condition that a transmission network between the first call end and a current second call end is abnormal, wherein the target data packet carries target text information; performing voice synthesis operation on the target text information to obtain a first voice signal corresponding to the target text information; and playing the first voice signal through a voice playing component on the second voice terminal. By adopting the technical scheme, the problem of poor call quality caused by network abnormality of a transmission network in the voice call processing method in the related technology is solved.

Description

Voice signal processing method and device and electronic equipment

Technical Field

The present application relates to the field of computers, and in particular, to a method and an apparatus for processing a voice signal, and an electronic device.

Background

At present, in the process of voice communication, a collection device collects sound to obtain a sound signal, the collected sound signal is compressed and encoded into audio code stream data through an audio encoder, and the audio code stream data is transmitted to a receiver through a transmission network; the receiver decodes the received audio code stream data through the audio decoder, restores the audio code stream data to a voice signal, and plays the restored voice signal.

Because the audio code stream data is transmitted through the transmission network, if the transmission quality is poor in a certain section of the transmission network link due to weak wireless signal coverage intensity, the abnormal transmission conditions such as packet loss, bandwidth limitation and the like can occur, so that the call quality is affected. As can be seen from the above, the voice call processing method in the related art has a problem of poor call quality due to network anomalies occurring in the transmission network.

Disclosure of Invention

The embodiment of the application provides a voice signal processing method, a voice signal processing device and electronic equipment, which at least solve the problem of poor call quality caused by network abnormality of a transmission network in the voice call processing method in the related technology.

According to an aspect of an embodiment of the present application, there is provided a method for processing a voice signal, including: in the process of carrying out voice call with a first call end, receiving a target data packet sent by the first call end under the condition that a transmission network between the first call end and a current second call end is abnormal, wherein the target data packet carries target text information; performing voice synthesis operation on the target text information to obtain a first voice signal corresponding to the target text information; and playing the first voice signal through a voice playing component on the second voice terminal.

According to another aspect of the embodiment of the present application, there is also provided a method for processing a voice signal, including: in the process of carrying out voice call with a second call terminal, under the condition that the transmission network between the current first call terminal and the second call terminal is abnormal, acquiring a target voice signal to be transmitted; the target voice signal is converted into target text information by carrying out voice recognition on the target voice signal; and sending a target data packet to the second session end through the transmission network, wherein the target data packet carries the target text information.

As an alternative, the method further comprises: and carrying out speech rate detection on the target speech signal to obtain speech rate parameter information corresponding to the target speech signal, wherein a target data packet also carries the speech rate parameter information.

As an alternative, the method further comprises: extracting the sound characteristics of the target call object from the second voice signal acquired by the voice acquisition component of the first call end; and respectively matching the sound characteristics corresponding to each preset sound model in the set of preset sound models with the sound characteristics of the target call object to obtain the matching degree of each preset sound model and the target call object, and determining the model identification of the preset sound model with the highest matching degree with the target call object in the set of preset sound models as a target model identification, wherein the target model identification is sent to the second call end.

According to still another aspect of the embodiment of the present application, there is also provided a processing apparatus for a voice signal, including: the first receiving unit is used for receiving a target data packet sent by a first call end under the condition that a transmission network between the first call end and a current second call end is abnormal in the process of carrying out voice call with the first call end, wherein the target data packet carries target text information; the execution unit is used for executing voice synthesis operation on the target text information to obtain a first voice signal corresponding to the target text information; and the playing unit is used for playing the first voice signal through a voice playing component on the second conversation end.

As an alternative, the apparatus further comprises: the second receiving unit is used for receiving the voice data packet periodically sent by the first call end through the transmission network; a first determining unit, configured to determine a network state of the transmission network according to a receiving result of the voice data packet and an expected receiving result of the voice data packet, where the network state of the transmission network is used to indicate whether the transmission network is abnormal; the first sending unit is configured to send first indication information to the first call end according to a network state of the transmission network, where the first indication information is used to indicate whether the transmission network has an anomaly.

As an alternative, the first determining unit includes: the first determining module is used for determining the packet loss rate of the voice data packets according to the number of the received voice data packets and the number of the voice data packets expected to be received; the second determining module is used for determining that the transmission network is abnormal under the condition that the packet loss rate of the voice data packet is greater than or equal to a packet loss rate threshold value; and the third determining module is used for determining that the transmission network is normal under the condition that the packet loss rate of the voice data packet is smaller than a packet loss rate threshold value.

As an alternative, the apparatus further comprises: a third receiving unit, configured to receive a plurality of groups of probe packets sequentially sent by the first call end through the transmission network according to an order of packet lengths from small to small, where each group of probe packets includes a plurality of probe packets with the same packet length, and packet lengths corresponding to different groups of probe packets are different; a second determining unit, configured to determine a bandwidth probe value corresponding to the transmission network according to the reception result of each group of probe packets and the expected reception result of each group of probe packets, where the bandwidth probe value is a maximum packet length of packet lengths corresponding to all groups of probe packets in which the reception result of each group of probe packets is consistent with the expected reception result; a third determining unit, configured to determine that an abnormality exists in the transmission network when the bandwidth detection value is greater than or equal to a bandwidth detection threshold; a fourth determining unit, configured to determine that the transmission network is normal, if the bandwidth detection value is smaller than a bandwidth detection threshold; and the second sending unit is used for sending second indicating information to the first call end, wherein the second indicating information is used for indicating whether the transmission network is abnormal or not.

As an alternative, the third receiving unit includes: the receiving module is used for receiving a target group detection packet sent by the first call end through the transmission network, wherein the target group detection packet corresponds to the target packet in length; the first sending module is configured to send third indication information to the first call end when a receiving result of the target group detection packet is consistent with an expected receiving result of the target group detection packet, where the third indication information is used to indicate the first call end to continue sending a next group of detection packets, and a packet length corresponding to the next group of detection packets is greater than the target packet length; and the second sending module is used for sending fourth indication information to the first call end when the receiving result of the target group detection packet is inconsistent with the expected receiving result of the target group detection packet, wherein the fourth indication information is used for indicating the first call end to stop sending the detection packet.

As an alternative, the execution unit includes: and the execution module is used for executing text-to-speech conversion operation on the target text information by using a target sound model matched with the first call end to obtain the first speech signal.

As an alternative, the apparatus further comprises: the searching unit is used for searching the sound model matched with the target object identifier by using the target object identifier of the target call object of the first call end before the target sound model matched with the first call end is used for performing text-to-speech conversion operation on the target text information; a fifth determining unit configured to determine, as the target acoustic model, an acoustic model that matches the target object identifier if an acoustic model that matches the target object identifier is found; and a sixth determining unit, configured to determine, as the target acoustic model, an acoustic model identified by a target model identifier in a set of preset acoustic models without finding an acoustic model that matches the target object identifier, where the target model identifier is a model identifier indicated by the first call end.

As an alternative, the apparatus further comprises: the extraction unit is used for extracting the sound characteristics of the target call object from the second voice signal acquired by the voice acquisition component of the first call end; the matching unit is used for respectively matching the sound characteristics corresponding to each preset sound model in the set of preset sound models with the sound characteristics of the target call object to obtain the matching degree of each preset sound model and the target call object; and a seventh determining unit, configured to determine, as the target model identifier, a model identifier of a preset acoustic model with the highest matching degree with the target call object in the set of preset acoustic models.

As an alternative, the apparatus further comprises: and the adjusting unit is used for adjusting the speech speed parameter of the first speech signal according to the speech speed parameter indicated by the speech speed parameter information after the speech synthesis operation is carried out on the target text information to obtain the first speech signal corresponding to the target text information, so as to obtain the adjusted first speech signal, wherein the speech speed parameter information is carried in the target data packet.

As an alternative solution, the transmission link from the first call end to the second call end is a first transmission link in the transmission network, and the transmission link from the second call end to the first call end is a second transmission link in the transmission network; the apparatus further comprises: the first acquisition unit is used for carrying out voice acquisition through the voice acquisition component of the second call end under the condition that the second transmission link is normal, so as to obtain a third voice signal; the coding unit is used for carrying out voice coding on the third voice signal to obtain audio code stream data corresponding to the third voice signal; and the transmission unit is used for transmitting the audio code stream data to the first call end through the second transmission link.

As an alternative, the apparatus further comprises: the second acquisition unit is used for carrying out voice acquisition through the voice acquisition component of the first call end under the condition that the transmission network is abnormal, so as to obtain a target voice signal; the recognition unit is used for converting the target voice signal into the target text information by carrying out voice recognition on the target voice signal; and the third sending unit is used for sending the target data packet carrying the target text information to the second conversation end through the transmission network.

According to still another aspect of the embodiment of the present application, there is also provided a processing apparatus for a voice signal, including: the system comprises an acquisition unit, a transmission unit and a transmission unit, wherein the acquisition unit is used for acquiring a target voice signal to be transmitted under the condition that a transmission network between a current first call end and a second call end is abnormal in the process of voice call with the second call end; the conversion unit is used for converting the target voice signal into target text information by carrying out voice recognition on the target voice signal; and the sending unit is used for sending the target data packet to the second conversation end through the transmission network, wherein the target data packet carries the target text information.

According to yet another aspect of the embodiments of the present application, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the above-described method of processing a speech signal when run.

According to yet another aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the processing method of the voice signal as above.

According to still another aspect of the embodiments of the present application, there is also provided an electronic device including a memory in which a computer program is stored, and a processor configured to execute the above-described processing method of a voice signal by the computer program.

In the embodiment of the application, if the transmission network between the sending end (the terminal equipment of the sending party) and the receiving end (the terminal equipment of the receiving party) is abnormal (such as a weak network environment) in the process of carrying out voice communication, the sending end sends a data packet carrying text information to the receiving party through the transmission network, and compared with voice data, the data quantity of the text information is smaller, so that the success rate of information transmission in the weak network environment can be improved; meanwhile, at the receiving side, if the transmitted text information is received in the voice call process, the text information is restored to a voice signal in a voice synthesis mode, and other operations are not required to be executed by a caller of the receiving side, namely, the voice call can still be carried out in a voice listening mode, so that the convenience of the voice call can be ensured, the technical effect of improving the call quality of the voice call is achieved, and the problem of poor call quality caused by network abnormality in a voice call processing method in the related art is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a schematic illustration of an application environment of an alternative speech signal processing method according to an embodiment of the application;

FIG. 2 is a flow chart of an alternative method of processing a speech signal according to an embodiment of the application;

FIG. 3 is a schematic diagram of an alternative method of processing a speech signal according to an embodiment of the application;

FIG. 4 is a schematic diagram of another alternative method of processing a speech signal according to an embodiment of the application;

FIG. 5 is a schematic diagram of yet another alternative method of processing a speech signal according to an embodiment of the application;

FIG. 6 is a schematic diagram of yet another alternative method of processing a speech signal according to an embodiment of the application;

FIG. 7 is a schematic diagram of yet another alternative method of processing a speech signal according to an embodiment of the application;

FIG. 8 is a flow chart of yet another alternative method of processing a speech signal according to an embodiment of the application;

FIG. 9 is a flow chart of yet another alternative method of processing a speech signal according to an embodiment of the application;

FIG. 10 is a flow chart of yet another alternative method of processing a speech signal according to an embodiment of the application;

FIG. 11 is a block diagram of an alternative speech signal processing apparatus according to an embodiment of the present application;

FIG. 12 is a block diagram of an alternative speech signal processing apparatus according to an embodiment of the present application;

FIG. 13 is a block diagram of an alternative electronic device in accordance with an embodiment of the present application;

FIG. 14 is a block diagram of the architecture of a computer system of an alternative electronic device in accordance with an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an aspect of the embodiment of the present application, there is provided a method for processing a voice signal, optionally, as an optional implementation manner, the method for processing a voice signal may be, but is not limited to, applied to an environment as shown in fig. 1. Including but not limited to a first telephony terminal 102, a second telephony terminal 104, and a network 106, and for any telephony terminal, including but not limited to a display, a processor, and a memory.

The specific process comprises the following steps:

in step S102, the first call end 102 performs voice recognition on the voice signal to be sent, and converts the voice signal into text information.

In the process of performing a voice call between the first call end 102 and the second call end 104, if a transmission network between the first call end 102 and the second call end 104 is abnormal, the first call end 102 may perform voice recognition on a voice signal to be sent, and convert the voice signal into text information.

In step S104, the text message is sent to the second session end 104 through the network 106.

In step S106, after receiving the text information, the second session end 104 converts the text information into a voice signal and plays the voice signal.

Optionally, the call end includes, but is not limited to, at least one of the following: mobile phones (such as Android Mobile phones, iOS Mobile phones, etc.), notebook computers, tablet computers, palm computers, mobile INTERNET DEVICES devices (MID for short), PAD, desktop computers, smart home appliances, vehicle-mounted devices, etc. The network 106 may include, but is not limited to: a wired network, a wireless network, wherein the wired network comprises: local area networks, metropolitan area networks, and wide area networks, the wireless network comprising: bluetooth, wireless fidelity (WIRELESS FIDELITY, abbreviated WIFI), and other networks that enable wireless communications. The above is merely an example, and is not limited in any way in the present embodiment.

Alternatively, the above-mentioned method for processing the voice signal may be performed by the second session end 104 alone, or may be performed by the first session end 102 and the second session end 104 together. As an alternative implementation manner, taking the second call end 104 as an example to execute the processing method of the voice signal in this embodiment, fig. 2 is a schematic flow chart of an alternative processing method of the voice signal according to an embodiment of the present application, as shown in fig. 2, the flow of the processing method of the voice signal may include the following steps:

Step S202, in the process of carrying out voice call with the first call end, receiving a target data packet sent by the first call end under the condition that the transmission network between the first call end and the current second call end is abnormal, wherein the target data packet carries target text information.

The method for processing the voice signal in the embodiment can be applied to a scene of processing the voice signal in the process of performing voice call among a plurality of call terminals. Here, the plurality of call ends may be the same type of terminal equipment, or may be different types of terminal equipment, and the operating system may be the same type of operator system, or may be different operating systems; the voice call may be initiated by the same application program running on multiple call ends and executed based on the application program, and may be performed in an audio/video call, live broadcast, etc. service in the application program. The type of the voice call may be various, for example, a voice call based on an IP (Internet Protocol ) network, and the embodiment is not limited to a call end, a voice call, and the like.

In the related art, a voice communication is performed by adopting a mode based on voice coding and decoding compression transmission, and a voice signal is compressed and transmitted through a voice coding model. As shown in fig. 3, the flow of the voice call is as follows: the voice signal is obtained by recording through a transmitting end (a call end can be a terminal acquisition device), the voice signal is compressed and encoded through an audio encoder to obtain audio code stream data, the audio code stream data is transmitted to a receiving party through a transmission network, the receiving party restores the voice signal after decoding through an audio decoder, and the voice is played through a loudspeaker device and the like.

Taking VoIP as an example, voIP (Voice over Internet Protocol) is a voice communication mode based on an IP network, after the voice data is encoded and compressed by a voice compression algorithm, the voice data is packed according to a network transmission protocol standard, the data packet is sent to a destination IP address through the IP network, and after the voice data packet is parsed and decompressed, the original voice signal is recovered, thereby achieving the purpose of transmitting the voice signal through the internet.

In speech signal encoding, a commonly used audio codec (or speech codec) comprises: AMR, g.722, EVRC, SILK, OPUS, AAC, etc. These audio encoders can compress the input sound digital signal ten to several tens times, and the compressed code stream is transmitted to the receiver for decoding by the corresponding decoder. However, because the information amount of the sound signal is large, for example, the wideband signal sampled by 16khz commonly used in VoIP has a bit rate of 256kbps, although the audio encoder has made a large compression on the input signal and can be suitable for most application scenarios, the compression capability of the frequency encoder is limited, and the conventional speech encoder such as the SILK usually needs to compress the 16kbps code rate, even if the ultra-low code rate encoder (LPCNet) is used for compressing, the compression of the coding code rate of 1.6kbps can only be realized, and for some special cases, for example, poor network conditions, weak wireless signals, serious packet loss, extremely low bandwidth and the like, the transmission process still cannot be well performed, and problems such as sound loss, blocking and the like occur.

For example, in some very weak and unstable network scenarios, for example, those with poor wireless signals such as mountain and underground environments, the transmission network is affected by weak wireless signal coverage intensity or limited bandwidth and unstable signal factors, so that the problem of poor transmission quality occurs at one end of the call end-to-end transmission network link, which results in serious packet loss, and the problem that voice compressed data packets are difficult to be received in whole or in large part, which results in frequent voice jamming, silence, etc. of the receiver.

Taking VoIP as an example, the reliability of the transmission network directly affects the user experience of VoIP, when a serious packet loss occurs in the transmission network, for example, a continuous multiframe cannot reach a receiving party (i.e., an answering party) as expected, a voice clip heard by the receiving party occurs, and if a continuous packet loss occurs for a long time, the voice cannot be heard by the opposite party.

In order to ensure that the voice signal can be normally sent to the receiver in the weak network environment, in this embodiment, under the condition that the transmission network abnormality is detected, the voice signal to be transmitted is converted into text information, for example, the voice signal is recognized into text information by ASR (Automatic Speech Recognition ) technology, and the text information is transmitted to the receiver; after receiving the Text information, the answering party may restore the Text To Speech (TTS) To a Speech signal To reproduce the Speech signal, and play the reproduced Speech signal. Because the voice signal is converted into the text information for transmission, the original voice signal (256 k bits of digital signal per second) can be reduced to the text information with only tens of bits, thereby improving the success rate of information transmission in the weak network environment.

The call end may be configured with at least two call modes, and may include: the voice coding and decoding mode and the ASR/TTS mode are adopted, the calling end can switch different calling modes according to the network state in the voice calling process, the voice coding and decoding mode is adopted in the normal network, and the ASR/TTS mode is switched to the ASR/TTS mode when the network state is abnormal (for example, in a weak network environment), wherein the ASR/TTS mode is a mode that the sending end converts a voice signal into a text signal for transmission and the receiving end restores the received text information into the voice signal. The switching of the different call modes may be performed by a mode control unit (which may be a program module) at the call end.

For the first call end, in the process of performing voice call between the first call end and the second call end, the first call end may first perform voice signal transmission in a voice coding/decoding mode. If an abnormality in the transmission network between the first call end and the second call end is detected, for example, an abnormality occurs in a transmission link (i.e., the aforementioned transmission network link) where the first call end performs information transmission to the second call end, for example, packet loss, bandwidth limitation, etc., the first call end and the second call end may switch to an ASR/TTS mode, where the mode switching may be for switching a transmission link with a certain direction, and for a transmission link from the first call end to the second call end, the ASR/TTS mode is adopted, and if a transmission link from the second call end to the first call end is normal, a voice call is performed for the transmission link still using a voice codec mode.

In the ASR/TTS mode, the first call end may send, through the transmission network, a target data packet carrying target text information to the second call end, where the target text information may be text information obtained by performing speech recognition on a speech signal adopted by the first call end, or may be text information obtained through an input interface of the first call end and input by a target call object of the first call end, which is not limited in this embodiment. For the second call end, the second call end can receive the target data packet sent by the first call end and can extract the carried target text information from the target data packet.

Optionally, in order to ensure that the data packet can be successfully transmitted to the second session end in the case of network anomaly, the first session end may send the target data packet through the reliable network transmission protocol, and correspondingly, the second session end may receive the target data packet sent by the first session end through the reliable network transmission protocol. Reliable network transport protocol refers to a network transport protocol with a feedback mechanism, such as TCP (Transmission Control Protocol ) or some modified reliable transport protocol. While the network is normal, data transmission may be performed by an unreliable transport protocol, such as UDP (User Datagram Protocol ).

For example, when it is detected that a packet is severely lost continuously or the upper limit value of the bandwidth of the transmission network cannot support normal call, the mode control unit of the transmitting end executes a command for switching from the voice coding and decoding mode to the ASR/TTS mode, at this time, the transmission protocol is changed from an unreliable transmission protocol (for example, UDP protocol) of the normal call to a reliable transmission protocol (for example, TCP), information (for example, a target data packet) such as text is transmitted to the receiving end through the reliable network transmission protocol, and the receiving end converts the text into voice through TTS to play.

It should be noted that, when data transmission is performed in the transmission network, the data (e.g., text information) may be encoded by the encoder at the sending end (e.g., the first call end) of the data, and the code stream data (e.g., text symbol encoded information) obtained by the encoding may be transmitted to the receiving end (e.g., the second call end) of the data through the transmission network; the received code stream data may be decoded by a decoder at the receiving end to recover the original data (e.g., text information).

Step S204, a voice synthesis operation is performed on the target text information, and a first voice signal corresponding to the target text information is obtained.

In the received target data packet, the second conversation end can extract the target text information from the target data packet, perform voice synthesis operation on the target text information, and convert the target text information into a corresponding voice signal through a text-to-voice technology, so as to obtain a first voice signal. In the voice synthesis, the voice synthesis may be performed on the target text information by using a voice model with multi-dimensional voice characteristics, different callers may use the same voice model, or may use a voice model with matched voice characteristics of the callers to perform voice synthesis, and the speech rate parameter of the synthesized first voice signal may be fixed or may be adjusted along with the speech rate of the callers, which is not limited in this embodiment.

Step S206, the first voice signal is played through the voice playing part on the second conversation end.

After the first voice signal is obtained, the first voice signal can be played through a voice playing component on the first call end, where the voice playing component can be a speaker, an earphone, a bluetooth device, and the like, and can also be other components capable of playing voice.

Since the voice call can be a continuous process, when the transmission network is detected to be restored to a normal state, the ASR/TTS mode is stopped and the voice codec mode is restored. For example, if the transmission network is restored, the first call end may switch to the voice codec mode, and perform voice call with the second call end in the voice codec mode. The process of performing the voice call using the voice codec mode is similar to that described above, and will not be described here.

It should be noted that, since the text information is transmitted through the transmission network, the transmission information amount is only one thousandth of the speech coding and decoding scheme, so that the transmission bandwidth can be greatly reduced, the success rate of data transmission is improved, and the method is applicable to speech communication in special network scenarios (for example, weak network environments).

According to the embodiment provided by the application, in the process of carrying out voice call with the first call end, under the condition that the transmission network between the first call end and the current second call end is abnormal, a target data packet sent by the first call end is received, wherein the target data packet carries target text information; performing voice synthesis operation on the target text information to obtain a first voice signal corresponding to the target text information; the first voice signal is played through the voice playing component on the second voice terminal, so that the problem that the voice call processing method in the related technology has poor call quality due to network abnormality of a transmission network is solved, and the call quality of the voice call is improved.

As an alternative, the method further includes:

S11, under the condition that a transmission network is abnormal, voice acquisition is carried out through a voice acquisition component of a first call end, and a target voice signal is obtained;

S12, converting the target voice signal into target text information by performing voice recognition on the target voice signal;

s13, the target data packet carrying the target text information is sent to the second conversation end through the transmission network.

For the first call end, when detecting that an abnormality occurs in the transmission network between the first call end and the second call end, the first call end may switch to the ASR/TTS mode. In an ASR/TTS mode, the first call end can acquire voice through a voice acquisition component on the first call end to obtain a target voice signal. The voice acquisition component may be a microphone array, pickup, or other component capable of voice signal acquisition.

Since voice call is a continuous process, the voice signal collected by performing data collection once may be a voice signal of a predetermined duration (for example, 5 s), or may be a voice signal divided based on a detected voice pause, that is, in the case where a voice pause is detected (for example, no valid voice signal is detected within 500 ms), the voice signal collected between voice pauses is determined as a voice signal collected once, or may be a voice signal determined by other means, which is not limited in this embodiment.

For the collected target voice signal, the first call end may perform voice recognition on the target voice signal, recognize the target voice signal as text information, obtain target text information, for example, may recognize voice as text through ASR technology, and encapsulate the target text information in a data packet, so as to obtain a target data packet, where other information may be included in the target data packet besides the target text information, for example, speech rate parameter information, voice model identifier, and the like, which is not limited herein. The first call end may transmit the target data packet to the second call end via the transmission network, where the target data packet may be transmitted using a reliable network transmission protocol.

For example, as shown in fig. 4, the voice is recognized as text information by the voice transmitting end through ASR technology, the text information or the text information combined with other information can be transmitted to the receiving end through a transmission protocol with extremely low code rate and reliability, and the receiving end reproduces the voice signal locally through local TTS text-to-voice technology for playing.

According to the embodiment of the application, when the transmission network is abnormal, the collected voice signal is converted into the text information, and the converted text information is transmitted to the opposite terminal of the call, so that the success rate of information transmission can be improved.

As an alternative, in the transmission network, the transmission link from the first call end to the second call end is a first transmission link, the transmission link from the second call end to the first call end is a second transmission link, the first transmission link and the second transmission link may be different transmission links, that is, if the first transmission link is abnormal and the second transmission link is abnormal, the second call end may adopt an ASR/TTS mode to perform data transmission, and the transmission manner is similar to that of the first call end and is not repeated herein.

Optionally, in the case that the second transmission link is normal, the method further includes:

S21, voice acquisition is carried out through a voice acquisition component of the second communication terminal, and a third voice signal is obtained;

s22, performing voice coding on the third voice signal to obtain audio code stream data corresponding to the third voice signal;

S23, transmitting the audio code stream data to the first call end through the second transmission link.

If the second transmission link is normal, the second call end can use a voice encoding and decoding mode to perform voice call in order to ensure the call quality. The second voice terminal may perform voice acquisition by a voice acquisition means (e.g., a microphone array, a pickup means, etc.) thereon in a similar manner as described above, resulting in a third voice signal. The second communication terminal may use the audio encoder to perform speech encoding on the third speech signal, so as to obtain audio code stream data corresponding to the third speech signal, and transmit the audio code stream data to the first communication terminal through the second transmission link. The manner of performing speech coding is similar to that described above and will not be described in detail here.

For example, as shown in fig. 5, when the first call end sends a voice signal to the second call end, the first call end converts the voice signal into text information for transmission, and the second call end restores the received text information into a voice signal for playing. When the second call end sends a voice signal to the first call end, the second call end carries out voice coding on the voice signal and transmits the audio code stream data obtained by coding to the first call end, and the first call end carries out voice decoding on the received audio code stream data and plays the voice signal obtained by decoding.

According to the embodiment provided by the application, the voice communication is carried out by adopting the voice coding and decoding mode for the normal transmission link of the network, so that the voice communication quality can be improved, and the communication experience of a user is further improved.

As an alternative, the method further includes:

S31, receiving a voice data packet periodically sent by a first call end through a transmission network;

S32, determining the network state of the transmission network according to the receiving result of the voice data packet and the expected receiving result of the voice data packet, wherein the network state of the transmission network is used for indicating whether the transmission network is abnormal or not;

s33, according to the network state of the transmission network, first indication information is sent to the first call end, wherein the first indication information is used for indicating whether the transmission network is abnormal or not.

In the call process, whether the current transmission network is abnormal or not can be judged by sending the voice data packet by the sending end and receiving the data packet by the receiving end. For the first call end, the first call end may send a voice data packet to the second call end at regular time, where the voice data packet may carry a voice signal, or may be an empty data packet only used for detecting a network state.

The second session end may receive the voice data packet periodically sent by the first session end through the transmission network, and for each voice data packet sent by the first session end, the second session end may or may not receive the voice data packet, that is, the second session end may receive all or part of the voice data packet sent by the first session end, and may learn in advance that the second session end can receive the voice data packet received by the second session end based on a preset interaction rule or based on a preset interaction rule in combination with the sequence number of the received voice data packet.

Based on the receiving result of the voice data packet and the expected receiving result of the voice data packet, the second session end may determine the packet loss condition of the voice data packet, so as to determine the network state of the transmission network (herein referred to as the first transmission link) based on the packet loss condition, that is, whether the transmission network is abnormal. The corresponding relation between the packet loss condition and the network state can be various, for example, if the packet loss is generated or the packet loss quantity exceeds a threshold value or the packet loss proportion exceeds the threshold value in a certain time, the network abnormality can be determined, otherwise, the network is determined to be normal.

For example, conventionally, the transmitting end sends voice data packets to the receiving end at regular time, each voice data packet may carry a sequence number of the data packet, and if the receiving end continuously receives a plurality of expected data packets for a long time, it may be determined that a serious packet loss occurs in the transmission network. For example, as shown in fig. 6, the voice data packets with the sequence numbers 1 and 8 are received by the receiving side, but the voice data packets with the sequence numbers 2-7 are not received by the receiving side, so that it can be determined that serious packet loss occurs in the transmission network.

According to the network state of the transmission network, the second call end can send first indication information to the first call end so as to indicate whether the transmission network is abnormal or not. The first call end may receive the first indication information, and determine whether the transmission network is abnormal based on the indication of the first indication information, so as to switch the call mode based on the network state of the transmission network.

According to the embodiment of the application, the sending end continuously sends the voice data packet to the receiving end, the receiving end determines the network state of the transmission network based on the receiving condition of the voice data packet, and indicates the network state of the transmission network to the sending end, so that the convenience of network state determination can be improved.

As an alternative, determining the network state of the transmission network according to the receiving result of the voice data packet and the expected receiving result of the voice data packet includes:

s41, determining the packet loss rate of the voice data packets according to the number of the received voice data packets and the number of the expected received voice data packets;

s42, determining that the transmission network is abnormal under the condition that the packet loss rate of the voice data packet is greater than or equal to the packet loss rate threshold value;

S43, determining that the transmission network is normal under the condition that the packet loss rate of the voice data packet is smaller than the packet loss rate threshold.

In this embodiment, when the network state of the transmission network is in the network state, whether the transmission network is abnormal may be determined based on whether the packet loss rate of the voice data packet exceeds the set packet loss rate threshold. When the packet loss rate of the voice data packet is greater than or equal to the packet loss rate threshold, determining that the transmission network is abnormal, and when the packet loss rate of the voice data packet is less than the packet loss rate threshold, determining that the transmission network is normal. The packet loss rate threshold may be preset and defined, and based on requirements of different call scenarios, the packet loss rate threshold may be different, for example, a larger packet loss rate threshold may be set for a scenario with higher call quality requirements, and a smaller packet loss rate threshold may be set for a scenario with lower call quality requirements.

In determining the packet loss rate of the voice data packet, the packet loss rate of the voice data packet may be determined based on the number of voice data packets received and the number of voice data packets expected to be received, for example, a ratio of a difference between the number of voice data packets expected to be received and the number of voice data packets expected to be received to the number of voice data packets expected to be received may be determined as the packet loss rate of the voice data packet, or other manners of determining the packet loss rate may be used without limitation herein.

According to the embodiment provided by the application, the packet loss rate is determined based on the actual receiving quantity and the expected receiving quantity of the voice data packets, and the network state of the transmission network is determined based on the packet loss rate, so that the convenience of network state determination can be improved.

As an alternative, the method further includes:

S51, receiving a plurality of groups of detection packets which are sequentially transmitted by a first call end through a transmission network according to the sequence of packet lengths from small to small, wherein each group of detection packets comprises a plurality of detection packets with the same packet length, and the packet lengths corresponding to different groups of detection packets are different;

S52, determining a bandwidth detection value corresponding to the transmission network according to the receiving result of each group of detection packets and the expected receiving result of each group of detection packets, wherein the bandwidth detection value is the maximum packet length in packet lengths corresponding to all groups of detection packets with the receiving results consistent with the expected receiving result in the plurality of groups of detection packets;

s53, determining that the transmission network is abnormal under the condition that the bandwidth detection value is larger than or equal to the bandwidth detection threshold value;

S54, determining that the transmission network is normal under the condition that the bandwidth detection value is smaller than the bandwidth detection threshold value;

S55, sending second indication information to the first call end, wherein the second indication information is used for indicating whether the transmission network is abnormal or not.

Considering that the abnormal transmission network may be caused by bandwidth limitation, in this embodiment, the transmitting end periodically transmits bandwidth detection packets with different packet sizes (packet lengths), if the detection packet reaches a size above a certain size and has a reception failure, it may be determined that the bandwidth of the transmission network is limited, and the maximum packet size of the bandwidth detection packet with the reception failure may be used as a bandwidth detection value; if the bandwidth probe value is too small to support normal voice packet transmission, the network is considered abnormal.

When performing bandwidth detection, the first call end may sequentially send multiple groups of detection packets according to the sequence of packet lengths from small arrival, where each group of detection packets in the multiple groups of detection packets may include multiple detection packets with the same packet size, and the number of detection packets included in different groups of detection packets may be the same (e.g., all the detection packets are the target number), and the corresponding packet lengths may be different. After a group of probe packets is sent, the first call end may wait for indication information sent from the second call end for indicating whether the current group of probe packets are all received, and determine whether to continue sending a next group of probe packets based on the received indication information, where a packet length corresponding to the next group of probe packets is greater than a packet length corresponding to the current group of probe packets.

The second communication terminal can receive a plurality of groups of detection packets sequentially sent by the first communication terminal through the transmission network, and determine a bandwidth detection value corresponding to the transmission network according to the receiving result of each group of detection packets and the expected receiving result of each group of detection packets, wherein the bandwidth detection value is the maximum packet length in packet lengths corresponding to all groups of detection packets with the receiving results consistent with the expected receiving result in the plurality of groups of detection packets. If the bandwidth detection value is greater than or equal to the bandwidth detection threshold value, determining that the transmission network is abnormal; if the bandwidth detection value is smaller than the bandwidth detection threshold value, determining that the transmission network is normal; the second session end may send second indication information to the first session end to indicate whether there is an abnormality in the transmission network.

If the current group of detection packets are not all received, the next group of detection packets are not transmitted, and the bandwidth detection value is the maximum packet length in the packet lengths corresponding to the groups of detection packets. The number of groups of bandwidth probe packets that the first telephony end expects to send to the second telephony end may be greater than or equal to the number of groups of multiple groups of probe packets. For example, as shown in fig. 7, the transmitting end expects to transmit 10 groups of sounding packets to the receiving end, and sequentially transmits the sounding packets according to the sequence from small to large of the corresponding packet length, and for each received sounding packet, it is necessary to wait for the receiving end to determine whether the next sounding packet needs to be transmitted based on the indication information returned by the receiving end of the current sounding packet. If the reception failure of the probe packet occurs when the nth group of probe packets is transmitted, the probe packets of the subsequent groups (which are not generated but are expected to be transmitted) are not transmitted.

According to the embodiment of the application, the transmitting end sequentially transmits the multiple groups of detection packets with different packet lengths according to the sequence of the packet lengths from small arrival, and the network state of the transmission network is determined based on the maximum packet length in the packet lengths corresponding to the successfully received groups of detection packets, so that the comprehensiveness of network state detection can be improved.

As an alternative, the method for receiving multiple groups of probe packets sent by the first call end sequentially from the small arrival order according to the packet length through the transmission network includes:

s61, receiving a target group detection packet sent by a first call end through a transmission network, wherein the target group detection packet corresponds to the target packet in length;

S62, sending third indication information to the first call end under the condition that the receiving result of the target group detection packet is consistent with the expected receiving result of the target group detection packet, wherein the third indication information is used for indicating the first call end to continuously send the next group detection packet, and the packet length corresponding to the next group detection packet is larger than the target packet length;

And S63, sending fourth indication information to the first call end when the receiving result of the target group detection packet is inconsistent with the expected receiving result of the target group detection packet, wherein the fourth indication information is used for indicating the first call end to stop sending the detection packet.

The current detection packet sent by the first call end is a target group detection packet, and the target group detection packet corresponds to the target packet in length, that is, the target group detection packet includes a plurality of detection packets with packet lengths being the target packet lengths. If all the probe packets in the target group of probe packets have been received, third indication information may be sent to the first call end to instruct the first call end to continue sending the next group of probe packets, for example, the third indication information is used to indicate that the target group of data packets have been successfully received. At this time, the first call end may continue to transmit the next probing packet.

If the target group detection packet includes a detection packet that has not been received, fourth indication information may be sent to the first call end to instruct the first call end to stop sending the detection packet, for example, the fourth indication information is used to indicate that the target group data packet has not been successfully received. At this time, the first call end may not send other probe packets.

It should be noted that, whether the network state detection is performed by the form of a voice data packet or by sending a bandwidth probe packet, it may be a continuous process, for example, the network state detection may be performed periodically or periodically, so that the sending end may switch the voice mode used in time based on the network state of the transmission network. Although the reception result of one set of probe packets is judged by whether or not all of the probe packets in the set of probe packets are received, it is not excluded that other judgment results are identical, for example, judgment based on the number of received exceeds the number threshold or the ratio of received exceeds the ratio threshold, which is not limited in the present embodiment.

According to the embodiment provided by the application, after the sending end sends one group of detection packets, whether to continue sending the detection packets is determined based on the indication information of the receiving end, so that the sending rationality of the detection packets can be improved, and the occupation of transmission resources is reduced.

As an alternative, performing a speech synthesis operation on the target text information to obtain a first speech signal corresponding to the target text information, including:

S71, performing text-to-speech conversion operation on the target text information by using the target sound model matched with the first call terminal to obtain a first speech signal.

In performing the speech synthesis, the speech synthesis may be performed on the text information using a sound model having a multi-dimensional sound characteristic that is configured by default, and different speakers may use the same sound model, that is, the same sound model regardless of the sex, tone, etc. of the speaker. The same voice model is adopted to carry out voice synthesis, and information transmission can be carried out, but synthesized voice is too mechanically inscribed, so that hearing experience of a user is affected.

In this embodiment, in order to adapt to different callers, a voice model matching with the voice features of the callers may be used for performing voice synthesis, where the matching may be completely consistent, that is, a voice model obtained by training the TTS voice model using the voice data of the current callers may be used for performing voice synthesis; a substantially uniform match is also possible, i.e. a speech synthesis is performed using a sound model selected from a preset sound model that best matches the sound characteristics of the current speaker.

For the target text signal, a sound model matched with the first call end, that is, a target sound model, may be first determined, and text-to-speech conversion operation is performed on the target text information by using the target sound model, so as to obtain the first speech signal. The determining of the acoustic model matching the first call end may be performed when a voice call is established with the first call end, or when an abnormality in the transmission network is detected (for example, first detected), or may be performed at other occasions. The target voice model may be indicated by the first call end, may be determined by the second call end, may be selected from a set of preset voice models (independent of a specific caller), or may be a voice model corresponding to a target call object of the first call end (corresponding to a current caller of the first call end), and may be acquired from a server, for example, a model library server, which is not limited in this embodiment.

According to the embodiment of the application, the voice synthesis efficiency can be improved by executing the text-to-voice operation by using the voice model matched with the transmitting end, and meanwhile, the voice signal synthesized by using the voice model matched with the transmitting end can represent the current caller of the transmitting party, so that the hearing experience of the receiving party can be improved.

As an alternative, before performing a text-to-speech conversion operation on the target text information using the target acoustic model matched with the first call end, the method further includes:

S81, searching a sound model matched with the target object identification by using the target object identification of the target call object of the first call end;

s82, when the sound model matched with the target object identifier is found, determining the sound model matched with the target object identifier as a target sound model;

s83, determining the sound model identified by the target model identification in a group of preset sound models as a target sound model under the condition that the sound model matched with the target object identification is not found, wherein the target model identification is the model identification indicated by the first call end.

In this embodiment, the target voice model matched with the first call end may be a voice model established for the target call object of the first call end, or may be a voice model matched with the target call object from a set of preset voice models. In this regard, the target object identifier of the target call object may be first used to find a sound model that matches the target object identifier, where the target object identifier may be account information used by the first call end login application program, or may be another object identifier that can uniquely identify the target call object. If a sound model matching the target object identification is found, the sound model matching the target object identification may be determined as the target sound model.

For example, the receiving end may first determine whether a TTS voice model of the current caller exists, and if so, use the TTS voice model to perform a TTS text-to-speech operation.

The acoustic model matching the target object identification may be obtained by model training an initial TTS voice model using voice data of the target call object, i.e. model training of the acoustic model for the specific caller, the model training being performed on a TTS acoustic model training server. The voice data of the target call object may include voice data acquired by the target call object during one or more voice calls, and the voice data of the target call object may be acquired by acquiring audio code stream data of the target call object in a voice codec mode from the first call end or from other call ends except the first call end and performing audio decoding on the audio code stream data. The trained acoustic models can be stored in a model library server. When the current voice call is established, the model library server sends the voice model of the target call object to the second call end for local storage.

If the sound model matched with the target object identifier is not found, the sound model matched with the first call end can be selected from a group of preset sound models, so that the target sound model is obtained. The mode of selecting the sound model by the group of preset sound models can be as follows: and determining the sound model identified by the target model identifier in the set of preset sound models as a target sound model, wherein the target model identifier can be a pre-stored model identifier corresponding to the target call object, for example, a model serial number, or a model identifier indicated by the first call end, and the target model identifier can be carried in a target data packet or a data packet before the target data packet.

It should be noted that during a voice call, both parties are usually fixed, i.e. the caller is usually kept unchanged, so the voice model only needs to indicate one record, which may be indicated at the time of the voice call establishment, or in the first packet after switching to ASR/TTS mode, or at other times, which is not limited in this embodiment.

According to the embodiment provided by the application, the rationality of the determination of the sound model can be improved and the hearing experience of the receiver can be improved by searching the specific sound model established for the transmitting end and selecting the sound model appointed by the transmitting end from the preset sound models.

As an alternative, the method further includes:

S91, extracting the sound characteristics of the target call object from the second voice signal acquired by the voice acquisition component of the first call end;

S92, respectively matching the sound characteristics corresponding to each preset sound model in a group of preset sound models with the sound characteristics of the target call object to obtain the matching degree of each preset sound model and the target call object;

s93, determining the model identification of the preset sound model with the highest matching degree with the target call object in the preset sound models as the target model identification.

In this embodiment, the target model identifier may be determined by matching the sound feature of the target call object with the sound feature corresponding to each preset sound model. The voice acquisition component (e.g., microphone array, pickup component, etc.) of the first call end may perform voice acquisition to obtain a second voice signal. The voice acquisition may be performed in a voice codec mode or an ASR/TTS mode, for example, when a voice call starts (the call mode employed at this time may be the voice codec mode), the voice acquisition means performs voice acquisition, and for example, when an abnormality of the transmission network is detected (the call mode employed at this time may be the ASR/TTS mode). Here, the second speech signal may be the target speech signal or may be a speech signal different from the target speech signal.

The first call end may perform voice feature extraction on the second voice signal, and extract the voice feature of the target call object from the second voice signal, where the extracted voice feature may have multiple types, and may include, but not limited to, at least one of the following: formant characteristics, pitch period characteristics, etc. The local of the first call end may store the sound feature corresponding to the preset sound model, or the first call end may acquire the sound feature corresponding to the preset sound model from a model library server or other servers. The first call end may respectively match the sound feature corresponding to each preset sound model in the set of preset sound models with the sound feature of the target call object, and determine the matching degree between each preset sound model and the target call object.

The first call end can determine a preset sound model with highest matching degree with the target call object in a group of preset sound models, and determine a model identifier of the preset sound model with highest matching degree as a target model identifier. The first call end may directly send the target model identifier to the second call end after the target model identifier is obtained, whether in a speech coding mode or an ASR/TTS mode, or send the target model identifier to the second call end together when the first sending of the data packet carrying the text information occurs under the condition that the abnormality of the transmission network is detected. The object model identifier may be transmitted only once, or may be transmitted uniformly each time a data packet carrying text information is transmitted, or may be transmitted in other transmission manners, which are not limited herein.

For example, the sending end matches the voice feature of the current caller with the preset plurality of TTS voice model features, the matching process may be based on the principle that the weighted error of the multidimensional voice feature is minimum, the matching result is the model serial number of the TTS voice model closest to the current caller, and the model serial number, the ASR recognition result and other information (for example, speech speed detection result) are packaged into a data packet to be sent to the receiving end.

The receiving end parses out text information, model number of TTS sound model and other information (e.g. speech rate information) after receiving the data packet. When starting ASR/TTS mode, the receiving end needs to check if TTS sound model of the sender exists locally, if yes, the receiving end calls the TTS sound model of the sender to perform TTS text to speech operation; if the TTS sound model of the caller is detected to be not stored locally, TTS processing is carried out through a preset TTS sound model (acquired through a model sequence of the preset TTS sound model analyzed from the data packet) which is matched with the sender and is closest to the caller.

According to the embodiment of the application, the voice characteristics of the caller are extracted from the collected voice signals, and the voice model used in voice synthesis is determined based on the matching degree of the voice characteristics corresponding to the preset voice model and the voice characteristics of the caller, so that the matching degree of the voice model and the caller can be improved, and the hearing experience of a user is improved.

As an alternative, after performing a speech synthesis operation on the target text information to obtain a first speech signal corresponding to the target text information, the method further includes:

S101, adjusting the speech rate parameter of the first speech signal according to the speech rate parameter indicated by the speech rate parameter information to obtain an adjusted first speech signal, wherein the speech rate parameter information is carried in a target data packet.

For the synthesized first voice signal, a default speech rate parameter may be used to set the speech rate parameter of the first voice signal, however, the above setting manner may make the played sound signal too hard, and affect the hearing feeling of the user. In this embodiment, besides the first text information, the target data packet may also carry speed parameter information, where the speed parameter information may be extracted from the target voice signal by the first call end, and may include a speed of each text in the first text information and an interval between different text. In addition, the target packet may also carry other voice parameter information, such as intonation parameter information (for indicating intonation parameters of each word), and the like.

The second communication terminal can extract the speech speed parameter information from the target data packet, and adjust the speech speed parameter of the first speech signal by using the speech speed parameter indicated by the speech speed parameter information after the first speech signal is restored, thereby obtaining the adjusted first speech signal. The adjusted first voice signal can be played through a voice playing component on the voice of the second call end.

For example, the transmitting end converts the voice signal into text information, and obtains real-time speech speed parameter information through speech speed detection, encapsulates the ASR recognition result and the speech speed detection result into a data packet, and transmits the data packet to the receiving end. At the receiving end, the generated sound signal is subjected to speed regulation (the speed parameter is consistent with the speed parameter detected by the transmitting end) to obtain a final sound signal for playing.

According to the embodiment provided by the application, the speech speed parameters of the restored speech signals are adjusted through the speech speed parameters indicated by the sending end, so that the authenticity of the restored speech signals can be improved, and the hearing experience of a user is improved.

The following explains a processing method of a voice signal in an embodiment of the present application with reference to an alternative example. In this optional example, the first call end is a transmitting end, the second call end is a receiving end, the voice call is VoIP, the voice collecting component is a microphone array, the voice playing component is a speaker, and the voice modes of the voice call may include a voice codec mode and an ASR/TTS mode.

In this optional example, a call manner of automatically switching ASR/TTS modes in a voice call process is provided, and in combination with fig. 8 and fig. 9, a flow of a processing method of a voice signal in this optional example may include the following steps:

step S802, the transmitting end and the receiving end use a voice coding and decoding mode to perform voice communication.

In the call process, the receiving end detects the network condition of the transmission network in real time, if the network condition of the transmission network is normal, the transmitting end and the receiving end can adopt a voice coding and decoding mode (namely, the process that the transmitting end carries out voice coding and transmission to the receiving end and decodes and plays the voice), and the voice call flow in the voice coding and decoding mode can comprise the following steps:

In step S8021, the audio encoder at the transmitting end performs speech encoding on the speech signal obtained by recording, so as to obtain corresponding audio code stream data.

In step S8022, the transmitting end transmits the audio code stream data to the receiving end through the transmission network.

In step S8023, the receiving end receives data and detects the network status.

The data received by the receiving end may include audio code stream data. Meanwhile, in the call process, the condition of the transmission network can be detected in real time, for example, whether the current transmission network is abnormal can be judged by periodically sending voice data packets, bandwidth detection packets and the like by a sending end and counting the receiving success rate of the voice data packets, the bandwidth detection packets and the like by a receiving end. Correspondingly, the data received by the receiving end may also include a voice data packet, a bandwidth detection packet, etc. for performing network state detection, and the receiving end may perform network state detection based on the receiving result of the voice data packet, the bandwidth detection packet, etc.

In step S8024, the audio decoder at the receiving end performs speech decoding on the received audio code stream data, and restores the speech signal.

In step S8025, the speaker at the receiving end plays the recovered voice signal.

In step S804, the transmitting end performs mode control based on the network state detected by the receiving end.

The sending end and the receiving end can decide whether the current call adopts a voice coding and decoding mode or adopts an ASR/TTS mode based on the detection result of the state of the transmission network. The receiving end can feed back the detected network state to the transmitting end. If the transmission network is abnormal, for example, serious packet loss occurs for a long time, bandwidth is limited, normal conversation is impossible, and the like, the voice coding and decoding mode is paused to be used, and the ASR/TTS mode is switched.

In step S806, the transmitting end and the receiving end use ASR/TTS mode to make a voice call.

In the ASR/TTS mode, the transmitting end recognizes the text information from the voice signal through the ASR, and transmits the text information to the receiving end, and the receiving end restores the text information into the voice signal through the TTS and plays the voice signal.

The transmitting end recognizes text information through ASR voice, detects a preset TTS voice model (general TTS voice model) which is most matched with the current caller (the caller of the transmitting end) and current caller voice speed detection parameter information, the information is transmitted to the receiving end terminal through a transmission protocol (such as TCP protocol) with extremely low code rate and reliability, and the receiving end plays the voice signals in a local reproduction mode through a local TTS text-to-voice technology (based on the most matched preset TTS voice model or the specific person TTS voice model). The voice call flow in ASR/TTS mode may include the steps of:

In step S8061, the transmitting end recognizes text information from the voice signal obtained by the ASR, and detects the model number of the TTS voice model matched with the caller and the speech speed parameter information of the caller.

In step S8062, the transmitting end transmits the ASR recognition result, the model sequence number and the speech speed parameter information to the receiving end through the network.

When transmitting the ASR recognition result, the model sequence number and the speech rate parameter information, the transmitting end can transmit to the receiving end through a transmission protocol (reliable network transmission protocol, for example, TCP protocol) with an extremely low code rate and reliability.

In step S8063, the receiving end parses the text information, the speech speed parameter and the model serial number from the received data packet.

In step S8064, the receiving end determines whether the TTS sound model of the current caller exists, if so, step S8065 is executed, otherwise, step S8066 is executed.

The audio code stream data obtained by the audio coding of the transmitting end can also be transmitted to a TTS sound model training server to perform model training of a TTS sound model of the current caller, and the TTS sound model corresponding to the current caller is obtained and stored in a model library server. The sender may store the TTS acoustic model based on the local caller list, for example, may obtain the corresponding TTS acoustic model from the model library server based on the caller identifier in the local caller list (may be the current caller or a potential caller determined based on the association relationship).

For the current call, the receiving end may determine whether the TTS sound model corresponding to the current caller is stored locally, if so, step S8065 is performed, otherwise, step S8066 is performed.

Step S8065, a TTS sound model of the current caller is determined.

Step S8066, selecting a preset TTS sound model corresponding to the model serial number indicated by the transmitting end.

In step S8067, TTS processing is performed using the TTS sound model of the current caller or the selected preset TTS sound model, and the sound signal is reproduced locally.

Step S8068, the speech speed parameter of the reproduced sound signal is adjusted according to the speech speed parameter indicated by the transmitting end, and the final speech signal is obtained. The final speech signal may be played through a speaker or the like at the receiving end.

Meanwhile, when the network state detection result determines that the network state is recovered to be normal, the mode control unit of the transmitting end is switched to a voice encoding and decoding mode to carry out voice communication. For any party in the current call, the party can be used as a sending end of a voice signal or a receiving end of the voice signal, so that in the current call process, any party can detect the network state from the opposite end to the local end of the call in the mode, and adjust the call mode adopted for processing the voice signal from the opposite end to the local end of the call based on the detected network state.

By the alternative example, the switching between the voice coding mode and the ASR/TTS mode is performed based on the network state of the transmission network, so that the success rate of information transmission can be improved in a weak network environment, and the use experience of a user can be improved.

According to another aspect of the embodiment of the present application, there is further provided a method for processing a voice signal, optionally, as an optional implementation manner, the method for processing a voice signal may be but is not limited to being applied to the environment shown in fig. 1, which is already described and will not be described herein.

Alternatively, the above-mentioned method for processing the voice signal may be performed by the first call end 102 alone, or may be performed by the first call end 102 and the second call end 104 together. As an alternative implementation manner, taking the processing method of the voice signal in the present embodiment performed by the first call end 102 as an example, fig. 10 is a schematic flow chart of another alternative processing method of the voice signal according to an embodiment of the present application, as shown in fig. 10, the flow chart of the processing method of the voice signal may include the following steps:

Step S1002, in the process of carrying out voice call with a second call terminal, under the condition that the transmission network between the current first call terminal and the second call terminal is abnormal, acquiring a target voice signal to be transmitted;

Step S1004, converting the target voice signal into target text information by performing voice recognition on the target voice signal;

In step S1006, the target data packet is sent to the second session end through the transmission network, where the target data packet carries the target text information.

In this embodiment, the manner of voice call, the first call end, the second call end, the transmission network, the network abnormality, and the voice recognition and the data packet transmission for the voice signal are similar to those in the foregoing embodiments, and will not be described herein.

According to the embodiment of the application, in the process of carrying out voice call with the second call terminal, under the condition that the transmission network between the current first call terminal and the second call terminal is abnormal, a target voice signal to be transmitted is obtained; converting the target voice signal into target text information by performing voice recognition on the target voice signal; the target data packet is sent to the second conversation end through the transmission network, wherein the target data packet carries target text information, the problem that the conversation quality is poor due to network abnormality of the transmission network in the voice conversation processing method in the related technology is solved, and the conversation quality of the voice conversation is improved.

As an alternative, the method further includes: and carrying out speech rate detection on the target speech signal to obtain speech rate parameter information corresponding to the target speech signal, wherein the target data packet also carries the speech rate parameter information.

An optional example of this embodiment may refer to an example shown in the above-mentioned processing method of a voice signal, and will not be described herein.

As an alternative, the method further includes: extracting the sound characteristics of the target call object from the second voice signal acquired by the voice acquisition component of the first call end; and respectively matching the sound characteristics corresponding to each preset sound model in the set of preset sound models with the sound characteristics of the target call object to obtain the matching degree of each preset sound model and the target call object, and determining the model identification of the preset sound model with the highest matching degree with the target call object in the set of preset sound models as the target model identification, wherein the target model identification is sent to the second call end.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.

According to still another aspect of the embodiment of the present application, there is also provided a voice signal processing apparatus for implementing the above-mentioned voice signal processing method. Fig. 11 is a block diagram of an alternative voice signal processing apparatus according to an embodiment of the present application, and as shown in fig. 11, the apparatus may include:

the first receiving unit 1102 is configured to receive, in a process of performing a voice call with the first call end, a target data packet sent by the first call end when a transmission network between the first call end and the current second call end is abnormal, where the target data packet carries target text information;

an execution unit 1104 for performing a speech synthesis operation on the target text information to obtain a first speech signal corresponding to the target text information;

a playing unit 1106, configured to play the first voice signal through a voice playing component on the second voice terminal.

It should be noted that, the first receiving unit 1102 in this embodiment may be used to perform the step S202, the executing unit 1104 in this embodiment may be used to perform the step S204, and the playing unit 1106 in this embodiment may be used to perform the step S206.

According to the embodiment of the application, in the process of carrying out voice call with the first call end, under the condition that the transmission network between the first call end and the current second call end is abnormal, a target data packet sent by the first call end is received, wherein the target data packet carries target text information; performing voice synthesis operation on the target text information to obtain a first voice signal corresponding to the target text information; the first voice signal is played through the voice playing component on the second voice terminal, so that the problem that the call quality is poor due to abnormal signal transmission of the transmission network in the processing method of the voice signal in the related technology is solved, the signal transmission of the transmission network is ensured, and the call quality is improved.

As an alternative, the apparatus further includes:

the second receiving unit is used for receiving the voice data packet periodically sent by the first call end through the transmission network;

A first determining unit, configured to determine a network state of a transmission network according to a receiving result of the voice data packet and an expected receiving result of the voice data packet, where the network state of the transmission network is used to indicate whether the transmission network is abnormal;

The first sending unit is used for sending first indication information to the first call end according to the network state of the transmission network, wherein the first indication information is used for indicating whether the transmission network is abnormal or not.

As an alternative, the first determining unit includes:

The first determining module is used for determining the packet loss rate of the voice data packets according to the number of the received voice data packets and the number of the voice data packets expected to be received;

the second determining module is used for determining that the transmission network is abnormal under the condition that the packet loss rate of the voice data packet is greater than or equal to the packet loss rate threshold value;

and the third determining module is used for determining that the transmission network is normal under the condition that the packet loss rate of the voice data packet is smaller than the packet loss rate threshold value.

As an alternative, the apparatus further includes:

A third receiving unit, configured to receive a plurality of groups of probe packets sequentially sent by the first call end through the transmission network according to an order of packet lengths from small to small, where each group of probe packets includes a plurality of probe packets with the same packet length, and packet lengths corresponding to different groups of probe packets are different;

A second determining unit, configured to determine, according to a reception result of each group of probe packets and an expected reception result of each group of probe packets, a bandwidth probe value corresponding to the transmission network, where the bandwidth probe value is a maximum packet length of packet lengths corresponding to all groups of probe packets in which the reception result of each group of probe packets is consistent with the expected reception result;

A third determining unit, configured to determine that an abnormality exists in the transmission network when the bandwidth detection value is greater than or equal to the bandwidth detection threshold;

A fourth determining unit, configured to determine that the transmission network is normal when the bandwidth detection value is smaller than the bandwidth detection threshold;

the second sending unit is used for sending second indication information to the first call end, wherein the second indication information is used for indicating whether the transmission network is abnormal or not.

As an alternative, the third receiving unit includes:

the receiving module is used for receiving a target group detection packet sent by the first call end through the transmission network, wherein the target group detection packet corresponds to the target packet in length;

The first sending module is used for sending third indication information to the first call end under the condition that the receiving result of the target group detection packet is consistent with the expected receiving result of the target group detection packet, wherein the third indication information is used for indicating the first call end to continuously send the next group of detection packets, and the packet length corresponding to the next group of detection packets is larger than the target packet length;

And the second sending module is used for sending fourth indication information to the first call end under the condition that the receiving result of the target group detection packet is inconsistent with the expected receiving result of the target group detection packet, wherein the fourth indication information is used for indicating the first call end to stop sending the detection packet.

As an alternative, the execution unit includes:

And the execution module is used for executing text-to-speech conversion operation on the target text information by using the target sound model matched with the first call terminal to obtain a first speech signal.

As an alternative, the apparatus further includes:

The searching unit is used for searching the sound model matched with the target object identifier by using the target object identifier of the target call object of the first call end before performing text-to-speech conversion operation on the target text information by using the target sound model matched with the first call end;

a fifth determining unit configured to determine, as the target acoustic model, the acoustic model matching the target object identifier if the acoustic model matching the target object identifier is found;

And a sixth determining unit, configured to determine, as the target acoustic model, an acoustic model identified by the target model identifier in a set of preset acoustic models without finding an acoustic model that matches the target object identifier, where the target model identifier is a model identifier indicated by the first call end.

As an alternative, the apparatus further includes:

The extraction unit is used for extracting the sound characteristics of the target call object from the second voice signal acquired by the voice acquisition component of the first call end;

The matching unit is used for respectively matching the sound characteristics corresponding to each preset sound model in the set of preset sound models with the sound characteristics of the target call object to obtain the matching degree of each preset sound model and the target call object;

and the seventh determining unit is used for determining the model identifier of the preset sound model with the highest matching degree with the target call object in the set of preset sound models as the target model identifier.

As an alternative, the apparatus further includes:

The adjusting unit is used for adjusting the speech speed parameter of the first speech signal according to the speech speed parameter indicated by the speech speed parameter information after the speech synthesis operation is carried out on the target text information to obtain the first speech signal corresponding to the target text information, and obtaining the adjusted first speech signal, wherein the speech speed parameter information is carried in the target data packet.

As an alternative scheme, the transmission link from the first call end to the second call end is a first transmission link in the transmission network, and the transmission link from the second call end to the first call end is a second transmission link in the transmission network; the device further comprises:

The first acquisition unit is used for acquiring voice through the voice acquisition component of the second call end under the condition that the second transmission link is normal, so as to obtain a third voice signal;

the coding unit is used for carrying out voice coding on the third voice signal to obtain audio code stream data corresponding to the third voice signal;

And the transmission unit is used for transmitting the audio code stream data to the first call end through the second transmission link.

As an alternative, the apparatus further includes:

The second acquisition unit is used for acquiring voice through the voice acquisition component of the first call end under the condition that the transmission network is abnormal, so as to obtain a target voice signal;

The recognition unit is used for converting the target voice signal into target text information by carrying out voice recognition on the target voice signal;

And the third sending unit is used for sending the target data packet carrying the target text information to the second conversation end through the transmission network.

According to still another aspect of the embodiment of the present application, there is also provided a voice signal processing apparatus for implementing the above-mentioned voice signal processing method. Fig. 12 is a block diagram of another alternative voice signal processing apparatus according to an embodiment of the present application, and as shown in fig. 12, the apparatus may include:

an obtaining unit 1202, configured to obtain, in a case where an abnormality occurs in a transmission network between a current first call end and a second call end during a voice call with the second call end, a target voice signal to be transmitted;

A conversion unit 1204, configured to convert the target voice signal into target text information by performing voice recognition on the target voice signal;

The sending unit 1206 is configured to send the target data packet to the second session through the transmission network, where the target data packet carries the target text information.

It should be noted that the acquisition unit 1202 in this embodiment may be configured to perform the above-described step S1302, the conversion unit 1204 in this embodiment may be configured to perform the above-described step S1304, and the transmission unit 1206 in this embodiment may be configured to perform the above-described step S1306.

The optional examples of this embodiment may refer to examples shown in the above-mentioned processing method of the voice signal, and will not be described herein.

According to still another aspect of the embodiment of the present application, there is further provided an electronic device for implementing the above-mentioned method for processing a voice signal, where the electronic device may be the first call end or the second call end shown in fig. 1. The present embodiment is described taking the electronic device as a terminal device as an example. As shown in fig. 13, the electronic device comprises a memory 1302 and a processor 1304, the memory 1302 having stored therein a computer program, the processor 1304 being arranged to perform the steps of any of the method embodiments described above by means of the computer program.

Alternatively, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of the computer network.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

S1, in the process of carrying out voice call with a first call end, receiving a target data packet sent by the first call end under the condition that a transmission network between the first call end and a current second call end is abnormal, wherein the target data packet carries target text information;

s2, performing voice synthesis operation on the target text information to obtain a first voice signal corresponding to the target text information;

s3, playing the first voice signal through a voice playing component on the second conversation end.

S1, in the process of carrying out voice call with a second call terminal, under the condition that the transmission network between the current first call terminal and the second call terminal is abnormal, acquiring a target voice signal to be transmitted;

s2, converting the target voice signal into target text information by performing voice recognition on the target voice signal;

And S3, sending the target data packet to a second conversation end through a transmission network, wherein the target data packet carries target text information.

Alternatively, it will be understood by those skilled in the art that the structure shown in fig. 13 is only schematic, and the electronic device may also be a smart phone (such as an Android mobile phone, an iOS mobile phone, etc.), a tablet computer, a palm computer, and terminal devices such as MID, PAD, etc. Fig. 13 is not limited to the structure of the electronic device described above. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 13, or have a different configuration than shown in FIG. 13.

The memory 1302 may be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for processing a voice signal in the embodiment of the present application, and the processor 1304 executes the software programs and modules stored in the memory 1302, thereby performing various functional applications and data processing, that is, implementing the method for processing a voice signal. Memory 1302 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 1302 may further include memory located remotely from processor 1304, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. Wherein the memory 1302 may be used for, but is not limited to, serializing files and compiling files.

As an example, the memory 1302 may include, but is not limited to, a first receiving unit 1102, an executing unit 1104, and a playing unit 1106 in a processing device including the voice signal. In addition, other module units in the above-mentioned voice signal processing apparatus may be included, but are not limited to, and are not described in detail in this example.

As another example, the memory 1302 may include, but is not limited to, an acquisition unit 1202, a conversion unit 1204, and a transmission unit 1206 in a processing apparatus including the above-described voice signal. In addition, other module units in the above-mentioned voice signal processing apparatus may be included, but are not limited to, and are not described in detail in this example.

Optionally, the transmission device 1306 is configured to receive or transmit data via a network. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission means 1306 comprises a network adapter (Network Interface Controller, NIC) which can be connected to other network devices and routers via a network cable so as to communicate with the internet or a local area network. In one example, the transmission device 1306 is a Radio Frequency (RF) module for communicating wirelessly with the internet.

In addition, the electronic device further includes: a display 1308 for displaying a call interface for voice calls; and a connection bus 1310 for connecting the respective module components in the above-described electronic device.

In other embodiments, the terminal device or the server may be a node in a distributed system, where the distributed system may be a blockchain system, and the blockchain system may be a distributed system formed by connecting the plurality of nodes through a network communication. Among them, the nodes may form a Peer-To-Peer (P2P) network, and any type of computing device, such as a server, a terminal, etc., may become a node in the blockchain system by joining the Peer-To-Peer network.

According to one aspect of the present application, there is provided a computer program product comprising a computer program/instruction containing program code for executing the method shown in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1409 and/or installed from the removable medium 1411. When executed by the central processor 1401, performs various functions provided by embodiments of the present application. The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

Fig. 14 schematically shows a block diagram of a computer system of an electronic device for implementing an embodiment of the application. As shown in fig. 14, the computer system 1400 includes a central processing unit 1401 (Central Processing Unit, CPU) that can execute various appropriate actions and processes according to a program stored in a Read-Only Memory 1402 (ROM) or a program loaded from a storage section 1408 into a random access Memory 1403 (Random Access Memory, RAM). In the random access memory 1403, various programs and data necessary for the system operation are also stored. The cpu 1401, the rom 1402, and the ram 1403 are connected to each other via a bus 1404. An Input/Output interface 1405 (Input/Output interface, i.e., I/O interface) is also connected to bus 1404.

The following components are connected to the input/output interface 1405: an input section 1406 including a keyboard, a mouse, and the like; an output portion 1407 including a Cathode Ray Tube (CRT), a Liquid crystal display (Liquid CRYSTAL DISPLAY, LCD), and a speaker; a storage section 1408 including a hard disk or the like; and a communication section 1409 including a network interface card such as a local area network card, a modem, and the like. The communication section 1409 performs communication processing via a network such as the internet. The drive 1410 is also connected to the input/output interface 1405 as needed. Removable media 1411, such as magnetic disks, optical disks, magneto-optical disks, semiconductor memory, and the like, is installed as needed on drive 1410 so that a computer program read therefrom is installed as needed into storage portion 1408.

In particular, the processes described in the various method flowcharts may be implemented as computer software programs according to embodiments of the application. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1409 and/or installed from the removable medium 1411. When executed by the central processor 1401, performs the various functions defined in the system of the present application.

It should be noted that, the computer system 1400 of the electronic device shown in fig. 14 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

According to one aspect of the present application, there is provided a computer-readable storage medium, from which a processor of a computer device reads the computer instructions, the processor executing the computer instructions, causing the computer device to perform the methods provided in the various alternative implementations of the above embodiments.

Alternatively, in the present embodiment, the above-described computer-readable storage medium may be configured to store a computer program for performing the steps of:

Alternatively, in this embodiment, it will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by a program for instructing a terminal device to execute the steps, where the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.

The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the method described in the embodiments of the present application.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In several embodiments provided by the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or at least two units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. A method for processing a speech signal, comprising:

In the process of carrying out voice call with a first call end, receiving a target data packet sent by the first call end under the condition that a transmission network between the first call end and a current second call end is abnormal, wherein the target data packet carries target text information;

Performing voice synthesis operation on the target text information to obtain a first voice signal corresponding to the target text information;

and playing the first voice signal through a voice playing component on the second voice terminal.

2. The method according to claim 1, wherein the method further comprises:

Receiving a voice data packet periodically sent by the first call end through the transmission network;

Determining a network state of the transmission network according to the receiving result of the voice data packet and the expected receiving result of the voice data packet, wherein the network state of the transmission network is used for indicating whether the transmission network is abnormal or not;

and sending first indication information to the first call end according to the network state of the transmission network, wherein the first indication information is used for indicating whether the transmission network is abnormal or not.

3. The method of claim 2, wherein said determining the network state of the transport network based on the received result of the voice data packet and the expected received result of the voice data packet comprises:

Determining the packet loss rate of the voice data packets according to the number of the received voice data packets and the number of the voice data packets expected to be received;

Determining that the transmission network is abnormal under the condition that the packet loss rate of the voice data packet is greater than or equal to a packet loss rate threshold value;

And under the condition that the packet loss rate of the voice data packet is smaller than a packet loss rate threshold value, determining that the transmission network is normal.

4. The method according to claim 1, wherein the method further comprises:

Receiving a plurality of groups of detection packets which are sequentially sent by the first call end through the transmission network according to the sequence of packet lengths from small to small, wherein each group of detection packets comprises a plurality of detection packets with the same packet length, and the packet lengths corresponding to different groups of detection packets are different;

Determining a bandwidth detection value corresponding to the transmission network according to the receiving result of each group of detection packets and the expected receiving result of each group of detection packets, wherein the bandwidth detection value is the maximum packet length in packet lengths corresponding to all groups of detection packets with the receiving results consistent with the expected receiving results in the groups of detection packets;

Determining that the transmission network is abnormal under the condition that the bandwidth detection value is larger than or equal to a bandwidth detection threshold value;

under the condition that the bandwidth detection value is smaller than a bandwidth detection threshold value, determining that the transmission network is normal;

and sending second indicating information to the first call end, wherein the second indicating information is used for indicating whether the transmission network is abnormal or not.

5. The method of claim 4, wherein the receiving the plurality of groups of probe packets sequentially sent by the first call end through the transmission network in the order of packet length from small arrival comprises:

receiving a target group detection packet sent by the first call end through the transmission network, wherein the target group detection packet corresponds to the target packet in length;

Transmitting third indication information to the first call end under the condition that the receiving result of the target group detection packet is consistent with the expected receiving result of the target group detection packet, wherein the third indication information is used for indicating the first call end to continuously transmit a next group of detection packets, and the packet length corresponding to the next group of detection packets is larger than the target packet length;

and sending fourth indication information to the first call end under the condition that the receiving result of the target group detection packet is inconsistent with the expected receiving result of the target group detection packet, wherein the fourth indication information is used for indicating the first call end to stop sending the detection packet.

6. The method of claim 1, wherein performing a speech synthesis operation on the target text information to obtain a first speech signal corresponding to the target text information comprises:

And performing text-to-speech conversion operation on the target text information by using a target sound model matched with the first call end to obtain the first speech signal.

7. The method of claim 6, wherein prior to performing a text-to-speech conversion operation on the target text information using a target acoustic model that matches the first call end, the method further comprises:

Searching a sound model matched with the target object identification by using the target object identification of the target call object of the first call end;

under the condition that the sound model matched with the target object identifier is found, determining the sound model matched with the target object identifier as the target sound model;

and under the condition that the sound model matched with the target object identifier is not found, determining the sound model identified by the target model identifier in a group of preset sound models as the target sound model, wherein the target model identifier is the model identifier indicated by the first call end.

8. The method of claim 7, wherein the method further comprises:

extracting the sound characteristics of the target call object from the second voice signal acquired by the voice acquisition component of the first call end;

Respectively matching the sound characteristics corresponding to each preset sound model in the set of preset sound models with the sound characteristics of the target call object to obtain the matching degree of each preset sound model and the target call object;

And determining the model identification of the preset sound model with the highest matching degree with the target call object in the set of preset sound models as the target model identification.

9. The method of claim 1, wherein after performing a speech synthesis operation on the target text information to obtain a first speech signal corresponding to the target text information, the method further comprises:

and adjusting the speech rate parameter of the first speech signal according to the speech rate parameter indicated by the speech rate parameter information to obtain the adjusted first speech signal, wherein the speech rate parameter information is carried in the target data packet.

10. The method of claim 1, wherein the transmission link from the first call end to the second call end is a first transmission link in the transmission network, and wherein the transmission link from the second call end to the first call end is a second transmission link in the transmission network; the method further comprises the steps of:

under the condition that the second transmission link is normal, voice acquisition is carried out through a voice acquisition component of the second call end, so that a third voice signal is obtained;

Performing voice coding on the third voice signal to obtain audio code stream data corresponding to the third voice signal;

and transmitting the audio code stream data to the first call end through the second transmission link.

11. The method according to any one of claims 1 to 10, further comprising:

Under the condition that the transmission network is abnormal, voice acquisition is carried out through a voice acquisition component of the first call end, so that a target voice signal is obtained;

converting the target voice signal into the target text information by carrying out voice recognition on the target voice signal;

And sending the target data packet carrying the target text information to the second session end through the transmission network.

12. A method for processing a speech signal, comprising:

In the process of carrying out voice call with a second call terminal, under the condition that the transmission network between the current first call terminal and the second call terminal is abnormal, acquiring a target voice signal to be transmitted;

The target voice signal is converted into target text information by carrying out voice recognition on the target voice signal;

And sending a target data packet to the second session end through the transmission network, wherein the target data packet carries the target text information.

13. A processing apparatus for a speech signal, comprising:

The first receiving unit is used for receiving a target data packet sent by a first call end under the condition that a transmission network between the first call end and a current second call end is abnormal in the process of carrying out voice call with the first call end, wherein the target data packet carries target text information;

the execution unit is used for executing voice synthesis operation on the target text information to obtain a first voice signal corresponding to the target text information;

and the playing unit is used for playing the first voice signal through a voice playing component on the second conversation end.

14. A processing apparatus for a speech signal, comprising:

the system comprises an acquisition unit, a transmission unit and a transmission unit, wherein the acquisition unit is used for acquiring a target voice signal to be transmitted under the condition that a transmission network between a current first call end and a second call end is abnormal in the process of voice call with the second call end;

The conversion unit is used for converting the target voice signal into target text information by carrying out voice recognition on the target voice signal;

And the sending unit is used for sending the target data packet to the second conversation end through the transmission network, wherein the target data packet carries the target text information.

15. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1-12 by means of the computer program.