WO2022135237A1

WO2022135237A1 - Voice processing method, terminal device, and storage medium

Info

Publication number: WO2022135237A1
Application number: PCT/CN2021/138389
Authority: WO
Inventors: 宁杰; 申呈洁; 鲍光照; 张岳; 渠畅
Original assignee: 华为技术有限公司
Priority date: 2020-12-25
Filing date: 2021-12-15
Publication date: 2022-06-30
Also published as: CN114694662A

Abstract

A voice processing method, a terminal device, and a storage medium. A first terminal device and a second terminal device are in a call state. The method comprises: the first terminal device obtains a voice signal input by a user (S501); when determining that the first terminal device is currently in the weak signal environment, and determining that both the first terminal device and the second terminal device support phonetic alphabet encoding and decoding, extracting a plurality of groups of information from the voice signal, each group of information comprising a phonetic alphabet, a tone, and a duration of a single word (S502); obtaining, according to an encoding table, encoding information corresponding to each group of information (S503); and sending the encoding information to the second terminal device (S504). The encoding table stores the correspondence between the phonetic alphabet, the tone, and the duration, and the encoding information. According to the voice processing method, transmission is carried out at an extremely low code rate in the weak signal environment, both parties of the call can clearly convey the pronunciation of the words, the user determines the semantic meaning according to the pronunciation, and the call effect is improved.

Description

Voice processing method, terminal device and storage medium

This application claims the priority of the Chinese patent application with the application number 202011568861.1 and the application name "Voice processing method, terminal equipment and storage medium" filed with the State Intellectual Property Office on December 25, 2020, the entire contents of which are incorporated by reference in in this application.

technical field

The embodiments of the present application relate to the field of communication technologies, and in particular, to a voice processing method, a terminal device, and a storage medium.

Background technique

When a user uses a terminal device to make a voice call, at the voice sender, the terminal device encodes the voice signal. Correspondingly, at the voice receiver, the terminal device decodes the received data and restores it to voice.

At present, terminal equipment adopts traditional speech coding methods such as waveform coding. However, when the communication environment is poor, the traditional speech coding method will cause the signal distortion of the receiving end to be large, the call will be intermittent, there will be noise or even no sound, and the call effect will be very poor.

SUMMARY OF THE INVENTION

Embodiments of the present application provide a voice processing method, terminal device, and storage medium. When the communication environment of the terminal device is poor, both parties to a call can clearly communicate the pronunciation of words, and users can judge semantics by pronunciation, which improves the effect of the call.

In a first aspect, a voice processing method is provided, which is applied to a first terminal device, and the first terminal device and the second terminal device are in a call state, the method includes: acquiring a voice signal input by a user; after determining that it is currently in a weak signal environment, And when it is determined that both the first terminal device and the second terminal device support phonetic symbol encoding and decoding, multiple sets of information are extracted from the voice signal, and each group of information includes the phonetic symbol, tone and duration of a single word; Encoding and decoding the duration; obtaining the encoding information corresponding to each group of information according to the encoding table; the encoding table stores phonetic symbols, tones, and the correspondence between the duration and the encoding information; sending the encoding information to the second terminal device.

The voice processing method provided in the first aspect can be applied to a voice sender who conducts a voice call in a weak signal environment. At the voice sender, the phonetic symbol, pitch and duration of each word in the user's speech are extracted and encoded, to obtain encoded information of a preset bit length corresponding to each word. Since the encoded information has a preset bit length, it can be transmitted at a very low bit rate in a weak signal environment. By encoding the pronunciation and duration of a single word, the receiver can clearly restore the pronunciation of the sender user, so that the receiver user can acquire semantics according to the played pronunciation, which improves the call effect.

In a possible implementation, obtaining the corresponding coding information of each group of information according to the coding table, including: for each group of information, determining whether the commonly used index table includes the phonetic symbols in the group of information; if the commonly used index table includes this group of information. If the phonetic symbols in the group of information are not included in the common index table, the encoding information is obtained according to the global index table.

In this implementation, the common index table is generated according to the common words of the user, and the number of common index values in the common index table is much smaller than the number of global index values in the global index table. The amount of search data is reduced, and the coding efficiency is improved.

In a possible implementation manner, determining that the current is in a weak signal environment includes: if it is determined that the target parameter satisfies the first preset condition, sending a first request message to the second terminal device, where the first request message is used to instruct the second terminal The device uses phonetic coding and decoding, and the target parameter is used to indicate the signal state of the communication environment where the first terminal device is currently located; the first response message sent by the second terminal device is received, and the first response message is used to instruct the second terminal device to use phonetic coding. decoding.

In a possible implementation manner, the first request message includes a hysteresis timer, and the hysteresis timer is used to indicate a delay time for the second terminal device to use phonetic symbol encoding and decoding.

In this implementation, by setting a lag timer, time is reserved for the switching of the voice codec mode. After the lag timer expires, the first terminal device and the second terminal device use phonetic symbol codec at the same time, which improves the codec mode. switching effect.

In a possible implementation manner, determining that it is currently in a weak signal environment includes: receiving a second request message sent by a second terminal device, where the second request message is used to instruct the first terminal device to use phonetic codec; A second response message is sent, where the second response message is used to instruct the first terminal device to use phonetic symbol codec.

In a possible implementation manner, the method further includes: if it is determined that the target parameter satisfies the second preset condition, sending a third request message to the second terminal device, where the third request message is used to instruct the second terminal device to use waveform encoding and decoding. , the target parameter is used to indicate the signal state of the communication environment where the first terminal device is currently located; the third response message sent by the second terminal device is received, and the third response message is used to instruct the second terminal device to use waveform encoding and decoding.

In a possible implementation manner, the method further includes: receiving a fourth request message sent by the second terminal device, where the fourth request message is used to instruct the first terminal device to use waveform encoding and decoding; if it is determined that the target parameter satisfies the second preset condition , then send a fourth response message to the second terminal device, where the fourth response message is used to instruct the first terminal device to use waveform codec, and the target parameter is used to indicate the signal state of the communication environment where the first terminal device is currently located.

In the above implementation manner, when the communication environment in which the first terminal device and the second terminal device are located is not a weak signal environment, the traditional voice codec mode can be switched to improve the call effect.

In a possible implementation manner, determining that the first terminal device supports phonetic symbol encoding and decoding includes: if it is determined that the first terminal device does not have the phonetic symbol encoding and decoding function enabled, generating and outputting prompt information; receiving the user's first instruction; A command turns on the function.

In a possible implementation manner, determining that both the first terminal device and the second terminal device support phonetic symbol encoding and decoding includes: sending first capability information to the second terminal device, where the first capability information is used to indicate that the first terminal device supports phonetic symbology. Encoding and decoding; receiving the first capability response information sent by the second terminal device, where the first capability response information is used to indicate that the second terminal device supports phonetic symbol encoding and decoding; or, receiving the second capability information sent by the second terminal device, the second capability The information is used to indicate that the second terminal device supports phonetic codec; the second capability response information is sent to the second terminal device, and the second capability response information is used to indicate that the first terminal device supports phonetic codec.

In a possible implementation manner, the method further includes: displaying a setting interface; receiving a user's operation in the setting interface; and in response to the operation, enabling the function of phonetic symbol encoding and decoding.

In a second aspect, a voice processing method is provided, which is applied to a second terminal device, and the second terminal device is in a talking state with the first terminal device. The method includes: receiving first information sent by the first terminal device; The first information is decoded to obtain multiple groups of information; each group of information includes the phonetic symbol, tone and duration of a single character, and the coding table stores the correspondence between the phonetic symbol, the tone, and the duration and the encoded information; according to the multiple groups of information, a voice signal is generated; Play the voice signal with the preset sound.

The voice processing method provided in the second aspect can be applied to a voice receiver who conducts a voice call in a weak signal environment. After the encoded information sent by the voice sender is transmitted through the channel, the voice receiver decodes the received information to obtain the phonetic symbol, pitch and duration of each word, thereby generating a complete and smooth voice signal and using the preset sound to play. Since the encoded information has a preset bit length, it can be transmitted at a very low bit rate in a weak signal environment. By encoding the pronunciation and duration of a single word, the receiver can clearly restore the pronunciation of the sender user, so that the receiver user can acquire semantics according to the played pronunciation, which improves the call effect.

In a possible implementation, decoding the first information according to the encoding table to obtain multiple sets of information, including: sequentially obtaining multiple encoding information from the first information, and the length of the encoding information is a preset bit length; for each encoding information, and obtain the phonetic symbol, pitch and duration corresponding to the encoding information according to the encoding table.

In a possible implementation manner, before receiving the first information sent by the first terminal device, the method further includes: receiving a first request message sent by the first terminal device, where the first request message is used to instruct the second terminal device to use phonetic symbol encoding and decoding. ; Phonetic codec refers to encoding and decoding phonetics, pitch and duration; sending a first response message to the first terminal device, the first response message is used to instruct the second terminal device to use phonetic codec.

In a possible implementation manner, before receiving the first information sent by the first terminal device, the method further includes: if it is determined that the target parameter satisfies the first preset condition, sending a second request message to the first terminal device, the second request message Used to instruct the first terminal device to use phonetic codec, the phonetic codec refers to encoding and decoding phonetic symbols, pitch and duration, and the target parameter is used to indicate the signal state of the communication environment where the second terminal device is currently located; receiving the first terminal A second response message sent by the device, where the second response message is used to instruct the first terminal device to use phonetic symbol codec.

In a possible implementation manner, the method further includes: receiving a third request message sent by the first terminal device, where the third request message is used to instruct the second terminal device to use waveform encoding and decoding; if it is determined that the target parameter satisfies the second preset condition , then send a third response message to the first terminal device, where the third response message is used to instruct the second terminal device to use waveform codec, and the target parameter is used to indicate the signal state of the communication environment where the first terminal device is currently located.

In a possible implementation manner, the method further includes: if it is determined that the target parameter satisfies the second preset condition, sending a fourth request message to the first terminal device, where the fourth request message is used to instruct the first terminal device to use waveform encoding and decoding. ; Receive a fourth response message sent by the first terminal device, where the fourth response message is used to instruct the first terminal device to use waveform codec.

In a possible implementation manner, before receiving the first information sent by the first terminal device, the method further includes: receiving the first capability information sent by the first terminal device, where the first capability information is used to indicate that the first terminal device supports phonetic symbol encoding and decoding. ; Send the first capability response information to the first terminal device, and the first capability response information is used to indicate that the second terminal device supports phonetic symbol encoding and decoding; Phonetic symbol encoding and decoding refers to encoding and decoding phonetic symbols, tones and duration; Second capability information sent by a terminal device, the second capability information is used to indicate that the second terminal device supports phonetic codec; second capability response information sent by the first terminal device is received, and the second capability response information is used to indicate the first terminal. The device supports phonetic codec.

In a possible implementation manner, the method further includes: displaying a setting interface; receiving a user's operation in the setting interface; in response to the operation, turning on the function of phonetic symbol encoding and decoding, where phonetic symbol encoding and decoding refers to encoding phonetic symbols, pitch and duration. decoding.

In a third aspect, there is provided an apparatus comprising: means or means for performing the steps in any of the above aspects.

In a fourth aspect, a terminal device is provided, including a processor, a memory, and a transceiver, where the transceiver is used for communicating with other devices, and the processor is used for calling a program stored in the memory to execute the method provided in any of the above aspects.

In a fifth aspect, a computer-readable storage medium is provided, and instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer or a processor, the method provided in any of the above aspects is implemented.

In a sixth aspect, a program product is provided, the program product includes a computer program, the computer program is stored in a readable storage medium, and at least one processor of a device can read the computer program from the readable storage medium , the at least one processor executes the computer program to cause the device to implement the method provided in any of the above aspects.

In any of the above aspects, in a possible implementation manner, the encoded information includes a first information component corresponding to phonetic symbols and tones, and a second information component corresponding to a duration.

In a possible implementation manner, the first information component includes a first information sub-component corresponding to a phonetic symbol and a second information sub-component corresponding to a tone.

In a possible implementation manner, the coding table includes a global index table and a common index table, the common index table is generated according to the number of words used by the user within a preset time period, and the global index table includes a phonetic symbol and a global index of the phonetic symbol The phonetic symbols included in the common index table have the common index value and the global index value of the phonetic symbol in the global index table.

In a possible implementation manner, the target parameter includes at least one of the following: the location information of the terminal device, the cell identifier of the cell currently accessed by the terminal device, the signal strength of the signal received by the terminal device or the voice packet loss rate.

Description of drawings

FIG. 1 is a diagram of an application scenario to which an embodiment of the present application is applicable;

Fig. 2 is a schematic diagram of the principle when a terminal device performs a voice call;

3 is a schematic diagram of a call effect using traditional voice codec in a weak signal environment;

FIG. 4 is a schematic diagram of a call effect provided by an embodiment of the present application in a weak signal environment;

FIG. 5 is a message interaction diagram of the voice processing method provided by the embodiment of the present application;

FIG. 6 is another message interaction diagram of the voice processing method provided by the embodiment of the present application;

FIG. 7 is another message interaction diagram of the voice processing method provided by the embodiment of the present application;

FIG. 8 is another message interaction diagram of the voice processing method provided by the embodiment of the present application;

FIG. 9 is another message interaction diagram of the voice processing method provided by the embodiment of the present application;

Fig. 10 is a kind of interface diagram of setting phonetic symbol encoding and decoding mode provided by the embodiment of this application;

11 is another message interaction diagram of the voice processing method provided by the embodiment of the present application;

12 is another message interaction diagram of the voice processing method provided by the embodiment of the present application;

FIG. 13 is a schematic structural diagram of a terminal device provided by an embodiment of the present application;

FIG. 14 is another schematic structural diagram of a terminal device provided by an embodiment of the present application.

Detailed ways

The embodiments of the present application are described below with reference to the accompanying drawings.

Exemplarily, FIG. 1 is a diagram of an application scenario to which this embodiment of the present application is applicable. As shown in FIG. 1 , user A uses terminal device 100, user B uses terminal device 200, and user A and user B can conduct a voice call. This embodiment of the present application does not limit the type of the terminal device. For example, examples of some terminal devices may be: mobile phones, tablet computers, PDAs, wearable devices, and the like.

FIG. 2 is a schematic diagram of the principle when a terminal device performs a voice call. As shown in FIG. 1 and FIG. 2 , when user A speaks and user B listens, the terminal device 100 is the voice sender, and the terminal device 200 is the voice receiver. The terminal device 100 acquires the voice signal input by the user A, performs voice encoding on the voice signal, and generates encoded information. The encoded information is received by the terminal device 200 after being transmitted through the channel. The terminal device 200 performs voice decoding on the received information, restores and generates a voice signal, and outputs it to user B.

It should be noted that FIG. 2 shows the voice codec part in the voice call process, and other processing processes are not limited.

The concepts in the embodiments of the present application are described below.

1. Voice coding

Speech coding has a broad meaning and a narrow meaning. In a broad sense, it refers to an encoding method that includes speech encoding at the sender and speech decoding at the receiver. The narrow meaning refers to speech encoding at the sender. For distinction, in the embodiments of the present application, encoding in a broad sense is referred to as encoding and decoding.

The purpose of speech codec is to digitize the speech signal, compress the transmission bandwidth of the speech signal, and improve the transmission rate of the channel.

There are many ways to implement speech encoding and decoding, for example, traditional speech encoding and decoding such as waveform encoding and decoding, feature encoding and decoding, and parameter encoding and decoding, as well as phonetic symbol encoding and decoding in the embodiments of the present application.

For the convenience of description, the traditional speech encoding and decoding methods in the embodiments of the present application take waveform encoding and decoding as an example for description.

2. Weak signal environment

When the terminal equipment is in a weak signal environment, the signal quality is poor. If the traditional voice encoding and decoding method is used, the transmitted voice code rate will be reduced, and the voice signal restored by the receiving end will be greatly distorted, resulting in discontinuity, noise or even no sound.

It should be noted that, in this embodiment of the present application, the terminal device is in a weak signal environment, which means that any one of the two terminal devices in a call is in a weak signal environment. Exemplarily, taking the terminal device 100 in FIG. 1 as an example, the terminal device 100 is in a weak signal environment, including the following three scenarios: scenario 1, the terminal device 100 is in a weak signal environment; scenario 2, the terminal device talking to the terminal device 100 200 is in a weak signal environment; in scenario three, both the terminal device 100 and the terminal device 200 are in a weak signal environment.

The terminal device may acquire its own target parameters, and determine whether the terminal device is in a weak signal environment according to whether the target parameters satisfy a preset condition. In this embodiment of the present application, when the target parameter satisfies the first preset condition, it is determined that the terminal device is in a weak signal environment, and when the target parameter satisfies the second preset condition, it is determined that the terminal device is not in a weak signal environment. The target parameters are different, and the corresponding first preset conditions and the second preset conditions are different. Optionally, the target parameter may include at least one of the following: the location information of the terminal device, the cell identifier of the cell currently accessed by the terminal device, the signal strength of the signal received by the terminal device or the voice packet loss rate.

Optionally, the target parameter is the location information of the terminal device, and the weak signal geographic range can be recorded in advance. When the location information of the terminal device is within the weak signal geographic range, it is determined that the terminal device is in a weak signal environment. On the contrary, when the location information of the terminal device is not within the weak signal geographic range, it is determined that the terminal device is not in a weak signal environment. This embodiment of the present application does not limit the geographic range of weak signals, for example, some mountainous areas, bridges, and other areas where it is difficult to deploy stations.

Optionally, the target parameter is the cell identifier of the cell currently accessed by the terminal device, and the weak signal cell identifier may be pre-recorded. When the cell identifier of the cell currently accessed by the terminal device is a weak signal cell identifier, it is determined that the terminal device is in a weak signal environment. On the contrary, when the cell identity of the cell currently accessed by the terminal device is not the weak signal cell identity, it is determined that the terminal device is not in a weak signal environment. This embodiment of the present application does not limit the identification of weak signal cells. For example, in a chain distribution scenario such as high-speed rail, subway, or expressway, the signal coverage blind spot is relatively fixed, and the identification of the cell with poor signal.

Optionally, the target parameter is the signal strength of the signal received by the terminal device, a first threshold and a second threshold may be preset, and the second threshold is greater than or equal to the first threshold. When the signal strength of the signal received by the terminal device is less than or equal to the first threshold, it is determined that the terminal device is in a weak signal environment. When the signal strength of the signal received by the terminal device is greater than or equal to the second threshold, it is determined that the terminal device is not in a weak signal environment. This embodiment of the present application does not limit the values of the first threshold and the second threshold.

Optionally, the target parameter is the voice packet loss rate of the terminal device, a third threshold and a fourth threshold may be preset, and the third threshold is greater than or equal to the fourth threshold. When the voice packet loss rate is greater than or equal to the fourth threshold, it is determined that the terminal device is in a weak signal environment. When the voice packet loss rate is less than or equal to the third threshold, it is determined that the terminal device is not in a weak signal environment. This embodiment of the present application does not limit the values of the third threshold and the fourth threshold.

3. Phonetic codec

In this embodiment of the present application, the phonetic symbol encoding and decoding refers to encoding and decoding the phonetic symbol, pitch, and duration of a single word according to a coding table. This embodiment of the present application does not limit the name of the phonetic codec, for example, it may also be called weak-signal high-definition speech codec.

Phonetic symbols can be word-based, words, sentences, or other units. The embodiment of the present application is described by taking the word unit as an example, and each single word has three pieces of information, which are phonetic symbol, pitch, and duration respectively.

This embodiment of the present application does not limit the division standard of the duration. For example, the duration can include short, medium and long. Short means less than 0.5 seconds, medium means greater than or equal to 0.5 seconds and less than 2 seconds, and long means greater than or equal to 2 seconds. For another example, the duration may include four categories of 1 to 4. 1 means less than 0.5 seconds, 2 means greater than or equal to 0.5 seconds and less than 1 second, 3 means greater than or equal to 1 second and less than 2 seconds, 4 means greater than or equal to 2 seconds.

On the voice sender side, phonetic coding is performed on the phonetic symbol, tone and duration of each word according to the coding table to generate coding information with a preset bit length. This embodiment of the present application does not limit the value of the preset bit length, for example, 22 bits. After the encoded information is transmitted through the channel, it reaches the voice receiver, and the receiver performs phonetic decoding on the received information according to the coding table to obtain the phonetic symbol, pitch and duration of the single word.

Wherein, in different languages, the definitions of phonetic symbols and tones are different, which are not limited in this embodiment of the present application. For example, in Chinese, phonetic symbols refer to the pinyin of a Chinese character. For example, "hello" corresponds to 2 phonetic symbols, which are "ni" for "you" and "hao" for "hao". There are four tones in Chinese: Yinping (first tone), Yangping (second tone), Shang tone (third tone) and Qu tone (fourth tone). For another example, in English, a phonetic symbol refers to an English word. For example, "Good morning" corresponds to 2 phonetic symbols, namely "good" and "morning". Tones in English may include, but are not limited to, at least two of the following: affirmative, interrogative, rising, falling, rising and falling, falling and rising, flat, high, and low.

4. Coding table, coding information, global index table and common index table

The encoding table is stored in the terminal device, and the encoding table stores the correspondence between phonetic symbols, tones, duration and encoding information. The terminal device completes the phonetic symbol encoding and decoding or phonetic symbol decoding by looking up the coding table. Optionally, the correspondence between phonetic symbols, tones and durations and the encoded information can include at least one of the following: the correspondence between the phonetic symbols and the encoded information, the correspondence between the tones and the encoded information, the duration Correspondence between a combination of phonetic symbols and tones and encoding information, or a corresponding relationship between a combination of phonetic symbols, tones and duration and encoding information. The coding table may include one table or at least two tables according to different correspondences. According to different correspondences, the encoded information of the preset bit length may include information components corresponding to different combinations of phonetic symbols, tones and durations.

Optionally, the encoded information may include a first information component corresponding to phonetic symbols and tones, and a second information component corresponding to a duration.

Optionally, the first information component includes a first information sub-component corresponding to a phonetic symbol and a second information sub-component corresponding to a tone. That is, phonetic symbols, pitches, and durations respectively correspond to information components.

Optionally, considering that the user's commonly used vocabulary is limited, in order to save time for table lookup and improve the efficiency of phonetic symbol encoding and decoding, the encoding table may include a global index table and a common index table. The global index table is used for encoding by the voice sender and decoding by the voice receiver, and a common index table is used for encoding by the voice sender. Wherein, the global index table includes a phonetic symbol and a global index value of the phonetic symbol. The global index table can be understood as a complete set of words or phonetic symbols in a certain language, and the value range of the global index value is large. The commonly used index table is generated according to the number of words used by the user within a preset time period, and the embodiment of the present application does not limit the value of the preset time period, for example, 3 months, 6 months, or 1 year. The phonetic symbols included in the common index table have the common index value and the global index value of the phonetic symbol in the global index table.

When the terminal device performs phonetic symbol encoding and decoding according to the encoding table, it may first look up the commonly used index table. Because the common index table is generated according to the common words of the user, the number of common index values is small, and the table lookup time is short, which can improve the efficiency of phonetic symbol encoding and decoding. When no result is found in the common index table, the global index table is searched for phonetic symbol encoding and decoding.

This embodiment of the present application does not limit the value range of the global index value and the phonetic symbol ordering. For example, common words are sorted first.

This embodiment of the present application does not limit the value range of the commonly used index values and the phonetic symbol ordering. For example, the number of phonetic symbols can be 5000, and the commonly used index value can be 13 bits in length, representing a maximum of 8192 phonetic symbols, which can be sorted in descending order according to the usage frequency of single words within a preset time period.

Optionally, the global index table and the commonly used index table may be updated periodically, and the update period is not limited in this embodiment of the present application.

The following takes the language of Chinese as an example to illustrate the coding table, coding information, global index table and common index table.

Optionally, in an implementation manner, the coding table includes: a Chinese global index table, a duration table, and a common index table. The encoded information includes a first information component and a second information component. The Chinese global index table is used to indicate the correspondence between the combination of words, phonetic symbols and tones and the first information component (global index value) in the encoded information. The global index value can be 20 bits, representing a maximum of 1.04 million words. The duration table is used to indicate the correspondence between the duration and the second information component in the encoded information. The common index table includes words, phonetic symbols, tones, duration, global index value and common index value. Exemplarily, it will be described in conjunction with Tables 1 to 3. It is assumed that the encoded information is 22 bits long, wherein the first information component (global index value) is 20 bits long, and the second information component is 2 bits long. As shown in Table 1, the Chinese global index table can be understood as a complete set of Chinese words, and each word corresponds to a global index value. The pitch ranges from 1 to 4, representing the first to fourth tones, respectively. As shown in Table 2, the duration includes three types: short, medium, and long, and the corresponding index value is 2 bits, and the index value is the second information component. As shown in Table 3, the commonly used index values can uniquely distinguish the combination of word, phonetic symbol, tone and duration.

Table 1 Chinese global index table

单字single word	音标phonetic symbol	音调tone	全局索引值(20bit)Global index value (20bit)
好it is good	haohao	33	0x000010x00001
坏Bad	huaihuai	44	0x000020x00002
多many	duoduo	11	0x000030x00003
少few	shaoshao	33	0x000040x00004

Table 2 Duration table

持续时长duration	索引值(2bit)(二进制)Index value (2bit) (binary)
短short	0000
中middle	0101
长long	1010

Table 3 Commonly used index table

单字single word	音标phonetic symbol	音调tone	持续时长duration	全局索引值(20bit)Global index value (20bit)	常用索引值(13bit)Common index value (13bit)
好it is good	haohao	33	短short	0x000010x00001	0x00010x0001
好it is good	haohao	33	中middle	0x000010x00001	0x00020x0002
坏Bad	huaihuai	44	长long	0x000020x00002	0x00030x0003
多many	duoduo	11	短short	0x000030x00003	0x00040x0004
少few	shaoshao	33	长long	0x000040x00004	0x00050x0005

Optionally, in another implementation manner, the encoding table includes: a Chinese global index table, a duration table, and a common index table. The encoded information includes a first information component and a second information component. The Chinese global index table can be found in Table 1, and the duration table can be found in Table 2. Common index tables can be found in Table 4. As shown in Table 4, the commonly used index values can uniquely distinguish the combination of words, phonetic symbols and tones.

In this implementation manner, the encoded information includes a first information component corresponding to phonetic symbols and tones, and a second information component corresponding to a duration. Since the durations are encoded separately, the difference of the durations may not be considered in the commonly used index table, which further reduces the number of commonly used index values and improves the rate of searching for the commonly used index table.

Table 4 Commonly used index table

单字single word	音标phonetic symbol	音调tone	全局索引值(20bit)Global index value (20bit)	常用索引值(13bit)Common index value (13bit)
好it is good	haohao	33	0x000010x00001	0x00010x0001
坏Bad	huaihuai	44	0x000020x00002	0x00020x0002
多many	duoduo	11	0x000030x00003	0x00030x0003
少few	shaoshao	33	0x000040x00004	0x00040x0004

Optionally, in another implementation manner, the encoding table includes: a Chinese global index table, a tone table, a duration table, and a common index table. The encoded information includes a first information subcomponent, a second information subcomponent and a second information component. Exemplarily, as shown in Table 5, the Chinese global index table is used to indicate the correspondence between the phonetic symbols and the first information sub-component (global index value) in the encoded information. Exemplarily, as shown in Table 6, the tone table is used to indicate the correspondence between the tone and the second information sub-component in the encoded information (the index value of 2 bits in Table 6). The pitch ranges from 1 to 4, representing the first to fourth tones, respectively. The duration table can be found in Table 2. Exemplarily, as shown in Table 7, the commonly used index values can uniquely distinguish different phonetic symbols.

In this implementation, the phonetic symbol, pitch and duration are separately encoded and decoded, and the difference between different words is not considered, which further reduces the number of common index values and the number of global index values, and improves the search for common index tables and global index tables. speed, improving the encoding and decoding efficiency.

Table 5 Chinese global index table

音标phonetic symbol	全局索引值(18bit)Global index value (18bit)
haohao	0x000010x00001
huaihuai	0x000020x00002
duoduo	0x000030x00003
shaoshao	0x000040x00004

Table 6 Tone Table

音调tone	索引值(2bit)(二进制)Index value (2bit) (binary)
11	0000
22	0101
33	1010
44	1111

Table 7 Commonly used index table

音标phonetic symbol	全局索引值(18bit)Global index value (18bit)	常用索引值(12bit)Common index value (12bit)
haohao	0x000010x00001	0x00010x0001
huaihuai	0x000020x00002	0x00020x0002
duoduo	0x000030x00003	0x00030x0003
shaoshao	0x000040x00004	0x00040x0004

It should be noted that when there are multiple languages, each language corresponds to a global index table. Optionally, the multiple languages may include, but are not limited to, at least two of the following: Chinese, English, German, French, Japanese, Korean, or dialects.

For example, the coding table includes three global index tables, which are Chinese global index table, English global index table and dialect global index table respectively. The Chinese global index table can include 380,000 phonetic symbols, of which the common phonetic symbols can be 100,000, and the common phonetic symbols are sorted first. The English global index table can include 280,000 phonetic symbols, of which the commonly used phonetic symbols can be 35,000, and the common phonetic symbols are sorted first. The dialect global index table can include 100,000 phonetic symbols.

It should be noted that this embodiment of the present application does not limit the value range of the global index value in each global index table. Optionally, all the global index tables can be numbered uniformly, which is convenient for quick searching during phonetic symbol encoding and decoding. For example, the encoding table includes a Chinese global index table and a dialect global index table. The value range of the global index value in the Chinese global index table is 1 to 100, with a maximum of 100 phonetic symbols. The global index value in the dialect global index table can be numbered from 101.

Optionally, in order to ensure the phonetic symbol encoding and decoding efficiency, the sum of the number of phonetic symbols in all global index tables is less than a preset value, which is not limited in this embodiment of the present application, for example, 1 million.

5. Terminal equipment supports phonetic codec

Optionally, in an implementation manner, the terminal device supports the function of phonetic symbol encoding and decoding by default, and there is no related setting switch, and no user setting is required, then the terminal device supports phonetic symbol encoding and decoding.

Optionally, in another implementation manner, the terminal device has the function of phonetic symbol encoding and decoding, and there is a related setting switch, which needs to be set by the user. The terminal device supports phonetic symbol codec means that the phonetic symbol codec function is currently enabled on the terminal device through user settings. If the terminal device currently disables the phonetic codec function, the terminal device does not support phonetic codec. This embodiment of the present application does not limit the manner in which the user enables or disables the phonetic symbol encoding and decoding function of the terminal device. For example, it can be controlled by any of the following: voice control, preset gesture control, and control by touch operation in the relevant interface.

At present, when a terminal device conducts a voice call, a traditional voice encoding and decoding method is usually used, for example, waveform encoding and decoding or feature encoding and decoding. When the signal of the communication environment in which the terminal equipment is located deteriorates, the transmitted voice code rate is reduced, and the waveform characteristics are more sparse, resulting in greater distortion of the waveform restored by the receiving end and the original waveform, and the call is intermittent, noisy or even silent, and the call effect is very good. Difference. Exemplarily, FIG. 3 is a schematic diagram of a call effect using traditional voice codec in a weak signal environment. As shown in Figure 3, user A and user B are currently in a weak signal environment, and for user B, the call is intermittent, and the effect is very poor.

In the voice processing method provided by the embodiment of the present application, when any one of the two terminal devices conducting a voice call is in a weak signal environment, the voice sender encodes the phonetic symbols, pitch and duration of the words spoken by the user, Encoding information with a preset bit length can be obtained. After the encoded information is transmitted through the channel, at the voice receiver, the received information is decoded to obtain the phonetic symbol, pitch and duration of the single word, thereby generating a complete and smooth voice signal and playing it with a preset sound. The speech processing method provided by the embodiment of the present application can transmit at an extremely low bit rate in a weak signal environment. Moreover, by encoding the pronunciation and duration of a single word, the receiver can clearly restore the pronunciation of the sender user, and then obtain the semantics according to the playback pronunciation, which solves the problem of intermittent, noisy or even silent calls, and improves the call effect. . Exemplarily, FIG. 4 is a schematic diagram of a call effect provided by an embodiment of the present application in a weak signal environment. As shown in Figure 4, user A and user B are currently in a weak signal environment. For user B, the terminal device can hear the pinyin sound played by the preset sound, which clearly restores the pronunciation of the sender, user A, and user B can By judging the semantics by pronunciation, the intention of user A is understood, and the effect of the call is improved.

Some application scenarios are illustrated below.

Optionally, in an application scenario, user A and user B are familiar with each other. When user A or user B is in a weak signal environment and needs to talk, through the voice processing method provided by the embodiment of the present application, through the fuzzy matching of homophonic or approximate pronunciation, the receiver user does not need to recognize the exact meaning of the word issued by the sound source, and can rely on both parties. Accurately understand what the other party really wants to say according to the pronunciation played by the terminal device, and improve the call effect.

Optionally, in another application scenario, user A is in an emergency or dangerous environment and the signal is poor. After user A talks with user B, through the voice processing method provided in this embodiment of the present application, user A can send a brief message to user B. For important words, user B can accurately understand the intention of user A according to the pronunciation played by the terminal device, which improves the call effect.

The technical solutions of the present application will be described in detail below through specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments.

The terms "first", "second", "third", "fourth", etc. (if any) in the embodiments of the present application are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. .

FIG. 5 is a message interaction diagram of the voice processing method provided by the embodiment of the present application. This embodiment involves a first terminal device and a second terminal device, the first terminal device and the second terminal device are in a call state, the first terminal device is a voice sender, and the second terminal device is a voice receiver. As shown in FIG. 5 , the voice processing method provided in this embodiment may include:

S501. The first terminal device acquires a voice signal input by a user.

For example, in Fig. 4, the voice signal input by the user is the voice signal corresponding to the user A's voice "Hello, I'm on the mountain" processed by the first terminal device.

S502. When the first terminal device determines that it is currently in a weak signal environment, and determines that both the first terminal device and the second terminal device support phonetic codec codec, extract multiple sets of information from the speech signal. Wherein, each group of information includes the phonetic symbol, tone and duration of the single word.

Wherein, determining that the first terminal device is currently in a weak signal environment may include: the first terminal device is in a weak signal environment, or the second terminal device is in a weak signal environment, or both the first terminal device and the second terminal device are in a weak signal environment signal environment. Regarding the weak signal environment and the implementation manner of judging whether the terminal device is in the weak signal environment, reference may be made to the above description, and details are not repeated here.

Wherein, reference may be made to the above description for the terminal device to support phonetic symbol encoding and decoding, which will not be repeated here.

Optionally, multiple sets of information can be extracted from the speech signal by way of waveform comparison. Specifically, the waveform of the speech signal is segmented. For each waveform, compare the waveform or waveform feature with the waveforms or waveform features of all locally pre-stored words, and determine the word with the greatest similarity in waveform or waveform feature among all the words as the word corresponding to the waveform. Optionally, the waveform of the speech signal is segmented, and waveform extraction can be performed by word, or segmented by a specified length. This embodiment does not limit the value of the specified length, for example, 1 second.

Because the phonetic symbol encoding and decoding used in the embodiment of the present application only requires that the pronunciation of the words is close, whether the meanings of the words are consistent may not be considered, and the efficiency of extracting multiple sets of information can be improved by means of waveform comparison.

Optionally, multiple sets of information can be extracted from the speech signal through a neural network model or a machine model. Optionally, a neural network model or a machine model is used for semantic recognition, and corresponding words are output according to the input speech signal. Optionally, the neural network model or the machine model is used for speech recognition, and the corresponding phonetic symbols and tones are output according to the input speech signal.

Optionally, the duration of the word can be determined through energy detection. For example, the level of the duration and the energy threshold corresponding to each level can be preset, and the duration of the word is determined according to the comparison with multiple energy thresholds. Exemplarily, see Table 2 for the level of duration.

For example, in Figure 4, 6 groups of information can be extracted from the speech signal, which are the phonetic symbols, pitch and duration of the words "you", "good", "me", "zai", "mountain" and "shang" respectively . For the single word "you", the phonetic symbol is ni, the tone is the third tone, and the duration is assumed to be short. For the single word "好", the phonetic symbol is hao, the pitch is the third tone, and the duration is assumed to be medium.

S503. The first terminal device acquires the encoding information corresponding to each group of information according to the encoding table.

The coding table stores the correspondence between phonetic symbols, tones and durations and coding information, and the coding information is a preset bit length, which can be referred to the above description, and will not be repeated here.

S504. The first terminal device sends the encoded information to the second terminal device.

Correspondingly, the encoded information is received by the second terminal device after being transmitted through the channel. In this embodiment, the information received by the second terminal device from the channel is called first information.

For example, in FIG. 4 , assuming that the preset bit length of the encoded information is 22 bits, the first terminal device sends 6 pieces of encoded information to the second terminal device, which are the words "you", "good", "me", "in ", "mountain", "up" corresponding to the encoded information, a total of 22*6=132bit. Correspondingly, after the encoded information is transmitted through the channel, the second terminal device can receive the first information with a length of 132 bits from the channel.

S505. The second terminal device decodes the first information according to the coding table to obtain multiple sets of information. Each set of information includes the phonetic symbol, pitch and duration of the word.

Optionally, decode the first information according to the coding table to obtain multiple sets of information, which may include:

A plurality of encoding information are sequentially acquired from the first information, and the length of the encoding information is a preset bit length.

For each encoding information, the phonetic symbol, pitch and duration corresponding to the encoding information are obtained according to the encoding table.

For example, in FIG. 4 , the first information is 132 bits. First, obtain the first 22-bit encoding information from the first information, and obtain the phonetic symbol, tone and duration of the word corresponding to the encoding information according to the encoding table. Then, continue to acquire the second 22-bit encoding information from the first information, and acquire the phonetic symbol, tone and duration of the word corresponding to the encoding information according to the encoding table. By analogy, until the decoding of the first information is completed, the phonetic symbols, tones and duration of 6 single words are obtained.

S506. The second terminal device generates a voice signal according to the multiple sets of information.

Since each set of information includes the phonetic symbol, tone and duration of a single word, the pronunciation and duration of each word can be restored to synthesize a complete and smooth speech signal.

S507. The second terminal device uses a preset sound to play the voice signal.

The preset voice is not limited in this embodiment, for example, it may be a male voice or a female voice.

It can be seen that the voice processing method provided in this embodiment can be applied to two terminal devices that conduct voice calls in a weak signal environment. At the voice sender, the phonetic symbol, pitch and duration of each word in the user's speech are extracted and encoded, to obtain encoded information of a preset bit length corresponding to each word. After the encoded information is transmitted through the channel, correspondingly, at the voice receiver, the received information is decoded to obtain the phonetic symbol, pitch and duration of each word, so as to generate a complete and smooth voice signal, and use the preset sound for playback . In the speech processing method provided by the embodiment of the present application, since the encoded information has a preset bit length, it can be transmitted at an extremely low bit rate in a weak signal environment. By encoding and decoding the pronunciation and duration of a single word, the receiver can clearly restore the pronunciation of the sender user, and then obtain the semantics according to the pronunciation of the playback, which solves the intermittent call, noise or even silence that occurs when using traditional speech encoding and decoding. The problem is that the call effect is improved.

Optionally, in S503, the encoding information corresponding to each group of information is obtained according to the encoding table, which may include:

For each set of information, it is determined whether the phonetic symbols in the set of information are included in the common index table.

If the commonly used index table includes the phonetic symbols in the group of information, the coding information is obtained according to the commonly used index table.

If the phonetic symbols in the group of information are not included in the common index table, the encoding information is obtained according to the global index table.

Since the common index table is generated according to the common words of the user, the number of common index values in the common index table is much smaller than the number of global index values in the global index table, so the common index table is searched and encoded first, which reduces the search data. quantity. If the difference is not found in the common index table, it is searched in the global index table for encoding, which improves the encoding efficiency.

Among them, the implementation manner of the coding table is different, and the manners of phonetic symbol encoding and phonetic symbol decoding are different. S503 and S505 are described below with examples.

Optionally, in an implementation manner, the encoding table shown in Table 1 to Table 3 above is used. The encoded information may include a first information component corresponding to phonetic symbols and tones, and a second information component corresponding to duration. In S503, for each group of information, the first terminal device may first look up the commonly used index table according to the phonetic symbol, tone and duration of the single character and the single character, and if found, use the corresponding 20-bit global index value as the first information component , if not found, search in the Chinese global index table according to the phonetic symbols and tones of the single character and the single character, and take the corresponding 20-bit global index value as the first information component. Then, search is performed in the duration table according to the duration of the single word, and the corresponding 2-bit index value is used as the second information component, thereby obtaining 22-bit encoded information. Correspondingly, in S505, the second terminal device acquires 22 bits of encoded information, the first 20 bits are the first information component, and the last 2 bits are the second information component. Search in the Chinese global index table according to the first 20 bits to obtain the phonetic symbols and tones of the words. Search in the duration table according to the last 2 bits to obtain the duration of the word.

Optionally, in another implementation manner, the coding table shown in Table 1, Table 2 and Table 4 above. The difference between this implementation manner and the above implementation manner is that: when the first terminal device searches the commonly used index table, it searches the commonly used index table according to the phonetic symbols and tones of the single character and the single character. In this implementation manner, since the common index table does not consider the duration, the number of common index values is further reduced, the search speed is faster, and the coding efficiency is improved.

Optionally, in yet another implementation manner, the coding table shown in Table 2, Table 5, Table 6 and Table 7 above. The encoded information may include first information components corresponding to phonetic symbols and tones, and second information components corresponding to durations, where the first information components include first information subcomponents corresponding to phonetic symbols and second information subcomponents corresponding to tones. In S503, for each group of information, the first terminal device can first search in the common index table according to the phonetic symbol of the single word, if found, the corresponding 18-bit global index value is used as the first information subcomponent, if not found, Then according to the phonetic symbol of the single word, the Chinese global index table is searched, and the corresponding 18-bit global index value is used as the first information sub-component. Then, according to the pitch of the single word, the pitch table is searched, and the corresponding 2-bit index value is used as the second information sub-component. Afterwards, the duration table is searched according to the duration of the single word, and the corresponding 2-bit index value is used as the second information component, thereby obtaining coded information of 18+2+2=22 bits. Correspondingly, in S505, the second terminal device obtains 22 bits of encoded information, the first 18 bits are the first information subcomponent, the middle 2 bits are the second information subcomponent, and the last 2 bits are the second information component. Search in the Chinese global index table according to the first 18 bits to obtain the phonetic symbol of a single word. Search in the tone table according to the middle 2bit to get the tone of the word. Search in the duration table according to the last 2 bits to obtain the duration of the word.

In this implementation manner, the phonetic symbol, pitch and duration are separately encoded and decoded, the number of common index values and the number of global index values are further reduced, the search speed is faster, and the encoding and decoding efficiency is improved.

Optionally, in another embodiment of the present application, an implementation manner of determining that the terminal device is in a weak signal environment in S502 is provided on the basis of the embodiment shown in FIG. 5 above. Through the negotiation between the first terminal device and the second terminal device, it is determined that phonetic symbol codec can be used when it is currently in a weak signal environment.

Optionally, in an implementation manner, as shown in FIG. 6 , determining that the first terminal device is currently in a weak signal environment may include:

S601. If the first terminal device determines that the target parameter of the first terminal device satisfies the first preset condition, send a first request message to the second terminal device, where the first request message is used to instruct the second terminal device to use phonetic symbol encoding and decoding.

Wherein, the target parameter of the first terminal device is used to indicate the signal state of the communication environment where the first terminal device is currently located. For the target parameter and the first preset condition, reference may be made to the above description of this application, and details are not repeated here.

Correspondingly, the second terminal device receives the first request message.

S602. The second terminal device sends a first response message to the first terminal device, where the first response message is used to instruct the second terminal device to use phonetic symbol encoding and decoding.

In this implementation, the first terminal device as the voice sender determines that it is currently in a weak signal environment according to its own target parameters, and actively initiates a negotiation of codec mode switching to the second terminal device, thereby ensuring that phonetic symbol codec is used in a timely manner, improving conversation quality.

Optionally, the first request message may include a first indication field, which is used to indicate that phonetic symbol codec is used. This embodiment of the present application does not limit the name of the first indication field.

Optionally, the first request message may include a hysteresis timer, and the hysteresis timer is used to indicate a delay time for the second terminal device to use phonetic symbol encoding and decoding.

Usually, the terminal device adopts the traditional voice codec by default. By setting the lag timer, time is reserved for the switching of the voice codec mode. After the lag timer expires, the first terminal device and the second terminal device use the phonetic symbol codec at the same time, which improves the codec mode switching effect.

Optionally, in another implementation manner, as shown in FIG. 7 , determining that the first terminal device is currently in a weak signal environment may include:

S701. If the second terminal device determines that the target parameter of the second terminal device satisfies the first preset condition, send a second request message to the first terminal device, where the second request message is used to instruct the first terminal device to use phonetic symbol encoding and decoding.

Wherein, the target parameter of the second terminal device is used to indicate the signal state of the communication environment where the second terminal device is currently located. For the target parameter and the first preset condition, reference may be made to the above description of this application, and details are not repeated here.

Correspondingly, the first terminal device receives the second request message.

Optionally, the second request message may include a first indication field, and reference may be made to the above description of the first indication field, which will not be repeated here.

Optionally, the second request message may include a hysteresis timer, where the hysteresis timer is used to indicate a delay time for the first terminal device to use phonetic symbol encoding and decoding.

S702. The first terminal device sends a second response message to the second terminal device, where the second response message is used to instruct the first terminal device to use phonetic symbol encoding and decoding.

In this implementation, the second terminal device as the voice receiver determines that it is currently in a weak signal environment according to its own target parameters, and actively initiates the negotiation of codec mode switching to the first terminal device, so as to ensure that the phonetic symbol codec mode is used in time. Improve call quality.

Optionally, in the above process, if the first response message or the second response message is not received successfully, it may be retransmitted. Optionally, the number of retransmissions may be set, and the specific value is not limited in this embodiment.

Optionally, in another embodiment of the present application, based on the embodiment shown in FIG. 5 above, an implementation manner of determining that both the first terminal device and the second terminal device support phonetic symbol encoding and decoding in S502 is provided. Through capability negotiation between the first terminal device and the second terminal device, it is determined that both parties in the call support phonetic codec, and phonetic codec can be used when currently in a weak signal environment.

Optionally, in an implementation manner, as shown in FIG. 8 , the first terminal device determines that both the first terminal device and the second terminal device support phonetic symbol encoding and decoding, which may include:

S801. The first terminal device sends first capability information to the second terminal device, where the first capability information is used to indicate that the first terminal device supports phonetic symbol encoding and decoding.

Correspondingly, the second terminal device receives the first capability information sent by the first terminal device.

S802. The second terminal device sends first capability response information to the first terminal device, where the first capability response information is used to indicate that the second terminal device supports phonetic codec codec.

In this implementation manner, the first terminal device, which is the voice sender, actively initiates capability negotiation with the second terminal device, so as to ensure that the phonetic symbol encoding and decoding method is adopted in time to improve the quality of the call.

Optionally, in another implementation manner, as shown in FIG. 9 , the first terminal device determines that both the first terminal device and the second terminal device support phonetic symbol encoding and decoding, which may include:

S901. The second terminal device sends second capability information to the first terminal device, where the second capability information is used to indicate that the second terminal device supports phonetic symbol encoding and decoding.

Correspondingly, the first terminal device receives the second capability information sent by the second terminal device.

S902: The first terminal device sends second capability response information to the second terminal device, where the second capability response information is used to indicate that the first terminal device supports phonetic codec codec.

In this implementation manner, the second terminal device, which is the voice receiver, actively initiates capability negotiation with the first terminal device, so as to ensure that the phonetic symbol encoding and decoding method is adopted in time to improve the call quality.

It should be noted that the first capability information and the second capability information may be separate messages, or may be carried in existing messages. This embodiment does not limit the time of the capability negotiation process. For example, after the first terminal device establishes a connection with the second terminal device, capability negotiation may be performed during an altering message, and the ringing message may include a New audiocodec capability field to indicate whether the terminal device supports phonetic codec codec .

Optionally, in the above process, if the first capability response information or the second capability response information is not received successfully, it may be retransmitted. Optionally, the number of retransmissions may be set, and the specific value is not limited in this embodiment.

Optionally, in a scenario where the user can set the terminal device to enable or disable the phonetic symbol encoding and decoding function, the terminal device supporting phonetic symbol encoding and decoding means that the phonetic symbol encoding and decoding function is currently enabled on the terminal device through user settings.

Optionally, if the function of phonetic symbol encoding and decoding is not currently enabled on the terminal device, the user can set it. The voice processing method provided in this embodiment may further include:

The setting interface is displayed.

Receive user actions in the settings interface.

In response to the operation, the function of phonetic codec is turned on.

Exemplarily, FIG. 10 is an interface diagram for setting a phonetic symbol encoding and decoding mode provided by an embodiment of the present application. As shown in (a) of FIG. 10 , the terminal device currently displays a setting interface 1001, and the setting interface 1001 includes the function option “weak signal HD voice coding”, that is, the phonetic symbol coding and decoding function in the embodiment of the present application. The state of the control 1010 can display whether the phonetic symbol codec function is enabled on the current terminal device. In (a) of FIG. 10 , the phonetic symbol codec function is turned off. The user can perform a click operation on the control 1010. Correspondingly, the terminal device responds to the click operation to enable the phonetic symbol encoding and decoding function, as shown in (b) of FIG. 10 .

It should be noted that this embodiment does not limit the time for the user to set the phonetic symbol encoding and decoding function.

Optionally, if the terminal device currently does not have the phonetic symbol encoding and decoding function enabled, and the terminal device determines that it is currently in a weak signal environment, the voice processing method provided in this embodiment may further include:

Generate and output prompt information.

A first instruction from the user is received.

Turn on the phonetic symbol encoding and decoding function according to the first instruction.

This embodiment does not limit the implementation manner of the prompt information. For example, the prompting voice may be played, the prompting music may be played, or a prompting box or prompting information may be popped up in the interface currently displayed by the terminal device.

Optionally, in yet another embodiment of the present application, an implementation manner of switching from phonetic symbol encoding and decoding to traditional speech encoding and decoding is provided on the basis of the foregoing embodiment. The communication environment where the two terminal devices of the call are located changes in real time and will not always be in a weak signal environment. The phonetic symbol encoding and decoding in the embodiment of the present application is more suitable for a weak signal environment and meets basic communication requirements. When the communication environment improves, it should switch back to the traditional voice codec in time to improve the user's call experience. Through the negotiation between the first terminal device and the second terminal device, when it is determined that the current environment is not a weak signal environment, traditional voice codec can be used.

Optionally, in an implementation manner, as shown in FIG. 11 , the voice processing method provided in this embodiment may further include:

S1101. If the first terminal device determines that the target parameter of the first terminal device satisfies the second preset condition, send a third request message to the second terminal device, where the third request message is used to instruct the second terminal device to use waveform encoding and decoding.

Wherein, the target parameter of the first terminal device is used to indicate the signal state of the communication environment where the first terminal device is currently located. For the target parameter and the second preset condition, reference may be made to the above description of this application, and details are not repeated here.

Correspondingly, the second terminal device receives the third request message.

Optionally, the third request message may include a second indication field, which is used to indicate the use of traditional speech codec. This embodiment of the present application does not limit the name of the second indication field. For example, the name could be back to HD. Optionally, the second indication field may also indicate the start time point of using traditional speech codec.

S1102. If the second terminal device determines that the target parameter of the second terminal device meets the second preset condition, send a third response message to the first terminal device, where the third response message is used to instruct the second terminal device to use waveform encoding and decoding.

Wherein, the target parameter of the second terminal device is used to indicate the signal state of the communication environment where the second terminal device is currently located. For the target parameter and the second preset condition, reference may be made to the above description of this application, and details are not repeated here.

In this implementation manner, when the first terminal device as the voice sender determines that the signal environment in which it is located has improved, it actively initiates a negotiation of codec mode switching to the second terminal device, and when the second terminal device also determines that it is not currently in the In a weak signal environment, a response message is returned to ensure that the traditional voice codec method is used in a timely manner when the communication environment is good to improve call quality.

Optionally, in another implementation manner, as shown in FIG. 12 , the voice processing method provided in this embodiment may further include:

S1201. If the second terminal device determines that the target parameter of the second terminal device meets the second preset condition, send a fourth request message to the first terminal device, where the fourth request message is used to instruct the first terminal device to use waveform encoding and decoding.

Correspondingly, the first terminal device receives the fourth request message.

Optionally, the fourth request message may include a second indication field, and reference may be made to the above description of the second indication field, which will not be repeated here.

S1202. If the first terminal device determines that the target parameter of the first terminal device satisfies the second preset condition, send a fourth response message to the second terminal device, where the fourth response message is used to instruct the first terminal device to use waveform encoding and decoding.

In this implementation manner, when the second terminal device as the voice receiver determines that the signal environment in which it is located has improved, it actively initiates a negotiation of codec mode switching to the first terminal device, and when the first terminal device also determines that it is not currently in a In a weak signal environment, a response message is returned to ensure that the traditional voice codec method is used in a timely manner when the communication environment is good to improve call quality.

Optionally, in the above process, if the third response message or the fourth response message is not received successfully, it may be retransmitted. Optionally, the number of retransmissions may be set, and the specific value is not limited in this embodiment.

It can be understood that, in order to realize the above-mentioned functions, the terminal device includes corresponding hardware and/or software modules for executing each function. The present application can be implemented in hardware or in the form of a combination of hardware and computer software in conjunction with the algorithm steps of each example described in conjunction with the embodiments disclosed herein. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functionality for each particular application in conjunction with the embodiments, but such implementations should not be considered beyond the scope of this application.

In this embodiment of the present application, the terminal device may be divided into functional modules according to the foregoing method examples. For example, each functional module may be divided according to each function, or two or more functions may be integrated into one processing module. It should be noted that, the division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation. It should be noted that the names of the modules in the embodiments of the present application are schematic, and the names of the modules are not limited in actual implementation.

In the case where each functional module is divided according to each function, FIG. 13 is a schematic structural diagram of a terminal device provided by an embodiment of the present application. As shown in FIG. 13 , the terminal device may include: a sending module 1301 , a processing module 1302 and a receiving module 1303 .

The sending module 1301 is used to send data to other devices. For example, encoding information, first request message, second response message, third request message, fourth response message, first capability information or second capability response information.

The receiving module 1303 is used for receiving data from other devices. For example, the first information, the second request message, the first response message, the fourth request message, the third response message, the second capability information or the first capability response information.

The processing module 1302 is used to obtain the voice signal input by the user, extract multiple sets of information from the voice signal, and obtain the corresponding coding information of each set of information according to the coding table; decode the first information according to the coding table to obtain multiple sets of information, and obtain multiple sets of information according to the multiple sets of information. The information generates a voice signal, and a preset sound is used to play the voice signal, etc.

It should be noted that, all relevant contents of the steps involved in the above method embodiments can be cited in the functional description of the corresponding functional module, which will not be repeated here.

Please refer to FIG. 14 , which shows another structure of a terminal device provided by an embodiment of the present application. The terminal device includes: a processor 1401 , a receiver 1402 , a transmitter 1403 , a memory 1404 , and a bus 1405 . The processor 1401 includes one or more processing cores, and the processor 1401 executes various functional applications and information processing by running software programs and modules. The receiver 1402 and the transmitter 1403 may be implemented as a communication component, which may be a baseband chip. The memory 1404 is connected to the processor 1401 through the bus 1405 . The memory 1404 may be configured to store at least one program instruction, and the processor 1401 may be configured to execute the at least one program instruction, so as to implement the technical solutions of the foregoing embodiments. The implementation principle and technical effect thereof are similar to the related embodiments of the above method, and are not repeated here.

When the terminal is powered on, the processor can read the software program in the memory, interpret and execute the instructions of the software program, and process the data of the software program. When it is necessary to send data through the antenna, the processor performs baseband processing on the data to be sent, and outputs the baseband signal to the control circuit in the control circuit. The control circuit performs radio frequency processing on the baseband signal and sends the radio frequency signal through the antenna in the form of electromagnetic waves send. When data is sent to the terminal, the control circuit receives the radio frequency signal through the antenna, converts the radio frequency signal into a baseband signal, and outputs the baseband signal to the processor, which converts the baseband signal into data and processes the data.

Those skilled in the art can understand that, for the convenience of description, FIG. 14 only shows one memory and one processor. In an actual terminal, there may be multiple processors and memories. The memory may also be referred to as a storage medium or a storage device, etc., which is not limited in this embodiment of the present application.

As an optional implementation manner, the processor may include a baseband processor and a central processing unit. The baseband processor is mainly used to process communication data, and the central processing unit is mainly used to execute software programs and process data of the software programs. Those skilled in the art can understand that the baseband processor and the central processing unit may be integrated into one processor, or may be independent processors, which are interconnected through technologies such as a bus. Those skilled in the art can understand that a terminal may include multiple baseband processors to adapt to different network standards, a terminal may include multiple central processors to enhance its processing capability, and various components of the terminal may be connected through various buses. The baseband processor can also be expressed as a baseband processing circuit or a baseband processing chip. The central processing unit can also be expressed as a central processing circuit or a central processing chip. The function of processing the communication protocol and communication data may be built in the processor, or may be stored in the memory in the form of a software program, and the processor executes the software program to realize the baseband processing function. The memory can be integrated into the processor or independent of the processor. The memory includes a cache, which can store frequently accessed data/instructions.

In this embodiment of the present application, the processor may be a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, which can implement or The methods, steps and logic block diagrams disclosed in the embodiments of this application are executed. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the methods disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware processor, or executed by a combination of hardware and software modules in the processor.

In this embodiment of the present application, the memory may be a non-volatile memory, such as a hard disk drive (HDD) or a solid-state drive (SS), etc., or may also be a volatile memory (volatile memory), for example Random-access memory (RAM). Memory is, without limitation, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory in this embodiment of the present application may also be a circuit or any other device capable of implementing a storage function, for storing program instructions and/or data. The methods provided by the embodiments of the present application may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general purpose computer, a special purpose computer, a computer network, network equipment, user equipment, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, optical fiber, digital subscriber line, DSL), or wireless (eg, infrared, wireless, microwave, etc.) A readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The available media can be magnetic media (eg, floppy disks, hard disks, magnetic tapes) ), optical media (eg, digital video disc (DWD), or semiconductor media (eg, SSD), etc.).

The embodiments of the present application provide a computer program product, which enables the terminal to execute the technical solutions in the foregoing embodiments when the computer program product runs on a terminal. The implementation principle and technical effect thereof are similar to those of the above-mentioned related embodiments, which will not be repeated here.

The embodiments of the present application provide a computer-readable storage medium, on which program instructions are stored, and when the program instructions are executed by a terminal, the terminal executes the technical solutions of the foregoing embodiments. The implementation principle and technical effect thereof are similar to those of the above-mentioned related embodiments, which will not be repeated here. To sum up, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still The technical solutions described in the foregoing embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the scope of the technical solutions of the embodiments of the present application.

Claims

A voice processing method, characterized in that it is applied to a first terminal device, the first terminal device and the second terminal device are in a call state, the method comprising:

Obtain the voice signal input by the user;

When it is determined that it is currently in a weak signal environment, and it is determined that both the first terminal device and the second terminal device support phonetic codec, extracting multiple sets of information from the voice signal, each set of information includes phonetic symbols and tones of a single word and duration; the phonetic codec refers to encoding and decoding phonetic symbols, pitch and duration;

Obtain the coding information corresponding to each group of information according to the coding table; the coding table stores the correspondence between phonetic symbols, tones and duration and the coding information;

The encoded information is sent to the second terminal device.
The method according to claim 1, wherein the encoded information comprises a first information component corresponding to phonetic symbols and tones, and a second information component corresponding to a duration.
The method according to claim 2, wherein the first information components comprise first information sub-components corresponding to phonetic symbols and second information sub-components corresponding to tones.
The method according to claim 1, wherein the coding table includes a global index table and a common index table, the common index table is generated according to the number of words used by a user within a preset time period, and the global index table is The index table includes a phonetic symbol and a global index value of the phonetic symbol, and the phonetic symbol included in the common index table has a common index value and a global index value of the phonetic symbol in the global index table.
The method according to claim 4, characterized in that, acquiring the encoding information corresponding to each group of information according to the encoding table comprises:

For each group of information, determine whether the phonetic symbols in the group of information are included in the commonly used index table;

If the phonetic symbols in the group of information are included in the commonly used index table, then the coding information is obtained according to the commonly used index table;

If the phonetic symbols in the group of information are not included in the common index table, the encoding information is acquired according to the global index table.
The method according to any one of claims 1-5, wherein the determining that the current is in a weak signal environment comprises:

If it is determined that the target parameter satisfies the first preset condition, a first request message is sent to the second terminal device, where the first request message is used to instruct the second terminal device to use the phonetic symbol codec, the target The parameter is used to indicate the signal state of the communication environment where the first terminal device is currently located;

A first response message sent by the second terminal device is received, where the first response message is used to instruct the second terminal device to use the phonetic symbol codec.
The method according to claim 6, wherein the first request message includes a hysteresis timer, and the hysteresis timer is used to indicate a delay time for the second terminal device to use the phonetic symbol codec.
The method according to any one of claims 1-5, wherein the determining that the current is in a weak signal environment comprises:

receiving a second request message sent by the second terminal device, where the second request message is used to instruct the first terminal device to use the phonetic symbol codec;

Send a second response message to the second terminal device, where the second response message is used to instruct the first terminal device to use the phonetic symbol codec.
The method according to any one of claims 1-8, wherein the method further comprises:

If it is determined that the target parameter satisfies the second preset condition, a third request message is sent to the second terminal device, where the third request message is used to instruct the second terminal device to use waveform encoding and decoding, and the target parameter uses to indicate the signal state of the communication environment where the first terminal device is currently located;

A third response message sent by the second terminal device is received, where the third response message is used to instruct the second terminal device to use the waveform codec.
The method according to any one of claims 1-8, wherein the method further comprises:

receiving a fourth request message sent by the second terminal device, where the fourth request message is used to instruct the first terminal device to use waveform codec;

If it is determined that the target parameter meets the second preset condition, a fourth response message is sent to the second terminal device, where the fourth response message is used to instruct the first terminal device to use the waveform codec, the target The parameter is used to indicate the signal state of the communication environment where the first terminal device is currently located.
The method according to claim 6 or 9 or 10, wherein the target parameter includes at least one of the following: location information of the first terminal device, information about a cell currently accessed by the first terminal device The cell identifier, the signal strength of the signal received by the first terminal device, or the voice packet loss rate.
The method according to any one of claims 1-11, wherein determining that the first terminal device supports phonetic symbol encoding and decoding comprises:

If it is determined that the first terminal device does not have the function of phonetic symbol codec enabled, then generate and output prompt information;

receiving a first instruction from the user;

The function is turned on according to the first instruction.
The method according to any one of claims 1-11, wherein the determining that both the first terminal device and the second terminal device support phonetic symbol encoding and decoding, comprising:

sending first capability information to the second terminal device, where the first capability information is used to indicate that the first terminal device supports phonetic codec;

receiving first capability response information sent by the second terminal device, where the first capability response information is used to indicate that the second terminal device supports phonetic symbol encoding and decoding;

or,

receiving second capability information sent by the second terminal device, where the second capability information is used to indicate that the second terminal device supports phonetic codec;

Send second capability response information to the second terminal device, where the second capability response information is used to indicate that the first terminal device supports phonetic codec codec.
The method according to any one of claims 1-11, wherein the method further comprises:

Display the setting interface;

receiving the user's operation in the setting interface;

In response to the operation, the function of phonetic codec is turned on.
A voice processing method, characterized in that it is applied to a second terminal device, and the second terminal device is in a call state with the first terminal device, the method comprising:

receiving the first information sent by the first terminal device;

According to the coding table, the first information is decoded, and multiple groups of information are obtained; each group of information includes the phonetic symbol, tone and duration of a single character, and the coding table stores the correspondence between the phonetic symbol, the pitch and the duration and the encoded information;

generating a speech signal according to the multiple sets of information;

The voice signal is played with a preset sound.
The method according to claim 15, wherein the decoding the first information according to the coding table to obtain multiple sets of information, comprising:

Obtain a plurality of encoding information in sequence from the first information, and the length of the encoding information is a preset bit length;

For each piece of encoding information, the phonetic symbol, pitch and duration corresponding to the encoding information are acquired according to the encoding table.
The method according to claim 16, wherein the encoded information comprises a first information component corresponding to phonetic symbols and tones, and a second information component corresponding to a duration.
The method according to claim 17, wherein the first information components comprise first information sub-components corresponding to phonetic symbols and second information sub-components corresponding to tones.
The method according to claim 15, wherein the coding table includes a global index table and a commonly used index table, the commonly used index table is generated according to the number of words used by a user within a preset time period, and the global index table is The index table includes a phonetic symbol and a global index value of the phonetic symbol, and the phonetic symbol included in the common index table has a common index value and a global index value of the phonetic symbol in the global index table.
The method according to any one of claims 15-19, wherein before the receiving the first information sent by the first terminal device, the method further comprises:

Receive a first request message sent by the first terminal device, where the first request message is used to instruct the second terminal device to use phonetic codec; the phonetic codec refers to encoding phonetic symbols, pitch and duration decoding;

Send a first response message to the first terminal device, where the first response message is used to instruct the second terminal device to use the phonetic symbol codec.
The method according to claim 20, wherein the first request message includes a hysteresis timer, and the hysteresis timer is used to indicate a delay time for the second terminal device to use the phonetic symbol codec.
The method according to any one of claims 15-19, wherein before the receiving the first information sent by the first terminal device, the method further comprises:

If it is determined that the target parameter satisfies the first preset condition, a second request message is sent to the first terminal device, where the second request message is used to instruct the first terminal device to use the phonetic symbol codec, the phonetic symbol Encoding and decoding refers to encoding and decoding phonetic symbols, tones and duration, and the target parameter is used to indicate the signal state of the communication environment where the second terminal device is currently located;

A second response message sent by the first terminal device is received, where the second response message is used to instruct the first terminal device to use the phonetic symbol codec.
The method according to any one of claims 15-22, wherein the method further comprises:

receiving a third request message sent by the first terminal device, where the third request message is used to instruct the second terminal device to use waveform codec;

If it is determined that the target parameter meets the second preset condition, a third response message is sent to the first terminal device, where the third response message is used to instruct the second terminal device to use the waveform codec, the target The parameter is used to indicate the signal state of the communication environment where the first terminal device is currently located.
The method according to any one of claims 15-22, wherein the method further comprises:

If it is determined that the target parameter satisfies the second preset condition, sending a fourth request message to the first terminal device, where the fourth request message is used to instruct the first terminal device to use waveform encoding and decoding;

A fourth response message sent by the first terminal device is received, where the fourth response message is used to instruct the first terminal device to use the waveform codec.
The method according to any one of claims 22-24, wherein the target parameter includes at least one of the following: location information of the second terminal device, current access of the second terminal device The cell identifier of the cell, the signal strength of the signal received by the second terminal device, or the voice packet loss rate.
The method according to any one of claims 15-25, wherein before the receiving the first information sent by the first terminal device, the method further comprises:

receiving first capability information sent by the first terminal device, where the first capability information is used to indicate that the first terminal device supports phonetic codec;

Send first capability response information to the first terminal device, where the first capability response information is used to indicate that the second terminal device supports phonetic codec; codec;

or,

second capability information sent to the first terminal device, where the second capability information is used to indicate that the second terminal device supports phonetic codec;

Receive second capability response information sent by the first terminal device, where the second capability response information is used to indicate that the first terminal device supports phonetic codec codec.
The method according to any one of claims 15-25, wherein the method further comprises:

Display the setting interface;

receiving the user's operation in the setting interface;

In response to the operation, the function of phonetic symbol encoding and decoding is turned on, and the phonetic symbol encoding and decoding refers to encoding and decoding phonetic symbols, pitch and duration.
A terminal device, characterized in that it includes a processor, a memory and a transceiver, the transceiver is used to communicate with other devices, and the processor is used to call a program stored in the memory to execute the program according to claims 1-27 The method of any of the above.
A computer-readable storage medium, characterized in that, the computer-readable storage medium stores computer instructions, when the computer instructions are executed on a terminal device, the terminal device is made to perform any one of claims 1-27. one of the methods described.