CN102710539A

CN102710539A - Method and device for transferring voice messages

Info

Publication number: CN102710539A
Application number: CN2012101335145A
Authority: CN
Inventors: 阮亚平; 李加周
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2012-05-02
Filing date: 2012-05-02
Publication date: 2012-10-03

Abstract

The invention discloses a method and a device for transferring voice messages. The method includes: staring a voice recognition module when quality of voice communication is lowered; performing voice recognition through the voice recognition module by a terminal to voice signals collected by a local voice input device, and generating corresponding text messages to send to an opposite terminal; or sending the voice signals to a voice recognition cloud end through the voice recognition module by the terminal, and obtaining corresponding text messages from the voice recognition cloud end to send to the opposite terminal. By the method and the device, effectiveness and timeliness of voice message transferring can be improved, and quality of user experience is enhanced.

Description

Voice information transmission method and device

Technical Field

The present invention relates to the field of communications, and in particular, to a method and apparatus for transmitting voice information.

Background

In the prior art, an instant messaging technology is a basic technology of the internet, and currently, common instant messaging software generally integrates a plurality of real-time communication modes such as texts, voices and videos so as to meet diversified communication requirements of users.

For two-way real-time communication, high quality voice calls are more demanding on the network and terminal devices than text-based. On one hand, the packet loss, delay and jitter of the network can seriously affect the call quality, and in addition, the microphone, the earphone, the loudspeaker and the noise environment of the terminal can also affect the call quality. Therefore, how to improve the voice call quality in the instant messaging system under the complex network and terminal environment is a problem to be solved.

Disclosure of Invention

The invention provides a method and a device for transmitting voice information, which aim to solve the problem of low voice call quality of an instant messaging system in the prior art.

The invention provides a voice information transmission method, which comprises the following steps:

starting a voice recognition module under the condition that the voice call quality is determined to be reduced;

the terminal carries out voice recognition on voice signals collected by local voice input equipment through a voice recognition module, generates corresponding text information and sends the text information to an opposite terminal; or the terminal sends the voice signal to the voice recognition cloud end through the voice recognition module, acquires corresponding text information from the voice recognition cloud end and sends the text information to the opposite end.

The invention also provides a voice information transmission device, comprising:

the starting module is used for starting the voice recognition module under the condition that the voice call quality is determined to be reduced;

the voice recognition module is used for carrying out voice recognition on voice signals collected by the local voice input equipment, generating corresponding text information and sending the text information to the opposite terminal; or sending the voice signal to a voice recognition cloud end, and acquiring corresponding text information from the voice recognition cloud end and sending the text information to the opposite end.

The invention has the following beneficial effects:

when the network or terminal environment can not ensure good voice call quality, the voice recognition technology is utilized to convert the voice into corresponding text information for transmission, the problem of low voice call quality of the instant messaging system in the prior art is solved, the effectiveness and timeliness of voice information transmission can be improved, and the user experience quality is improved.

Drawings

FIG. 1 is a flow chart of a method of voice messaging in accordance with an embodiment of the present invention;

fig. 2 is a detailed process flow diagram of a method of transmitting voice information according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a transmitting end and a receiving end according to an embodiment of the present invention;

FIG. 4 is a flow chart of example 1 of an embodiment of the present invention;

FIG. 5 is a flow chart of example 2 of an embodiment of the present invention;

FIG. 6 is a schematic diagram of a scenario for example 3 of an embodiment of the present invention;

FIG. 7 is a flow chart of example 3 of an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a voice information transmitting apparatus according to an embodiment of the present invention.

Detailed Description

In order to solve the problem of low voice call quality of an instant messaging system in the prior art, the invention provides a voice information transmission method and a voice information transmission device, which can automatically meet basic communication requirements for voice call application in the instant messaging system no matter the quality of a network is reduced or a fault or a problem which is not beneficial to real-time voice communication occurs in a terminal environment, and greatly improve the experience quality of a user. The present invention will be described in further detail below with reference to the drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

Method embodiment

According to an embodiment of the present invention, there is provided a method for transmitting voice information, fig. 1 is a flowchart of the method for transmitting voice information according to an embodiment of the present invention, and as shown in fig. 1, the method for transmitting voice information according to an embodiment of the present invention includes the following processes:

step 101, starting a voice recognition module under the condition that the voice call quality is determined to be reduced;

step 101 specifically includes the following processing: under the condition that the terminal determines that the current network condition and/or the terminal environment of the opposite terminal cause the reduction of the voice call quality, automatically starting a voice recognition module; or, the voice recognition module is manually started according to the operation of the user.

In step 101, the terminal determining that the current network condition causes the voice call quality to be degraded specifically includes the following processing:

1. acquiring a network quality index carried in feedback information sent by an opposite terminal, wherein the network quality index carries information about whether a packet loss rate, network jitter and/or a delay value exceed a preset first threshold value; in practical applications, the first threshold may include a plurality of thresholds respectively corresponding to a packet loss rate, a network jitter, and a delay value.

2. If the network quality index carries information that the packet loss rate, the network jitter and/or the delay value exceed a preset first threshold value, determining that the current network condition causes the voice call quality to be reduced;

in step 101, the terminal determining that the voice call quality is reduced due to the terminal environment of the opposite terminal specifically includes the following processing:

1. acquiring feedback information sent by an opposite terminal, and determining that voice output equipment of the opposite terminal cannot work normally according to the feedback information, and determining that the voice call quality is reduced due to the terminal environment of the opposite terminal; or

2. And acquiring feedback information sent by the opposite terminal, and determining that the environmental noise value of the opposite terminal exceeds a preset second threshold according to the feedback information, so that the voice call quality is reduced due to the terminal environment of the opposite terminal. Specifically, the environmental noise value of the opposite end may be obtained by detecting a signal-to-noise ratio of the input voice signal and sending feedback information.

Preferably, before the voice recognition module is started, prompt information can be output to prompt a user to select whether to start the voice recognition module; and under the condition that the user selects no, forbidding to start the voice recognition module so as to save resources, and if the user selects yes, starting the voice recognition module.

102, the terminal performs voice recognition on a voice signal acquired by local voice input equipment through a voice recognition module, generates corresponding text information and sends the text information to an opposite terminal; or the terminal sends the voice signal to the voice recognition cloud end through the voice recognition module, acquires corresponding text information from the voice recognition cloud end and sends the text information to the opposite end.

Specifically, the speech recognition module can perform segmented speech recognition on the speech signal collected by the local speech input device.

In step 102, after generating the corresponding text information, time information corresponding to the text information may also be recorded, where the time information includes: start time, duration;

in step 102, sending the text message to the peer specifically includes: and sending the text information carrying the time information to an opposite terminal through an independent text channel or a voice stream channel, wherein the text information carries a voice recognition generation attribute.

After step 102 is performed, the peer needs to receive and present the text information.

Specifically, if the opposite terminal judges that the attribute of the text information is generated by voice recognition, the text information can be converted into voice information through a text-to-voice conversion module, and the converted voice information is played according to the time information; wherein, playing the converted voice information according to the time information specifically comprises: 1. judging whether the voice packet in the time period corresponding to the text information is still to be broadcasted or not according to the time information; 2. and under the condition that the voice packet is judged to be played, judging whether the packet loss rate of the voice packet is greater than a preset third threshold value, if so, replacing the voice packet with the converted voice information, playing the voice information, and if not, ending the operation.

The opposite end can also directly display the text information in a text mode.

It should be noted that, in the case that the opposite end is a forwarding device, the text information or the converted voice information is forwarded.

The above technical solutions of the embodiments of the present invention are described in detail below with reference to the accompanying drawings.

Fig. 2 is a detailed processing flowchart of the method for transmitting voice information according to the embodiment of the present invention, and as shown in fig. 2, the method includes the following steps:

step 201, judging whether the network quality can ensure the call quality, if not, executing step 204, otherwise, executing step 202;

step 202, judging whether the terminal environment of the opposite terminal can ensure the call quality, if not, executing step 204, otherwise, executing step 203;

step 203, judging whether the user selects to manually start the voice recognition module, if so, executing step 204, otherwise, ending the operation;

step 204, starting a voice recognition module;

step 205, performing voice recognition on a voice signal acquired by local voice input equipment to generate corresponding text information;

step 206, sending the text message to the opposite terminal;

step 207, the opposite end receives and displays the text information.

Fig. 3 is a schematic diagram of a transmitting end and a receiving end according to an embodiment of the present invention, and as shown in fig. 3, the transmitting end determines whether to perform voice recognition on collected voice information through network quality detection, terminal environment detection, and user setting detection, converts the voice information into text information and transmits the text information to an opposite end if voice recognition is required, and directly transmits the voice information (voice code) if voice recognition is not required. If the receiving end receives the text data, the text data can be directly displayed, and the text data can also be converted into the voice data to be played.

The above technical solutions of the embodiments of the present invention are described in detail below with reference to examples.

Example 1

The client A acquires the text information corresponding to the voice data segment through the voice recognition module under the condition that the current network condition is not good, and sends the text information to the client B, and the client B can show the text information to a user after receiving the text information and tries to convert the text information into voice for output. Fig. 4 is a flowchart of example 1 of the embodiment of the present invention, and as shown in fig. 4, includes the following processes:

step 401, the client a and the client B perform voice communication, the client B counts packet loss rate, if the packet loss rate is higher than a set threshold, the operation jumps to step 402, otherwise, the operation is ended.

Step 402, the client B sends feedback information to the client a.

In step 403, the client a receives and parses the feedback information, and starts a speech recognition module.

Step 404, the client a transmits the collected voice signal to the voice recognition module, and analyzes the voice signal to obtain corresponding text information.

In step 405, client a transmits the generated text information to client B in a package through a text transmission channel. The packed text information includes: text information itself, corresponding start time, duration, "speech recognition generated" attributes.

In step 406, the client B receives the text packet and parses out the text information, the start time, the duration, and the attribute value.

In step 407, the client B displays the text information in the text dialog window.

In step 408, if the text information attribute value is 'speech recognition generation', skipping 409 is performed, otherwise, the operation is ended.

Step 409, according to the starting time and duration of the text message, searching whether the received voice data packet in the corresponding time period is to be played, if not, skipping 410, otherwise, ending the operation.

And step 410, judging whether the packet loss rate of the voice data packet is greater than a set threshold value, if so, skipping 411, otherwise, ending the operation.

Step 411, discarding all voice data packets in the text information time period, and performing text-to-voice conversion on the text information and then replacing the text information.

Example 2

When the user of the client A hears the notification that the voice cannot be heard from the user of the opposite side, the voice recognition module is actively started, the text information corresponding to the voice data segment is obtained through the voice recognition module and is sent to the client B, and the text information can be displayed to the user after the text information is received by the client B. Fig. 5 is a flowchart of example 2 of an embodiment of the present invention, and as shown in fig. 5, includes the following processes:

step 501, when the speech between the client a and the client B starts to talk, and the user of the client B cannot hear the speech of the other party, the voice is sent to be "inaudible";

in step 502, if the user of the client a hears the voice of the user of the client B as "inaudible", skipping 503, otherwise, ending the operation.

In step 503, the user at client a selects to start the speech recognition function.

Step 504, the client a transmits the collected voice signal to the voice recognition module, and analyzes the voice signal to obtain corresponding text information, and the client a transmits the text information to the client B through a text transmission channel.

In step 505, the client B receives and parses the text packet.

In step 506, client B displays the text information in a text dialog window.

Example 3

Fig. 6 is a schematic view of a scenario of example 3 according to an embodiment of the present invention, and as shown in fig. 6, an Instant Messaging (IM) client a calls a fixed telephone C through a voice gateway server B and performs a voice call with the fixed telephone C. Under the condition that the current network condition is not good, the client A acquires text information corresponding to the voice data segment through the voice recognition module and sends the text information to the voice gateway server B, and the voice gateway server B tries to convert the text information into voice information after receiving the text information and forwards the voice information to the fixed telephone C. Fig. 7 is a flowchart of example 3 of the embodiment of the present invention, as shown in fig. 7, including the following processes:

and step 701, the client A and the fixed telephone C carry out voice call through a voice gateway server B, the voice gateway server B counts the packet receiving packet loss rate received from the client A, if the packet loss rate is higher than a set threshold value, the client A jumps to 702, otherwise, the operation is ended.

Step 702, the voice gateway server B sends feedback information to the client a.

Step 703, the client a receives and analyzes the feedback information, and starts the speech recognition module.

Step 704, the client a transmits the collected voice signal to the voice recognition module, and analyzes the voice signal to obtain corresponding text information.

Step 705, the client a packages the generated text information to the gateway B through the text transmission channel. The packaging of the text information comprises: the text information itself, the corresponding start time, and the duration.

Step 706, the voice gateway server B receives the text packet and parses out the text information, the start time, and the duration.

Step 707, the voice gateway server B searches the received voice data packet in the corresponding time period according to the starting time and the duration of the text message, if not, jumps to step 708, otherwise, ends the operation.

Step 708, determining whether the packet loss rate of the voice data packet is greater than a set threshold, if so, skipping 709, otherwise, ending the operation.

And 709, discarding all voice data packets in the time period corresponding to the text information, and replacing the text information after performing text-to-voice conversion on the text information. And forwarded to the fixed telephone C.

In summary, with the technical solution of the embodiments of the present invention, when the network or terminal environment cannot ensure good voice call quality, the voice recognition technology is used to convert the voice into the corresponding text information for transmission, so as to solve the problem of low voice call quality of the instant messaging system in the prior art, improve the effectiveness and timeliness of voice information transmission, and improve the quality of user experience.

Device embodiment

According to an embodiment of the present invention, there is provided a voice information transmitting apparatus, fig. 8 is a schematic structural diagram of the voice information transmitting apparatus according to the embodiment of the present invention, and as shown in fig. 8, the voice information transmitting apparatus according to the embodiment of the present invention includes: the starting module 80 and the speech recognition module 82 are described in detail below for the respective modules of the embodiment of the present invention.

A starting module 80, configured to start the voice recognition module 82 in a case where it is determined that the voice call quality is degraded;

the starting module 80 is specifically configured to: the voice recognition module 82 is automatically started under the condition that the terminal determines that the current network condition and/or the terminal environment of the opposite terminal cause the voice call quality to be reduced; or, the voice recognition module 82 is manually started according to the operation of the user;

the starting module 80 specifically includes: a network condition determining submodule and a terminal environment determining submodule, wherein:

the network condition determining submodule is used for acquiring a network quality index carried in feedback information sent by an opposite terminal, wherein the network quality index carries information about whether a packet loss rate, network jitter and/or a delay value exceed a preset first threshold value; in practical applications, the first threshold may include a plurality of thresholds respectively corresponding to a packet loss rate, a network jitter, and a delay value; if the network quality index carries information that the packet loss rate, the network jitter and/or the delay value exceed a preset first threshold value, determining that the current network condition causes the voice call quality to be reduced;

the terminal environment determining submodule is used for acquiring feedback information sent by the opposite terminal, and determining that the voice output equipment of the opposite terminal cannot work normally according to the feedback information, so that the terminal environment of the opposite terminal is determined to cause the voice call quality to be reduced; or obtaining feedback information sent by the opposite terminal, and determining that the environmental noise value of the opposite terminal exceeds a preset second threshold value according to the feedback information, and determining that the terminal environment of the opposite terminal causes the voice call quality to be reduced.

The voice recognition module 82 is used for performing voice recognition on voice signals acquired by local voice input equipment, generating corresponding text information and sending the text information to an opposite terminal; or sending the voice signal to a voice recognition cloud end, and acquiring corresponding text information from the voice recognition cloud end and sending the text information to the opposite end.

The speech recognition module 82 is specifically configured to: carrying out segmented voice recognition on voice signals collected by local voice input equipment;

the speech recognition module 82 is further configured to: recording time information corresponding to the text information, wherein the time information comprises: start time, duration; sending text information carrying time information to an opposite terminal through an independent text channel or a voice stream channel, wherein the text information carries a voice recognition generation attribute;

preferably, the above apparatus further comprises: the device comprises a prompt module, a display module and a forwarding module. Wherein,

a prompt module, configured to output a prompt message to prompt a user to select whether to start the speech recognition module 82 before the start module 80 starts the speech recognition module 82; in the case that the user chooses no, the voice recognition module 82 is prohibited from being started;

the display module is used for receiving and displaying the text information sent by the voice recognition module 82;

wherein, the display module specifically comprises:

the voice display sub-module is used for judging that the attribute of the text information is generated by voice recognition, converting the text information into voice information through the text-to-voice conversion module and playing the converted voice information according to the time information;

the text display sub-module is used for directly displaying text information in a text mode;

the voice presentation sub-module is specifically configured to: judging whether the voice packet in the time period corresponding to the text information is still to be broadcasted or not according to the time information; under the condition that the voice packet is judged to be played, judging whether the packet loss rate of the voice packet is greater than a preset third threshold value or not, if so, replacing the voice packet with the converted voice information, and playing the voice information;

and the forwarding module is used for forwarding the text information or the converted voice information under the condition that the opposite end is the forwarding device.

step 201, the determining module 80 determines whether the network quality can ensure the call quality, if not, step 204 is executed, otherwise, step 202 is executed;

step 202, the determining module 80 determines whether the terminal environment of the opposite terminal can ensure the call quality, if not, step 204 is executed, otherwise, step 203 is executed;

step 203, the starting module 82 judges whether the user selects to manually start the voice recognition module, if so, step 204 is executed, otherwise, the operation is ended;

step 204, the starting module 82 starts a voice recognition module;

step 205, the voice recognition module 84 performs voice recognition on the voice signal collected by the local voice input device to generate corresponding text information;

step 206, the voice recognition module 84 sends the text message to the opposite terminal;

in step 207, the opposite-end display module 86 receives and displays the text message.

Example 1

Step 402, the client B sends feedback information to the client a.

Example 2

In step 505, the client B receives and parses the text packet.

In step 506, client B displays the text information in a text dialog window.

Example 3

Although the preferred embodiments of the present invention have been disclosed for illustrative purposes, those skilled in the art will appreciate that various modifications, additions and substitutions are possible, and the scope of the invention should not be limited to the embodiments described above.

Claims

1. A method for transmitting voice information, comprising:

the terminal performs voice recognition on voice signals acquired by local voice input equipment through the voice recognition module, generates corresponding text information and sends the text information to the opposite terminal; or the terminal sends the voice signal to a voice recognition cloud end through the voice recognition module, acquires corresponding text information from the voice recognition cloud end and sends the text information to an opposite terminal.

2. The method of claim 1, wherein, in the event that a decrease in voice call quality is determined, activating the voice recognition module specifically comprises:

the method comprises the steps that under the condition that a terminal determines that the current network condition and/or the terminal environment of an opposite terminal cause reduction of voice call quality, a voice recognition module is automatically started; or

And manually starting the voice recognition module according to the operation of the user.

3. The method of claim 2,

the specific steps that the terminal determines that the voice call quality is reduced due to the current network condition include:

acquiring a network quality index carried in feedback information sent by the opposite terminal, wherein the network quality index carries information about whether a packet loss rate, network jitter and/or a delay value exceed a preset first threshold value;

if the network quality index carries information that the packet loss rate, the network jitter and/or the delay value exceed a preset first threshold value, determining that the current network condition causes the voice call quality to be reduced;

the specific steps that the terminal determines that the voice call quality is reduced due to the terminal environment of the opposite terminal include:

acquiring feedback information sent by the opposite terminal, and determining that the voice output equipment of the opposite terminal cannot work normally according to the feedback information, and determining that the voice call quality is reduced due to the terminal environment of the opposite terminal; or,

and acquiring feedback information sent by the opposite terminal, and determining that the environment noise value of the opposite terminal exceeds a preset second threshold value according to the feedback information, so that the voice call quality is reduced due to the terminal environment of the opposite terminal.

4. The method of claim 2,

before automatically starting the speech recognition module, the method further comprises:

outputting prompt information to prompt a user to select whether to start the voice recognition module;

under the condition that the user selects no, forbidding to start the voice recognition module;

after generating the corresponding text information, the method further comprises:

recording time information corresponding to the text information, wherein the time information comprises: start time, duration;

sending the text information to the opposite terminal specifically includes:

and sending the text information carrying the time information to the opposite terminal through an independent text channel or a voice stream channel, wherein the text information carries a voice recognition generation attribute.

5. The method of claim 4, wherein the method further comprises: the opposite terminal receives and displays the text information;

the receiving and displaying the text information by the opposite terminal specifically includes:

if the opposite terminal judges that the attribute of the text information is generated by the voice recognition, converting the text information into voice information through a text-to-voice conversion module, and playing the converted voice information according to the time information; or

And the opposite end directly displays the text information in a text mode.

6. The method of claim 5, wherein playing the converted voice information according to the time information specifically comprises:

judging whether the voice packet in the time period corresponding to the text information is still to be broadcasted or not according to the time information;

and under the condition that a voice packet is judged to be played, judging whether the packet loss rate of the voice packet is greater than a preset third threshold value, if so, replacing the voice packet with the converted voice information, and playing the voice information.

7. The method of claim 5, wherein the method further comprises:

and forwarding the text information or the converted voice information under the condition that the opposite terminal is forwarding equipment.

8. A voice information transmitting apparatus, comprising:

the voice recognition module is used for carrying out voice recognition on voice signals collected by local voice input equipment, generating corresponding text information and sending the text information to an opposite terminal; or sending the voice signal to a voice recognition cloud end, and acquiring corresponding text information from the voice recognition cloud end and sending the text information to an opposite end.

9. The apparatus of claim 8,

the starting module is specifically configured to: the method comprises the steps that under the condition that a terminal determines that the current network condition and/or the terminal environment of an opposite terminal cause reduction of voice call quality, a voice recognition module is automatically started; or, the voice recognition module is manually started according to the operation of the user;

the starting module specifically comprises:

a network condition determining submodule, configured to obtain a network quality indicator carried in feedback information sent by the peer end, where the network quality indicator carries information about whether a packet loss rate, a network jitter, and/or a delay value exceeds a preset first threshold; if the network quality index carries information that the packet loss rate, the network jitter and/or the delay value exceed a preset first threshold value, determining that the current network condition causes the voice call quality to be reduced;

a terminal environment determining submodule, configured to acquire feedback information sent by the opposite terminal, and determine that voice output equipment of the opposite terminal cannot work normally according to the feedback information, and then determine that the terminal environment of the opposite terminal causes reduction in voice call quality; or obtaining feedback information sent by the opposite terminal, and determining that the environmental noise value of the opposite terminal exceeds a preset second threshold value according to the feedback information, and determining that the terminal environment of the opposite terminal causes the voice call quality to be reduced.

10. The apparatus of claim 9,

the speech recognition module is specifically configured to: carrying out segmented voice recognition on voice signals collected by the local voice input equipment;

the speech recognition module is further configured to: recording time information corresponding to the text information, wherein the time information comprises: start time, duration; sending the text information carrying the time information to the opposite terminal through an independent text channel or a voice stream channel, wherein the text information carries a voice recognition generation attribute;

the device further comprises: the prompting module is used for outputting prompting information before the starting module starts the voice recognition module and prompting a user to select whether to start the voice recognition module; under the condition that the user selects no, forbidding to start the voice recognition module;

the display module is used for receiving and displaying the text information sent by the voice recognition module;

the display module specifically comprises:

the voice display sub-module is used for judging that the attribute of the text information is generated by the voice recognition, converting the text information into voice information through a text-to-voice conversion module and playing the converted voice information according to the time information;

the text display sub-module is used for directly displaying the text information in a text mode;

the voice presentation sub-module is specifically configured to:

judging whether the voice packet in the time period corresponding to the text information is still to be broadcasted or not according to the time information; under the condition that a voice packet is judged to be played, judging whether the packet loss rate of the voice packet is greater than a preset third threshold value or not, if so, replacing the voice packet with the converted voice information, and playing the voice information;

the device further comprises: and the forwarding module is used for forwarding the text information or the converted voice information under the condition that the opposite end is forwarding equipment.