WO2020237886A1

WO2020237886A1 - Voice and text conversion transmission method and system, and computer device and storage medium

Info

Publication number: WO2020237886A1
Application number: PCT/CN2019/103634
Authority: WO
Inventors: 齐燕
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-05-30
Filing date: 2019-08-30
Publication date: 2020-12-03
Also published as: CN110349581B; CN110349581A

Abstract

The present application provides a voice and text conversion transmission method and system, and a computer device and a storage medium. The method comprises: detecting whether a network transmission bandwidth belongs to an extremely low bandwidth or not; and if yes, starting a voice recognition system. A sending end identifies user voice information, converts the voice information into a target text with characteristic information, and sends the target text to a receiving end; and the receiving end receives the target text sent by the sending end, identifies the target text, converts the target text into the voice information, and plays the voice information.

Description

Voice and text conversion and transmission method, system, computer equipment and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 30, 2019, the application number is 201910465416.3, and the invention title is "Speech and text conversion transmission methods, systems, computer equipment and storage media". The entire content Incorporated in this application by reference.

Technical field

This application relates to the field of communication technology, and in particular to a voice and text conversion transmission method, system, computer equipment and storage medium.

Background technique

At present, audio and video conferences solve the problem of poor network transmission and low bandwidth, and usually adopt methods to reduce the bit rate of video and audio. However, it is not applicable to scenarios with very low bandwidth, because the lowest bit rate of audio and video encoding is still higher than the available bandwidth. In the case of low bandwidth, audio information cannot be transmitted or packet loss occurs in the transmitted audio information. As a result, audio and video may be interrupted and the purpose of information transmission may not be achieved. Therefore, there is an urgent need for a method that can communicate normally under extremely low bandwidth.

technical problem

The main purpose of this application is to provide a voice and text conversion transmission method, system, computer equipment, and storage medium, aiming to solve the problem that audio conferences cannot be conducted under extremely low bandwidth.

Technical solutions

In order to achieve the above objective, this application provides a voice and text conversion transmission method, which includes the steps:

The sending end detects whether the first current network transmission bandwidth of the sending end belongs to an extremely low bandwidth, and detects whether a signal whose second current network transmission bandwidth of the receiving end belongs to an extremely low bandwidth is received;

If the first current network transmission bandwidth of the sending end belongs to the extremely low bandwidth and/or the second current network transmission bandwidth of the receiving end is received as a signal of extremely low bandwidth, the speech-to-text system is activated and sent to the receiving end Send a signal for communication through a voice-to-text system;

The voice information spoken by the user is recognized through the voice-to-text system, and converted into target text, and the target text is sent to the receiving end, where the target text includes a feature code and a text field.

Further, the step of the sending end detecting whether the first current network transmission bandwidth of the sending end belongs to extremely low bandwidth includes:

Monitor the current network speed of the sending end in real time, and compare the current network speed with the preset network speed;

If the current network speed is greater than 10% of the preset network speed, determining that the first current network transmission bandwidth of the sending end does not belong to extremely low bandwidth;

If the current network speed is less than or equal to 10% of the preset network speed, it is determined that the first current network transmission bandwidth of the sending end belongs to an extremely low bandwidth.

Further, the step of recognizing the voice information spoken by the user and converting it into target text includes:

Recognizing the voice information of the user; including semantic recognition and voiceprint recognition;

The voice information is converted into text fields, and the audio information features in the voice information are extracted to generate a feature code; the audio information features include a voiceprint spectrum and a PCM code stream, and the feature code is generated according to the voiceprint A string of symbols

The feature code is added to the text field in a preset manner to obtain the target text.

Further, after the step of extracting audio information features in the voice information and generating a feature code, the method further includes:

Input the extracted audio information features into a preset voice model, and name the voice model with the generated feature code; the feature code serves as a unique identification identifier for calling the voice model;

Sending the voice model to the receiving end.

This application also proposes a voice and text conversion transmission method, including the steps:

The receiving end detects whether the second current network transmission bandwidth of the receiving end belongs to an extremely low bandwidth, and detects whether a signal whose first current network transmission bandwidth of the sending end belongs to an extremely low bandwidth is received;

If the second current network transmission bandwidth of the receiving end belongs to extremely low bandwidth and/or the first current network transmission bandwidth of the sending end is received as a signal of extremely low bandwidth, the text-to-speech system is activated and sent to the sending end Send signals for communication through text-to-speech systems;

Receive the target text from the sender, recognize the target text, convert the target text into voice information, and play it.

Further, the step of the receiving end detecting whether the second current network transmission bandwidth of the receiving end belongs to extremely low bandwidth includes:

Monitor the current network speed of the receiving end in real time, and compare the current network speed with the preset network speed;

If the current network speed is greater than 10% of the preset network speed, determining that the second current network transmission bandwidth of the receiving end does not belong to extremely low bandwidth;

If the current network speed is less than or equal to 10% of the preset network speed, it is determined that the second current network transmission bandwidth of the receiving end belongs to an extremely low bandwidth.

Further, the step of receiving the target text from the sending end, recognizing the target text, and converting the target text into voice information, further includes:

Extract text fields based on the feature information attached to the target text;

Converting the text in the text field into pronunciation syllables to obtain the spectrum information and PCM code stream corresponding to the syllables;

Find the corresponding user's voice model in the local voice database according to the feature information attached to the target text;

The spectrum information and PCM code stream obtained by text conversion are exchanged with the spectrum information and PCM code stream in the voice model of the corresponding user to obtain the spectrum information and PCM code stream corresponding to the user and the text field.

This application also proposes a voice and text conversion transmission system, including: a sending end and a receiving end;

The sending end is used to detect whether the first current network transmission bandwidth of the sending end belongs to an extremely low bandwidth, and to detect whether a signal whose second current network transmission bandwidth of the receiving end belongs to an extremely low bandwidth is received;

Recognizing the voice information spoken by the user through the voice-to-text system, converting it into target text, and sending the target text to the receiving end;

The receiving end is used to detect whether the second current network transmission bandwidth of the receiving end belongs to an extremely low bandwidth, and to detect whether a signal whose first current network transmission bandwidth of the sending end belongs to an extremely low bandwidth is received;

This application also proposes a computer device including a memory and a processor, the memory stores computer-readable instructions, and the processor implements the steps of any one of the above methods when the computer-readable instructions are executed by the processor.

Beneficial effect

This application also proposes a computer-readable storage medium on which computer-readable instructions are stored, and when the computer-readable instructions are executed by a processor, the steps of any one of the above methods are implemented.

A voice and text conversion transmission system, method, computer equipment, and storage medium provided in this application detect whether the network transmission bandwidth is extremely low bandwidth. If the network transmission bandwidth is extremely low bandwidth, the voice recognition system is activated. The sending end recognizes the user's voice information, converts the voice information into target text with feature information, and sends the target text to the receiving end. The receiving end receives the target text sent by the sending end, recognizes the target text, and The target text is converted into voice information and played. In this application, the system automatically detects the network bandwidth, adaptively switches the transmission mode, and can still interact with the remote end smoothly when the network is not ideal, which solves the problem of voice transmission under extremely low bandwidth and achieves the purpose of information interaction. In addition, when text is converted into speech, the self-built speech model is used for conversion, which improves the fidelity.

Description of the drawings

Figure 1 is a schematic diagram of the steps of a voice and text conversion transmission method in an embodiment of the present application;

Figure 2 is a schematic diagram of the steps of another voice and text conversion transmission method in an embodiment of the present application;

FIG. 3 is a schematic block diagram of the structure of a computer device according to an embodiment of the present application.

The best mode of the invention

In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the application, and not used to limit the application.

1, this application proposes a voice and text conversion transmission method, including the steps:

S1. The sending end detects whether the first current network transmission bandwidth of the sending end belongs to an extremely low bandwidth, and detecting whether a signal whose second current network transmission bandwidth of the receiving end belongs to an extremely low bandwidth is received;

S2. If the first current network transmission bandwidth of the sending end belongs to extremely low bandwidth and/or the second current network transmission bandwidth of the receiving end is received as a signal of extremely low bandwidth, start the speech-to-text system and send the The receiving end sends a signal communicated through the voice-to-text system;

S3. Recognize the voice information spoken by the user through the voice-to-text system, convert it into target text, and send the target text to the receiving end, where the target text includes a feature code and a text field.

As mentioned in step S1 above, because the network transmission is affected by the configuration of the user’s computer software and hardware, the address of the website being browsed, the peer website, the bandwidth of the peer server, etc., the actual rate when the user goes online is usually lower than the theoretical rate value. The above-mentioned network transmission bandwidth refers to the data transmission capacity in actual signal transmission; extremely low bandwidth refers to 10% lower than the theoretical value of the normal communication bandwidth. For example, in normal communication, the bandwidth rate is 4M/S, the theoretical value is 512KB/S, but the actual value is about 400KB/S, and the extremely low bandwidth means that the bandwidth rate is below 52KB/S. When the network transmission bandwidth is extremely low, the data transmission is unstable, which will cause the packet loss rate to increase. As a result, a lot of data cannot be transmitted normally.

As described in step S2 above, after determining that the current network belongs to a very low bandwidth, the speech-to-text system is activated. Since the network speed is limited in the state of extremely low bandwidth, packet loss is very likely to occur in video and audio transmission, and the function of the speech recognition system is to ensure that the information used for communication is still normal in the state of extremely low bandwidth. transmission. Therefore, the client of the voice-to-text system needs to be activated as the sender. The above and sending the signal to the receiving end to communicate through the voice-to-text system is to prompt or control the receiving end to start the text-to-speech system client installed at one end of the receiving end to communicate.

As mentioned in step S3, the sending end refers to a terminal that sends out the target text, and the terminal may be a PC, a notebook computer, a tablet computer, and other intelligent terminal devices that can be connected to the network. In this embodiment, bandwidth is divided into uplink bandwidth and downlink bandwidth. Theoretically, the upstream bandwidth and downstream bandwidth will not have an impact, but IP protocol transmission requires two-way interaction, which actually has some impact. The extremely low bandwidth is not conducive to data transmission. Therefore, when the sender sends the target text to the receiver, in order to improve the efficiency of data transmission, the downlink bandwidth can be limited to a minimum before sending the target text. Restore after completion. Can achieve the purpose of improving the efficiency of data transmission. Correspondingly, this application receives the target text through the receiving end. Corresponding clients are installed on the sender and receiver. The receiving end also realizes the recognition of the target text through the client of the text-to-speech system, and converts the target text into voice information and plays it.

In one embodiment, the step of the sending end detecting whether the first current network transmission bandwidth of the sending end belongs to extremely low bandwidth includes:

S11. Monitor the current network speed of the sending end in real time, and compare the current network speed with a preset network speed;

S12. If the current network speed is greater than 10% of the preset network speed, determine that the first current network transmission bandwidth of the sending end does not belong to the extremely low bandwidth;

S13. If the current network speed is less than or equal to 10% of the preset network speed, it is determined that the first current network transmission bandwidth of the sending end belongs to an extremely low bandwidth.

In step S11-step S13, in simple terms, the unit of network transmission bandwidth is bit (bit), and the unit of network speed is Byte (byte), and the relationship between the two is 1Byte=8bit. Therefore, the above-mentioned network transmission bandwidth is directly proportional to the network speed, and since detecting the network speed is much more convenient than detecting the network transmission bandwidth, in this embodiment, the purpose of detecting the network transmission bandwidth is achieved by detecting the network speed. The aforementioned preset network speed is the theoretical value of the network speed actually accessed in normal communication. By detecting the proportion of the network speed in the preset network speed, you can know whether the network transmission bandwidth belongs to the extremely low bandwidth.

In one embodiment, the step S3 of recognizing the voice information spoken by the user through the voice-to-text system and converting it into target text includes:

S31. Recognizing the voice information of the user; including semantic recognition and voiceprint recognition;

S32. Convert the voice information into a text field, and extract audio information features in the voice information to generate a feature code; the audio information feature includes a voiceprint spectrum and a PCM code stream, and the feature code is based on the voiceprint A string of symbols generated;

S33. Add a feature code to the text field in a preset manner to obtain the target text.

In step S31, the aforementioned voice information refers to the words spoken by the user, and the aforementioned text field refers to the text generated by recognizing the words spoken by the same user in a continuous period of time. The purpose of this step is to recognize what the user said and convert the content of the recognized user’s words into a paragraph of text.

In steps S32 to S33, the above-mentioned audio information feature refers to the information of the user's voiceprint spectrum and the PCM code stream in the generated recording file to identify what the user said. The above feature code refers to the character string generated by the user's voiceprint feature. Because the user's voiceprint feature is unique, the generated character string is also unique, and can be used as a kind of identification information to extract the corresponding speech The human voice model is guaranteed to be error-free. In addition, in the identification of the character string, special information from the beginning of the character string to the end of the character string is added (for example, ## feature code## text field). When the speech recognition system recognizes the text field, it will automatically extract the feature code, and the feature code will not affect the recognition of the text field. For the above-mentioned target text, multiple target texts can be further packaged and compressed together, which is convenient for sending and further reduces space saving. Packing and compressing multiple target texts at one time can prevent data loss when transmitting data.

In an embodiment, after the step S32 of extracting audio information features in the voice information and generating a feature code, the method further includes:

S3201. Input the extracted audio information features into a preset voice model, and name the voice model with the generated feature code; the feature code serves as a unique identification identifier for calling the voice model;

S3202. Send the voice model to the receiving end.

In steps S3201 to S3202, inputting the extracted audio information features into the preset voice model means that since the pronunciation of each word is composed of syllables, the preset voice model is to record all the voices of the same user. The audio information feature of the syllable is extracted from the audio information feature of all syllables spoken by the same user from the user's recording file, and then input into the preset voice model, and the obtained voice model has all the syllable features of the user's pronunciation. The voice model is sent to the receiving end through step S3202. Furthermore, with the user's voice model at the receiving end, the frequency characteristics of the pronunciation of the corresponding syllable can be synthesized through the syllable characteristics, and these frequency points can be converted The PCM signal (through the inverse Fourier transform) can synthesize a personalized voice with user voice characteristics for language simulation.

Referring to Figure 2, this application also proposes a voice and text conversion transmission method, including the steps:

S10. The receiving end detects whether the second current network transmission bandwidth of the receiving end belongs to an extremely low bandwidth, and detects whether a signal whose first current network transmission bandwidth of the sending end belongs to an extremely low bandwidth is received;

S20. If the second current network transmission bandwidth of the receiving end belongs to an extremely low bandwidth and/or a signal whose first current network transmission bandwidth of the sending end belongs to an extremely low bandwidth is received, start a text-to-speech system and send a message to the The sender sends the signal communicated through the text-to-speech system;

S30. Receive the target text sent by the sender, recognize the target text, convert the target text into voice information, and play it.

As mentioned in step S10 above, since network transmission is affected by the configuration of the user’s computer software and hardware, the address of the website being browsed, the peer website, the bandwidth of the peer server, etc., the actual rate when the user goes online is usually lower than the theoretical rate value. The above-mentioned network transmission bandwidth refers to the data transmission capacity in actual signal transmission; extremely low bandwidth refers to 10% lower than the theoretical value of the normal communication bandwidth. For example, in normal communication, the bandwidth rate is 4M/S, the theoretical value is 512KB/S, but the actual value is about 400KB/S, and the extremely low bandwidth means that the bandwidth rate is below 52KB/S. When the network transmission bandwidth is extremely low, the data transmission is unstable, which will cause the packet loss rate to increase. As a result, a lot of data cannot be transmitted normally.

As described in step S20 above, after it is determined that the current network belongs to a very low bandwidth, the text-to-speech system is activated. As the network speed is limited in the state of extremely low bandwidth, packet loss may occur in video and audio transmission. The function of the text-to-speech system is to ensure that the information used for communication can still be used in the state of extremely low bandwidth. Normal transmission. Therefore, the client of the text-to-speech system needs to be activated as the receiver. The above and sending the signal for communication through the text-to-speech system to the sending end is to prompt or control the sending end to start the voice-to-text system client installed at the sending end to communicate.

As mentioned in the above step S30, the above sending end refers to a terminal that sends out the target text, and the terminal may be a PC, a notebook computer, a tablet computer, and other intelligent terminal devices that can be connected to the network. Theoretically, the upstream bandwidth and downstream bandwidth will not have an impact, but IP protocol transmission requires two-way interaction, which actually has some impact. The extremely low bandwidth is not conducive to data transmission. Therefore, when the receiving end receives the target text from the sending end, in order to improve the efficiency of data transmission, the uplink bandwidth can be limited to a minimum when receiving the target text, and the reception is completed Restore later. Can achieve the purpose of improving the efficiency of data transmission. Correspondingly, this application sends the target text through the sender. Corresponding clients are installed on the sender and receiver. The sending end also recognizes the voice information spoken by the user through the speech-to-text system, converts it into target text, and sends the target text to the receiving end.

In an embodiment, the step S10 of the sending end detecting whether the first current network transmission bandwidth of the sending end belongs to extremely low bandwidth includes:

S101: Monitor the current network speed of the sending end in real time, and compare the current network speed with a preset network speed;

S102: If the current network speed is greater than 10% of the preset network speed, determine that the first current network transmission bandwidth of the sending end does not belong to extremely low bandwidth;

S103: If the current network speed is less than or equal to 10% of the preset network speed, it is determined that the first current network transmission bandwidth of the sending end belongs to an extremely low bandwidth.

In step S101-step S103, in simple terms, the unit of network transmission bandwidth is bit (bit), the unit of network speed is Byte (byte), and the relationship between the two is 1Byte=8bit. Therefore, the above-mentioned network transmission bandwidth is directly proportional to the network speed, and since detecting the network speed is much more convenient than detecting the network transmission bandwidth, in this embodiment, the purpose of detecting the network transmission bandwidth is achieved by detecting the network speed. The aforementioned preset network speed is the theoretical value of the network speed actually accessed in normal communication. By detecting the proportion of the network speed in the preset network speed, you can know whether the network transmission bandwidth belongs to the extremely low bandwidth.

In one embodiment, the step S30 of receiving the target text from the sending end, recognizing the target text, and converting the target text into voice information further includes:

S301: Extract text fields according to feature information attached to the target text;

S302: Convert the text in the text field into pronunciation syllables, and obtain spectrum information and PCM code streams corresponding to the syllables;

S303: Search for a voice model corresponding to the user in the local voice database according to the feature information attached to the target text;

S304: Convert the spectrum information and PCM code stream obtained by text conversion with the spectrum information and PCM code stream in the voice model of the corresponding user to obtain the spectrum information and PCM code stream corresponding to the user and the text field.

In step S301, the above-mentioned target text is converted from the words spoken by the user by the sending end. When the target text contains words spoken by multiple users, the feature information can divide the target text into multiple paragraphs. The segments all contain feature information of the corresponding user, that is, the target text is composed of multiple text fields, and each text field contains feature information. It means that the text field is converted from what a specific user said. For example, according to the analysis of the feature information, the target text contains A feature, B feature, A feature, and C feature; from this, the target text is a paragraph spoken by user A, a paragraph spoken by user B, and a paragraph spoken by user A And the conversion of a passage from the C user.

In step S302, the text in the text field is converted into pronunciation syllables to obtain audio information, and the audio information includes the frequency spectrum information and the PCM code stream corresponding to the syllables.

In step S303, in addition to extracting text fields, the feature information attached to the target text is also used to find a voice model. The process is to compare the feature information attached to the target text with the user features contained in the voice model in the voice database. If the matching is successful, it indicates that the text field is the words spoken by the user corresponding to the voice model.

In step S304, the above-mentioned adjustment of the spectrum information and the PCM code stream refers to replacing the characteristic spectrum segment and the PCM code stream in the user's voice model with the corresponding part of the spectrum information and the PCM code stream obtained by text conversion, namely Corresponding syllable phase replacement. Obtain audio information close to what the real user said. The sound heard by playing the audio information is close to the user's original words.

The specific generation process of the above-mentioned feature information can be summarized as: extracting the speaker's audio information features, such as the audio signal PCM code stream, and the spectral characteristics of the sound, and then summarizing and counting this information for a long time. The above-mentioned spectral characteristics refer to: the PCM signal of speech is transformed into the frequency domain after Fourier transform: the value of each frequency point represents the magnitude of the frequency. The sound is composed of many sine waves of different frequencies, and the frequency characteristic refers to the size of the sine wave of each frequency. The specific process is to sample the analog signal such as voice at regular intervals to discretize it, and at the same time round the sampled value according to the stratification unit, and at the same time use a set of binary codes to represent the amplitude of the sampled pulse . The user's voice characteristics can be extracted from the frequency characteristics. According to the obtained frequency characteristics, the user's voice characteristics can be extracted: for example, the value of the energy corresponding to each frequency, or the average value and variance of the energy of all frequency points, etc. can be taken. Divide the user's voice PCM signal into small syllables, such as a, u, e, i, u, yu, etc., extract the characteristics of these syllables and transmit them to the receiving end of the other end, and establish a corresponding model on the receiving end. The receiving end can synthesize the frequency characteristics of the corresponding syllable pronunciation by using the received text and combining the syllable characteristics of the model, and convert these frequency points to the PCM signal (through inverse Fourier transform) to synthesize a voice characteristic of the user , A personalized voice.

The voice and text conversion transmission method, system, computer equipment and storage medium proposed in this application detect whether the network transmission bandwidth belongs to extremely low bandwidth. If the network transmission bandwidth is extremely low bandwidth, the voice recognition system is activated. The sending end recognizes the user's voice information, converts the voice information into target text with feature information, and sends the target text to the receiving end. The receiving end receives the target text sent by the sending end, recognizes the target text, and The target text is converted into voice information and played. In this application, the system automatically detects the network bandwidth, adaptively switches the transmission mode, and can still interact with the remote end smoothly when the network is not ideal, which solves the problem of voice transmission under extremely low bandwidth and achieves the purpose of information interaction. In addition, when text is converted into speech, the self-built speech model is used for conversion, which improves the fidelity.

An embodiment of the present application also proposes a voice and text conversion transmission system, including: a sending end and a receiving end;

The sending end is used to detect whether the first current network transmission bandwidth of the sending end belongs to extremely low bandwidth, and to detect whether a signal with the second current network transmission bandwidth of the receiving end belongs to extremely low bandwidth is received; If the current network transmission bandwidth belongs to the extremely low bandwidth and/or the signal with the second current network transmission bandwidth belonging to the extremely low bandwidth of the receiving end is received, the speech-to-text system is activated, and the communication via the speech-to-text system is sent to the receiving end The signal; the voice information spoken by the user is recognized through the voice-to-text system, and converted into target text, and the target text is sent to the receiving end;

The receiving end is used to detect whether the second current network transmission bandwidth of the receiving end belongs to an extremely low bandwidth, and to detect whether a signal whose first current network transmission bandwidth of the sending end belongs to an extremely low bandwidth is received; If the current network transmission bandwidth belongs to the extremely low bandwidth and/or the first current network transmission bandwidth of the sending end is received, the text-to-speech system is activated, and the text-to-speech system communication is sent to the sending end The signal; receive the target text sent by the sender, recognize the target text, convert the target text into voice information, and play.

Referring to FIG. 3, an embodiment of the present application also proposes a computer device. The computer device may be a server, and its internal structure may be as shown in FIG. 3. The computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the computer designed processor is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium. The database of the computer equipment is used to store data such as a guide plan library. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer readable instructions are executed by the processor to realize a voice and text conversion transmission method.

The above-mentioned processor executes the steps of the above-mentioned method:

The voice information spoken by the user is recognized through the voice-to-text system, and converted into target text, and the target text is sent to the receiving end.

In another embodiment, the foregoing processor executes the steps of the foregoing method:

An embodiment of the present application also proposes a computer-readable storage medium on which computer-readable instructions are stored. When the computer-readable instructions are executed by a processor, a method for voice and text conversion and transmission is realized, including the steps:

In an embodiment, the step of recognizing the voice information spoken by the user and converting it into target text includes:

Identifying the voice information of the user;

Converting the voice information into text fields, and extracting audio information features in the voice information to generate a feature code;

In an embodiment, after the step of extracting audio information features in the voice information and generating a feature code, the method further includes:

Input the extracted audio information features into a preset voice model, and name the voice model with the generated feature code;

Sending the voice model to the receiving end.

Another embodiment of the present application also provides a computer-readable storage medium on which computer-readable instructions are stored. When the computer-readable instructions are executed by a processor, a method for converting and transmitting voice to text is realized, including the steps:

In one embodiment, the step of the receiving end detecting whether the second current network transmission bandwidth of the receiving end belongs to extremely low bandwidth includes:

In an embodiment, the step of receiving the target text from the sending end, recognizing the target text, and converting the target text into voice information further includes:

The above are only the preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the specification and drawings of this application, or directly or indirectly applied to other related The technical field is equally included in the scope of patent protection of this application.

Claims

A voice and text conversion transmission method, characterized in that it comprises the steps:

The sending end detects whether the first current network transmission bandwidth of the sending end belongs to an extremely low bandwidth, and detects whether a signal whose second current network transmission bandwidth of the receiving end belongs to an extremely low bandwidth is received;

If the first current network transmission bandwidth of the sending end belongs to the extremely low bandwidth and/or the second current network transmission bandwidth of the receiving end is received as a signal of extremely low bandwidth, the speech-to-text system is activated and sent to the receiving end Send a signal for communication through a voice-to-text system;

The voice information spoken by the user is recognized through the voice-to-text system, and converted into target text, and the target text is sent to the receiving end, where the target text includes a feature code and a text field.
The voice and text conversion transmission method according to claim 1, wherein the step of the sending end detecting whether the first current network transmission bandwidth of the sending end belongs to extremely low bandwidth comprises:

Monitor the current network speed of the sending end in real time, and compare the current network speed with the preset network speed;

If the current network speed is greater than 10% of the preset network speed, determining that the first current network transmission bandwidth of the sending end does not belong to extremely low bandwidth;

If the current network speed is less than or equal to 10% of the preset network speed, it is determined that the first current network transmission bandwidth of the sending end belongs to an extremely low bandwidth.
The voice and text conversion transmission method according to claim 1, wherein the step of recognizing the voice information spoken by the user and converting it into target text comprises:

Recognizing the voice information of the user; including semantic recognition and voiceprint recognition;

The voice information is converted into text fields, and the audio information features in the voice information are extracted to generate a feature code; the audio information features include a voiceprint spectrum and a PCM code stream, and the feature code is generated according to the voiceprint A string of symbols

The feature code is added to the text field in a preset manner to obtain the target text.
The voice and text conversion transmission method according to claim 3, characterized in that, after the step of extracting the audio information feature in the voice information and generating a feature code, it further comprises:

Input the extracted audio information features into a preset voice model, and name the voice model with the generated feature code; the feature code serves as a unique identification identifier for calling the voice model;

Sending the voice model to the receiving end.
A voice and text conversion transmission method, characterized in that it comprises the steps:

The receiving end detects whether the second current network transmission bandwidth of the receiving end belongs to an extremely low bandwidth, and detects whether a signal whose first current network transmission bandwidth of the sending end belongs to an extremely low bandwidth is received;

If the second current network transmission bandwidth of the receiving end belongs to extremely low bandwidth and/or the first current network transmission bandwidth of the sending end is received as a signal of extremely low bandwidth, the text-to-speech system is activated and sent to the sending end Send signals for communication through text-to-speech systems;

Receive the target text from the sender, recognize the target text, convert the target text into voice information, and play it.
The voice and text conversion transmission method according to claim 5, wherein the step of the receiving end detecting whether the second current network transmission bandwidth of the receiving end belongs to a very low bandwidth comprises:

Monitor the current network speed of the receiving end in real time, and compare the current network speed with the preset network speed;

If the current network speed is greater than 10% of the preset network speed, determining that the second current network transmission bandwidth of the receiving end does not belong to extremely low bandwidth;

If the current network speed is less than or equal to 10% of the preset network speed, it is determined that the second current network transmission bandwidth of the receiving end belongs to an extremely low bandwidth.
The voice and text conversion transmission method according to claim 5, wherein the step of receiving the target text sent by the sending end, recognizing the target text, and converting the target text into voice information, further include:

Extract text fields based on the feature information attached to the target text;

Converting the text in the text field into pronunciation syllables to obtain the spectrum information and PCM code stream corresponding to the syllables;

Find the corresponding user's voice model in the local voice database according to the feature information attached to the target text;

The spectrum information and PCM code stream obtained by text conversion are exchanged with the spectrum information and PCM code stream in the voice model of the corresponding user to obtain the spectrum information and PCM code stream corresponding to the user and the text field.
A voice and text conversion transmission system, which is characterized in that it comprises: a sending end and a receiving end;

The sending end is used to detect whether the first current network transmission bandwidth of the sending end belongs to an extremely low bandwidth, and to detect whether a signal whose second current network transmission bandwidth of the receiving end belongs to an extremely low bandwidth is received;

If the first current network transmission bandwidth of the sending end belongs to the extremely low bandwidth and/or the second current network transmission bandwidth of the receiving end is received as a signal of extremely low bandwidth, the speech-to-text system is activated and sent to the receiving end Send a signal for communication through a voice-to-text system;

Recognizing the voice information spoken by the user through the voice-to-text system, converting it into target text, and sending the target text to the receiving end;

The receiving end is used to detect whether the second current network transmission bandwidth of the receiving end belongs to an extremely low bandwidth, and to detect whether a signal whose first current network transmission bandwidth of the sending end belongs to an extremely low bandwidth is received;

If the second current network transmission bandwidth of the receiving end belongs to extremely low bandwidth and/or the first current network transmission bandwidth of the sending end is received as a signal of extremely low bandwidth, the text-to-speech system is activated and sent to the sending end Send signals for communication through text-to-speech systems;

Receive the target text from the sender, recognize the target text, convert the target text into voice information, and play it.
The voice and text conversion transmission system according to claim 8, wherein the sending end is further used for:

Monitor the current network speed of the sending end in real time, and compare the current network speed with the preset network speed;

If the current network speed is greater than 10% of the preset network speed, determining that the first current network transmission bandwidth of the sending end does not belong to extremely low bandwidth;

If the current network speed is less than or equal to 10% of the preset network speed, it is determined that the first current network transmission bandwidth of the sending end belongs to an extremely low bandwidth.
The voice and text conversion transmission system according to claim 8, wherein the sending end is further used for:

Recognizing the voice information of the user; including semantic recognition and voiceprint recognition;

The voice information is converted into text fields, and the audio information features in the voice information are extracted to generate a feature code; the audio information features include a voiceprint spectrum and a PCM code stream, and the feature code is generated according to the voiceprint A string of symbols

The feature code is added to the text field in a preset manner to obtain the target text.
The voice and text conversion transmission system according to claim 8, wherein the sending end is further used for:

Input the extracted audio information features into a preset voice model, and name the voice model with the generated feature code; the feature code serves as a unique identification identifier for calling the voice model;

Sending the voice model to the receiving end.
The voice and text conversion transmission system according to claim 8, wherein the receiving end is further used for:

Extract text fields based on the feature information attached to the target text;

Converting the text in the text field into pronunciation syllables to obtain the spectrum information and PCM code stream corresponding to the syllables;

Find the corresponding user's voice model in the local voice database according to the feature information attached to the target text;

The spectrum information and PCM code stream obtained by text conversion are exchanged with the spectrum information and PCM code stream in the voice model of the corresponding user to obtain the spectrum information and PCM code stream corresponding to the user and the text field.
A computer device includes a memory and a processor, the memory stores computer readable instructions, and is characterized in that, when the processor executes the computer readable instructions, the steps of the voice and text conversion transmission method are realized:

The sending end detects whether the first current network transmission bandwidth of the sending end belongs to an extremely low bandwidth, and detects whether a signal whose second current network transmission bandwidth of the receiving end belongs to an extremely low bandwidth is received;

If the first current network transmission bandwidth of the sending end belongs to the extremely low bandwidth and/or the second current network transmission bandwidth of the receiving end is received as a signal of extremely low bandwidth, the speech-to-text system is activated and sent to the receiving end Send a signal for communication through a voice-to-text system;

The voice information spoken by the user is recognized through the voice-to-text system, and converted into target text, and the target text is sent to the receiving end, where the target text includes a feature code and a text field.
The computer device according to claim 13, wherein the step of the sending end detecting whether the first current network transmission bandwidth of the sending end belongs to a very low bandwidth comprises:

Monitor the current network speed of the sending end in real time, and compare the current network speed with the preset network speed;

If the current network speed is greater than 10% of the preset network speed, determining that the first current network transmission bandwidth of the sending end does not belong to extremely low bandwidth;

If the current network speed is less than or equal to 10% of the preset network speed, it is determined that the first current network transmission bandwidth of the sending end belongs to an extremely low bandwidth.
The computer device according to claim 13, wherein the step of recognizing the voice information spoken by the user and converting it into target text comprises:

Recognizing the voice information of the user; including semantic recognition and voiceprint recognition;

The voice information is converted into text fields, and the audio information features in the voice information are extracted to generate a feature code; the audio information features include a voiceprint spectrum and a PCM code stream, and the feature code is generated according to the voiceprint A string of symbols

The feature code is added to the text field in a preset manner to obtain the target text.
The computer device according to claim 15, characterized in that, after the step of extracting the audio information feature in the voice information and generating the feature code, it further comprises:

Input the extracted audio information features into a preset voice model, and name the voice model with the generated feature code; the feature code serves as a unique identification identifier for calling the voice model;

Sending the voice model to the receiving end.
A computer device includes a memory and a processor, the memory stores computer readable instructions, and is characterized in that, when the processor executes the computer readable instructions, the steps of the voice and text conversion transmission method are realized:

The receiving end detects whether the second current network transmission bandwidth of the receiving end belongs to an extremely low bandwidth, and detects whether a signal whose first current network transmission bandwidth of the sending end belongs to an extremely low bandwidth is received;

If the second current network transmission bandwidth of the receiving end belongs to extremely low bandwidth and/or the first current network transmission bandwidth of the sending end is received as a signal of extremely low bandwidth, the text-to-speech system is activated and sent to the sending end Send signals for communication through text-to-speech systems;

Receive the target text from the sender, recognize the target text, convert the target text into voice information, and play it.
18. The computer device according to claim 17, wherein the step of the receiving end detecting whether the second current network transmission bandwidth of the receiving end belongs to a very low bandwidth comprises:

Monitor the current network speed of the receiving end in real time, and compare the current network speed with the preset network speed;

If the current network speed is greater than 10% of the preset network speed, determining that the second current network transmission bandwidth of the receiving end does not belong to extremely low bandwidth;

If the current network speed is less than or equal to 10% of the preset network speed, it is determined that the second current network transmission bandwidth of the receiving end belongs to an extremely low bandwidth.
18. The computer device according to claim 17, wherein the step of receiving the target text sent by the sending end, recognizing the target text, and converting the target text into voice information, further comprises:

Extract text fields according to the feature information attached to the target text;

Converting the text in the text field into pronunciation syllables to obtain the spectrum information and PCM code stream corresponding to the syllables;

Find the corresponding user's voice model in the local voice database according to the feature information attached to the target text;

The spectrum information and PCM code stream obtained by text conversion are exchanged with the spectrum information and PCM code stream in the voice model of the corresponding user to obtain the spectrum information and PCM code stream corresponding to the user and the text field.
A computer-readable storage medium having computer-readable instructions stored thereon, wherein the computer-readable instructions implement the steps of the method according to any one of claims 1 to 7 when executed by a processor.