CN111081247A - Method for speech recognition, terminal, server and computer-readable storage medium - Google Patents

Method for speech recognition, terminal, server and computer-readable storage medium Download PDF

Info

Publication number
CN111081247A
CN111081247A CN201911351762.5A CN201911351762A CN111081247A CN 111081247 A CN111081247 A CN 111081247A CN 201911351762 A CN201911351762 A CN 201911351762A CN 111081247 A CN111081247 A CN 111081247A
Authority
CN
China
Prior art keywords
recognized
server
voice
voice data
speech recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911351762.5A
Other languages
Chinese (zh)
Inventor
刘海康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201911351762.5A priority Critical patent/CN111081247A/en
Publication of CN111081247A publication Critical patent/CN111081247A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/04Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
    • H04L63/0428Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/16Implementation or adaptation of Internet protocol [IP], of transmission control protocol [TCP] or of user datagram protocol [UDP]
    • H04L69/161Implementation details of TCP/IP or UDP/IP stack architecture; Specification of modified or new header fields
    • H04L69/162Implementation details of TCP/IP or UDP/IP stack architecture; Specification of modified or new header fields involving adaptations of sockets based mechanisms
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Abstract

The present disclosure provides a method for speech recognition and a corresponding terminal, server and computer-readable storage medium. The method comprises the following steps: acquiring voice data to be recognized; sending request information to a first server, wherein the request information comprises the voice data to be recognized; receiving response information from the first server, wherein the response information comprises at least two voice recognition results corresponding to the voice data to be recognized, and the at least two voice recognition results are obtained by respectively recognizing the voice data to be recognized by at least two voice recognition devices; and displaying the at least two voice recognition results.

Description

Method for speech recognition, terminal, server and computer-readable storage medium
Technical Field
The present disclosure relates to the field of speech recognition, and more particularly, to a method for speech recognition and a corresponding terminal, server and computer-readable storage medium.
Background
With the rapid development of the internet, the voice recognition technology has been widely used. When applying a voice recognition technology to a terminal such as a smart phone, a voice recognition application program needs to be downloaded and installed on the terminal. The speech recognition application may be referred to as a client. Specifically, the client may transmit voice data to be recognized to the background server through a hypertext transfer protocol (HTTP) or a socket protocol, and receive a voice recognition result from the background server and display the voice recognition result.
However, the background server only feeds back a single speech recognition result to the client. Since the accuracy of the speech recognition completely depends on the single speech recognition result, when there is a certain requirement for the fault tolerance of the speech recognition result, the single speech recognition result may be difficult to satisfy the required fault tolerance. In addition, corresponding speech recognition applications need to be developed for each platform, for example, a speech recognition application corresponding to an android operating system needs to be developed for the android operating system and a speech recognition application corresponding to an IOS operating system needs to be developed for the IOS operating system. This not only results in a waste of development resources, but also requires the user to install application programs of different versions for different operating systems, resulting in a cumbersome user operation, thereby reducing the user experience. In addition, the communication between the client and the background server is a clear text communication. This reduces the security and privacy of the voice data transmitted between the client and the backend server, resulting in a potential safety hazard of the voice data.
Disclosure of Invention
To overcome the disadvantages of the prior art, the present disclosure proposes a method for speech recognition and a corresponding terminal, server and computer-readable storage medium.
According to one aspect of the present disclosure, a speech recognition method is provided. The method is executed by a terminal and comprises the following steps: acquiring voice data to be recognized; sending request information to a first server, wherein the request information comprises the voice data to be recognized; receiving response information from the first server, wherein the response information comprises at least two voice recognition results corresponding to the voice data to be recognized, and the at least two voice recognition results are obtained by respectively recognizing the voice data to be recognized by at least two voice recognition devices; and displaying the at least two voice recognition results.
According to an example of the present disclosure, the acquiring voice data to be recognized includes: and acquiring the voice data to be recognized by a second application program running in the first application program.
According to an example of the present disclosure, the method further includes: dividing the voice data to be recognized into at least two voice data blocks; wherein the request information includes the at least two voice data blocks.
According to an example of the present disclosure, the sending request information to the first server includes: sending the request information to the first server through an encrypted transport protocol; wherein receiving response information from the first server comprises: receiving the response information from the first server over the encrypted transport protocol.
According to an example of the present disclosure, wherein the encrypted transport protocol is a secure socket Layer (secure socket Layer) based transport protocol.
According to an example of the present disclosure, the method further includes: displaying indication information, wherein the indication information indicates a speech recognition result with the highest accuracy in the at least two speech recognition results.
According to an example of the present disclosure, wherein the response information further includes the indication information.
According to an example of the present disclosure, the method further includes: determining the accuracy of each speech recognition result; and generating the indication information according to the accuracy of each voice recognition result.
According to another aspect of the present disclosure, a speech recognition method is provided. The method is performed by a first server and comprises the following steps: receiving request information from an application program, wherein the request information comprises voice data to be recognized; respectively sending the voice data to be recognized to each voice recognition device in at least two voice recognition devices; receiving one voice recognition result corresponding to the voice data to be recognized from each voice recognition device; and sending response information to the application program, wherein the response information comprises at least two voice recognition results corresponding to the voice data to be recognized.
According to an example of the present disclosure, wherein the application is running in another application.
According to an example of the present disclosure, the sending the voice data to be recognized to each of at least two voice recognition devices respectively comprises: converting the voice data to be recognized into data in a preset format; and transmitting the data in the predetermined format to each of the at least two voice recognition devices, respectively.
According to an example of the present disclosure, wherein the receiving request information from the application program includes: receiving the request information from the application program through an encrypted transmission protocol; wherein sending response information to the application comprises: and sending the response information to the application program through the encrypted transmission protocol.
According to an example of the present disclosure, wherein the encrypted transport protocol is a secure socket layer based transport protocol.
According to an example of the present disclosure, the response information further includes indication information, wherein the indication information indicates a speech recognition result with the highest accuracy of the at least two speech recognition results.
According to an example of the present disclosure, the method further includes: determining the accuracy of each speech recognition result; and generating the indication information according to the accuracy of each voice recognition result.
According to another aspect of the present disclosure, a method for speech recognition is provided. The method is performed by a speech recognition device, comprising: receiving voice data from a first server; recognizing the received voice data to obtain a voice recognition result corresponding to the received voice data; and sending the obtained speech recognition result to the first server.
According to another aspect of the present disclosure, there is provided a terminal for voice recognition, including: an acquisition unit configured to acquire voice data to be recognized; a sending unit configured to send request information to a first server, wherein the request information includes the voice data to be recognized; a receiving unit configured to receive response information from the first server, wherein the response information includes at least two voice recognition results corresponding to the voice data to be recognized, and the at least two voice recognition results are obtained by at least two voice recognition devices respectively recognizing the voice data to be recognized; and a display unit configured to display the at least two voice recognition results.
According to an example of the present disclosure, the obtaining unit is configured as a second application program running in the first application program.
According to an example of the present disclosure, the sending unit is configured to divide the voice data to be recognized into at least two voice data blocks, wherein the request information includes the at least two voice data blocks.
According to an example of the present disclosure, wherein the sending unit is configured to send the request information to the first server through an encrypted transport protocol; wherein the receiving unit is configured to receive the response information from the first server over the encrypted transport protocol.
According to an example of the present disclosure, wherein the encrypted transport protocol is a secure socket layer based transport protocol.
According to an example of the present disclosure, the display unit is further configured to display indication information, wherein the indication information indicates a speech recognition result with the highest accuracy of the at least two speech recognition results.
According to an example of the present disclosure, wherein the response information further includes the indication information.
According to an example of the present disclosure, the terminal further comprises a processing unit configured to determine an accuracy of each speech recognition result; and generating the indication information according to the accuracy of each voice recognition result.
According to another aspect of the present disclosure, there is provided a server for voice recognition, including: a receiving unit configured to receive request information from an application program, wherein the request information includes voice data to be recognized; a transmitting unit configured to transmit the voice data to be recognized to each of at least two voice recognition apparatuses, respectively; the receiving unit is further configured to receive one voice recognition result corresponding to the voice data to be recognized from each voice recognition apparatus; and the transmitting unit is further configured to transmit response information to the application program, wherein the response information includes at least two voice recognition results corresponding to the voice data to be recognized.
According to an example of the present disclosure, wherein the receiving unit is configured to receive the request information from the application program through an encrypted transmission protocol; wherein the sending unit is configured to send the response information to the application program through the encrypted transport protocol.
According to an example of the present disclosure, wherein the encrypted transport protocol is a secure socket layer based transport protocol.
According to an example of the present disclosure, the response information further includes indication information, wherein the indication information indicates a speech recognition result with the highest accuracy of the at least two speech recognition results.
According to an example of the present disclosure, the server further comprises a processing unit configured to determine an accuracy of the respective speech recognition result; and generating the indication information according to the accuracy of each voice recognition result.
According to another aspect of the present disclosure, there is provided a voice recognition apparatus for voice recognition, including: a receiving unit configured to receive voice data from a first server; a recognition unit configured to recognize the received voice data to obtain one voice recognition result corresponding to the received voice data; and a transmitting unit configured to transmit the obtained voice recognition result to the first server.
According to another aspect of the present disclosure, there is provided a terminal for voice recognition, including: a processor; and a memory in which is stored a computer-executable program that, when executed by the processor, performs the method performed by the terminal.
According to another aspect of the present disclosure, there is provided a server for voice recognition, including: a processor; and a memory, wherein the memory has stored therein a computer-executable program that, when executed by the processor, performs the method performed by the server described above.
According to another aspect of the present disclosure, there is provided a voice recognition apparatus for voice recognition, including: a processor; and a memory, wherein the memory has stored therein a computer-executable program that, when executed by the processor, performs the method described above as being performed by a speech recognition device.
According to another aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon instructions which, when executed by a processor, cause the processor to perform the above-described method performed by a terminal.
According to another aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon instructions which, when executed by a processor, cause the processor to perform the above-described method performed by a server.
According to another aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon instructions that, when executed by a processor, cause the processor to perform the above-described method performed by a speech recognition device.
According to the method for voice recognition and the corresponding terminal, server, voice recognition device and computer readable storage medium of the above aspects of the present disclosure, the application program may transmit voice data to be recognized to the server, receive at least two voice recognition results obtained by at least two voice recognition devices respectively recognizing the voice data to be recognized from the server, and display the at least two voice recognition results, thereby enabling the application program to obtain a plurality of voice recognition results from the server instead of a single voice recognition result, and further better satisfy the fault tolerance required by the service.
In addition, according to the method for voice recognition and the corresponding terminal, server, voice recognition device, and computer readable storage medium of the above aspects of the present disclosure, acquiring the voice data to be recognized may be acquiring the voice data to be recognized by the second application running in the first application, so that the method can be applied to any platform without downloading and installing the second application, thereby solving the cross-platform problem of the conventional client, avoiding respectively developing corresponding clients for different platforms, saving development resources, simplifying user operations, and improving user experience.
In addition, according to the method for voice recognition and the corresponding terminal, server, voice recognition device and computer readable storage medium of the above aspects of the disclosure, the encrypted transmission protocol is adopted for communication between the application program and the server, and the security and privacy of voice data are ensured.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally represent like parts or steps.
FIG. 1A shows a schematic diagram of an architecture of a speech recognition system according to an embodiment of the present disclosure.
FIG. 1B shows another schematic diagram of an architecture of a speech recognition system according to an embodiment of the present disclosure.
Fig. 2 shows a flow diagram of a method performed by a second application according to an embodiment of the present disclosure.
Fig. 3 shows a detailed schematic diagram of applying a method performed by a second application according to an embodiment of the present disclosure.
Fig. 4 shows a flow chart of a method performed by a first server according to an embodiment of the present disclosure.
Fig. 5 shows a flow chart of a method performed by each speech recognition device according to an embodiment of the present disclosure.
Fig. 6 shows a schematic diagram of interactions between a first server and a plurality of speech recognition devices according to an embodiment of the present disclosure.
FIG. 7A shows a schematic diagram of a speech recognition system initiating multiple passes of speech recognition according to an embodiment of the present disclosure.
FIG. 7B shows a schematic diagram of a speech recognition system displaying multiple speech recognition results according to an embodiment of the present disclosure.
Fig. 8 illustrates a schematic structural diagram of a terminal performing the method illustrated in fig. 2 according to an embodiment of the present disclosure.
Fig. 9 shows a schematic structural diagram of a first server for executing the method shown in fig. 4 according to an embodiment of the present disclosure.
Fig. 10 shows a schematic structural diagram of a speech recognition device for performing the method shown in fig. 5 according to an embodiment of the present disclosure.
Fig. 11 illustrates an architecture of a device according to an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present disclosure more apparent, example embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numerals refer to like elements throughout. It should be understood that: the embodiments described herein are merely illustrative and should not be construed as limiting the scope of the disclosure.
First, a schematic diagram of an architecture of a speech recognition system according to an embodiment of the present disclosure is described with reference to fig. 1A. As shown in fig. 1A, the system 100 may include a terminal 110, a first server 120, and at least two voice recognition devices (e.g., voice recognition devices 130a, 130b, 130c, and 130 d).
In the present disclosure, a first application may be run on the terminal 110 and a second application may be run in the first application. Specifically, the "first application" described herein may be an application installed on the terminal 110 and running on the terminal 110, and may be, for example, an application (such as a WeChat) for implementing a chat interaction function. The "second application" described herein may have the following features: the application program that can be run in the first application program does not need to be downloaded and installed in the terminal 110. For example, in a first application, a second application may be run in a plug and play manner. For example, the second application may be an applet, such as an applet for speech recognition. That is, in the present disclosure, the second application is run in the first application, and thus, only the first application needs to be downloaded and installed on the terminal 110 and the second application does not need to be downloaded and installed on the terminal 110. Because the second application program does not need to be downloaded and installed on the terminal 110, the second application program can be applied to any platform, the problem of cross-platform of the traditional client is solved, the second application programs of corresponding versions are prevented from being respectively developed for different platforms, development resources are saved, user operation is simplified, and user experience is improved.
Further, in the present disclosure, the first application and the second application may be based on different programming languages. For example, the first application may be based on the Android programming language and the second application may be based on the Java Script programming language. Further, the first application may be referred to as a first client and/or the second application may be referred to as a second client.
Further, in the present disclosure, the second application may establish a communication link with the first server 120. The second application may then retrieve the voice data to be recognized and send request information to the first server 120 over the established communication link, wherein the request information includes the voice data to be recognized. After receiving the request message, the first server 120 may send the voice data to be recognized to the voice recognition devices 130a, 130b, 130c, and 130d, respectively. The speech recognition device 130a may recognize the speech data to be recognized to obtain a first speech recognition result, and feed back the first speech recognition result to the first server 120. Similarly, the voice recognition devices 130b, 130c, and 130d may recognize the voice data to be recognized to obtain a second voice recognition result, a third voice recognition result, and a fourth voice recognition result, respectively, and feed back the second voice recognition result, the third voice recognition result, and the fourth voice recognition result to the first server 120. After receiving the first to fourth speech recognition results, the first server 120 may feed back the first to fourth speech recognition results to the second application program through the established communication link. Accordingly, the second application may display the first to fourth voice recognition results through a display part (e.g., a display) of the terminal 110 so that the user selects one of the first to fourth voice recognition results as a voice recognition result to be used by the user. With the system architecture shown in fig. 1A, the second application can obtain multiple speech recognition results from the first server 120 instead of a single speech recognition result, thereby better meeting the fault tolerance required by speech recognition.
Further, in the present disclosure, the respective speech recognition results may be the same, and may be different.
Further, in the present disclosure, the terminal 110 may be an electronic device such as a smartphone, a tablet, a laptop portable computer, a desktop computer, or the like.
Further, in the present disclosure, the first Server 120 may be a Server for establishing a communication link with the second application and distributing voice data from the second application, such as a Proxy Server (Proxy Server).
Further, in the present disclosure, each voice recognition apparatus may be a server that recognizes voice data and outputs a voice recognition result. The voice recognition result may be text information corresponding to the voice data. According to an example of the present disclosure, each speech recognition device may include two modules, an adaptation module and a speech recognition module, wherein the adaptation module may be configured to receive speech data from the first server 120 and match the received speech data to the speech recognition module, and the speech recognition module may be configured to convert the speech data from the adaptation module into text information.
According to another example of the present disclosure, the adaptation module may be implemented as an adaptation Server (Adapter Server), and the voice recognition module may be implemented as a voice recognition Server. In this example, each speech recognition device may be replaced with an adaptation server and a speech recognition server with a data processing channel between the adaptation server and the speech recognition server. The "speech recognition server" described herein may also be referred to as a decode server (DecoderServer). Fig. 1B shows another schematic diagram of the architecture of a speech recognition system based on this example. In fig. 1B, the voice recognition device 130a is replaced with an adaptation server 131A and a decoding server 132a, the voice recognition device 130B is replaced with an adaptation server 131B and a decoding server 132B, the voice recognition device 130c is replaced with an adaptation server 131c and a decoding server 132c, and the voice recognition device 130d is replaced with an adaptation server 131d and a decoding server 132d, as compared with fig. 1A.
Next, a method performed by the second application according to an embodiment of the present disclosure will be described with reference to fig. 2. Fig. 2 shows a flow diagram of a method performed by a second application according to an embodiment of the present disclosure.
As shown in fig. 2, in step S201, voice data to be recognized is acquired. The voice data in step S201 may be an analog signal, for example, an analog voice signal. Alternatively, the voice data in step S201 may be a digital signal, for example, a digital voice signal.
According to one example of the present disclosure, the second application may collect voice data to be recognized. For example, the second application may collect voice data to be recognized through the voice input device. For example, the second application may collect voice data to be recognized by calling a microphone in terminal 110 in fig. 1A.
According to another example of the present disclosure, the second application may receive voice data to be recognized from another terminal. For example, another terminal different from the terminal 110 in fig. 1A may collect voice data to be recognized and transmit the voice data to be recognized to the second application program, and accordingly, the second application program may receive the voice data to be recognized from the another terminal.
Then, in step S202, request information is sent to the first server, wherein the request information includes the voice data to be recognized.
According to one example of the present disclosure, the second application may send the request information to the first server through an encrypted transport protocol. In this example, the encrypted transport protocol may be a Secure Socket Layer (SSL) -based transport protocol. For example, the encrypted transport protocol may be an SSL-based HTTP protocol, which may be referred to as an HTTPs protocol. For another example, the encryption transmission protocol may be a Web Socket protocol based on SSL, which may be referred to as Web Socket Secure (WSS) protocol.
In the case where the second application sends the request information to the first server via the HTTPS protocol, the link between the second application and the first server is an HTTPS communication link, and accordingly, the second application may send the request information to the first server via the HTTPS communication link. For example, the second application may POST request information to the first server over the HTTPS communication link. Because the HTTPS belongs to the application layer protocol, a data transmission task can be carried out by establishing an HTTPS communication link in the application layer, so that the stability of data transmission and the safety of data are ensured, and a complex network relation does not need to be maintained in the application layer.
In addition, in a case where the second application transmits the request information to the first server through the WSS protocol, the link between the second application and the first server is a WSS communication link through which the second application can transmit the request information to the first server, accordingly. Further, in this case, the second application may convert the voice Data to be recognized into Binary Data (Binary Data), and the request information includes the Binary Data.
By the example, the communication between the second application program and the first server adopts an encryption transmission protocol, so that the safety and the privacy of voice data are ensured.
Further, according to an example of the present disclosure, the second application may pre-process the voice data to be recognized and include the pre-processed voice data in the request information to be transmitted to the first server.
In the case where the speech data to be recognized is an analog speech signal, the preprocessing may include sampling, quantization, denoising, encoding, and compression. Specifically, the analog speech signal may be sampled and quantized to obtain a digital speech signal, then noise in the digital speech signal is removed, and the denoised digital speech signal is encoded and compressed. In addition, in the case where the voice data to be recognized is a digital voice signal, the preprocessing may include processing such as denoising, encoding, and compression. Specifically, the noise in the digital speech signal can be removed, and the denoised digital speech signal can be encoded and compressed. In addition, in the present disclosure, the encoding and compression may be a conventional technique of encoding and compressing Audio data, such as MP3 technique, OGG technique, Windows Media Audio (WMA) technique, and the like. By encoding and compressing the voice data to be recognized, the voice data to be transmitted can be greatly reduced, and the method has a better effect particularly in the case of large service flow.
In addition, the preprocessing may further include fragmenting the voice data to be recognized. For example, in a case where the second application sends the request information to the first server through the WSS protocol, the preprocessing of the voice data to be recognized by the second application may further include fragmentation.
In particular, the second application may divide the voice data to be recognized into at least two voice data blocks. In this case, the request information includes the at least two voice data blocks. That is, the second application may transmit the voice data to be recognized to the first server in the form of voice data blocks without transmitting the entire voice data to be recognized to the first server. Therefore, after receiving a voice data block, the first server can send the voice data block to the voice recognition device for voice recognition, and receive a voice recognition result aiming at the voice data block from the voice recognition device so as to display the voice recognition result aiming at the voice data block in time. Therefore, in this way, the first server does not need to send the voice data to be recognized to the voice recognition device after completely and correctly receiving the whole voice data to be recognized, receive the voice recognition result aiming at the whole voice data to be recognized from the voice recognition device, and display the voice recognition result aiming at the whole voice data to be recognized, thereby realizing the streaming recognition of the voice data.
Further, according to an example of the present disclosure, in step S202, the second application may transmit a plurality of request information to the first server, wherein each request information may include voice data to be recognized. In this example, the second application may send each request message to the first server via the same encrypted transport protocol. For example, the second application may send each request message to the first server via HTTPS protocol. Alternatively, in this example, the second application may send each request message to the first server via a plurality of encrypted transport protocols, respectively. For example, the second application may send the first request message to the first server via the HTTPS protocol and send the second request message to the first server via the WSS protocol.
Returning to fig. 2, in step S203, response information is received from the first server, where the response information includes at least two voice recognition results corresponding to the voice data to be recognized, and the at least two voice recognition results are obtained by at least two voice recognition devices respectively recognizing the voice data to be recognized.
According to an example of the present disclosure, in the case where the second application transmits the request information to the first server through the encrypted transmission protocol in step S202, the second application may receive the response information from the first server through the encrypted transmission protocol in step S203.
For example, in the case where the second application program transmits the request information to the first server through the HTTPS protocol in step S202, the second application program may receive the response information from the first server through the HTTPS protocol in step S203. For another example, in the case where the second application transmits the request information to the first server through the WSS protocol in step S202, the second application may receive the response information from the first server through the WSS protocol in step S203.
Then, in step S204, the at least two speech recognition results are displayed. For example, the second application may display the at least two speech recognition results using a display device. For example, the second application may display the at least two speech recognition results by calling a display component (e.g., a display) in the terminal 110 in fig. 1A.
According to one example of the present disclosure, the second application may also display indication information. The indication information may indicate a speech recognition result with the highest accuracy among the at least two speech recognition results.
In a first implementation, the second application may receive the indication information from the first server. For example, the first server may determine a speech recognition result with the highest accuracy among the at least two speech recognition results, and indicate the speech recognition result with the highest accuracy to the second application through the indication information. Specifically, the first server may determine the accuracy of each speech recognition result according to a conventional method of determining the accuracy of speech recognition results. For example, the accuracy of each speech recognition result can be determined by performing semantic analysis on the speech recognition result and according to the compliance level of semantic logic.
Further, in the first implementation, the response information in step S203 may include indication information so that the second application program receives the indication information from the first server.
Further, in a second implementation, the second application may generate the indication information. Specifically, the second application may determine the accuracy of each speech recognition result, and generate the indication information according to the accuracy of each speech recognition result. For example, the second application may determine the accuracy of each speech recognition result according to conventional methods of determining the accuracy of speech recognition results. For example, the second application may determine the accuracy of each speech recognition result by performing semantic analysis on the speech recognition result and determining the accuracy of the speech recognition result according to the compliance level of the semantic logic.
Further, according to an example of the present disclosure, the second application may further determine a speech recognition result with the highest accuracy of the at least two speech recognition results according to an operation of the user. For example, the second application may display the at least two speech recognition results, and accordingly, the user may determine and select a speech recognition result with the highest accuracy among the at least two speech recognition results. The second application may then determine, in response to the user's selection, the most accurate speech recognition result of the at least two speech recognition results.
In this example, after the second application determines the speech recognition result with the highest accuracy among the at least two speech recognition results, the second application may display only the speech recognition result with the highest accuracy among the at least two speech recognition results without displaying the remaining speech recognition results among the at least two speech recognition results. Alternatively, after the second application determines the most accurate speech recognition result of the at least two speech recognition results, the second application may display each of the at least two speech recognition results, but may add a decorative element to the most accurate speech recognition result to highlight the most accurate speech recognition result. For example, the second application may add text information, which may be "most accurate," to the most accurate speech recognition result. As another example, the second application may also add a bright colored shading to the most accurate speech recognition result. Next, a specific schematic diagram to which the method performed by the second application program according to the embodiment of the present disclosure is applied will be described with reference to fig. 3. Fig. 3 shows a detailed schematic diagram of applying a method performed by a second application according to an embodiment of the present disclosure. As shown in fig. 3, the user may initiate a multi-pass speech recognition process via the second application. Then, the second application program may establish a connection with the first server through the HTTPS protocol to transmit the voice data to be recognized to the first server and receive a plurality of voice recognition results from the first server. The second application may then collect voice data while the user speaks. When the user finishes speaking, the second application program can perform pre-processing on the voice data, such as MP3 encoding and compressing. The second application may then send the pre-processed voice data to the first server.
Through the embodiment of the disclosure, the second application program can send the voice data to be recognized to the first server, receive at least two voice recognition results obtained by respectively recognizing the voice data to be recognized by at least two voice recognition devices from the first server, and display the at least two voice recognition results, so that the second application program obtains a plurality of voice recognition results from the first server instead of a single voice recognition result, and further better meets the fault tolerance rate required by the service.
In the following, a method performed by a first server according to an embodiment of the present disclosure will be described with reference to fig. 4. Fig. 4 shows a flow chart of a method performed by a first server according to an embodiment of the present disclosure.
As shown in fig. 4, in step S401, request information is received from an application, wherein the request information includes voice data to be recognized. The application in step S401 is run in another application. For example, the application in step S401 is the second application described above, and the other application is the first application described above.
According to one example of the present disclosure, the first server may receive the request information from the second application through an encrypted transport protocol. In this example, the encrypted transport protocol may be a Secure Socket Layer (SSL) -based transport protocol. For example, the encrypted transport protocol may be an SSL-based HTTP protocol, which may be referred to as an HTTPs protocol. For another example, the encryption transmission protocol may be a Web Socket protocol based on SSL, which may be referred to as Web Socket Secure (WSS) protocol.
In the case where the first server receives the request information from the second application via the HTTPS protocol, the link between the second application and the first server is an HTTPS communication link over which the first server may receive the request information from the second application accordingly.
Further, in a case where the first server receives the request information from the second application through the WSS protocol, the link between the second application and the first server is a WSS communication link, and accordingly, the first server can receive the request information from the second application through the WSS communication link. Further, in this case, the second application program may convert the voice data to be recognized into binary data, and the request information includes the binary data. Accordingly, the first server receives binary data from the second application.
By the example, the communication between the second application program and the first server adopts an encryption transmission protocol, so that the safety and the privacy of voice data are ensured.
Further, according to an example of the present disclosure, in a case where the second application preprocesses the voice data to be recognized, the first server may receive the preprocessed voice data from the second application. In an example where the pre-processing is to fragment the voice data to be recognized, the second application may divide the voice data to be recognized into at least two voice data blocks. In this case, the request information includes the at least two voice data blocks. Accordingly, the first server may receive at least two blocks of speech data from the second application.
In this example, the first server, upon receiving a voice data block, may send the voice data block to the voice recognition device for voice recognition and receive a voice recognition result for the voice data block from the voice recognition device so as to display the voice recognition result for the voice data block. In this way, the first server does not need to send the voice data to be recognized to the voice recognition device after completely and correctly receiving the whole voice data to be recognized, receive the voice recognition result aiming at the whole voice data to be recognized from the voice recognition device, and display the voice recognition result aiming at the whole voice data to be recognized, thereby realizing the streaming recognition of the voice data.
Then, in step S402, the voice data to be recognized is transmitted to each of at least two voice recognition apparatuses, respectively.
According to one example of the present disclosure, the first server may convert voice data to be recognized into data of a predetermined format and transmit the data of the predetermined format to each of at least two voice recognition devices, respectively. The data of the predetermined format described herein may be structure data (StructData).
Then, in step S403, one speech recognition result corresponding to the speech data to be recognized is received from each speech recognition apparatus.
According to one example of the present disclosure, each of the voice recognition devices may convert the voice recognition result into data of a predetermined format and return the data of the predetermined format to the first server. The data of the predetermined format described here may be the structural body data described above.
Then, in step S404, response information is transmitted to the application program. Specifically, in step S404, the first server transmits the response information to the second application. The response information may include at least two voice recognition results corresponding to the voice data to be recognized.
According to an example of the present disclosure, in the case where the first server receives the request information from the second application through an encrypted transmission protocol in step S402, the first server may transmit the response information to the second application through the encrypted transmission protocol in step S404.
For example, in the case where the first server receives the request information from the second application program through the HTTPS protocol in step S402, the first server may transmit the response information to the second application program through the HTTPS protocol in step S404. For another example, in the case where the first server receives the request information from the second application through the WSS protocol in step S402, the first server may transmit the response information to the second application through the WSS protocol in step S404.
Further, according to an example of the present disclosure, the response information in step S404 may further include indication information, which may indicate a speech recognition result with the highest accuracy among the at least two speech recognition results. For example, the first server may determine a speech recognition result with the highest accuracy among the at least two speech recognition results, and indicate the speech recognition result with the highest accuracy to the second application through the indication information.
In this example, the first server may determine the accuracy of each speech recognition result according to conventional methods of determining the accuracy of speech recognition results. For example, the accuracy of each speech recognition result can be determined by performing semantic analysis on the speech recognition result and according to the compliance level of semantic logic.
Through the embodiment of the disclosure, the second application program can send the voice data to be recognized to the first server, receive at least two voice recognition results obtained by respectively recognizing the voice data to be recognized by at least two voice recognition devices from the first server, and display the at least two voice recognition results, so that the second application program obtains a plurality of voice recognition results from the first server instead of a single voice recognition result, and further better meets the fault tolerance rate required by the service.
Next, a method performed by each speech recognition device according to an embodiment of the present disclosure will be described with reference to fig. 5. Fig. 5 shows a flow chart of a method performed by each speech recognition device according to an embodiment of the present disclosure.
As shown in fig. 5, in step S501, voice data is received from a first server. In the present disclosure, the voice data in step S501 may be voice data preprocessed by the second application.
According to one example of the present disclosure, the first server may convert voice data into data of a predetermined format and transmit the data of the predetermined format to each of at least two voice recognition devices, respectively. Accordingly, the voice recognition device may receive the data in the predetermined format from the first server. The data of the predetermined format described herein may be structured data.
Then, in step S502, the received voice data is recognized to obtain one voice recognition result corresponding to the received voice data. For example, the Speech Recognition device may employ conventional Speech Recognition techniques, such as Automatic Speech Recognition (Automatic Speech Recognition) techniques, to recognize the received Speech data.
Then, in step S503, the obtained speech recognition result is transmitted to the first server. For example, the voice recognition device may convert the voice recognition result into data of a predetermined format and return the data of the predetermined format to the first server. The data of the predetermined format described here may be the structural body data described above.
Because a plurality of speech recognition devices all feed back a speech recognition result to the first server, with this embodiment, the second application program can receive at least two speech recognition results obtained by at least two speech recognition devices respectively recognizing the same speech data from the first server, and display the at least two speech recognition results, so that the second application program obtains a plurality of speech recognition results from the first server instead of a single speech recognition result, thereby better meeting the fault tolerance required by the service.
Further, in an example in which each voice recognition device includes one adaptation server and one decoding server, the adaptation server may perform the above-described steps S501 and S503, and the decoding server may perform the above-described step S502.
In the following, a schematic diagram of interaction between a first server and a plurality of speech recognition devices according to an embodiment of the present disclosure will be described with reference to fig. 6. Fig. 6 shows a schematic diagram of interactions between a first server and a plurality of speech recognition devices according to an embodiment of the present disclosure. In the example of fig. 6, the first server may be a proxy server, and each speech recognition device may include an adaptation server and a decoding server. Fig. 6 shows a proxy server and three speech recognition devices. As shown in fig. 6, after the first server establishes a link with each speech recognition device, the first server may send speech data to be recognized to each speech recognition device. Each voice recognition device can process the voice data to be recognized to obtain a voice recognition result, and feed back the voice recognition result to the first server. The data interacted between the first server and each voice recognition device can be structure data. Through the linkage between the proxy server and the plurality of voice recognition devices, the multi-path voice recognition aiming at the same voice data can be realized, and the fault tolerance rate of the voice recognition is improved.
Specific examples of speech recognition by the speech recognition system of the embodiments of the present disclosure will be described below with reference to fig. 7A-7B. FIG. 7A shows a schematic diagram of a speech recognition system initiating multiple passes of speech recognition according to an embodiment of the present disclosure. As shown in fig. 7A, a WeChat (i.e., a first application) may be installed on the terminal and then an applet (i.e., a second application) for speech recognition may be run in the WeChat. By clicking the voice button, the second application may display a voice recognition interface. The second application may establish a link with the first server in HTTPS protocol. When the user starts speaking, the second application program can collect the voice data to be recognized and preprocess the voice data to be recognized. The second application may then send the preprocessed speech data to the first server for speech recognition by the plurality of speech recognition devices, respectively. FIG. 7B shows a schematic diagram of a speech recognition system displaying multiple speech recognition results according to an embodiment of the present disclosure. As shown in fig. 7B, the second application receives four speech recognition results from the first server and displays the four speech recognition results.
Hereinafter, a terminal corresponding to the method illustrated in fig. 2 according to an embodiment of the present disclosure is described with reference to fig. 8. Fig. 8 illustrates a schematic structural diagram of a terminal 800 for performing the method illustrated in fig. 2 according to an embodiment of the present disclosure. Since the function of the terminal 800 is the same as the details of the method described above with reference to fig. 2, a detailed description of the same is omitted here for the sake of simplicity. As shown in fig. 8, the terminal 800 includes: an acquisition unit 810 configured to acquire voice data to be recognized; a transmitting unit 820 configured to transmit request information to a first server, wherein the request information includes the voice data to be recognized; a receiving unit 830 configured to receive response information from the first server, wherein the response information includes at least two voice recognition results corresponding to the voice data to be recognized, and the at least two voice recognition results are obtained by at least two voice recognition devices respectively recognizing the voice data to be recognized; and a display unit 840 configured to display the at least two voice recognition results. The terminal 800 may include other components in addition to the four units, however, since the components are not related to the contents of the embodiments of the present disclosure, illustration and description thereof are omitted herein.
In the present disclosure, the voice data acquired by the acquisition unit 810 may be an analog signal, for example, an analog voice signal. Alternatively, the voice data acquired by the acquisition unit 810 may be a digital signal, for example, a digital voice signal.
According to one example of the present disclosure, the obtaining unit 810 may collect voice data to be recognized. For example, the acquisition unit 810 may acquire voice data to be recognized through a voice input device. For example, the acquisition unit 810 may acquire voice data to be recognized by calling a microphone.
According to another example of the present disclosure, the obtaining unit 810 may receive voice data to be recognized from another terminal. For example, another terminal different from the terminal 800 may collect voice data to be recognized and transmit the voice data to be recognized to the receiving unit 830, and accordingly, the receiving unit 830 may receive the voice data to be recognized from the another terminal and transmit the voice data to be recognized to the obtaining unit 810.
According to an example of the present disclosure, the transmitting unit 820 may transmit the request information to the first server through an encrypted transmission protocol. In this example, the encrypted transport protocol may be a Secure Socket Layer (SSL) -based transport protocol. For example, the encrypted transport protocol may be an SSL-based HTTP protocol, which may be referred to as an HTTPs protocol. For another example, the encryption transmission protocol may be a Web Socket protocol based on SSL, which may be referred to as Web Socket Secure (WSS) protocol.
In a case where the transmitting unit 820 transmits the request information to the first server through the HTTPS protocol, a link between the transmitting unit 820 and the first server is an HTTPS communication link, and accordingly, the transmitting unit 820 may transmit the request information to the first server through the HTTPS communication link. For example, the sending unit 820 may send the request information to the first server in a POST manner through the HTTPS communication link. Because the HTTPS belongs to the application layer protocol, a data transmission task can be carried out by establishing an HTTPS communication link in the application layer, so that the stability of data transmission and the safety of data are ensured, and a complex network relation does not need to be maintained in the application layer.
Further, in a case where the transmitting unit 820 transmits the request information to the first server through the WSS protocol, the link between the transmitting unit 820 and the first server is a WSS communication link, and accordingly, the transmitting unit 820 may transmit the request information to the first server through the WSS communication link. Further, in this case, the second application may convert the voice Data to be recognized into Binary Data (Binary Data), and the request information includes the Binary Data.
By the example, the communication between the terminal and the first server adopts an encryption transmission protocol, so that the security and the privacy of voice data are ensured.
Further, according to one example of the present disclosure, the transmitting unit 820 may preprocess voice data to be recognized and include the preprocessed voice data in the request information to transmit to the first server.
In the case where the speech data to be recognized is an analog speech signal, the preprocessing may include sampling, quantization, denoising, encoding, and compression. Specifically, the analog speech signal may be sampled and quantized to obtain a digital speech signal, then noise in the digital speech signal is removed, and the denoised digital speech signal is encoded and compressed. In addition, in the case where the voice data to be recognized is a digital voice signal, the preprocessing may include processing such as denoising, encoding, and compression. Specifically, the noise in the digital speech signal can be removed, and the denoised digital speech signal can be encoded and compressed. In addition, in the present disclosure, the encoding and compression may be a conventional technique of encoding and compressing Audio data, such as MP3 technique, OGG technique, Windows Media Audio (WMA) technique, and the like. By encoding and compressing the voice data to be recognized, the voice data to be transmitted can be greatly reduced, and the method has a better effect particularly in the case of large service flow.
In addition, the preprocessing may further include fragmenting the voice data to be recognized. For example, in a case where the transmitting unit 820 transmits the request information to the first server through the WSS protocol, the preprocessing of the voice data to be recognized by the transmitting unit 820 may further include fragmentation.
Specifically, the transmitting unit 820 may divide the voice data to be recognized into at least two voice data blocks. In this case, the request information includes the at least two voice data blocks. That is, the transmitting unit 820 may transmit the voice data to be recognized to the first server in the form of voice data blocks without transmitting the entire voice data to be recognized to the first server. Therefore, after receiving a voice data block, the first server can send the voice data block to the voice recognition device for voice recognition, and receive a voice recognition result aiming at the voice data block from the voice recognition device so as to display the voice recognition result aiming at the voice data block in time. Therefore, in this way, the first server does not need to send the voice data to be recognized to the voice recognition device after completely and correctly receiving the whole voice data to be recognized, receive the voice recognition result aiming at the whole voice data to be recognized from the voice recognition device, and display the voice recognition result aiming at the whole voice data to be recognized, thereby realizing the streaming recognition of the voice data.
Further, according to an example of the present disclosure, the transmitting unit 820 may transmit a plurality of request information to the first server, wherein each request information may include voice data to be recognized. In this example, the sending unit 820 may send each request information to the first server through the same encrypted transmission protocol, respectively. For example, the transmission unit 820 may respectively transmit the respective request information to the first server through the HTTPS protocol. Alternatively, in this example, the transmitting unit 820 may transmit the respective request information to the first server through a plurality of encrypted transmission protocols, respectively. For example, the transmission unit 820 may transmit the first request information to the first server through the HTTPS protocol and transmit the second request information to the first server through the WSS protocol.
According to an example of the present disclosure, in a case where the transmitting unit 820 transmits the request information to the first server through an encrypted transmission protocol, the receiving unit 830 may receive the response information from the first server through the encrypted transmission protocol.
Then, the display unit 840 displays the at least two voice recognition results. According to an example of the present disclosure, the display unit 840 may also display indication information. The indication information may indicate a speech recognition result with the highest accuracy among the at least two speech recognition results.
According to an example of the present disclosure, the terminal 800 may further include a processing unit (not shown in the figures). The processing unit may be a second application. The processing unit may be configured to generate the indication information. Specifically, the processing unit may determine the accuracy of each speech recognition result, and generate the indication information according to the accuracy of each speech recognition result. For example, the processing unit may determine the accuracy of each speech recognition result according to conventional methods of determining the accuracy of speech recognition results. For example, the processing unit may determine the accuracy of each speech recognition result by performing semantic analysis on the speech recognition result and determining the accuracy of the speech recognition result according to the compliance level of the semantic logic.
Through the embodiment of the disclosure, the second application program can send the voice data to be recognized to the first server, receive at least two voice recognition results obtained by respectively recognizing the voice data to be recognized by at least two voice recognition devices from the first server, and display the at least two voice recognition results, so that the second application program obtains a plurality of voice recognition results from the first server instead of a single voice recognition result, and further better meets the fault tolerance rate required by the service.
Hereinafter, a first server corresponding to the method illustrated in fig. 4 according to an embodiment of the present disclosure is described with reference to fig. 9. Fig. 9 illustrates a schematic structural diagram of a first server 900 for executing the method illustrated in fig. 4 according to an embodiment of the present disclosure. Since the function of the first server 900 is the same as the details of the method described above with reference to fig. 4, a detailed description of the same is omitted here for the sake of simplicity. As shown in fig. 9, the first server 900 includes: a receiving unit 910 configured to receive request information from an application, wherein the request information includes voice data to be recognized; a transmitting unit 920 configured to transmit the voice data to be recognized to each of at least two voice recognition apparatuses, respectively; the receiving unit 910 is further configured to receive one speech recognition result corresponding to the speech data to be recognized from each speech recognition device; and the transmitting unit 920 is further configured to transmit response information to the application program, wherein the response information includes at least two voice recognition results corresponding to the voice data to be recognized. In addition to these two units, the first server 900 may further include other components, however, since these components are not related to the content of the embodiments of the present disclosure, illustration and description thereof are omitted herein.
In the disclosed embodiments, an application may be running in another application. For example, the application may be the second application described above and the other application may be the first application described above.
According to an example of the present disclosure, the receiving unit 910 may receive the request information from the second application through an encrypted transmission protocol. In this example, the encrypted transport protocol may be a Secure Socket Layer (SSL) -based transport protocol. For example, the encrypted transport protocol may be an SSL-based HTTP protocol, which may be referred to as an HTTPs protocol. For another example, the encryption transmission protocol may be a Web Socket protocol based on SSL, which may be referred to as Web Socket Secure (WSS) protocol.
In a case where the receiving unit 910 receives the request information from the second application through the HTTPS protocol, the link between the second application and the first server is an HTTPS communication link, and accordingly, the receiving unit 910 may receive the request information from the second application through the HTTPS communication link.
Further, in a case where the receiving unit 910 receives the request information from the second application through the WSS protocol, the link between the second application and the first server is a WSS communication link, and accordingly, the receiving unit 910 may receive the request information from the second application through the WSS communication link. Further, in this case, the second application program may convert the voice data to be recognized into binary data, and the request information includes the binary data. Accordingly, the receiving unit 910 receives binary data from the second application.
By the example, the communication between the second application program and the first server adopts an encryption transmission protocol, so that the safety and the privacy of voice data are ensured.
Further, according to an example of the present disclosure, in a case where the second application preprocesses the voice data to be recognized, the receiving unit 910 may receive the preprocessed voice data from the second application. In an example where the pre-processing is to fragment the voice data to be recognized, the second application may divide the voice data to be recognized into at least two voice data blocks. In this case, the request information includes the at least two voice data blocks. Accordingly, the receiving unit 910 may receive at least two voice data blocks from the second application.
In this example, after the receiving unit 910 receives a voice data block, the sending unit 920 may send the voice data block to a voice recognition device for voice recognition, and the receiving unit 910 receives a voice recognition result for the voice data block from the voice recognition device so as to display the voice recognition result for the voice data block. In this way, the first server does not need to send the voice data to be recognized to the voice recognition device after completely and correctly receiving the whole voice data to be recognized, receive the voice recognition result aiming at the whole voice data to be recognized from the voice recognition device, and display the voice recognition result aiming at the whole voice data to be recognized, thereby realizing the streaming recognition of the voice data.
Further, according to one example of the present disclosure, the transmitting unit 920 may convert voice data to be recognized into data of a predetermined format and transmit the data of the predetermined format to each of at least two voice recognition apparatuses, respectively. The data of the predetermined format described herein may be structure data (StructData).
According to one example of the present disclosure, each of the voice recognition devices may convert the voice recognition result into data of a predetermined format and return the data of the predetermined format to the first server. The data of the predetermined format described here may be the structural body data described above.
According to an example of the present disclosure, in a case where the receiving unit 910 receives the request information from the second application through an encrypted transmission protocol, the transmitting unit 920 may transmit the response information to the second application through the encrypted transmission protocol.
For example, in a case where the receiving unit 910 receives the request information from the second application through the HTTPS protocol, the transmitting unit 920 may transmit the response information to the second application through the HTTPS protocol. For another example, in a case where the receiving unit 910 receives the request information from the second application through the WSS protocol, the transmitting unit 920 may transmit the response information to the second application through the WSS protocol.
Further, according to an example of the present disclosure, the response information transmitted by the transmitting unit 920 may further include indication information, which may indicate a speech recognition result with the highest accuracy among the at least two speech recognition results. For example, the first server may further include a processing unit (not shown in the figure) configured to determine a speech recognition result with the highest accuracy from the at least two speech recognition results, and indicate the speech recognition result with the highest accuracy to the second application program through the indication information.
In this example, the processing unit may determine the accuracy of each speech recognition result according to conventional methods of determining the accuracy of speech recognition results. For example, the accuracy of each speech recognition result can be determined by performing semantic analysis on the speech recognition result and according to the compliance level of semantic logic.
Through the embodiment of the disclosure, the second application program can send the voice data to be recognized to the first server, receive at least two voice recognition results obtained by respectively recognizing the voice data to be recognized by at least two voice recognition devices from the first server, and display the at least two voice recognition results, so that the second application program obtains a plurality of voice recognition results from the first server instead of a single voice recognition result, and further better meets the fault tolerance rate required by the service.
Hereinafter, a voice recognition apparatus corresponding to the method illustrated in fig. 5 according to an embodiment of the present disclosure is described with reference to fig. 10. Fig. 10 shows a schematic structural diagram of a speech recognition device 1000 for performing the method shown in fig. 5 according to an embodiment of the present disclosure. Since the function of the voice recognition apparatus 1000 is the same as the details of the method described above with reference to fig. 5, a detailed description of the same is omitted here for the sake of simplicity. As shown in fig. 10, the speech recognition apparatus 1000 includes: a receiving unit 1010 configured to receive voice data from a first server; a recognition unit 1020 configured to recognize the received voice data to obtain one voice recognition result corresponding to the received voice data; and a transmitting unit 1030 configured to transmit the obtained voice recognition result to the first server. The voice recognition apparatus 1000 may include other components in addition to the three units, however, since these components are not related to the contents of the embodiments of the present disclosure, illustration and description thereof are omitted herein.
In the embodiment of the present disclosure, the voice data received by the receiving unit 1010 may be voice data preprocessed by the second application.
According to one example of the present disclosure, the first server may convert voice data into data of a predetermined format and transmit the data of the predetermined format to each of at least two voice recognition devices, respectively. Accordingly, the receiving unit 1010 may receive the data of the predetermined format from the first server. The data of the predetermined format described herein may be structured data.
Further, according to an example of the present disclosure, the Recognition unit 1020 may employ conventional Speech Recognition technology, such as Automatic Speech Recognition (Automatic Speech Recognition) technology, to recognize the received Speech data.
Further, according to an example of the present disclosure, the transmitting unit 1030 may convert the voice recognition result into data of a predetermined format and return the data of the predetermined format to the first server. The data of the predetermined format described here may be the structural body data described above.
Because a plurality of speech recognition devices all feed back a speech recognition result to the first server, with this embodiment, the second application program can receive at least two speech recognition results obtained by at least two speech recognition devices respectively recognizing the same speech data from the first server, and display the at least two speech recognition results, so that the second application program obtains a plurality of speech recognition results from the first server instead of a single speech recognition result, thereby better meeting the fault tolerance required by the service.
Furthermore, devices (e.g., a terminal, a first server, a voice recognition device, etc.) according to embodiments of the present disclosure may also be implemented by means of the architecture of a computing device shown in fig. 11. Fig. 11 illustrates an architecture of the computing device. As shown in fig. 11, the computing device 1100 may include a bus 1110, one or more CPUs 1120, a Read Only Memory (ROM)1130, a Random Access Memory (RAM)1140, communication ports 1150 for connecting to a network, input/output components 1160, a hard disk 1170, and the like. Storage devices in the computing device 1100, such as the ROM 1130 or the hard disk 1170, may store various data or files used in computer processing and/or communications as well as program instructions for execution by the CPU. The computing device 1100 may also include a user interface 1180. Of course, the architecture shown in FIG. 11 is merely exemplary, and one or more components of the computing device shown in FIG. 11 may be omitted when implementing different devices, as desired.
Embodiments of the present disclosure may also be implemented as a computer-readable storage medium. A computer readable storage medium according to an embodiment of the present disclosure has computer readable instructions stored thereon. The computer readable instructions, when executed by a processor, may perform a method according to embodiments of the present disclosure described with reference to the above figures. The computer-readable storage medium includes, but is not limited to, volatile memory and/or non-volatile memory, for example. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc.
Those skilled in the art will appreciate that the disclosure of the present disclosure is susceptible to numerous variations and modifications. For example, the various devices or components described above may be implemented in hardware, or may be implemented in software, firmware, or a combination of some or all of the three.
Furthermore, as used in this disclosure and in the claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are inclusive in the plural, unless the context clearly dictates otherwise. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. Likewise, the word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect.
Furthermore, flow charts are used in this disclosure to illustrate operations performed by systems according to embodiments of the present disclosure. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
While the present disclosure has been described in detail above, it will be apparent to those skilled in the art that the present disclosure is not limited to the embodiments described in the present specification. The present disclosure can be implemented as modifications and variations without departing from the spirit and scope of the present disclosure defined by the claims. Accordingly, the description of the present specification is for the purpose of illustration and is not intended to be in any way limiting of the present disclosure.

Claims (15)

1. A method for speech recognition, comprising:
acquiring voice data to be recognized;
sending request information to a first server, wherein the request information comprises the voice data to be recognized;
receiving response information from the first server, wherein the response information comprises at least two voice recognition results corresponding to the voice data to be recognized, and the at least two voice recognition results are obtained by respectively recognizing the voice data to be recognized by at least two voice recognition devices; and
and displaying the at least two voice recognition results.
2. The method of claim 1, wherein the obtaining speech data to be recognized comprises:
and acquiring the voice data to be recognized by a second application program running in the first application program.
3. The method of claim 1, wherein the first and second light sources are selected from the group consisting of a red light source, a green light source, and a blue light source,
further comprising:
dividing the voice data to be recognized into at least two voice data blocks;
wherein the request information includes the at least two voice data blocks.
4. The method according to any one of claims 1 to 3,
wherein said sending request information to the first server comprises:
sending the request information to the first server through an encrypted transport protocol;
wherein receiving response information from the first server comprises:
receiving the response information from the first server over the encrypted transport protocol.
5. The method of claim 4, wherein the encrypted transport protocol is a secure socket Layer (secure socket Layer) based transport protocol.
6. The method of any of claims 1 to 3, further comprising:
displaying indication information, wherein the indication information indicates a speech recognition result with the highest accuracy in the at least two speech recognition results.
7. The method of claim 6, wherein the response information further includes the indication information.
8. The method of claim 6, further comprising:
determining the accuracy of each speech recognition result; and
and generating the indication information according to the accuracy of each voice recognition result.
9. A method for speech recognition, comprising:
receiving request information from an application program, wherein the request information comprises voice data to be recognized;
respectively sending the voice data to be recognized to each voice recognition device in at least two voice recognition devices;
receiving one voice recognition result corresponding to the voice data to be recognized from each voice recognition device; and
and sending response information to the application program, wherein the response information comprises at least two voice recognition results corresponding to the voice data to be recognized.
10. The method of claim 9, wherein the application runs in another application.
11. The method according to claim 9 or 10, wherein said sending the speech data to be recognized to each of at least two speech recognition devices, respectively, comprises:
converting the voice data to be recognized into data in a preset format; and
and respectively transmitting the data in the preset format to each voice recognition device in the at least two voice recognition devices.
12. A terminal for speech recognition, comprising:
an acquisition unit configured to acquire voice data to be recognized;
a sending unit configured to send request information to a first server, wherein the request information includes the voice data to be recognized;
a receiving unit configured to receive response information from the first server, wherein the response information includes at least two voice recognition results corresponding to the voice data to be recognized, and the at least two voice recognition results are obtained by at least two voice recognition devices respectively recognizing the voice data to be recognized; and
a display unit configured to display the at least two voice recognition results.
13. A server for speech recognition, comprising:
a receiving unit configured to receive request information from an application program, wherein the request information includes voice data to be recognized;
a transmitting unit configured to transmit the voice data to be recognized to each of at least two voice recognition apparatuses, respectively;
the receiving unit is further configured to receive one voice recognition result corresponding to the voice data to be recognized from each voice recognition apparatus; and
the transmitting unit is further configured to transmit response information to the application program, wherein the response information includes at least two voice recognition results corresponding to the voice data to be recognized.
14. A terminal for speech recognition, comprising:
a processor; and
memory, wherein the memory has stored therein a computer-executable program that, when executed by the processor, performs the method of any of claims 1-8.
15. A computer-readable storage medium having stored thereon instructions that, when executed by a processor, cause the processor to perform the method of any one of claims 1-8.
CN201911351762.5A 2019-12-24 2019-12-24 Method for speech recognition, terminal, server and computer-readable storage medium Pending CN111081247A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911351762.5A CN111081247A (en) 2019-12-24 2019-12-24 Method for speech recognition, terminal, server and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911351762.5A CN111081247A (en) 2019-12-24 2019-12-24 Method for speech recognition, terminal, server and computer-readable storage medium

Publications (1)

Publication Number Publication Date
CN111081247A true CN111081247A (en) 2020-04-28

Family

ID=70317436

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911351762.5A Pending CN111081247A (en) 2019-12-24 2019-12-24 Method for speech recognition, terminal, server and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN111081247A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150162003A1 (en) * 2013-12-10 2015-06-11 Alibaba Group Holding Limited Method and system for speech recognition processing
CN105679319A (en) * 2015-12-29 2016-06-15 百度在线网络技术(北京)有限公司 Speech recognition processing method and device
US20170316781A1 (en) * 2015-02-13 2017-11-02 Tencent Technology (Shenzhen) Company Limited Remote electronic service requesting and processing method, server, and terminal
CN107769875A (en) * 2017-10-18 2018-03-06 北京华力创通科技股份有限公司 The voice broadcast method of satellite communication, device and system
CN109147818A (en) * 2018-10-30 2019-01-04 Oppo广东移动通信有限公司 Acoustic feature extracting method, device, storage medium and terminal device
CN109739971A (en) * 2019-01-03 2019-05-10 浙江百应科技有限公司 A method of full duplex Intelligent voice dialog is realized based on wechat small routine
CN109874034A (en) * 2019-01-14 2019-06-11 深圳市金锐显数码科技有限公司 TV speech remote control method, device and terminal device
CN110148416A (en) * 2019-04-23 2019-08-20 腾讯科技(深圳)有限公司 Audio recognition method, device, equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150162003A1 (en) * 2013-12-10 2015-06-11 Alibaba Group Holding Limited Method and system for speech recognition processing
US20170316781A1 (en) * 2015-02-13 2017-11-02 Tencent Technology (Shenzhen) Company Limited Remote electronic service requesting and processing method, server, and terminal
CN105679319A (en) * 2015-12-29 2016-06-15 百度在线网络技术(北京)有限公司 Speech recognition processing method and device
CN107769875A (en) * 2017-10-18 2018-03-06 北京华力创通科技股份有限公司 The voice broadcast method of satellite communication, device and system
CN109147818A (en) * 2018-10-30 2019-01-04 Oppo广东移动通信有限公司 Acoustic feature extracting method, device, storage medium and terminal device
CN109739971A (en) * 2019-01-03 2019-05-10 浙江百应科技有限公司 A method of full duplex Intelligent voice dialog is realized based on wechat small routine
CN109874034A (en) * 2019-01-14 2019-06-11 深圳市金锐显数码科技有限公司 TV speech remote control method, device and terminal device
CN110148416A (en) * 2019-04-23 2019-08-20 腾讯科技(深圳)有限公司 Audio recognition method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107277153B (en) Method, device and server for providing voice service
WO2017041366A1 (en) Method and device for image recognition
CN109065053B (en) Method and apparatus for processing information
US11758088B2 (en) Method and apparatus for aligning paragraph and video
US11360737B2 (en) Method and apparatus for providing speech service
CN112615822B (en) Message processing method and device, computing equipment and readable storage medium
CN111739553A (en) Conference sound acquisition method, conference recording method, conference record presentation method and device
CN108874825B (en) Abnormal data verification method and device
US20170142454A1 (en) Third-party video pushing method and system
CN111541718B (en) Internal and external network interaction method and system of power terminal and data transmission method
EP2856753A1 (en) Communicating with an endpoint using matrix barcodes
US11488603B2 (en) Method and apparatus for processing speech
CN112311720B (en) Data transmission method and device
CN111507698A (en) Processing method and device for transferring accounts, computing equipment and medium
CN111081247A (en) Method for speech recognition, terminal, server and computer-readable storage medium
CN107608718B (en) Information processing method and device
CN107872683B (en) Video data processing method, device, equipment and storage medium
CN113852835A (en) Live broadcast audio processing method and device, electronic equipment and storage medium
CN110634478A (en) Method and apparatus for processing speech signal
CN113179261B (en) Video stream processing method and device, storage medium and platform server
CN114245057A (en) Audio and video call method, device, equipment and storage medium based on insurance service
CN110971685A (en) Content processing method, content processing device, computer equipment and storage medium
CN112581934A (en) Voice synthesis method, device and system
CN109040653B (en) Data encryption and decryption overhead determining method and device and electronic equipment
US8681949B1 (en) System, method, and computer program for automated non-sound operations by interactive voice response commands

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40022187

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination