CN112992141B

CN112992141B - Communication method and device in voice recognition scene

Info

Publication number: CN112992141B
Application number: CN202110201275.1A
Authority: CN
Inventors: 李明阳; 毛鑫; 陈睿欣
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-02-23
Filing date: 2021-02-23
Publication date: 2023-06-16
Anticipated expiration: 2041-02-23
Also published as: CN112992141A

Abstract

The application discloses a communication method and a communication device in a voice recognition scene, and relates to the technical field of communication and artificial intelligence including voice recognition. The specific embodiment comprises the following steps: receiving subscription of the second terminal to the voice recognition result of the server in response to the communication connection with the first terminal, wherein the number of the second terminals is at least one, and the subscribed voice recognition result is the voice recognition result of the voice transmitted by the first terminal through the communication connection; in response to receiving the voice uploaded by the first terminal, determining a voice recognition result of the voice, and sending the voice recognition result to the second terminal; in response to obtaining the speech recognition end instruction, the communication connection is disconnected. The terminal in the application can subscribe the voice recognition result of the voice received by the server through the communication connection, so that the server can timely and actively push the voice recognition result to the terminal with the requirement on the voice recognition result.

Description

Communication method and device in voice recognition scene

Technical Field

The present application relates to the field of computer technologies, and in particular, to the field of communications and artificial intelligence technologies including speech recognition, and in particular, to a communication method and apparatus in a speech recognition scenario.

Background

The voice recognition is to take voice as a research object, and the machine automatically recognizes and understands the voice of human beings through voice signal processing and pattern recognition.

In the prior art of speech recognition, speech recognition technology is the technology that allows a machine to convert speech signals into corresponding text or commands through a recognition and understanding process. The speech recognition method is mainly a pattern matching method. During the training phase, the user speaks each word in the vocabulary one time in turn and stores its feature vector as a template in a template library. In the recognition stage, the feature vector of the input voice is compared with the similarity of each template in the template library in sequence, and the highest similarity is output as a recognition result.

Disclosure of Invention

Provided are a communication method, a communication device, an electronic device and a storage medium in a voice recognition scene.

According to a first aspect, there is provided a communication method in a speech recognition scenario, for a server, including: receiving subscription of the second terminal to the voice recognition result of the server in response to the communication connection with the first terminal, wherein the number of the second terminals is at least one, and the subscribed voice recognition result is the voice recognition result of the voice transmitted by the first terminal through the communication connection; in response to receiving the voice uploaded by the first terminal, determining a voice recognition result of the voice, and sending the voice recognition result to the second terminal; in response to obtaining the speech recognition end instruction, the communication connection is disconnected.

According to a second aspect, there is provided a communication method in a speech recognition scenario for a second terminal, the method comprising: responding to the establishment of communication connection between the server and the first terminal, subscribing the voice recognition results of the server to the server, wherein the number of the second terminals is at least one, and the subscribed voice recognition results are aimed at the voice uploaded by the first terminal through the communication connection; and receiving a voice recognition result determined by the server for the voice uploaded by the first terminal, wherein the server responds to a voice recognition ending instruction, and disconnects communication.

According to a third aspect, there is provided a communication device in a speech recognition scenario, for a server, including: a receiving unit configured to receive, in response to establishment of a communication connection with the first terminal, subscription of a second terminal to a voice recognition result of the server, wherein the number of the second terminals is at least one, and the subscribed voice recognition result is a voice recognition result of voice transmitted by the first terminal through the communication connection; a determining unit configured to determine a speech recognition result of the speech in response to receiving the speech uploaded by the first terminal, and transmit the speech recognition result to the second terminal; and a disconnection unit configured to disconnect the communication connection in response to obtaining the voice recognition end instruction.

According to a fourth aspect, there is provided a communication device in a speech recognition scenario for a second terminal, the device comprising: the subscription unit is configured to subscribe the voice recognition result of the server to the server in response to the communication connection between the server and the first terminal, wherein the number of the second terminals is at least one, and the subscribed voice recognition result is aimed at the voice uploaded by the first terminal through the communication connection; and a result receiving unit configured to receive a voice recognition result determined by the server for the voice uploaded by the first terminal, wherein the server disconnects the communication connection in response to obtaining the voice recognition end instruction.

According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the embodiments of the communication method in a speech recognition scenario.

According to a sixth aspect, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a method according to any one of the embodiments of the communication method in a speech recognition scenario.

According to a seventh aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any one of the embodiments of the communication method in a speech recognition scenario.

According to the scheme of the application, the terminal in the method can subscribe the voice recognition result of the voice received by the server through the communication connection, so that the server can timely and actively push the voice recognition result to the terminal with the requirement for the voice recognition result, and the problems that the timeliness is poor and more terminals with the requirement cannot obtain the voice recognition result due to the fact that the terminal with the requirement for uploading the voice periodically obtains the voice recognition result from the server in the prior art are avoided.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings, in which:

FIG. 1 is an exemplary system architecture diagram in which some embodiments of the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a communication method in a speech recognition scenario according to the present application;

FIG. 3 is a schematic illustration of one application scenario of a communication method in a speech recognition scenario according to the present application;

FIG. 4 is a flow chart of yet another embodiment of a communication method in a speech recognition scenario according to the present application;

FIG. 5 is a schematic diagram of one embodiment of a communication device in a speech recognition scenario according to the present application;

fig. 6 is a block diagram of an electronic device for implementing a communication method in a speech recognition scenario of an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 in which embodiments of a communication method in a speech recognition scenario or a communication device in a speech recognition scenario of the present application may be applied.

As shown in fig. 1, a system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the

terminal devices

101, 102, 103 to receive or send messages or the like. Various communication client applications, such as video-type applications, live applications, instant messaging tools, mailbox clients, social platform software, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal devices

101, 102, 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices with display screens, including but not limited to smartphones, tablets, electronic book readers, laptop and desktop computers, and the like. When the

terminal devices

101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., multiple software or software modules for providing distributed services) or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing support for the

terminal devices

101, 102, 103. The background server may analyze and process the received data such as voice, and feed back the processing result (e.g., the voice recognition result) to the terminal device.

It should be noted that, the communication method in the voice recognition scenario provided in the embodiment of the present application may be executed by the server 105 or the

terminal devices

101, 102, 103, and accordingly, the communication apparatus in the voice recognition scenario may be disposed in the server 105 or the

terminal devices

101, 102, 103.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of one embodiment of a communication method in a speech recognition scenario according to the present application is shown. The communication method under the voice recognition scene is used for a server and comprises the following steps:

in step 201, in response to establishing a communication connection with the first terminal, a subscription of the second terminal to the voice recognition result of the server is received, where the number of the second terminals is at least one, and the subscribed voice recognition result is a voice recognition result of the voice transmitted by the first terminal through the communication connection.

In this embodiment, an execution body (for example, a server shown in fig. 1) on which the communication method in the speech recognition scenario runs may receive a subscription of the second terminal to the speech recognition result of the server under the condition that the first terminal establishes a communication connection (for example, webSocket) with the server, and specifically, the execution body may receive a subscription request and confirm, that is, accept the subscription. Once the electronic device subscribes the voice recognition result to the server, the electronic device has the authority to receive the voice recognition result actively pushed by the server. The subscribed voice recognition result is a voice recognition result of the voice uploaded by the first terminal through the communication connection. The number of the second terminals may be one or more, that is, there may be a plurality of terminals that are all subscribed to the voice recognition result of the server.

In practice, the executing entity may generate a unique session identification (session id) for the communication connection, which may be used to track the speech recognition process performed over the communication connection. Accordingly, the speech recognition procedure herein is a recognition procedure for a conversation including at least one speech.

Specifically, the first terminal is a calling party of the voice recognition service of the server, and the second terminal is a subscribing party for subscribing. The second terminal may include the first terminal, for example, the second terminal is the first terminal or the second terminal includes the first terminal and other terminals, or the second terminal may include only other terminals than the first terminal.

Step 202, in response to receiving the voice uploaded by the first terminal, determining a voice recognition result of the voice, and transmitting the voice recognition result to the second terminal.

In this embodiment, the execution body may determine a speech recognition result of the uploaded speech when receiving the speech uploaded by the first terminal. The execution body may then send the speech recognition result to each of the second terminals that have subscribed.

In practice, the executing body may perform voice recognition on the voice in the device to determine a voice recognition result, or may send the voice to other electronic devices to obtain the voice recognition result fed back by the other electronic devices, so as to determine the voice recognition result.

In step 203, in response to obtaining the speech recognition end instruction, the communication connection is disconnected.

In this embodiment, the execution body may disconnect the communication connection in response to obtaining a voice recognition end instruction. In practice, the executing body may receive a speech recognition ending instruction (such as that uploaded by the first terminal) sent by other electronic devices, so as to obtain the speech recognition ending instruction. Or, the executing body may generate the voice recognition ending instruction by itself under the triggering of the preset condition, so as to obtain the voice recognition ending instruction.

According to the method provided by the embodiment of the invention, the terminal can subscribe the voice recognition result of the voice received by the server through the communication connection, so that the server can timely and actively push the voice recognition result to the terminal with the requirement on the voice recognition result, and the problems that the timeliness is poor and more terminals with the requirement cannot obtain the voice recognition result due to the fact that the terminal with the requirement for uploading the voice periodically obtains the voice recognition result from the server in the prior art are avoided.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the communication method in the speech recognition scenario according to the present embodiment. In the application scenario of fig. 3, in step 301, a first terminal (i.e. a caller) may receive a user trigger identification operation. In step 302, the first terminal may send a communication connection request to the server. In step 303, the server may establish a communication connection between the first terminal and the server. In step 304, the server may return a communication connection success message indicating that the communication connection is established successfully to the first terminal. In step 305, the first terminal may send the voice to the server in segments, that is, send the voice in the form of voice segments. In step 306, the server may write the voice sent by the first terminal into the asynchronous queue. In step 307, the server may return a successful receipt message indicating successful receipt of voice to the first terminal. In step 308, the second terminal (i.e., subscriber) may subscribe to the voice recognition result from the server. In step 309, optionally, the server may accept the subscription and return a subscription success message to the second terminal. In step 310, the server may start a new thread to perform real-time speech recognition on the speech in the asynchronous queue by using the new thread. In step 311, the server may actively push the speech recognition result to the second terminal. In step 312, the second terminal may display the voice recognition result to the user in real time. In step 313, the first terminal may generate a voice recognition ending instruction after detecting that the user of the user ends the recognition operation. In step 314, the first terminal may send a speech recognition end instruction to the server. In step 315, the server may end the speech recognition and disconnect the communication connection. In step 316, the server may return a voice recognition end message indicating a result of the voice recognition procedure to the first terminal.

With further reference to fig. 4, a flow 400 of yet another embodiment of a communication method in a speech recognition scenario is shown. The process 400, for a server, includes the following steps:

in step 401, in response to establishing a communication connection with the first terminal, a subscription of the second terminal to the voice recognition result of the server is received, where the number of the second terminals is at least one, and the subscribed voice recognition result is a voice recognition result of the voice transmitted by the first terminal through the communication connection.

In this embodiment, an execution body (for example, a server shown in fig. 1) on which the communication method in the speech recognition scenario runs may receive a subscription of the second terminal to the speech recognition result of the server under the condition that the first terminal establishes a communication connection with the server.

Step 402, in response to receiving a voice segment collected and sent by the first terminal in the first time period, determining a voice recognition result of the voice segment, where the voice in which the voice segment is located includes at least one voice segment.

In this embodiment, the executing body may determine, in response to receiving a voice segment sent by the first terminal, a voice recognition result of the voice segment. The manner in which the first terminal uploads the voice may be a segmented transmission manner. Specifically, the transmission of the first terminal may be a real-time transmission.

Specifically, the speech segment may be speech collected by the first terminal during the first period of time. For example, every 100ms, the first terminal transmits voice data of 100ms just before to the server in real time.

In step 403, in response to obtaining the speech recognition end instruction, the communication connection is disconnected.

The first terminal in this embodiment may upload the voice in a segmented manner, and compared with the whole voice uploading manner, the method and the device can improve the real-time performance of voice uploading and recognition, thereby reducing the time consumption of voice recognition and shortening the voice recognition process.

In some optional implementations of this embodiment, before sending the speech recognition result to the second terminal, the method may further include: responding to the obtained segment recognition results of each voice segment in the voice where the voice segment is located, and performing error correction processing on the total recognition results formed by the segment recognition results of each voice segment; and transmitting the speech recognition result to the second terminal, comprising: and sending the result of error correction processing on the total identification result to the second terminal.

In these alternative implementations, the execution body may perform error correction processing on a speech recognition result (i.e., a total recognition result) composed of speech recognition results (i.e., a segment recognition result) of the speech segments (i.e., all the speech segments) in the speech where the speech segments are located, when the segment recognition results of each of the speech segments (i.e., all the speech segments) are obtained. Thus, the execution subject can obtain the result of the error correction processing. In practice, the segment recognition results of each voice segment may be spliced according to the sequence of the uploading time to form the total recognition result. The execution body may perform error correction processing in various manners, for example, detecting whether the total recognition result corresponds to the entire voice, and then, for example, detecting whether the total recognition result is smooth.

The implementation methods can integrally send the error-corrected total recognition result to the second terminal, so that the accuracy of the sent voice recognition result can be ensured.

In some optional implementations of any of the embodiments above, in response to establishing a communication connection with the first terminal, the method may include: if a communication connection request sent by a first terminal in response to detection of user triggering identification operation is received, establishing communication connection with the first terminal; and in response to obtaining the speech recognition end instruction, disconnecting the communication connection, comprising: and if the voice recognition ending instruction sent by the first terminal in response to the detection of ending the recognition operation of the user is received, disconnecting the communication connection.

In these alternative implementations, the executing entity may establish a communication connection with the first terminal when receiving a communication connection request sent by the first terminal. The communication connection request is sent by the first terminal in response to detecting a user trigger identification operation. In practice, the user trigger operation is an operation made by the user for triggering the first terminal to receive voice and then send a communication connection request to the server. In addition, the execution body may disconnect the communication connection when receiving a voice recognition end instruction sent by the first terminal. The speech recognition end instruction may be triggered by a user end recognition operation made when the user wants to end the speech recognition process.

These implementations may allow the user to trigger the start and end of the speech recognition process, thereby increasing the user's control over the entire process.

In some optional implementations of any of the foregoing embodiments, after receiving the subscription of the second terminal to the voice recognition result of the server, and before disconnecting the communication connection, the method further includes: and receiving and confirming subscription of other terminals to the voice recognition result of the server, and taking the other terminals and the second terminal as new second terminals.

In these implementations, the executing entity may receive the subscription of the other terminal at any time before the communication connection is disconnected, and use the other terminal as the second terminal, where the new second terminal includes the other terminal. In this way, the execution body may send the voice recognition result to each second terminal including the other terminal after obtaining the voice recognition result to be sent.

Other electronic devices in the implementation modes can continue to subscribe to the voice recognition result of the server at any time before the communication connection is disconnected, so that the voice recognition requirement of the other electronic devices is met.

In some optional implementations of any of the embodiments above, the voice uploaded by the first terminal is stored in an asynchronous queue; determining a speech recognition result of the speech and transmitting the speech recognition result to the second terminal, comprising: starting an asynchronous thread, acquiring uploaded voice from an asynchronous queue, performing real-time voice recognition on the voice to obtain a voice recognition result, and writing the voice recognition result into a shared variable; and sending the voice recognition result in the shared variable to the second terminal.

In these alternative implementations, the executing entity may store the voice (e.g., the voice segment) uploaded by the first terminal in an asynchronous queue. And then, the execution main body can start an asynchronous thread, acquire the stored voice from the asynchronous queue by utilizing the asynchronous thread and perform real-time voice recognition on the voice, so as to obtain a voice recognition result and store the voice recognition result into a shared variable. Then, the executing body may send the speech recognition result in the shared variable to the second terminal by using another thread.

The implementation modes can utilize asynchronous threads and asynchronous queues to realize independence among the steps, so that strict sequence limitation among the steps can be avoided. In addition, by utilizing the shared variable, the quick storage and the quick access of the voice recognition result can be realized, so that the communication efficiency of the voice recognition result is improved.

The application also provides a communication method in a voice recognition scene, which is used for the second terminal, and comprises the following steps: responding to the establishment of communication connection between the server and the first terminal, subscribing the voice recognition results of the server to the server, wherein the number of the second terminals is at least one, and the subscribed voice recognition results are aimed at the voice uploaded by the first terminal through the communication connection; and receiving a voice recognition result determined by the server for the voice uploaded by the first terminal, wherein the server responds to a voice recognition ending instruction, and disconnects communication.

The terminal in the method provided by the embodiment can subscribe the voice recognition result of the voice received by the server through the communication connection, so that the server can timely and actively push the voice recognition result to the terminal with the requirement on the voice recognition result, and the problems that the timeliness is poor and more terminals with the requirement cannot obtain the voice recognition result due to the fact that the terminal with the requirement for uploading the voice regularly obtains the voice recognition result from the server in the prior art are avoided.

With further reference to fig. 5, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of a communication device in a speech recognition scenario, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the embodiment of the device may further include the same or corresponding features or effects as the embodiment of the method shown in fig. 2, except for the features described below. The device can be applied to various electronic equipment.

As shown in fig. 5, a communication device 500 in a speech recognition scenario of the present embodiment is used for a server, and includes: a receiving unit 501, a determining unit 502 and a disconnecting unit 503. Wherein, the receiving unit 501 is configured to receive, in response to establishing a communication connection with the first terminal, a subscription of the second terminal to the voice recognition result of the server, where the number of the second terminals is at least one, and the subscribed voice recognition result is a voice recognition result of the voice transmitted by the first terminal through the communication connection; a determining unit 502 configured to determine a speech recognition result of the speech in response to receiving the speech uploaded by the first terminal, and transmit the speech recognition result to the second terminal; the disconnection unit 503 is configured to disconnect the communication connection in response to obtaining a voice recognition end instruction.

In this embodiment, the specific processing of the receiving unit 501, the determining unit 502 and the disconnecting unit 503 of the communication device 500 in the speech recognition scenario and the technical effects thereof may refer to the relevant descriptions of step 201, step 202 and step 203 in the corresponding embodiment of fig. 2, and are not repeated here.

In some optional implementations of the present embodiment, the determining unit is further configured to perform determining a speech recognition result of the speech in response to receiving the speech uploaded by the first terminal in the following manner: and responding to the received voice fragments collected and sent by the first terminal in the first time period, and determining a voice recognition result of the voice fragments, wherein the voice in which the voice fragments are positioned comprises at least one voice fragment.

In some optional implementations of this embodiment, the apparatus further includes: an error correction unit configured to perform error correction processing on a total recognition result composed of the segment recognition results of each voice segment in the voice where the voice segment is located in response to obtaining the segment recognition result of each voice segment before transmitting the voice recognition result to the second terminal; and a determining unit further configured to perform transmitting the speech recognition result to the second terminal as follows: and sending the result of error correction processing on the total identification result to the second terminal.

In some optional implementations of the present embodiment, the receiving unit is further configured to perform the responding to the establishment of the communication connection with the first terminal in the following manner: if a communication connection request sent by a first terminal in response to detection of user triggering identification operation is received, establishing communication connection with the first terminal; and a disconnection unit further configured to perform disconnection of the communication connection in response to obtaining the voice recognition end instruction in the following manner: and if the voice recognition ending instruction sent by the first terminal in response to the detection of ending the recognition operation of the user is received, disconnecting the communication connection.

In some optional implementations of this embodiment, the apparatus further includes: and the newly added unit is configured to receive the subscription of the voice recognition result of the other terminal to the server after receiving the subscription of the voice recognition result of the second terminal to the server and before disconnecting the communication connection, and takes the other terminal and the second terminal as new second terminals.

In some optional implementations of this embodiment, the voice uploaded by the first terminal is stored in an asynchronous queue; a determining unit further configured to perform determining a speech recognition result of the speech and transmit the speech recognition result to the second terminal as follows: starting an asynchronous thread to acquire uploaded voice from an asynchronous queue and conduct real-time voice recognition on the voice to obtain a voice recognition result, and writing the voice recognition result into a shared variable; and sending the voice recognition result in the shared variable to the second terminal.

The application also provides a communication device under a voice recognition scene, which is used for a second terminal, and the device comprises: the subscription unit is configured to subscribe the voice recognition result of the server to the server in response to the communication connection between the server and the first terminal, wherein the number of the second terminals is at least one, and the subscribed voice recognition result is aimed at the voice uploaded by the first terminal through the communication connection; and a result receiving unit configured to receive a voice recognition result determined by the server for the voice uploaded by the first terminal, wherein the server disconnects the communication connection in response to obtaining the voice recognition end instruction.

According to embodiments of the present application, there is also provided an electronic device, a readable storage medium and a computer program product.

As shown in fig. 6, a block diagram of an electronic device of a communication method in a speech recognition scenario according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.

As shown in fig. 6, the electronic device includes: one or more processors 601, memory 602, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 601 is illustrated in fig. 6.

Memory 602 is a non-transitory computer-readable storage medium provided herein. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the communication methods provided herein in a speech recognition scenario. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to execute a communication method in a speech recognition scenario provided by the present application.

The memory 602 is used as a non-transitory computer readable storage medium, and may be used to store a non-transitory software program, a non-transitory computer executable program, and modules, such as program instructions/modules (e.g., the receiving unit 501, the determining unit 502, and the disconnecting unit 503 shown in fig. 5) corresponding to a communication method in a speech recognition scenario in an embodiment of the present application. The processor 601 executes various functional applications of the server and data processing, i.e., implements the communication method in the speech recognition scenario in the above-described method embodiment, by running non-transitory software programs, instructions, and modules stored in the memory 602.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for a function; the storage data area may store data created from the use of the communication electronic device in a speech recognition scenario, etc. In addition, the memory 602 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 602 may optionally include memory located remotely from processor 601, which may be connected to communication electronics in a speech recognition scenario via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the communication method in the speech recognition scenario may further include: an input device 603 and an output device 604. The processor 601, memory 602, input device 603 and output device 604 may be connected by a bus or otherwise, for example in fig. 6.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the communication electronics in a speech recognition scenario, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, etc. input devices. The output means 604 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware. The described units may also be provided in a processor, for example, described as: a processor includes a receiving unit, a determining unit, and a disconnecting unit. Wherein the names of the units do not constitute a limitation of the unit itself in some cases, for example, the disconnection unit may also be described as "a unit that disconnects the communication connection in response to obtaining a voice recognition end instruction".

As another aspect, the present application also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: receiving subscription of the second terminal to the voice recognition result of the server in response to the communication connection with the first terminal, wherein the number of the second terminals is at least one, and the subscribed voice recognition result is the voice recognition result of the voice transmitted by the first terminal through the communication connection; in response to receiving the voice uploaded by the first terminal, determining a voice recognition result of the voice, and sending the voice recognition result to the second terminal; in response to obtaining the speech recognition end instruction, the communication connection is disconnected.

The foregoing description is only of the preferred embodiments of the present application and is presented as a description of the principles of the technology being utilized. It will be appreciated by persons skilled in the art that the scope of the invention referred to in this application is not limited to the specific combinations of features described above, but it is intended to cover other embodiments in which any combination of features described above or equivalents thereof is possible without departing from the spirit of the invention. Such as the above-described features and technical features having similar functions (but not limited to) disclosed in the present application are replaced with each other.

Claims

1. A communication method in a speech recognition scenario, for a server, the method comprising:

receiving subscription of a second terminal to the voice recognition result of the server in response to the establishment of the communication connection with the first terminal, wherein the number of the second terminals is at least one, and the subscribed voice recognition result is the voice recognition result of the voice transmitted by the first terminal through the communication connection;

determining a voice recognition result of the voice in response to receiving the voice uploaded by the first terminal, and sending the voice recognition result to the second terminal, wherein each time period, the first terminal sends the voice fragments acquired in the time period to the server;

responsive to obtaining a speech recognition end instruction, disconnecting the communication connection;

the voice uploaded by the first terminal is stored in an asynchronous queue;

the determining the voice recognition result of the voice and sending the voice recognition result to the second terminal includes:

starting an asynchronous thread, acquiring uploaded voice from the asynchronous queue by utilizing the asynchronous thread, performing real-time voice recognition on the voice to obtain a voice recognition result, and writing the voice recognition result into a shared variable; and transmitting the voice recognition result in the shared variable to the second terminal by using another thread.

2. The method of claim 1, wherein determining, in response to receiving the voice uploaded by the first terminal, a voice recognition result of the voice comprises:

and responding to the received voice fragments collected and sent by the first terminal in the first time period, and determining a voice recognition result of the voice fragments, wherein the voice in which the voice fragments are positioned comprises at least one voice fragment.

3. The method of claim 2, wherein prior to said transmitting the speech recognition result to the second terminal, the method further comprises:

responding to the obtained segment recognition results of each voice segment in the voice where the voice segment is located, and performing error correction processing on the total recognition results formed by the segment recognition results of each voice segment; and

the step of sending the voice recognition result to the second terminal comprises the following steps:

and sending the result of error correction processing on the total identification result to the second terminal.

4. The method of claim 1, wherein the responding to the communication connection being established with the first terminal comprises:

if a communication connection request sent by the first terminal in response to detection of user trigger identification operation is received, establishing communication connection with the first terminal; and

The responding to the voice recognition ending instruction, the communication connection is disconnected, and the method comprises the following steps:

and if the voice recognition ending instruction sent by the first terminal in response to the detection of ending the recognition operation of the user is received, disconnecting the communication connection.

5. The method of claim 1, wherein after the receiving the subscription of the second terminal to the voice recognition result of the server, and before the disconnecting the communication connection, the method further comprises:

and receiving subscription of other terminals to the voice recognition result of the server, and taking the other terminals and the second terminal as new second terminals.

6. A communication method in a speech recognition scenario for a second terminal, the method comprising:

responding to the establishment of communication connection between a server and a first terminal, subscribing voice recognition results of the server to the server, wherein the number of the second terminals is at least one, the subscribed voice recognition results are aimed at voices uploaded by the first terminal through the communication connection, and each time period, the first terminal sends voice fragments acquired in the time period to the server;

Receiving a voice recognition result determined by the server for the voice uploaded by the first terminal, wherein the server responds to a voice recognition ending instruction, and disconnects the communication connection;

the voice uploaded by the first terminal is stored in an asynchronous queue;

the server determines a voice recognition result of the voice and sends the voice recognition result to the second terminal, and the method comprises the following steps: starting an asynchronous thread, acquiring uploaded voice from the asynchronous queue by utilizing the asynchronous thread, performing real-time voice recognition on the voice to obtain a voice recognition result, and writing the voice recognition result into a shared variable; and transmitting the voice recognition result in the shared variable to the second terminal by using another thread.

7. A communication device in a speech recognition scenario for a server, the device comprising:

a receiving unit configured to receive, in response to establishment of a communication connection with a first terminal, subscription of a second terminal to a speech recognition result of the server, where the number of the second terminals is at least one, the subscribed speech recognition result is a speech recognition result of speech transmitted by the first terminal through the communication connection, and each time period, the first terminal transmits a speech segment acquired in the time period to the server;

A determining unit configured to determine a speech recognition result of the speech in response to receiving the speech uploaded by the first terminal, and transmit the speech recognition result to the second terminal;

a disconnection unit configured to disconnect the communication connection in response to obtaining a voice recognition end instruction;

the voice uploaded by the first terminal is stored in an asynchronous queue;

the determining unit is further configured to perform the determining of the speech recognition result of the speech and send the speech recognition result to the second terminal as follows:

8. The apparatus of claim 7, wherein the determining unit is further configured to perform the determining a speech recognition result of the speech in response to receiving the speech uploaded by the first terminal in such a manner that:

9. The apparatus of claim 8, wherein the apparatus further comprises:

an error correction unit configured to perform error correction processing on a total recognition result composed of the segment recognition results of each voice segment in the voice where the voice segment is located in response to obtaining the segment recognition result of each voice segment before the voice recognition result is transmitted to the second terminal; and

the determining unit is further configured to perform the sending of the speech recognition result to the second terminal as follows:

10. The apparatus of claim 7, wherein the receiving unit is further configured to perform the responsive to establishing a communication connection with the first terminal as follows:

The disconnection unit is further configured to execute the disconnection of the communication connection in response to obtaining a speech recognition end instruction in the following manner:

11. The apparatus of claim 7, wherein the apparatus further comprises:

and the newly added unit is configured to receive the subscription of the voice recognition result of the service end by other terminals after the subscription of the voice recognition result of the service end by the second terminal is received and before the communication connection is disconnected, and the other terminals and the second terminal are used as new second terminals.

12. A communication device in a speech recognition scenario for a second terminal, the device comprising:

a subscription unit configured to subscribe a voice recognition result of a service end to the service end in response to the service end establishing communication connection with a first terminal, wherein the number of the second terminals is at least one, the subscribed voice recognition result is for voice uploaded by the first terminal through the communication connection, and each time period, the first terminal sends voice fragments acquired in the time period to the service end;

A result receiving unit configured to receive a voice recognition result determined by the server for the voice uploaded by the first terminal, wherein the server disconnects the communication connection in response to obtaining a voice recognition end instruction;

the voice uploaded by the first terminal is stored in an asynchronous queue;

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.