CN110853643A

CN110853643A - Method, device, equipment and storage medium for voice recognition in fast application

Info

Publication number: CN110853643A
Application number: CN201911129442.5A
Authority: CN
Inventors: 董红光; 吴华; 范宏伟
Original assignee: Beijing Xiaomi Mobile Software Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd
Priority date: 2019-11-18
Filing date: 2019-11-18
Publication date: 2020-02-28

Abstract

The disclosure discloses a method, a device, equipment and a storage medium for voice recognition in fast application, which are applied to the field of computers. The method comprises the following steps: acquiring a first voice signal, wherein the first voice signal is a voice signal received by a first fast application in a terminal in a running state; sending the first voice signal to a first voice recognition server, wherein the first voice recognition server is a voice recognition server corresponding to a first fast application, and the voice recognition servers corresponding to at least two fast applications are different; and receiving a voice recognition result sent by the first voice recognition server. According to the method and the device, the voice signal acquired by the terminal is sent to the voice recognition server appointed by the currently running fast application, so that the voice recognition result is more accurate.

Description

Method, device, equipment and storage medium for voice recognition in fast application

Technical Field

The present disclosure relates to the field of computers, and in particular, to a method, an apparatus, a device, and a storage medium for performing speech recognition in a fast application.

Background

The fast application is a novel application form based on a hardware platform. The fast application is developed by using a front-end technology stack, is rendered as native, and can simultaneously have the dual advantages of an HTML5(HyperText Markup Language) page and native application. The fast application runs based on a fast application framework, which is integrated in the computer operating system.

In the related art, after receiving a voice signal, a terminal running a fast application sends the voice signal to a voice recognition server designated by the terminal to perform voice recognition.

In the process of implementing the present disclosure, the public finds that the above mode has at least the following defects: different fast applications correspond to different voice recognition scenes, no matter which fast application is currently operated by the terminal, the terminal can send a voice signal to a voice recognition server appointed by the terminal for voice recognition, the voice recognition cannot be combined with the current fast application scene for recognition, and the accuracy of a voice recognition result is low.

Disclosure of Invention

The embodiment of the disclosure provides a method, a device, equipment and a storage medium for voice recognition in fast application, which can solve the problems that different fast applications correspond to different voice recognition scenes, no matter which fast application is currently operated by a terminal, the terminal can send a voice signal to a voice recognition server appointed by the terminal for voice recognition, the voice recognition cannot be combined with the current fast application scene for recognition, and the accuracy of a voice recognition result is low. The technical scheme is as follows:

according to an aspect of the present disclosure, there is provided a method for performing speech recognition in a fast application, the method being applied to a terminal, at least one fast application running in the terminal, the fast application being an application that runs based on a fast application framework integrated in an operating system and does not need to be manually installed, the method comprising:

acquiring a first voice signal, wherein the first voice signal is a voice signal received by a first fast application in the terminal in a running state;

sending the first voice signal to a first voice recognition server, wherein the first voice recognition server is a voice recognition server corresponding to the first fast application, and the voice recognition servers corresponding to at least two fast applications are different;

and receiving a voice recognition result sent by the first voice recognition server.

Optionally, the acquiring the first voice signal includes:

displaying a user interface of the first fast application, wherein the user interface is an interface of the first fast application in a running state and comprises at least one user interface control;

collecting the first voice signal in the foreground display process of the user interface;

the sending the first voice signal to a first voice recognition server includes:

and sending the first voice signal and a first candidate operation instruction to a first voice recognition server, wherein the first candidate operation instruction is a candidate operation instruction corresponding to the user interface control, and the first candidate operation instruction is used for assisting the first voice recognition server in recognizing the first voice signal.

Optionally, before sending the first voice signal and the first candidate operation instruction to the first voice recognition server, the user interface further includes:

obtaining a candidate operation instruction of a displayed control, wherein the displayed control is a user interface control corresponding to the display part, and the displayed control is part or all of the at least one user interface control;

determining the candidate operation instruction of the displayed control as the first candidate operation instruction.

Optionally, the method further comprises:

when the first fast application does not have a corresponding first voice recognition server, sending the first voice signal to a default voice recognition server;

and receiving a voice recognition result sent by the default voice recognition server.

Optionally, the method further comprises:

acquiring a second voice signal, wherein the second voice signal is a voice signal received when the terminal does not run the fast application;

sending the second voice signal to a default voice recognition server;

Optionally, the method further comprises:

and when the fast application is not operated on the terminal, stopping acquiring the voice signal.

According to another aspect of the present disclosure, there is provided an apparatus for performing voice recognition in a fast application, the apparatus being a part of a terminal, at least one fast application running in the terminal, the fast application being an application that runs based on a fast application framework integrated in an operating system and does not need to be manually installed, the apparatus comprising:

the terminal comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is configured to acquire a first voice signal, and the first voice signal is a voice signal received by a first fast application in the terminal in a running state;

a sending module configured to send the first voice signal to a first voice recognition server, the first voice recognition server being a voice recognition server corresponding to the first fast application, there being at least two of the fast applications that correspond to different voice recognition servers;

a receiving module configured to receive the voice recognition result sent by the first voice recognition server.

Optionally, the apparatus further comprises: a display module;

the display module is configured to display a user interface of the first fast application, wherein the user interface is an interface of the first fast application in a running state, and the user interface comprises at least one user interface control;

the acquisition module is further configured to acquire the first voice signal in a foreground display process of the user interface;

the sending module is further configured to send the first voice signal and a first candidate operation instruction to a first voice recognition server, where the first candidate operation instruction is a candidate operation instruction corresponding to the user interface control, and the first candidate operation instruction is used to assist the first voice recognition server in recognizing the first voice signal.

Optionally, the user interface includes a display portion and a hidden portion, the display portion is a portion of the user interface displayed on the terminal, and the hidden portion is a portion of the user interface not displayed on the terminal, the apparatus further includes: a determination module;

the obtaining module is further configured to obtain a candidate operation instruction of a displayed control, where the displayed control is a user interface control corresponding to the display portion, and the displayed control is a part or all of the at least one user interface control;

the determination module is configured to determine the candidate operation instruction of the displayed control as the first candidate operation instruction.

Optionally, the sending module is further configured to send the first voice signal to a default voice recognition server when the first fast application does not have a corresponding first voice recognition server;

the receiving module is further configured to receive the voice recognition result sent by the default voice recognition server.

Optionally, the obtaining module is further configured to obtain a second voice signal, where the second voice signal is a voice signal received when the fast application is not running on the terminal;

the sending module is further configured to send the second voice signal to a default voice recognition server;

Optionally, the obtaining module is further configured to stop obtaining the voice signal when the fast application is not running on the terminal.

According to another aspect of the present disclosure, there is provided a computer device including: a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by the processor to implement the method of speech recognition in a fast application as described above.

According to another aspect of the present disclosure, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions that is loaded and executed by the processor to implement the method for speech recognition in a fast application as described above.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

the voice signal acquired by the terminal is sent to the voice recognition server appointed by the fast application running at present, so that voice recognition can be carried out under the voice recognition rule provided by the fast application developer, the developer can set the voice recognition rule attached to the fast application according to the type and the use scene of the fast application, for example, the fast application in catering industry can recognize the voice signal according to the specific voice recognition rule, so that the voice recognition result is close to the voice instruction of the fast application in catering industry, and the voice recognition result is more accurate.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a block diagram of an implementation environment provided by an exemplary embodiment of the present disclosure;

FIG. 2 is a flow chart of a method for speech recognition in a fast application provided by an exemplary embodiment of the present disclosure;

FIG. 3 is a flow chart of a method for speech recognition in a fast application provided by another exemplary embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a user interface for a method of speech recognition in a fast application provided by another exemplary embodiment of the present disclosure;

FIG. 5 is a flow chart of a method for speech recognition in a fast application provided by another exemplary embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a user interface for a method of speech recognition in a fast application provided by another exemplary embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a fast application framework provided by an exemplary embodiment of the present disclosure;

FIG. 8 is a schematic illustration of a fast application initial use provided by an exemplary embodiment of the present disclosure;

FIG. 9 is a block diagram of an apparatus for speech recognition in a fast application provided by an exemplary embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of a computer device according to an exemplary embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

First, terms related to embodiments of the present disclosure are introduced:

quick application: the method is a novel application form based on a mobile phone hardware platform. The fast application is developed by using a front-end technology stack, is rendered as native, and can simultaneously have the dual advantages of an HTML5(HyperText Markup Language) page and native application. The fast application can be used without installation after being downloaded, and the memory occupied by the fast application is far smaller than that of the native application, generally about several hundred kb. The quick application can download and decompress at the same time without popping up an installation interface to carry out the installation process. The fast application update is to acquire the update data in real time and update the data in real time without installing an update package. The fast application runs based on a fast application framework, which is integrated in the computer operating system. Any native application may be developed as a fast-forward application, such as a video application, a social application, a picture application, a music application, a reading application, a learning and education application, a financial application, a life application, an office application, a travel application, a shopping application, a game application, and so forth.

And (3) voice recognition: also known as Automatic Speech Recognition (ASR), is the conversion of vocabulary content in human Speech into computer-readable input, such as keystrokes, binary codes, or character sequences. That is, a process in which the terminal converts the acquired voice signal into a computer-readable signal.

User interface controls or user interface ui (user interface) controls: is any visual control or element that is visible on the user interface of the application, such as a picture, input box, text box, button, label, etc. control. Some of the UI controls are responsive to user actions, such as the user triggering a scroll control on the user interface to control the page to scroll up and down. Illustratively, UI controls also include controls that are not visible on the user interface of the application, but which may be responsive to user manipulation. For example, there is a location on the user interface where the user can control the terminal screenshot when clicking.

Fig. 1 shows a block diagram of a computer system provided in an exemplary embodiment of the present disclosure. The computer system 200 includes: terminal 220, server 240.

The terminal 220 has fast applications installed and running. The quick application may be any one of a video application, a social application, a picture application, a music application, a reading application, a learning and education application, a financial application, a life application, an office application, a trip application, a shopping application, and a game application. The terminal 220 is a terminal used by a user, and the user uses the terminal 220 to run a fast application and perform at least one of clicking, browsing, querying, communicating, paying, and sending and receiving data.

The terminal 220 is connected to the server 240 through a wireless network or a wired network.

The server 240 includes at least one of a server, a plurality of servers, a cloud computing platform, and a virtualization center. Illustratively, the server 240 includes a processor 244 and a memory 242, the memory 242 in turn including a display module 2421, a control module 2422, and a receiving module 2423. The server 240 is used to provide background services for applications supporting the virtual world. Alternatively, server 240 undertakes primary computational tasks and terminal 220 undertakes secondary computational tasks; alternatively, server 240 undertakes the secondary computing work and terminal 220 undertakes the primary computing work; or, the server 240 and the terminal 220 perform cooperative computing by using a distributed computing architecture.

The device types of the terminal 220 include: at least one of a smartphone, a tablet, a smart home, a television, an e-book reader, an MP3 player, an MP4 player, a laptop portable computer, and a desktop computer. The following embodiments are illustrated with the terminal comprising a smartphone.

Those skilled in the art will appreciate that the number of terminals described above may be greater or fewer. For example, the number of the terminals may be only one, or several tens or hundreds of the terminals, or more. The number of terminals and the type of the device are not limited in the embodiments of the present disclosure.

Fig. 2 is a flowchart illustrating a method for performing speech recognition in a fast application according to an exemplary embodiment of the present disclosure, which may be applied to the terminal 220 in the computer system shown in fig. 1 or other terminals in the computer system.

The method is applied to a terminal, at least one fast application runs in the terminal, the fast application runs on the basis of a fast application framework integrated in an operating system and does not need to be manually installed, and the method comprises the following steps:

step 102, a first voice signal is obtained, wherein the first voice signal is a voice signal received by a first fast application in the terminal in a running state.

The terminal acquires a first voice signal, wherein the first voice signal is a voice signal received by a first application in the terminal in a running state.

The voice signal is an audio signal that the terminal has acquired from the environment. Illustratively, the terminal collects the voice signal through a recording module built in the terminal or a recording device externally connected to the terminal. The recording device may be any one of a microphone, a microphone array, and a microphone. Illustratively, the voice signal is a voice instruction signal issued by the user to the terminal.

The first voice signal is a voice signal acquired by the terminal when the terminal is running the first fast application. Illustratively, the terminal running the first fast application means that the first fast application runs in the foreground of the terminal. Illustratively, the voice signals acquired by the terminal in the period from the beginning of the terminal running the first fast application to the end of the terminal exiting the first fast application are the first voice signals. That is, the voice signal is the first voice signal no matter how many times or several pieces of voice signals are acquired by the terminal during this time.

Illustratively, after the terminal converts the first fast application into the background running, the acquired voice signal does not belong to the first voice signal. Illustratively, when the foreground of the terminal is running the first fast application, the terminal enters a sleep state, that is, a display interface on the terminal is no longer a user interface of the first fast application, or enters a black screen state, a voice signal acquired by the terminal does not belong to the first voice signal.

Illustratively, the running state refers to a state running in the foreground of the terminal.

Illustratively, the first fast application is no longer in the running state after the terminal enters the sleep state. In the dormant state, a display device or a display module of the terminal is in a screen protection state or a black screen state.

The first fast application refers to a fast application. Illustratively, the first fast application is a fast application that is running in the foreground of the terminal. Illustratively, the foreground of the terminal has only one fast application running. Or, the foreground of the terminal runs two or more fast applications, the first fast application is the fast application being controlled by the user, and the fast application being controlled by the user can be the fast application which has just received the user control instruction or the fast application which has just been opened by the user.

Step 105, sending the first voice signal to a first voice recognition server, where the first voice recognition server is a voice recognition server corresponding to a first fast application, and where the voice recognition servers corresponding to at least two fast applications are different.

The terminal transmits the first voice signal to the first voice recognition server.

The voice recognition server is a server for performing voice recognition. The speech recognition server may convert the speech signal into a computer readable input signal. Illustratively, the voice recognition server receives a voice signal sent by the terminal, performs voice recognition, and returns a voice recognition result to the terminal.

The first speech recognition server is a speech recognition server corresponding to the first fast application. Illustratively, the first speech recognition server is a first fast application specific speech recognition server. Illustratively, the first fast application includes an IP (Internet Protocol) address of the first voice recognition server, and when the terminal runs the first fast application, the terminal acquires the first voice signal and sends the first voice signal to the first voice recognition server according to the IP address of the first voice recognition server. For example, the voice recognition server corresponding to the first fast application may be changed, for example, the voice recognition server corresponding to the first fast application is changed from the first voice recognition server to the second voice recognition server, that is, when the first fast application runs, the first voice signal acquired by the terminal is not sent to the first voice recognition server, but sent to the first voice recognition server

A second speech recognition server.

Illustratively, a plurality of fast applications can be run on the terminal, and different fast applications correspond to different voice recognition servers. For example, when the second fast application runs in the foreground of the terminal, the voice signal acquired by the terminal is sent to the third voice recognition server corresponding to the second fast application. Illustratively, different fast applications may correspond to the same speech recognition server. For example, when the second fast application runs in the foreground of the terminal, the voice signal acquired by the terminal is sent to the first voice recognition server corresponding to the second fast application.

And 106, receiving a voice recognition result sent by the first voice recognition server.

And the terminal receives the voice recognition result sent by the first voice recognition server.

For example, after the terminal sends the first voice signal to the first voice recognition server, the first voice recognition server may return a voice recognition result obtained after performing voice recognition on the first voice signal to the terminal.

The voice recognition result is a result of performing voice recognition on the first voice signal. Illustratively, the speech recognition result is a computer-readable input signal. The computer-readable input signal refers to an input instruction that can be recognized by a computer, for example, the computer-readable input signal may be: the computer receives at least one of an instruction signal acquired when the computer receives a trigger operation on the UI control and an instruction signal acquired when the computer receives an input operation on a mouse and a keyboard. For example, when the first voice signal is "search for sugar", the terminal may perform an operation of searching for sugar "according to the voice recognition result corresponding to the first voice signal, that is, the voice recognition result corresponds to the sum of the instruction signal for selecting the search box, the instruction signal for typing" sugar "in the search box, and the instruction signal for starting the search by clicking.

In summary, in the method provided in this embodiment, the voice signal acquired by the terminal is sent to the voice recognition server specified by the fast application currently running, so that voice recognition can be performed under the voice recognition rule provided by the fast application developer, and the developer can set the voice recognition rule conforming to the fast application according to the type and the use scene of the fast application, for example, the fast catering application can recognize the voice signal according to the specific voice recognition rule, so that the voice recognition result is close to the voice instruction of the fast catering application, and the voice recognition result is more accurate.

Illustratively, the terminal also sends the operation receivable on the current user interface to the first voice server, and assists the first voice server to perform voice recognition more accurately.

Fig. 3 is a flowchart illustrating a method for speech recognition in a fast application according to another exemplary embodiment of the present disclosure. The method may be applied in the terminal 220 in a computer system as shown in fig. 1 or in other terminals in the computer system. The method comprises the following steps:

step 101, judging whether the foreground application is a fast application.

The terminal determines whether the foreground application is a fast application.

First, the terminal needs to determine whether the foreground application is a fast application. If the application is fast application, step 1021 is executed, and if not, step 301 is executed.

Illustratively, the terminal may not perform step 101. For example, after acquiring a section of voice signal, the terminal judges which server the voice signal is sent to according to the current running state of the terminal, and if the terminal runs a fast application, the terminal sends the voice signal to a voice recognition server designated by the fast application; and if the terminal does not run the fast application or the fast application does not have a designated voice recognition server, the terminal sends the voice recognition signal to a default voice recognition server.

Foreground applications are applications that run in the foreground of the terminal. Illustratively, foreground operation refers to displaying a user interface corresponding to an application on a display interface of the terminal. For example, the user interface corresponding to the application is related information of the application on the user interface, for example, the first application may display a word "first application" on the application interface.

Step 1021, displaying a user interface of the first fast application, wherein the user interface is an interface of the first fast application in a running state, and the user interface comprises at least one user interface control.

The terminal displays a user interface of the first fast application, wherein the user interface is an interface of the first fast application in a running state and comprises at least one user interface control.

The user interface is a medium for interaction and information exchange between the system and the user, and it enables conversion between an internal form of information and a human-acceptable form. Illustratively, the user interface includes UI controls, some of which may be responsive to user actions. Illustratively, the user interface has regular boundaries, namely the user interface is located within the boundaries, and the boundaries may be boundary lines visible on the terminal or boundaries of a display screen of the terminal, that is, the user interface is displayed on the terminal in a full screen.

Illustratively, the user interface of the first fast application is: when the first fast application operates in the foreground of the terminal, the interface displayed on the terminal or the foremost interface displayed on the terminal. For example, the terminal opens multiple applications simultaneously, the user interfaces of the applications are displayed on the terminal in an overlapped mode, and the user interface of the first fast application is the user interface which is displayed in the foremost row and is not blocked by the user interfaces of other applications.

For example, as shown in FIG. 4, a black border frames out a user interface 501 for the first fast application. At least one UI control is provided on the user interface 501, for example, as shown in fig. 4, an invisible scrolling control is provided on the user interface 501, the scrolling control can receive a vertical and horizontal sliding operation of a user, and the terminal controls a page in the user interface to be scrolled and browsed vertically according to the vertical and horizontal sliding operation. For example, as shown in FIG. 4, there is a selection control 504 on the user interface 501, and user-triggered selection control 504 may control page switching in the user interface 501 from an A page to a B page. Illustratively, the user interface of the first fast application does not include a status bar of the terminal that is native to the operating system, such as a terminal status bar 502 located above the user interface 501 and a terminal handle bar 503 located below the user interface 501 in fig. 4.

Step 1022, during the foreground display process of the user interface, a first voice signal is collected.

The terminal collects a first voice signal in the foreground display process of the user interface.

Illustratively, when the user interface of the first fast application is displayed at the frontmost end on the display of the terminal, i.e. the first fast application is an application running in the foreground, the speech signal collected by the terminal is the first speech signal. When the display of the terminal does not display the picture or the application which is displayed at the frontmost end is not the first application, the voice signal collected by the terminal is not the first voice signal.

Illustratively, when the terminal is collecting the first voice signal, the terminal may send a voice collection prompt to inform the user that the terminal is collecting the voice signal. The voice acquisition prompt can be a mode of displaying prompt information on the terminal or a mode of sending prompt sound by the terminal and the like. For example, the voice collection prompt may be an icon showing that voice is being collected on the terminal, or a prompt tone may be issued when the voice signal is obtained.

Step 103, judging whether the first fast application has a corresponding first voice recognition server.

The terminal judges whether the first fast application has a corresponding first voice recognition server. If there is a corresponding first speech recognition server, then go to step 105 or step 1051, otherwise go to step 201.

For example, a developer of the first fast application or a user (user) of the fast application may set a designated voice recognition server for the first fast application separately, that is, when the terminal runs the first fast application, the collected first voice signal is sent to the voice recognition server designated by the developer or the user for recognition, so as to ensure information security of the developer and the user.

For example, after a developer or a user of the first fast application designates the first speech recognition server for the first fast application, an address of the first speech recognition server may be stored in the first fast application, and when a first speech signal collected by the terminal during operation of the first fast application is received, the address of the first speech recognition server stored in the first fast application may be directly read, and the first speech signal may be sent to the address of the first speech recognition server.

Step 1051, sending the first voice signal and a first candidate operation instruction to a first voice recognition server, where the first candidate operation instruction is a candidate operation instruction corresponding to a user interface control, and the first candidate operation instruction is used to assist the first voice recognition server in recognizing the first voice signal.

The terminal sends the first voice signal and the first candidate operation instruction to the first voice recognition server.

Illustratively, the user interface control in step 1051 has two cases:

the first case is that all UI controls that may be present in the first fast application, that is, the user interface of the first fast application may switch many pages, for example, the takeaway application may have a top page where there are sort controls, and a point sort control may enter a page corresponding to the sort control, and the user interface controls of the first case include all UI controls in all pages that can be opened on the user interface (including already opened pages or not yet opened pages).

The second is all UI controls in the currently open page in the user interface of the first fast application. That is, when only sliding up and down, left and right, and not switching pages are performed in the user interface of the first fast application, all UI controls in the page.

The candidate operation instruction is a control instruction corresponding to the user interface control. Illustratively, the candidate operation instruction is an instruction signal generated after the terminal receives a triggering operation on the user interface control. For example, the candidate operation instruction is a control instruction corresponding to a trigger operation that can be received by the user interface control. For example, a user interface control may correspond to a plurality of candidate operation instructions. For example, the scroll control on the user interface may receive four trigger operations of sliding up, sliding down, sliding left, and sliding right, and then the scroll control has four candidate operation instructions corresponding to the four trigger operations of sliding up, sliding down, sliding left, and sliding right, respectively.

The first candidate operation instruction comprises at least one candidate operation instruction. Illustratively, the first candidate operation instruction is a candidate operation instruction corresponding to at least one user interface control.

Illustratively, the first candidate operation instruction is a speech recognition result of the first speech signal candidate. For example, the first speech recognition server may preferentially obtain the speech recognition result corresponding to the first speech signal from the candidate operation instruction, or the first speech recognition server may obtain the speech recognition result corresponding to the first speech signal with reference to the candidate operation instruction.

Illustratively, the terminal sends the first candidate operation instruction to the first voice recognition server for assisting the first voice recognition server in performing voice recognition on the first voice signal. Illustratively, the first candidate operation instruction is used to inform the first speech recognition server which instructions the current terminal can receive, and the first speech recognition server can recognize the first speech signal more accurately according to the first candidate operation instruction. For example, if the first candidate operation instruction includes a candidate operation instruction corresponding to a slide-up operation of the scroll control and the first speech signal is "flip-up", the first speech recognition server can easily obtain an operation instruction corresponding to a slide-up operation as a speech recognition result by referring to the first candidate operation instruction based on the first speech signal.

For example, the terminal may send the first voice signal and the first candidate operation instruction to the first voice recognition server simultaneously, or may send the first voice signal and the first candidate operation instruction to the first voice recognition server separately. For example, the terminal may send the first candidate operation instruction to the first speech recognition server in real time each time a page switch is performed on the user interface of the first fast application.

Illustratively, the terminal will perform corresponding operations according to the speech recognition result. For example, when the terminal acquires the speech recognition result corresponding to "play first music", the terminal performs an operation of playing "first music".

Step 201, when the first fast application has no corresponding first voice recognition server, sending the first voice signal to a default voice recognition server.

When the first fast application does not have the corresponding first voice recognition server, the terminal sends the first voice signal to the default voice recognition server.

Illustratively, the first speech signal and the first candidate operation instruction are sent to a default speech recognition server when the first fast application does not have a corresponding first speech recognition server.

For example, when a developer or a user (user) of the first fast application does not specify a voice recognition server for the first fast application, the terminal may send a first voice signal to a default voice recognition server for voice recognition.

The default speech recognition server is a speech recognition server specified by the fast application framework, or the default speech recognition server is a speech recognition server specified by the user for all fast applications in a unified manner. The speech recognition server designated by the fast application framework is a speech recognition server provided by a developer of the fast application framework. For example, the first fast application framework developer may designate the fourth speech recognition server as the default speech recognition server, and if the user does not change the default speech recognition server and the first fast application does not have the designated speech recognition server, send the first speech signal to the fourth speech recognition server; if the first fast application framework developer designates the fourth speech recognition server as the default speech recognition server, but the user changes the default speech recognition server to the fifth speech recognition server, and the first fast application does not have the designated speech recognition server, the first speech signal is sent to the fifth speech recognition server.

For example, the priority order sent to the speech recognition server may be:

priority 1: firstly, sending the voice to a voice recognition server specially appointed by a user for a first quick application;

priority 2: if the user does not specify the first fast application, sending the first fast application to a default voice recognition server changed by the user;

priority 3: if the user does not change the default voice recognition server, sending the voice recognition server to a first voice recognition server appointed by a first fast application developer;

priority 4: and if the developer of the first fast application does not specify the voice recognition server, sending the voice recognition server to a default voice recognition server preset by the developer of the fast application framework.

The priority 2 and the priority 3 can also be adjusted according to the setting of the user or the first fast application developer, for example, in order to ensure the safety of the information of the user, the user sets that all voice information must be sent to a voice recognition server designated by the user, and the priority 2 is higher than the priority 3; if the user does not make such a setting, the developer of the first fast application may set the priority 3 to be higher than the priority 2 in order to ensure that the information of the first fast application is not leaked.

Step 202, receiving a voice recognition result sent by a default voice recognition server.

And the terminal receives a voice recognition result sent by the default voice recognition server.

And after receiving the first voice signal, the default voice recognition server performs voice recognition on the first voice signal and returns a voice recognition result corresponding to the first voice signal to the terminal.

For example, the default speech recognition server may also receive a first candidate operation instruction, and assist speech recognition using the first candidate operation instruction.

Step 301, a second voice signal is obtained, where the second voice signal is a voice signal received when no fast application is running on the terminal.

The terminal acquires a second voice signal, wherein the second voice signal is a voice signal received when the terminal does not run the fast application.

Illustratively, when the first fast application is not running in the foreground of the terminal, for example, the terminal is not currently running any application or fast application, or the foreground of the terminal is running other applications or fast applications, or the terminal enters a sleep state (a black display) after running the first fast application, at this time, the voice signal acquired by the terminal is no longer the first voice signal.

The second speech signal is a speech signal that is distinct from the first speech signal. Illustratively, among all the voice signals collected by the terminal, the voice signals except the first voice signal belong to the second voice signal.

Step 302, sending the second voice signal to a default voice recognition server.

And the terminal sends the second voice signal to a default voice recognition server.

Illustratively, if the terminal needs to perform voice recognition on the second voice signal, the terminal sends the second voice signal to the corresponding voice recognition server according to the terminal state when the terminal acquires the second voice signal.

For example, if the second voice signal is a voice signal acquired when the foreground of the terminal does not run any application, the terminal sends the second voice signal to the default voice recognition server.

For example, if the second voice signal is a voice signal acquired when the foreground of the terminal is running the second fast application, the terminal may determine whether the second fast application has a designated voice recognition server, and if so, send the second fast application to the designated voice recognition server; and if not, sending to a default voice recognition server.

Step 303, receiving a voice recognition result sent by the default voice recognition server.

For example, when the terminal sends the second voice signal to the default voice recognition server, the terminal receives a voice recognition result returned by the default voice recognition server.

Illustratively, the terminal performs corresponding operation according to a voice recognition result corresponding to the second voice signal.

In summary, in the method provided in this embodiment, by determining the state when the terminal acquires the first voice signal, determining whether the first fast application has a corresponding voice recognition server, and finally determining to send the first voice signal to the first voice recognition server or another voice recognition server, a user or a fast application developer can set a server in which voice recognition is performed by himself, so that information security of the user and the first fast application is ensured.

The first candidate operation instruction corresponding to the user interface control on the first fast application user interface is sent to the first voice recognition server, so that the first voice recognition server is assisted to perform voice recognition on the first voice signal, the accuracy of a voice recognition result is improved, and the fitting degree of the voice recognition result and the first fast application is improved.

For example, the first candidate operation instruction may also be a candidate operation instruction corresponding to all user interface controls currently displayed on the terminal by the user interface.

Illustratively, the terminal may also stop collecting voice signals when the foreground is not running the fast application.

Fig. 5 is a flowchart illustrating a method for speech recognition in a fast application according to another exemplary embodiment of the present disclosure, which may be applied to the terminal 220 in the computer system shown in fig. 1 or other terminals in the computer system. Unlike the method of speech recognition in the fast application shown in fig. 3, steps 1041 and 1042 are added before step 1051, replacing steps 301, 302, 303 with step 401.

Step 1041, obtaining candidate operation instructions of the displayed control, where the displayed control is a user interface control corresponding to the display portion, and the displayed control is a portion or all of at least one user interface control.

The terminal obtains a candidate operation instruction of a displayed control, wherein the displayed control is a user interface control corresponding to the display part, and the displayed control is part or all of at least one user interface control.

Illustratively, the user interface includes a display portion and a hidden portion, the display portion is a portion of the user interface displayed on the terminal, the hidden portion is a portion of the user interface not displayed on the terminal,

illustratively, due to the limitation of the size of the display screen of the terminal, the user interface cannot be displayed on the terminal completely. The user may navigate through various portions of the user interface through a sliding or scrolling operation, up, down, left, or right.

Exemplarily, a part of the terminal displayed on the terminal is determined as a display part of the user interface; a portion that is not displayed on the terminal but can be dragged and dropped by a sliding or scrolling operation is determined as a hidden portion of the user interface.

Illustratively, the displayed portion and the hidden portion of the user interface belong to the same page on the user interface.

For example, as shown in fig. 6, the user interface 501 of the first fast application displays only a part of the screen on the terminal 600, the part displayed on the terminal 600 is a display part 601, and the part of the user interface 501 other than the display part 601 is a hidden part 602.

Illustratively, the terminal does not need to acquire the candidate operation instructions corresponding to all the user interface controls on the user interface 501, but only needs to acquire the candidate operation instructions corresponding to the user interface controls on the display portion. For example, as shown in fig. 6, the first control 603 is located in the hidden portion 602, and the terminal does not need to acquire the candidate operation instruction corresponding to the first control 603. The second control 604 is located in the display portion 601, and the terminal obtains a candidate operation instruction corresponding to the second control 604.

Illustratively, the displayed control is a user interface control present in the displayed portion of the user interface. Illustratively, the displayed controls may be all of the user interface controls in step 1051, or may be a portion of the user interface controls in step 1051.

Illustratively, when the display part of the user interface changes, the terminal retrieves the candidate operation instruction of the displayed control.

Step 1042, determining the candidate operation instruction of the displayed control as a first candidate operation instruction.

The terminal determines the candidate operation instruction of the displayed control as a first candidate operation instruction.

For example, the terminal determines the candidate operation instruction corresponding to the displayed control as a first candidate operation instruction. That is, the terminal only sends an operation instruction which can be received by the UI control currently displayed on the terminal to the first voice recognition server, so that the first voice recognition server can obtain a voice recognition result which can be executed on the display portion.

Step 401, when the terminal does not run the fast application, stopping acquiring the voice signal.

Illustratively, when the fast application is not running on the terminal or the fast application running in the foreground has no recording right, the terminal stops acquiring the voice signal. That is, the terminal does not turn on the microphone for recording.

In summary, in the method provided in this embodiment, only the candidate operation instruction corresponding to the UI control already displayed on the terminal is sent to the speech recognition server, so as to assist the speech recognition server in obtaining the speech recognition result that can be executed on the display portion, thereby improving the accuracy of speech recognition.

The terminal stops recording when the terminal does not run the quick application or the quick application has no recording right, so that the privacy and the information safety of the user are further protected.

For example, the methods provided in the above embodiments can be split and then freely combined into new embodiments.

FIG. 7 illustrates a block diagram 100 of a fast application framework shown in an exemplary embodiment of the present disclosure, the fast application framework including: a scenario portal 120, a fast application engine 140, and Operating System (OS) infrastructure and hardware 160.

The scene portal 120 includes at least one of a negative one-screen, a global search, a lock screen, a desktop, an application marketplace, a browser, and a two-dimensional code. The appearance of the scene portal 120 may be in the form of a page and a card.

The fast application engine 140 includes a front end framework 141, a generic scenario 142, a lightweight scenario 143, an embedded SDK (Software Development Kit) 144, and business access 145.

The front-end framework 141 includes MVVM (Model-View-Model), V-DOM, routing, basic API (Application Programming Interface), service API, UI (user Interface) component, and the like;

the general scene 142 and the lightweight scene 143 include a JavaScript engine, a standard rendering engine, an extreme rendering engine, an end-cloud-core acceleration, a security mechanism, an emerging scene (AI (Artificial Intelligence), AR (Augmented Reality), etc.), a system integration (application management, rights management, etc.);

service access 145 includes Push (Push), account/payment, etc.

OS infrastructure & hardware 160 includes: a graphics library, a native control, a system service, and a GPU (graphics Processing Unit)/NPU (embedded Neural network processor), etc.

From the execution path level, there are standard HTML5 approaches to support generic Web (Web page) scenes (typically through the Webview component of the system or customized Webview) and js (javascript) + Native approaches to support a lighter weight, faster experience. The architecture of the fast application engine will be briefly described in terms of 3 levels.

1) Application development (front end framework + Components and API capability)

Front-end design for fast applications mirrors and integrates the design ideas of mainstream front-end frameworks (Vue, React, etc.): the application is built in a componentized mode, an MVVM design mode taking data binding as a core is adopted, the performance is improved in a V-DOM mode, and meanwhile, a concise and clear template of the class Vue is selected. Meanwhile, the layout aspect is correspondingly simplified. From the perspective of new application form, native UI mapping and capability opening, a set of components and API specifications need to be defined, and the rapid development and application are convenient to develop.

2) System integration (application management, card-embedded SDK, security mechanism, etc.)

The fast application, as a complete application modality, can be deeply integrated with the system, run as a native application, and interact with the system. There are currently two forms of rapid application: a full screen mode independent application mode and an embedded mode card mode. In the form of independent application, the experience of the user is similar to a native application program, and the method has complete life cycle management, page management, routing and the like. The fast application can be parasitized on the Activity of android, the page is parasitized on the Fragment, and the management and control of the instance are carried out through the independent background Service. The card is another form, and is embedded into each corner of the system as an independent local control through the embedded SDK, so that dynamic content is presented in a light weight mode. In the aspect of security isolation, better security guarantee can be achieved through a sandbox mechanism, process isolation and authority control in combination with the support of an operating system layer.

3) Performance experience and emerging scenes (JavaScript engine, rendering engine, end-cloud-core acceleration, emerging scenes)

In the aspects of interactive experience, resource overhead, stability and the like, the fast application realizes effective combination of a front-end development mode, native rendering and platform capacity by introducing a native rendering path.

Different from cross-platform frameworks of other application layers, the cross-platform framework is fast applied to an operating system rooted in a mobile phone, and deep integration from a chip to the operating system to cloud can be achieved. The combination of the end and the cloud takes the acceleration of the starting performance as an example, and the optimization of the network link layer can greatly accelerate the starting speed of the application through the cooperative rendering of the cloud and the end. Meanwhile, the special capability of the hardware platform can be integrated, and the experience is further improved. For example, the calculation power of the NPU can be integrated into a fast application engine by combining with a mobile phone AI chip, so that an AI scene (face recognition, image super-resolution, and the like) can be executed at a low delay and high performance on the end side, the privacy of a user is effectively protected, and the bandwidth is saved.

Fig. 8 shows a flowchart illustrating the start-up of a fast application according to an exemplary embodiment of the present disclosure, including:

1) when the system is started for the first time, a user clicks and triggers downloading of a program package of the fast application, and meanwhile initialization related work of the fast application engine is carried out. After the whole fast application program package is downloaded and verified, the JavaScript file of the first page to be displayed is loaded and rendered.

2) The page rendering comprises JavaScript loading, execution of page and JavaScript frame logic, layout operation and drawing of the native UI control. When the logic in the page is executed, one or more network requests (from the page to the application-own three-party server) are generated, and the data returned by the network requests drive the re-rendering of the page until the content of the first screen is completely displayed.

The network request JavaScript execution, typesetting and drawing are not in simple serial relation, but are parallelly interwoven to influence the rendering performance of the whole page, and are strongly related to the logic of page design, the network condition and the running state of equipment.

Due to the specificity of fast applications, fast applications are expected to perform more tasks and perform more functions. At present, in the process of gradually improving the functions of the fast application, the embodiment of the present disclosure provides a new function of the fast application, and the new function is utilized to implement the voice recognition using the designated voice recognition server in different fast applications, thereby expanding the functions of the fast application and improving the processing capability and the practicability of the fast application.

The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.

Fig. 9 is a schematic structural diagram illustrating an apparatus for speech recognition in a fast application according to an exemplary embodiment of the present disclosure. The file executing apparatus may be implemented as a part or all of the DLNA server by software, hardware, or a combination of both. The terminal runs at least one fast application, and the fast application runs based on a fast application framework integrated in an operating system and does not need to be manually installed, and the apparatus 700 includes:

an obtaining module 701 configured to obtain a first voice signal, where the first voice signal is a voice signal received by a first fast application in the terminal in an operating state;

a sending module 702 configured to send the first voice signal to a first voice recognition server 706, where the first voice recognition server 706 is a voice recognition server corresponding to the first fast application, and there are at least two voice recognition servers corresponding to the fast applications that are different;

a receiving module 703 configured to receive the voice recognition result sent by the first voice recognition server.

Optionally, the apparatus further comprises: a display module 704;

the display module 704 configured to display a user interface of the first fast application, where the user interface is an interface of the first fast application in a running state, and the user interface includes at least one user interface control;

the obtaining module 701 is further configured to collect the first voice signal during a foreground display process of the user interface;

the sending module 702 is further configured to send the first voice signal and a first candidate operation instruction to the first voice recognition server 706, where the first candidate operation instruction is a candidate operation instruction corresponding to the user interface control, and the first candidate operation instruction is used to assist the first voice recognition server 706 in recognizing the first voice signal.

Optionally, the user interface includes a display portion and a hidden portion, the display portion is a portion of the user interface displayed on the terminal, and the hidden portion is a portion of the user interface not displayed on the terminal, the apparatus further includes: a determination module 705;

the obtaining module 701 is further configured to obtain a candidate operation instruction of a displayed control, where the displayed control is a user interface control corresponding to the display portion, and the displayed control is a part or all of the at least one user interface control;

the determining module 705 is configured to determine the candidate operation instruction of the displayed control as the first candidate operation instruction.

Optionally, the sending module 702 is further configured to send the first voice signal to a default voice recognition server 707 when the first fast application does not have a corresponding first voice recognition server 706;

the receiving module 703 is further configured to receive the voice recognition result sent by the default voice recognition server 707.

Optionally, the obtaining module 701 is further configured to obtain a second voice signal, where the second voice signal is a voice signal received when the fast application is not running on the terminal;

the sending module 702 is further configured to send the second voice signal to a default voice recognition server 707;

Optionally, the obtaining module 701 is further configured to stop obtaining the voice signal when the fast application is not running on the terminal.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the function of performing voice recognition in a fast application, only the division of the above functional modules is illustrated, and in an actual application, the function distribution may be completed by different functional modules according to actual needs, that is, the content structure of the device may be divided into different functional modules, so as to complete all or part of the functions described above.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 10 is a block diagram illustrating an apparatus 1000 for speech recognition in fast applications, according to an example embodiment. For example, the apparatus 1000 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 10, the apparatus 1000 may include one or more of the following components: processing component 1002, memory 1004, power component 1006, multimedia component 1008, audio component 1010, Input/Output (I/O) interface 1012, sensor component 1014, and communications component 1016.

The processing component 1002 generally controls the overall operation of the device 1000, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 1002 may include one or more processors 1020 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 1002 may include one or more modules that facilitate interaction between processing component 1002 and other components. For example, the processing component 1002 may include a multimedia module to facilitate interaction between the multimedia component 1008 and the processing component 1002.

The memory 1004 is configured to store various types of data to support operations at the apparatus 1000. Examples of such data include instructions for any application or method operating on device 1000, contact data, phonebook data, messages, pictures, videos, and so forth. The Memory 1004 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as static random-Access Memory (SRAM), electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic or optical disk.

The power supply component 1006 provides power to the various components of the device 1000. The power components 1006 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 1000.

The multimedia component 1008 includes a screen that provides an output interface between the device 1000 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1008 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 1000 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 1010 is configured to output and/or input audio signals. For example, the audio component 1010 may include a Microphone (MIC) configured to receive external audio signals when the device 1000 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 1004 or transmitted via the communication component 1016. In some embodiments, audio component 1010 also includes a speaker for outputting audio signals.

I/O interface 1012 provides an interface between processing component 1002 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 1014 includes one or more sensors for providing various aspects of status assessment for the device 1000. For example, sensor assembly 1014 may detect an open/closed state of device 1000, the relative positioning of components, such as a display and keypad of device 1000, the change in position of device 1000 or a component of device 1000, the presence or absence of user contact with device 1000, the orientation or acceleration/deceleration of device 1000, and the change in temperature of device 1000. The sensor assembly 1014 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 1014 may also include a photosensor, such as a CMOS (complementary Metal Oxide Semiconductor) or CCD (Charge-coupled device) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1014 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1016 is configured to facilitate communications between the apparatus 1000 and other devices in a wired or wireless manner. The device 1000 may access a wireless network based on a communication standard, such as Wi-Fi, 2G, or 3G, or a combination thereof. In an exemplary embodiment, the communication component 1016 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 1016 further includes a Near Field Communication (NFC) module to facilitate short-range communications.

In an exemplary embodiment, the apparatus 1000 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 1004 comprising instructions, executable by the processor 1020 of the device 1000 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM (Read-Only Memory), a Random Access Memory (RAM), a CD-ROM (Compact Disc Read-Only Memory), a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium, wherein instructions of the storage medium, when executed by a processor of a device 1000, enable the device 1000 to perform a method for speech recognition in a fast application, the method being applied in a terminal having at least one fast application running therein, the fast application being an application that runs based on a fast application framework integrated in an operating system and does not need to be manually installed, the method comprising:

Optionally, the acquiring the first voice signal includes:

Optionally, the method further comprises:

sending the second voice signal to a default voice recognition server;

Optionally, the method further comprises:

The present disclosure also provides a terminal, including: the apparatus includes a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by the processor to implement the method for speech recognition in a fast application provided by the above method embodiments.

The present disclosure also provides a computer device, comprising: a processor and a memory, the storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by the processor to implement the method for speech recognition in a fast application provided by the above-described method embodiments.

The present disclosure also provides a computer-readable storage medium having at least one instruction, at least one program, a set of codes, or a set of instructions stored therein, which is loaded and executed by a processor to implement the method for speech recognition in a fast application provided by the above method embodiments.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for performing voice recognition in a fast application, the method being applied to a terminal, at least one fast application running in the terminal, the fast application being an application running based on a fast application framework integrated in an operating system and not requiring manual installation, the method comprising:

2. The method of claim 1, wherein the obtaining the first speech signal comprises:

3. The method according to claim 2, wherein the user interface includes a display portion and a hidden portion, the display portion being a portion of the user interface displayed on the terminal, the hidden portion being a portion of the user interface not displayed on the terminal,

before the sending the first voice signal and the first candidate operation instruction to the first voice recognition server, the method further includes:

4. The method of any of claims 1 to 3, further comprising:

5. The method of any of claims 1 to 3, further comprising:

sending the second voice signal to a default voice recognition server;

6. The method of any of claims 1 to 5, further comprising:

7. An apparatus for performing speech recognition in a fast application, the apparatus being a part of a terminal, at least one fast application running in the terminal, the fast application being an application that runs based on a fast application framework integrated in an operating system and does not require manual installation, the apparatus comprising:

8. The apparatus of claim 7, further comprising: a display module;

9. The apparatus of claim 8, wherein the user interface comprises a display portion and a hidden portion, the display portion being a portion of the user interface displayed on the terminal, the hidden portion being a portion of the user interface not displayed on the terminal, the apparatus further comprising: a determination module;

10. The apparatus according to any one of claims 7 to 9,

the sending module is further configured to send the first voice signal to a default voice recognition server when the first fast application does not have a corresponding first voice recognition server;

11. The apparatus according to any one of claims 7 to 9,

the obtaining module is further configured to obtain a second voice signal, where the second voice signal is a voice signal received when the fast application is not running on the terminal;

12. The apparatus according to any one of claims 7 to 11,

the acquisition module is further configured to stop acquiring the voice signal when the fast application is not running on the terminal.

13. A computer device, the computer comprising: a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by the processor to implement the method for speech recognition in a fast application according to any of claims 1 to 6.

14. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement a method for speech recognition in a fast application according to any one of claims 1 to 6.