CN110827815A

CN110827815A - Voice recognition method, terminal, system and computer storage medium

Info

Publication number: CN110827815A
Application number: CN201911081516.2A
Authority: CN
Inventors: 肖明; 李凌志; 陆伟峰; 朱荣昌; 唐僖僖
Original assignee: Shenzhen Transsion Holdings Co Ltd
Current assignee: Shenzhen Transsion Holdings Co Ltd
Priority date: 2019-11-07
Filing date: 2019-11-07
Publication date: 2020-02-21
Anticipated expiration: 2039-11-07
Also published as: CN110827815B

Abstract

The embodiment of the invention discloses a voice recognition method, a terminal, a system and a computer storage medium, wherein the method comprises the following steps: receiving first voice information, converting the first voice information into initial information, and outputting the initial information; outputting candidate information corresponding to the initial information when a first operation for the initial information is detected; when a second operation aiming at the candidate information is detected, acquiring a correction object; updating the initial information according to the correction object to obtain and/or output target information; the invention can realize the rapid and accurate recognition and modification of the voice information and improve the voice recognition efficiency.

Description

Voice recognition method, terminal, system and computer storage medium

Technical Field

The present invention relates to the field of computer application technologies, and in particular, to a speech recognition method, a terminal, a system, and a computer storage medium.

Background

With the continuous development of smart phones, the voice assistant functions are developed like bamboo shoots in spring after rain, and are popular with users, so that the users can realize intelligent conversation with the voice assistant, and partial problems are solved.

At present, a user can perform voice communication with a mobile phone voice assistant, but the current voice recognition technology cannot completely recognize the problems of continuous pronunciation and the like, and voice recognition errors are easy to occur. When the device recognizes an error for a piece of voice information, the device often reflects in a display interface of the voice assistant, for example, the wrong recognition information is displayed, so that a voice instruction initiated by a user is recognized incorrectly, and the accuracy of executing the instruction by the terminal device is directly affected. In this regard, the user may re-initiate the voice instruction. Therefore, on one hand, a piece of voice information is often only recognized by a few character objects incorrectly, and on the other hand, recognition based on a reinitiated voice instruction may still be inaccurate, so that the current recognition scheme has the problem of low recognition efficiency.

Disclosure of Invention

The embodiment of the invention provides a voice recognition method, a terminal, a system and a computer storage medium, which can efficiently obtain a voice recognition result.

In one aspect, a first embodiment of the present invention provides a speech recognition method, including:

receiving first voice information, converting the first voice information into initial information, and outputting the initial information; outputting candidate information corresponding to the initial information when a first operation for the initial information is detected; when a second operation aiming at the candidate information is detected, acquiring a correction object; and updating the initial information according to the correction object to obtain and/or output target information.

Optionally, receiving first voice information, and converting the first voice information into initial information includes: acquiring a target initial object obtained according to the conversion in the first voice information; identifying a type of the target initial object; and obtaining initial information according to the type of the target initial object.

Optionally, the obtaining initial information according to the target initial object type includes: calling an association database corresponding to the target initial object type, and if an object with a matching degree meeting a first preset threshold value with the target initial object is found in the association database, updating the target initial object according to the found object to obtain initial information; or, searching an object corresponding to the target initial object type in a networking manner, and if an object with a matching degree meeting a second preset threshold value with the target initial object is found, updating the target initial object according to the found object to obtain initial information; the first preset threshold and the second preset threshold are the same or different.

Optionally, a correlation database corresponding to the target initial object type is called, and if an object whose matching degree with the target initial object meets a first preset threshold is found in the correlation database, the target initial object is updated according to the found object to obtain initial information; or, searching an object corresponding to the target initial object type in a networking manner, and if an object with a matching degree meeting a second preset threshold value with the target initial object is found, updating the target initial object according to the found object to obtain initial information; the first preset threshold and the second preset threshold are the same or different.

Optionally, when the type of the target initial object is identified as a contact type, taking an address book database stored in the terminal as the association database; and/or when the type of the target initial object is identified as the application name type, taking an application database recorded by the terminal as the association database; and/or when the type of the target initial object is identified to be an unknown type, networking and searching for the object corresponding to the type of the target initial object.

Optionally, semantic analysis is performed on the initial information or the target information, and a control instruction is output.

Optionally, the position for outputting the initial information includes at least one of a current interface, a preset fixed screen region, and a floating window, or the position for outputting the candidate information includes at least one of a current interface, a preset fixed screen region, and a floating window, or the position for outputting the target information includes at least one of a current interface, a preset fixed screen region, and a floating window, or the position for outputting the control instruction includes at least one of a current interface, a preset fixed screen region, and a floating window; the initial information, the candidate information, the target information and the control instruction display positions are the same or different.

Optionally, the method is applied to the voice recognition system, where the voice recognition system includes at least one first terminal and at least one second terminal, the first terminal is configured to receive the first voice message, and the second terminal is configured to output the control instruction.

Optionally, the method further comprises: outputting a voice correction identifier; re-inputting voice correction information through the voice correction identifier; updating the initial information according to the voice correction information; and/or outputting the initial information and/or the voice correction information.

Optionally, the method is applied to the voice recognition system, where the voice recognition system includes at least one first terminal and at least one second terminal, the first terminal is configured to receive first voice information, and the second terminal is configured to output the voice correction identifier and receive the re-entered voice correction information.

Optionally, the type of the initial information or the type of the target information includes at least one of text, image, audio, video, and file; and/or the first operation or the second operation comprises: the method comprises the following steps of long pressing, repeated pressing, sliding, space gesture operation, and N times of click operation, wherein the time interval between two adjacent times of click operation is less than a preset threshold, and N is at least one of integers greater than or equal to 2; the first operation and the second operation are the same or different. Optionally, the position for outputting the initial information includes at least one of a current interface, a preset fixed screen region, and a floating window, or the position for outputting the candidate information includes at least one of a current interface, a preset fixed screen region, and a floating window, or the position for outputting the target information includes at least one of a current interface, a preset fixed screen region, and a floating window; the display positions of the initial information, the candidate information and the target information are the same or different.

Optionally, the method is applied to the speech recognition system, where the speech recognition system includes at least one first terminal and at least one second terminal, the first terminal is configured to receive first speech information, and the second terminal is configured to output the initial information or output candidate information or output the target information.

On the other hand, an embodiment of the present invention further provides a speech recognition method, where the method is applied to the speech recognition system, where the speech recognition system includes at least one first terminal and at least one second terminal, and includes: receiving first voice information from the first terminal, and converting the first voice information into initial information; and obtaining and/or outputting target information according to the initial information.

Optionally, the obtaining and/or outputting the target information according to the initial information includes: outputting the initial information on the first terminal and/or the second terminal; when a first operation aiming at the initial information is detected, outputting candidate information corresponding to the initial information on the first terminal and/or the second terminal; when a second operation aiming at the candidate information is detected, acquiring a correction object; updating the initial information according to the correction object to obtain target information, and/or outputting the target information on the first terminal and/or the second terminal.

Optionally, the position for outputting the initial information includes at least one of a current interface, a preset fixed screen region, and a floating window, or the position for outputting the candidate information includes at least one of a current interface, a preset fixed screen region, and a floating window, or the position for outputting the target information includes at least one of a current interface, a preset fixed screen region, and a floating window; the display positions of the initial information, the candidate information and the target information are the same or different; and/or the first operation or the second operation comprises: the method comprises the following steps of long pressing, repeated pressing, sliding, space gesture operation, and N times of click operation, wherein the time interval between two adjacent times of click operation is less than a preset threshold, and N is at least one of integers greater than or equal to 2; the first operation and the second operation are the same or different.

Optionally, the receiving the first voice message from the first terminal, and converting the first voice message into initial information includes: acquiring a target initial object obtained according to the conversion in the first voice information; identifying a type of the target initial object; and obtaining initial information according to the type of the target initial object.

Optionally, the step of obtaining initial information according to the target initial object type includes: calling an association database corresponding to the target initial object type, and if an object with a matching degree meeting a first preset threshold value with the target initial object is found in the association database, updating the target initial object according to the found object to obtain initial information; or, searching an object corresponding to the target initial object type in a networking manner, and if an object with a matching degree meeting a second preset threshold value with the target initial object is found, updating the target initial object according to the found object to obtain initial information; the first preset threshold and the second preset threshold are the same or different.

Optionally, when the type of the target initial object is identified as a contact type, taking an address book database stored in the terminal as the association database; and/or when the type of the target object is identified as the application name type, taking a system application database recorded by the terminal as the associated database; and/or when the type of the target initial object is identified to be an unknown type, networking and searching for the object corresponding to the type of the target initial object.

Optionally, semantic analysis is performed on the initial information or the target information, and a control instruction is output. The position of outputting the initial information through the first terminal and/or the second terminal comprises at least one of a current interface, a preset fixed screen area and a floating window, or the position of outputting the candidate information through the first terminal and/or the second terminal comprises at least one of a current interface, a preset fixed screen area and a floating window, or the position of outputting the target information through the first terminal and/or the second terminal comprises at least one of a current interface, a preset fixed screen area and a floating window, or the position of outputting the control instruction through the first terminal and/or the second terminal comprises at least one of a current interface, a preset fixed screen area and a floating window; the initial information, the candidate information, the target information and the control instruction output positions are the same or different.

Optionally, outputting a speech correction identifier through the first terminal and/or the second terminal; receiving the voice correction information re-entered through the voice correction identifier; updating the initial information according to the voice correction information; and/or outputting the voice correction information and/or the initial information through the first terminal and/or the second terminal.

Optionally, the position where the initial information is output through the first terminal and/or the second terminal includes at least one of a current interface, a preset fixed screen region, and a floating window, or the position where the candidate information is output includes at least one of a current interface, a preset fixed screen region, and a floating window, or the position where the target information is output includes at least one of a current interface, a preset fixed screen region, and a floating window; the output positions of the initial information, the candidate information and the target information are the same or different. The type of the initial information or the type of the target information comprises at least one of text, image, audio, video and file.

Correspondingly, the embodiment of the invention also provides an intelligent terminal, which comprises: comprising a processor, a memory and a user interface, said processor, said memory and said user interface being interconnected, wherein said memory is adapted to store a computer program, said computer program comprising program instructions, said processor being configured to invoke said program instructions to perform the above mentioned speech recognition method.

Accordingly, an embodiment of the present invention further provides a speech recognition system, including at least a first terminal and at least a second terminal, where the first terminal includes a first display, a first processor and a first memory, and the second terminal includes a second display, a second processor and a second memory, where the first memory and/or the second memory is used to store a computer program, the computer program includes program instructions, and the first processor and/or the second processor is configured to call the program instructions to execute the above-mentioned speech recognition method. Optionally, the first terminal includes a first display screen, and/or the second terminal includes a second display screen.

Accordingly, the embodiment of the present invention further provides a computer storage medium, in which program instructions are stored, and when the program instructions are executed, the computer storage medium is used for implementing the above-mentioned voice recognition method.

The received voice information is identified and converted into initial information, when a first operation is detected, candidate information corresponding to the initial information is obtained, a correction object is obtained according to a second operation performed on the candidate information, and the initial information is corrected; the method and the device have the advantages that selective replacement and modification are carried out on the content with the identification error in the initial information, so that the operation is simple, the time is saved, the use requirements of users are met, the voice information can be identified and modified quickly and accurately, and the voice identification efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a speech recognition method according to an embodiment of the present invention;

FIG. 2a is a user interface diagram of a speech recognition method provided by an embodiment of the present invention;

FIG. 2b is a user interface diagram of a speech recognition method provided by an embodiment of the present invention;

FIG. 2c is a user interface diagram of a speech recognition method provided by an embodiment of the present invention;

FIG. 3 is a flow chart of another speech recognition method provided by the embodiment of the invention;

FIG. 4 is a flow chart of another speech recognition method provided by the embodiments of the present invention;

FIG. 5 is a schematic structural diagram of a speech recognition system according to an embodiment of the present invention;

FIG. 6 is a flowchart of another speech recognition method according to an embodiment of the present invention

Fig. 7 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an intelligent terminal according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of the invention and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to the listed steps or modules, but may alternatively include other steps or modules not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments. The mobile terminals described herein also include, but are not limited to, devices that can accept voice information, such as mobile phones, personal computers, tablet computers, in-vehicle systems, televisions, etc., that have touch sensitive surfaces (e.g., touch screens or touch pads).

The speech recognition technology is a technology for accurately recognizing human speech (such as characters, words, clauses, sentences and the like) by using a computer and a digital signal processing technology, and the speech recognition is based on extracting various effective characteristics of the speech to be recognized to form a speech mode to be recognized, comparing the speech mode with a sample mode in a speech library in a memory of a storage terminal device, and recognizing what characters, words and the like are through a mode classification method. The speech recognition process is a process of recognizing language components such as syllables or words. Although there is a lot of research on speech recognition technology, due to the complexity of speech, it is not perfect in recognition of continuous speech, large vocabulary, dialects, and the accuracy of recognition is not high, so it is essential to correct errors in the speech recognition result.

Based on the above description, an embodiment of the present invention describes a speech recognition method in conjunction with fig. 1, fig. 2a, fig. 2b, and fig. 2c, where fig. 1 is a schematic flow chart of a speech recognition method provided in an embodiment of the present invention, and fig. 2a, fig. 2b, and fig. 2c are user interface diagrams of a speech recognition method of the present invention, the method may be executed by an intelligent terminal, and the intelligent terminal may be, for example, a smartphone, a tablet computer, an intelligent wearable device, a vehicle-mounted system, a television, and the like, where the method specifically includes the following steps:

s101, receiving first voice information, converting the first voice information into initial information 202, and outputting the initial information 202;

in one embodiment, before the receiving the first voice, it may be detected whether a voice assistant function of the terminal device is turned on, and if the voice assistant is not turned on, an instruction to turn on the voice assistant is sent to the terminal device to turn on the voice assistant. The first voice information may be used for the user to communicate with a voice assistant of the terminal device, and the user may make a call, create a note, send an email, open a system application, or the like through the voice assistant. For example, the user may send a voice on the human-computer interaction interface 201 shown in fig. 2a, where the voice is "help me sent a message to julier".

In one embodiment, the recognizing the first voice message refers to matching the first voice message with a voice library of the terminal device, screening text messages with high pronunciation approximation degree with the first voice message, so as to obtain pronunciation matching degree of the text messages and the voice correction information, and storing at least one text message with the matching degree with the first voice message within a preset threshold, where the preset threshold may be stored in advance by a system or may be user-defined, the threshold may be 80%, 85%, 95%, and the like, and the matching degree of the pronunciation threshold is calculated by the voice assistant function corresponding device during voice recognition. The text information meeting the preset threshold may be sorted according to the degree of matching of the pronunciation of the user, the text information with the highest degree of matching is selected as the initial information, and the rest are candidate text information, for example, the preset threshold is preset to be 90%, the first voice information is entered, and the first voice information may be converted into a plurality of text information in the conversion process, wherein within meeting the preset threshold, the text information meeting the preset threshold is sorted according to the degree of matching of the pronunciation of the user as follows: "Julie", "job", and "Julian", wherein a text message with the highest judged pronunciation matching degree is selected as the final output text message.

In one embodiment, the speech libraries include, but are not limited to, dialect speech libraries corresponding to dialects in different regions, language speech libraries corresponding to languages in different countries, and the like. When the first voice information input by the user is received, the geographical position information of the mobile terminal user can be obtained, the corresponding dialect voice library or language voice library is loaded according to the position information, and the first voice information input by the user is identified, so that the efficiency and the accuracy of voice identification are improved.

S102, when a first operation aiming at the initial information 202 is detected, outputting candidate information corresponding to the initial information 202;

in one embodiment, when a first operation on the initial information 202 is detected, a target object 203 selected by the first operation is determined, and a preview interface 204 is displayed, wherein the preview interface displays a candidate object 205 of the target object, and the candidate information comprises the target object 203 selected by the first operation and the preview interface 204;

the first operation is a pressing operation of the user on the touch display screen aiming at the initial information 202, and text information which identifies errors in the initial information 202 is selected through the first operation and is used for determining a target object 203 which the user wants to modify. The first operation includes: the method comprises the following steps of long pressing, repeated pressing, sliding, space gesture operation and N times of click operation, wherein the time interval between two adjacent times of click operation is smaller than a preset threshold value, and N is at least one of integers larger than or equal to 2.

In one embodiment, the target object may be a combination of one or more characters, for example, "Julie" in fig. 2a, or "Help me sent a message to Julie" whole sentence. The target object may be highlighted, including but not limited to changing color, font size, font weight, or underlining.

In one embodiment, a preview interface 204 is displayed on the user interface 201, and the preview interface 204 is preferably selected beside the initial information or the target object and may be located at any position of the user interface 201. The preview interface 204 is used for displaying the candidate objects, and the preview interface 204 includes, but is not limited to, a rolling popup interface and a page-turning popup interface.

The candidate object is other text information, in which the matching degree obtained by matching the first speech information with a speech library in the process of converting the first speech information into text information satisfies the preset threshold, as shown in fig. 2 b: and in the conversion process, the matching degree of the Jolia and the Julian meets the preset threshold, so that both the Jolia and the Julian can be candidate objects.

S103, acquiring a correction object when a second operation aiming at the candidate information is detected;

the second operation is an operation when the user selects a correct character object when browsing the candidate object, and the second operation may include a long-time pressing operation in which a click duration is greater than a preset threshold, or N times of click operations in which a time interval between two consecutive click operations is less than a preset threshold, where N is an integer greater than or equal to 2, and N may be 2, 3, 4, 5, and the like, and the first operation and the second operation are the same or different.

And S104, updating the initial information according to the correction object to obtain and/or output target information.

In one embodiment, the target information is obtained by directly replacing the target object with a correction object. In one embodiment, the displayed target information needs to be distinguished from the initial information by, but not limited to, spacing the initial dialog information from the speech update text information by displaying a string "corrected information" on the user interface.

In one embodiment, the type of the initial information or the type of the target information may further include text, image, audio, video, file. After the user inputs the first voice information, the voice information can be translated into text information, and the text information can also be converted into image, audio and video forms to be presented on the user interface.

In the embodiment of the invention, a target object with an identification error selected by a user is obtained by identifying and converting received voice information into initial information, a candidate object for correcting the target object is directly displayed on a current user interface, a corrected object is further obtained, and the initial information is updated according to the corrected object; the method and the device for identifying and modifying the voice information aim at replacing and modifying the simple character object of the target object with the wrong identification, are simple to operate, save time, meet the use requirements of users, can quickly and accurately identify and modify the voice information, and improve the voice identification efficiency.

Referring to fig. 2b and fig. 3, fig. 3 is a flowchart of another speech recognition method according to the present invention, which may be executed by a smart terminal, for example, a terminal such as a smart phone, a tablet computer, a smart wearable device, an in-vehicle system, a television, and the method specifically includes the following steps: S302-S305 specifically correspond to the step S101 in the first embodiment:

s301, displaying a user interface; in one embodiment, a user interface is displayed at the terminal device, which may display dialog information of the user with a voice assistant.

S302, acquiring a target initial object obtained according to the conversion in the first voice information;

receiving first voice information, converting the first voice information into text information, and acquiring a part of character objects in the text information as target initial objects in the process of converting the first voice information into the text information, wherein the acquired target initial objects include but are not limited to the following categories: character combination objects with low utilization rate in the word stock; character objects obviously not conforming to the structural form are found by analyzing the text information structure; according to semantic analysis of the text information, the target object type obviously can be found from a storage system of the terminal equipment, such as a person name, an application name, a system tool name, a search engine name and the like, and the character object type which can be found from the storage system of the terminal equipment can be stored in advance. For example, the text information of the first voice conversion is "help me send a message to Julie", where the character object "Julie" is an object sent by a mail after semantic analysis, and is likely to be a certain contact of the user, that is, the target initial object "Julie" in the text information is obtained.

S303, identifying the type of the target initial object;

in one embodiment, the types of the target initial object include, but are not limited to: the type of the contact person, the type of the application name, the type of the system tool and the type of the application program can be directly searched in a database of the terminal. In one embodiment, the method for identifying the type of the text message may perform inference on the meaning of the sentence according to the sentence structure and the sentence type of the text message, and identify the type of the sentence. The mobile phone voice assistant can store the basic structure of a sentence, basically mark the structure of the text information obtained by conversion when converting the first voice information, mark text character objects with obviously abnormal structures, or mark character objects which can be found from a storage system of the terminal equipment, such as a person name, an application name, a system tool name, a search engine name and the like, and the like. For example, the type of the target initial object "Julie" is identified, and according to the whole sentence result "send a message to Julie," to "is followed by an object and corresponds to the" send message ", association concludes that the" Julie "may be a sending object of the" message ", which is likely to be a contact type.

In one embodiment, the type of the target initial object may be further identified by associating and inferring an existing character object of the target initial object in combination with a sentence structure, and the mobile phone voice assistant may have a most basic word stock, and may associate the converted text information by associating with a mobile phone. For example, the voice information that the user wants to input is "hello, please help me to turn on the flashlight", the text information converted from the first voice information is "hello, please help me to turn on the flashlight", wherein the word combination of "flashlight receiving" is not commonly used in the word stock, so that the target initial object is "flashlight receiving", and the target initial object is inferred to be a name word combination and a noun about the flashlight application according to the character "flashlight" in the target initial object and the previous verb character "turn on", and the target initial object is inferred to be an application name.

S304, calling an associated database corresponding to the target initial object type or searching an object corresponding to the target initial object type in a networking manner; in one embodiment, when the type of the target initial object is identified, it is first necessary to determine an association database of the target initial object, where the type of the target initial object and the association database thereof may be pre-stored, or may be obtained by association of the voice assistant through intelligent analysis. For example, when the type of the target initial object is identified as a contact type, an address book database stored by the terminal is used as the association database, and the association database is called after authorization is carried out after the association database is determined.

And when the type of the target initial object is identified as the type of the contact person, taking an address book database stored by the terminal as the associated database. For example, when the target initial object "Julie" is identified as a contact type, the terminal device grants the authority of the address book database to the voice recognition device, and searches the character object related to the Julie in the association database on the authorized basis.

And when the type of the target initial object is identified as the application name type, taking a system application database recorded by the terminal as the association database. For example, when the target initial object "receive flashlight" is identified as the application name type, the terminal device grants the authority of the system application database to the voice recognition device, and searches the character object related to the "receive flashlight" on the authorized basis.

In one embodiment, when it is recognized that the type of the target initial object does not conform to the type stored in the terminal in advance and the terminal cannot analyze the target initial object, the terminal may search for an object corresponding to the type of the target initial object in a networked manner.

S305, if an object with the matching degree between the object and the target initial object meeting a preset threshold value is searched in the associated database or is searched in a networking manner, updating the initial object according to the object to obtain initial information;

in an embodiment, an association database corresponding to the target initial object type is called, and if an object whose matching degree with the target initial object meets a first preset threshold is found in the association database, the target initial object is updated according to the found object to obtain initial information. The first preset threshold may be saved in advance by the system or may be user-defined, the threshold may be 80%, 85%, 95%, or the like, the matching degree of the first threshold is calculated by the voice assistant function corresponding device during voice recognition, one or more searched character objects meeting the first preset threshold may be found, the character object with the highest matching degree is selected to update the target initial object, and the remaining character objects may be candidate objects.

In one embodiment, an object corresponding to the target initial object type is searched in a networked manner, and if an object with a matching degree with the target initial object meeting a second preset threshold is found, the target initial object is updated according to the found object to obtain initial information. The second preset threshold may be saved in advance by the system or may be user-defined, and the matching degree of the second threshold is calculated by the device corresponding to the voice assistant function during voice recognition.

S306, detecting the operation acted on the user interface;

in one embodiment, after the initial information is generated according to the voice information and output on the user interface, the user can judge whether the initial information displayed on the user interface is correct, and if the initial information is incorrect, the user can modify the initial information by performing text operation on the information with the error identification in the initial information, or click the voice correction identifier displayed on the user interface to re-input the voice information for modification.

S307, when a first operation aiming at the initial information is detected, outputting candidate information corresponding to the initial information;

in one embodiment, when a first operation aiming at the initial information is detected, a target object selected by the first operation is determined, and a preview interface is displayed, wherein the preview interface displays candidate objects of the target object, and the candidate information comprises the target object selected by the first operation and the preview interface;

in one embodiment, the first operation is a pressing operation of the user on the touch display screen aiming at the initial information, and text information which identifies errors in the initial information is selected through the first operation and is used for determining the target object which the user wants to modify S203. The first operation includes: the method comprises the following steps of long pressing, repeated pressing, sliding, space gesture operation and N times of click operation, wherein the time interval between two adjacent times of click operation is smaller than a preset threshold value, and N is at least one of integers larger than or equal to 2.

S308, when a second operation aiming at the candidate information is detected, acquiring a correction object; and updating the initial information according to the correction object to obtain and/or output target information. The second operation includes: the method comprises the following steps of long pressing, repeated pressing, sliding, space gesture operation, and N times of click operation, wherein the time interval between two adjacent times of click operation is smaller than a preset threshold, N is at least one of integers larger than or equal to 2, and the second operation can be the same as or different from the first operation.

The updating initial information is to obtain the target information by directly replacing the target object with the correction object. The displayed target information needs to be distinguished from the initial information by, but not limited to, spacing the initial dialog information from the speech update text information by displaying a string "corrected information" on the user interface.

In one embodiment, in S306, when the user is detected to operate the speech correction identifier, the method may further include the following steps:

s3011, when detecting the operation aiming at the voice correction identifier, re-inputting voice correction information through the voice correction identifier;

in one embodiment, the speech correction identifier is always displayed on the user interface; or when a first operation aiming at the initial information is detected, determining that the initial information is identified incorrectly, and displaying the voice correction identification; the speech correction identifier may be displayed on the same user interface at the same time as the preview interface, with a display position that is preferentially directly below the user interface, as indicated at 207 in fig. 2 b.

The voice correction identification is used for re-inputting voice information by a user and re-inputting voice of a selected character object with wrong recognition, and when the user presses the voice correction identification, the microphone is opened to collect the voice information within the time when the user presses the voice correction identification; meanwhile, the correction icon can be a microphone icon, a loudspeaker icon and the like, and text can be arranged beside the correction icon to prompt the user to re-enter voice information, such as 'Plate say it again …'.

When detecting that the touch area of the voice correction identifier on the user interface has a pressing operation, opening a microphone, and collecting voice correction information of the user within a pressing time, wherein the voice correction information can be re-entered whole sentence voice information, and the voice correction information can also be re-entered voice information aiming at an incorrect target object in the initial information.

When the user presses the selected target object against the character object identified as the error in the initial information, the target object is marked, and the target object can be highlighted for the user to distinguish from other characters, wherein the highlighting includes but is not limited to changing color, font size, font weight or underlining. On the basis of marking the target object, when detecting that a touch area of the voice correction identifier on a user interface has a pressing operation, opening a microphone for collecting voice correction information within the pressing time of a user, wherein the collected voice correction information is a message for correcting the target object.

S3012, updating the initial information according to the voice correction information; and/or, outputting the initial information and/or the voice correction information;

in one embodiment, the speech update text information converted from the speech correction information may be directly displayed on the user interface. And performing voice correction identification on the target object, re-inputting voice correction information through the voice correction identification after marking the target object, directly replacing the target object in the initial information with text information obtained by converting the voice correction information, updating the initial information, and displaying the voice updated text information on a user interface. The displayed voice updated text information needs to be distinguished from the initial information, including but not limited to the following ways: the initial dialog information is separated from the speech update text information by displaying a string "corrected information" on the user interface.

In an embodiment, the user cannot correct the target object by the above method, the present invention may further provide a manual modification mode to correct the target object, after the user selects the target object, the user may select the manual modification mode by clicking a right key to pop up a keyboard page, the keyboard page may be displayed on the same user interface as the preview page, and the user may output a correct character object through a keyboard to replace the target object and update the initial information.

In one embodiment, the target object and the correction object may be associated, and when the target object is acquired again, the corresponding correction object is preferentially displayed; the target initial object and the character object used for updating the target initial object can be associated, and when the target initial object is obtained again, the character object associated with the target initial object is preferentially displayed; the voice correction information and the voice correction text information can be associated, and when the voice correction information is acquired, the voice correction text information is preferentially displayed; and the target object and the corresponding character object manually modified by the user can be associated, and when the target object is acquired again, the corresponding character object manually modified by the user is preferentially displayed.

S309, performing semantic analysis on the initial information or the target information, and outputting a control instruction;

in one embodiment, the control instruction can be generated by a device with a mobile phone voice assistant function through recognition, and after voice analysis is performed on target information, a corresponding control instruction is generated to call a corresponding program. For example, the target information is "help me send a message to Julian", it is known that the user needs to send the message after the analysis, and the object is "Julian", so that an instruction for sending a certain mail to "Julian" is generated, and the control instruction may be computer code obtained by a computer through programming, and a mailbox function is called through the control instruction, a mailbox address of the contact is found, and the mail is sent.

In an embodiment, the display positions of the initial information, the candidate information, the target information, and the control instruction may include at least one of a current interface, a preset fixed screen area, and a floating window, and the display positions of the current interface, the preset fixed screen area, and the floating window may be the same or different.

In one embodiment, the method is applied to the voice recognition system, and the voice recognition system comprises at least one first terminal and at least one second terminal, wherein the first terminal is used for receiving first voice information, and the second terminal is used for outputting the voice correction identification and receiving the re-entered voice correction information. The first terminal can be a microphone, an earphone, a loudspeaker and other terminal equipment used for receiving voice information, the second terminal can be a mobile phone, a personal computer, a tablet computer, a vehicle-mounted system, a television and other terminal equipment capable of displaying control instructions, and the first terminal and the second terminal can be connected in a wired or wireless mode.

In one embodiment, the method is applied to the voice recognition system, the voice recognition system comprises at least one first terminal and at least one second terminal, the first terminal is used for receiving the first voice information, and the second terminal is used for outputting the control instruction.

In the embodiment of the invention, the received voice information is identified and converted into the text information, the initial object type of the text information is identified, the corresponding corrected characters are automatically modified by searching the terminal database or the network, and the target object with the wrong identification is manually found by the user for identification and modification, so that the voice information can be quickly and accurately identified and modified, and the voice identification efficiency is improved.

Referring to fig. 4 again, fig. 4 is a flowchart of another speech recognition method according to the present invention, which may be executed by a smart terminal, for example, a terminal such as a smart phone, a tablet computer, a smart wearable device, a vehicle-mounted system, a television, and the method specifically includes the following steps:

s401, before the first voice is received, whether the voice assistant of the terminal equipment is opened or not is detected, if the voice assistant is not opened, an instruction for opening the voice assistant is sent to the terminal equipment, and the voice assistant is started.

S402, receiving first voice information; in one embodiment, the first voice information may be collected from the voice information of the user through a microphone of the terminal device.

S403, converting the first voice information into text information;

in an embodiment, the first voice information may be matched with a voice library of a terminal device to obtain text information with a high pronunciation matching degree with the first voice information, where the voice library includes, but is not limited to, dialect voice libraries corresponding to dialects in different regions, language voice libraries corresponding to languages in different countries, and the like, and when receiving the first voice information, the geographic location information of the mobile terminal user may be obtained, and a corresponding dialect voice library or language voice library may be loaded according to the location information.

S404, displaying the initial information on a user interface, wherein the initial information is text information with the highest matching degree with the first voice information through selection, and the user interface is used for displaying a conversation process when a voice assistant naturally converses with a user;

s405, judging whether the initial information is correct or not, judging whether the initial text is correct or not by a user, and executing a step S407 when detecting that a touch screen has a first operation aiming at the initial information, namely the initial information is incorrect; if the first operation is not detected, then it is correct, step S410 is executed.

S406, if the initial information is judged to be incorrect, marking the target object selected by the first operation, and displaying a voice correction identifier on the user interface; in one embodiment, the user may select the target object with the error according to the first operation, where the first operation may include a long-time pressing operation whose click duration is greater than a preset threshold, or N times of click operations whose time interval between two consecutive click operations is less than a preset threshold, where N is an integer greater than or equal to 2, where N is 2, 3, 4, and the like.

S407, re-inputting voice correction information through the voice correction identifier;

in one embodiment, the user may re-enter the sentence-wise speech correction information without selecting a character object; the target object with the recognition error may be selected, and the speech correction information may be input only for the target object, and the speech correction information may be used only for correcting the target object.

S408, updating the initial information according to the voice correction information, converting the voice correction information into text information, wherein the way of converting the voice correction information into the voice correction text information is the same as the way of converting the first voice information into the initial text information, and updating the initial information according to the text information.

S409, displaying the initial information and/or the voice correction information, wherein the voice update text information can be obtained by converting the directly input whole sentence voice correction information or can be the voice update text information which updates the initial information in a voice correction mode.

S410, after displaying the voice updated text information, performing semantic analysis on the voice updated text information to generate a control instruction, where the method for controlling the generation of the instruction refers to the foregoing embodiment S308.

The received voice information is recognized and converted into the initial information, the target object with the recognition error selected by the user is obtained, and the voice correction information is re-entered and updated according to the target object or the initial information.

Referring to fig. 5 and fig. 6 again, fig. 5 is a speech recognition system according to an embodiment of the present invention, fig. 6 is another speech recognition method according to the present invention, and the method shown in fig. 6 can be applied to the system shown in fig. 5; the voice recognition system at least comprises at least one first terminal 501 and at least one second terminal 502, the first terminal can be a microphone, an earphone, a loudspeaker, a mobile phone, a personal computer and other terminal equipment which can be used for receiving voice information, the second terminal can be a mobile phone, a personal computer, a tablet computer, a vehicle-mounted system, a television and other terminal equipment which have a display function and can execute a control instruction, and the first terminal and the second terminal can be connected in a wired mode and can also communicate in a wireless mode such as WIFI and Bluetooth.

In one embodiment, the first terminal may include a first processor and a first memory, and may further include a microphone circuit for receiving voice information, the second terminal may include a second processor and a second memory, and may further include a display screen having a display function, the first terminal and the second terminal may further include a communication interface for communication, the first memory and/or the second memory is used for storing a computer program, the computer program includes program instructions, the first processor and/or the second processor is configured to call the program instructions to execute the voice recognition method shown in fig. 6, and the specific implementation process of S601-S608 may refer to S301-S308:

s601, a first terminal receives first voice information and sends the first voice information to a second terminal, and the second terminal converts the first voice information; in one embodiment, the S601 may also be that the first terminal receives first voice information, and the first terminal converts the voice information.

S602, the second terminal obtains the target initial object obtained according to the conversion in the first voice information.

S603, the second terminal identifies the type of the target initial object; in one embodiment, the type of the initial object includes, but is not limited to, a contact type, an application name type, a system tool type, and an application program type, which can be directly found in a database of the terminal.

S604, the second terminal calls the associated database corresponding to the target initial object type or searches for the object corresponding to the target initial object type in a networking mode.

In one embodiment, when the second terminal identifies that the type of the target initial object is a contact type, an address book database stored by the terminal is used as the association database.

In one embodiment, when the second terminal identifies the type of the target object as an application name type, a system application database recorded by the terminal is used as the association database.

In one embodiment, when the type of the target initial object is identified as unknown, the second terminal searches for an object corresponding to the type of the target initial object in a networking manner.

S605, if the second terminal searches for or networks with the associated database for an object with a matching degree meeting a preset threshold value with the target initial object, updating the initial object according to the object to obtain initial information;

in an embodiment, the second terminal may call an association database corresponding to the target initial object type, and if an object whose matching degree with the target initial object satisfies a first preset threshold is found in the association database, update the target initial object according to the found object to obtain initial information.

In an embodiment, the second terminal may search for an object corresponding to the target initial object type in a network, and update the target initial object according to the searched object to obtain the initial information if the object whose matching degree with the target initial object satisfies a second preset threshold is found. The first preset threshold and the second preset threshold may be the same or different.

In one embodiment, the steps S601-S605 may also be performed by the first terminal, where the first terminal converts the first voice message into initial information and sends the initial information to the second terminal.

S606, the first terminal and/or the second terminal detect operation acting on a user interface; in an embodiment, after the second terminal generates the initial information according to the voice information and outputs the initial information on the user interface, the user can determine whether the initial information displayed on the user interface is correct, and if the initial information is incorrect, the second terminal can modify the text operation by aiming at the information with the recognition error in the initial information, or click the voice correction identifier displayed on the user interface to re-input the voice information for modification.

S607, when a first operation aiming at the initial information is detected, outputting candidate information corresponding to the initial information on the first terminal and/or the second terminal;

in one embodiment, when the second terminal detects a first operation on the initial information, a target object selected by the first operation is determined, and a preview interface is displayed, wherein the preview interface displays candidate objects of the target object, and the candidate information comprises a preview interface corresponding to the target object selected by the first operation; in one embodiment, the first operation comprises: the method comprises the following steps of long pressing, repeated pressing, sliding, space gesture operation and N times of click operation, wherein the time interval between two adjacent times of click operation is smaller than a preset threshold value, and N is at least one of integers larger than or equal to 2.

S608, when a second operation aiming at the candidate information is detected, acquiring a correction object; updating the initial information according to the correction object to obtain target information, and/or outputting the target information on the first terminal and/or the second terminal.

In one embodiment, the second operation comprises: the method comprises the following steps of long pressing, repeated pressing, sliding, space gesture operation and N times of click operation, wherein the time interval between two adjacent times of click operation is smaller than a preset threshold value, N is at least one of integers larger than or equal to 2, and the first operation and the second operation can be the same or different.

In an embodiment, the S606, S607, and S608 may also be executed by the first terminal, and may be that the first terminal receives the first voice message, converts the first voice message into the initial message, acquires the target object of the initial message and displays the candidate object of the target object on a preview interface, acquires the correction object according to the selected object, and updates the initial message according to the correction object to obtain the target message and/or display the target message.

In one embodiment, when the operation on the user interface is detected as an operation for a speech correction identifier in S606, this embodiment may further include the following steps of S6011, when the operation for the speech correction identifier is detected, re-entering speech correction information by the second terminal through the speech correction identifier;

s6012, updating the initial information according to the voice correction information; and/or outputting the voice correction information and/or the initial information through the second terminal.

In one embodiment, S6011-S6012 may be further performed by the first terminal, and when the first terminal detects an operation for a speech correction identifier, the first terminal re-enters speech correction information through the speech correction identifier, receives the speech correction information, and updates the initial information according to the speech correction information; and/or outputting the voice correction information and/or the initial information through the first terminal.

S609, the second terminal carries out semantic analysis on the initial information or the target information and outputs a display control instruction; in one embodiment, the second terminal generates a control instruction through the target information and calls a corresponding program.

In one embodiment, the type of the initial information or the type of the target information includes at least one of text, image, audio, video, and file.

In the embodiment of the invention, the first terminal and the second terminal are matched for use, the first terminal receives voice information, the second terminal converts the voice information into text information and modifies the text information in a word and voice mode, the first terminal and the second terminal can be connected in a wired or wireless mode, diversified requirements of voice receiving of users are met, and meanwhile, when the second terminal does not have a voice receiving function, the voice recognition function is realized through the matching use of the first terminal, and the applicability is better.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a speech recognition device of the present invention, where the device of the embodiment of the present invention may be disposed in an intelligent terminal, and the intelligent terminal may specifically be a terminal such as a smart phone, a tablet computer, an intelligent wearable device, a vehicle-mounted system, or a television, and the device includes the following modules: the semantic analysis module 701 is configured to receive first voice information, convert the first voice information into initial information, and output the initial information.

In an embodiment, the speech recognition apparatus further specifically operates the search module 702 during the operation of the semantic parsing module: the searching module 702 is configured to obtain a target initial object obtained according to the conversion in the first voice information; identifying a type of the target initial object; and obtaining initial information according to the type of the target initial object. In an embodiment, the searching module 702 is further configured to call an association database corresponding to the target initial object type, and if an object whose matching degree with the target initial object meets a first preset threshold is found in the association database, update the target initial object according to the found object to obtain initial information; in an embodiment, the searching module 702 is further configured to search for an object corresponding to the target initial object type in a networked manner, and if an object whose matching degree with the target initial object meets a second preset threshold is found, update the target initial object according to the found object to obtain initial information.

In one embodiment, the searching module 702 is further configured to, when the type of the target initial object is identified as a contact type, use a database of an address book stored in the terminal as the association database.

In one embodiment, the searching module 702 is further configured to, when the type of the target initial object is identified as the application name type, use an application database recorded by the terminal as the association database.

In one embodiment, the search module 702 is further configured to, when the type of the target initial object is identified as unknown, network search for an object corresponding to the target initial object type.

A display module 703, configured to output candidate information corresponding to the initial information when a first operation on the initial information is detected;

an obtaining module 704, configured to obtain a correction object when a second operation on the candidate information is detected;

the correcting module 705 is configured to update the initial information according to the correcting object, and obtain and/or output target information.

In one embodiment, the process speech correction module 706 is further operable to: the voice correction module 706 is configured to output a voice correction identifier; re-inputting voice correction information through the voice correction identifier; updating the initial information according to the voice correction information; and/or outputting the voice correction information.

And the command generating module 707 is configured to perform semantic analysis on the initial information or the target information, and output a control instruction.

It is to be understood that, for specific implementation of each functional module in the embodiments of the present invention, reference may be made to the description related to the foregoing method embodiment, which is not described herein again.

In the embodiment of the invention, the content with the identification error in the initial information is selectively replaced and modified, the operation is simple, the time is saved, the use requirements of users are met, the voice information can be quickly and accurately identified and modified, and the voice identification efficiency is improved. Referring to fig. 8 again, fig. 8 is a schematic structural diagram of the intelligent terminal of the present invention. The intelligent terminal according to the embodiment of the invention can refer to: terminals such as smart phones, tablet computers, intelligent wearable devices and the like. The intelligent terminal at least comprises a processor 801, a storage device 802 and a user interface 803, wherein the processor 801, the storage device 802 and the user interface 803 are connected with each other, the storage device 802 is used for storing a computer program, the computer program comprises program instructions, and the processing device 801 is used for executing the program instructions.

The user interface 803 may be a touch display capable of receiving an input operation from a user, a microphone capable of receiving a voice input, a speaker capable of issuing a voice prompt to the user, a microphone capable of receiving a voice message input by the user, or the like. .

The storage device 802 may include a volatile memory (volatile memory), such as a random-access memory (RAM); the storage device may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a solid-state drive (SSD), etc.; the storage means may also comprise a combination of memories of the kind described above.

The processor 801 may be a Central Processing Unit (CPU). The processor may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or the like. The PLD may be a field-programmable gate array (FPGA), a General Array Logic (GAL), or the like.

In one embodiment, the storage device 802 is further configured to store program instructions that the processor 801 may invoke: receiving first voice information, converting the first voice information into initial information, and outputting the initial information; outputting candidate information corresponding to the initial information when a first operation for the initial information is detected; when a second operation aiming at the candidate information is detected, acquiring a correction object; and updating the initial information according to the correction object to obtain and/or output target information.

In one embodiment, the processor 801, when receiving first voice information and converting the first voice information into initial information, acquires a target initial object obtained according to conversion in the first voice information; identifying a type of the target initial object; and obtaining initial information according to the type of the target initial object.

In an embodiment, when the processor 801 executes the obtaining of the initial information according to the target initial object type, the processor 801 is specifically configured to invoke an association database corresponding to the target initial object type, and if an object whose matching degree with the target initial object meets a first preset threshold is found in the association database, update the target initial object according to the found object to obtain the initial information; or, searching an object corresponding to the target initial object type in a networking manner, and if the object with the target initial object matching degree meeting a second preset threshold is found, updating the target initial object according to the found object to obtain initial information.

In an embodiment, when the processor 801 executes the obtaining of the initial information according to the type of the target initial object, the processor 801 is specifically configured to, when the type of the target initial object is identified as a contact type, use an address book database stored in a terminal as the association database; and/or when the type of the target initial object is identified as the application name type, taking an application database recorded by the terminal as the association database; and/or when the type of the target initial object is identified to be an unknown type, networking and searching for the object corresponding to the type of the target initial object.

In one embodiment, the processor 801 is further configured to output a speech correction identifier; re-inputting voice correction information through the voice correction identifier; updating the initial information according to the voice correction information; and/or outputting the initial information and/or the voice correction information.

In one embodiment, the processor 801 is further configured to perform semantic analysis on the initial information or the target information, and output a control instruction.

It is to be understood that, for the specific implementation of the processor 801 in the embodiment of the present invention, reference may be made to the description related to the foregoing method embodiment, which is not repeated herein.

Furthermore, the present invention also discloses a computer storage medium, in which program instructions are stored, and when executed, the program instructions are used for implementing the speech recognition method as described in fig. 1 or fig. 3, fig. 4 or fig. 6.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

While the invention has been described with reference to a number of embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A speech recognition method, comprising the steps of:

receiving first voice information, converting the first voice information into initial information, and outputting the initial information;

outputting candidate information corresponding to the initial information when a first operation for the initial information is detected;

when a second operation aiming at the candidate information is detected, acquiring a correction object;

and updating the initial information according to the correction object to obtain and/or output target information.

2. The method of claim 1, wherein receiving a first voice message and converting the first voice message into an initial message comprises:

acquiring a target initial object obtained according to the conversion in the first voice information;

identifying a type of the target initial object;

and obtaining initial information according to the type of the target initial object.

3. The method of claim 2, wherein obtaining initial information according to the target initial object type comprises:

calling an association database corresponding to the target initial object type, and if an object with a matching degree meeting a first preset threshold value with the target initial object is found in the association database, updating the target initial object according to the found object to obtain initial information; or the like, or, alternatively,

searching an object corresponding to the target initial object type in a networking manner, and if an object with a matching degree meeting a second preset threshold value with the target initial object is found, updating the target initial object according to the found object to obtain initial information;

the first preset threshold and the second preset threshold are the same or different.

4. The method of claim 3,

when the type of the target initial object is identified as the type of the contact person, taking an address book database stored by the terminal as the associated database; and/or the presence of a gas in the gas,

when the type of the target initial object is identified as the application name type, taking an application database recorded by the terminal as the association database; and/or the presence of a gas in the gas,

and when the type of the target initial object is identified to be an unknown type, networking and searching for an object corresponding to the type of the target initial object.

5. The method of any of claims 1 to 4, further comprising:

and performing semantic analysis on the initial information or the target information, and outputting a control instruction.

6. The method of claim 5,

the position for outputting the initial information comprises at least one of a current interface, a preset fixed screen area and a floating window, or

The position for outputting the candidate information comprises at least one of a current interface, a preset fixed screen area and a floating window, or

The position for outputting the target information comprises at least one of a current interface, a preset fixed screen area and a floating window, or

The position for outputting the control instruction comprises at least one of a current interface, a preset fixed screen area and a floating window;

the initial information, the candidate information, the target information and the control instruction output positions are the same or different.

7. The method according to claim 5, wherein the method is applied to the voice recognition system, and the voice recognition system comprises at least one first terminal and at least one second terminal, the first terminal is used for receiving the first voice information, and the second terminal is used for outputting the control instruction.

8. The method of any of claims 1 to 4, further comprising:

outputting a voice correction identifier;

re-inputting voice correction information through the voice correction identifier;

updating the initial information according to the voice correction information; and/or the presence of a gas in the gas,

and outputting the initial information and/or the voice correction information.

9. The method according to claim 8, wherein the method is applied to the speech recognition system, the speech recognition system comprises at least one first terminal and at least one second terminal, the first terminal is configured to receive the first speech information, the second terminal is configured to output the speech correction identifier and receive the re-entered speech correction information.

10. The method according to any one of claims 1 to 4,

the type of the initial information or the type of the target information comprises at least one of text, image, audio, video and file; and/or the presence of a gas in the gas,

the first operation or the second operation includes: the method comprises the following steps of long pressing, repeated pressing, sliding, space gesture operation, and N times of click operation, wherein the time interval between two adjacent times of click operation is less than a preset threshold, and N is at least one of integers greater than or equal to 2;

the first operation and the second operation are the same or different.

11. The method according to any one of claims 1 to 4,

The position for outputting the target information comprises at least one of a current interface, a preset fixed screen area and a floating window;

the output positions of the initial information, the candidate information and the target information are the same or different.

12. The method according to any one of claims 1 to 4, wherein the method is applied to the speech recognition system, and the speech recognition system comprises at least one first terminal and at least one second terminal, the first terminal is configured to receive the first speech information, and the second terminal is configured to output the initial information or output the candidate information or output the target information.

13. A speech recognition method, the method being applied to a speech recognition system, the speech recognition system comprising at least one first terminal and at least one second terminal, the method comprising:

receiving first voice information from the first terminal, and converting the first voice information into initial information;

and obtaining and/or outputting target information according to the initial information.

14. The method of claim 13, wherein: the obtaining and/or outputting of the target information according to the initial information includes:

outputting the initial information on the first terminal and/or the second terminal;

when a first operation aiming at the initial information is detected, outputting a candidate object corresponding to the initial information on the first terminal and/or the second terminal;

updating the initial information according to the correction object to obtain target information, and/or outputting the target information on the first terminal and/or the second terminal.

15. The method of claim 14, wherein:

the display positions of the initial information, the candidate information and the target information are the same or different;

and/or the presence of a gas in the gas,

the first operation and the second operation are the same or different.

16. The method according to any one of claims 13 to 15, wherein the receiving the first voice message from the first terminal and converting the first voice message into the initial message comprises:

identifying a type of the target initial object;

17. The method of claim 16, wherein the step of obtaining initial information according to the target initial object type comprises:

18. The method of claim 17,

when the type of the target object is identified as the application name type, taking a system application database recorded by the terminal as the association database; and/or the presence of a gas in the gas,

19. The method of any of claims 13 to 15, further comprising:

20. The method of claim 19,

the position of the initial information output by the first terminal and/or the second terminal comprises at least one of a current interface, a preset fixed screen area and a floating window, or

The position of the candidate information output by the first terminal and/or the second terminal comprises at least one of a current interface, a preset fixed screen area and a floating window, or

The position of the target information output by the first terminal and/or the second terminal comprises at least one of a current interface, a preset fixed screen area and a floating window, or

The position of the control instruction output by the first terminal and/or the second terminal comprises at least one of a current interface, a preset fixed screen area and a floating window;

21. The method of any of claims 13 to 15, further comprising:

outputting a voice correction identifier through the first terminal and/or the second terminal;

receiving the voice correction information re-entered through the voice correction identifier;

and outputting the voice correction information and/or the initial information through the first terminal and/or the second terminal.

22. The method of claim 21,

23. The method according to any one of claims 13 to 15,

the type of the initial information or the type of the target information comprises at least one of text, image, audio, video and file.

24. An intelligent terminal comprising a display, a processor and a memory wherein the memory is configured to store a computer program comprising program instructions and the processor is configured to invoke the program instructions to perform a speech recognition method according to any one of claims 1 to 6 or 8 or 10 to 11.

25. A speech recognition system comprising at least a first terminal, at least a second terminal, wherein the first terminal comprises a first processor and a first memory, and the second terminal comprises a second processor and a second memory, wherein the first memory and/or the second memory is/are adapted to store a computer program comprising program instructions, and wherein the first processor and/or the second processor are/is configured to invoke the program instructions to perform a speech recognition method according to any of claims 7 or 9 or 12 or 13 to 23.

26. The system according to claim 25, characterized in that the first terminal comprises a first display screen and/or the second terminal comprises a second display screen.

27. A computer storage medium having stored thereon program instructions for implementing a speech recognition method according to any one of claims 1 to 23 when executed.