CN113299290A

CN113299290A - Method and device for speech recognition, electronic equipment and readable storage medium

Info

Publication number: CN113299290A
Application number: CN202110369099.2A
Authority: CN
Inventors: 梁浩
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2021-04-06
Filing date: 2021-04-06
Publication date: 2021-08-24
Also published as: WO2022213986A1

Abstract

The application discloses a voice recognition method, a voice recognition device, electronic equipment and a readable storage medium, and belongs to the technical field of communication. The method comprises the following steps: under the condition of acquiring voice information, receiving a first input of a user, wherein the first input is an input for triggering starting of a target application program; and responding to the first input, and displaying first key information in the voice information through a target application program, wherein the first key information is associated with the type of the target application program.

Description

Method and device for speech recognition, electronic equipment and readable storage medium

Technical Field

The application belongs to the technical field of communication, and particularly relates to a voice recognition method, a voice recognition device, electronic equipment and a readable storage medium.

Background

With the development of electronic technology, voice chat is becoming one of the main ways of remote chat. The voice chat mainly comprises making a call, voice calling and voice short messages. In the above-mentioned voice chat process, it is usually necessary to manually record through a paper pen or manually input through text editing software, such as key information of name, phone number, address information, meeting time and meeting place.

In the related technology, after the voice chat is finished, the stored voice information is subjected to voice recognition through the text editing software to generate text information, and then the user stores the key information in the text information through a manual screening mode.

However, the ratio of the key information in the voice information corresponding to the voice chat is small, which causes the text information to contain much redundant information, and reduces the efficiency of acquiring the key information.

Disclosure of Invention

An object of the embodiments of the present application is to provide a method, an apparatus, an electronic device, and a readable storage medium for speech recognition, which can solve the problem of low efficiency in acquiring key information during speech recognition.

In order to solve the technical problem, the present application is implemented as follows:

in a first aspect, an embodiment of the present application provides a method for speech recognition. The method comprises the following steps: receiving a first input of a user under the condition of acquiring voice information; and responding to the first input, and displaying first key information in the voice information through a target application program, wherein the first key information is associated with the type of the target application program.

In a second aspect, an embodiment of the present application provides an apparatus for speech recognition. The device includes: the device comprises a first receiving module and a first display module; the first receiving module is configured to receive a first input of a user when the voice information is acquired; the first display module is configured to display, through a target application program, first key information in the voice information in response to the first input, where the first key information is associated with a type of the target application program.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, and when executed by the processor, the program or instructions implement the steps of the method as provided in the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium on which a program or instructions are stored, which when executed by a processor implement the steps of the method as provided in the first aspect.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method as provided in the first aspect.

In a sixth aspect, the present application provides a computer program product stored in a non-volatile storage medium, the program product being executed by at least one processor to implement the method as provided in the first aspect.

In the embodiment of the application, in the case of acquiring the voice information, after receiving a first input, the first input may be responded, and first key information in the voice information may be displayed through the target application, where the first key information is associated with the type of the target application. Therefore, the first key information associated with the target application program is extracted from the voice information through direct display, so that the extraction efficiency of the key information is improved, the invalid recognition of the voice information is avoided, and the man-machine interaction performance of the electronic equipment is improved.

Drawings

Fig. 1 is a schematic diagram of a speech recognition method according to an embodiment of the present application;

fig. 2 is a second schematic diagram of a speech recognition method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a chat interface provided in an embodiment of the present application;

FIG. 4 is a diagram illustrating a chat interface receiving a screen recognition gesture according to an embodiment of the present application;

fig. 5 is a schematic diagram of displaying an application identifier according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a method for receiving a first input according to an embodiment of the present application;

fig. 7 is a schematic diagram of a method for displaying first key information according to an embodiment of the present disclosure;

fig. 8 is a schematic view of a display interface of first key information provided in an embodiment of the present application;

fig. 9 is a schematic diagram of a method for acquiring first key information according to an embodiment of the present application;

fig. 10 is a third schematic diagram of a speech recognition method according to an embodiment of the present application;

fig. 11 is a schematic diagram of editing first key information according to an embodiment of the present application;

FIG. 12 is a schematic structural diagram of an apparatus for speech recognition according to an embodiment of the present application;

fig. 13 is a second schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 14 is a hardware schematic diagram of an electronic device according to an embodiment of the present disclosure;

fig. 15 is a second hardware schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

The speech recognition method provided by the embodiment of the present application is described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios thereof.

According to the conversation scene of telephone shopping, a user A calls a customer B through an electronic device, and the user A inquires whether the customer B purchases a product or not, if so, attribute information such as the number, the model, the color, the delivery time and the like of the product to be recorded is required to be purchased. In the related art, in the process of a call between a user a and a client B, a recording function is started, and the call content is recorded to generate voice information. After the call is finished, if the customer B needs to purchase a product, the user A starts a text editing application program, selects the voice information in the text editing application program, recognizes the voice information as text information, and then stores the text information. In the speech recognition process, due to the problems of spoken language, accent, sentence break and the like of the client B, the recognized text information may not be accurate enough, so that the text information has deviation, and in the text information, the key information really required to be recorded by the user a is mixed with the text information, so that the efficiency of acquiring the key information is low.

With reference to the above specific scenario, in the embodiment of the application, if the customer B needs to purchase a product, the user a may receive the first input when the electronic device acquires the voice information of the call content between the user a and the customer B, so that the electronic device displays the first key information in the voice information through the text editing application program. Therefore, the first key information associated with the text editing application program is extracted from the voice information through direct display, so that the extraction efficiency of the key information is improved, the invalid recognition of the voice information is avoided, and the human-computer interaction performance of the electronic equipment is improved. Meanwhile, in the process of the conversation between the user A and the client B, the electronic equipment displays the first key information, so that the user A can immediately confirm whether the key information record is correct with the client B, and the attribute information of the client and the purchased product can be ensured to be consistent.

As shown in fig. 1, an embodiment of the present application provides a method of speech recognition. The method may include

steps

101 and 102 described below. The method is exemplified below by an apparatus that performs subject-based speech recognition.

Step 101: the voice recognition device receives a first input from a user when the voice information is acquired.

In this embodiment, the first input is an input for triggering the starting of the target application program. Illustratively, the first input may include at least one of: clicking an icon corresponding to the target application program, a screen gesture, clicking a mechanical key corresponding to the target application program, clicking a virtual key corresponding to the target application program, and inputting the information in other feasibility manners.

Optionally, in this embodiment of the application, if the first input is an input of a target application identifier by a user, and the target application identifier is used to indicate the target application program, as shown in fig. 2, before the first input of the user is received in step 101, the method for speech recognition provided in this embodiment of the application may further include step 201 and step 202.

Step 201: the means for speech recognition receives a third input from the user.

Step 202: the means for speech recognition displays at least one application identification in response to the third input.

In this embodiment of the present application, each application identifier is used to indicate an application program, and the at least one application identifier includes a target application identifier.

It should be noted that the third input may include at least one of: clicking an icon corresponding to the target application program, identifying a gesture on a screen, and clicking a mechanical key corresponding to the target application program can also be input in other feasibility manners.

For example, as shown in fig. 3, when it is detected that the electronic device normally performs a voice call, the voice recognition apparatus displays a voice chat interface corresponding to the voice call. Next, as shown in fig. 4 and 5, after the electronic device detects a screen recognition gesture (i.e., the second input) of the user on the voice chat interface, in response to the screen recognition gesture, application identifiers corresponding to five alternative applications are displayed on the application identifier interface, and the user may select an application program (i.e., the target application program) to be started.

Therefore, the plurality of optional application programs are displayed for the user by displaying the application identifications of the plurality of application programs, the user can conveniently select the target application program suitable for recording the text information corresponding to the voice information, and the man-machine interaction performance of the electronic equipment is improved.

Optionally, in this embodiment of the present application, as shown in fig. 6, step 101 may be implemented by step 601 and step 602:

step 601: in the process of carrying out voice call, the voice recognition device carries out voice recording on the call content of the voice call to acquire voice information.

The voice information is the voice information recorded by the voice recognition device in the voice call process.

Step 602: after the voice call is finished or in the voice call process, the voice recognition device receives a first input of a user.

Therefore, by recording the voice call content in real time, the electronic equipment can respond to the first input of the user at any time to extract the key information related to the target application program, and the voice call does not need to be finished, so that the voice recognition efficiency is improved, and the man-machine interaction performance of the electronic equipment is also improved.

Further optionally, in this embodiment of the application, when the voice recognition device performs voice recording on the call content of the voice call, any one of the following voice recording storage manners may be selected for recording: automatically recording and storing all voice calls, recording and storing voice calls in response to user input, and automatically recording and caching all voice calls.

It should be noted that, because the voice call includes multiple voice call modes such as telephone, voice chat, or voice message, different voice recording storage modes can be selected according to different implementation mechanisms of the voice call modes in the embodiments of the present application.

For example, a voice call is implemented in a telephone manner, the purpose of the voice call is generally to transmit a voice signal through a base station, and since the voice signal usually disappears immediately after being transmitted to a voice recognition device and played, in order to store call contents, the voice can be recorded in a manner of recording and storing the voice call in response to a user input.

For example, voice calls are realized in a voice chat mode, the realization depends on recording and forwarding of voice by a network cloud end, and due to instability of network transmission, part of voice call content is usually cached in a voice recognition device, so that all voice calls can be automatically recorded and cached according to a partial caching mode.

For example, a voice call is realized in a voice message manner, and since the data volume is small and the possibility of needing repeated playing is high, all voice calls can be automatically recorded and stored in order to facilitate the repeated playing of the user.

Further optionally, in this embodiment of the application, the voice information stored in the cache manner is temporarily stored in a cache space of the voice recognition apparatus, and since the cache space is limited, the cache space occupied by the voice information needs to be periodically or aperiodically cleared for normal operation of other processes in the voice recognition apparatus. Generally, in the process of removing the voice information, the removal method mainly includes any one of the following: after the voice call is finished, the voice information caching time reaches the preset duration, a first input corresponding to the voice information is received, and an input of clearing cache data is received.

Step 102: the speech recognition device displays first key information in the speech information through the target application program in response to the first input.

In the embodiment of the present application, the first key information is associated with a type of the target application, wherein the type of the target application may be referred to as an achievable function of the target application. Illustratively, the type of the target application program is an address book type, the name and the telephone number in the voice message are first key information associated with the address book type, the target application program is a purchase type, and the buyer, the brand, the product model number and the product quantity in the voice message are first key information associated with the purchase type.

In one example, if the front-end interface of the electronic device is a display interface of the application identifier when the first input is received, the application program corresponding to the application identifier for which the first input is directed is directly determined as the target application program.

In another example, if the front-end interface of the electronic device is a voice chat interface when the first input is received, before step 102, the method for voice recognition provided by the embodiment of the present application may further include step 102a or step 102b of determining the target application.

Step 102 a: when the voice call is detected to be carried out, the voice recognition device determines the application program associated with the first parameter of the first input as the target application program.

Step 102 b: in the case where the end of the voice call is detected, the voice recognition apparatus determines the application program associated with the second parameter of the first input as the target application program.

Illustratively, the first input corresponds to different target applications at different times of the voice call. For example, during a voice call, the target application is a first application corresponding to the first input, and after the voice call is ended, the target application is a second application corresponding to the first input.

It should be noted that, in the case that the electronic device displays different interfaces, different operations may be performed by the same screen gesture, for example, when a three-finger sliding-down gesture is input on a conventional initial interface of the electronic device, a camera application is started, and when a three-finger sliding-down gesture is input on a display interface corresponding to a method for performing voice recognition on the electronic device, a memo application is started. In order to distinguish the conventional initial interface of the electronic device, the voice recognition device may be configured to determine that the application program associated with the second input parameter of the first input is the target application program only when the voice recognition device detects that the voice call is ended and a time difference between the ending time of the voice call and the current time belongs to a preset time period, compared with the initial interface when the voice call is ended.

Illustratively, taking the first input as a two-finger horizontal sliding example, when a voice call is detected to be performed at the input time of the first input, the target application program is determined to be a translation software program, so as to realize quick translation of voice information; otherwise, when the voice call is detected to be ended at the input moment of the first input, the target application program is determined to be a memorandum software program, so that the recording and the storage of the voice information are conveniently realized.

Thus, if the content of the first input is the same, but the input time of the first input is different, different target applications are determined, i.e. one first input corresponds to multiple response results, so that fewer input types of the first input can achieve more response results.

Optionally, after determining the target application corresponding to the first input, the speech recognition apparatus starts the target application in response to the first input. And under the condition that the voice call is detected to be carried out, suspending the voice call program and switching the target application program to the foreground for operation. And under the condition that the voice call is detected to be ended, ending the voice call program, and switching the target application program to foreground operation.

Optionally, after receiving the first input, the speech recognition apparatus determines the target application corresponding to the first input, then determines the specific speech content included in the speech information, and then may start the new interface while starting the target application, or may start the new interface in response to a user input after starting the target application.

Further alternatively, the voice recognition apparatus may display the first key information in the voice information on a newly created interface of the target application.

Optionally, after the target application displays the first key information, the speech recognition device performs at least one of the following operations on the first key information: storing, jumping to a number playing page, jumping to a short message sending page, and editing again. It should be noted that the operation performed on the first key information is realized by the target application.

In the voice recognition method provided by the embodiment of the application, when the voice information is acquired, after a first input is received, the first input can be responded, and first key information in the voice information is displayed through a target application program, wherein the first key information is associated with the type of the target application program. Therefore, the first key information associated with the target application program is extracted from the voice information through direct display, so that the extraction efficiency of the key information is improved, the invalid recognition of the voice information is avoided, and the man-machine interaction performance of the electronic equipment is improved.

Alternatively, as shown in fig. 7, in the embodiment of the present application, step 102 may be implemented by steps 701 to 703.

Step 701: the means for speech recognition launches the target application in response to the first input.

In the embodiment of the application, according to whether the front-end interface of the electronic equipment is a display interface of the application identifier or a voice chat interface when the first input is received, the target application program is determined, and then the target application program is started. It should be noted that, in order to reduce the operation steps, the new information interface may be directly started while the target application is started, so as to conveniently display the first key information.

Step 702: the voice recognition device acquires first key information corresponding to a target application program in the voice information.

In the embodiment of the present application, step 702 may be implemented by step 702a and step 702 b.

Step 702 a: the speech recognition device extracts key fields included in a newly established information interface in the target application program.

Step 702 b: and the voice recognition device determines first key information corresponding to each key field in the first text information according to the key fields and the rule matching mode of the key fields.

In the embodiment of the application, the key fields included in the newly-built information interface in the target application program are extracted, and the first key information in the first text information can be extracted by adopting a template, a word list and a rule matching mode according to the key fields.

For example, as shown in fig. 8, assuming that the target application is an address book, the key fields of the address book include name, phone number, and remark information, wherein the name is lie four, the phone number is 135xxxxxxxx, and the address is the address of the school street 1. And setting a rule matching mode according to the telephone number being a 7-8 bit fixed telephone number or an 11-bit mobile phone number, and extracting first key information corresponding to the key field telephone number.

In the embodiment of the present application, the display modes adopted for displaying the first key information in the interface of the target application include, but are not limited to, a bold display mode, an oblique display mode, and a highlight display mode.

Step 703: the speech recognition device displays the first key information in the target application program.

In the embodiment of the application, first key information corresponding to the key field in the voice information is identified according to the key field in the newly-built information interface. In the embodiment of the application, the first key information is displayed on a new information interface of the target application program.

Therefore, the target application program interface is directly started to obtain the first key information corresponding to the key field in the newly-built information interface of the target application program, so that the purpose of real-time voice recognition is achieved.

Further optionally, in this embodiment of the application, the acquiring of the first key information specifically includes acquiring voice information, and recognizing the first key information corresponding to the target program in the voice information. Each specific step is described separately herein.

Example one: obtaining voice information

Further optionally, in the embodiment of the present application, in order to reduce the data amount of the voice information that needs to be subjected to voice recognition, the proportion of redundant information is reduced, so as to improve the voice recognition efficiency. The voice information is acquired in step 702, which may be implemented in step 702c or step 702 d.

Step 702 c: the method comprises the steps that a voice recognition device determines voice information as first voice information under the condition that voice communication is detected; the first voice message is a voice message corresponding to a preset time period before the input time of the first input.

Step 702 d: the voice recognition device determines that the voice information is second voice information under the condition that the voice call is detected to be ended; the second voice message is all voice messages recorded in the voice call process.

Illustratively, in the case of detecting that a voice call is being made, if the user hears that the user talking to the user mentions some key content (e.g., mentions phone number, address, appointment time, etc.), the user inputs a first input. Meanwhile, because the voice information corresponding to the preset time period before the input time of the first input usually includes information that needs to be recorded by the user, the voice information with less data volume can be acquired by taking the input time of the first input as a reference time node, and the proportion of redundant information is reduced, so that the voice recognition efficiency is improved.

Illustratively, in the case of detecting the end of the voice call, then, most of the content in the voice message is the first key information for the user, or the first key information in the voice message is more dispersed throughout the voice call, so the voice message is the whole voice message recorded during the course of extracting the voice call.

Further optionally, in this embodiment of the application, after the target application is started and before the target application is closed, the device for speech recognition may avoid missing the content to be recorded in the speech passage by updating the speech information in real time. Similar to step 702c, obtaining voice information in step 702 may also include step 702 e.

Step 702 e: and under the condition that the voice call is detected, the voice recognition device extracts and updates the voice information according to a preset interval.

Wherein, the updating the voice information comprises: and voice information of the voice call recording corresponding to the preset interval.

For example, after the speech recognition device extracts the updated speech information, it extracts the second key information from the updated speech information based on the target application program, and then displays the first key information and the second key information in the interface of the target application program.

For example, a person a and a person b have a conversation about purchasing computers, the person a informs the person b that 10X-brand computers with models 1566 need to be purchased, and the person b identifies the call contents of the two through the electronic device, so that key call contents are displayed in an application interface of a target application program, for example, "the person b needs to purchase 10X-brand computers with models 1566" (namely, the first key information). Then, someone B confirms the order of the goods again during the communication with someone A, and someone A informs someone B that 5 computers of brand Y with model number 1588 need to be purchased. At this time, the electronic device recognizes the call content of the two again to obtain new key call content, for example, "5 computers of brand Y with model number 1588" (i.e. the second key information), and updates the key call content displayed in the application interface of the target application program based on the new key call content.

In this way, as the voice call is carried out, the updating voice information is continuously generated, so that the updating voice information can extract and display the first key information and the second key information in real time through the target application program.

Further optionally, in this embodiment of the application, the acquiring the voice information in step 702 may further include: and the voice recognition device filters interference information in the voice information according to a preset voiceprint recognition algorithm and regenerates the voice information. It should be noted that in the voice chat process, the environment may include a whistle sound, an animal roar sound, a rain sound, a wind sound, and the like, so that filtering the interference information can improve the accuracy of voice recognition.

Example two: identifying first key information corresponding to target program in voice information

Further optionally, as shown in fig. 9, in the embodiment of the present application, the recognizing text information included in the speech information in step 702 specifically includes steps 901 to 903.

Step 901: the voice recognition device converts the voice information into target text information, and extracts first text information corresponding to a target application program from the target text information.

Step 902: the means for speech recognition obtains at least one type of information.

Step 903: and the voice recognition device deletes the text information with the type matched with the preset type in the first text information according to the at least one type information to obtain the first key information.

In this embodiment of the present application, the first text information includes first key information.

In the embodiment of the application, the voice information is converted into the target text information by extracting the audio characteristics of the voice information and then converting the audio characteristics into the text information through the scoring of the acoustic model and the language model.

In an embodiment of the application, the at least one type information is used to indicate a type of information included in the first text information. Illustratively, the above text information types may include abnormal stopwords, spoken words, time words, and place words, for example, "a hotel break at the south side of a bank in the yellow river avenue, beginning at 16 o 'clock of 3/2 and ending at 21 o' clock of a schedule at which you can see" wherein: the "south side" belongs to an abnormal superposed word, "this" belongs to a spoken word, "3 months, 2 days, 16 points and 21 points" belongs to a time vocabulary, "yellow river street, a certain bank and a certain hotel" belongs to a place vocabulary.

It should be noted that the abnormal overlapping of words, spoken words, and local dialect words in the above example prevents the first key information from being obtained, and therefore, the types of overlapping words, spoken words, and local dialect words may be determined as the preset type. The voice recognition device deletes the text message with the type matching the preset type in the first text message, which means to delete "side, this" in the above example, and obtains the first key message "the hotel holiday at the south side of the bank in the avenue of yellow river starts at 16 o 'clock and ends at 21 o' clock at 3 y and 2 y, and you can see such a time schedule".

Illustratively, the spoken language vocabulary optimization process in the text optimization processing method includes:

mode 1: and carrying out spoken text analysis on the first text information through a preset spoken word list. The preset spoken word list can record a section of spoken voice through a voice input mode by a user, performs character recognition on the spoken voice to obtain spoken text information, displays the spoken text information, edits the spoken text information, reserves spoken words corresponding to spoken buddhists of the user, and finally merges the spoken words to obtain the spoken word list.

Mode 2: the first text information is subjected to spoken text analysis by adding a language model (a language model that is trained exclusively for one common spoken word).

Specifically, for simple spoken words such as "this", "this" (pause for a long time) "that occur consecutively, recognition of a spoken word matching a preset type is performed. Finally, the recognized spoken words can be displayed to the user in a highlight or prominent mode so that the user can conveniently select whether to delete the spoken words for outputting the final voice input, or finally, the recognized spoken words are deleted by setting a voice input method for one-key deletion or automatically deleted.

Therefore, the type contained in the first text information is matched with the preset type, the spoken vocabulary of the first text information is deleted, and the readability of the first key information can be provided.

Optionally, as shown in fig. 10, in the embodiment of the present application, after step 102, the method for speech recognition provided in the embodiment of the present application may further include step 1001 and step 1002.

Step 1001: the means for speech recognition receives a second input from the user.

Step 1002: the speech recognition device processes the first key information in response to the second input in an editing processing manner corresponding to the second input.

In an embodiment of the application, the second input is an editing input of the first key information by a user. As shown in fig. 11, assuming that the target application is an address book, the editing processing method corresponding to the third input is as follows: after the information to be modified is selected by clicking, the information to be modified appears in the edit bar, and the user deletes, splices or re-enters the information to be modified. The original information content of the first key information can be updated in real time along with the modification of the user, and can also be replaced after the modification of the user is completed.

Illustratively, if the information to be modified is repeated information which is repeated for multiple times, the user performs uniform correction on all repeated parts after deleting, splicing or re-entering the information to be modified so as to realize rapid integration of the information.

Optionally, in this embodiment of the present application, after step 102, the method for speech recognition provided in this embodiment of the present application may further include: and generating a temporary cache control, wherein the temporary cache control is used for caching the first key information.

Illustratively, if the target application program is a memo and the first key information is a phone number, the phone number can be directly dialed through the temporary cache control, so that the displayed first key information can be directly applied, and the tedious process that a user selects to copy and paste the phone number to dial a phone call is avoided.

It should be noted that, in the method for speech recognition provided in the embodiment of the present application, the execution subject may be a device for speech recognition, or a control module of the device for speech recognition for executing the method for speech recognition. In the embodiment of the present application, a method for performing speech recognition by a speech recognition apparatus is taken as an example, and the speech recognition apparatus provided in the embodiment of the present application is described. However, in practical applications, the main body of the above-mentioned speech recognition method may also be other devices or apparatuses that can execute the speech recognition method, and this is not limited in this embodiment of the present application.

As shown in fig. 12, an embodiment of the present application provides a speech recognition apparatus. The speech recognition apparatus includes a first receiving module 1201 and a first display module 1202;

the first receiving module 1201 is configured to receive a first input of a user when the voice information is acquired;

the first display module 1202 is configured to display, by a target application program, first key information in the voice information in response to a first input received by the first receiving module 1201, where the first key information is associated with a type of the target application program.

Optionally, the first receiving module 1201 is configured to: in the process of carrying out voice call, carrying out voice recording on the call content of the voice call to acquire the voice information; and receiving a first input of a user after the voice call is ended or in the voice call process.

Optionally, as shown in fig. 13, the apparatus further includes: a determination module 1203;

the determining module 1203 is configured to determine, by the first displaying module 1202 in response to the first input, an application associated with a first parameter of the first input as the target application when a voice call is detected before displaying, by the target application, first key information in the voice information;

the determining module 1203 is further configured to determine, by the first displaying module 1202 in response to the first input, an application associated with the first input second parameter as the target application when the end of the voice call is detected before the first key information in the voice information is displayed through the target application.

Optionally, the first display module 1202 is configured to: responding to the first input, and starting a target application program; acquiring first key information corresponding to the target application program in the voice information; and displaying the first key information in the target application program.

Optionally, the first display module 1202 is specifically configured to: converting the voice information into target text information, and extracting first text information corresponding to the target application program from the target text information, wherein the first text information comprises the first key information; acquiring at least one type information, wherein the at least one type information is used for indicating the type of the information contained in the first text information; and deleting the text information with the type matched with the preset type in the first text information according to the at least one type information to obtain the first key information.

Optionally, as shown in fig. 13, the apparatus further includes: a second receiving module 1204 and a first processing module 1205;

the second receiving module 1204 is configured to receive a second input of the user after the first display module 1202 displays the first key information in the voice information through the target application, where the second input is an editing input of the first key information by the user;

the first processing module 1205 is configured to respond to the second input received by the second receiving module 1204, and process the first key information in an editing processing manner corresponding to the second input.

In the voice recognition device provided by the embodiment of the application, when the voice information is acquired, after a first input is received, the first input can be responded, and first key information in the voice information is displayed through the target application program, wherein the first key information is associated with the type of the target application program. Therefore, the first key information associated with the target application program is extracted from the voice information through direct display, so that the extraction efficiency of the key information is improved, the invalid recognition of the voice information is avoided, and the man-machine interaction performance of the electronic equipment is improved.

The speech recognition device in the embodiment of the present application may be a device, or may be a component, an integrated circuit, or a chip in a terminal. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine or a self-service machine, and the like, and the embodiments of the present application are not particularly limited.

The speech recognition device in the embodiment of the present application may be a device having an operating system. The operating system may be an Android (Android) operating system, an IOS operating system, or other possible operating systems, which is not specifically limited in the embodiments of the present application.

The speech recognition device provided by the embodiment of the present application can implement each process implemented by the above method embodiment, and is not described here again to avoid repetition.

The beneficial effects of the various implementation manners in this embodiment may specifically refer to the beneficial effects of the corresponding implementation manners in the above method embodiments, and are not described herein again to avoid repetition.

Optionally, as shown in fig. 14, an electronic device 1400 is further provided in the embodiment of the present application, and includes a processor 1401, a memory 1402, and a program or an instruction stored in the memory 1402 and executable on the processor 1401, where the program or the instruction is executed by the processor 1401 to implement each process of the foregoing speech recognition method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

It should be noted that the electronic devices in the embodiments of the present application include the mobile electronic device and the non-mobile electronic device described above.

Fig. 15 is a schematic hardware structure diagram of an electronic device implementing an embodiment of the present application.

The electronic device 1500 includes, but is not limited to: a radio frequency unit 1501, a network module 1502, an audio output unit 1503, an input unit 1504, a sensor 1505, a display unit 1506, a user input unit 1507, an interface unit 1508, a memory 1509, and a processor 1510.

Those skilled in the art will appreciate that the electronic device 1500 may also include a power supply (e.g., a battery) for powering the various components, which may be logically coupled to the processor 1510 via a power management system to perform functions such as managing charging, discharging, and power consumption via the power management system. The electronic device structure shown in fig. 15 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is omitted here.

The user input unit 1507 is configured to receive a first input of a user when the voice information is acquired;

the processor 1510 is configured to display, by a target application program, first key information in the voice information in response to the first input, where the first key information is associated with a type of the target application program.

Optionally, the processor 1510 is further configured to perform voice recording on the call content of the voice call in the process of performing the voice call, so as to obtain the voice information;

optionally, the user input unit 1507 is further configured to receive a first input from a user after the voice call is ended or during the voice call.

Optionally, the processor 1510 is further configured to determine, when it is detected that a voice call is performed, an application program associated with the first parameter of the first input as the target application program; and when the voice call end is detected, determining the application program associated with the first input second parameter as the target application program.

Optionally, the processor 1510 is further configured to start a target application program in response to the first input; acquiring first key information corresponding to the target application program in the voice information; and displaying the first key information in the target application program.

Optionally, the processor 1510 is further configured to convert the voice information into target text information, and extract first text information corresponding to the target application program from the target text information, where the first text information includes the first key information; acquiring at least one type information, wherein the at least one type information is used for indicating the type of the information contained in the first text information; and deleting the text information with the type matched with the preset type in the first text information according to the at least one type information to obtain the first key information.

Optionally, the user input unit 1507 is further configured to receive a second input from a user, where the third input is an editing input of the first key information by the user;

optionally, the processor 1510 is further configured to respond to the second input and process the first key information in an editing processing manner corresponding to the third input.

In the electronic device provided by the embodiment of the application, when the voice information is acquired, after a first input is received, the first input can be responded, and first key information in the voice information is displayed through a target application program, wherein the first key information is associated with the type of the target application program. Therefore, the first key information associated with the target application program is extracted from the voice information through direct display, so that the extraction efficiency of the key information is improved, the invalid recognition of the voice information is avoided, and the man-machine interaction performance of the electronic equipment is improved.

It should be understood that in the embodiment of the present application, the input unit 1504 may include a Graphics Processing Unit (GPU) 15041 and a microphone 15042, and the graphics processor 15041 processes image data of still pictures or videos obtained by an image capturing device (such as a camera) in a video capturing mode or an image capturing mode. The display unit 1506 may include a display panel 15061, and the display panel 15061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 1507 includes a touch panel 15071 and other input devices 15072. A touch panel 15071, also referred to as a touch screen. The touch panel 15071 may include two parts of a touch detection device and a touch controller. Other input devices 15072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein. The memory 1509 may be used to store software programs as well as various data including, but not limited to, application programs and an operating system. The processor 1510 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1510.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the embodiment of the speech recognition method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device in the above embodiment. Readable storage media, including computer-readable storage media such as a computer-read-only memory (ROM), a Random Access Memory (RAM), a magnetic or optical disk, and so forth.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to execute a program or an instruction to implement each process of the voice recognition method embodiment, and can achieve the same technical effect, and in order to avoid repetition, the description is omitted here.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes several instructions for enabling a terminal (such as a mobile phone, a computer, a server, or a network device) to execute the methods of the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of speech recognition, the method comprising:

receiving a first input of a user under the condition of acquiring voice information;

displaying, by a target application, first key information in the voice information in response to the first input, the first key information being associated with a type of the target application.

2. The method according to claim 1, wherein receiving a first input from a user in a case where the voice information is obtained comprises:

in the process of carrying out voice call, carrying out voice recording on the call content of the voice call to obtain the voice information;

and receiving a first input of a user after the voice call is finished or in the voice call process.

3. The method of claim 1 or 2, wherein before displaying the first key information of the voice information by the target application in response to the first input, the method further comprises:

under the condition that the voice call is detected to be carried out, determining the application program associated with the first input first parameter as the target application program;

and under the condition that the end of the voice call is detected, determining the application program associated with the first input second parameter as the target application program.

4. The method of claim 1, wherein the displaying, by a target application, first key information in the voice information in response to the first input comprises:

responding to the first input, starting a target application program or switching the target application to a foreground;

acquiring first key information corresponding to the target application program in the voice information;

and displaying the first key information in the target application program.

5. The method according to claim 4, wherein the obtaining of the first key information corresponding to the target application in the voice information includes:

converting the voice information into target text information, and extracting first text information corresponding to the target application program from the target text information, wherein the first text information comprises the first key information;

acquiring at least one type information, wherein the at least one type information is used for indicating the type of the information contained in the first text information;

and deleting the text information with the type matched with the preset type in the first text information according to the at least one type information to obtain the first key information.

6. The method of claim 1, wherein after displaying the first key information in the voice information by the target application, the method further comprises:

receiving a second input of the user, wherein the second input is an editing input of the first key information by the user;

and responding to the second input, and processing the first key information in an editing processing mode corresponding to the second input.

7. An apparatus for speech recognition, the apparatus comprising: the device comprises a first receiving module and a first display module;

the first receiving module is used for receiving a first input of a user under the condition of acquiring the voice information;

the first display module is used for responding to the first input received by the first receiving module and displaying first key information in the voice information through a target application program, wherein the first key information is associated with the type of the target application program.

8. The apparatus of claim 7, wherein the first receiving module is configured to:

9. The apparatus of claim 7 or 8, further comprising: a determination module;

the determining module is configured to determine, by the first display module, an application program associated with a first parameter of the first input as a target application program when a voice call is detected before first key information in the voice information is displayed through the target application program in response to the first input;

the determining module is further configured to determine, by the first display module, an application program associated with the second parameter of the first input as the target application program when a voice call end is detected before the first key information in the voice information is displayed through the target application program in response to the first input.

10. The apparatus of claim 7, wherein the first display module is configured to:

in response to the first input, launching a target application;

and displaying the first key information in the target application program.

11. The apparatus of claim 10, wherein the first display module is specifically configured to:

12. The apparatus of claim 7, further comprising: the device comprises a second receiving module and a first processing module;

the second receiving module is used for receiving a second input of a user after the first display module displays the first key information in the voice information through a target application program, wherein the second input is an editing input of the first key information by the user;

and the first processing module is used for responding to the second input received by the second receiving module and processing the first key information by adopting an editing processing mode corresponding to the second input.

13. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions when executed by the processor implementing the steps of the method of speech recognition according to any one of claims 1 to 6.

14. A readable storage medium, characterized in that it stores thereon a program or instructions which, when executed by a processor, implement the steps of the method of speech recognition according to any one of claims 1 to 6.