WO2022213986A1

WO2022213986A1 - Voice recognition method and apparatus, electronic device, and readable storage medium

Info

Publication number: WO2022213986A1
Application number: PCT/CN2022/085338
Authority: WO
Inventors: 梁浩
Original assignee: 维沃移动通信有限公司
Priority date: 2021-04-06
Filing date: 2022-04-06
Publication date: 2022-10-13
Also published as: CN113299290A

Abstract

A method and apparatus for collecting key information during a voice call, an electronic device, and a readable storage medium, the method comprising: receiving a first input by a user when voice information has been acquired, the first input being an input that triggers and starts a target application program (101); and in response to the first input, displaying first key information in the voice information by means of the target application program, the first key information being associated with the type of the target application program (102).

Description

Method, apparatus, electronic device and readable storage medium for speech recognition

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to Chinese Patent Application No. 202110369099.2 filed in China on April 6, 2021, the entire contents of which are hereby incorporated by reference.

technical field

The present application belongs to the field of communication technologies, and in particular relates to a method, an apparatus, an electronic device and a readable storage medium for speech recognition.

Background technique

With the development of electronic technology, voice chat has gradually become one of the main ways of remote chat. Among them, the voice chat mainly includes phone calls, voice calls and voice short messages. In the above-mentioned voice chat process, it is usually necessary to manually record with pen and paper, or manually input through text editing software, such as key information such as name, mobile phone number, address information, meeting time and meeting location.

In the related art, after the end of the voice chat, text editing software can be used to perform speech recognition on the stored voice information to generate text information, and the user can then manually filter the key information in the text information.

However, the proportion of key information in the voice information corresponding to the voice chat is small, resulting in the above-mentioned text information containing a lot of redundant information, which reduces the efficiency of obtaining key information.

SUMMARY OF THE INVENTION

The purpose of the embodiments of the present application is to provide a speech recognition method, apparatus, electronic device, and readable storage medium, which can solve the problem of low efficiency in acquiring key information during speech recognition.

In a first aspect, an embodiment of the present application provides a method for speech recognition. The method includes: in the case of acquiring voice information, receiving a first input from a user; in response to the first input, displaying first key information in the voice information through a target application program, the first key information and the target associated with the type of application.

In a second aspect, an embodiment of the present application provides a device for speech recognition. The device includes: a first receiving module and a first display module; the above-mentioned first receiving module is used to receive the first input of the user when the voice information is acquired; the above-mentioned first display module is used to respond to the above-mentioned first input. Upon input, the target application program displays the first key information in the voice information, where the first key information is associated with the type of the target application program.

In a third aspect, an embodiment of the present application provides an electronic device, the electronic device includes a processor, a memory, and a program or instruction stored in the memory and executable on the processor, the program or instruction being executed by the processor When executed, the steps of the method as provided in the first aspect are implemented.

In a fourth aspect, an embodiment of the present application provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or instruction is executed by a processor, the steps of the method provided in the first aspect are implemented.

In a fifth aspect, an embodiment of the present application provides a chip, the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement the method provided in the first aspect.

In a sixth aspect, an embodiment of the present application provides a computer program product, where the program product is stored in a non-volatile storage medium, and the program product is executed by at least one processor to implement the method provided in the first aspect.

In the embodiment of the present application, when the voice information is acquired, after the first input is received, the first input can be responded to, and the first key information in the voice information can be displayed through the target application, wherein the first key A key piece of information is associated with the type of target application. In this way, the first key information associated with the target application is directly displayed and extracted from the voice information, which not only improves the extraction efficiency of the key information, but also avoids invalid recognition of the voice information, and also improves the human-machine performance of the electronic device. Interactive performance.

Description of drawings

1 is one of the schematic diagrams of a method for speech recognition provided by an embodiment of the present application;

2 is a second schematic diagram of a method for speech recognition provided by an embodiment of the present application;

3 is a schematic diagram of a chat interface provided by an embodiment of the present application;

4 is a schematic diagram of receiving a screen recognition gesture on a chat interface provided by an embodiment of the present application;

5 is a schematic diagram of displaying an application identifier according to an embodiment of the present application;

6 is a schematic diagram of a method for receiving a first input provided by an embodiment of the present application;

7 is a schematic diagram of a method for displaying first key information provided by an embodiment of the present application;

8 is a schematic diagram of a display interface of the first key information provided by an embodiment of the present application;

9 is a schematic diagram of a method for acquiring first key information provided by an embodiment of the present application;

10 is a third schematic diagram of a method for speech recognition provided by an embodiment of the present application;

11 is a schematic diagram of editing first key information according to an embodiment of the present application;

12 is one of the schematic structural diagrams of the apparatus for speech recognition provided by an embodiment of the present application;

FIG. 13 is the second schematic structural diagram of the apparatus for speech recognition provided by the embodiment of the application;

FIG. 14 is one of the hardware schematic diagrams of the electronic device provided by the embodiment of the application;

FIG. 15 is the second schematic diagram of the hardware of the electronic device provided by the embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments.

The terms "first", "second" and the like in the description and claims of the present application are used to distinguish similar objects, and are not used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the present application can be practiced in sequences other than those illustrated or described herein, and distinguish between "first", "second", etc. The objects are usually of one type, and the number of objects is not limited. For example, the first object may be one or more than one. In addition, "and/or" in the description and claims indicates at least one of the connected objects, and the character "/" generally indicates that the associated objects are in an "or" relationship.

The speech recognition method provided by the embodiments of the present application will be described in detail below through specific embodiments and application scenarios with reference to the accompanying drawings.

For the dialogue scenario of telephone shopping, user A calls customer B through an electronic device, and user A asks customer B whether to buy a product. If you want to buy a product, you need to record the attribute information such as the quantity, model, color, and delivery time of the product. In the related art, during a call between user A and client B, the recording function is activated, and the content of the call is recorded to generate voice information. After the call, if customer B wants to purchase a product, user A starts the text editing application, selects the above voice information in the text editing application, recognizes the voice information as text information, and saves the text information. In the process of speech recognition, due to customer B's spoken language, accent, sentence segmentation and other problems, the recognized text information may not be accurate enough, resulting in deviations in the text information, and the key information that user A really needs to record in the text information is mixed with the text information, resulting in low efficiency in obtaining key information.

Combining the above specific scenarios, in the embodiment of the present application, if customer B needs to purchase a product, user A can receive the first input when the electronic device obtains the voice information of the content of the call between user A and customer B, so that the electronic device The first key information in the voice message is displayed through a text editing application. In this way, the first key information associated with the text editing application is directly displayed and extracted from the voice information, which not only improves the extraction efficiency of the key information, but also avoids invalid recognition of the voice information, and also improves the efficiency of the electronic device. computer interaction performance. At the same time, during the call between user A and customer B, the electronic device displays the first key information, so that user A can instantly confirm with customer B whether the key information is recorded correctly, and can ensure that the attribute information of the customer and the products he purchased is consistent .

As shown in FIG. 1 , an embodiment of the present application provides a method for speech recognition. The method may include

steps

101 and 102 described below. The method is exemplarily described below by taking an apparatus whose main body is speech recognition as an example.

Step 101: The apparatus for speech recognition receives the first input of the user in the case of acquiring the speech information.

In this embodiment of the present application, the above-mentioned first input is an input that triggers the startup of the target application. Exemplarily, the above-mentioned first input may include at least one of the following: clicking an icon corresponding to the target application, screen gestures, clicking a mechanical button corresponding to the target application, clicking a virtual button corresponding to the target application, or other feasible options. Sexual input.

Optionally, in this embodiment of the present application, if the above-mentioned first input is an input of a target application identifier by a user, and the above-mentioned target application identifier is used to indicate the above-mentioned target application program, then as shown in FIG. 2 , in step 101 Before receiving the first input from the user, the method for speech recognition provided in this embodiment of the present application may further include step 201 and step 202 .

Step 201: The apparatus for speech recognition receives a third input from the user.

Step 202: The apparatus for speech recognition displays at least one application identifier in response to the third input.

In this embodiment of the present application, each application identifier is used to respectively indicate an application, and the at least one application identifier includes a target application identifier.

It should be noted that the above-mentioned third input may include at least one of the following: clicking an icon corresponding to the target application, screen recognition gesture, clicking a mechanical button corresponding to the target application, or other feasible inputs.

Exemplarily, as shown in FIG. 3 , when the apparatus for voice recognition detects that the electronic device is making a voice call normally, it will display a voice chat interface corresponding to the voice call. Next, as shown in FIG. 4 and FIG. 5 , when the electronic device detects the user's screen recognition gesture on the voice chat interface (that is, the above-mentioned second input), in response to the screen recognition gesture, a display on the application identification interface is displayed. The application identifiers corresponding to the five candidate application programs, from which the user can select the application program that he wants to start (ie, the above-mentioned target application program).

In this way, by displaying the application identifiers of multiple application programs, multiple optional application programs are displayed for the user, which not only facilitates the user to select the target application program suitable for recording the text information corresponding to the voice information, but also improves the man-machine of the electronic device. Interactive performance.

Optionally, in this embodiment of the present application, as shown in FIG. 6 , the foregoing step 101 may be implemented through steps 601 and 602:

Step 601: During the voice call, the voice recognition device performs voice recording on the call content of the voice call, and acquires voice information.

Wherein, the above-mentioned voice information is voice information recorded by a voice recognition device during a voice call.

Step 602: After the voice call ends or during the voice call, the device for voice recognition receives the first input from the user.

In this way, through the real-time recording of the voice call content, the electronic device can respond to the user's first input at any time to extract key information related to the target application, and there is no need to wait for the end of the voice call, which not only improves the voice recognition efficiency, but also improves the Human-computer interaction performance of electronic devices.

Further optionally, in the embodiment of the present application, when the voice recognition device performs voice recording on the call content of the voice call, it can select any of the following voice recording storage methods to record: automatically record and store all voice calls, Record and store voice calls in response to user input, automatically record and cache all voice calls.

It should be noted that, since a voice call includes multiple voice call modes such as phone calls, voice chats, or voice messages, the embodiments of the present application may select different voice recording and storage modes according to different implementation mechanisms of the call modes.

Exemplarily, to implement a voice call by making a phone call, the purpose is usually to transmit the voice signal through the base station. Since the voice signal is transmitted to the voice recognition device and played, it usually disappears immediately. Therefore, in order to store the call content, Voice may be recorded by recording and storing voice calls in response to user input.

Exemplarily, a voice call is implemented in the form of voice chat, which depends on the recording and forwarding of the voice by the network cloud. Due to the instability of network transmission, part of the voice call content is usually cached in the voice recognition device. Partial caching method to automatically record and cache all voice calls.

Exemplarily, voice calls are implemented in the form of voice messages. Since the data volume is small and the possibility of repeated playback is high, all voice calls can be automatically recorded and stored in order to facilitate the user's repeated playback.

Further optionally, in the embodiment of the present application, the voice information stored in the cache mode is temporarily stored in the cache space of the voice recognition device. Since the cache space is limited, for the normal operation of other processes in the voice recognition device, it is necessary to Periodically or irregularly clear the cache space occupied by voice messages. Generally, in the process of clearing the voice information, the clearing methods used mainly include any one of the following: after the voice call ends, the voice information cache time reaches a preset duration, the first input corresponding to the voice information is received, and the cache clearing time is received. data input.

Step 102: In response to the first input, the apparatus for speech recognition displays the first key information in the speech information through the target application program.

In this embodiment of the present application, the first key information is associated with the type of the target application, where the type of the target application may be referred to as the achievable function of the target application. Exemplarily, the type of the target application is the address book type, the name and phone number in the voice information are the first key information associated with the address book type, the target application is the purchase type, and the buyer, brand, product in the voice information , product model, and product quantity are the first key information associated with the purchase type.

In an example, if the front-end interface of the electronic device is the display interface of the application identifier when the first input is received, the application program corresponding to the application identifier targeted by the first input is directly determined as the target application program.

In another example, if the front-end interface of the electronic device is a voice chat interface when the first input is received, before step 102, the voice recognition method provided by this embodiment of the present application may further include step 102a of determining a target application program or step 102b.

Step 102a: In the case of detecting a voice call, the apparatus for voice recognition determines the application associated with the first inputted first parameter as the target application.

Step 102b: In the case of detecting the end of the voice call, the apparatus for voice recognition determines the application associated with the first input second parameter as the target application.

Exemplarily, at different moments of the voice call, the above-mentioned first input corresponds to different target applications. For example, during a voice call, the above-mentioned target application is the first application corresponding to the first input, and after the voice call ends, the above-mentioned target application is the second application corresponding to the first input.

It should be noted that when the electronic device displays different interfaces, the same screen gesture may perform different operations. For example, in the conventional initial interface of the electronic device, input the three-finger swipe gesture to start the camera application. On the display interface corresponding to the method for performing speech recognition on the device, inputting a three-finger sliding gesture will start the memo application. In order to distinguish the conventional initial interface of the electronic device from the initial interface in the case of the end of the voice call, the voice recognition device may be set in the case where the end of the voice call is detected, and the time difference between the end time of the voice call and the current time belongs to the preset time. After a set period of time, the device for speech recognition can determine that the application program associated with the second input parameter of the first input is the target application program.

Exemplarily, taking the first input as two-finger lateral sliding as an example, when a voice call is detected at the moment of input of the first input, it is confirmed that the target application is a translation software program, so as to facilitate the rapid translation of voice information; otherwise, When the end of the voice call is detected at the input moment of the first input, it is confirmed that the target application program is a memo software program, so as to realize the recording and saving of the voice information.

In this way, if the content of the first input is the same, but the input time of the first input is different, different target applications are determined, that is, one first input corresponds to multiple response results, so that there are fewer input types of the first input, More responsive results can be achieved.

Optionally, after determining the target application corresponding to the first input, the voice recognition apparatus starts the target application in response to the first input. When a voice call is detected, hang up the voice call program and switch the target application to the foreground to run. In the case of detecting the end of the voice call, end the voice call program and switch the target application to the foreground to run.

Optionally, after receiving the first input, the voice recognition device first determines the target application program corresponding to the first input, then determines the specific voice content contained in the voice information, and then starts the new interface while starting the target application program, A new interface may also be launched in response to user input after launching the target application.

Further optionally, the voice recognition apparatus may display the first key information in the voice information on the newly created interface of the target application.

Optionally, after the target application program displays the first key information, the voice recognition device performs at least one of the following operations on the first key information: saving, jumping to the broadcast number page, jumping to the short message sending page, and editing again. It should be noted that the operation on the first key information can be realized by the target application.

In the voice recognition method provided by the embodiment of the present application, in the case of acquiring voice information, after receiving the first input, the first input can be responded to, and the first input in the voice information can be displayed through the target application program. Key information, wherein the first key information is associated with the type of the target application. In this way, the first key information associated with the target application is directly displayed and extracted from the voice information, which not only improves the extraction efficiency of the key information, but also avoids invalid recognition of the voice information, and also improves the human-machine performance of the electronic device. Interactive performance.

Optionally, as shown in FIG. 7 , in this embodiment of the present application, step 102 may be implemented through steps 701 to 703 .

Step 701: The apparatus for speech recognition starts the target application in response to the first input.

In the embodiment of the present application, according to whether the front-end interface of the electronic device is the display interface of the application identification or the voice chat interface when the first input is received, the target application is determined, and then the target application is started. It should be noted that, in order to reduce the operation steps, when the target application program is started, the newly created information interface can be directly started, so as to conveniently display the first key information.

Step 702: The apparatus for speech recognition acquires the first key information corresponding to the target application in the speech information.

In this embodiment of the present application, step 702 may be implemented by step 702a and step 702b.

Step 702a: The apparatus for speech recognition extracts the key fields included in the newly created information interface in the target application program.

Step 702b: The apparatus for speech recognition determines the first key information corresponding to each key field in the first text information according to the above key field and the rule matching method of the key field.

In the embodiment of the present application, to extract the key fields included in the newly created information interface in the target application program, the first key information in the first text information can be extracted according to the key fields by using a template, a vocabulary, and a rule matching method.

Exemplarily, as shown in FIG. 8 , it is assumed that the target application is an address book, and the key fields of the address book include name, phone number, and remarks, where the name is Li Si, the phone number is 135xxxxxxxx, and the address is No. 1 Shuyuan Street. . According to the phone number being a 7-8-digit fixed phone number, or an 11-digit mobile phone number, set the rule matching method, and extract the key field phone number corresponding to the first key information.

It should be noted that, in this embodiment of the present application, the display method used to display the above-mentioned first key information in the interface of the above-mentioned target application program includes but is not limited to a bold display method, an oblique display method, and a highlight display method.

Step 703: The device for speech recognition displays the first key information in the target application.

In the embodiment of the present application, the first key information corresponding to the key field in the voice information is identified according to the key field in the newly created information interface. In this embodiment of the present application, the first key information is displayed on the newly created information interface of the target application.

In this way, by directly starting the target application program interface, the first key information corresponding to the key fields in the newly created information interface of the target application program is obtained, so as to realize the purpose of real-time speech recognition.

Further optionally, in this embodiment of the present application, acquiring the first key information specifically includes acquiring voice information, and identifying the first key information corresponding to the target program in the voice information. In this way, each specific step will be explained separately.

Example 1: Get voice information

Further optionally, in this embodiment of the present application, in order to reduce the data amount of speech information that needs to be recognized for speech recognition, the proportion of redundant information is reduced, so as to improve speech recognition efficiency. Acquiring the voice information in step 702 can be achieved through step 702c or step 702d.

Step 702c: The voice recognition device determines that the voice information is the first voice information when it detects that a voice call is being made; wherein, the first voice information corresponds to a preset time period before the input time of the first input. voice message.

Step 702d: When detecting that the voice call ends, the voice recognition device determines that the voice information is the second voice information; wherein the second voice information is all the voice information recorded during the voice call.

Exemplarily, when a voice call is detected, if the user hears some key content (such as mentioning a phone number, address, appointment time, etc.) mentioned by the user talking to the user, the user enters the first enter. At the same time, since the voice information corresponding to the preset time period before the input time of the first input usually includes the information that the user needs to record, therefore, taking the input time of the first input as the reference time node, it is possible to obtain a smaller amount of data. Voice information, reduce the proportion of redundant information to improve the efficiency of voice recognition.

Exemplarily, in the case where the end of the voice call is detected, for the user, most of the content in the voice information is the first key information, or the first key information in the voice information is relatively low during the entire voice call process. Decentralized, therefore, the voice information is extracted from the entire voice information recorded during the voice call.

Further optionally, in this embodiment of the present application, after starting the target application program and before closing the target application program, the device for speech recognition can avoid missing the content that needs to be recorded in the voice passage by updating the voice information in real time. Similar to step 702c, acquiring voice information in step 702 may further include step 702e.

Step 702e: In the case of detecting a voice call, the voice recognition device extracts and updates voice information according to preset intervals.

Wherein, the above-mentioned updated voice information includes: voice information of the voice call recording corresponding to the preset interval.

Exemplarily, after extracting the updated voice information, the voice recognition device will extract the second key information from the updated voice information based on the target application, and then convert the first key information displayed in the interface of the target application. and second key information.

For example, A and B have a conversation about purchasing a computer. A informs B that he needs to buy 10 X brand computers with a model of 1566. The key call content is displayed on the interface, for example, "B needs to buy 10 X-brand computers with model number 1566" (ie, the above-mentioned first key information). Then, B reconfirmed the order of goods during the call with A, and then A told B that he needed to buy 5 Y-brand computers with model number 1588. At this time, the electronic device recognizes the content of the two calls again, and obtains new key content of the call, such as "five Y-brand computers with model number 1588" (that is, the second key information above), and based on the new key Call content, update the key call content displayed in the application interface of the target application.

In this way, with the progress of the voice call, the updated voice information is continuously generated, so that the updated voice information can be extracted and displayed in real time by the target application program of the first key information and the second key information.

Further optionally, in this embodiment of the present application, acquiring the voice information in step 702 may further include: the voice recognition device filters out interference information in the voice information according to a preset voiceprint recognition algorithm, and regenerates the voice information. It should be noted that in the process of voice chat, the environment may include whistle sound, animal roaring sound, rain sound, wind sound, etc. In this way, filtering out interfering information can improve the accuracy of speech recognition.

Example 2: Identify the first key information corresponding to the target program in the voice information

Further optionally, as shown in FIG. 9 , in this embodiment of the present application, the text information included in the voice information is recognized in step 702 , which specifically includes steps 901 to 903 .

Step 901: The apparatus for speech recognition converts the above-mentioned speech information into target text information, and extracts first text information corresponding to the target application program from the target text information.

Step 902: The apparatus for speech recognition acquires at least one type of information.

Step 903: The apparatus for speech recognition deletes the text information whose type matches the preset type in the first text information according to the above at least one type of information, so as to obtain the above-mentioned first key information.

In this embodiment of the present application, the above-mentioned first text information includes first key information.

In the embodiment of the present application, the voice information is converted into target text information by performing audio feature extraction on the voice information, and then converting the audio features into text information through scoring by an acoustic model and a language model.

In this embodiment of the present application, the above-mentioned at least one type of information is used to indicate the type of information included in the first text information. Exemplarily, the above-mentioned text information types may include abnormal overlapping words, colloquial words, time words, and place words, such as "An annual meeting will be held at a hotel on the south side of a bank on Yellow River Street, this is at 16:00 on March 2. Start and end at 21:00, do you think such a time arrangement is okay?", among which: "south side" is an abnormal reduplication, "this this" is a spoken word, ", March 2, 16:00, 21:00" is a time Vocabulary, "Yellow River Street, a certain bank, a certain hotel" is a place vocabulary.

It should be noted that in the above example, abnormal reduplicated words, spoken words and local dialect words hinder obtaining the first key information. Therefore, the types of reduplicated words, spoken words and local dialect words can be determined as preset types. The device for speech recognition deletes the text information whose type matches the preset type in the above-mentioned first text information, which means to delete "side, this and this" in the above example, and obtain the first key information "in a certain bank on the south side of Yellow River Street. The hotel has an annual meeting, which starts at 16:00 on March 2 and ends at 21:00. Do you think this timetable is ok?"

Exemplarily, the spoken word optimization process in the above text optimization processing method includes:

Mode 1: Perform oral text analysis on the first text information by using a preset spoken word list. Among them, the preset spoken word list can record a piece of spoken speech by the user through voice input, perform text recognition on the spoken speech, obtain spoken text information, display the spoken text information, edit the spoken text information to retain the spoken words corresponding to one's own mantra, and finally Combine spoken words to get a list of spoken words.

Mode 2: Perform oral text analysis on the first text information by adding a language model (specially training a language model for commonly used spoken words).

Specifically, for simple spoken words, such as "this this" and "this (pause for a long time)" that appear continuously, identify the spoken words that match the preset type. Finally, the recognized spoken words can be displayed to the user in a highlighted or highlighted manner, so that the user can choose whether to delete these spoken words for the final output of the voice input, or finally, for the recognized spoken words, by setting the voice input method , perform one-click deletion, or automatically delete the recognized spoken words.

In this way, by matching the type contained in the first text information with the preset type, and deleting the colloquial vocabulary of the first text information, the readability of the first key information can be improved.

Optionally, as shown in FIG. 10 , in this embodiment of the present application, after step 102 , the speech recognition method provided in this embodiment of the present application may further include

steps

1001 and 1002 .

Step 1001: The apparatus for speech recognition receives a second input from the user.

Step 1002: In response to the second input, the apparatus for speech recognition uses an editing processing method corresponding to the second input to process the first key information.

In the embodiment of the present application, the above-mentioned second input is an editing input of the above-mentioned first key information by the user. It should be noted that, as shown in FIG. 11 , assuming that the target application is the address book, the editing processing method corresponding to the above third input is: after clicking to select the information to be modified, the information to be modified appears in the editing column, and the user treats Modify information for deletion, splicing or re-entry. The original information content of the first key information may be updated in real time with the user's modification, or may be replaced after the user's modification is completed.

Exemplarily, if the information to be modified is repeated repeated information, after the user deletes, splices or re-enters the information to be modified, all the repeated parts are corrected uniformly to realize rapid information integration.

Optionally, in the embodiment of the present application, after step 102, the speech recognition method provided in the embodiment of the present application may further include: generating a temporary cache control, where the temporary cache control is used to cache the first key information.

Exemplarily, if the target application is a memo and the first key information is a phone number, the phone number can be directly dialed through the temporary cache control, so that the displayed first key information can be directly applied, avoiding the user's choice to copy and paste the phone number. The tedious process of making phone calls.

It should be noted that, in the speech recognition method provided by the embodiments of the present application, the execution subject may be a speech recognition apparatus, or a control module in the speech recognition apparatus for performing the speech recognition method. In the embodiments of the present application, a method for performing speech recognition by a speech recognition device is used as an example to describe the speech recognition device provided by the embodiments of the present application. However, in practical applications, the execution subject of the above speech recognition method may also be other devices or apparatuses that can perform the speech recognition method, which is not limited in this embodiment of the present application.

As shown in FIG. 12 , an embodiment of the present application provides a device for speech recognition. The device for speech recognition includes: a first receiving module 1201 and a first display module 1202;

The above-mentioned first receiving module 1201 is configured to receive the first input of the user when the voice information is obtained;

The above-mentioned first display module 1202 is used to display the first key information in the above-mentioned voice information through the target application program in response to the first input received by the above-mentioned first receiving module 1201, the above-mentioned first key information and the type of the above-mentioned target application program Associated.

Optionally, the above-mentioned first receiving module 1201 is configured to: in the process of conducting a voice call, perform voice recording on the call content of the above-mentioned voice call, and obtain the above-mentioned voice information; after the above-mentioned voice call ends or during the above-mentioned voice call , the user's first input is received.

Optionally, as shown in FIG. 13 , the above apparatus further includes: a determining module 1203;

The above-mentioned determining module 1203 is used for the above-mentioned first display module 1202, in response to the above-mentioned first input, before displaying the first key information in the above-mentioned voice information through the target application, in the case of detecting that a voice call is being made, the above-mentioned first key information is displayed. The application program associated with the input first parameter is determined as the above-mentioned target application program;

The above-mentioned determining module 1203 is also used for the above-mentioned first display module 1202, in response to the above-mentioned first input, before displaying the first key information in the above-mentioned voice information through the target application, in the case that the end of the voice call is detected, the above-mentioned first key information is displayed. An application program associated with the input second parameter is determined as the above-mentioned target application program.

Optionally, the above-mentioned first display module 1202 is configured to: start the target application in response to the above-mentioned first input; obtain the first key information corresponding to the above-mentioned target application in the above-mentioned voice information; in the above-mentioned target application, display The first key information above.

Optionally, the above-mentioned first display module 1202 is specifically configured to: convert the above-mentioned voice information into target text information, and extract the first text information corresponding to the above-mentioned target application program from the above-mentioned target text information, the above-mentioned first text The information includes the above-mentioned first key information; at least one type of information is obtained, and the above-mentioned at least one type of information is used to indicate the type of information contained in the above-mentioned first text information; according to the above-mentioned at least one type of information, delete the type in the above-mentioned first text information Text information matching the preset type to obtain the above-mentioned first key information.

Optionally, as shown in FIG. 13 , the above apparatus further includes: a second receiving module 1204 and a first processing module 1205;

The above-mentioned second receiving module 1204 is used for receiving the user's second input after the above-mentioned first display module 1202 displays the first key information in the above-mentioned voice information through the target application program, and the above-mentioned second input is the user's response to the above-mentioned first key information. edit input;

The above-mentioned first processing module 1205 is configured to, in response to the second input received by the above-mentioned second receiving module 1204, use an editing processing method corresponding to the above-mentioned second input to process the above-mentioned first key information.

In the device for speech recognition provided by the embodiment of the present application, in the case of acquiring the voice information, after receiving the first input, the first input can be responded to, and the first input in the voice information can be displayed through the target application program Key information, wherein the first key information is associated with the type of the target application. In this way, the first key information associated with the target application is directly displayed and extracted from the voice information, which not only improves the extraction efficiency of the key information, but also avoids invalid recognition of the voice information, and also improves the human-machine performance of the electronic device. Interactive performance.

The apparatus for speech recognition in this embodiment of the present application may be an apparatus, or may be a component, an integrated circuit, or a chip in a terminal. The apparatus may be a mobile electronic device or a non-mobile electronic device. Exemplarily, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palmtop computer, an in-vehicle electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook, or a personal digital assistant (personal digital assistant). assistant, PDA), etc., non-mobile electronic devices can be servers, network attached storage (NAS), personal computer (personal computer, PC), television (television, TV), teller machine or self-service machine, etc., this application Examples are not specifically limited.

The apparatus for speech recognition in this embodiment of the present application may be an apparatus having an operating system. The operating system may be an Android (Android) operating system, an IOS operating system, or other possible operating systems, which are not specifically limited in the embodiments of the present application.

The apparatus for speech recognition provided in this embodiment of the present application can implement each process implemented by the foregoing method embodiment, and to avoid repetition, details are not described herein again.

For the beneficial effects of the various implementations in this embodiment, reference may be made to the beneficial effects of the corresponding implementations in the foregoing method embodiments, which are not repeated here to avoid repetition.

Optionally, as shown in FIG. 14, an embodiment of the present application further provides an electronic device 1400, including a processor 1401, a memory 1402, and a program or instruction stored in the memory 1402 and executable on the processor 1401, the program Or, when the instruction is executed by the processor 1401, each process of the above-mentioned speech recognition method embodiment can be realized, and the same technical effect can be achieved. In order to avoid repetition, details are not repeated here.

It should be noted that the electronic devices in the embodiments of the present application include the above-mentioned mobile electronic devices and non-mobile electronic devices.

FIG. 15 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 1500 includes but is not limited to: a radio frequency unit 1501, a network module 1502, an audio output unit 1503, an input unit 1504, a sensor 1505, a display unit 1506, a user input unit 1507, an interface unit 1508, a memory 1509, and a processor 1510, etc. part.

Those skilled in the art can understand that the electronic device 1500 may also include a power supply (such as a battery) for supplying power to various components, and the power supply may be logically connected to the processor 1510 through a power management system, so as to manage charging, discharging, and power consumption through the power management system. consumption management and other functions. The structure of the electronic device shown in FIG. 15 does not constitute a limitation to the electronic device. The electronic device may include more or less components than the one shown, or combine some components, or arrange different components, which will not be repeated here. .

Wherein, the above-mentioned user input unit 1507 is configured to receive the first input of the user when the voice information is acquired;

The processor 1510 is configured to display, through the target application, the first key information in the voice information in response to the first input, where the first key information is associated with the type of the target application.

Optionally, the above-mentioned processor 1510 is further configured to perform voice recording on the call content of the above-mentioned voice call in the process of conducting a voice call, and obtain the above-mentioned voice information;

Optionally, the above-mentioned user input unit 1507 is further configured to receive the first input of the user after the above-mentioned voice call ends or during the above-mentioned voice call.

Optionally, the processor 1510 is further configured to determine the application associated with the first parameter of the first input as the target application when a voice call is detected; In this case, the application associated with the second parameter of the first input is determined as the target application.

Optionally, the above-mentioned processor 1510 is further configured to start the target application in response to the above-mentioned first input; obtain the first key information corresponding to the above-mentioned target application in the above-mentioned voice information; in the above-mentioned target application, display the above-mentioned first key information. a key message.

Optionally, the above-mentioned processor 1510 is further configured to convert the above-mentioned voice information into target text information, and extract the first text information corresponding to the above-mentioned target application program from the above-mentioned target text information, and the above-mentioned first text information includes: The above-mentioned first key information; obtain at least one type of information, and the above-mentioned at least one type of information is used to indicate the type of information contained in the above-mentioned first text information; according to the above-mentioned at least one type of information, delete the type and preset in the above-mentioned first text information Type matching text information to obtain the above-mentioned first key information.

Optionally, the above-mentioned user input unit 1507 is further configured to receive the second input of the user, and the above-mentioned third input is the editing input of the above-mentioned first key information by the user;

Optionally, the above-mentioned processor 1510 is further configured to, in response to the above-mentioned second input, use an editing processing manner corresponding to the above-mentioned third input to process the above-mentioned first key information.

In the electronic device provided by the embodiment of the present application, in the case of acquiring the voice information, after receiving the first input, it can respond to the first input, and display the first key information in the voice information through the target application program , wherein the first key information is associated with the type of the target application. In this way, the first key information associated with the target application is directly displayed and extracted from the voice information, which not only improves the extraction efficiency of the key information, but also avoids invalid recognition of the voice information, and also improves the human-machine performance of the electronic device. Interactive performance.

It should be understood that, in this embodiment of the present application, the input unit 1504 may include a graphics processing unit (graphics processing unit, GPU) 15041 and a microphone 15042. Such as camera) to obtain still pictures or video image data for processing. The display unit 1506 may include a display panel 15061, which may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 1507 includes a touch panel 15071 and other input devices 15072 . The touch panel 15071 is also called a touch screen. The touch panel 15071 may include two parts, a touch detection device and a touch controller. Other input devices 15072 may include but are not limited to physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, and joysticks, which will not be repeated here. Memory 1509 may be used to store software programs as well as various data, including but not limited to application programs and operating systems. The processor 1510 may integrate an application processor and a modem processor, wherein the application processor mainly handles the operating system, user interface, and application programs, and the like, and the modem processor mainly handles wireless communication. It can be understood that, the above-mentioned modulation and demodulation processor may not be integrated into the processor 1510.

Embodiments of the present application further provide a readable storage medium, where a program or an instruction is stored on the readable storage medium. When the program or instruction is executed by a processor, each process of the above-mentioned speech recognition method embodiment can be achieved, and the same can be achieved. In order to avoid repetition, the technical effect will not be repeated here.

The processor is the processor in the electronic device in the above embodiment. A readable storage medium includes a computer-readable storage medium, such as a computer read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, and the like.

An embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of the above-mentioned speech recognition method embodiment, And can achieve the same technical effect, in order to avoid repetition, it is not repeated here.

It should be understood that the chip mentioned in the embodiments of the present application may also be referred to as a system-on-chip, a system-on-chip, a system-on-a-chip, or a system-on-a-chip, or the like.

It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or device comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in the reverse order depending on the functions involved. To perform functions, for example, the described methods may be performed in an order different from that described, and various steps may also be added, omitted, or combined. Additionally, features described with reference to some examples may be combined in other examples.

From the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to enable a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods of the various embodiments of the present application.

The embodiments of the present application have been described above in conjunction with the accompanying drawings, but the present application is not limited to the above-mentioned specific embodiments, which are merely illustrative rather than restrictive. Under the inspiration of this application, without departing from the scope of protection of the purpose of this application and the claims, many forms can be made, which all fall within the protection of this application.

Claims

A method for speech recognition, the method comprising:

In the case of acquiring the voice information, receiving the first input of the user;

In response to the first input, the target application displays first key information in the voice information, the first key information being associated with the type of the target application.
The method according to claim 1, wherein, when the voice information is acquired, receiving the first input of the user comprises:

In the process of making a voice call, voice recording is performed on the call content of the voice call, and the voice information is obtained;

After the voice call ends or during the voice call, a first input from the user is received.
The method according to claim 1 or 2, wherein, before the first key information in the voice information is displayed by the target application in response to the first input, the method further comprises:

In the case of detecting a voice call, the application program associated with the first parameter of the first input is determined as the target application program;

In the case that the end of the voice call is detected, the application program associated with the first input second parameter is determined as the target application program.
The method according to claim 1, wherein, in response to the first input, displaying the first key information in the voice information through a target application program comprises:

In response to the first input, start the target application, or switch the target application to the foreground;

Obtain the first key information corresponding to the target application in the voice information;

In the target application, the first key information is displayed.
The method according to claim 4, wherein the acquiring the first key information corresponding to the target application in the voice information comprises:

Converting the voice information into target text information, and extracting first text information corresponding to the target application program from the target text information, where the first text information includes the first key information;

acquiring at least one type of information, where the at least one type of information is used to indicate the type of information contained in the first text information;

According to the at least one type of information, the text information whose type matches the preset type in the first text information is deleted to obtain the first key information.
The method according to claim 1, wherein after displaying the first key information in the voice information through the target application, the method further comprises:

receiving a second input from the user, where the second input is an editing input of the first key information by the user;

In response to the second input, the first key information is processed using an editing processing manner corresponding to the second input.
A device for speech recognition, the device comprising: a first receiving module and a first display module;

The first receiving module is configured to receive the first input of the user when the voice information is acquired;

The first display module is configured to display the first key information in the voice information through a target application program in response to the first input received by the first receiving module, the first key information and the target application associated with the type of program.
The apparatus according to claim 7, wherein the first receiving module is configured to:

In the process of making a voice call, voice recording is performed on the call content of the voice call, and the voice information is obtained;

After the voice call ends or during the voice call, a first input from the user is received.
The apparatus according to claim 7 or 8, wherein the apparatus further comprises: a determining module;

The determining module is used for the first display module, in response to the first input, before displaying the first key information in the voice information through the target application, in the case of detecting that a voice call is being made, display all the information. The application program associated with the first parameter of the first input is determined as the target application program;

The determining module is further configured to, in response to the first input, the first display module displays the first key information in the voice information through the target application, in the case of detecting the end of the voice call, displaying the voice call. The application program associated with the second parameter of the first input is determined as the target application program.
The device according to claim 7, wherein the first display module is used for:

In response to the first input, start the target application;

Obtain the first key information corresponding to the target application in the voice information;

In the target application, the first key information is displayed.
The device according to claim 10, wherein the first display module is specifically used for:

Converting the voice information into target text information, and extracting first text information corresponding to the target application program from the target text information, where the first text information includes the first key information;

acquiring at least one type of information, where the at least one type of information is used to indicate the type of information contained in the first text information;

According to the at least one type of information, the text information whose type matches the preset type in the first text information is deleted to obtain the first key information.
The apparatus according to claim 7, wherein the apparatus further comprises: a second receiving module and a first processing module;

The second receiving module is configured to receive the second input from the user after the first display module displays the first key information in the voice information through the target application, and the second input is the user's response to the first key information. - Edit input of key information;

The first processing module is configured to, in response to the second input received by the second receiving module, use an editing processing manner corresponding to the second input to process the first key information.
An electronic device, comprising a processor, a memory, and a program or instruction stored on the memory and executable on the processor, the program or instruction being executed by the processor to achieve as claimed in claims 1 to 6 The steps of any one of the speech recognition methods.
A readable storage medium on which programs or instructions are stored, and when the programs or instructions are executed by a processor, implement the steps of the method for speech recognition according to any one of claims 1 to 6.
A computer program product executed by at least one processor to implement the method of speech recognition as claimed in any one of claims 1 to 6.
A chip, the chip includes a processor and a communication interface, the communication interface is coupled with the processor, and the processor is used for running a program or an instruction to implement the voice as claimed in any one of claims 1 to 6 method of identification.
An electronic device, characterized by comprising the electronic device being configured to perform the method of speech recognition according to any one of claims 1 to 6 .