CN114155855A

CN114155855A - Voice recognition method, server and electronic equipment

Info

Publication number: CN114155855A
Application number: CN202111553176.6A
Authority: CN
Inventors: 戴磊; 吴聪聪; 李霞
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-03-08

Abstract

The embodiment provides a voice recognition method, a server and electronic equipment, belonging to the voice recognition technology, wherein the voice recognition method comprises the following steps: receiving voice data to be recognized sent by electronic equipment; if the voice instruction corresponding to the voice data is not recognized, extracting the voice keywords in the voice data; determining a voice instruction related to the voice keyword according to historical voice data; and sending at least one alternative user intention information corresponding to the voice instruction related to the voice keyword to the electronic equipment. Compared with the prior art, the method and the device have the advantages that when the voice instruction corresponding to the voice data is not recognized, the at least one alternative user intention information corresponding to the voice instruction related to the voice keyword is sent to the electronic equipment, so that the electronic equipment can display the alternative user intention information, and the intelligent level of voice recognition is improved.

Description

Voice recognition method, server and electronic equipment

Technical Field

The embodiment of the application relates to a voice recognition technology. And more particularly, to a voice recognition method, a server, and an electronic device.

Background

The intelligent voice interaction technology is gradually becoming the standard configuration of intelligent home products such as intelligent televisions, intelligent refrigerators and the like. Under the intelligent voice interaction scene, a user controls an intelligent television, an intelligent refrigerator and the like through voice to realize a series of operations such as watching videos, listening to music, checking weather and controlling the television.

In the related art, when intelligent voice interaction is performed, voice data input by a user can be recognized as a text through the voice recognition module, and then lexical, syntactic and semantic analysis is performed on the text through the semantic analysis module, so that a voice instruction corresponding to the intention of the user is understood, the voice instruction is issued to the terminal device, and the terminal device executes or displays the voice instruction.

However, when the voice data input by the user does not include a complete user intention, the corresponding voice instruction cannot be recognized based on the voice data, so that the input voice data is not responded, the intelligent level of voice recognition is reduced, and the user experience is affected.

Disclosure of Invention

The exemplary embodiment of the application provides a voice recognition method, a server and an electronic device, so as to improve the intelligent level of voice recognition.

In a first aspect, an embodiment of the present application provides a speech recognition method, including:

receiving voice data to be recognized sent by electronic equipment;

if the voice instruction corresponding to the voice data is not recognized, extracting a voice keyword in the voice data;

determining a voice instruction related to the voice keyword according to historical voice data;

and sending at least one alternative user intention information corresponding to the voice instruction related to the voice keyword to the electronic equipment.

In some embodiments of the present application, the determining the voice instruction related to the voice keyword according to the historical voice data includes:

and retrieving the voice instruction related to the voice keyword from historical voice data of all users in the database.

In some embodiments of the present application, the determining, according to the historical speech data, the speech instruction related to the speech keyword further includes:

determining a current user inputting the voice data;

and retrieving the voice instruction related to the voice keyword from the historical voice data of the current user in a database.

In some embodiments of the present application, if the voice keyword is a verb, the retrieving the voice instruction related to the voice keyword includes:

and searching the voice instruction with the voice keyword as a prefix from the historical voice data through a prefix searching algorithm.

and screening out the voice instruction related to the voice keyword from the historical voice data according to the equipment connection information.

In some embodiments of the present application, the sending at least one alternative user intention information corresponding to the voice instruction related to the voice keyword to the electronic device further includes:

determining the historical request times of the voice instruction related to the voice keyword;

sorting the voice instructions related to the voice keywords from big to small according to the historical request times;

determining at least one alternative user intention information corresponding to the voice instructions related to the voice keywords in a preset number according to the sequence of the voice instructions related to the voice keywords;

transmitting the at least one alternative user intent to the electronic device.

In a second aspect, an embodiment of the present application provides a speech recognition method, including:

receiving voice data input by a user, wherein necessary keywords forming the user intention are lacked in the voice data;

sending the voice data to a server;

receiving at least one alternative user intention information sent by the server, wherein the user intention information corresponds to a voice instruction related to a voice keyword in the voice data, and the voice instruction related to the voice keyword is determined from historical voice data.

In some embodiments of the present application, after the receiving the user intention information sent by the server, the method further includes:

displaying the at least one alternative user intent information on a display interface.

In a third aspect, an embodiment of the present application provides a server, where the server includes:

a memory and a processor;

the memory for storing an executable program, the processor configured to:

receiving voice data to be recognized sent by electronic equipment;

In some embodiments of the present application, the treatment appliance is configured to:

determining a current user inputting the voice data;

In some embodiments of the present application, if the speech keyword is a verb, the processor is configured to:

and sending the at least one alternative user intention information to the electronic equipment.

In a fourth aspect, an embodiment of the present application provides an electronic device, including:

the communicator is used for carrying out data communication with the server, receiving data sent by the server and sending the data to the server;

a controller connected with the communicator, the controller configured to:

sending the voice data to a server;

In some embodiments of the present application, the controller is further configured to:

In a fourth aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program of instructions for implementing any one of the possible methods of the first aspect when executed by a processor.

In a fifth aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program of instructions for implementing any one of the possible methods of the second aspect when executed by a processor.

The embodiment of the application provides a voice recognition method, a server and electronic equipment, wherein the voice recognition method comprises the following steps: receiving voice data to be recognized sent by electronic equipment; if the voice instruction corresponding to the voice data is not recognized, extracting the voice keywords in the voice data; determining a voice instruction related to the voice keyword according to historical voice data; and sending at least one alternative user intention information corresponding to the voice instruction related to the voice keyword to the electronic equipment. Compared with the prior art, the method and the device have the advantages that when the voice instruction corresponding to the voice data is not recognized, the at least one alternative user intention information corresponding to the voice instruction related to the voice keyword is sent to the electronic equipment, so that the electronic equipment can display the alternative user intention information, and the intelligent level of voice recognition is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the implementation manner in the related art, a brief description will be given below of the drawings required for the description of the embodiments or the related art, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

Fig. 1 is a schematic diagram illustrating an operational scenario between an electronic device and a control apparatus according to some embodiments;

FIG. 2 illustrates a flow diagram of a method of speech recognition, according to some embodiments;

FIG. 3 illustrates a schematic diagram of determining user intent, according to some embodiments;

FIG. 4 illustrates a signaling interaction diagram of a speech recognition method according to some embodiments;

FIG. 5 illustrates a flow diagram of another speech recognition method according to some embodiments;

FIG. 6 illustrates a flow diagram of yet another speech recognition method according to some embodiments;

FIG. 7 illustrates an interface diagram of an electronic device, according to some embodiments;

FIG. 8 illustrates an interface diagram of another electronic device, according to some embodiments;

FIG. 9 illustrates a structural diagram of a server, according to some embodiments;

FIG. 10 illustrates a schematic structural diagram of an electronic device, according to some embodiments.

Detailed Description

To make the objects, embodiments and advantages of the present application clearer, the following description of exemplary embodiments of the present application will clearly and completely describe the exemplary embodiments of the present application with reference to the accompanying drawings in the exemplary embodiments of the present application, and it is to be understood that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.

All other embodiments, which can be derived by a person skilled in the art from the exemplary embodiments described herein without inventive step, are intended to be within the scope of the claims appended hereto. In addition, while the disclosure herein has been presented in terms of one or more exemplary examples, it should be appreciated that aspects of the disclosure may be implemented solely as a complete embodiment.

It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.

The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and are not necessarily intended to limit the order or sequence of any particular one, Unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein.

Furthermore, the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or device that comprises a list of elements is not necessarily limited to those elements explicitly listed, but may include other elements not expressly listed or inherent to such product or device.

The term "module," as used herein, refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware and/or software code that is capable of performing the functionality associated with that element.

The term "remote control" as used herein refers to a component of an electronic device (e.g., an electronic device as disclosed herein) that is generally capable of being wirelessly controlled over a relatively short distance. Typically using infrared and/or Radio Frequency (RF) signals and/or bluetooth to connect with the electronic device, and may also include WiFi, wireless USB, bluetooth, motion sensor, etc. For example: the hand-held touch remote controller replaces most of the physical built-in hard keys in the common remote control device with the user interface in the touch screen.

The term "gesture" as used in this application refers to a user's behavior through a change in hand shape or an action such as hand motion to convey a desired idea, action, purpose, or result.

Fig. 1 is a schematic diagram illustrating an operation scenario between an electronic device and a control apparatus according to an embodiment. As shown in fig. 1, a user may operate the electronic device 200 through the mobile terminal 300 and the control apparatus 100.

In some embodiments, the control apparatus 100 may be a remote controller, and the communication between the remote controller and the electronic device includes an infrared protocol communication or a bluetooth protocol communication, and other short-distance communication methods, etc., and the electronic device 200 is controlled by wireless or other wired methods. The user may input a user command through a key on the remote controller, voice input, control panel input, etc. to control the electronic device 200. Such as: the user may input a corresponding control command through a volume up/down key, a channel control key, up/down/left/right moving keys, a voice input key, a menu key, a power on/off key, etc. on the remote controller, to implement the function of controlling the electronic device 200.

In some embodiments, mobile terminals, tablets, computers, laptops, and other smart devices may also be used to control the electronic device 200. For example, the electronic device 200 is controlled using an application running on the smart device. The application, through configuration, may provide the user with various controls in an intuitive User Interface (UI) on a screen associated with the smart device.

In some embodiments, the mobile terminal 300 may install a software application with the electronic device 200 to implement connection communication through a network communication protocol for the purpose of one-to-one control operation and data communication. Such as: the control instruction protocol can be established between the mobile terminal 300 and the electronic device 200, the remote control keyboard is synchronized to the mobile terminal 300, and the function of the electronic device 200 is controlled by controlling the user interface on the mobile terminal 300. The audio and video content displayed on the mobile terminal 300 can also be transmitted to the electronic device 200, so as to realize the synchronous display function.

As also shown in fig. 1, the electronic device 200 is also in data communication with the server 400 through a variety of communication means. The electronic device 200 may be allowed to communicatively connect through a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the electronic device 200. Illustratively, the electronic device 200 receives software program updates, or accesses a remotely stored digital media library, by sending and receiving information, as well as Electronic Program Guide (EPG) interactions. The server 400 may be a cluster or a plurality of clusters, and may include one or more types of servers. Other web service contents such as video on demand and advertisement services are provided through the server 400.

The electronic device 200 may be a liquid crystal display, an OLED display, or a projection electronic device. The particular electronic device type, size, resolution, etc. are not limiting, and those skilled in the art will appreciate that the electronic device 200 may be modified in performance and configuration as desired.

The electronic device 200 may additionally provide an intelligent network tv function of a computer support function in addition to the broadcast receiving tv function, including but not limited to a network tv, an intelligent tv, an Internet Protocol Tv (IPTV), and the like.

The following describes a voice interaction process related to an embodiment of the present application.

In the voice interaction process, after receiving voice data sent by the terminal equipment, the server firstly obtains text data corresponding to the voice data through voice recognition, then understands the intention of a user based on the text data, and finally determines a corresponding voice instruction according to the intention of the user.

The voice recognition mainly comprises two parts of training and recognition. Training is usually completed off-line, signal processing and knowledge mining are carried out on a mass voice and language database collected in advance, and an acoustic model and a language model required by a voice recognition system are obtained; the recognition process is usually completed on line, and the real-time voice of the user is automatically recognized. The recognition process can be generally divided into two major modules, namely a front-end module and a back-end module: the front-end module is mainly used for carrying out end point detection (removing redundant mute and non-speaking sound), noise reduction, feature extraction and the like; the 'back-end' module is used for carrying out statistical pattern recognition (also called 'decoding') on the feature vector of the user speaking by utilizing the trained 'acoustic model' and 'language model' to obtain the contained text information, and in addition, the back-end module also has a 'self-adaptive' feedback module which can carry out self-learning on the voice of the user, thereby carrying out necessary 'correction' on the 'acoustic model' and the 'voice model' and further improving the recognition accuracy.

The understanding of the user intent may be achieved through a user intent classification model and a target transition probability matrix. The user intent classification model may classify the user intent of the textual data by training. The target transition probability matrix is a probability that a certain intention category to which text data belongs is determined according to the text data converted from voice data. That is, the target transition probability matrix does not care which intention category the current text data belongs to, and only obtains which intention category the last text data belongs to. And predicting the probability that the next character data belongs to each intention category according to the intention category of the previous character data. After understanding the user intention, the corresponding voice instruction may be determined based on the mapping relationship between the user intention and the voice instruction.

The intelligent voice interaction technology is gradually becoming the standard configuration of intelligent home products such as intelligent televisions, intelligent refrigerators and the like. Under the intelligent voice interaction scene, a user controls an intelligent television, an intelligent refrigerator and the like through voice to realize a series of operations such as watching videos, listening to music, checking weather and controlling the television. In the related art, when intelligent voice interaction is performed, voice data input by a user can be recognized as a text through the voice recognition module, and then lexical, syntactic and semantic analysis is performed on the text through the semantic analysis module, so that a voice instruction corresponding to the intention of the user is understood, the voice instruction is issued to the terminal device, and the terminal device executes or displays the voice instruction. However, when the voice data input by the user does not include a complete user intention, the corresponding voice command cannot be recognized based on the voice data, so that no response is made to the input voice data, the intelligent level of voice recognition is reduced, and the user experience is affected.

In order to solve the above problem, embodiments of the present application provide a voice recognition method and a server, when a voice instruction corresponding to voice data is not recognized, an alternative user intention is determined by determining a voice instruction related to a voice keyword in the voice data, so that an electronic device displays the alternative user intention, thereby improving an intelligent level of voice recognition.

It can be understood that the above speech recognition method can be implemented by the server provided in the embodiments of the present application. The following describes the technical solution of the embodiment of the present application in detail by taking a server integrated or installed with relevant execution codes as an example. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

Fig. 2 illustrates a flow diagram of a speech recognition method according to some embodiments, which relate to a specific process of how speech recognition is performed. The execution subject of the embodiment is a server. As shown in fig. 2, the method includes:

s301, receiving voice data to be recognized sent by the electronic equipment.

In the embodiment of the application, when a user needs to perform voice recognition, the electronic device can send the voice data to be recognized to the server by inputting the voice data to the electronic device, so that the server performs voice recognition.

It should be understood that the embodiment of the present application is not limited to the type of the electronic device, and the electronic device may be, for example, a display device, a smart sound box, and the like.

S302, if the voice instruction corresponding to the voice data is not recognized, extracting the voice keywords in the voice data.

In this step, after the server receives the voice data to be recognized sent by the electronic device, the voice data to be recognized may be recognized, and if the voice instruction corresponding to the voice data is not recognized, the voice keyword in the voice data is extracted.

It should be noted that, in the embodiment of the present application, how the server recognizes and determines the user intention corresponding to the voice data is not limited, and the user intention of the voice data to be recognized may be determined sequentially through chinese word segmentation, part of speech tagging and dependency parsing, so as to determine the corresponding voice instruction.

The Chinese word segmentation is the core of voice recognition. Since words are the smallest meaningful language components that can move independently, word segmentation is the first step in natural language processing. Different from English, each word is divided by a space or a punctuation mark, and the Chinese is difficult to define the word boundary. In the related art, Chinese word segmentation can be performed based on three modes of rules, statistics and understanding. In some embodiments, Chinese word segmentation can be performed based on rules, and word segmentation is performed based on a word bank by using a forward maximum matching algorithm. For example, the "suo tang dynasty in a single tian-fang" can be divided into "suo fang", "suo" and "suo tang dynasty".

The part-of-speech tagging is a classification method by taking the characteristics of words as the basis for classifying the parts-of-speech, and is one of the preprocessing links of natural language processing. In the related art, the part-of-speech tagging method is performed based on rules, statistics, or understanding. In some embodiments, part-of-speech tagging may be performed by relying on a thesaurus rule-based tagging method. For example, the "suo tang meaning of a single Tianfang" may be, after the part-of-speech tagging: { single Tian Fang-single Tian Fang [ actor ] }, { Funcword structaux } of, sui Tang Dynasty-sui Tang Dynasty [ title ] }.

For dependency syntax analysis, it can be understood that in dependency syntax, the syntax structure essentially contains the relationships between words and word pairs. Wherein one dependency relationship connects two words, one is a core word (head) and one is a modifier (dependent). One representative representation of the dependency syntax analysis result is a dependency syntax tree, in which the labeled relationships include: the main-predicate relationship SBV, the dynamic-object relationship VOB, the inter-object relationship IOB, the preposition object FOB, the bilingual DBL, the middle relationship ATT, the middle relationship ADV, the dynamic-complement relationship CMP, the parallel relationship COO, the intermediate relationship POB, the left additional relationship LAD, the right additional relationship RAD, the independent structure IS and the core relationship HED.

In some embodiments, a dependency syntax tree corresponding to the intent may be generated in a statistical-based and rule-based manner for describing the relationship between the various words, and parsing the grammatical elements in the recognition sentence. And performing one-to-many mapping and corresponding weight on the user intention and syntax, performing syntax matching on the input word list line by line, performing weight analysis on the successfully matched list, and extracting the most appropriate user intention as a result, thereby determining the corresponding voice instruction based on the extracted user intention.

It should be noted that, in some embodiments, due to incomplete user expression or failure of the voice capture device, the necessary keywords forming the user's intention may be absent from the voice data, so that the voice data input by the user is incomplete. At this time, when the server recognizes the incomplete voice data, the user intention may not be understood by mistake, and then the corresponding voice instruction cannot be determined, and at this time, the keywords remaining in the incomplete voice data may be extracted.

It should be understood that the incomplete voice data referred to in the embodiments of the present application may be understood as voice data that includes only an action to be performed but not an object to be performed, or only an object to be performed but not an action to be performed, and thus cannot be combined into a voice instruction. For example, the voice data input by the user is recognized as "open", which only contains the action to be executed but does not contain the object to be executed, so that the voice instruction corresponding to the voice data cannot be recognized, and the voice keyword in the voice data needs to be extracted.

It should be understood that the type of the speech keyword in the speech data is not limited in the embodiments of the present application, and in some embodiments, the speech keyword may be a verb, such as "open", "close", "pause", and the like. In other embodiments, the speech keyword may be a name of a device, such as "air conditioner", "bedroom light", "television", etc., or may also be a name of an application, such as "XXX game", "XX music", etc., or may also be a name of a channel, such as "center station", "movie channel", etc.

S303, determining a voice instruction related to the voice keyword according to the historical voice data.

In this step, after the server extracts the voice keyword from the voice data, the voice instruction related to the voice keyword may be determined according to the historical voice data.

It should be understood that the embodiment of the present application is not limited to how to determine the voice command related to the voice keyword according to the historical voice data, and in some embodiments, the server may retrieve the voice command related to the voice keyword from the historical voice data of all users in the database. In other embodiments, the server may determine the current user who input the speech data, and then retrieve the speech instruction associated with the speech keyword from the historical speech data of the current user in the database.

It should be noted that, in this embodiment of the present application, how to retrieve the voice instruction related to the voice keyword is not limited, and in some embodiments, if the voice keyword is a verb, the server may retrieve, from the historical voice data, the voice instruction with the voice keyword as a prefix through a prefix retrieval algorithm.

The prefix search algorithm searches all language data beginning with the voice keyword in the database through an ES (distributed full text search framework) primary key. The database may be a database corresponding to the electronic device, and stores the historical voice data sent by the electronic device to the server, and the database may store the historical voice data in a classified manner based on different users.

In some embodiments, the server may further screen out voice instructions related to the voice keywords from the historical voice data according to the device connection information.

The device connection information includes information of an electronic device connected to the electronic device. For example, the device connection information may include a connection relationship between the electronic device and the air conditioner, a connection relationship between the electronic device and the bedroom lamp, and the like.

For example, if the device connection information includes a "bedroom lamp" connected to the electronic device, when the historical voice data determines a voice instruction related to the voice keyword, the screening may be performed to determine only a voice instruction related to the "bedroom lamp", for example, the voice instruction "turn on the bedroom lamp".

S304, at least one alternative user intention information corresponding to the voice instruction related to the voice keyword is sent to the electronic equipment.

In this step, after the server determines the voice instruction related to the voice keyword according to the historical voice data, at least one alternative user intention information corresponding to the voice instruction related to the voice keyword may be sent to the electronic device.

It should be understood that, in the embodiment of the present application, there is no limitation on how to send the user intention information corresponding to the voice instruction related to the voice keyword to the electronic device, and in some embodiments, the server may first determine the historical request times of the voice instruction related to the voice keyword. Secondly, the server can sort the voice instructions related to the voice keywords from large to small according to the historical request times. And thirdly, the server can determine the alternative user intention information corresponding to the voice instructions related to the preset number of voice keywords according to the sequence of the voice instructions related to the voice keywords. Finally, the server may transmit the user intention information to the electronic device.

It should be understood that the above-mentioned historical request times may reflect the usage frequency of the voice instruction related to the voice keyword, so that the voice instruction related to the voice keyword with higher usage frequency is displayed to the user, thereby improving the success rate of completing the voice keyword.

It should be noted that, in the embodiment of the present application, the voice commands related to the voice keywords may also be sorted respectively based on the determination manner of the voice commands related to the voice keywords. That is, the voice commands related to the voice keywords retrieved from the historical voice data of all users and the voice commands related to the voice keywords retrieved from the historical voice data of the current user may be sorted respectively.

Correspondingly, the preset number is not limited in the embodiment of the application, and the preset number can be specifically set according to actual conditions. In some embodiments, the corresponding preset number may also be set based on the determination manner of the voice instruction related to the voice keyword.

Illustratively, FIG. 3 illustrates a schematic diagram of determining user intent, according to some embodiments. As shown in fig. 3, for the corresponding voice keyword "open", based on the voice commands related to the voice keywords retrieved from the historical voice data of all users, the user intention information corresponding to the "open center one", "open XXX game", and "open history" voice commands with the highest number of historical requests may be selected. And based on voice instructions related to the voice keywords searched from the historical voice data of the current user, user intention information corresponding to the voice instructions of 3 'turn on air conditioner', 'turn on movie channel' and 'turn on bedroom lamp' with the highest historical request times is also selected. In this way, alternative user intention information can be presumed from both sides. Subsequently, the 6 alternative user intention information determined in the two manners can be all transmitted to the electronic device.

According to the voice recognition method provided by the embodiment of the application, firstly, the server receives voice data to be recognized, which is sent by the electronic equipment. Secondly, if the voice instruction corresponding to the voice data is not recognized, the server extracts the voice keywords in the voice data. And thirdly, the server determines the voice instruction related to the voice keyword according to the historical voice data. And finally, the server sends at least one alternative user intention message corresponding to the voice instruction related to the voice keyword to the electronic equipment. Compared with the prior art, the method and the device have the advantages that when the voice instruction corresponding to the voice data is not recognized, the at least one alternative user intention information corresponding to the voice instruction related to the voice keyword is sent to the electronic equipment, so that the electronic equipment can display the alternative user intention information, and the intelligent level of voice recognition is improved.

On the basis of the above embodiments, the following describes an interaction process between the electronic device and the server. Fig. 4 illustrates a signaling interaction diagram of a speech recognition method according to some embodiments, as shown in fig. 4, the method comprising:

s401, receiving voice data to be recognized input by a user through electronic equipment, wherein necessary keywords forming the intention of the user are absent in the voice data;

s402, the electronic equipment sends the voice data to be recognized to a server.

And S403, if the voice command corresponding to the voice data is not recognized, the server extracts the voice keywords in the voice data.

S404, the server determines a voice instruction related to the voice keyword according to the historical voice data.

S405, the server sends at least one alternative user intention message corresponding to the voice instruction related to the voice keyword to the electronic equipment.

S406, the electronic equipment displays at least one alternative user intention message on the display interface.

The technical terms, technical effects, technical features, and alternative embodiments of S401 to S406 can be understood with reference to S301 to S304 shown in fig. 2, and repeated descriptions thereof will not be repeated here.

On the basis of the above embodiment, the following describes how the server determines the voice command associated with the voice keyword. FIG. 5 illustrates a flow diagram of another speech recognition method according to some embodiments, as shown in FIG. 5, the method comprising:

s501, receiving voice data to be recognized sent by the electronic equipment.

S502, if the voice instruction corresponding to the voice data is not recognized, extracting the voice keywords in the voice data.

S503, searching the voice command relevant to the voice keyword from the historical voice data of all users in the database.

S504, determining the current user inputting the voice data.

And S505, searching the voice command related to the voice keyword from the historical voice data of the current user in the database.

S506, determining the historical request times of the voice command related to the voice keyword.

And S507, sequencing the voice instructions related to the voice keywords from large to small according to the historical request times.

And S508, determining at least one alternative user intention information corresponding to the voice instructions related to the preset number of voice keywords according to the sequence of the voice instructions related to the voice keywords.

S509, at least one alternative user intention message is sent to the electronic device.

The technical terms, technical effects, technical features, and alternative embodiments of S501-S409 can be understood with reference to S301-S304 shown in fig. 2, and repeated descriptions will not be repeated here.

On the basis of the above embodiments, how the electronic device interacts with the server is described below. Fig. 6 schematically illustrates a flow chart of another speech recognition method according to some embodiments, where an execution subject of the present embodiment is an electronic device, and as shown in fig. 6, the method includes:

s601, receiving voice data input by a user, wherein necessary keywords forming the intention of the user are absent in the voice data.

S602, sending voice data to a server.

S603, receiving at least one alternative user intention information sent by the server, wherein the user intention information corresponds to a voice instruction related to the voice keyword in the voice data, and the voice instruction related to the voice keyword is determined from historical voice data.

And S604, displaying at least one alternative user intention information on the display interface.

In some embodiments, after the electronic device receives at least one alternative user intention information corresponding to the voice data sent by the server, the alternative user intention information may be further displayed on the electronic device to prompt the user.

It should be noted that, in this embodiment of the application, when the electronic device displays alternative user intention information, the electronic device may enter an awake-free state and prompt the user, so as to receive the user intention information selected by the user in time and generate a corresponding voice instruction.

Fig. 7 illustrates an interface diagram of an electronic device, according to some embodiments. FIG. 8 illustrates an interface diagram of another electronic device, according to some embodiments. As shown in fig. 7, when the user inputs "turn on", the server cannot determine the voice instruction corresponding to the voice data, and at this time, on the interface shown in fig. 7, a prompt message "what you want to open" may be displayed, and user intention information "turn on air conditioner", "turn on movie channel", "turn on bedroom lamp", "turn on center one", "turn on XXX game", and "turn on history" of the related alternatives may be displayed. When the user performs voice input again based on the displayed user intention information, the electronic device jumps to the interface as shown in fig. 8, displays voice data "power channel on/second on" and reply of the electronic device "good, power channel turned on for you" that the user inputs again, and executes a corresponding voice instruction.

FIG. 9 illustrates a structural diagram of a server according to some embodiments. The server may be implemented by software, hardware or a combination of both to perform the speech recognition method on the server side in the above embodiments. As shown in fig. 9, the server 700 includes: a memory 701 and a processor 702.

The memory 701 is used to store executable programs, and the processor 702 is configured to:

receiving voice data to be recognized sent by electronic equipment;

if the voice instruction corresponding to the voice data is not recognized, extracting the voice keywords in the voice data;

In some embodiments of the present application, the processor 702 is specifically configured to:

and retrieving voice instructions related to the voice keywords from historical voice data of all users in the database.

determining a current user inputting voice data;

and retrieving voice instructions related to the voice keywords from historical voice data of the current user in the database.

In some embodiments of the present application, if the speech keyword is a verb, the processor 702 is specifically configured to:

and searching the voice instruction with the voice keyword as the prefix from the historical voice data through a prefix searching algorithm.

and screening out voice instructions related to the voice keywords from the historical voice data according to the equipment connection information.

determining the historical request times of voice instructions related to the voice keywords;

at least one alternative user intent information is sent to the electronic device.

The server provided in the embodiment of the present application may perform the voice recognition action on the server side in the above method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.

FIG. 10 illustrates a schematic structural diagram of an electronic device, according to some embodiments. The server may be implemented by software, hardware or a combination of both to perform the speech recognition method on the server side in the above embodiments. As shown in fig. 10, the electronic device 800 includes: a communicator 801 and a controller 802.

A communicator 801 for performing data communication with a server, receiving data transmitted by the server, and transmitting data to the server;

the controller 802 is connected with the communicator 801, the controller 802 is configured to:

receiving voice data input by a user, wherein necessary keywords forming the intention of the user are lacked in the voice data;

sending voice data to a server;

and receiving at least one alternative user intention information sent by the server, wherein the user intention information corresponds to a voice instruction related to the voice keyword in the voice data, and the voice instruction related to the voice keyword is determined from the historical voice data.

In some embodiments of the present application, the controller 802 is further configured to:

at least one alternative user intent information is displayed on the display interface.

The electronic device provided in the embodiment of the present application may perform the voice recognition action on the electronic device side in the above method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.

Embodiments of the present application also provide a program, which when executed by a processor, is configured to perform the speech recognition method provided by the above method embodiments.

Embodiments of the present application further provide a program product, such as a computer-readable storage medium, having instructions stored therein, which when run on a computer, cause the computer to perform the speech recognition method provided by the above method embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the invention are generated in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A speech recognition method, comprising:

receiving voice data to be recognized sent by electronic equipment;

2. The method of claim 1, wherein determining the phonetic instruction associated with the phonetic keyword according to the historical phonetic data comprises:

3. The method of claim 1, wherein determining the phonetic instruction associated with the phonetic keyword based on historical phonetic data further comprises:

determining a current user inputting the voice data;

4. The method according to claim 2 or 3, wherein if the speech keyword is a verb, the retrieving the speech instruction related to the speech keyword comprises:

5. The method of claim 1, wherein determining the phonetic instruction associated with the phonetic keyword according to the historical phonetic data comprises:

6. The method of claim 1, wherein the at least one alternative user intent information corresponding to the speech instruction associated with the speech keyword is sent to the electronic device, and wherein the method further comprises:

7. A speech recognition method, comprising:

sending the voice data to a server;

8. The method of claim 7, wherein after said receiving at least one alternative user intent information sent by the server, the method further comprises:

9. A server, comprising:

a memory and a processor;

the memory for storing an executable program, the processor configured to:

receiving voice data to be recognized sent by electronic equipment;

10. An electronic device, comprising:

a controller connected with the communicator, the controller configured to:

sending the voice data to a server;