CN108932945B

CN108932945B - Voice instruction processing method and device

Info

Publication number: CN108932945B
Application number: CN201810233853.8A
Authority: CN
Inventors: 钱希; 杨琛
Original assignee: Beijing Orion Star Technology Co Ltd
Current assignee: Beijing Orion Star Technology Co Ltd
Priority date: 2018-03-21
Filing date: 2018-03-21
Publication date: 2021-08-31
Anticipated expiration: 2038-03-21
Also published as: CN108932945A

Abstract

The application discloses a method and a device for processing a voice instruction, wherein the method comprises the following steps: receiving a voice instruction containing the original intention of a user from a terminal; carrying out voice recognition on the voice instruction to generate text information of the voice instruction; analyzing the text information, and determining an analysis intention corresponding to the text information; searching resources required by executing the voice command according to the analysis intention, and sending the resources to the terminal; and determining that the original intention is not met, marking the voice instruction, the text information and the analysis intention as error samples, and storing the error samples into an error sample library. According to the method, whether the wrong sample is obtained or not is determined according to the human-computer interaction reaction mode, the user does not need to manually carry out the operation of wrong labeling, and the possibility of obtaining the wrong sample is increased.

Description

Voice instruction processing method and device

Technical Field

The present application relates to the field of voice interaction intelligent devices, and in particular, to a method and an apparatus for processing a voice command.

Background

Along with the development of artificial intelligence technology, a wide variety of intelligent devices appear in the market, and the common intelligent devices comprise smart phones, smart sound boxes, smart televisions, smart robots and the like. In order to enhance the user experience, many smart devices provide voice input and voice output functions. The voice interaction systems of these smart devices determine the user's intention according to voice instructions input by the user in order to provide various services to the user.

In a common voice interactive system, the instructions input by the user are generally divided into three major parts. Firstly, a voice recognition system ASR (automatic speech recognition) converts a voice instruction input by a user into characters; then, the semantic parsing system NLP (natural language processing) parses the intention represented by the characters; and finally, executing the task to be completed by realizing the intention by requesting various resources.

Both speech recognition systems and natural language processing systems require large amounts of labeled data to train. After the detection of the error sample is on line, the manual marking of the user input is continuously needed to improve the accuracy of the detection model of the error sample. In the prior art, mostly, a user autonomously marks an error sample through an active human-computer interaction mode. Because the voice instruction input by the user has no standard mode, various and different voice instructions can be generated corresponding to the same original intention, and a large amount of marking data, especially those which are identified by mistake, can be collected, so that the performance of a wrong sample detection model can be greatly improved. However, in the prior art, a user needs to switch to other interactive systems to actively label an error sample, and most users abandon the action of completing data labeling due to complicated operations, so that it is difficult to actually acquire data actively submitted by the user, a large amount of manpower and material resources have to be consumed to collect the error sample and store the error sample in an error sample library, and thus a training error sample detection model in the existing voice interactive system cannot realize rapid improvement of user experience.

Disclosure of Invention

In order to solve the problems in the prior art, embodiments of the present application provide a method and an apparatus for processing a voice instruction, an intelligent device, and a computer-readable storage medium, so as to solve the problem of automatically obtaining a wrong labeling sample from a human-computer interaction operation, thereby quickly improving user experience of a voice interaction system.

One aspect of the present embodiment provides a method for processing a voice instruction, where the method includes:

receiving a voice instruction containing the original intention of a user from a terminal;

carrying out voice recognition on the voice instruction to generate text information of the voice instruction;

analyzing the text information, and determining an analysis intention corresponding to the text information;

searching resources required by executing the voice command according to the analysis intention, and sending the resources to the terminal;

and determining that the original intention is not met, marking the voice instruction, the text information and the analysis intention as error samples, and storing the error samples into an error sample library.

Optionally, the determining that the original intent is not satisfied comprises:

repeatedly receiving voice instructions with the same user and the same analytic intention within a preset time period; or

And receiving information from the terminal for marking the voice instruction, the text information and the analysis intention as error samples.

Optionally, it is determined in a decision tree manner whether voice commands with the same user's parsing intention are repeatedly received within a preset time period.

Optionally, the method further comprises:

saving a record of resource retrieval corresponding to the analysis intention;

and if the reason that the analysis intention is not met is determined to be caused by resource retrieval according to a preset rule based on the retrieval record, removing the error sample from the error sample library.

Optionally, the method further comprises:

saving a record of resource retrieval corresponding to the analysis intention;

calculating the matching degree of the resource retrieval and the analysis intention, and storing the calculated value of the matching degree in a retrieval record;

and if the value of the matching degree is smaller than a preset threshold value, removing the error sample from the error sample library.

Another aspect of the embodiments of the present application further provides a method for processing a voice instruction, where the method includes:

collecting a voice instruction containing an original intention of a user and sending the voice instruction to a server;

acquiring resources required by executing the voice instruction from a server side;

and determining that the original intention is not met, and sending information which marks the voice instruction, text information obtained by performing voice recognition based on the voice instruction and analysis intention obtained by analyzing the text information as an error sample to the server.

Optionally, after acquiring the resources required for executing the voice instruction, the method further includes:

providing information of an execution action corresponding to the voice instruction based on the acquired resource;

the determining that the original intent is not satisfied comprises:

capturing an indication to abandon execution of an execution action corresponding to the voice instruction.

Optionally, after acquiring resources required for executing the voice instruction, the method further includes:

executing an execution action corresponding to the voice instruction based on the acquired resource;

the determining that the original intent is not satisfied comprises:

capturing an indication that the execution action corresponding to the voice instruction is terminated within a predetermined time threshold.

Another aspect of the embodiments of the present application further provides a device for processing a voice instruction, where the device includes: the system comprises a receiving module, a voice recognition module, an analysis module, a resource retrieval module, a first error sample detection module and an error sample library; wherein the receiving module is configured to receive a voice instruction containing the original intention of the user from the terminal; the voice recognition module is configured to perform voice recognition on the voice instruction, and generate text information of the voice instruction; the analysis module is configured to analyze the text information and determine an analysis intention corresponding to the text information; the resource retrieval module is configured to retrieve resources required for executing the voice instruction according to the analysis intention and send the resources to the terminal; the first error sample detection module is configured to determine that the original intention is not satisfied, label the voice instruction, the text information and the analysis intention as error samples, and store the error samples in an error sample library; the error sample repository is configured to store the error samples.

Optionally, the first false sample detection module determining that the original intent is not satisfied comprises:

Optionally, the first error sample detection module is configured to determine whether a voice instruction with the same parsing intention of the same user is repeatedly received within a preset time period in a decision tree manner.

Optionally, the first error sample detection module is further configured to:

saving a record of resource retrieval corresponding to the analysis intention;

Optionally, the first error sample detection module is further configured to:

saving a record of resource retrieval corresponding to the analysis intention;

Optionally, the apparatus comprises: the device comprises an acquisition module, an execution module and a second error sample detection module; the acquisition module is configured to acquire a voice instruction containing an original intention of a user and send the voice instruction to a server; the execution module is configured to obtain resources required for executing the voice instruction from a server side; the second error sample detection module is configured to determine that the original intention is not satisfied, and send information that labels the voice instruction, text information obtained by performing voice recognition based on the voice instruction, and an analysis intention obtained by analyzing the text information as an error sample to the server.

Optionally, the execution module is further configured to, after acquiring the resources required for executing the voice instruction,

the determining that the original intent is not satisfied comprises:

Optionally, the execution module is further configured to, after obtaining resources required for executing the voice instruction, execute an execution action corresponding to the voice instruction based on the obtained resources;

the determining that the original intent is not satisfied comprises:

In another aspect, an embodiment of the present invention further provides an intelligent device, which includes a memory, a processor, and computer instructions stored in the memory and executable on the processor, where the processor executes the instructions to implement the processing method of the voice instructions as described above.

In another aspect, embodiments of the present application further provide a computer-readable storage medium, on which computer instructions are stored, where the instructions are executed by a processor to implement the processing method of voice instructions as described above.

The processing method and device for the voice instruction can automatically acquire the data which are identified by the error from the man-machine interaction operation, mark the data as the error sample and store the error sample into the error sample library, so that the labor cost consumed by marking the error sample is greatly reduced, the optimization efficiency of an error sample detection model is remarkably improved, and the user experience of a voice interaction system is effectively improved.

Drawings

Fig. 1 is a flowchart illustrating a method for processing a voice command at a server according to an embodiment of the present application;

fig. 2 is a flowchart illustrating a method for processing a voice command at a server according to another embodiment of the present application;

FIG. 3 is a schematic diagram of a decision tree according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating a method for processing a voice command at a server according to another embodiment of the present application;

FIG. 5 is a flowchart illustrating a method for processing voice commands at a server according to another embodiment of the present application;

FIG. 6 is a flowchart illustrating a method for processing voice commands of a client according to an embodiment of the present application;

FIG. 7 is a flowchart illustrating a method for processing voice commands of a client according to another embodiment of the present application;

FIG. 8 is a flow chart illustrating a method for processing voice commands of a client according to another embodiment of the present application;

fig. 9 is a schematic structural diagram of a device for processing a voice command at a server according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a device for processing voice commands according to an embodiment of the present application;

FIG. 11 is a schematic diagram of a smart device according to an embodiment of the present application;

Detailed Description

While the present application is susceptible to embodiments and details, it should be understood that the present application is not limited to the details of the particular embodiments disclosed, but is capable of many modifications and variations, as will be apparent to those of ordinary skill in the art, without departing from the spirit of the application.

In the present application, the terms "first", "second", "third", "fourth", and the like are used only for distinguishing one from another, and do not indicate importance, order, existence of one another, and the like.

In the present application, a method, an apparatus, an intelligent device and a storage medium for processing a voice instruction are provided, and detailed descriptions are individually provided in the following embodiments.

In an embodiment of the present application, a method for processing a voice instruction at a server is disclosed, and with reference to fig. 1, the method includes:

step 101: receiving a voice instruction containing the original intention of a user from a terminal;

step 102: carrying out voice recognition on the voice instruction to generate text information of the voice instruction;

step 103: analyzing the text information, and determining an analysis intention corresponding to the text information;

step 104: searching resources required by executing the voice command according to the analysis intention, and sending the resources to the terminal;

step 105: and determining that the original intention is not met, marking the voice instruction, the text information and the analysis intention as error samples, and storing the error samples into an error sample library.

According to the method, whether the wrong sample is obtained or not is determined according to the human-computer interaction reaction mode, the user does not need to manually carry out the operation of wrong labeling, and therefore the possibility of obtaining the wrong sample is increased.

In one embodiment according to the present application, the determining in step 105 that the original intent is not satisfied comprises:

Taking the first case as an example, suppose that a user wants to purchase Pizza named as "trash can" of a well-known Pizza shop, then the user inputs a voice instruction "take out one trash can Pizza" through a mobile phone, a server end receives the voice instruction and then an error occurs in the voice recognition process, the voice instruction is converted into text information "mom one trash can Pizza", and the matched trash clearing service company is searched according to the intention of resolving and analyzing the error of the wrong text information. Since the parsing intention is not in accordance with the user intention and cannot meet the user request, the user may try to input the voice instruction "take out a trash can Pizza" again, this time, the server correctly converts the voice instruction into the text information "take out a trash can Pizza" in the voice recognition process, parses out the correct intention according to the content of the text information, searches a list of restaurants selling Pizza/Pizza and provides a map route to the shops. The user finds information about the address and telephone of the known Pizza shop in the list of restaurants provided and successfully orders a Pizza named "trash can". In the process, the wrongly recognized voice instruction of 'take out one trash can Pizza', the text information of 'mom one trash can Pizza' and the analysis intention of 'search matching garbage clearing service company' is marked as an error sample and stored in an error sample library for improving the accuracy of the error sample detection model. Reducing the likelihood of subsequent misidentification or resolution.

The embodiment provides a judgment mode for determining that the original intention is not met by the server side, namely, whether to collect an error sample is determined by detecting whether a certain voice instruction representing the same analysis intention is repeatedly uploaded by the same user in a short time, and the method omits the tedious step of manually carrying out error labeling feedback by the user, so that the possibility of obtaining the error sample is greatly improved.

In the second case, if the server receives the information from the terminal, which marks the voice command, the text information and the parsing intention as the error sample, it is determined that the original intention of the user is not satisfied, and the voice command, the text information and the parsing intention are marked as the error sample and stored in an error sample library.

In another embodiment according to the present application, as shown in fig. 2, where steps 201 to 204 are identical to steps 101 to 104 in the method shown in fig. 1, a learnable detection model is constructed in step 205 in a decision tree manner for detecting whether a voice command with the same parsing intention of the same user is repeatedly received within a preset time period, so as to determine whether the original intention is satisfied, and in case that the original intention is not satisfied, the voice command, the text information and the parsing intention are labeled as an error sample and saved to an error sample library.

In machine learning, a decision tree is a predictive model that represents a mapping between object attributes and object values, with each node in the decision tree representing a decision condition for an object attribute and its branches representing objects that meet the node condition. Leaf nodes of the decision tree represent the prediction results to which the objects belong.

Fig. 3 is a schematic structural diagram of a decision tree in an embodiment, configured to detect whether a voice command with the same parsing intention of the same user is repeatedly received within a preset time period, and further determine whether to mark the current voice command, text information, and parsing intention as error samples and store the error samples in an error sample library.

There are three attributes on which to base the decision in the decision tree of this structure:

firstly, whether an instruction with the same intention is received repeatedly;

secondly, whether the instructions with the same intention come from the same user;

and repeating for more than 2 times in the third and second minutes.

Each node in the decision tree represents a judgment condition for an object attribute, and its branches represent objects that meet the node conditions. For example: the server repeatedly receives voice instructions of the same intent, the instructions being from the same user, 3 times in 2 minutes. Judging by the root node of the decision tree that the condition is in accordance with the right branch (YES); judging whether the users come from the same user or not, and according with a right branch (YES); and then judging whether the operation is repeated for more than 2 times within 2 minutes, if so, according with a right branch (YES), the current situation falls on a leaf node of which the original intention is not met, and marking the current voice instruction, the text information and the analysis intention as error samples to be stored in an error sample library.

The decision tree can be constructed by using an ID3 algorithm (Iterative Dichotomiser 3), a C4.5 algorithm or a CART algorithm and the like.

The ID3 algorithm is a greedy algorithm for constructing decision trees. The ID3 algorithm originates from a concept learning system, takes the descending speed of information entropy as a standard for selecting test attributes, namely, selects an attribute with the highest information gain which is not used for dividing at each node as a dividing standard, and then continues the process until the generated decision tree can perfectly classify training samples.

The C4.5 algorithm and the ID3 algorithm are solved by a greedy algorithm, and the difference between the C4.5 algorithm and the ID3 algorithm lies in the difference of classification decision bases. When classification decision is made by information gain, the classification decision method is biased to the characteristic of more values, and C4.5 is the classification decision method based on the information gain ratio. Thus, the C4.5 algorithm is identical in structure and recursion to ID3, except that the information gain ratio is chosen to be the largest when the decision feature is selected.

The CART algorithm is also called a classification regression tree algorithm, and the classification regression tree is a binary tree, so that the dichotomy of the CART algorithm can simplify the scale of the decision tree and improve the efficiency of generating the decision tree.

The decision tree algorithm has the greatest advantage that the model can be learned by self, and the error sample detection model with good effect can be trained only by better marking the training examples.

When the decision tree is deep in dimension, it is easy to cause an overfitting phenomenon in which an assumption is excessively strict in order to obtain a consistent assumption. Avoiding overfitting is one of the core tasks of classifier design. In order to prevent overfitting, besides a pruning method for limiting the dimensionality of the decision tree, a large number of decision trees can be constructed to form a random forest to prevent overfitting, and the defect of weak generalization capability of the decision trees is avoided. In other words, there may be overfitting for a single decision tree, but the overfitting phenomenon can be eliminated by increasing the breadth. The random forest technology can better process high latitude data, and training can be quickly completed under the condition of multiple characteristics. In addition, the random forest can detect the mutual influence among the characteristics in the training process so as to predict whether the sample belongs to the error sample.

Therefore, the learnable detection model can be optimized in a random forest manner to filter the error samples.

A typical method of constructing a random forest comprises:

-randomly picking n samples from the sample set;

-randomly selecting K attributes from all attributes, selecting the best segmentation attribute as a node to establish a decision tree;

repeating the two steps for m times to establish m decision trees;

the m decision trees form a random forest, and the result is voted in a voting mode to determine which type the data belongs to. The voting mechanism is made with a vote veto, minority-compliant majority, weighted majority, etc.

Through decision trees or random forest technology, some special users, such as testing personnel, can be eliminated from the error sample library through modeling the habit patterns of daily use of the users.

The role of random forest technology in this application is illustrated below, taking as an example a specific application, in which:

the user categories are: test personnel and general users.

Each decision tree in the random forest corresponds to a feature for classification, if the total number of features is 3 forests, 3 decision trees are corresponding to the decision trees, and the decision trees adopt classification regression trees.

The parameters of the first decision tree, classified for the feature "average duration of daily use of speech instructions", are shown in table 1:

the parameters of the second decision tree, which is classified for the feature "average number of speech instructions uploaded daily", are shown in table 2:

number of uploaded voices	Testing personnel	General users
			More than or equal to 500 pieces per day	75％	1％
Daily number is greater than or equal to 100	85％	8％
			Less than or equal to 50 pieces per day	15％	75％
Less than or equal to 10 pieces per day	1％	35％

The parameters of the third decision tree, classified for the feature "average number of errors per day" are shown in table 1:

number of uploaded voices	Testing personnel	General users
			More than 100 pieces per day	80％	2％
More than or equal to 50 strips per day	92％	15％
			Less than or equal to 20 strips per day	30％	55％
Less than or equal to 10 pieces per day	1％	30％

According to the classification results of the three decision trees, the distribution situation of user classification can be established for the information of a certain specific user:

feature(s)	Characteristic value	Testing personnel	General users
				Average duration of daily use of voice commands	7	95％	5％
Average number of voice commands uploaded daily	100	85％	8％
				Average number of daily callout errors	50	92％	15％

Finally, it is concluded that the user has a probability of about 91% being a tester and a probability of about 9% being a normal user, so that the user is finally determined to belong to the tester, and the error sample generated by the user operation is removed from the error sample library.

In another embodiment according to the present application, the method comprises:

step 401: receiving a voice instruction containing the original intention of a user from a terminal;

step 402: carrying out voice recognition on the voice instruction to generate text information of the voice instruction;

step 403: analyzing the text information, and determining an analysis intention corresponding to the text information;

step 404: searching resources required by executing the voice command according to the analysis intention, sending the resources to the terminal, and storing a record of resource searching corresponding to the analysis intention;

step 405: determining that the original intention is not met, marking the voice instruction, the text information and the analysis intention as error samples, and storing the error samples in an error sample library;

step 406: and if the reason that the analysis intention is not met is determined to be caused by resource retrieval according to a preset rule based on the retrieval record, removing the error sample from the error sample library.

The cases where the analysis intention due to the resource search cause is not satisfied include:

the resource retrieval result cannot be obtained due to network connection errors;

the wrong way of retrieval results in wrong resource retrieval results; or

Because the limitation of the search library results in that the required resource search result cannot be obtained.

Because the situation that the user intention is not satisfied can be generated in the process of voice recognition error or semantic interpretation, or can be caused by resource retrieval failure or error, when the error sample is stored, the record of resource retrieval is stored at the same time, so that the voice recognition and semantic analysis can be correctly and only the interference error sample which is not satisfied by the user intention caused by the retrieval failure or error is removed from the error sample library in the modes of manual rechecking and the like, thereby improving the accuracy of the error sample detection model.

In a specific application, a server receives a voice instruction ' play movie ' AABBCC ' input by a user, obtains text information ' play movie ' AABBCC ' of the voice instruction through voice recognition, does not find a movie video resource named ' AABBCC ' when resource retrieval is carried out after the text information is analyzed, and a subsequent user repeatedly inputs the voice instruction, but because the matched movie video resource cannot be retrieved, the intention of the user cannot be met all the time, so that records of the voice instruction ' play movie ' AABBCC ', the text information corresponding to the voice instruction, the analytic intention and the resource retrieval corresponding to the analytic intention are marked as error samples and stored in an error sample library. Obviously, no error occurs in the speech recognition and semantic parsing process, so that the error sample can be removed from the error sample library through manual screening.

In another embodiment according to the present application, to avoid the cumbersome manual screening, the matching degree of the resource and the request is preserved in the retrieval record so as to realize the automatic screening of the interference error data sample, and the method includes:

step 501: receiving a voice instruction containing the original intention of a user from a terminal;

step 502: carrying out voice recognition on the voice instruction to generate text information of the voice instruction;

step 503: analyzing the text information, and determining an analysis intention corresponding to the text information;

step 504: searching resources required by executing the voice command according to the analysis intention, sending the resources to the terminal, and storing a record of resource searching corresponding to the analysis intention;

step 505: calculating the matching degree of the resource retrieval and the analysis intention, and storing the calculated value of the matching degree in a retrieval record;

step 506: determining that the original intention is not met, marking the voice instruction, the text information and the analysis intention as error samples, and storing the error samples in an error sample library;

step 507: and if the value of the matching degree is smaller than a preset threshold value, removing the error sample from the error sample library.

Through the steps, the error samples which cannot meet the user intention and are marked due to the resource retrieval can be screened out and removed from the error sample library, and the accuracy of the error sample detection model is further improved.

In an alternative embodiment, the error sample labeled as failing to satisfy the user's intention due to resource retrieval may be filtered out before being saved to the error sample library.

There are many ways to compute the degree of matching of resource search and parsing intent, and a specific application is exemplified below. If a user wants to go to a KTV song named as 'lollipop' and wants to search for relevant information, the user inputs a voice command 'lollipop KTV song', voice recognition is carried out on the voice command to obtain corresponding text information 'lollipop KTV song', three keywords 'lollipop', 'KTV' and 'song' are obtained according to the splitting of the text information, and a KTV with the resolution intention of the keyword 'lollipop' in the search name is obtained according to the keywords. However, since there is no KTV named "lollipop", only the following 4 search results are provided as feedback:

1. providing information of a place where a song can be played, wherein the name of the place does not contain the keyword 'lollipop';

2. calling the contact person containing the keywords 'lollipop' and/or 'KTV' and/or 'singing';

3. adding a schedule of 'singing a lollipop KTV' into a calendar;

4. a song is played that includes the keywords "lollipop" and/or "KTV".

Obviously, no matter which of the four types of feedback provided above can not satisfy the original intention of the user, since the matching degree between the resource retrieval and the analysis intention is not high, it is assumed that the value of the matching degree calculated in the first case is 50%, the matching degree calculated in the second case is 40%, the matching degree calculated in the third case is 30%, and the matching degree calculated in the fourth case is 15%, which are all smaller than the preset threshold value 70%, at this time, even if the server repeatedly receives the same voice command sent by the same user for many times, the corresponding error sample is finally removed from the error sample library because the value of the matching degree between the resource retrieval and the analysis intention is smaller than the preset threshold value.

The calculation of the degree of matching depends on the type of resource retrieval, for example, a song is retrieved. The voice command is to play a string of songs S1, S1 representing the song name. However, an error occurs in speech recognition or semantic parsing, so that the names of songs included in the finally obtained parsing intention are character strings S2, S1 and S2 respectively correspond to pinyin character strings P1 and P2, and the value of the matching degree using the character string S2 as a search condition is calculated by the formula M-1-d/max (len (P1), len (P2)), where M is the matching degree, d is the edit distance of P1 and P2, len (P1) is the length of the pinyin character string P1, len (P2) is the length of the pinyin character string P2, and max (len (P1), len (P2)) takes a numerical value with a larger character string length between the two characters.

In an embodiment of the present application, a method for processing a voice instruction of a client is disclosed, and referring to fig. 6, the method includes:

step 601: collecting a voice instruction containing an original intention of a user and sending the voice instruction to a server;

step 602: acquiring resources required by executing the voice instruction from a server side;

step 603: and determining that the original intention is not met, and sending information which marks the voice instruction, text information obtained by performing voice recognition based on the voice instruction and analysis intention obtained by analyzing the text information as an error sample to the server.

According to the method, whether the error sample is obtained or not is determined according to the human-computer interaction reaction mode, the user does not need to manually carry out error labeling operation, and the possibility of obtaining the error sample is increased.

Optionally, in another embodiment according to the present application, another method for processing a voice instruction of a client is provided, and referring to fig. 7, the method includes:

step 701: collecting a voice instruction containing an original intention of a user and sending the voice instruction to a server;

step 702: acquiring resources required by executing the voice instruction from a server side;

step 703: providing information of an execution action corresponding to the voice instruction based on the acquired resource;

step 704: capturing an instruction for abandoning execution of an execution action corresponding to the voice instruction, determining that the original intention is not satisfied, and sending information for marking the voice instruction, text information obtained by voice recognition based on the voice instruction and analysis intention obtained by analyzing the text information as an error sample to the server.

According to the method, feedback information is provided for the user based on the acquired resources, and the user is informed of follow-up actions to be performed by the client through the feedback information. The feedback information can be Text feedback or voice feedback realized by TTS (Text To Speech) technology, and the conversion from Text information To audio information can be realized by a common Text-To-voice conversion unit. The user can judge whether the request can be satisfied or not by seeing or hearing the feedback information. Generally, a user can voluntarily abandon execution of a subsequent action only under the condition that intentions cannot be met, so that if the client receives an instruction that the user abandons execution of the subsequent action, it can be estimated that resources required for executing the voice instruction and acquired by the client cannot meet the actual request of the user, and at the moment, the voice instruction, text information acquired by performing voice recognition based on the voice instruction and an analysis intention acquired by analyzing the text information are marked as error samples and stored in an error sample library, so that a tedious process of manually marking the error samples by the user is omitted, the error samples are automatically acquired under the state completely conforming to the conventional operation habits of the user, and the probability of acquiring the effective error samples is greatly improved.

In another embodiment according to the present application, a method for processing a voice instruction of a client is provided, and referring to fig. 8, the method includes:

step 801: collecting a voice instruction containing an original intention of a user and sending the voice instruction to a server;

step 802: acquiring resources required by executing the voice instruction from a server side;

step 803: executing an execution action corresponding to the voice instruction based on the acquired resource;

step 804: capturing an instruction for stopping the execution action corresponding to the voice instruction within a preset time threshold, determining that the original intention is not satisfied, and sending information for marking the voice instruction, text information obtained by voice recognition based on the voice instruction and analysis intention obtained by analyzing the text information as an error sample to the server.

For example, a voice instruction 'playing video ABCC' input by a user is collected, and the voice instruction is uploaded to a server side; when a server side carries out voice recognition, errors occur, the voice command is recognized as wrong text information 'broadcast video ADCC', the analytic intention analyzed according to the wrong text information is a video with the broadcast name 'ADCC', and a video resource 'ADCC' required for executing the analytic intention is searched according to the finally determined analytic intention; the client starts playing after acquiring video resources of ADCC, the user finds that the played video is not ABCC requested by the user, so that an instruction of ending playing is sent out in 5 seconds of playing the video, the client captures the instruction of the user, the original intention of the user is determined not to be met, the voice instruction ABCC is marked as an error sample, the text information ADCC obtained by voice recognition based on the voice instruction is used for playing the video, and the analysis intention video with the ADCC playing name is obtained by analyzing the text information and stored in an error sample library.

According to the method, a mode which accords with the conventional operation habit of the user is selected, so that the automatic collection of the error sample is realized, the complicated process of manually marking the error sample by the user is omitted, and the probability of obtaining the effective error sample is improved.

An embodiment of the present application discloses a device for processing a voice instruction at a server, referring to fig. 9, the device includes: the system comprises a receiving module 901, a voice recognition module 902, an analysis module 903, a resource retrieval module 904, a first error sample detection module 905 and an error sample library 906; wherein, the receiving module 901 is configured to receive a voice instruction containing the original intention of the user from the terminal; the voice recognition module 902 is configured to perform voice recognition on the voice command, and generate text information of the voice command; the analysis module 903 is configured to analyze the text information and determine an analysis intention corresponding to the text information; the resource retrieving module 904 is configured to retrieve the resource required for executing the voice instruction according to the parsing intention and transmit the resource to the terminal; the first error sample detection module 905 is configured to determine that the original intent is not satisfied, label the voice instruction, the text information, and the parsing intent as an error sample, and save the error sample to an error sample library 906; the error sample repository 906 is configured to store the error samples.

By the aid of the device, whether the wrong sample is obtained or not can be determined according to the human-computer interaction reaction mode, a user does not need to manually perform wrong labeling operation, and the possibility of obtaining the wrong sample is increased.

In another embodiment according to the present application, the first erroneous sample detection module 905 determining that the original intent is not satisfied comprises:

The detection module 905 of the device determines the time for collecting the wrong sample by capturing the voice instruction representing the same analysis intention repeatedly uploaded by the same user in a short time, so that the complicated step of manually carrying out wrong labeling feedback by the user is omitted, and the possibility of obtaining the wrong sample is greatly improved. Another judgment method is to save an error sample meeting a judgment criterion that the client determines that the original intention is not met into an error sample library.

In another embodiment according to the present application, the first error sample detection module is configured to construct a learnable detection model in a decision tree manner, so as to detect whether a voice instruction with the same parsing intention of the same user is repeatedly received within a preset time period, thereby determining whether the original intention is satisfied, and if the original intention is not satisfied, labeling the voice instruction, the text information and the parsing intention as error samples and saving the error samples to an error sample library.

The decision tree algorithm has the greatest advantage that the model can be learned by self, and the learnable detection model can be optimized in a random forest mode to filter error samples only by better marking the training examples.

The random forest algorithm can prevent overfitting and solve the defect of weak generalization capability of the decision tree.

In another embodiment according to the present application, the first erroneous sample detection module 905 is further configured to:

saving a record of resource retrieval corresponding to the analysis intention;

Since the situation that the user intention is not satisfied may also be caused by a failure or an error in resource retrieval, when the device configured with the first error sample detection module 905 as described above saves an error sample, and simultaneously saves a record of resource retrieval, such interference error samples that speech recognition and semantic parsing are correct and only the user intention is not satisfied due to a failure or an error in retrieval can be removed from an error sample library in a manner of manual review, etc., so as to improve the accuracy of an error sample detection model.

According to another embodiment of the present application, the first erroneous sample detection module 905 is further configured to:

saving a record of resource retrieval corresponding to the analysis intention;

The device configured with the first error sample detection module 905 as described above can screen out an error sample that cannot be labeled to meet the user's intention due to resource retrieval, and remove the error sample from the error sample library, thereby further improving the accuracy of the error sample detection model.

In an embodiment according to the present application, a device for processing a voice instruction of a client is disclosed, as shown in fig. 10, the device includes: a collecting module 1001, an executing module 1002 and a second error sample detecting module 1003; the collection module 1001 is configured to collect a voice instruction containing an original intention of a user and transmit the voice instruction to a server; the execution module 1002 is configured to obtain resources required for executing the voice instruction from a server side; the second error sample detection module 1003 is configured to determine that the original intention is not satisfied, and send, to the server, information that labels, as an error sample, the voice instruction, text information obtained by performing voice recognition based on the voice instruction, and an analysis intention obtained by analyzing the text information.

For convenience of describing the principle of the client and server cooperating, fig. 10 also shows the structural composition of the server-side voice command processing device cooperating with the client-side voice command processing device. The receiving module 1004 is configured to receive a voice instruction containing an original intention of a user sent by the acquiring module 1001 of the terminal; the voice instruction is transmitted to a voice recognition module 1005 for voice recognition, and text information of the voice instruction is generated; the text information is transmitted to the parsing module 1006 to determine a parsing intention corresponding to the text information; then, the resource retrieving module 1007 retrieves the resource required for executing the voice command according to the analysis intention, and sends the resource to the executing module 1002 of the terminal; the execution module 1002 of the terminal obtains the resource required for executing the voice instruction from the server, and then the second error sample detection module 1003 determines whether the original intention of the user is satisfied, and if it is determined that the original intention is not satisfied, the execution module sends the voice instruction, text information obtained by performing voice recognition based on the voice instruction, and information in which an analysis intention obtained by analyzing the text information is labeled as an error sample to the server. Wherein the first error sample detection module 1003 is capable of saving a record of resource retrieval corresponding to the parsing intent; and if it is determined that the reason why the analysis intention is not met is caused by resource retrieval according to a preset rule based on the retrieval record, removing the error sample from the error sample library 1009.

The error sample library 1009 is generally disposed at the server side, and certainly does not exclude the possibility of setting a dedicated error sample library for a specific user at the client side.

The device can determine whether to acquire the wrong sample according to the human-computer interaction reaction mode, the user does not need to manually perform the operation of wrong labeling, and the possibility of acquiring the wrong sample is increased.

According to an embodiment of the present application, the execution module 1002 is further configured to, after acquiring a resource required for executing the voice instruction, provide information of an execution action corresponding to the voice instruction based on the acquired resource;

the determining that the original intent is not satisfied comprises:

The device configured with the execution module 1002 not only omits the tedious process of manually marking the error sample by the user, but also automatically collects the error sample in a state completely conforming to the conventional operation habit of the user, thereby greatly improving the probability of obtaining the effective error sample.

According to another embodiment of the present application, the execution module 1002 is further configured to, after acquiring a resource required for executing the voice instruction, execute an execution action corresponding to the voice instruction based on the acquired resource;

the determining that the original intent is not satisfied comprises:

The device configured with the execution module 1002 can also realize automatic collection of error samples in a manner according with the conventional operation habits of users, so that the complicated process of manually marking the error samples by the users is omitted, and the probability of obtaining effective error samples is improved.

In one embodiment according to the present application, there is provided a smart device 1100 as shown in fig. 11, including but not limited to a memory 1101, a processor 1102 and computer instructions stored on the memory 1101 and executable on the processor 1102, wherein the processor 1102 when executing the instructions implements the method for processing voice instructions as described above.

The foregoing is a schematic scheme of an intelligent device according to this embodiment. It should be noted that the technical solution of the intelligent device belongs to the same concept as the aforementioned processing method of the voice command, and details that are not described in detail in the technical solution of the intelligent device can be referred to the description of the technical solution of the processing method of the voice command.

In one embodiment according to the present application, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement a method of processing voice instructions as previously described.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the aforementioned processing method of the voice command, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the processing method of the voice command.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the application to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and the practical application, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims

1. A method for processing a voice command, the method comprising:

and according to a reaction mode of man-machine interaction, determining that the original intention is not met, and marking the voice instruction, the text information and the analysis intention as error samples to be stored in an error sample library.

2. The method of claim 1, wherein the determining that the original intent is not satisfied comprises:

3. The method of claim 2, wherein the determination of whether the voice commands with the same user's parsing intention are repeatedly received within a preset time period is performed in a decision tree manner.

4. The method according to claim 1 or 2, characterized in that the method further comprises:

saving a record of resource retrieval corresponding to the analysis intention;

5. The method according to claim 1 or 2, characterized in that the method further comprises:

saving a record of resource retrieval corresponding to the analysis intention;

6. A method for processing a voice command, the method comprising:

7. The method of claim 6, wherein after obtaining the resources needed to execute the voice instruction, the method further comprises:

the determining that the original intent is not satisfied comprises:

8. The method of claim 6, wherein after acquiring resources required for executing the voice instruction, the method further comprises:

the determining that the original intent is not satisfied comprises:

9. An apparatus for processing a voice command, the apparatus comprising: the system comprises a receiving module, a voice recognition module, an analysis module, a resource retrieval module, a first error sample detection module and an error sample library; wherein the content of the first and second substances,

the receiving module is configured to receive a voice instruction containing the original intention of the user from the terminal;

the voice recognition module is configured to perform voice recognition on the voice instruction, and generate text information of the voice instruction;

the analysis module is configured to analyze the text information and determine an analysis intention corresponding to the text information;

the resource retrieval module is configured to retrieve resources required for executing the voice instruction according to the analysis intention and send the resources to the terminal;

the first error sample detection module is configured to determine that the original intention is not met according to a reaction mode of human-computer interaction, mark the voice instruction, the text information and the analysis intention as error samples, and store the error samples in an error sample library;

the error sample repository is configured to store the error samples.

10. The apparatus of claim 9, wherein the first false sample detection module determining that the original intent is not satisfied comprises:

11. The apparatus according to claim 9 or 10, wherein the first error sample detection module is configured to determine whether to repeatedly receive the voice command with the same user's same parsing intention within a preset time period in a decision tree manner.

12. The apparatus of claim 9 or 10, wherein the first erroneous sample detection module is further configured to:

saving a record of resource retrieval corresponding to the analysis intention;

13. The apparatus of claim 9 or 10, wherein the first erroneous sample detection module is further configured to:

saving a record of resource retrieval corresponding to the analysis intention;

14. An apparatus for processing a voice command, the apparatus comprising: the device comprises an acquisition module, an execution module and a second error sample detection module;

the acquisition module is configured to acquire a voice instruction containing an original intention of a user and send the voice instruction to a server;

the execution module is configured to obtain resources required for executing the voice instruction from a server side;

the second error sample detection module is configured to determine that the original intention is not satisfied, and send information that labels the voice instruction, text information obtained by performing voice recognition based on the voice instruction, and an analysis intention obtained by analyzing the text information as an error sample to the server.

15. The apparatus of claim 14, wherein:

the execution module is further configured to provide information of an execution action corresponding to the voice instruction based on the acquired resource after acquiring the resource required for executing the voice instruction;

the determining that the original intent is not satisfied comprises:

16. The apparatus of claim 14, wherein:

the execution module is further configured to execute an execution action corresponding to the voice instruction based on the acquired resource after acquiring the resource required for executing the voice instruction;

the determining that the original intent is not satisfied comprises:

17. A smart device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, wherein the processor implements the method of processing voice instructions of any one of claims 1 to 5 or 6 to 8 when executing the instructions.

18. A computer-readable storage medium having stored thereon computer instructions, which when executed by a processor, implement the method of processing voice instructions of any one of claims 1 to 5 or 6 to 8.