CN105869640B

CN105869640B - Method and device for recognizing voice control instruction aiming at entity in current page

Info

Publication number: CN105869640B
Application number: CN201510031182.3A
Authority: CN
Inventors: 雷欣
Original assignee: Shanghai Ink Hundred Meaning Information Technology Co Ltd
Current assignee: Volkswagen China Investment Co Ltd; Mobvoi Innovation Technology Co Ltd
Priority date: 2015-01-21
Filing date: 2015-01-21
Publication date: 2019-12-31
Anticipated expiration: 2035-01-21
Also published as: CN105869640A

Abstract

The invention provides a method and a device for recognizing a voice control instruction aiming at an entity in a current page. The method comprises the following steps: extracting entities from the current page; constructing a candidate instruction set based on the extracted entity and the corresponding construction template; based on the matching of the speech spoken by the user with the candidate instructions in the set of candidate instructions, speech control instructions for the entities in the current page are identified from the speech spoken by the user. The invention improves the flexibility of voice instruction recognition.

Description

Method and device for recognizing voice control instruction aiming at entity in current page

Technical Field

The present invention relates to a voice recognition technology, and in particular, to a method and an apparatus for recognizing a voice control command for an entity in a current page.

Background

In the prior art, when voice instruction recognition is performed, whether the voice of the user is a voice instruction can be determined only based on whether the voice instruction in the fixed voice instruction set is matched with the voice of the user. For example, taking an example that a fixed voice instruction set contains an instruction "i want to buy a train ticket in beijing", only if the voice content generated by the user is the same as the voice instruction, the user can be considered to issue the voice instruction, and then the related operation is executed. If the voice content generated by the user is 'i want to buy the train ticket to get to Beijing', that is, the sequence of the sentence patterns is reversed, the user cannot be considered to send the voice command, so that the related operation is not executed, and the flexibility of voice command recognition is poor.

Disclosure of Invention

One of the technical problems solved by the invention is to improve the flexibility of voice instruction recognition.

According to an embodiment of one aspect of the present invention, there is provided a method of recognizing a voice control instruction for an entity in a current page, including: extracting entities from the current page; constructing a candidate instruction set based on the extracted entity and the corresponding construction template; based on the matching of the speech spoken by the user with the candidate instructions in the set of candidate instructions, speech control instructions for the entities in the current page are identified from the speech spoken by the user.

Optionally, the step of extracting the entity from the current page includes: segmenting words in the current page; judging the part of speech of the divided words; inputting each word of the separated words with specific part of speech into a classifier to judge whether the word is a word forming an entity and whether the word forms the beginning, the middle or the end of the entity, wherein the classifier is trained in advance by a set of word samples of the entity and non-entity; and judging whether the word with the specific part of speech is an entity or not according to the judgment result of the classifier on each word in the separated words with the specific part of speech.

Alternatively, the construction template is formed beforehand as follows: and extracting an entity from each voice control command in the historical voice control command set of the current user, and extracting language patterns around the extracted entity to be used as a construction template corresponding to the extracted entity.

Alternatively, the construction template is formed beforehand as follows: an entity is extracted from each voice control command in a set of historical voice control commands of all users, and language patterns around the extracted entity are extracted to be used as a construction template corresponding to the extracted entity.

Optionally, the step of constructing a candidate instruction set based on the extracted entities and the corresponding construction templates includes: acquiring synonyms of the extracted entities based on the extracted entities; and respectively applying the extracted entity and the obtained synonym to the corresponding construction template of the extracted entity to respectively obtain corresponding candidate instructions, and putting the corresponding candidate instructions into a candidate instruction set.

Optionally, the step of identifying the voice control instruction for the entity in the current page from the voice spoken by the user based on the matching of the voice spoken by the user with the candidate instruction in the candidate instruction set comprises: and in response to the voice spoken by the user being matched with one candidate instruction in the candidate instruction set, identifying the extracted entity corresponding to the candidate instruction, so as to identify the voice control instruction aiming at the extracted entity in the current page from the voice spoken by the user.

According to an embodiment of an aspect of the present invention, there is provided an apparatus for recognizing a voice control instruction for an entity in a current page, including: an extraction unit configured to extract an entity from a current page; a construction unit configured to construct a set of candidate instructions based on the extracted entities and corresponding construction templates; and the recognition unit is configured to recognize the voice control instruction aiming at the entity in the current page from the voice spoken by the user based on the matching of the voice spoken by the user and the candidate instruction in the candidate instruction set.

Optionally, the extraction unit is configured to: segmenting words in the current page; judging the part of speech of the divided words; inputting each word of the separated words with specific part of speech into a classifier to judge whether the word is a word forming an entity and whether the word forms the beginning, the middle or the end of the entity, wherein the classifier is trained in advance by a set of word samples of the entity and non-entity; and judging whether the word with the specific part of speech is an entity or not according to the judgment result of the classifier on each word in the separated words with the specific part of speech.

Optionally, the construction unit is configured to: acquiring synonyms of the extracted entities based on the extracted entities; and respectively applying the extracted entity and the obtained synonym to the corresponding construction template of the extracted entity to respectively obtain corresponding candidate instructions, and putting the corresponding candidate instructions into a candidate instruction set.

Optionally, the recognition unit is configured to, in response to a match between the speech spoken by the user and one candidate instruction in the candidate instruction set, recognize the extracted entity corresponding to the candidate instruction, so as to recognize the speech control instruction for the extracted entity in the current page from the speech spoken by the user.

The candidate instruction set of the embodiment of the invention is not fixed, but is constructed according to the entity existing on the current page and the corresponding construction template in real time according to the difference of the current page, so that the user can flexibly send the instruction.

It will be appreciated by those of ordinary skill in the art that although the following detailed description will proceed with reference being made to illustrative embodiments, the present invention is not intended to be limited to these embodiments. Rather, the scope of the invention is broad and is intended to be defined only by the claims appended hereto.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:

FIG. 1 is a flow diagram of a method of identifying voice control instructions for entities in a current page in accordance with one embodiment of the present invention;

FIG. 2 is a flowchart illustrating the process of extracting entities from a current page in a method according to an embodiment of the present invention;

FIG. 3 is a detailed flow diagram of a process for constructing a set of candidate instructions based on extracted entities and corresponding construction templates in a method according to one embodiment of the invention;

FIG. 4 is a block diagram of an apparatus for identifying voice control commands for entities in a current page according to one embodiment of the present invention.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

The present invention is described in further detail below with reference to the attached drawing figures.

FIG. 1 is a flow diagram of a method 1 of identifying voice control commands for entities in a current page according to one embodiment of the invention. The method can be used for vehicle-mounted equipment, mobile terminals, fixed equipment (such as desktop computers) and the like. The current page refers to a page displayed on vehicle-mounted equipment, a mobile terminal, fixed equipment (such as a desktop computer) and the like at present. It may be a page that is not displayed in response to the action of the current user (a user operating a vehicle-mounted device, a mobile terminal, a fixed device, or the like), or may be a page that is displayed in response to the action of the current user. An entity refers to a word or sequence number displayed on a page that represents an object of an action that a user may desire. For example, the table top displays food items such as "spicy attraction", "spicy pot", "special grilled fish", etc., wherein each food item is considered as an entity, and the serial numbers (such as 1, 2, 3, etc.) displayed beside the items of "spicy attraction", "spicy pot", "special grilled fish" are also considered as an entity, because the voice instructions (such as "i want to eat spicy attraction", "i select 3") that the user may issue next are likely to be directed to them.

One application scenario in which the current page is a page that is not displayed in response to the action of the current user (a user operating the in-vehicle device, the mobile terminal, the stationary device, or the like), for example, one application on the in-vehicle device, is opened by default on the desktop of the in-vehicle device when the in-vehicle device is turned on. Navigation, food, shopping and the like are displayed on the desktop. After the current user utters the voice "i want to go and shop", the method 1 of one embodiment of the present invention recognizes that it is the voice control command for the entity in the current page "shop" and performs further actions, such as displaying a nearby mall for the current user. Of course, such an application may also be on a mobile terminal or a fixed device, which when turned on causes some options to appear on the display desktop by default.

An application scenario in which the current page is a page displayed in response to an action of a current user (a user operating a vehicle-mounted device, a mobile terminal, a fixed device, or the like), for example, in a certain vehicle-mounted application, the current user activates the vehicle-mounted application first, and then says, for example, "please provide to restaurants near me", at which time "spicy enticement," "boiling fish village," "full focus," or the like is displayed on a display screen of the vehicle-mounted device, and the current user says "i want to go to full focus", at which time method 1 of identifying a voice control instruction for an entity in the current page of an embodiment of the present invention identifies that it is a voice control instruction for the entity "full focus" in the current page, and further performs further actions, for example, turning on a phone for full focus, or displaying a specific route to full focus, or the like. Of course, such an application may also be on a mobile terminal or a fixed device, after the current user makes some options appear on the display screen of the mobile terminal or the fixed device through the preceding operation, the method 1 for recognizing the voice control command for the entity in the current page of the embodiment of the present invention may be used to recognize whether the voice next uttered by the current user is the voice control command for the entity in the current page, and to which entity.

In step 110, entities are extracted from the current page.

In one case, after analyzing the composition of the current page, it is found that the current page mainly includes several frames, in which there is a word (identifying whether the text in the frame is a word or a phrase or sentence composed of several words can be realized by the existing word segmentation technology), and the word in each frame can be considered as an entity.

In another case, after analyzing the composition of the current page, it is found that the current page mainly includes several frames, in which there is a phrase or sentence, respectively, or after analyzing the composition of the current page, it is found that the current page is an article, or a page with a complex structure including various characters and various frames, and at this time, entities need to be extracted from the method of fig. 2, for example.

In sub-step 1101, the text in the current page is participled.

Typically all the words identified on the current page are segmented. For example, in the case that the current page mainly includes several frames, and there is a phrase or sentence in each of the several frames, the phrase or sentence is divided into words. For example, where the current page is an article, the article is segmented. The word segmentation can be realized by adopting the existing word segmentation method.

In sub-step 1102, the part of speech of the segmented word is determined.

At present, mature technologies exist in the aspect of semantic analysis. The part-of-speech of the separated word can be judged by adopting the method for judging the part-of-speech in the prior art. Generally, only real words such as nouns, verbs, adjectives, etc., and ordinal words may become entities. It is unlikely that a particle will become an entity.

In sub-step 1103, each of the separated words having a particular part of speech is input to a classifier trained in advance using a set of word samples of entities and non-entities to determine whether the word is a word constituting an entity and whether the word constitutes the beginning, middle, or end of the entity.

Words of a particular part of speech such as real words and ordinal words. In some cases, words of a particular part of speech may be specified only as nouns and ordinal words.

Machine learning currently has mature technology. A model, i.e., classifier, may be trained using a set of a large number of real words and a large number of non-real word samples. Specifically, each of the samples of the real words and the non-real words is input to a classifier, and whether the word is from a real word or a non-real word and constitutes the head, middle or end of the real word is input to the classifier, from which the classifier learns what rules the words from the real words, from the non-real words, and the words constituting the head, middle and end of the real words respectively have. Thus, when a new word is input to the classifier, the classifier can determine whether the word is a word constituting an entity and whether the word is at the beginning, middle, or end of the constituting entity.

In sub-step 1104, it is determined whether the word with the specific part of speech is an entity according to the determination result of the classifier for each word in the separated words with the specific part of speech.

For example, for "boiling fish town", the classifier determines that "boiling" is often the beginning of the entity, "boiled" is often the middle or end of the entity, "fish" is often the middle of the entity, "town" is often the end of the entity, and thus, determines that "boiling fish town" is the entity.

In step 120, a set of candidate instructions is constructed based on the extracted entities and corresponding construction templates.

Candidate instructions are instructions that a user may issue for an entity of a current page. The construction template refers to a language mode which can be used when a user issues an instruction aiming at the entity of the current page. For example, for "2. boiling fish village" on the current page, the user may issue instructions "i want to go to boiling fish village", "select boiling fish village", "2", "select 2", etc. "i want to go to boil fish village", "select to boil fish village", "boiling fish village", "2", "select 2" are candidate instructions, and "i want to go xx", "select xx", "No." and "select No." are construction templates.

One way of forming the construction templates is to pre-define various construction templates for various entities by a person in advance and store them in a database.

Another way of forming the construction formwork is: an entity is extracted from each voice control command in a set of historical voice control commands of all users, and language patterns around the extracted entity are extracted to be used as a construction template corresponding to the extracted entity.

For example, the entity "Tiananmen" is extracted from "i want to go to Tiananmen, sit how much money? on subway", the independent sentence with the predicate in which it is located is "i want to go to Tiananmen", and the language mode is "i want to go xx".

For example, for a user using an application, an entity may be extracted from each voice control command issued when the user using the application uses the application historically, and the language patterns around the extracted entity may be extracted as a construction template corresponding to the extracted entity.

The advantage of forming the construction templates is that the construction templates are collected from the actual application of the user, and are not thought out by people in advance, so that the objectivity of the construction templates is improved, and the accuracy of recognizing the voice control instructions for the entities in the current page is improved.

Another way of forming the construction formwork is: and extracting an entity from each voice control command in the historical voice control command set of the current user, and extracting language patterns around the extracted entity to be used as a construction template corresponding to the extracted entity.

For example, for a current user using a certain user, entities can be extracted from each voice control command issued when the current user using the application uses the application historically, and language patterns around the extracted entities can be extracted as a construction template corresponding to the extracted entities.

The advantage of forming the construction template is that the construction template is extracted from the historical voice control commands of the current user, and reflects the characteristics of the current user's own language, for example, when the current user historically sees that the page has "boiling fish village", the user often says "want to eat the boiling fish village" instead of "i want to go to the boiling fish village", at this time, the user may "want to eat xx", which may be a more common construction template for the current user, so that the manner of forming the construction template can adapt to the personalized requirements of the user, and the accuracy of recognizing the voice control commands for the entities in the current page is improved.

One way to construct a set of candidate instructions based on the extracted entities and corresponding construction templates is to directly apply the extracted entities to the corresponding construction templates of the extracted entities, resulting in candidate instructions to be placed in the set of candidate instructions.

For example, the extracted entity is "boiling fish village", and the corresponding construction templates are "i want to go xx", "xx" and "choose xx", and the extracted entity is applied to these construction templates, and the obtained candidate instructions are "i want to go boiling fish village", "boiling fish village" and "choose boiling fish village", and they are put into the candidate instruction set. In one approach, they may be placed in the candidate instruction set corresponding to "boiling fish village".

In another embodiment, as shown in FIG. 3, the step of constructing the set of candidate instructions 120 based on the extracted entities and corresponding construction templates includes sub-steps 1201 and 1202.

In sub-step 1201, a synonym of the extracted entity is obtained based on the extracted entity.

A synonym database is constructed in advance. For example, the expert finds synonyms one by one for the entities extracted from each voice control command in the set of historical voice control commands of all users or current users, and places the synonyms in the synonym database. Or classifying all words in the dictionary by an expert, forming a synonym set by the words with similar meanings, and forming a synonym database by all the synonym sets. The synonym database may also be constructed in other ways.

After the synonym database is constructed, the synonyms of the extracted entities can be obtained based on the extracted entities in a mode of searching the synonym database.

In the sub-step 1202, the extracted entity and the obtained synonym are respectively applied to the corresponding construction template of the extracted entity, so as to respectively obtain corresponding candidate instructions, and the corresponding candidate instructions are put into a candidate instruction set.

For example, if the extracted entity is "Beijing university," the obtained synonym is "Beida," and the corresponding construction templates are "navigate to xx," "go xx," "i want to go xx," and "call to xx," then the resulting candidate instructions are:

navigation to Beijing university

To Beijing university

I want to go to Beijing university

-telephone to Beijing university

Navigation to North university

Northeast China root of China

To north go

I want to go to North university

Telephone calls to north.

In step 130, voice control instructions for entities in the current page are identified from the user spoken speech based on a match of the user spoken speech with candidate instructions in the set of candidate instructions.

For example, a part of a pause in the speech of the current user is recognized, the speech between two parts of the pause in the speech of the current user is considered to be speech of a small sentence, the speech of the small sentence is recognized as characters by using a speech recognition method known in the art, the characters are compared with candidate instructions in the candidate instruction set constructed in step 120 one by one, when the speech of the small sentence after being recognized as characters is found to be completely consistent with one candidate instruction in the candidate instruction set constructed in step 120 or comprises one candidate instruction in the candidate instruction set constructed in step 120, it is considered that a match between the speech spoken by the user and the candidate instruction in the candidate instruction set is found, and the candidate instruction in the found candidate instruction set is the speech control instruction for the entity in the current page.

Then, it can be further determined to which entity of the current page the recognized voice control instruction is directed. As described above, when the corresponding candidate instructions obtained according to the extracted entities are put into the candidate instruction set, the candidate instructions may be stored in the candidate instruction set in a manner of corresponding to the extracted entities, so that in response to the matching of the voice spoken by the user with one candidate instruction in the candidate instruction set, the extracted entity corresponding to the candidate instruction may be identified, and thus, which entity of the current page the identified voice control instruction is directed to is determined.

The voice control command may then be executed after recognizing the voice control instruction for the entity in the current page from the voice spoken by the user. For example, the executive program code corresponding to each candidate instruction in the candidate instruction set is placed in another database, and when one candidate instruction in the candidate instruction set is found (i.e. the voice control instruction is identified), the voice control command can be executed by executing the corresponding executive program code in the another database.

As shown in fig. 4, the apparatus 2 for recognizing a voice control command for an entity in a current page according to one embodiment of the present invention includes: an extracting unit 210 configured to extract an entity from a current page; a construction unit 220 configured to construct a set of candidate instructions based on the extracted entities and corresponding construction templates; a recognition unit 230 configured to recognize a voice control instruction for an entity in the current page from the voice spoken by the user based on a matching of the voice spoken by the user with a candidate instruction in the set of candidate instructions. The above units can be implemented in software, hardware (FPGA, integrated circuit, etc.) or a combination of software and hardware.

Optionally, the extraction unit 210 is configured to: segmenting words in the current page; judging the part of speech of the divided words; inputting each word of the separated words with specific part of speech into a classifier to judge whether the word is a word forming an entity and whether the word forms the beginning, the middle or the end of the entity, wherein the classifier is trained in advance by a set of word samples of the entity and non-entity; and judging whether the word with the specific part of speech is an entity or not according to the judgment result of the classifier on each word in the separated words with the specific part of speech.

Optionally, the identifying unit 230 is configured to identify the extracted entity corresponding to a candidate instruction in response to the voice spoken by the user matching with the candidate instruction in the candidate instruction set, so as to identify the voice control instruction for the extracted entity in the current page from the voice spoken by the user.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. A method (1) of recognizing speech control instructions for entities in a current page, comprising:

extracting an entity (110) from a current page, wherein the entity comprises a real word or a serial number displayed by the page;

constructing a set of candidate instructions (120) based on the extracted entities and corresponding construction templates;

identifying voice control instructions (130) for entities in the current page from the voice spoken by the user based on a match of the voice spoken by the user with candidate instructions in the set of candidate instructions;

wherein the construction template is formed in advance as follows: extracting an entity from each voice control command in a historical voice control command set of a current user, and extracting language modes around the extracted entity to serve as a construction template corresponding to the extracted entity;

2. The method of claim 1, wherein the step of extracting the entity (110) from the current page comprises:

segmenting words in a current page (1101);

judging the part of speech of the separated words (1102);

inputting each word of the separated words with specific part of speech into a classifier to determine whether the word is a word constituting an entity and whether the word constitutes the beginning, middle or end of the entity (1103), the classifier being trained in advance using a set of word samples of entities and non-entities;

and judging whether the word with the specific part of speech is an entity or not according to the judgment result of the classifier on each word in the separated words with the specific part of speech (1104).

3. The method of claim 1, wherein the step of constructing a set of candidate instructions (120) based on the extracted entities and corresponding construction templates comprises:

acquiring synonyms of the extracted entities based on the extracted entities (1201);

and respectively applying the extracted entity and the obtained synonym to the corresponding construction template of the extracted entity to respectively obtain corresponding candidate instructions, and putting the corresponding candidate instructions into a candidate instruction set (1202).

4. The method of claim 1, wherein the step of identifying speech control instructions (130) for entities in the current page from the user's spoken speech based on a match of the user's spoken speech with candidate instructions in the set of candidate instructions comprises:

and in response to the voice spoken by the user being matched with one candidate instruction in the candidate instruction set, identifying the extracted entity corresponding to the candidate instruction, so as to identify the voice control instruction aiming at the extracted entity in the current page from the voice spoken by the user.

5. An apparatus (2) for recognizing speech control instructions for entities in a current page, comprising:

the extraction unit (210) is configured to extract an entity from the current page, wherein the entity comprises a real word or a serial number displayed by the page;

a construction unit (220) configured to construct a set of candidate instructions based on the extracted entities and corresponding construction templates;

a recognition unit (230) configured to recognize a voice control instruction for an entity in the current page from the voice spoken by the user based on a matching of the voice spoken by the user with a candidate instruction of the set of candidate instructions

6. The apparatus according to claim 5, wherein the decimation unit (210) is configured to:

segmenting words in the current page;

judging the part of speech of the divided words;

inputting each word of the separated words with specific part of speech into a classifier to judge whether the word is a word forming an entity and whether the word forms the beginning, the middle or the end of the entity, wherein the classifier is trained in advance by a set of word samples of the entity and non-entity;

and judging whether the word with the specific part of speech is an entity or not according to the judgment result of the classifier on each word in the separated words with the specific part of speech.

7. The apparatus according to claim 5, wherein the construction unit (220) is configured to:

acquiring synonyms of the extracted entities based on the extracted entities;

and respectively applying the extracted entity and the obtained synonym to the corresponding construction template of the extracted entity to respectively obtain corresponding candidate instructions, and putting the corresponding candidate instructions into a candidate instruction set.

8. Apparatus according to claim 5, wherein the recognition unit (230) is configured to recognize the extracted entity corresponding to a candidate instruction in response to the user uttering speech matching the candidate instruction of the set of candidate instructions, thereby to recognize from the user uttered speech the speech control instruction for the extracted entity in the current page.