CN110718225A

CN110718225A - Voice control method, terminal and storage medium

Info

Publication number: CN110718225A
Application number: CN201911177576.4A
Authority: CN
Inventors: 同超
Original assignee: Shenzhen Konka Electronic Technology Co Ltd
Current assignee: Shenzhen Konka Electronic Technology Co Ltd
Priority date: 2019-11-25
Filing date: 2019-11-25
Publication date: 2020-01-21

Abstract

The invention discloses a voice control method, a terminal and a storage medium, wherein the voice control method comprises the following steps: acquiring a voice instruction and an image in a preset range in front of a terminal, and determining whether the voice instruction is an effective instruction according to the characteristics of the voice instruction and the image; and when the voice command is determined to be an effective command, controlling the terminal according to the voice command. The voice control method determines whether the voice instruction sent by the user is an effective instruction or not through the voice instruction sent by the user and the image of the user, corresponding voice control is carried out on the terminal only when the voice instruction sent by the user is determined to be the effective instruction, the user does not need to execute complex operation, the convenience of using the voice control by the user is improved, and meanwhile the false response triggering probability is reduced.

Description

Voice control method, terminal and storage medium

Technical Field

The present invention relates to the field of voice control technologies, and in particular, to a voice control method, a terminal, and a storage medium.

Background

With the development of voice recognition technology, voice has been widely used for controlling various terminals, and in the process of performing voice control, how to determine whether the voice uttered by the user is used for controlling the terminal becomes a difficult problem to be solved.

In the prior art, a voice collected after a certain operation is performed by a user, for example, the user presses a button or the distance from the user to the terminal is within a preset distance, is determined as a voice for controlling the terminal. However, this voice control method requires the user to perform corresponding operations, and the operations such as pressing a button and approaching the terminal cause a user to have trouble in use, and the user preferably performs voice control without performing complicated operations. Meanwhile, after the user presses the button carelessly or the distance close to the terminal is a preset distance, all voices sent by the user can be received by the terminal and judged as the control instruction, and the triggering probability of false response is high.

Thus, there is a need for improvements and enhancements in the art.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a voice control method, a terminal and a storage medium for solving the above-mentioned drawbacks of the prior art, and to solve the problem that the voice control method in the prior art is inconvenient for a user to operate.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a method of voice control, wherein the method comprises:

acquiring a voice instruction and an image in a preset range in front of a terminal, and determining whether the voice instruction is an effective instruction according to the characteristics of the voice instruction and the image;

and when the voice command is determined to be an effective command, controlling the terminal according to the voice command.

The voice control method, wherein the determining whether the voice command is an effective command according to the feature of the voice command and the image specifically includes:

acquiring a first face feature in the image;

determining whether the user is a registered user or not according to a first voiceprint feature and/or the first face feature corresponding to the voice instruction;

when the user is a registered user, determining the face orientation of the user according to the first facial features;

and when the face faces to the direction of the terminal, converting the voice instruction into a text, and determining whether the voice instruction is an effective instruction according to the text.

The voice control method, wherein the determining whether the user is a registered user according to the first voiceprint feature and/or the first face feature corresponding to the voice instruction specifically includes:

determining whether the first voiceprint feature matches a second prestored voiceprint feature;

and/or determining whether the first facial features are matched with second facial features stored in advance.

The voice control method, wherein the determining whether the voice command is an effective command according to the text specifically includes:

and when the voice control system of the terminal is in a sleep state, determining whether the text is consistent with a pre-stored awakening word.

and when the terminal is in an awakening state, determining whether the voice instruction is an effective instruction according to a pre-trained instruction tag model.

The voice control method, wherein the determining whether the voice command is an effective command according to a pre-trained command classification model specifically includes:

inputting the text into the instruction label model, and determining whether the instruction label model outputs an instruction label corresponding to the text;

the instruction label model is trained according to a data set with a plurality of texts, the data set is provided with a plurality of groups of training samples, and each group of training samples comprises texts and instruction labels corresponding to the texts.

The voice control method, wherein the controlling the terminal according to the voice instruction specifically includes:

and when the terminal is in the sleep state, controlling the terminal to be switched from the sleep state to the awakening state.

The voice control method, wherein the controlling the terminal according to the effective instruction specifically includes:

and when the terminal is in an awakening state, controlling the terminal to execute the voice instruction according to the instruction tag corresponding to the text.

A terminal, wherein the terminal comprises: the voice control device comprises a processor and a storage medium which is in communication connection with the processor, wherein the storage medium is suitable for storing a plurality of instructions, and the processor is suitable for calling the instructions in the storage medium to execute the steps of realizing the voice control method.

A storage medium, wherein the storage medium stores one or more programs, which are executable by one or more processors to implement the steps of the voice control method of any one of the above.

Has the advantages that: compared with the prior art, the voice control method, the terminal and the storage medium are provided, whether the voice instruction sent by the user is an effective instruction is determined through the voice instruction sent by the user and the image of the user, the terminal is subjected to corresponding voice control only when the voice instruction sent by the user is determined to be the effective instruction, the user does not need to execute complex operation, the convenience of using voice control by the user is improved, and meanwhile the false response triggering probability is reduced.

Drawings

FIG. 1 is a flowchart illustrating a first embodiment of a voice control method according to the present invention;

FIG. 2 is a flowchart illustrating sub-steps S100 of an embodiment of a voice control method provided in the present invention;

fig. 3 is a schematic structural diagram of a preferred embodiment of the terminal provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and effects of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, fig. 1 is a flowchart illustrating a voice control method according to a preferred embodiment of the present invention. The method comprises the following steps:

s100, acquiring a voice instruction and an image in a preset range in front of a terminal, and determining whether the voice instruction is an effective instruction or not according to the voice instruction and the image information.

The voice instruction is a voice instruction issued by a user, and the voice instruction can be issued when the user wants to control the terminal by voice. The voice instruction can be obtained through a preset microphone, and the microphone can be installed on the terminal or can be arranged independently.

When a user controls a terminal, the user is often in front of the terminal, and therefore, in this embodiment, the image in a preset range in front of the terminal is obtained, the preset range may be specifically set according to an actual situation, for example, when the terminal is a television, the preset range may be set according to a size and an applicable space of the television, for example, the television is a large-sized television, and then the television can be applicable to a larger living room, and then the preset range may be set larger, conversely, if the television is smaller in size, then the range in which the content displayed by the television can be viewed may be smaller, and then the preset range may be set smaller, and a person skilled in the art may specifically set the preset range according to an actual situation.

The image can be acquired by a preset camera, and the camera can be installed on the terminal or can be independently set.

The microphone and the camera can be in a working state all the time, or the microphone can be in a working state all the time, and after the microphone acquires the voice command, the camera is controlled to start to collect images, so that energy consumption is reduced.

After the voice instruction and the image are obtained, determining whether the voice instruction is an effective instruction according to the feature of the voice instruction and the image, as shown in fig. 2, specifically including:

and S110, acquiring a first face feature in the image.

After the image is acquired, face recognition can be performed on the image to acquire a first face feature in the image.

And S120, determining whether the user is a registered user according to the first voiceprint feature and/or the first face feature corresponding to the voice instruction.

In this embodiment, in order to prevent a security risk brought by other people who control the terminal arbitrarily, it is set that only a registered user can control the terminal, and specifically, it is determined whether the user is a registered user according to the first voiceprint and/or the first facial feature corresponding to the voice instruction.

Specifically, the determining whether the user is a registered user according to the first voiceprint feature and/or the first facial feature corresponding to the voice instruction includes:

s121, determining whether the first voiceprint feature is matched with a second voiceprint feature stored in advance; and/or the presence of a gas in the gas,

and S122, determining whether the first face features are matched with second face features stored in advance.

The voiceprint is a sound wave frequency spectrum which is displayed by an electro-acoustic instrument and carries speech information, and the voiceprint not only has specificity, but also has the characteristic of relative stability, namely, the voiceprint characteristics of different voices emitted by the same person are consistent. In this embodiment, after the voice instruction is obtained, the voice instruction may be analyzed to obtain the first voiceprint feature corresponding to the voice instruction, and whether the user is a registered user is determined according to whether the first voiceprint feature is consistent with a voiceprint feature of the registered user. Specifically, the terminal may collect and store a second voiceprint feature of a registered user in advance, match the first voiceprint feature with the second voiceprint feature after the terminal acquires the first voiceprint feature, and determine that the user is a registered user if the first voiceprint feature can be matched with the second voiceprint feature. Of course, it may be understood that one terminal may have more than one registered user, that is, there may be a plurality of second voiceprint features, and after the first voiceprint feature is obtained, as long as the first voiceprint feature can be matched with one of the second voiceprint features, the user is determined to be a registered user.

The first facial feature in the image may also be used to determine whether the user is a registered user, and specifically, in this embodiment, after the image is obtained, the image is analyzed to extract the first facial feature in the image, and it is easy to see that the first facial feature is a facial feature of the user who is using the terminal. Determining whether the user is a registered user by judging whether the first facial features are consistent with the facial features of the registered user. Specifically, the terminal may collect and store a second face feature of a registered user in advance, match the first face feature with the second face feature after the terminal acquires the first face feature, and determine that the user is a registered user if the first face feature and the second face feature can be matched. Similarly, there may be a plurality of second facial features, and as described above for the second voiceprint feature, details are not repeated here.

It should be noted that, in the step S111, determining whether the first voiceprint feature matches with the second prestored voiceprint feature and the step S112, determining whether the first face feature matches with the second prestored face feature may be performed simultaneously, or may be performed separately, that is, a relationship between the two is and/or. When the first voiceprint feature and the first face feature are executed simultaneously, namely, whether the user is a registered user is determined simultaneously according to the first voiceprint feature and the first face feature, the identification accuracy of the registered user can be improved.

If the user is determined to be the registered user, the next operation can be executed, and if the user is determined not to be the registered user, the current process is ended, and the next voice command and image are waited to be obtained.

Referring to fig. 2 again, in the present embodiment, after determining whether the user is a registered user, the method includes:

and S130, when the user is a registered user, determining the face orientation of the user according to the first face features.

In the embodiment, in order to facilitate the user to use the voice control terminal, whether the voice command sent by the user is sent to the terminal is determined by the face orientation of the user. Specifically, the determining of the face orientation of the user according to the first face feature may be determined by judging whether the first face feature is a front face, or may be determined by extracting an eye part in the first face feature and determining whether the implementation is looking at the terminal. Of course, the present invention is not limited to the above-mentioned exemplary methods, and those skilled in the art may select different methods for determining the face orientation according to actual situations.

S140, when the face faces to the direction of the terminal, the voice instruction is converted into a text, and whether the voice instruction is an effective instruction or not is determined according to the text.

Specifically, in the prior art, a plurality of terminals are provided with a plurality of turns of conversation functions, once the plurality of turns of conversation functions are started, all voices sent by a user are collected and judged as instructions, at this time, the voices sent by the user and not directed at the terminal cause wrong responses of the terminal, the triggering rate of the wrong responses is high, the user cannot perform other language behaviors during the use of the terminal, and experience is poor. In this embodiment, if the face is oriented toward the terminal, it is determined that the voice instruction sent by the user is sent by the terminal, at this time, the voice instruction is converted into a text, and the semantic instruction is converted into the text by using an existing voice recognition technology, such as an asr (automatic speech recognition) technology, and the like, which is not described herein any more, after the text is obtained, it is determined whether the voice instruction is an effective instruction according to the text, the terminal performs a corresponding operation when the voice instruction is an effective instruction, and the user can perform voice control on the terminal without a complicated operation, and meanwhile, triggering false response is also effectively avoided. And if the face direction is not the direction towards the terminal, ending the current process and waiting for acquiring a new voice command and an image.

Specifically, the determining whether the voice instruction is a valid instruction according to the text includes:

when the terminal is in a sleep state, determining whether the text is consistent with a pre-stored awakening word;

Specifically, in order to save energy consumption, most terminals are provided with a sleep mode, and when the terminals are not operated for a period of time, the terminals enter a sleep state, where the sleep state refers to a state in which the terminals are in a standby state with low energy consumption, at this time, most functions of the terminals are turned off, only some systems such as a voice receiving system and the like are in operation, and are ready to receive a voice instruction, at this time, the terminals need to be awakened by a specific awakening word, and when the terminals are in the awakening state, corresponding operations can be executed according to the voice instruction.

In this embodiment, when the terminal is in a sleep state, only the determination of whether the voice instruction is the wake-up instruction is performed, at this time, the text is matched with a pre-stored wake-up word, if the text is consistent with the wake-up word, the voice instruction is an effective instruction, a wake-up operation can be performed on the terminal, and if the text is inconsistent with the wake-up word, the terminal is maintained in the sleep state. It can be seen that, with the voice control method provided in this embodiment, only when a registered user looks at the terminal and speaks a preset wake-up word, the terminal can be woken up, so that the terminal is prevented from being woken up by mistake when the user speaks the wake-up word when the user does not wake up the terminal by accident.

And when the terminal is in an awakening state, determining whether the voice instruction is an effective instruction according to a pre-trained instruction label model. Specifically, the instruction label model is trained according to a data set with a plurality of texts, the data set has a plurality of groups of training samples, and each group of training samples includes a text and an instruction label corresponding to the text. The instruction tag corresponds to a specific category of the voice instruction corresponding to the text, for example: the terminal comprises a local instruction (increasing volume, reducing brightness and the like) for local operation of the terminal, a video type instruction (i want to watch XX movie, open XX TV play and the like) for the terminal to operate through cloud service, a music type (i want to listen to XXX song, search XX album) and the like. That is, the instruction tag model actually has a function of classifying the voice instruction, and the instruction tag model can be realized by a Natural Language Processing (NLP) technique.

Specifically, the determining whether the voice command is a valid command according to a pre-trained command label model includes: and inputting the text into the instruction label model, and determining whether the instruction label model outputs an instruction label corresponding to the text.

After the text corresponding to the voice command is acquired, the text is input into the command classification model, and as can be seen from the foregoing, the command label model has a function of classifying the text, and when the command label model can output the command label corresponding to the text, it indicates that the voice command can be recognized as a specific category, and the terminal can execute a corresponding operation according to the voice command. And when the instruction tag model cannot output the instruction tag corresponding to the text, the voice instruction cannot be identified as a specific category, the terminal cannot execute corresponding operation according to the voice instruction, and the voice instruction is an invalid instruction.

And when the voice command is an invalid command, the terminal does not respond to the voice command, ends the current flow and waits for acquiring a new voice command and a new head portrait.

Referring to fig. 1 again, the voice control method further includes:

and S200, controlling the terminal according to the voice command when the voice command is determined to be an effective command.

When the terminal is in a sleep state, the controlling the terminal according to the voice instruction specifically includes: and controlling the terminal to be switched from a sleep state to an awakening state.

As already explained above, when the terminal is in the sleep state and the text of the voice command is consistent with the wakeup word, it is determined that the voice command is an effective command, and at this time, the terminal is woken up to be switched from the sleep state to the wakeup state.

When the terminal is in the wake-up state, the controlling the terminal according to the voice instruction specifically includes: and controlling the terminal to execute the voice instruction according to the instruction label corresponding to the text.

Specifically, when the voice control system of the terminal is in an awake state, if the instruction tag model can output the instruction tag corresponding to the text according to the text corresponding to the voice instruction, it is determined that the voice instruction is an effective instruction. And after the instruction tag corresponding to the text is obtained, controlling the terminal to execute the voice instruction according to the instruction tag. For example, when the voice command is: and when the instruction label is 'video type', controlling the terminal to search the XX in the video type resource.

In summary, the present embodiment provides a voice control method, which determines whether a user is a registered user through voiceprint matching and/or face feature matching, determines whether a face of the user is towards a terminal through face features, determines whether a received voice instruction is an effective instruction through steps of determining an instruction tag through a natural language processing model, and the like, controls the terminal to execute a corresponding action only when the received voice instruction is the effective instruction, and the user only needs to look at the terminal to speak the voice instruction without other operations, thereby not only facilitating the user to use a voice control system, but also reducing a false response rate of the terminal to voice.

It should be understood that, although the steps in the flowcharts shown in the figures of the present specification are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in the flowchart may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Example two

Based on the above embodiments, the present invention further provides a terminal, and a schematic block diagram thereof may be as shown in fig. 3. The terminal comprises a processor, a memory, a network interface, a display screen and a temperature sensor which are connected through a system bus. Wherein the processor of the terminal is configured to provide computing and control capabilities. The memory of the terminal comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the terminal is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a speech control method. The display screen of the terminal can be a liquid crystal display screen or an electronic ink display screen, and the temperature sensor of the terminal is arranged in the terminal in advance and used for detecting the current operating temperature of internal equipment.

It will be understood by those skilled in the art that the block diagram shown in fig. 3 is a block diagram of only a portion of the structure associated with the inventive arrangements and is not intended to limit the terminals to which the inventive arrangements may be applied, and that a particular terminal may include more or less components than those shown, or may have some components combined, or may have a different arrangement of components.

In one embodiment, a terminal is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor when executing the computer program implementing at least the following steps:

Wherein, the determining whether the voice instruction is an effective instruction according to the feature of the voice instruction and the image specifically includes:

acquiring a first face feature in the image;

Wherein the determining whether the user is a registered user according to the first voiceprint feature and/or the first facial feature corresponding to the voice instruction specifically includes:

Wherein the determining whether the voice instruction is an effective instruction according to the text specifically includes:

and when the terminal is in a sleep state, determining whether the text is consistent with a pre-stored awakening word.

Wherein the determining whether the voice instruction is an effective instruction according to the pre-trained instruction classification model specifically comprises:

Wherein the controlling the terminal according to the voice instruction specifically includes:

EXAMPLE III

The present invention also provides a storage medium storing one or more programs executable by one or more processors to implement the steps of the voice control method described in the above embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A voice control method, characterized in that the voice control method comprises:

2. The voice control method according to claim 1, wherein the determining whether the voice command is a valid command according to the feature of the voice command and the image specifically comprises:

acquiring a first face feature in the image;

3. The voice control method according to claim 2, wherein the determining whether the user is a registered user according to the first voiceprint feature and/or the first facial feature corresponding to the voice instruction specifically includes:

4. The voice control method according to claim 2, wherein the determining whether the voice command is a valid command according to the text specifically includes:

5. The voice control method according to claim 2, wherein the determining whether the voice command is a valid command according to the text specifically includes:

6. The method according to claim 5, wherein the determining whether the voice command is a valid command according to a pre-trained command classification model specifically comprises:

7. The voice control method according to claim 4, wherein the controlling the terminal according to the voice instruction specifically comprises:

8. The voice control method according to claim 6, wherein the controlling the terminal according to the voice instruction specifically comprises:

9. A terminal, characterized in that the terminal comprises: a processor, a storage medium communicatively coupled to the processor, the storage medium adapted to store a plurality of instructions, the processor adapted to invoke the instructions in the storage medium to perform the steps of implementing the voice control method of any of claims 1-8.

10. A storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps of the voice control method of any one of claims 1-8.