CN111627442A

CN111627442A - Speech recognition method, processor, system, computer equipment and readable storage medium

Info

Publication number: CN111627442A
Application number: CN202010462534.1A
Authority: CN
Inventors: 葛友杰
Original assignee: Xingluo Intelligent Technology Co Ltd
Current assignee: Xingluo Intelligent Technology Co Ltd
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2020-09-04

Abstract

The invention provides a voice recognition method, a processor, a system, computer equipment and a computer readable storage medium, wherein the method comprises the following steps: receiving and analyzing a first voice signal input by a user, and determining a first action expected to be executed by the first voice signal or a first object of the first action operation; acquiring a second voice signal input by a user within a set time before receiving the first voice signal, and determining a first scene where the user is currently located according to the second voice signal; and determining a first action expected to be executed in the first voice signal and a first object of the first action operation according to the first voice signal and the first scene, and sending a control instruction, wherein the control instruction controls the first action to be executed on the first object. The method can realize the recognition of the user voice under the condition that the user semantics is incomplete.

Description

Speech recognition method, processor, system, computer equipment and readable storage medium

Technical Field

The present invention relates to the field of speech recognition technology, and in particular, to a speech recognition method, processor, system, computer device, and readable storage medium.

Background

The existing voice recognition analysis processing scene can only recognize instructions determined according to voice intentions, such as turning on a lamp in a living room, and instructions which are not clear to the voice intentions are often difficult to recognize. For example, scenes with short semantic divergence, such as opening or brightening of the client instruction, cannot be processed. At present, a semantic analysis system is urgently needed to be provided to solve the defects in the prior art.

Disclosure of Invention

The present invention is directed to overcoming the drawbacks of the prior art and providing a speech recognition method, system, computer device and readable storage medium, so as to solve the problem of the prior art that the speech intention is ambiguous and difficult to recognize.

In order to realize the purpose, the following technical scheme is adopted:

a first aspect of the present invention provides a speech recognition method, including:

receiving and analyzing a first voice signal input by a user, and determining a first action expected to be executed by the first voice signal or a first object of the first action operation;

acquiring a second voice signal input by a user within a set time before receiving the first voice signal, and determining a first scene where the user is currently located according to the second voice signal;

and determining a first action expected to be executed in the first voice signal and a first object of the first action operation according to the first voice signal and the first scene, and sending a control instruction, wherein the control instruction is used for controlling the first action to be executed on the first object.

In a specific embodiment, the obtaining a second voice signal input by a user within a preset time before receiving the first voice signal and determining, according to the second voice signal, a first scene where the user is currently located specifically includes:

acquiring a second voice signal input by a user within a set time before the first voice signal is received, wherein the second voice signal comprises a second action expected to be executed and a second object of the second action operation;

determining a second scene in which the user is located when inputting the second voice signal according to the second action and the second object;

determining the second scene as the first scene.

acquiring a second voice signal input by a user within a set time before the first voice signal is received, wherein the second voice signal only comprises a second action expected to be executed or only comprises a second object of a second action operation expected to be executed;

determining a second scene where the user is currently located according to the second operation action, or determining the second scene where the user is currently located according to the second object;

determining the second scene as the first scene.

In a specific embodiment, the determining, according to the first speech signal and the current first scene, a first action and a first object of the first action operation expected to be performed in the first speech signal specifically includes:

if the first voice signal only comprises the first action, acquiring an operation object of the first action in the first scene and a first probability of the operation object being executed from an established user scene data stack;

determining a priority of the operation object according to the first probability;

and executing the first operation on the operation object with the highest priority.

if the first voice signal only comprises the first object, acquiring an operation action matched with the first object in the first scene and a second probability of executing the operation action from the established user scene data stack;

determining the priority of the operation action matched with the first object according to the second probability;

performing the operation action with the highest priority on the first object.

In a specific embodiment, the establishing a user context data stack specifically includes:

acquiring historical voice information input by a user, analyzing the historical voice information, and acquiring a third scene where the historical voice information input by the user is located, a third action expected to be executed by the historical voice information and a third object of the third action operation;

and storing the third scene, the third action, the third object and the corresponding relation among the third scene, the third action, the third object and the third object to form the user scene data stack.

A second aspect of the present invention provides a speech recognition processor, the processor comprising:

a receiving recognition unit for receiving and recognizing a first voice signal input by a user, the first voice signal including a first action or a first object of the first action operation expected to be performed;

a first scene determining unit, configured to acquire a second voice signal input by a user within a set time before receiving the first voice signal, and determine a first scene where the user is currently located according to the second voice signal;

a first action and first object determination unit for determining a first action to be performed in the first voice signal and a first object of the first action operation according to the first voice signal and the first scene, and sending a control instruction which controls the first action to be performed on the first object.

In a specific embodiment, the first scenario determination unit is specifically configured to:

determining the second scene as the first scene.

In a particular embodiment, the first action and first object determination unit is particularly configured to:

performing the operation action with the highest priority on the first object.

In a specific embodiment, the system further comprises:

A third aspect of the present invention provides a speech recognition processing system comprising a sound pickup apparatus, an execution apparatus, and the aforementioned processor, wherein,

the pickup equipment is used for collecting a first voice signal input by a user and sending the first voice signal to the processor;

the execution device is used for receiving the control instruction and executing the first action on the first object.

A fourth aspect of the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a computer device, carries out the aforementioned method steps.

A fifth aspect of the invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the computer program to cause the computer device to perform the steps of the method.

The invention has the beneficial effects that: the voice recognition method receives and analyzes a first voice signal input by a user, acquires a first scene at a position where a second voice signal input by the user is received within a set time of the first voice signal, determines a first action and a first object expected to be executed in the first voice signal according to the first voice signal and the first scene, and executes a first operation on the first object. According to the method provided by the embodiment of the invention, under the condition that the first voice signal input by the user is incomplete, the first voice input by the user can be analyzed, and further the operation expected to be performed by the user can be executed. The method and the device solve the defect that the voice information input by a user in the prior art is incomplete in semantics and cannot be operated.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described below, and it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope of the present invention.

Fig. 1 is a schematic flowchart of a speech signal recognition method according to a first embodiment of the present invention;

fig. 2 is a schematic structural diagram of a speech signal recognition processor according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a speech recognition system according to a third embodiment of the present invention.

Detailed Description

Hereinafter, various embodiments of the present invention will be described more fully. The invention is capable of various embodiments and of modifications and variations therein. However, it should be understood that: there is no intention to limit various embodiments of the invention to the specific embodiments disclosed herein, but on the contrary, the intention is to cover all modifications, equivalents, and/or alternatives falling within the spirit and scope of various embodiments of the invention.

Hereinafter, the terms "includes" or "may include" used in various embodiments of the present invention indicate the presence of disclosed functions, operations, or elements, and do not limit the addition of one or more functions, operations, or elements. Furthermore, as used in various embodiments of the present invention, the terms "comprises," "comprising," "includes," "including," "has," "having" and their derivatives are intended to mean that the specified features, numbers, steps, operations, elements, components, or combinations of the foregoing, are only meant to indicate that a particular feature, number, step, operation, element, component, or combination of the foregoing, is not to be understood as first excluding the existence of, or adding to the possibility of, one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.

In various embodiments of the invention, the expression "a or/and B" includes any or all combinations of the words listed simultaneously, e.g., may include a, may include B, or may include both a and B.

Expressions (such as "first", "second", and the like) used in various embodiments of the present invention may modify various constituent elements in various embodiments, but may not limit the respective constituent elements. For example, the above description does not limit the order and/or importance of the elements described. The foregoing description is for the purpose of distinguishing one element from another. For example, the first user device and the second user device indicate different user devices, although both are user devices. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of various embodiments of the present invention.

It should be noted that: in the present invention, unless otherwise explicitly stated or defined, the terms "mounted," "connected," "fixed," and the like are to be construed broadly, e.g., as being fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium; there may be communication between the interiors of the two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the present invention, it should be understood by those skilled in the art that the terms indicating an orientation or a positional relationship herein are based on the orientations and the positional relationships shown in the drawings and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the device or the element referred to must have a specific orientation, be constructed in a specific orientation and operate, and thus, should not be construed as limiting the present invention.

The terminology used in the various embodiments of the present invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the various embodiments of the present invention. Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the various embodiments of the present invention belong. The terms (such as those defined in commonly used dictionaries) should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

An embodiment of the present invention provides a speech recognition method, as shown in fig. 1, the method includes the following steps:

s1, receiving and analyzing a first voice signal input by a user, and determining a first action or a first object of the first action operation which is expected to be executed by the first voice signal.

Specifically, a first voice signal input by a user is received, the first voice signal is recognized, and a first action or a first object of the first action operation expected to be executed by the first voice signal is determined.

In a specific embodiment, the first speech signal is semantically incomplete, i.e. the first speech signal comprises only the first action expected to be performed or only the first object of the first action operation.

For example, if the light is turned on as a complete semantic signal, the first voice signal only includes the first action to be performed to turn on, or only includes the first object light of the first action operation.

S2, acquiring a second voice signal input by the user within the set time before the first voice signal is received, and determining the first scene where the user is currently located according to the second voice signal.

And after receiving the first voice signal, acquiring a second voice signal input by the user within a set time before receiving the first voice signal, wherein the set time can be within 3 minutes, namely after receiving the first voice signal input by the user, judging the second voice signal input by the user within 3 minutes before receiving the first voice signal. Wherein the second speech signal may be a semantically complete speech signal, i.e. the second speech signal comprises objects of a second action and a second action operation that are expected to need to be performed.

Wherein the second voice signal may also be the aforementioned action voice signal or object voice signal.

When the second voice signal is a voice signal with complete voice, the scene where the user inputs the second voice signal, the second action expected to be executed by the second voice signal and the second object of the second action operation can be determined according to the second voice signal.

And when the second voice signal is an action voice signal, determining a final second scene, a second action and a second object according to the second voice signal.

And when the second voice signal is the object voice signal, acquiring a scene finally determined according to the second voice signal and executing an action to determine a second scene, a second action and a second object where the second voice signal is located.

After the second scene is determined, because the time difference between the reception of the first voice signal and the reception of the second voice signal is short, it can be considered that the first scene in which the user is currently located is the same as the scene in which the user inputs the second voice signal, that is, the first scene and the second scene are the same scene.

S3, determining a first action expected to be executed in the first voice signal and a first object of the first action operation according to the first voice signal and the current first scene, and executing the first action on the first object.

Specifically, if the first voice signal is an action voice signal, determining an operation object of the first action in the first scene, acquiring a probability corresponding to the first operation object, performing priority ordering according to the operation object, and sequentially executing the first action on the first operation object according to the priority.

For example, assuming that the first voice signal is an action operation signal, assuming that the first action is on, and it is known from the first scene that the second scene in which the user inputs the second voice signal is an intelligent home scene, the scene in which the user inputs the first voice signal is considered to be the intelligent home scene, in the intelligent home scene, all operation objects corresponding to the on action and corresponding probabilities are obtained, assuming that the probability of turning on the light is 0.6, and the probability of turning on the air conditioner is 0.3, the priorities of the operation objects are sorted according to the order of the probabilities from large to small, and the higher the probability, the higher the priority of turning on the light is higher than the priority of turning on the air conditioner. And executing opening actions on the equal numbers according to the priority.

The voice recognition method receives and analyzes a first voice signal input by a user, acquires a first scene at a second voice signal input by the user within a set time of receiving the first voice signal, determines a first action and a first object expected to be executed in the first voice signal according to the first voice signal and the first scene, and executes a first operation on the first object. According to the method provided by the embodiment of the invention, under the condition that the first voice signal input by the user is incomplete, the first voice input by the user can be analyzed, and further the operation expected to be performed by the user can be executed. The method and the device solve the defect that the voice information input by a user in the prior art is incomplete in semantics and cannot be operated.

Based on the first embodiment of the present invention, the second embodiment of the present invention provides a speech recognition processor, as shown in fig. 2, the processor 1 includes: a reception recognition unit 10, a first scene determination unit 11 and a first action and first object determination unit 12, wherein the receiving and recognizing unit 10 is used for receiving a first voice signal input by a user, the first voice signal comprises a first action expected to be executed or a first object of the first action operation, the first scene determination unit 11 is configured to acquire a second voice signal input by a user within a set time before receiving the first voice signal, determining a first scene in which the user is currently located according to the second voice signal, the first action and first object determining unit 12 being configured to determine a first action and a first object of the first action operation expected to be performed in the first voice signal according to the first voice signal and the first scene, and sending a control instruction which controls the first action to be executed on the first object.

The first scene determining unit 11 is specifically configured to acquire a second voice signal input by a user within a set time before the first voice signal is received, where the second voice signal includes a second action expected to be performed and a second object of the second action operation, determine, according to the second action and the second object, a second scene where the user is located when the second voice signal is input, and determine, according to the second scene, the first scene.

The first scene determining unit 11 is specifically configured to acquire a second voice signal input by a user within a set time before the first voice signal is received, where the second voice signal only includes a second action expected to be performed or only includes a second object of a second action operation expected to be performed, determine a second scene where the user is currently located according to the second action operation, or determine the second scene where the user is currently located according to the second object, and determine the first scene according to the second scene.

The first action and first object determining unit 12 is specifically configured to, if only the first action is included in the first speech signal, obtain, from an established user scene data stack, an operation object of the first action in the first scene and a first probability corresponding to the operation object, determine a priority of the operation object according to the first probability, and execute the first operation on the operation object with the highest priority.

The first action and first object determining unit 12 is specifically configured to, if only the first object is included in the first speech signal, obtain, from an established user scene data stack, an operation action in the first scene that matches the first object and a second probability that the operation action is performed, determine, according to the second probability, a priority of the operation action that matches the first object, and perform, on the first object, the operation action with a highest priority.

The system 1 further includes a user scene data stack establishing unit, where the user scene data stack establishing unit is configured to obtain historical voice information input by a user, analyze the historical voice information, obtain a third scene in which the historical voice information is input by the user, a third action expected to be executed by the historical voice information, and a third object of the third action operation, and store the third scene, the third action, the third object, and a correspondence relationship among the third scene, the third action, the third object, and the third object to form the user scene data stack.

Based on the second embodiment of the present invention, the third embodiment of the present invention provides a voice recognition system, as shown in fig. 3, the voice recognition system 100 includes a sound pickup apparatus 2, an execution apparatus 3, and the aforementioned processor 1, where the sound pickup apparatus 2 is configured to collect a first voice signal input by a user and send the first voice signal to the processor 1, and the execution apparatus 3 is configured to receive the control instruction and execute the first action on the first object.

Based on the first embodiment of the present invention, a computer device is provided in the fourth embodiment of the present invention, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to make the computer device execute the steps of the foregoing method.

Based on the first embodiment of the present invention, a fifth embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a computer device, implements the foregoing method steps.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

The above-described embodiments are merely illustrative of several embodiments of the present invention, which are described in more detail and detail, but are not to be construed as limiting the scope of the present invention. It should be noted that, for those skilled in the art, other various changes and modifications can be made according to the above-described technical solutions and concepts, and all such changes and modifications should fall within the protection scope of the present invention.

Claims

1. A speech recognition method, comprising:

2. The method according to claim 1, wherein the acquiring a second voice signal input by the user within a preset time before receiving the first voice signal, and the determining a first scene in which the user is currently located according to the second voice signal specifically comprises:

determining the second scene as the first scene.

3. The method according to claim 1, wherein the acquiring a second voice signal input by the user within a preset time before receiving the first voice signal, and the determining a first scene in which the user is currently located according to the second voice signal specifically comprises:

determining the second scene as the first scene.

4. The method according to claim 2 or 3, wherein the determining, from the first speech signal and the current first scene, a first action expected to be performed in the first speech signal and a first object of the first action operation specifically comprises:

5. The method according to claim 2 or 3, wherein the determining, from the first speech signal and the current first scene, a first action expected to be performed in the first speech signal and a first object of the first action operation specifically comprises:

performing the operation action with the highest priority on the first object.

6. The method according to claim 4 or 5, wherein the establishing a user context data stack specifically comprises:

7. A speech recognition processor, the processor comprising:

the voice recognition device comprises a receiving and recognizing unit, a processing unit and a processing unit, wherein the receiving and recognizing unit is used for receiving and recognizing a first voice signal input by a user, and determining a first action expected to be executed by the first voice signal or a first object of the first action operation;

8. The processor of claim 7, wherein the first scenario determination unit is specifically configured to:

determining the second scene as the first scene.

9. The system according to claim 7, wherein the first scenario determination unit is specifically configured to:

determining the second scene as the first scene.

10. The system according to claim 8 or 9, characterized in that the first action and first object determination unit is specifically configured to:

11. The system according to claim 8 or 9, characterized in that the first action and first object determination unit is specifically configured to:

performing the operation action with the highest priority on the first object.

12. The system according to claim 10 or 11, characterized in that the system further comprises:

13. A speech recognition processing system comprising a sound pick-up device, an execution device and a processor according to claims 7-12,

14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a computer device, carries out the method steps of any one of the preceding claims 1 to 6.

15. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor executes the computer program to cause the computer device to perform the steps of the method of any of claims 1 to 6.