CN114550719A

CN114550719A - Method and device for recognizing voice control instruction and storage medium

Info

Publication number: CN114550719A
Application number: CN202210158548.3A
Authority: CN
Inventors: 王伟龙
Original assignee: Qingdao Haier Technology Co Ltd; Haier Smart Home Co Ltd
Current assignee: Qingdao Haier Technology Co Ltd; Haier Smart Home Co Ltd
Priority date: 2022-02-21
Filing date: 2022-02-21
Publication date: 2022-05-27

Abstract

The invention discloses a method and a device for identifying a voice control instruction, a storage medium and an electronic device, wherein the method comprises the following steps: acquiring a voice control instruction received by target equipment; responding to the voice control instruction, and identifying the voice control instruction through a target server and target equipment corresponding to the target equipment; determining a target recognition result corresponding to the voice control instruction from a first recognition result of the target server and a second recognition result of the target equipment according to the recognition speed of the target server on the voice control instruction; and controlling the target equipment according to the target identification result. By adopting the technical scheme, the problem that the dependence degree of the recognition of the voice control instruction on the network is high in the related technology is solved.

Description

Method and device for recognizing voice control instruction and storage medium

Technical Field

The invention relates to the field of computers, in particular to a method and a device for recognizing a voice control instruction, a storage medium and an electronic device.

Background

With the development of science and technology, the control modes of intelligent devices are more and more, in the control process of the devices, the recognition technology of voice control instructions occupies a very important position, the prior art generally uses an online recognition mode, that is, a recognition module is deployed at a server corresponding to the devices, and data and a recognition result related to the voice control instructions are transmitted through a network.

Aiming at the problems of high dependence degree of voice control instruction recognition on a network and the like in the related technology, an effective solution is not provided yet.

Disclosure of Invention

The embodiment of the invention provides a method and a device for identifying a voice control instruction, a storage medium and an electronic device, which are used for at least solving the problems that the dependence degree of the identification of the voice control instruction on a network is high in the related technology.

According to an embodiment of the present invention, there is provided a method for recognizing a voice control instruction, including: acquiring a voice control instruction received by target equipment;

responding to the voice control instruction, and respectively identifying the voice control instruction through a target server corresponding to the target equipment and the target equipment;

determining a target recognition result corresponding to the voice control instruction from a first recognition result of the target server and a second recognition result of the target device according to the recognition speed of the target server to the voice control instruction;

and controlling the target equipment according to the target identification result.

Optionally, the determining, according to the recognition speed of the target server for the voice control instruction, a target recognition result corresponding to the voice control instruction from the first recognition result of the target server and the second recognition result of the target device includes:

detecting the recognition speed of the target server to the voice control instruction;

determining the first recognition result as the target recognition result if the recognition speed is greater than or equal to a target speed;

determining the second recognition result as the target recognition result in a case where the recognition speed is less than the target speed.

Optionally, the detecting the recognition speed of the target server to the voice control instruction includes:

detecting whether the first recognition result is received within a target time after the voice control instruction is sent to the target server for recognition;

determining that the recognition speed is greater than or equal to a target speed in the case of detecting that the first recognition result is received within the target time;

determining that the recognition speed is less than the target speed if it is detected that the first recognition result is not received within the target time.

Optionally, the identifying the voice control instruction by the target server corresponding to the target device and the target device respectively at the same time includes:

sending the voice control instruction to the target server, wherein the target server is used for recognizing the voice control instruction by using a first recognition model;

and recognizing the voice control instruction by using a second recognition model deployed on the target device.

Optionally, the recognizing the voice control instruction by using a second recognition model deployed on the target device includes:

inputting the voice control instruction into the second recognition model to obtain a target control operation output by the second recognition model, wherein the target control operation is used for indicating the control intention of the voice control instruction on the target equipment;

matching the target control operation with a target operation set stored on the target device, wherein the target operation set is used for recording device control operations which are allowed to be executed by the target device under the condition of no network connection;

determining the target control operation as the second recognition result if it is determined that the target control operation is included in the target operation set;

and in the case that the target control operation is determined not to be included in the target operation set, determining a target prompt operation as the second recognition result, wherein the target prompt operation is used for prompting the target device not to allow the voice control instruction to be executed in the case of no network connection.

Optionally, before the inputting the voice control instruction into the second recognition model, the method further includes:

acquiring a voice control instruction sample marked with the equipment control operation;

and training an initial recognition model by using the voice control instruction sample marked with the equipment control operation to obtain the second recognition model.

Optionally, the training an initial recognition model by using the voice control instruction sample labeled with the device control operation to obtain the second recognition model includes:

inputting the voice control instruction sample into the initial recognition model to obtain control operation data output by the initial recognition model;

substituting the control operation data and the equipment control operation into a loss function corresponding to the initial recognition model to obtain a loss value;

and adjusting the model parameters of the initial identification model according to the loss value until the loss value is converged to obtain the second identification model.

According to another embodiment of the present invention, there is also provided a speech control instruction recognition apparatus, including: acquiring a voice control instruction received by target equipment;

According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to execute the above recognition method of the voice control instruction when running.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the method for recognizing the voice control instruction through the computer program.

In the embodiment of the invention, a voice control instruction received by target equipment is obtained; responding to the voice control instruction, and identifying the voice control instruction through a target server and target equipment corresponding to the target equipment; determining a target recognition result corresponding to the voice control instruction from a first recognition result of the target server and a second recognition result of the target equipment according to the recognition speed of the target server on the voice control instruction; controlling the target equipment according to the target recognition result, namely after the target equipment receives the voice control instruction, on one hand, recognizing the voice control instruction through a target server corresponding to the target equipment and outputting a first recognition result; on the other hand, the voice control instruction is identified through the target equipment, and a second identification result is output; and then, determining a target recognition result corresponding to the voice control instruction according to the recognition speed of the target server for recognizing the voice control instruction, so that the target equipment is controlled by adopting the corresponding recognition result according to the recognition speed of the target server, the voice control instruction of the user can be recognized even if the network fluctuates or the server fails, and the voice control instruction of the user can be responded in time. By adopting the technical scheme, the problems that the dependence degree of the recognition of the voice control instruction on the network is high and the like in the related technology are solved, and the technical effect of reducing the dependence degree of the recognition of the voice control instruction on the network is realized.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

fig. 1 is a block diagram of a hardware configuration of a computer terminal of a method for recognizing a voice control command according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of recognition of voice control commands according to an embodiment of the present invention;

FIG. 3 is a diagram of a first recognition model recognizing a voice control command according to an embodiment of the present invention;

FIG. 4 is a diagram of a second recognition model recognizing a voice control command according to an embodiment of the present invention;

FIG. 5 is a schematic illustration of a set of target operations of a second recognition model according to an embodiment of the present invention;

FIG. 6 is a schematic illustration of a target hinting operation of a second recognition model according to an embodiment of the invention;

FIG. 7 is a schematic diagram of a training process for a second recognition model according to an embodiment of the present invention;

FIG. 8 is a schematic illustration of recognition speed according to an embodiment of the present invention;

FIG. 9 is a schematic illustration of a recognition process of a voice control command according to an embodiment of the invention;

fig. 10 is a block diagram of a device for recognizing a speech control command according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The method provided by the embodiment of the invention can be executed in a computer terminal, a computer terminal or a similar arithmetic device. Taking the example of being operated on a computer terminal, fig. 1 is a block diagram of a hardware structure of the computer terminal of the method for recognizing a voice control instruction according to the embodiment of the present invention. As shown in fig. 1, the computer terminal may include one or more processors 102 (only one is shown in fig. 1) (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.) and a memory 104 for storing data, and in an exemplary embodiment, may further include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the computer terminal. For example, the computer terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration with equivalent functionality to that shown in FIG. 1 or with more functionality than that shown in FIG. 1.

The memory 104 may be used to store computer programs, for example, software programs and modules of application software, such as a computer program corresponding to the recognition method of the voice control instruction in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the above-mentioned method. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to a computer terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In this embodiment, a method for recognizing a voice control command is provided, and is applied to the above-mentioned computer terminal, and fig. 2 is a flowchart of the method for recognizing a voice control command according to the embodiment of the present invention, where the flowchart includes the following steps:

step S202, acquiring a voice control instruction received by target equipment;

step S204, responding to the voice control instruction, and respectively identifying the voice control instruction through a target server corresponding to the target equipment and the target equipment;

step S206, determining a target recognition result corresponding to the voice control instruction from a first recognition result of the target server and a second recognition result of the target device according to the recognition speed of the target server to the voice control instruction;

and step S208, controlling the target equipment according to the target identification result.

Through the steps, after the target equipment receives the voice control instruction, on one hand, the target server corresponding to the target equipment identifies the voice control instruction and outputs a first identification result; on the other hand, the voice control instruction is identified through the target equipment, and a second identification result is output; and then, determining a target recognition result corresponding to the voice control instruction according to the recognition speed of the target server for recognizing the voice control instruction, so that the target equipment is controlled by adopting the corresponding recognition result according to the recognition speed of the target server, and the voice control instruction of the user can be recognized even if the network fluctuates or the server fails, thereby responding to the voice control instruction of the user in time. By adopting the technical scheme, the problems that the dependence degree of the recognition of the voice control instruction on the network is high and the like in the related technology are solved, and the technical effect of reducing the dependence degree of the recognition of the voice control instruction on the network is realized.

In the technical solution provided in step S202, the obtaining of the voice control instruction received by the target device may be, but is not limited to, received by a receiving module deployed on the target device, for example: the voice control instruction for controlling the loudspeaker box can be received by the audio acquisition module of the loudspeaker box.

Optionally, in this embodiment, the voice control instruction may be used to perform operation control on the target device, and may include, but is not limited to, control over the device itself and control over other devices, for example, control over turning up or turning down the volume of the sound box, control over the sound box to inquire about weather or encyclopedia knowledge, and control over turning on and off of the air conditioner using the sound box.

Optionally, in this embodiment, the type of the voice control instruction may include, but is not limited to, voice control or gesture control, and the like, for example, controlling the volume of the sound box to turn up or turn down may directly use the voice instruction, and may also be controlled by the sound box recognizing a gesture of sliding up or down.

In the technical solution provided in step S204, in response to the voice control instruction, the target server and the target device corresponding to the target device respectively identify the voice control instruction at the same time, that is, for the identification of the voice control instruction, multiple threads may be performed at the same time, for example, the target server and the target device corresponding to the target device may identify the voice control instruction at the same time.

Alternatively, in this embodiment, the target server may include, but is not limited to, any server having a voice control instruction recognition function and a function of performing an operation related to the voice control instruction, that is, in a case where the voice control instruction is to inquire about weather, the target server may recognize that the control intention of the voice instruction is to inquire about weather, and the target server may inquire about a specified weather condition.

In one exemplary embodiment, the voice control instructions may be recognized by the target server and the target device simultaneously, but not limited to, in the following manner: sending the voice control instruction to the target server, wherein the target server is used for recognizing the voice control instruction by using a first recognition model; and recognizing the voice control instruction by using a second recognition model deployed on the target device.

That is, the voice control command may be transmitted to the target server, and then the voice control command is recognized by the first recognition model deployed on the target server, that is, the voice control command is recognized online. Fig. 3 is a schematic diagram of recognizing a voice control command by the first recognition model according to an embodiment of the present invention, and as shown in fig. 3, when the voice command for querying weather is received by the speaker, the voice command may be transmitted to the target server corresponding to the speaker, and then the semantic of recognizing the voice command by the first recognition model on the target server is "query weather". And simultaneously, the loudspeaker box detects the recognition time of the first recognition model to the voice command for inquiring weather.

The voice control instruction can be identified online and can be identified offline through a second identification model deployed on the target device.

It should be noted that there is no limitation on the execution sequence between the steps of recognizing the voice control instruction by the target server and the target device, that is, the two steps may be executed simultaneously or sequentially, that is, the voice control instruction may be transmitted to the target server and recognized by using the second recognition model deployed on the target device. Or the voice control instruction can be sent to the target server first and then sent to the second recognition model for recognition. Or the voice control instruction can be sent to the second recognition model for recognition, and then the voice control instruction is sent to the target server. This is not limited in this embodiment.

Alternatively, in the present embodiment, the first recognition model and the second recognition model may be, but are not limited to, any model that can recognize a voice control command, such as: convolutional neural network models, cyclic neural network models, and the like. The model types of the first recognition model and the second recognition model may be the same or different.

In one exemplary embodiment, the voice control instructions may be identified using the second recognition model, but are not limited to, by: inputting the voice control instruction into the second recognition model to obtain a target control operation output by the second recognition model, wherein the target control operation is used for indicating the control intention of the voice control instruction on the target equipment; matching the target control operation with a target operation set stored on the target device, wherein the target operation set is used for recording device control operations which are allowed to be executed by the target device under the condition of no network connection; determining the target control operation as the second recognition result if it is determined that the target control operation is included in the target operation set; and in the case that the target control operation is determined not to be included in the target operation set, determining a target prompt operation as the second recognition result, wherein the target prompt operation is used for prompting the target device not to allow the voice control instruction to be executed in the case of no network connection.

Fig. 4 is a schematic diagram of recognizing a voice control instruction by a second recognition model according to an embodiment of the present invention, and as shown in fig. 4, after the target control operation corresponding to the voice control instruction is recognized by the second recognition model, the target control operation is matched with a target operation set stored on a target device, in a case where the target operation set includes the target control operation, if the representative target control operation can be executed in a case where the target operation set is not connected to a network, the target control operation is determined as the second recognition result, and in a case where the target operation set does not include the target control operation, the representative target control operation cannot be executed in a case where the target operation set is not connected to a network, the target prompt operation is determined as the second recognition result.

Optionally, in this embodiment, the target operation set is used to record a device control operation that the target device is allowed to perform without connecting to a network, that is, the target operation set may record an offline operation that the device may perform, fig. 5 is a schematic diagram of the target operation set of the second recognition model according to the embodiment of the present invention, as shown in fig. 5, the sound box may perform a volume adjustment operation and an on-off operation of the sound box in the case of not connecting to the network, that is, the volume adjustment operation and the on-off operation belong to the target operation set, but the sound box may not perform an operation of querying weather in the case of not connecting to the network, that is, the operation of querying weather does not belong to the target operation set.

Optionally, in this embodiment, the target prompting operation may be, but is not limited to, any operation that can prompt the target device not to allow the voice control instruction to be executed without connecting to a network, such as: if the second recognition result is not the local instruction set, the "network state does not support this instruction" is prompted, that is, in the case where the network is not connected, the recognition support capability may be limited, and only local controls, such as "turn up volume", "turn on washing machine", etc., may be supported. Fig. 6 is a schematic diagram of a target prompting operation of a second recognition model according to an embodiment of the present invention, as shown in fig. 6, after the sound box receives a voice control instruction of "inquire weather", the voice control instruction is recognized by the second recognition model deployed in the sound box, and the instruction intent of the obtained voice control instruction is "inquire weather", and then, comparing the "inquire weather" with the target operation set, since the target operation set does not include the operation of "inquire weather", the target prompting operation of "the network state does not support this instruction" is output, where local instruction sets supported by different devices are different. By matching the recognition result with the native instruction set, if it is a native instruction, it is executed, otherwise, it is prompted that the network state does not support the instruction.

In an exemplary embodiment, before said inputting said speech control instruction into said second recognition model, the second recognition model may be trained, but not limited to, by: acquiring a voice control instruction sample marked with the equipment control operation; and training an initial recognition model by using the voice control instruction sample marked with the equipment control operation to obtain the second recognition model.

Optionally, in this embodiment, the generation process of the second recognition model may be, but is not limited to, training the initial recognition model through an offline speech control instruction sample labeled with the device control operation. By combing the off-line function, a specific recognition model and a language model with small volume are obtained by performing special training aiming at the function field, namely, special training is performed aiming at the linguistic data controlled by the local equipment, such as 'turning up the volume', 'turning on the washing machine', and the like. The "weather is so today" is not optimized because the weather information at that time may not be available without the web. Therefore, a specific recognition model and a language model with small volume are obtained, the requirement on hardware configuration is lowered, and the requirement on cost reduction is met.

In an exemplary embodiment, fig. 7 is a schematic diagram of a training process of a second recognition model according to an embodiment of the present invention, and as shown in fig. 7, an initial recognition model may be trained using a sample of speech control instructions labeled with device control operations, but not limited to, in the following manner, to obtain the second recognition model: inputting the voice control instruction sample into the initial recognition model to obtain control operation data output by the initial recognition model; substituting the control operation data and the equipment control operation into a loss function corresponding to the initial recognition model to obtain a loss value; and adjusting the model parameters of the initial identification model according to the loss value until the loss value is converged to obtain the second identification model. In other words, in the training process of the second recognition model, a voice control instruction sample is input into the initial recognition model, and control operation data output by the initial recognition model is obtained; then substituting control operation data and the equipment control operation into a loss function corresponding to the initial recognition model to obtain a loss value; finally, model parameters are adjusted until the loss value is converged, and the second recognition model is trained at the moment.

Optionally, in this embodiment, the voice control instruction sample may be, but is not limited to, an offline instruction set matched with an offline function of the device, so as to reduce the size of the instruction set, reduce the occupation of hardware resources, and improve the matching speed. By adopting the specifically trained recognition model and language model, the model volume is reduced, the recognition precision is ensured, and the hardware requirement of low cost is realized. And the method has the advantages of small realization difficulty, short development period and stable performance.

In the technical solution provided in step S206, an operation that is more suitable for controlling the device in response to the recognition result of the voice control instruction is determined as the target recognition result from the recognition result returned by the target server and the recognition result given by the target device according to the recognition speed of the target server for the voice control instruction, that is, the more suitable recognition result is selected according to the recognition speed of the target server for the voice control instruction to perform the subsequent operation according to the recognition speed of the target server for the voice control instruction, based on whether the first recognition result or the second recognition result is the target recognition result.

Alternatively, in this embodiment, the recognition speed may be embodied by, but not limited to, any parameter or data that can detect and reflect the speed of obtaining the recognition result of the target server. Such as: the network state, under the condition of poor network state, the speed of obtaining the first identification result is slow; and under the condition of good network state, the first identification result is acquired quickly. That is, the speed of recognition of the voice control instruction by the target server can be embodied by, but not limited to, the network state. Alternatively, the time parameter may be used to express the speed of recognition of the voice control command by the target server.

In an exemplary embodiment, the target recognition result corresponding to the voice control command may be determined according to the recognition speed of the voice control command by the first recognition model, but is not limited to: detecting the recognition speed of the target server to the voice control instruction; determining the first recognition result as the target recognition result if the recognition speed is greater than or equal to a target speed and if the recognition speed is greater than or equal to a target speed; determining the second recognition result as the target recognition result in a case where the recognition speed is less than the target speed.

That is, the recognition speed of the target server for recognizing the voice control instruction is detected, if the recognition speed is detected to be greater than or equal to the target speed, the target server is considered to recognize the voice control instruction faster, and the first recognition result of the target server can be determined as the target recognition result; if the recognition speed is lower than the target speed, the recognition of the voice control instruction by the target server is considered to be slow, and the second recognition result of the target device may be determined as the target recognition result.

Optionally, in this embodiment, the determining, according to the recognition speed of the target server for the voice control instruction, the target recognition result corresponding to the voice control instruction may be, but is not limited to, dynamically switching between the offline recognition mode and the online recognition mode according to whether the current networking state of the device is dynamically switched to use an online recognition model of the server or a local recognition model of the device, that is, monitoring the networking state of the device in real time; if the online recognition result returns too slowly, the offline recognition result is used. Such as: under the condition of no network, directly using an offline recognition model; under the condition of network, the online identification model and the offline identification model are simultaneously used in a multithread mode in parallel, the identification result of the online identification model is received within the specified time, the identification result of the online identification model is used, otherwise, the identification result of the offline identification model is used, and bad experience of long waiting time caused by poor network is avoided. The dependence of the voice equipment on the network is reduced to a certain extent, so that the experience of the equipment under the condition of no network is improved.

In an exemplary embodiment, the recognition speed of the voice control instruction by the target server may be detected by, but is not limited to: detecting whether the first recognition result is received within a target time after the voice control instruction is sent to the target server for recognition; determining that the recognition rate is greater than or equal to a target rate if it is detected that the first recognition result is received within the target time; determining that the recognition rate is less than the target rate if it is detected that the first recognition result is not received within the target time.

That is, the timing may be started by sending the voice control command to the target server for recognition, and the timing result may be compared with the target time, and if the first recognition result returned by the target server is received within the target time, it may be confirmed that the online recognition process of the target server meets the requirement for the recognition speed thereof, that is, the recognition speed is greater than or equal to the target speed. If the first recognition result returned by the target server is not received within the target time, the online recognition process of the target server can be confirmed to be not in accordance with the requirement on the recognition speed, namely the recognition speed is lower than the target speed.

Optionally, in this embodiment, the target time may be, but is not limited to, a preset time that does not affect the user experience, for example, in the process of issuing the voice control instruction, the user does not feel the network delay if the user receives the recognition result of the device within 2 seconds, so that the user experience is not affected, and then, but is not limited to, 2 seconds may be taken as the target time.

Fig. 8 is a diagram illustrating a recognition speed according to an embodiment of the present invention, and as shown in fig. 8, in the case where the target time is 2 seconds, the first recognition result of the user inquiring weather is returned within 2 seconds, and it may be determined that the recognition speed is greater than or equal to the target speed. The first recognition result of the user inquiring weather is returned after 2 seconds, and it may be determined that the recognition speed is less than the target speed.

In the technical solution provided in step S208, the target device is correspondingly controlled according to the target identification result. Such as: the sound box acquires a weather inquiry command that the target identification result is output by the first identification model, and then the sound box is used for inquiring weather; or, the sound box acquires the weather inquiry instruction that the target identification result is output by the second identification model, and then the sound box is controlled to prompt the user that the off-line mode does not support the instruction.

In order to better understand the process of the method for recognizing the voice control command, the following describes a flow of the method for recognizing the voice control command with reference to an alternative embodiment, but the flow is not limited to the technical solution of the embodiment of the present invention.

In this embodiment, a method for recognizing a voice control command is provided, and fig. 9 is a schematic diagram of a recognition process of a voice control command according to an embodiment of the present invention, as shown in fig. 9, the following steps are specifically performed:

step S901: the method comprises the steps of awakening the equipment, judging whether the equipment has a network or not, and using an offline identification mode under the condition that the equipment is disconnected, so as to avoid bad experience of long waiting time caused by too poor network;

step S902: under the condition that the equipment has a network, performing offline identification and online identification on the instruction;

step S903: judging whether the online identification result is returned within a specified time;

step S904: using the online recognition result as a target recognition result when the online recognition result is returned within a prescribed time;

step S905: using the off-line recognition result as a target recognition result when the on-line recognition result is not returned within a prescribed time;

step S906: matching whether the offline recognition result is a local instruction or not;

step S907: under the condition that the offline recognition result is not the local instruction, prompting the user that the offline mode does not support the instruction;

step S908: and when the offline recognition result is the local instruction, executing the local instruction, judging whether the execution is successful, and carrying out corresponding prompt on the user.

Through the implementation mode, the dependence of the voice equipment on the network is reduced to a certain extent, and the experience of the equipment under the condition of no network is improved. Secondly, a recognition model and a language model which are specially trained are adopted, namely a specific offline instruction set is adopted, the instruction set volume is reduced, the occupation of hardware resources is reduced, the model volume is reduced, the recognition precision is ensured, the low-cost hardware requirement is met, and meanwhile the matching speed is improved. In addition, the off-line identification mode and the on-line identification mode are dynamically and automatically switched, so that the user can always obtain the best experience. And the technology has small realization difficulty, short development period and stable performance.

Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method according to the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Fig. 10 is a block diagram of a voice control command recognition apparatus according to an embodiment of the present invention; as shown in fig. 10, includes:

a first obtaining module 1002, configured to obtain a voice control instruction received by a target device;

the recognition module 1004 is configured to respond to the voice control instruction, and simultaneously recognize the voice control instruction through a target server corresponding to the target device and the target device, respectively;

a determining module 1006, configured to determine, according to a recognition speed of the target server for the voice control instruction, a target recognition result corresponding to the voice control instruction from a first recognition result of the target server and a second recognition result of the target device;

and a control module 1008, configured to control the target device according to the target identification result.

Through the embodiment, after the target device receives the voice control instruction, on one hand, the target server corresponding to the target device identifies the voice control instruction and outputs a first identification result; on the other hand, the voice control instruction is identified through the target equipment, and a second identification result is output; and then, determining a target recognition result corresponding to the voice control instruction according to the recognition speed of the voice control instruction recognized by the target server, namely the speed of allowing the recognition result of the target server to be obtained, so that the target equipment is controlled by adopting the corresponding recognition result according to the recognition speed of the target server, and even if the network fluctuates or the server fails, the voice control instruction of the user can be recognized, so that the voice control instruction of the user can be responded in time. By adopting the technical scheme, the problems that the dependence degree of the recognition of the voice control instruction on the network is high and the like in the related technology are solved, and the technical effect of reducing the dependence degree of the recognition of the voice control instruction on the network is realized.

In an exemplary embodiment, the determining module includes:

the detection unit is used for detecting the recognition speed of the target server to the voice control instruction;

a first determination unit configured to determine the first recognition result as the target recognition result when the recognition speed is greater than or equal to a target speed;

a second determination unit configured to determine the second recognition result as the target recognition result in a case where the recognition speed is less than the target speed.

In an exemplary embodiment, the detection unit is configured to:

In an exemplary embodiment, the identification module includes:

the sending unit is used for sending the voice control instruction to the target server, wherein the target server is used for recognizing the voice control instruction by using a first recognition model;

a recognition unit, configured to recognize the voice control instruction using a second recognition model deployed on the target device.

In an exemplary embodiment, the identification unit is configured to:

In one exemplary embodiment, the apparatus further comprises:

a second obtaining module, configured to obtain a voice control instruction sample labeled with the device control operation before the voice control instruction is input into the second recognition model;

and the training module is used for training an initial recognition model by using the voice control instruction sample marked with the equipment control operation to obtain the second recognition model.

In an exemplary embodiment, the training module includes:

the input unit is used for inputting the voice control instruction sample into the initial recognition model to obtain control operation data output by the initial recognition model;

a substitution unit, configured to substitute the control operation data and the device control operation into a loss function corresponding to the initial recognition model to obtain a loss value;

and the adjusting unit is used for adjusting the model parameters of the initial identification model according to the loss value until the loss value is converged to obtain the second identification model.

An embodiment of the present invention further provides a storage medium including a stored program, wherein the program executes any one of the methods described above.

Alternatively, in the present embodiment, the storage medium may be configured to store program codes for performing the following steps:

s1, acquiring a voice control instruction received by the target equipment;

s2, responding the voice control instruction, and identifying the voice control instruction through a target server corresponding to the target device and the target device respectively;

s3, determining a target recognition result corresponding to the voice control instruction from the first recognition result of the target server and the second recognition result of the target device according to the recognition speed of the target server to the voice control instruction;

and S4, controlling the target equipment according to the target identification result.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

s1, acquiring a voice control instruction received by the target equipment;

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for recognizing a voice control command is characterized by comprising the following steps:

acquiring a voice control instruction received by target equipment;

2. The method for recognizing the voice control command according to claim 1, wherein the determining a target recognition result corresponding to the voice control command from a first recognition result of the target server and a second recognition result of the target device according to the recognition speed of the target server for the voice control command comprises:

3. The method for recognizing the voice control command according to claim 2, wherein the detecting the recognition speed of the voice control command by the target server comprises:

4. The method for recognizing the voice control command according to claim 1, wherein the simultaneously recognizing the voice control command by the target server corresponding to the target device and the target device respectively comprises:

5. The method according to claim 4, wherein the recognizing the voice control command by using the second recognition model deployed on the target device comprises:

6. The method of recognizing a voice control command according to claim 5, wherein before the inputting the voice control command into the second recognition model, the method further comprises:

7. The method for recognizing the voice control command according to claim 6, wherein the training an initial recognition model by using the voice control command sample labeled with the device control operation to obtain the second recognition model comprises:

8. An apparatus for recognizing a voice control command, comprising:

the first acquisition module is used for acquiring a voice control instruction received by the target equipment;

the recognition module is used for responding to the voice control instruction and simultaneously recognizing the voice control instruction through a target server corresponding to the target equipment and the target equipment respectively;

the determining module is used for determining a target recognition result corresponding to the voice control instruction from a first recognition result of the target server and a second recognition result of the target device according to the recognition speed of the target server to the voice control instruction;

and the control module is used for controlling the target equipment according to the target identification result.

9. A computer-readable storage medium, comprising a stored program, wherein the program is operable to perform the method of any one of claims 1 to 7.

10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 7 by means of the computer program.