WO2022198365A1

WO2022198365A1 - Voice control method and apparatus

Info

Publication number: WO2022198365A1
Application number: PCT/CN2021/082019
Authority: WO
Inventors: 高益; 聂为然; 李宏言
Original assignee: 华为技术有限公司
Priority date: 2021-03-22
Filing date: 2021-03-22
Publication date: 2022-09-29
Also published as: CN113228167B; CN113228167A

Abstract

A voice control method and apparatus, used for solving the problem of long delay in the existing voice control process. In the present application, first text information having complete semantics is determined according to a first voice signal; and according to the first text information and second text information, a target device is controlled to switch between specified operating states, wherein the second text information is acquired before the first text information, the second text information is used for controlling the target device to enter the first operating state in the specified operating states, and the second text information has a contextual relationship with the first text information.

Description

A kind of voice control method and device

technical field

The present application relates to the field of automatic driving, and in particular, to a voice control method and device.

Background technique

Voice interaction products have been widely used in people's daily life. For example, smart phones, smart home devices, and smart vehicle-mounted devices all have voice interaction functions. Especially in the in-vehicle environment, voice interaction can free hands, and has the characteristics of fast command control and safe driving.

During the driving process, due to the change of the driving environment, the user can usually adjust the opening of the in-vehicle equipment such as the windows and the sunroof through voice interaction with the in-vehicle voice control device.

However, in the current voice interaction process, the voice control device needs to determine that the user has finished sending the voice signal, and then can perform voice recognition and semantic analysis according to the obtained entire user voice to obtain control instructions, and then according to the control instructions to the corresponding In-vehicle devices such as window openings are adjusted. Since the user's voice needs to be recognized and analyzed after the entire segment of the user's voice is acquired, the time delay of the entire control process is relatively long.

SUMMARY OF THE INVENTION

The present application provides a voice control method and device, which are used to reduce control delay and improve user experience during the voice control process.

The voice control method provided in this application can be implemented by a terminal device, for example, a vehicle or a vehicle-mounted device. The voice control method can also be implemented by components of the terminal device, such as processing devices, circuits, chips and other components in the terminal device, for example, a chip supporting wireless communication functions in the terminal device, such as a system chip or a communication chip. The system-on-chip is also called a system-on-chip, or a system-on-chip (SOC) chip. The communication chip may include a radio frequency processing chip and a baseband processing chip. The baseband processing chip is also sometimes called a modem. In physical implementation, the communication chip can be integrated inside the SoC chip or not with the SoC chip set. For example, the baseband processing chip is integrated in the SoC chip, and the radio frequency processing chip is not integrated with the SoC chip.

In a first aspect, the present application provides a voice control method, the method includes: determining first text information with complete semantics according to a first voice signal; switch state. Exemplarily, the specified operating state corresponding to the target device may include at least a first operating state and a second operating state. The second text information is acquired before the first text information, there is a contextual relationship between the second text information and the first text information, and the second text information is used to control the target device to enter the first operation in the specified operation state state, the first text information is used to control the target device to switch from the first operation state in the specified operation state to the second operation state.

Exemplarily, the target device is a car window, and the specified operating state corresponding to the car window may include moving down (ie, the first operating state) and stopping moving downward (ie, the second operating state), wherein the second text information is used for The vehicle window is controlled to move downward, and the first text information is used to control the vehicle window to stop moving downward. According to the first text information and the second text information, the vehicle window can be controlled to switch from a state of moving downward to a state of stopping moving downward.

It should be understood that, in the above technical solution, there is no need to wait for the user to finish sending the voice signal, but in the case that the first text information has complete semantics and there is second text information that has a contextual relationship with the first text information, It is determined that the user has finished sending the voice signal, and a control command is generated according to the first text information and the second text information. This method helps to reduce the control delay in the voice control process, and the generated control command is combined with the previous one. It is generated from text information and can effectively control the target device in the specified running state.

In an optional implementation manner, the second text information and the first text information have a contextual relationship, including at least one or more of the following: the second text information and the first text information correspond to (or act on) the same target equipment; the execution action corresponding to the second text information and the execution action corresponding to the first text information are of the same type.

It should be understood that, in the above technical solution, when it is determined that the second text information and the first text information have a contextual relationship, it can be specifically determined whether the second text information and the first text information correspond to the same target device, and/or, Whether the execution action corresponding to the second text information and the execution action corresponding to the first text information are of the same type helps to improve the accuracy of determining that the second text information is the first text information.

In an optional implementation manner, before determining the first text information with complete semantics according to the first voice signal, the method further includes: determining the second text information with complete semantics according to the second voice signal; The natural language is understood, and the second structured information is obtained; according to the second structured information, the target device is controlled to enter the first operating state.

It should be understood that, in the above technical solution, the second voice signal is obtained first, the target device is controlled to enter the first operating state according to the second voice signal, and then the first voice signal is obtained, and the target device is controlled from the first voice signal according to the first voice signal. The first operating state is switched to the second operating state, so as to realize the control of the target device in the first operating state according to the second voice signal.

In an optional implementation manner, controlling the target device to switch in a specified operating state according to the first text information and the second text information includes: A preset set corresponding to the information, the preset set includes the correspondence between one or more preset text information and preset instruction identifiers; when the one or more preset text information includes first text information, according to the first The preset instruction identifier corresponding to the text information determines a control instruction, wherein the control instruction is used to control the target device to switch from the first operating state in the designated operating state to the second operating state in the designated operating state.

It should be understood that, in the above technical solution, a preset set corresponding to the second structured information is set, and when the first text information is included in the preset set, the preset corresponding to the first text information can be directly determined. The instruction identifier is generated, and the control instruction is generated according to the preset instruction identifier, without performing natural language understanding and dialogue management on the first text information, and further reducing the time delay in the control process.

In an optional implementation manner, controlling the target device to switch in a specified operating state according to the first text information and the second text information, further includes: presetting any one of the first text information and one or more preset text information. When the text information is different, the first structured information is obtained by performing natural language understanding on the first text information; the control instruction is determined according to the first structured information and the second structured information.

It should be understood that, in the above technical solution, in the case where the first text information is not included in the preset set, the first text information can be understood by natural language to obtain the first structured information, and then the first structured information can be obtained according to the first structured information. The information and the second structured information are managed by dialogue, and control instructions are obtained, which helps to ensure the normal operation of the system.

In an optional implementation manner, after the control instruction is determined according to the first structured information and the second structured information, the method further includes: if the control instruction is invalid, updating the second structured information according to the first structured information.

It should be understood that, in the above technical solution, when the control instruction is invalid, the stored second structured information can be updated according to the first structured information (that is, the stored historical structured information is updated) to ensure the currently stored historical structure. The information is the latest structured information, which ensures the correct operation of the system and helps to make correct judgments when new voice signals are received.

In an optional implementation manner, determining the first text information with complete semantics according to the first voice signal includes: determining M characters corresponding to the first voice signal according to the first voice signal, where M is a positive integer; The text information composed of multiple characters is input into the first preset model, and the output result of the first preset model is obtained, and the first preset model is used to judge whether the text information composed of the input multiple characters has complete semantics; The text information composed of characters and the output result of the first preset model are used to generate the first text information.

In an optional implementation manner, the first preset model is determined by the following steps: obtaining a first training set, the first training set includes a plurality of first training data, and for each first training data in the plurality of first training data; training data, the first training data includes first training text information and a first label, the first training text information is composed of one or more words, and the first label is used to indicate whether the first training text information has complete semantics; according to A plurality of first training data and a first training model, perform one or more first model training, until the first output result of the first training model meets the first preset condition, and make the first output result meet the first preset The conditional first training model is determined as the first preset model; wherein, the first model training includes: inputting a plurality of first training data into the first training model to obtain a first output result; updating according to the first output result The model parameters in the first training model are obtained, and the first training model after the model parameters are updated is obtained.

It should be understood that, in the above technical solution, a first preset model is preset, wherein the first preset model is a relatively accurate classification model trained according to a plurality of historical training data. When the M characters corresponding to the first voice signal are determined according to the first voice signal, the text information composed of the M characters can be input into the first preset model to determine the M characters corresponding to the current first voice signal Whether the text has complete semantics is helpful to obtain a more accurate judgment result, thereby obtaining more accurate first text information with complete semantics.

In an optional implementation manner, according to the first text information and the second text information, before controlling the target device to switch in the specified operating state, it further includes: inputting the first text information and the historical text information into the second preset model. , obtain the output result of the second preset model, the second preset model is used to judge whether the two input text information has a contextual relationship; according to the output result of the second preset model, the historical text information is determined as the second text information .

In an optional implementation manner, the second preset model is determined by the following steps: obtaining a second training set, the second training set includes a plurality of second training data, and for each of the plurality of second training data Two training data, the second training data includes two pieces of second training text information and a second label, and the second label is used to indicate whether the two pieces of second training text information have a contextual relationship; model, perform one or more second model training, until the second output result of the second training model meets the second preset condition, and determine the second training model whose second output result meets the second preset condition as the second model A preset model; wherein the second model training includes: inputting a plurality of second training data into the second training model to obtain a second output result; updating model parameters in the second training model according to the second output result to obtain The second training model after the model parameters are updated.

It should be understood that, in the above technical solution, a second preset model is preset, wherein the second preset model is a more accurate classification model trained according to a plurality of historical training data. When the M characters corresponding to the first speech signal have complete semantics, that is, when the M characters form the first text information, the first text information and the currently stored historical text information can be input into the second preset model, so that according to the The output result of the second preset model determines whether the historical text information is above the first text information, which helps to obtain a more accurate determination result.

In a second aspect, the present application provides a voice control device, the device comprising: a processing module for determining first text information with complete semantics according to a first voice signal; a control module for determining according to the first text information and the first text information Two text information, control the target device to switch in the specified running state. Exemplarily, the specified operating state corresponding to the target device may include at least a first operating state and a second operating state. The second text information is acquired before the first text information, there is a contextual relationship between the second text information and the first text information, and the second text information is used to control the target device to enter the first operation in the specified operation state state, the first text information is used to control the target device to switch from the first operation state in the specified operation state to the second operation state.

In an optional implementation manner, the second text information and the first text information have a contextual relationship, including at least one or more of the following: the second text information and the first text information correspond to the same target device; the second text information and the first text information correspond to the same target device; The execution action corresponding to the information and the execution action corresponding to the first text information are of the same type.

In an optional implementation manner, before the processing module determines the first text information with complete semantics according to the first voice signal, the processing module is further configured to: determine the second text information with complete semantics according to the second voice signal; Perform natural language understanding on the second text information to obtain second structured information; the control module is further configured to: control the target device to enter the first operating state according to the second structured information.

In an optional implementation manner, the control module is specifically configured to: determine a preset set corresponding to the second structured information according to the second structured information corresponding to the second text information, and the preset set includes one or more The correspondence between preset text information and preset instruction identifiers; when one or more preset text information includes first text information, the control instruction is determined according to the preset instruction identifier corresponding to the first text information, wherein the control instruction uses The control target device is switched from the first operation state in the designated operation state to the second operation state in the designated operation state.

In an optional implementation manner, the control module is further configured to: when the first text information is different from any one of the one or more preset text information, perform natural language understanding on the first text information to obtain the first text information. Structured information; the control instruction is determined according to the first structured information and the second structured information.

In an optional implementation manner, the control module is further configured to: after determining the control instruction according to the first structured information and the second structured information, in the case that the control instruction is invalid, update the second structured information according to the first structured information. information.

In an optional implementation manner, the processing module is specifically configured to: determine, according to the first voice signal, M characters corresponding to the first voice signal, where M is a positive integer; and input the text information composed of the M characters into the first preset. model, to obtain the output result of the first preset model, the first preset model is used to judge whether the text information composed of the input multiple characters has complete semantics; according to the text information composed of M characters and the first preset model The result is output, and the first text information is generated.

In an optional implementation manner, the processing module is specifically configured to: obtain a first training set, the first training set includes a plurality of first training data, and for each first training data in the plurality of first training data, The first training data includes first training text information and a first label, the first training text information is composed of one or more words, and the first label is used to indicate whether the first training text information has complete semantics; The training data and the first training model are performed one or more times of training the first model until the first output result of the first training model meets the first preset condition, and the first output result that meets the first preset condition is determined. The training model is determined to be the first preset model; wherein, the first model training includes: inputting a plurality of first training data into the first training model to obtain a first output result; updating the first training model according to the first output result The model parameters in , obtain the first training model after the model parameters are updated.

In an optional implementation, before the control module controls the target device to switch in the specified operating state according to the first text information and the second text information, the processing module is further configured to: input the first text information and the historical text information into the first text information and the historical text information. In the second preset model, the output result of the second preset model is obtained, and the second preset model is used to determine whether the two input text information has a contextual relationship; according to the output result of the second preset model, the historical text information is determined is the second text message.

In an optional implementation manner, the processing module is specifically configured to: obtain a second training set, the second training set includes a plurality of second training data, and for each second training data in the plurality of second training data, The second training data includes two pieces of second training text information and a second label, and the second label is used to indicate whether the two pieces of second training text information have a contextual relationship; according to the plurality of second training data and the second training model, execute once or multiple times of second model training, until the second output result of the second training model meets the second preset condition, and the second training model whose second output result meets the second preset condition is determined as the second preset model; The training of the second model includes: inputting a plurality of second training data into the second training model to obtain a second output result; updating model parameters in the second training model according to the second output results, and obtaining the updated model parameters the second trained model.

In a third aspect, the present application provides a computing device, comprising a processor, the processor is connected to a memory, the memory stores a computer program, and the processor is configured to execute the computer program stored in the memory, so that the computing device executes the first aspect or the first A method in any possible implementation of the aspect.

In a fourth aspect, the present application provides a computer-readable storage medium on which a computer program or instruction is stored, and when the computer program or instruction is executed, enables a computer to execute the above-mentioned first aspect or any one of the first aspects. method in the implementation.

In a fifth aspect, the present application provides a computer program product, which, when the computer reads and executes the computer program product, causes the computer to execute the first aspect or the method in any possible implementation manner of the first aspect.

In a sixth aspect, the present application provides a chip, which is connected to a memory and used to read and execute a software program stored in the memory, so as to realize the method in the above-mentioned first aspect or any possible implementation manner of the first aspect .

It should be understood that, in the technical solutions of the first aspect to the sixth aspect, the voice control device may acquire the second voice signal, determine the second text information according to the second voice signal, and then perform natural language understanding according to the second text information, The second structured information is obtained, and then the target device is controlled to enter the first operating state of the specified operating state according to the second structured information. In the case that the target device is in the first operating state, the voice control device may also perform streaming speech recognition on the continuously obtained first voice signal in the process of continuously obtaining the first voice signal, so as to obtain the corresponding M characters, In the case where it is determined that the text information composed of the M characters has complete semantics, the first text information composed of the M characters with completed semantics does not need to wait for the silence period after the user sends the completed voice signal. After the text information with complete semantics, it is inferred that the user has completed the delivery of the voice signal, thereby effectively reducing the control delay.

Further, it is determined whether there is a contextual relationship between the two according to the first text information and the currently stored historical textual information, and when it is determined that the two have a contextual relationship, it can be determined that the currently obtained first voice signal is the user. According to the further instruction of the last second voice signal, the target device in the first running state can be controlled according to the first text information and the second text information. Specifically, the running state of the target device is changed from the first running state The state is switched to the second running state, in this way, the target device indicated by the first voice signal sent by the user can be effectively determined, and the target device can be controlled.

Moreover, when controlling the target device, it can be determined whether the first text information is in a preset set corresponding to the second text information (ie, the second structured information), and when the first text information is in the preset set, The preset instruction identifier corresponding to the first text information can be directly determined according to the preset set without performing natural language understanding and dialogue management on the first text information, thereby helping to further reduce the delay in the control process.

In this way, in this application, by using streaming speech recognition technology, complete semantic determination, context determination, and setting a preset set corresponding to the second text information (ie, the second structured information), the time delay in the control process can be effectively reduced, and the user The target device can be controlled more intuitively and effectively by sending a voice signal, which helps to improve the user experience.

Description of drawings

1 is a schematic diagram of functional modules included in a voice control device provided by the application;

2 is a schematic diagram of functional modules included in a data processing module provided by the application;

FIG. 3 is a specific scene to which the voice control device provided by the present application is applicable;

4 is a schematic diagram of a process of slowly moving down a group of vehicle windows provided by the present application;

5 is a schematic diagram of a first voice control device processing voice signal generation time delay provided by the application;

6 is a schematic diagram of functional modules included in yet another data processing module provided by the present application;

7 is a schematic flowchart of a voice control method provided by the present application;

8 is a schematic flowchart of yet another voice control method provided by the application;

9 is a schematic flowchart of the input and output of two preset models in a flow control module provided by the present application;

10 is a voice control process in a vehicle-mounted scene provided by the application;

FIG. 11 is a voice control process in yet another vehicle-mounted scene provided by the application;

12 is a voice control process in another vehicle-mounted scene provided by the application;

13 is a schematic diagram of a second type of voice control device processing voice signal generation delay provided by the application;

FIG. 14 is a schematic diagram of a third voice control device processing voice signal generation time delay provided by the application;

15 is a schematic structural diagram of a voice control device provided by the application;

FIG. 16 is a schematic structural diagram of another voice control apparatus provided by the present application.

Detailed ways

The embodiments of the present application will be described in detail below with reference to the accompanying drawings.

As shown in FIG. 1 , a schematic diagram of functional modules included in a voice control device provided by the present application. The voice control device includes: a voice acquisition module, a data processing module and a decision-making module. Wherein, the voice acquisition module is used to acquire the voice signal, and transmit the acquired voice signal to the data processing module. The data processing module is used to perform speech analysis, semantic analysis, dialogue management, etc. on the speech signal to obtain data processing results. The data processing module sends the data processing result to the decision-making module, and the decision-making module generates a control instruction according to the data processing result and sends it to the corresponding target device.

As shown in FIG. 2 , a schematic diagram of functional modules included in a data processing module provided by the present application. Exemplarily, the data processing module includes: a speech recognition (automatic speech recognition, ASR) function module, a natural language understanding (natural language understanding, NLU) function module, and a dialog management (dialog management, DM) function module. For convenience of description, the ASR function module, the NLU function module, and the DM function module are referred to as the ASR module, the NLU module, and the DM module for short below, respectively.

The three components are described below.

1. The ASR module can be used to perform speech analysis, that is, to convert the speech signal input by the user into natural language text (which can be called text information), which is equivalent to the human ear.

The principle flow of speech recognition: "speech input - encoding (feature extraction) - decoding - text output". Exemplarily, the voice input is to input the acquired voice signal into the ASR module. The voice signal is actually a sound wave, and the ASR module can perform encoding (feature extraction) on the voice signal. Specifically, the sound wave can be split according to frames (millisecond level) to obtain a small piece of waveform corresponding to each frame. For a small piece of waveform corresponding to each frame, the small piece of waveform is converted into multi-dimensional vector information according to human ear characteristics. The ASR module decodes and obtains a plurality of phonemes (phones) corresponding to the multi-dimensional vector information according to the multi-dimensional vector information, composes the plurality of factors into words and concatenates them into sentences (ie, text information). The ASR module outputs the generated text information.

Technologies related to speech recognition, including:

1) Voice active detection (VAD)

Voice activity detection may also be referred to as voice activation detection or silence detection, etc.

In the far-field recognition scenario, users cannot touch the device with their hands. At this time, the noise is relatively large, and the signal-to-noise ratio drops sharply. It can be simply understood that the signal is not clear, so VAD technology can be used. Its function is to judge when there is a voice signal input and when there is no voice signal input (ie, mute), and subsequent voice signal processing or voice recognition can be performed on the valid voice fragments cut out by the VAD. That is, the VAD is mainly used to detect whether the user completes the input of the voice signal.

VAD mainly includes phonetic VAD and semantic VAD. Voice VAD means that when it is detected that there is no voice signal input within a set period of time, it stops receiving voice signals (also referred to as stopping radio). Semantic VAD means that when it is determined that the text information currently converted from the input speech signal has complete semantics, the speech signal is stopped to be received.

2) Voice wake-up (voice trigger, VT)

In the far-field recognition scenario, voice wake-up needs to be performed after the VAD detects a human voice, which is equivalent to sending a wake-up command to the device to trigger subsequent voice recognition.

3) Microphone array

This is a system used to sample and process the spatial characteristics of the sound field, and consists of a certain number of acoustic sensors (usually microphones). There are several purposes: speech enhancement, the process of extracting pure speech from a noisy speech signal; sound source localization, which uses a microphone array to calculate the angle and distance of the target speaker, so as to achieve the tracking of the target speaker and subsequent voice directional pickup; de-reverberation to reduce the influence of some reflected sounds; sound source signal extraction/separation to extract all mixed sounds. It is mainly suitable for complex environments with many noises, noises and echoes such as vehicles, outdoors and supermarkets.

Second, the NLU module can be used to perform natural language understanding or semantic analysis, that is, to convert natural language text into structured information that can be understood by machines. Exemplarily, natural language text such as "open car window", structured information obtained through natural language understanding, such as "control-window.adjust".

3. The DM module can be used to perform dialogue management, that is, based on the state of the dialogue, according to the semantic information, provide corresponding services. Dialogue management controls the process of human-machine dialogue, and it will decide what kind of response to the user based on the historical information of the dialogue. The most common application is the task-driven multi-round dialogue. The user has a clear purpose such as order query, etc., the user needs are more complex, there are many restrictions, and it may be necessary to state in multiple rounds. In essence, task-driven dialogue management is actually a decision-making process. During the dialogue process, the system continuously decides the optimal action to be taken next according to the current state (such as: providing results, asking for specific constraints, clarifying or confirming requirements, etc.), So as to most effectively assist users to complete the task of obtaining information or services.

In addition, the data processing module may also include: a natural language generation (NLG) function module and a speech synthesis (text to speech, TTS) function module. For the convenience of description, the NLG function module and the TTS function module are respectively referred to below for short. For NLG module, TTS module.

The NLG module can be used to generate natural language texts based on business information.

The TTS module can be used to turn natural language text into an output speech signal. Contrary to the ASR module, the TTS module converts natural language text into speech for the machine to read aloud, which is equivalent to a human mouth.

As shown in FIG. 3, a specific scene to which the voice control device provided by this application can be applied, the specific scene can be a vehicle-mounted scene, and the user can use the voice control device to send a message to a vehicle-mounted device (which can be called a target device, such as a car window , car speakers, seats, air conditioners, etc.) to issue control commands. For example, in Figure 3, if the user says "open the car window" (equivalent to the user sending a voice signal, the voice signal is "open the car window"), after receiving the voice signal, the voice control device can pass the voice signal through the picture The ASR module, NLU module and DM module shown in 2 obtain the control command of the car window after processing, and then control the car window to move down slowly according to the control command.

In addition, the user can also issue control commands to other in-vehicle devices through the voice control device. For example, if the user says "raise the seat", the voice control device will control the seat to raise slowly in response to the voice signal. For example, if the user says "turn down the air conditioner wind", the voice control device will respond to the voice signal and control the air conditioner. The wind is slowly decreasing.

In addition, the voice control device provided in this application can also be applied to other scenarios, for example, in a home scenario, a user can use the voice control device to send a voice control device to a certain home device (which can be called a target device, such as a robot vacuum cleaner, a desk lamp, a curtain, etc.) in the home scenario. etc.) to issue control commands. Exemplarily, when the user says "open the curtains", the voice control device controls the curtains to open slowly in response to the voice signal; the user says "turn up the desk lamp", the voice control device gradually increases the brightness of the console lights in response to the voice signal, etc.

It should be pointed out that the above-mentioned target device may be in a corresponding operating state within a preset period of time based on the control instruction. For example, in the case of opening a car window, the process of moving the car window from a fully closed state to a fully open state takes about 3-4 seconds. FIG. 4 is a schematic diagram of a group of vehicle windows slowly moving down in the present application, wherein the thick solid line represents the vehicle door, and the thin dashed line represents the vehicle window. In Fig. 4(a), the window is in a fully closed state, that is, the window has not been opened. In Fig. 4(b), the window is in a half-open state, specifically in a 40% open state. In Fig. 4(c), the window is still in a half-open state, specifically in a 60% open state. In Fig. 4(d), the window is in a fully open state, that is, in a 100% open state. It takes about 3-4 seconds to move the window from the state of (a) in Figure 4 to the state of (d) in Figure 4.

Based on this, within 3-4 seconds after the window starts to move downward based on the control command, the window is in a running state of slowly moving downward. In this running state, the user can intuitively feel the current open state of the window, and issue a control command to the window again through the voice control device according to personal needs, such as a window stop command, so that the window can stay at the desired position of the user. s position.

For example, when the car window is moved down to the position shown in (c) in Figure 4, the user intuitively feels that the current car window position is more appropriate, so the voice control device can send a stop command to the car window again, for example, the user says "stop", After receiving the voice signal, the voice control device can process the voice signal through the ASR module, NLU module, and DM module shown in FIG. The windows stop moving down.

In this application, the user's control of the target device in the running state may be referred to as process control. For example, the above-mentioned control of the vehicle window in the process of moving down may be referred to as the process control of the vehicle window. The above description also applies to the case where the target device is other equipment in the vehicle equipment, such as seats, air conditioners, etc., of course, also applies to the case where the target device is equipment in other scenarios, such as robot vacuum cleaners, curtains, Table lamps, etc.

What needs to be added here is that the target device may also be considered to have a designated operating state, and the designated operating state includes at least two operating states, referred to as a first operating state and a second operating state. The first running state may be the running state of the target device based on the voice signal (or control command) issued by the user for the first time, such as the running state of the window moving downward, or the slowly raising seat. operating status, etc. The second operating state may be the operating state that the target device is in based on the voice signal (or control instruction) sent by the user for the second time, such as the operating state in which the window stops moving downward, such as the operating state in which the seat stops slowly raising Wait.

In the process of the user sending the voice signal to the voice control device, the voice control device needs to determine that the user has finished sending the voice signal (or user voice, voice command), and then the voice control device can perform the voice signal according to the entire acquired voice signal. Recognition and semantic analysis result in control instructions.

Exemplarily, a trailing silence may be set, and the voice control apparatus determines that the user has finished delivering the voice signal when the voice control apparatus determines that the duration of not receiving the voice signal reaches the silence period. Subsequently, the voice control device obtains a control instruction after processing the obtained entire voice signal through the ASR module, the NLU module, and the DM module shown in FIG. 2 .

FIG. 5 is a schematic diagram of the time delay for processing a voice signal by the first kind of voice control device exemplarily provided by this application. The time delay specifically includes the silence duration, the processing duration of the ASR module, the processing duration of the NLU module and the processing duration of the DM module. It can be seen from the The time delay from when the voice control device receives the voice signal to when the voice control device generates the control command is relatively long.

A long time delay will cause the target device not to be controlled in time, especially in process control, the user cannot control the target device more intuitively and effectively through the voice control device. For example, when the window is moved down to 60%, the user intuitively feels that the current window position is more suitable, so the user says "stop", there may be a delay between the user saying "stop" and the actual stop of the window, such as 1 second (s ), then the window may have moved down to 80% at this time, so the final position of the window is not what the user wants.

Based on this, the present application provides a voice control method for reducing control delay in a voice control process.

In order to better explain the voice control method in this application, the data processing module in this application is further described as follows.

FIG. 6 exemplarily provides a data processing module in the present application. Compared with the specific structure of the data processing module in FIG. 2 , a flow control module and a fast matching module are newly added. The flow control module receives the text information from the ASR module and determines whether to send the text information to the quick match module. When the flow control module sends the text information to the fast matching module, the fast matching module can determine the preset instruction identifier from the preset set, and determine the control instruction to send to the target device according to the preset instruction identifier. In the case that the quick matching module cannot determine the preset instruction identifier from the preset set, the corresponding control instruction can be further generated through the NLU module and the DM module, and sent to the target device. For specific implementation, reference may be made to the descriptions in the following method embodiments.

In this embodiment of the present application, the voice signal sent by the user for the first time is referred to as the second voice signal as follows. The text information obtained by the voice control device according to the second voice signal is called the second text information, the control instruction generated according to the second text information is called the second control instruction, and the second control instruction is used to control the target device to enter the first operating state. .

The voice signal sent by the user for the second time is called the first voice signal, and the first voice signal is the voice signal in the process control performed by the user on the target device. The text information obtained by the voice control device according to the first voice signal is called the first text information, and the control instruction generated according to the first text information is called the first control instruction, and the first control instruction is used to control the target device from the first operating state. Switch to the second operating state.

FIG. 7 is a schematic flowchart of a voice control method exemplarily provided by the application, in the process:

Step 701: The voice control apparatus determines, according to the first voice signal, first text information with complete semantics.

The voice control device can recognize the received voice signal through the streaming voice recognition technology. In this way, the voice control device does not need to wait for a silent period, but starts to perform voice recognition after receiving the user's voice signal.

In case 1, the first voice signal delivered by the user is a text.

For example, the first voice signal sent by the user is "stop", and the user needs to finish saying the word "stop" after a period of time, such as 0.5s. For the voice control device, the following operations can be performed: the voice signal "stop" is received, and the voice signal "stop" is converted into the text message "stop".

In case 2, the first voice signal delivered by the user is a plurality of characters.

For example, the first voice signal sent by the user is "just tune here", and the user needs to pass a period of time, such as 2s, to finish saying the four words "just tune here". For voice-controlled devices, you can do the following:

Time T1: When the voice signal "Ji" is received, the voice signal "Ji" is converted into the text "Ji", that is, the text information "Ji" is generated.

Time T2: Receive the voice signal "Tune", convert the voice signal "Tune" into the text "Tune", and generate the text information "Just Tune" in combination with the text information "Just" generated at the time of T1.

Time T3: After receiving the voice signal "To", the voice signal "To" is converted into the text "To", and combined with the text information "Just tune" at the T2 time, the text information "Just tune to" is generated.

Time T4: Receive the voice signal "this", convert the voice signal "this" into the text "this", and combine with the text information at time T3 "just tune in" to generate the text message "just tune in here".

In the above case 1, the text information recognized by the voice control device has complete semantics. In the above case 2 from time T1 to time T3, although the voice control device performs speech recognition, the recognized text information does not have complete semantics, and the text information obtained at the above-mentioned time T4 has complete semantics. The voice control device needs to determine whether the recognized text information has complete semantics. It can be understood that the text information has complete semantics here, and the voice control device can determine corresponding structured information or control instructions according to the text information.

In an optional implementation, a classification model can be preset, and the classification model is used to identify whether the text information has complete semantics. The classification model can be called a first preset model, and the input of the first preset model is voice control. The device performs streaming speech recognition to obtain text information (or one or more characters contained in the text information), and the output of the first preset model is first indication information, and the first indication information is used to indicate whether the text information is with full semantics.

Exemplarily, the first indication information may be a preset bit. For example, when the preset bit is 1, it indicates that the input text information has complete semantics; when the preset bit is 0, it indicates that the input text information has complete semantics. The entered text information does not have full semantics.

In an optional implementation manner, the first preset model may be obtained by training in the following manner:

Prepare a first training set in advance, the first training set includes a plurality of first training data, and each first training data in the plurality of first training data includes first training text information and a first label, wherein The first training text information includes one or more words, and the first label is used to indicate whether the first training text information has complete semantics.

Exemplarily, the first label may be manually pre-labeled, or may be automatically labeled during the machine learning process. The first label can use a preset bit to indicate whether the corresponding first training text information has complete semantics. For example, when the preset bit is 1, it indicates that the corresponding first training text information has complete semantics. The preset When the value of the bit is 0, it indicates that the corresponding first training text information does not have complete semantics.

Table 1 exemplarily provides a plurality of first training data in the first training set for this application.

Exemplarily, the first training data includes first training text information "Ji" and a first label "0", and the first label "0" is used to indicate that the first training text information "Ji" does not have complete semantics.

For another example, the first training data includes first training text information "stop" and a first label "1", and the first label "1" is used to indicate that the first training text information "stop" has complete semantics.

Table 1

第一训练文本信息First training text information	第一标签first tab	第一训练文本信息First training text information	第一标签first tab
就At once	00	播broadcast	00
就、调just, adjust	00	播、放play, play	00
就、调、到to, to, to	00	播、放、音play, play, sound	00
就、调、到、这just, tune, to, this	11	播、放、音、乐play music	11
停stop	11	stopstop	11

Further, one or more times of model training (which may be referred to as first model training) can be performed on the first training model according to a plurality of first training data in the first training set, and the trained model can be obtained as the first training model. Default model.

Exemplarily, in each training of the first model, a plurality of first training data in the first training set may be input into the first training model to obtain the output result of the first training model (referred to as the first output result). ), the first output result is, for example, determining whether the first training text information in each first training data has complete semantics. According to the first output result and the first label in each first training data, a model update parameter is determined, wherein the model update parameter is such as a gradient parameter. The current first training model is updated according to the model update parameter.

The next first model training is performed based on the updated first training model, and the above operations are repeated until the determined first output result meets the first preset condition.

Exemplarily, the output accuracy rate of the first training model can be determined according to the first output result, for example, there are 1000 first training data in total, wherein the output results corresponding to 900 first training data in the first output result are correct, Then this output is 90% correct. Correspondingly, the first preset condition may be set such that the output accuracy rate is greater than the preset accuracy rate. When the output accuracy rate of the first output result of the first training model is greater than the preset accuracy rate, it can be determined that the first training model has been trained, and the trained first training model can be used as the first preset model.

It should be noted that the voice control device can further update the model parameters of the first preset model according to the data obtained during the working process, so as to improve the accuracy of the model.

The voice control device processes the first voice signal through streaming voice recognition technology to obtain text information corresponding to the first voice signal (hereinafter referred to as third text information), where the third text information includes M characters, where M is a positive integer.

The voice control device inputs the third text information into the first preset model, and generates the first text information according to the output result of the first preset model and the third text information.

In one example, the output result of the first preset model indicates that the third text information has complete semantics, and the voice control apparatus may use the third text information as the first text information. For example, input the third text message "just tune here" composed of four characters "just", "tune", "to" and "this" into the first preset model, and the output of the first preset model is " 1", the voice control device can take the third text message "Just call it here" as the first text message.

In another example, the output result of the first preset model indicates that the third text information does not have complete semantics, then the voice control apparatus may, after recognizing the new text through the streaming voice technology, combine the new text with the M texts. The new third text information composed of words is input into the first preset model, until the output result of the first preset model indicates that the input third text information has complete semantics, and the input third text information is used as the first text. information.

Step 702, the voice control apparatus controls the target device to switch from the first operation state to the second operation state in the designated operation state according to the first text information and the second text information.

Wherein, before acquiring the first text information, the voice control apparatus first acquires the second text information. The second text information is used to control the target device to enter the first running state in the specified running state, and there is a contextual relationship between the second text information and the first text information.

It is stated in advance that the voice control device will store a session state, and the session state may include text information and/or structured information determined by the voice control device according to the last received and processed voice signal. The voice control device can determine whether to generate a corresponding control instruction according to the currently received voice signal and in combination with the stored session state.

In this application, the text information in the session state may be referred to as historical text information, and the structured information in the session state may be referred to as historical structured information. When the session state satisfies the third preset condition, the historical text information and the historical structured information may also be referred to as second text information and second structured information, respectively.

In an optional manner, there is a contextual relationship between the historical text information and the first text information, and it can also be understood that the historical text information is the preceding text of the first text information, and/or the first text information is historical text information below. Can be any one or more of the following conditions:

Condition 1, the historical text information and the first text information both correspond to the same target device. For example, both the historical text information and the first text information correspond to car windows. For another example, both the historical text information and the first text information correspond to seats.

Condition 2, the execution action corresponding to the historical text information is of the same type as the execution action corresponding to the first text information. For example, the historical text information is used to instruct the car window to move down, and the first text information is used to instruct the car window to stop moving down, both of which correspond to the action type of down move. For another example, the historical text information is used to instruct the seat to be raised, and the first text information is used to instruct the seat to be raised to stop, both of which correspond to the action type of raising.

The following example illustrates the situation where there is a contextual relationship between the historical text information and the first text information:

(1) The historical text information is "open the car window", and the first text information is "just tune here".

(2) The historical text information is "open car window", and the first text information is "stop".

(3) The historical text information is "open car window", and the first text information is "right rear car window".

(4) The historical text information is "Turn down the wind power of the air conditioner", and the first text information is "OK".

(5) The historical text information is "window down", and the first text information is "window down stop".

A contextual relationship between the historical text information and the first text information may be used as the third preset condition. The historical text information may instruct the target device to enter a certain operating state, and the first text information may instruct the target device to switch from the operating state to another operating state. Equivalently, the second text information instructs the target device to enter the first operating state, and the first text information instructs the target device to switch from the first operating state to the second operating state.

In another optional manner, there is no contextual relationship between the historical text information and the first text information, the historical text information may instruct a certain device to enter a certain operating state, and the first text information may instruct other devices to enter other operating states . The following example illustrates the situation where there is no contextual relationship between the historical text information and the first text information:

(a) The historical text information is "open car window", and the first text information is "play music".

(b) The historical text information is "open car window", and the first text information is "open bluetooth".

(c) The historical text information is "open the car window", and the first text information is "turn off the air conditioner".

The above is only an exemplary example, and does not constitute a limitation to the method of the present application.

After determining the first text information, the voice control device may determine whether there is a contextual relationship between the historical text information and the first text information. In one example, whether there is a contextual relationship between the historical text information and the first text information may be determined through the above-mentioned condition 1 and/or condition 2.

In another example, a classification model may be preset, and the classification model is used to determine whether there is a contextual relationship between two pieces of text information. The classification model may be called a second preset model, and the input of the second preset model is two pieces of text information, specifically historical text information and first text information, the output of the second preset model is second indication information, and the second indication information is used to indicate whether there is a context between the historical text information and the first text information relation.

Exemplarily, the second indication information may be a preset bit. For example, when the preset bit takes a value of 1, it indicates that there is a contextual relationship between the input historical text information and the first text information, and the preset bit When the value is 0, it indicates that there is no contextual relationship between the input historical text information and the first text information.

In an optional implementation manner, the second preset model may be obtained by training in the following manner:

Prepare a second training set in advance, the second training set includes a plurality of second training data, and each second training data in the plurality of second training data includes two text information and a second label, wherein the first The second tag is used to indicate whether there is a contextual relationship between the two text information. Exemplarily, the two pieces of text information have a sequential order.

Exemplarily, the second label may be manually pre-labeled, or may be automatically labeled during the machine learning process. The second tag can use a preset bit to indicate whether there is a contextual relationship between the two corresponding text information. For example, when the preset bit is 1, it indicates that there is a contextual relationship between the two corresponding textual information. When the preset bit takes a value of 0, it indicates that there is no contextual relationship between the two corresponding text information.

Table 2 exemplarily provides a plurality of second training data in the second training set for this application.

Exemplarily, the second training data includes two text information "open the window", "turn off the air conditioner" and a second label "0", and the second label "0" is used to indicate "open the window" and "turn off the air conditioner". ' are not contextually related.

For another example, the second training data includes two text messages "open the car window", "just call here" and a second label "1", the second label "1" is used to indicate "open the car window" and There is a contextual relationship between "just tune in here".

Table 2

前一个文本信息previous text message	后一个文本信息next text message	第二标签second label
打开车窗open the windows	就调到这just call here	11
打开车窗open the windows	停stop	11
打开车窗open the windows	右后车窗right rear window	11
调小空调风力Turn down the air conditioner	好了All right	11
打开车窗open the windows	播放音乐play music	00
打开车窗open the windows	打开蓝牙Turn on bluetooth	00
打开车窗open the windows	关闭空调Turn off the air conditioner	00

Further, one or more times of model training (which may be referred to as second model training) can be performed on the second training model according to a plurality of second training data in the second training set, and the trained model can be obtained as the second training model. Default model.

Exemplarily, in each second model training, a plurality of second training data in the second training set may be input into the second training model to obtain the output result of the second training model (referred to as the second output result). ), and the second output result is, for example, determining whether there is a contextual relationship between two pieces of text information in each second training data. According to the second output result and the second label in each second training data, a model update parameter is determined, wherein the model update parameter is such as a gradient parameter. The current second training model is updated according to the model update parameter.

Execute the next second model training based on the updated second training model, and repeat the above operations until the determined second output result meets the second preset condition.

Exemplarily, the output correct rate of the second training model may be determined according to the second output result, for example, there are 1000 second training data in total, wherein the output results corresponding to 900 second training data in the second output result are correct, Then this output is 90% correct. Correspondingly, the second preset condition may be set such that the output accuracy rate is greater than the preset accuracy rate. When the output accuracy rate of the second output result of the second training model is greater than the preset accuracy rate, it can be determined that the second training model has been trained, and the trained second training model can be used as the second preset model.

It should be noted that the voice control device can further update the model parameters of the second preset model according to the data obtained during the working process, so as to improve the accuracy of the model.

In an optional implementation, the voice control device inputs the historical text information and the first text information into the second preset model, and determines the difference between the historical text information and the first text information according to the output result of the second preset model. Whether there is a context relationship, that is, whether there is second text information. The situation is explained as follows:

In case 1, when the second text information exists, the voice control device determines a control instruction according to the second text information and the first text information, and the control instruction is used to control the target device to switch from the first operating state to the second operating state.

The following first explains that the target device enters the first operating state.

In an optional specific implementation, the voice control apparatus acquires the second text information based on the above implementation manner of acquiring the first text information. Exemplarily, the voice control apparatus acquires the second voice signal sent by the user, and obtains N characters corresponding to the second voice signal through voice recognition, where N is a positive integer. When it is determined that the N characters have complete semantics, the voice control device performs natural language understanding on the second text information to obtain second structured information, and then controls the target device to enter the first operating state according to the second structured information.

In addition, since the second voice signal sent by the user is used to instruct the target device to enter the first operating state, for example, it is used to instruct the window to enter the operating state of moving downward, and for example, it is used to instruct the seat to enter the operating state of slowly raising etc., that is to say, the time delay requirement in the execution process corresponding to the second speech signal is lower than the time delay requirement in the execution process (ie process control) corresponding to the first speech signal, and the speech control device can also be based on the existing process The method controls the target device to enter the first operating state, which is not limited in this application.

One or more second preset structured information is included in the voice control device. For any one of the one or more second preset structured information, the second preset structured information corresponds to a preset set, and the preset set includes one or more presets text information.

In an optional implementation manner, in the preset set corresponding to the second preset structured information, one or more preset text information may correspond to one or more preset instruction identifiers.

Exemplarily, Table 3 provides a correspondence between the second preset structured information and the preset set provided by the present application.

For example, in the preset set corresponding to the second preset structured information "control-window.adjust", the preset instruction corresponding to the preset text information "stop, stop, ok, ok" is identified as "window stop".

For another example, in the preset set corresponding to the second preset structured information "control-chair.adjust", the preset instruction corresponding to the preset text information "stop, stop, ok, ok" is identified as "seat stop".

table 3

In an optional specific implementation, the voice control device determines the preset set corresponding to the second structured information from the correspondence between the second preset structured information and the preset set according to the second structured information, and then determines the preset set corresponding to the second structured information. Whether the first text information is included in the preset set corresponding to the second structured information. When the first text information is included in the preset set, the voice control apparatus may determine a control instruction for controlling the target device according to the preset instruction identifier corresponding to the first text information in the preset set.

Taking Table 3 as an example, the second structured information is "control-window.adjust", and the voice control device determines that the first text information "stops" in the preset set corresponding to "control-window.adjust", and further determines that "stop" ” The corresponding preset command is identified as “window stop”. The voice control device determines to send a window stop command to the lower window according to the preset command identification "window stop".

Further, the first preset structured information corresponding to the preset text information may also be set in the preset set corresponding to each second preset structured information. For example, in Table 4, in the preset set corresponding to "control-window.adjust", the preset text information "stop, stop, ok, ok" corresponds to the preset command mark "window stop", and further corresponds to the first A preset structured information "control-window.stop".

Table 4

If the control command determined by the voice control device from the preset set corresponding to the second structured information according to the first text information is invalid, the dialog management can be performed according to the first preset structured information for subsequent instruction issuance . The invalid instruction may be that the voice control device does not issue the control instruction, or the target device does not execute the control instruction after issuing the control instruction to the target device.

Exemplarily, if the voice control device determines that the control command is a "window deceleration command" and the current speed of the downward movement of the car window has reached the minimum speed, the voice control device may determine that the control command is invalid. Further, the voice control device can initiate a dialogue according to the first preset structured information "control-window.slower" corresponding to the "window deceleration command", such as reminding the user that the current minimum descent speed has been reached, or asking the user whether it is necessary to stop. Move the window down.

In another optional implementation manner, the preset set corresponding to the second preset structured information may include one or more preset text information and one or more first preset structured information.

Exemplarily, Table 5 shows the correspondence between the second preset structured information and the preset set provided by the present application.

For example, in the preset set corresponding to the second preset structured information "control-window.adjust", the first preset structured information corresponding to the preset text information "stop, stop, ok, ok" is "stop".

table 5

In an optional specific implementation, the voice control device determines the preset set corresponding to the second structured information from the correspondence between the second preset structured information and the preset set according to the second structured information, and then determines the preset set corresponding to the second structured information. Whether the first text information is included in the preset set corresponding to the second structured information. In the case that the first text information is included in the preset set, the voice control device may generate a third text information according to the first preset structured information corresponding to the first text information in the preset set and in combination with the second structured information structured information, and determine a control instruction for controlling the target device according to the third structured information.

Taking Table 5 as an example, the second structured information is "control-window.adjust", the voice control device determines that the first text information "stops" in the preset set corresponding to "control-window.adjust", and further determines that "stop" "The corresponding first preset structured information is "stop". The voice control device generates third structured information such as "control-window.stop" according to the first preset structured information "stop" and the second structured information as "control-window.adjust", and then according to the third structured information The control message "control-window.stop" sends a window stop command to the window.

If the voice control device traverses all the preset text information in the preset set corresponding to the second structured information, and determines that the preset set does not contain the first text information, the voice control device may execute voice according to the first text information It is understood that the first structured information is obtained, then the third structured information is generated according to the first structured information and the second structured information, and the control instruction for controlling the target device is determined according to the third structured information.

Taking Table 3 as an example, the second structured information is "control-window.adjust", and the voice control device determines that the first text information "just adjust to this" is not in the preset set corresponding to "control-window.adjust", and the voice control The device performs natural speech understanding on the first text information "just adjust here", obtains the first structured information such as "stop", and the voice control device further according to the second structured information "control-window.adjust" and the first structured information information "stop", generate third structured information such as "control-window.stop", and then issue a window stop command to the vehicle window according to the third structured information "control-window.stop".

In this embodiment of the present application, the voice control device may not include a preset set corresponding to the second structured information, that is, one or more second preset structured information in the voice control device does not include the second structure information. The voice control device can perform voice understanding according to the first text information, obtain the first structured information, and then generate the third structured information according to the first structured information and the second structured information, and determine the user according to the third structured information. Control commands for controlling the target device.

For example, the second structured information is "media-set.adjust" (where "media-set.adjust" is used to control the car speaker to play music), and the second structured information is not included in the plurality of second preset structured information middle. For example, the first structured information is still "stop", the voice control device can generate the third structured information "media-set.stop" according to the second structured information "media-set.adjust" and the first structured information "stop" ”, and then according to the third structured information “media-set.stop”, a stop instruction for controlling the car speaker to stop playing music is generated.

In addition, in this application, the control command determined by the voice control device according to the third structured information may be invalid. For example, if the second structured information is "control-window.adjust" and the first structured information is "top", then The third structured information such as "control-top-window.adjust" is generated, and the corresponding control command is, for example, adjusting the sunroof. Based on the previous control command for adjusting the window, the voice control device can determine that the generated control command is an invalid command.

The voice control device can update the session state according to the newly generated third structured information, and when the voice control device receives a new voice signal again, it can determine whether to generate a valid control command according to the new voice signal and the session state. Exemplarily, the voice control device receives a voice signal such as "stop", then the voice control device can determine to stop the sunroof according to "stop" and "control-top-window.adjust" in the session state.

In still other possible manners, the voice control device may also initiate an inquiry, and communicate with the user through dialogue to generate effective control instructions. For example, when it is determined that the control instruction corresponding to the third structured information is invalid, a query sentence is generated, such as "Do you need to adjust the sunroof?" or "How do you adjust the sunroof?", and when it is determined that the user needs to stop adjusting the sunroof, a sunroof stop instruction is issued.

It should be noted that, in the above example, the target device may not be indicated in the first text information (or the first voice signal), and the voice control device can determine the first text information according to the second text information and the first text information with the contextual relationship. The second text information and the first text information correspond to the same target device, that is, the target device corresponding to the first text information is the same as the target device corresponding to the second text information. For example, the second text information is "open car window", the target device is the car window, and the first text information is "stop". Although the first text information does not contain the target device, it can be The second text information of the context relationship determines that the target device in the first text information is also a car window.

In addition, the present application does not exclude the situation that the target device is indicated in the first text information (or the first voice signal). For example, the second text information is "open the window", and the first text information is "the window is stopped", both of which are Indicates that the target device is a car window.

In case 2, when the second text information does not exist, the voice control device performs voice understanding according to the first text information, obtains the first structured information, and updates the conversation state according to the first structured information.

In the first possible manner, the conversation state is stored in the voice control device, which is equivalent to storing historical text information and historical structured information in the voice control device. There is no contextual relationship between the historical text information and the first text information, and the voice The control device may perform speech understanding according to the first text information to obtain the first structured information, and then update the conversation state according to the first text information and the first structured information.

In the second possible manner, the conversation state in the voice control device is not empty, which is equivalent to that the historical text information and historical structured information are not stored in the voice control device, and the voice control device can perform voice understanding according to the first text information, and obtain the first structured information, and then use the first text information and the first structured information as the current session state.

When the voice control device receives the voice signal again, it can generate a control instruction according to the new voice signal and the updated session state, or update the session state again.

Combined with the data processing module in FIG. 6 , the flow control module can be provided with a first preset model and a second preset model, which is equivalent to the flow control module used to determine the first text information with complete semantics according to the first voice signal , and determine whether there is a contextual relationship between the two according to the first text information and the historical text information. A preset database may be set in the quick matching module, and the preset database includes one or more second preset structured information, which is equivalent to that the quick matching module is used to determine whether the first text information corresponds to a preset instruction identifier.

Based on the modules in FIG. 6 , another voice control method is provided, and the flow of the method can be referred to as shown in FIG. 8 .

Step 801, the ASR module determines third text information according to the first voice signal, wherein the third text information includes M characters, and M is a positive integer.

Step 802, the ASR module sends the third text information to the flow control module. Correspondingly, the flow control module receives the third text information from the ASR module.

Step 803, the flow control module inputs the third text information into the first preset model, and determines whether the third text information has complete semantics. If yes, go to step 804, otherwise go back to step 801.

Step 804, the flow control module determines whether the historical text information has a contextual relationship with the first text information (ie, the third text information obtained in the above step 803). If yes, execute step 805, otherwise, the first text information is processed by the NLU module and the DM module.

Step 805, the flow control module sends the first text information to the quick matching module. Correspondingly, the fast matching module receives the first text information from the flow control module.

Step 806, the quick matching module determines whether there is a preset instruction identifier corresponding to the first text information in the preset set corresponding to the second structured information. If yes, execute step 807, otherwise, the first text information is processed by the NLU module and the DM module.

Step 807, the quick matching module sends the preset instruction identifier corresponding to the first text information to the decision module.

Step 808, the decision-making module generates a control instruction according to the preset instruction identifier corresponding to the first text information.

Step 809, the decision module sends a control instruction to the target device.

For the content not described in detail in the above steps 801 to 809, please refer to the description in the related embodiment of FIG. 7 .

Exemplarily, FIG. 9 exemplarily provides a schematic flowchart of the input and output of two preset models in the flow control module, wherein the input of the first preset model is third text information, for example, the third text information At this point, the output of the first preset model indicates that the third textual information has complete semantics. The flow control module uses the third text information as the first text information, and inputs the historical text information and the first text information into the second preset model, for example, the historical text information is "open the car window", and the output of the second preset model It is indicated that the historical text information and the first text information have a contextual relationship. Exemplarily, the first preset model and the second preset model may be obtained through self-supervised learning.

In order to better explain the embodiments of the present application, descriptions are given below in conjunction with specific scenarios.

In the vehicle-mounted scene in FIG. 3 , when the user sends a voice signal (ie, the second voice signal) for the first time, he says "open the car window", and the voice control device controls the car window to move down slowly in response to the voice signal. The text information is "open car window", and the historical structured information is "control-window.adjust".

When the user sends a voice signal (ie, the first voice signal) for the second time, there may be the following examples:

Example 1, the voice signal (that is, the first voice signal) sent by the user for the second time is "stop", referring to the voice control flow exemplarily shown in FIG. 10 , including the following steps:

Step 1, the voice control device determines that the text message "stop" has complete semantics;

Step 2, the voice control device determines that the text information "stop" and the text information "open car window" have a contextual relationship.

Step 3, the voice control device determines that the preset set corresponding to the structured information "control-window.adjust" includes the text information "stop", and determines that the preset instruction corresponding to the text information "stop" is identified as "window stop", according to The preset command flag "window stop" determines the window stop command.

Example 2, the voice signal (that is, the first voice signal) sent by the user for the second time is "just call here", referring to the voice control flow exemplarily shown in Figure 11, including the following steps:

Step 1, the voice control device determines that the text information "just" does not have complete semantics;

Step 2, the voice control device determines that the text information "just tune" does not have complete semantics;

Step 3, the voice control device determines that the text information "just call" does not have complete semantics;

Step 4, the voice control device determines that the text message "just call here" has complete semantics;

In step 5, the voice control device determines that the text message "just call here" has a contextual relationship with "open the car window".

Step 6, the voice control device determines that the preset set corresponding to the structured information "control-window.adjust" does not include the text information "just adjust here".

In step 7, the voice control device performs semantic analysis processing on the text information "just call here" to obtain structured information "stop".

Step 8: After the voice control device performs dialogue management on the structured information "stop" and the structured information "control-window.adjust", the structured information "control-window.stop" is obtained.

Step 9, the voice control device generates a window stop command according to the structured information "control-window.stop".

Example 3, the voice signal (that is, the first voice signal) issued by the user for the second time is "playing music", referring to the voice control flow exemplarily shown in Figure 12, including the following steps:

Step 1, the voice control device determines that the text information "play" does not have complete semantics;

Step 2, the voice control device determines that the text information "playing" does not have complete semantics;

Step 3, the voice control device determines that the text information "playing sound" does not have complete semantics;

Step 4, the voice control device determines that the text information "playing music" has complete semantics;

Step 5, the voice control device determines that the text information "playing music" and "opening the car window" do not have a contextual relationship.

The voice control device performs semantic analysis, dialogue management, etc. according to "play music", and updates the conversation state.

For the content not described in detail in the above examples 1 to 3, please refer to the description in the related embodiment of FIG. 7 .

In the in-vehicle scene of FIG. 3 , in a specific optional manner, the car window may be controlled by the motor in the in-vehicle circuit, the voice control device may send a control command to the in-vehicle circuit, and the in-vehicle circuit controls the power supply of the motor according to the control command. to control the windows. Exemplarily, in the above examples 1 and 2, when the voice control device controls the window to move down in response to the second voice signal, the voice control device may send the window down instruction to the vehicle-mounted circuit, and the vehicle-mounted circuit can move the window down according to the second voice signal. The window down command controls the motor power to be connected, and the motor works, making the window move down slowly. When the voice control device controls the window to stop moving downward in response to the first voice signal, it may be that the voice control device sends a window stop command to the vehicle-mounted circuit, and the vehicle-mounted circuit controls the motor power to disconnect and the motor to stop working according to the window stop command. stop the windows from moving.

In another specific optional manner, the car window may be controlled by a stepping motor in the stepping circuit, the voice control device may send a stepping signal to the stepping motor, and the stepping motor controls the car window according to the stepping signal. Exemplarily, in the above examples 1 and 2, when the voice control device controls the window to move downward in response to the second voice signal, the voice control device may send a start step to the stepper motor according to the window downward movement instruction. Enter the signal to control the stepping circuit to work, so that the window moves down slowly. When the voice control device controls the car window to stop moving downward in response to the first voice signal, the voice control device may send a stop stepping signal to the stepping motor to control the stepping circuit to stop working, so that the car window stops moving.

In the above technical solution, the voice control device can obtain the second voice signal, determine the second text information according to the second voice signal, and then perform natural language understanding according to the second text information to obtain the second structured information, and then according to the second structured information. The information controls the target device to enter the first operating state of the designated operating state. In the case that the target device is in the first operating state, the voice control device may also perform streaming speech recognition on the continuously obtained first voice signal in the process of continuously obtaining the first voice signal, so as to obtain the corresponding M characters, In the case where it is determined that the text information composed of the M characters has complete semantics, the first text information composed of the M characters with completed semantics does not need to wait for the silence period after the user sends the completed voice signal. After the text information with complete semantics, it is inferred that the user has completed the delivery of the voice signal, thereby effectively reducing the control delay.

Based on the voice control method in the present application, the time delay generated by the voice control device for processing the voice signal can be reduced.

FIG. 13 is a schematic diagram of a second type of voice control device processing voice signal generation time delay provided by the present application. The voice control device performs voice recognition from the moment of receiving the voice signal, according to the prediction corresponding to the second structured information. If the set is set, when the preset instruction identifier corresponding to the first text information cannot be determined, the corresponding control instruction can be obtained after being processed by the NLU module and the DM module, and sent to the target device. Compared with the time delay diagram shown in FIG. 5 , the method of the present application can at least avoid the time delay caused by the voice control apparatus waiting for the silent duration.

FIG. 14 is a schematic diagram of the time delay for processing the voice signal by the third voice control device provided by the application. The voice control device performs voice recognition from the moment of receiving the voice signal, and according to the prediction corresponding to the second structured information A set is set to determine the preset instruction identifier corresponding to the first text information, so as to obtain the corresponding control instruction and deliver it to the target device. Compared with the delay diagram shown in FIG. 5 , the method of the present application can not only avoid the delay caused by the voice control device waiting for the silent duration, but also avoid the delay caused by the processing of the NLU module and the DM module.

The various embodiments described herein may be independent solutions, or may be combined according to internal logic, and these solutions all fall within the protection scope of the present application.

It can be understood that, in the foregoing method embodiments, the methods and operations implemented by the voice control device may also be implemented by components (eg, chips or circuits) that can be used in the voice control device.

The division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation. In addition, each functional module in each embodiment of the present application may be integrated into one processor, or may exist physically alone, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules.

Based on the above content and the same concept, FIG. 15 and FIG. 16 are schematic structural diagrams of possible voice control apparatuses provided by the present application. These voice control apparatuses can be used to implement the functions of the voice control apparatuses in the above method embodiments, and thus can also achieve the beneficial effects of the above method embodiments.

As shown in FIG. 15 , the voice control apparatus includes a processing module 1501 and a control module 1502 . In an optional implementation manner, the processing module 1501 may be configured to execute step 701 in the method embodiment exemplarily shown in FIG. 7 , and the control module 1502 may be configured to execute the steps in the method embodiment exemplarily shown in FIG. 7 . 702. In another optional implementation manner, the processing module 1501 may be used to perform steps 801 to 805 in the method embodiment exemplarily shown in FIG. 8 , and the control module 1502 may be used to perform the method implementation exemplarily shown in FIG. 8 . Steps 806 to 809 in the example.

In an optional implementation manner, the processing module 1501 is used to determine the first text information with complete semantics according to the first voice signal; the control module 1502 is used to control the target device to specify the first text information according to the first text information and the second text information. Switching in the running state, wherein the second text information is acquired before the first text information, the second text information is used to control the target device to enter the first running state in the specified running state, and the second text information is the same as the first text information. Information is contextual.

In an optional implementation manner, before the processing module 1501 determines the first text information with complete semantics according to the first voice signal, the processing module 1501 is further configured to: determine the second text with complete semantics according to the second voice signal. information; perform natural language understanding on the second text information to obtain second structured information; the control module 1502 is further configured to: control the target device to enter the first operating state according to the second structured information.

In an optional implementation manner, the control module 1502 is specifically configured to: determine a preset set corresponding to the second structured information according to the second structured information corresponding to the second text information, and the preset set includes one or more The corresponding relationship between the preset text information and the preset instruction identifier; when the one or more preset text information includes the first text information, the control instruction is determined according to the preset instruction identifier corresponding to the first text information, wherein the control instruction It is used to control the target device to switch from the first operation state in the designated operation state to the second operation state in the designated operation state.

In an optional implementation manner, the control module 1502 is further configured to: when the first text information is different from any one of the one or more preset text information, perform natural language understanding on the first text information to obtain the first text information. a structured information; the control instruction is determined according to the first structured information and the second structured information.

In an optional implementation manner, the control module 1502 is further configured to: after determining the control instruction according to the first structured information and the second structured information, in the case that the control instruction is invalid, update the second structure according to the first structured information information.

In an optional implementation manner, the processing module 1501 is specifically configured to: according to the first voice signal, determine M characters corresponding to the first voice signal, where M is a positive integer; and input the text information composed of the M characters into the first preset. A model is set to obtain the output result of the first preset model, and the first preset model is used to judge whether the text information composed of the input multiple characters has complete semantics; according to the text information composed of M characters and the first preset model The output result generates the first text information.

In an optional implementation manner, the processing module 1501 is specifically configured to: obtain a first training set, the first training set includes a plurality of first training data, and for each first training data in the plurality of first training data , the first training data includes first training text information and a first label, the first training text information consists of one or more words, and the first label is used to indicate whether the first training text information has complete semantics; a training data and a first training model, perform one or more trainings of the first model until the first output result of the first training model meets the first preset condition, and select the first output result that meets the first preset condition A training model is determined to be the first preset model; wherein, the first model training includes: inputting a plurality of first training data into the first training model to obtain a first output result; updating the first training data according to the first output result The model parameters in the model are obtained to obtain the first training model after the model parameters are updated.

In an optional implementation, before the control module 1502 controls the target device to switch in the specified operating state according to the first text information and the second text information, the processing module 1501 is further configured to: input the first text information and the historical text information. In the second preset model, the output result of the second preset model is obtained, and the second preset model is used to judge whether the two input text information has a contextual relationship; according to the output result of the second preset model, the historical text The information is determined to be second text information.

In an optional implementation manner, the processing module 1501 is specifically configured to: obtain a second training set, the second training set includes a plurality of second training data, for each second training data in the plurality of second training data , the second training data includes two second training text information and a second label, and the second label is used to indicate whether the two second training text information have a contextual relationship; according to the plurality of second training data and the second training model, execute The second model is trained one or more times until the second output result of the second training model meets the second preset condition, and the second training model whose second output result meets the second preset condition is determined as the second preset model ; wherein, the second model training includes: inputting a plurality of second training data into the second training model to obtain a second output result; according to the second output result, updating the model parameters in the second training model to obtain a model parameter update After the second training model.

FIG. 16 shows the apparatus provided in this embodiment of the present application, and the apparatus shown in FIG. 16 may be a hardware circuit implementation of the apparatus shown in FIG. 15 . The apparatus can be applied to the flow chart shown above to perform the functions of the voice control apparatus in the above method embodiments.

For ease of explanation, FIG. 16 shows only the main components of the device.

The voice control apparatus includes: a processor 1610 and an interface 1630 , and optionally, the voice control apparatus further includes a memory 1620 . The interface 1630 is used to enable communication with other devices.

The method performed by the voice control apparatus in the above embodiments may be implemented by the processor 1610 calling a program stored in a memory (which may be the memory 1620 in the voice control apparatus, or an external memory). That is, the voice control apparatus may include a processor 1610, and the processor 1610 executes the method performed by the voice control apparatus in the above method embodiments by calling the program in the memory. The processor here may be an integrated circuit with signal processing capability, such as a CPU. The voice control device may be implemented by one or more integrated circuits configured to implement the above methods. For example: one or more ASICs, or, one or more microprocessor DSPs, or, one or more FPGAs, etc., or a combination of at least two of these integrated circuit forms. Alternatively, the above implementations may be combined.

Specifically, the function/implementation process of the processing module 1501 and the control module 1502 in FIG. 15 can be implemented by the processor 1610 in the voice control device shown in FIG. 16 calling the computer execution instructions stored in the memory 1620 .

Based on the above content and the same concept, the present application provides a computing device, including a processor, the processor is connected to a memory, the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory, so that the computing device executes the above method methods in the examples.

Based on the above content and the same concept, the present application provides a computer-readable storage medium on which a computer program or instruction is stored. When the computer program or instruction is executed, the computing device executes the method in the above method embodiment.

Based on the above content and the same concept, the present application provides a computer program product, when a computer reads and executes the computer program product, so that a computing device executes the methods in the above method embodiments.

Based on the above content and the same concept, the present application provides a chip connected to a memory for reading and executing a software program stored in the memory, so that a computing device executes the methods in the above method embodiments.

Based on the above content and the same concept, an embodiment of the present application provides an apparatus, the apparatus includes a processor and an interface circuit, the interface circuit is configured to receive a program or an instruction code and transmit it to the processor; the processor The program or instruction code is executed to execute the method in the above method embodiment.

It can be understood that, the various numbers and numbers involved in the embodiments of the present application are only for the convenience of description, and are not used to limit the scope of the embodiments of the present application. The size of the sequence numbers of the above processes does not imply the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic.

Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the protection scope of the present application. Thus, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.

Claims

A voice control method, comprising:

According to the first speech signal, determine the first text information with complete semantics;

According to the first text information and the second text information, the target device is controlled to switch in a specified operating state, wherein the second text information is acquired before the first text information, and the second text information It is used to control the target device to enter a first operation state in the specified operation state, and the second text information has a contextual relationship with the first text information.
The method of claim 1, wherein the second text information has a contextual relationship with the first text information, including at least one or more of the following:

The second text information and the first text information correspond to the same target device;

The execution action corresponding to the second text information and the execution action corresponding to the first text information belong to the same type.
The method according to claim 1 or 2, wherein, before determining the first text information with complete semantics according to the first voice signal, the method further comprises:

According to the second speech signal, determine the second text information with complete semantics;

performing natural language understanding on the second text information to obtain second structured information;

According to the second structured information, the target device is controlled to enter the first operating state.
The method according to any one of claims 1 to 3, wherein the controlling the target device to switch in a specified operating state according to the first text information and the second text information, comprises:

Determine a preset set corresponding to the second structured information according to the second structured information corresponding to the second text information, where the preset set includes one or more preset text information and preset instructions The corresponding relationship of the identification;

When the one or more preset text information includes the first text information, a control instruction is determined according to the preset instruction identifier corresponding to the first text information, wherein the control instruction is used to control the target The device switches from a first operating state of the designated operating states to a second operating state of the designated operating states.
The method according to claim 4, wherein the controlling the target device to switch in a specified operating state according to the first text information and the second text information, further comprises:

When the first text information is different from any one of the one or more preset text information, performing natural language understanding on the first text information to obtain first structured information;

The control instruction is determined according to the first structured information and the second structured information.
The method according to claim 5, wherein after determining the control instruction according to the first structured information and the second structured information, the method further comprises:

When the control instruction is invalid, the second structured information is updated according to the first structured information.
The method according to any one of claims 1 to 6, wherein the determining, according to the first voice signal, the first text information with complete semantics comprises:

According to the first voice signal, determine M characters corresponding to the first voice signal, where M is a positive integer;

Inputting the text information composed of the M characters into a first preset model to obtain an output result of the first preset model, where the first preset model is used to determine the text information composed of the input multiple characters Whether it has complete semantics;

The first text information is generated according to the text information composed of the M characters and the output result of the first preset model.
The method of claim 7, wherein the first preset model is determined by the steps of:

Acquire a first training set, the first training set includes a plurality of first training data, and for each first training data in the plurality of first training data, the first training data includes a first training data training text information and a first label, where the first training text information consists of one or more words, and the first label is used to indicate whether the first training text information has complete semantics;

According to the plurality of first training data and the first training model, one or more first model training is performed until the first output result of the first training model meets the first preset condition, and the first output result is The first training model that meets the first preset condition is determined to be the first preset model;

The training of the first model includes: inputting the plurality of first training data into the first training model to obtain the first output result; and updating the first training model according to the first output result to obtain the first training model after the model parameters are updated.
The method according to any one of claims 1 to 8, wherein, before controlling the target device to switch in a specified operating state according to the first text information and the second text information, the method further comprises:

The first text information and the historical text information are input into the second preset model to obtain the output result of the second preset model, and the second preset model is used to judge whether the input two text information has context;

According to the output result of the second preset model, the historical text information is determined as the second text information.
The method of claim 9, wherein the second preset model is determined by the steps of:

Acquire a second training set, the second training set includes a plurality of second training data, and for each second training data in the plurality of second training data, the second training data includes two first training data Two training text information and a second label, the second label is used to indicate whether the two second training text information have a contextual relationship;

According to the plurality of second training data and the second training model, one or more second model training is performed until the second output result of the second training model meets the second preset condition, and the second output result is A second training model that meets the second preset condition is determined to be the second preset model;

Wherein, the second model training includes: inputting the plurality of second training data into the second training model to obtain a second output result; updating the data in the second training model according to the second output result model parameters, to obtain a second training model after the model parameters are updated.
A voice control device, comprising:

a processing module, configured to determine the first text information with complete semantics according to the first speech signal;

A control module, configured to control the target device to switch in a specified operating state according to the first text information and the second text information, wherein the second text information is acquired before the first text information, so The second text information is used to control the target device to enter a first operating state in the specified operating state, and the second text information has a contextual relationship with the first text information.
The apparatus of claim 11, wherein the second text information has a contextual relationship with the first text information, including at least one or more of the following:

The second text information and the first text information correspond to the same target device;

The execution action corresponding to the second text information and the execution action corresponding to the first text information belong to the same type.
The device according to claim 11 or 12, wherein before the processing module determines the first text information with complete semantics according to the first voice signal, the processing module is further configured to:

Determine second text information with complete semantics according to the second voice signal; perform natural language understanding on the second text information to obtain second structured information;

The control module is also used for:

According to the second structured information, the target device is controlled to enter the first operating state.
The device according to any one of claims 11 to 13, wherein the control module is specifically used for:

Determine a preset set corresponding to the second structured information according to the second structured information corresponding to the second text information, where the preset set includes one or more preset text information and preset instructions The corresponding relationship of the identification;

When the one or more preset text information includes the first text information, a control instruction is determined according to the preset instruction identifier corresponding to the first text information, wherein the control instruction is used to control the target The device switches from a first operating state of the designated operating states to a second operating state of the designated operating states.
The apparatus of claim 14, wherein the control module is further configured to:

When the first text information is different from any one of the one or more preset text information, performing natural language understanding on the first text information to obtain first structured information;

The control instruction is determined according to the first structured information and the second structured information.
The apparatus of claim 15, wherein the control module is further configured to update the second structured information according to the first structured information when the control instruction is invalid.
The device according to any one of claims 11 to 16, wherein the processing module is specifically configured to:

According to the first voice signal, determine M characters corresponding to the first voice signal, where M is a positive integer;

Inputting the text information composed of the M characters into a first preset model to obtain an output result of the first preset model, where the first preset model is used to determine the text information composed of the input multiple characters Whether it has complete semantics;

The first text information is generated according to the text information composed of the M characters and the output result of the first preset model.
The apparatus of claim 17, wherein the processing module is specifically configured to:

Acquire a first training set, the first training set includes a plurality of first training data, and for each first training data in the plurality of first training data, the first training data includes a first training data training text information and a first label, the first training text information consists of one or more words, and the first label is used to indicate whether the first training text information has complete semantics;

According to the plurality of first training data and the first training model, one or more first model training is performed until the first output result of the first training model meets the first preset condition, and the first output result is The first training model that meets the first preset condition is determined to be the first preset model;

The training of the first model includes: inputting the plurality of first training data into the first training model to obtain the first output result; and updating the first training model according to the first output result to obtain the first training model after the model parameters are updated.
The apparatus according to any one of claims 11 to 18, characterized in that, before the control module switches the target device in a specified operating state according to the first text information and the second text information, the processing module Also used for:

Inputting the first text information and historical text information into a second preset model to obtain an output result of the second preset model, the second preset model is used to determine whether the two input text information has context;

According to the output result of the second preset model, the historical text information is determined as the second text information.
The apparatus of claim 19, wherein the processing module is specifically configured to:

Acquire a second training set, the second training set includes a plurality of second training data, and for each second training data in the plurality of second training data, the second training data includes two first training data Two training text information and a second label, the second label is used to indicate whether the two second training text information have a contextual relationship;

According to the plurality of second training data and the second training model, one or more second model training is performed until the second output result of the second training model meets the second preset condition, and the second output result is A second training model that meets the second preset condition is determined to be the second preset model;

Wherein, the second model training includes: inputting the plurality of second training data into the second training model to obtain a second output result; updating the data in the second training model according to the second output result model parameters, to obtain a second training model after the model parameters are updated.
A computing device, characterized in that it includes a processor, the processor is connected to a memory, the memory stores a computer program, and the processor is configured to execute the computer program stored in the memory, so that the computing device executes A method as claimed in any one of claims 1 to 10.
A computer-readable storage medium, characterized in that, a computer program or instruction is stored in the computer-readable storage medium, and when the computer program or instruction is executed by a computing device, so that the computing device performs as claimed in the claims The method of any one of 1 to 10.
A chip, characterized in that it includes at least one processor and an interface;

the interface for providing program instructions or data for the at least one processor;

The at least one processor is adapted to execute the program line instructions such that the method of any of claims 1 to 10 is performed.