WO2022198365A1 - Voice control method and apparatus - Google Patents

Voice control method and apparatus Download PDF

Info

Publication number
WO2022198365A1
WO2022198365A1 PCT/CN2021/082019 CN2021082019W WO2022198365A1 WO 2022198365 A1 WO2022198365 A1 WO 2022198365A1 CN 2021082019 W CN2021082019 W CN 2021082019W WO 2022198365 A1 WO2022198365 A1 WO 2022198365A1
Authority
WO
WIPO (PCT)
Prior art keywords
text information
training
model
preset
information
Prior art date
Application number
PCT/CN2021/082019
Other languages
French (fr)
Chinese (zh)
Inventor
高益
聂为然
李宏言
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202180001481.6A priority Critical patent/CN113228167B/en
Priority to PCT/CN2021/082019 priority patent/WO2022198365A1/en
Publication of WO2022198365A1 publication Critical patent/WO2022198365A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Definitions

  • the present application relates to the field of automatic driving, and in particular, to a voice control method and device.
  • Voice interaction products have been widely used in people's daily life.
  • smart phones, smart home devices, and smart vehicle-mounted devices all have voice interaction functions.
  • voice interaction can free hands, and has the characteristics of fast command control and safe driving.
  • the user can usually adjust the opening of the in-vehicle equipment such as the windows and the sunroof through voice interaction with the in-vehicle voice control device.
  • the voice control device needs to determine that the user has finished sending the voice signal, and then can perform voice recognition and semantic analysis according to the obtained entire user voice to obtain control instructions, and then according to the control instructions to the corresponding In-vehicle devices such as window openings are adjusted. Since the user's voice needs to be recognized and analyzed after the entire segment of the user's voice is acquired, the time delay of the entire control process is relatively long.
  • the present application provides a voice control method and device, which are used to reduce control delay and improve user experience during the voice control process.
  • the voice control method provided in this application can be implemented by a terminal device, for example, a vehicle or a vehicle-mounted device.
  • the voice control method can also be implemented by components of the terminal device, such as processing devices, circuits, chips and other components in the terminal device, for example, a chip supporting wireless communication functions in the terminal device, such as a system chip or a communication chip.
  • the system-on-chip is also called a system-on-chip, or a system-on-chip (SOC) chip.
  • the communication chip may include a radio frequency processing chip and a baseband processing chip.
  • the baseband processing chip is also sometimes called a modem.
  • the communication chip can be integrated inside the SoC chip or not with the SoC chip set.
  • the baseband processing chip is integrated in the SoC chip, and the radio frequency processing chip is not integrated with the SoC chip.
  • the present application provides a voice control method, the method includes: determining first text information with complete semantics according to a first voice signal; switch state.
  • the specified operating state corresponding to the target device may include at least a first operating state and a second operating state.
  • the second text information is acquired before the first text information, there is a contextual relationship between the second text information and the first text information, and the second text information is used to control the target device to enter the first operation in the specified operation state state, the first text information is used to control the target device to switch from the first operation state in the specified operation state to the second operation state.
  • the target device is a car window
  • the specified operating state corresponding to the car window may include moving down (ie, the first operating state) and stopping moving downward (ie, the second operating state), wherein the second text information is used for
  • the vehicle window is controlled to move downward
  • the first text information is used to control the vehicle window to stop moving downward.
  • the vehicle window can be controlled to switch from a state of moving downward to a state of stopping moving downward.
  • the second text information and the first text information have a contextual relationship, including at least one or more of the following: the second text information and the first text information correspond to (or act on) the same target equipment; the execution action corresponding to the second text information and the execution action corresponding to the first text information are of the same type.
  • the method before determining the first text information with complete semantics according to the first voice signal, the method further includes: determining the second text information with complete semantics according to the second voice signal; The natural language is understood, and the second structured information is obtained; according to the second structured information, the target device is controlled to enter the first operating state.
  • the second voice signal is obtained first, the target device is controlled to enter the first operating state according to the second voice signal, and then the first voice signal is obtained, and the target device is controlled from the first voice signal according to the first voice signal.
  • the first operating state is switched to the second operating state, so as to realize the control of the target device in the first operating state according to the second voice signal.
  • controlling the target device to switch in a specified operating state according to the first text information and the second text information includes: A preset set corresponding to the information, the preset set includes the correspondence between one or more preset text information and preset instruction identifiers; when the one or more preset text information includes first text information, according to the first The preset instruction identifier corresponding to the text information determines a control instruction, wherein the control instruction is used to control the target device to switch from the first operating state in the designated operating state to the second operating state in the designated operating state.
  • a preset set corresponding to the second structured information is set, and when the first text information is included in the preset set, the preset corresponding to the first text information can be directly determined.
  • the instruction identifier is generated, and the control instruction is generated according to the preset instruction identifier, without performing natural language understanding and dialogue management on the first text information, and further reducing the time delay in the control process.
  • controlling the target device to switch in a specified operating state according to the first text information and the second text information further includes: presetting any one of the first text information and one or more preset text information.
  • the first structured information is obtained by performing natural language understanding on the first text information; the control instruction is determined according to the first structured information and the second structured information.
  • the first text information in the case where the first text information is not included in the preset set, the first text information can be understood by natural language to obtain the first structured information, and then the first structured information can be obtained according to the first structured information.
  • the information and the second structured information are managed by dialogue, and control instructions are obtained, which helps to ensure the normal operation of the system.
  • the method further includes: if the control instruction is invalid, updating the second structured information according to the first structured information.
  • the stored second structured information can be updated according to the first structured information (that is, the stored historical structured information is updated) to ensure the currently stored historical structure.
  • the information is the latest structured information, which ensures the correct operation of the system and helps to make correct judgments when new voice signals are received.
  • determining the first text information with complete semantics according to the first voice signal includes: determining M characters corresponding to the first voice signal according to the first voice signal, where M is a positive integer; The text information composed of multiple characters is input into the first preset model, and the output result of the first preset model is obtained, and the first preset model is used to judge whether the text information composed of the input multiple characters has complete semantics; The text information composed of characters and the output result of the first preset model are used to generate the first text information.
  • the first preset model is determined by the following steps: obtaining a first training set, the first training set includes a plurality of first training data, and for each first training data in the plurality of first training data; training data, the first training data includes first training text information and a first label, the first training text information is composed of one or more words, and the first label is used to indicate whether the first training text information has complete semantics; according to A plurality of first training data and a first training model, perform one or more first model training, until the first output result of the first training model meets the first preset condition, and make the first output result meet the first preset
  • the conditional first training model is determined as the first preset model; wherein, the first model training includes: inputting a plurality of first training data into the first training model to obtain a first output result; updating according to the first output result The model parameters in the first training model are obtained, and the first training model after the model parameters are updated is obtained.
  • a first preset model is preset, wherein the first preset model is a relatively accurate classification model trained according to a plurality of historical training data.
  • the text information composed of the M characters can be input into the first preset model to determine the M characters corresponding to the current first voice signal Whether the text has complete semantics is helpful to obtain a more accurate judgment result, thereby obtaining more accurate first text information with complete semantics.
  • the target device before controlling the target device to switch in the specified operating state, it further includes: inputting the first text information and the historical text information into the second preset model. , obtain the output result of the second preset model, the second preset model is used to judge whether the two input text information has a contextual relationship; according to the output result of the second preset model, the historical text information is determined as the second text information .
  • the second preset model is determined by the following steps: obtaining a second training set, the second training set includes a plurality of second training data, and for each of the plurality of second training data Two training data, the second training data includes two pieces of second training text information and a second label, and the second label is used to indicate whether the two pieces of second training text information have a contextual relationship; model, perform one or more second model training, until the second output result of the second training model meets the second preset condition, and determine the second training model whose second output result meets the second preset condition as the second model A preset model; wherein the second model training includes: inputting a plurality of second training data into the second training model to obtain a second output result; updating model parameters in the second training model according to the second output result to obtain The second training model after the model parameters are updated.
  • a second preset model is preset, wherein the second preset model is a more accurate classification model trained according to a plurality of historical training data.
  • the M characters corresponding to the first speech signal have complete semantics, that is, when the M characters form the first text information
  • the first text information and the currently stored historical text information can be input into the second preset model, so that according to the The output result of the second preset model determines whether the historical text information is above the first text information, which helps to obtain a more accurate determination result.
  • the present application provides a voice control device, the device comprising: a processing module for determining first text information with complete semantics according to a first voice signal; a control module for determining according to the first text information and the first text information Two text information, control the target device to switch in the specified running state.
  • the specified operating state corresponding to the target device may include at least a first operating state and a second operating state.
  • the second text information is acquired before the first text information, there is a contextual relationship between the second text information and the first text information, and the second text information is used to control the target device to enter the first operation in the specified operation state state, the first text information is used to control the target device to switch from the first operation state in the specified operation state to the second operation state.
  • the second text information and the first text information have a contextual relationship, including at least one or more of the following: the second text information and the first text information correspond to the same target device; the second text information and the first text information correspond to the same target device; The execution action corresponding to the information and the execution action corresponding to the first text information are of the same type.
  • the processing module before the processing module determines the first text information with complete semantics according to the first voice signal, the processing module is further configured to: determine the second text information with complete semantics according to the second voice signal; Perform natural language understanding on the second text information to obtain second structured information; the control module is further configured to: control the target device to enter the first operating state according to the second structured information.
  • control module is specifically configured to: determine a preset set corresponding to the second structured information according to the second structured information corresponding to the second text information, and the preset set includes one or more The correspondence between preset text information and preset instruction identifiers; when one or more preset text information includes first text information, the control instruction is determined according to the preset instruction identifier corresponding to the first text information, wherein the control instruction uses The control target device is switched from the first operation state in the designated operation state to the second operation state in the designated operation state.
  • control module is further configured to: when the first text information is different from any one of the one or more preset text information, perform natural language understanding on the first text information to obtain the first text information. Structured information; the control instruction is determined according to the first structured information and the second structured information.
  • control module is further configured to: after determining the control instruction according to the first structured information and the second structured information, in the case that the control instruction is invalid, update the second structured information according to the first structured information. information.
  • the processing module is specifically configured to: determine, according to the first voice signal, M characters corresponding to the first voice signal, where M is a positive integer; and input the text information composed of the M characters into the first preset. model, to obtain the output result of the first preset model, the first preset model is used to judge whether the text information composed of the input multiple characters has complete semantics; according to the text information composed of M characters and the first preset model The result is output, and the first text information is generated.
  • the processing module is specifically configured to: obtain a first training set, the first training set includes a plurality of first training data, and for each first training data in the plurality of first training data,
  • the first training data includes first training text information and a first label, the first training text information is composed of one or more words, and the first label is used to indicate whether the first training text information has complete semantics;
  • the training data and the first training model are performed one or more times of training the first model until the first output result of the first training model meets the first preset condition, and the first output result that meets the first preset condition is determined.
  • the training model is determined to be the first preset model; wherein, the first model training includes: inputting a plurality of first training data into the first training model to obtain a first output result; updating the first training model according to the first output result The model parameters in , obtain the first training model after the model parameters are updated.
  • the processing module is further configured to: input the first text information and the historical text information into the first text information and the historical text information.
  • the output result of the second preset model is obtained, and the second preset model is used to determine whether the two input text information has a contextual relationship; according to the output result of the second preset model, the historical text information is determined is the second text message.
  • the processing module is specifically configured to: obtain a second training set, the second training set includes a plurality of second training data, and for each second training data in the plurality of second training data,
  • the second training data includes two pieces of second training text information and a second label, and the second label is used to indicate whether the two pieces of second training text information have a contextual relationship; according to the plurality of second training data and the second training model, execute once or multiple times of second model training, until the second output result of the second training model meets the second preset condition, and the second training model whose second output result meets the second preset condition is determined as the second preset model;
  • the training of the second model includes: inputting a plurality of second training data into the second training model to obtain a second output result; updating model parameters in the second training model according to the second output results, and obtaining the updated model parameters the second trained model.
  • the present application provides a computing device, comprising a processor, the processor is connected to a memory, the memory stores a computer program, and the processor is configured to execute the computer program stored in the memory, so that the computing device executes the first aspect or the first A method in any possible implementation of the aspect.
  • the present application provides a computer-readable storage medium on which a computer program or instruction is stored, and when the computer program or instruction is executed, enables a computer to execute the above-mentioned first aspect or any one of the first aspects. method in the implementation.
  • the present application provides a computer program product, which, when the computer reads and executes the computer program product, causes the computer to execute the first aspect or the method in any possible implementation manner of the first aspect.
  • the present application provides a chip, which is connected to a memory and used to read and execute a software program stored in the memory, so as to realize the method in the above-mentioned first aspect or any possible implementation manner of the first aspect .
  • the voice control device may acquire the second voice signal, determine the second text information according to the second voice signal, and then perform natural language understanding according to the second text information, The second structured information is obtained, and then the target device is controlled to enter the first operating state of the specified operating state according to the second structured information.
  • the voice control device may also perform streaming speech recognition on the continuously obtained first voice signal in the process of continuously obtaining the first voice signal, so as to obtain the corresponding M characters,
  • the first text information composed of the M characters with completed semantics does not need to wait for the silence period after the user sends the completed voice signal. After the text information with complete semantics, it is inferred that the user has completed the delivery of the voice signal, thereby effectively reducing the control delay.
  • the target device in the first running state can be controlled according to the first text information and the second text information. Specifically, the running state of the target device is changed from the first running state The state is switched to the second running state, in this way, the target device indicated by the first voice signal sent by the user can be effectively determined, and the target device can be controlled.
  • the target device when controlling the target device, it can be determined whether the first text information is in a preset set corresponding to the second text information (ie, the second structured information), and when the first text information is in the preset set,
  • the preset instruction identifier corresponding to the first text information can be directly determined according to the preset set without performing natural language understanding and dialogue management on the first text information, thereby helping to further reduce the delay in the control process.
  • the time delay in the control process can be effectively reduced, and the user
  • the target device can be controlled more intuitively and effectively by sending a voice signal, which helps to improve the user experience.
  • FIG. 1 is a schematic diagram of functional modules included in a voice control device provided by the application.
  • FIG. 2 is a schematic diagram of functional modules included in a data processing module provided by the application.
  • FIG. 3 is a specific scene to which the voice control device provided by the present application is applicable;
  • FIG. 4 is a schematic diagram of a process of slowly moving down a group of vehicle windows provided by the present application.
  • FIG. 5 is a schematic diagram of a first voice control device processing voice signal generation time delay provided by the application
  • FIG. 6 is a schematic diagram of functional modules included in yet another data processing module provided by the present application.
  • FIG. 7 is a schematic flowchart of a voice control method provided by the present application.
  • FIG. 8 is a schematic flowchart of yet another voice control method provided by the application.
  • FIG. 9 is a schematic flowchart of the input and output of two preset models in a flow control module provided by the present application.
  • FIG. 11 is a voice control process in yet another vehicle-mounted scene provided by the application.
  • FIG. 13 is a schematic diagram of a second type of voice control device processing voice signal generation delay provided by the application.
  • FIG. 14 is a schematic diagram of a third voice control device processing voice signal generation time delay provided by the application.
  • FIG. 15 is a schematic structural diagram of a voice control device provided by the application.
  • FIG. 16 is a schematic structural diagram of another voice control apparatus provided by the present application.
  • the voice control device includes: a voice acquisition module, a data processing module and a decision-making module.
  • the voice acquisition module is used to acquire the voice signal, and transmit the acquired voice signal to the data processing module.
  • the data processing module is used to perform speech analysis, semantic analysis, dialogue management, etc. on the speech signal to obtain data processing results.
  • the data processing module sends the data processing result to the decision-making module, and the decision-making module generates a control instruction according to the data processing result and sends it to the corresponding target device.
  • the data processing module includes: a speech recognition (automatic speech recognition, ASR) function module, a natural language understanding (natural language understanding, NLU) function module, and a dialog management (dialog management, DM) function module.
  • ASR automatic speech recognition
  • NLU natural language understanding
  • DM dialog management
  • the ASR module can be used to perform speech analysis, that is, to convert the speech signal input by the user into natural language text (which can be called text information), which is equivalent to the human ear.
  • the voice input is to input the acquired voice signal into the ASR module.
  • the voice signal is actually a sound wave, and the ASR module can perform encoding (feature extraction) on the voice signal.
  • the sound wave can be split according to frames (millisecond level) to obtain a small piece of waveform corresponding to each frame.
  • the small piece of waveform is converted into multi-dimensional vector information according to human ear characteristics.
  • the ASR module decodes and obtains a plurality of phonemes (phones) corresponding to the multi-dimensional vector information according to the multi-dimensional vector information, composes the plurality of factors into words and concatenates them into sentences (ie, text information).
  • the ASR module outputs the generated text information.
  • VAD Voice active detection
  • Voice activity detection may also be referred to as voice activation detection or silence detection, etc.
  • VAD voice-to-noise ratio
  • VAD mainly includes phonetic VAD and semantic VAD.
  • Voice VAD means that when it is detected that there is no voice signal input within a set period of time, it stops receiving voice signals (also referred to as stopping radio).
  • Semantic VAD means that when it is determined that the text information currently converted from the input speech signal has complete semantics, the speech signal is stopped to be received.
  • voice wake-up needs to be performed after the VAD detects a human voice, which is equivalent to sending a wake-up command to the device to trigger subsequent voice recognition.
  • This is a system used to sample and process the spatial characteristics of the sound field, and consists of a certain number of acoustic sensors (usually microphones).
  • speech enhancement the process of extracting pure speech from a noisy speech signal
  • sound source localization which uses a microphone array to calculate the angle and distance of the target speaker, so as to achieve the tracking of the target speaker and subsequent voice directional pickup
  • de-reverberation to reduce the influence of some reflected sounds
  • sound source signal extraction/separation to extract all mixed sounds. It is mainly suitable for complex environments with many noises, noises and echoes such as vehicles, outdoors and supermarkets.
  • the NLU module can be used to perform natural language understanding or semantic analysis, that is, to convert natural language text into structured information that can be understood by machines.
  • natural language text such as "open car window”
  • structured information obtained through natural language understanding such as "control-window.adjust”.
  • the DM module can be used to perform dialogue management, that is, based on the state of the dialogue, according to the semantic information, provide corresponding services.
  • Dialogue management controls the process of human-machine dialogue, and it will decide what kind of response to the user based on the historical information of the dialogue.
  • the most common application is the task-driven multi-round dialogue.
  • the user has a clear purpose such as order query, etc., the user needs are more complex, there are many restrictions, and it may be necessary to state in multiple rounds.
  • task-driven dialogue management is actually a decision-making process.
  • the system continuously decides the optimal action to be taken next according to the current state (such as: providing results, asking for specific constraints, clarifying or confirming requirements, etc.), So as to most effectively assist users to complete the task of obtaining information or services.
  • the data processing module may also include: a natural language generation (NLG) function module and a speech synthesis (text to speech, TTS) function module.
  • NLG natural language generation
  • TTS text to speech
  • the NLG module can be used to generate natural language texts based on business information.
  • the TTS module can be used to turn natural language text into an output speech signal. Contrary to the ASR module, the TTS module converts natural language text into speech for the machine to read aloud, which is equivalent to a human mouth.
  • a specific scene to which the voice control device provided by this application can be applied the specific scene can be a vehicle-mounted scene, and the user can use the voice control device to send a message to a vehicle-mounted device (which can be called a target device, such as a car window , car speakers, seats, air conditioners, etc.) to issue control commands.
  • a vehicle-mounted device which can be called a target device, such as a car window , car speakers, seats, air conditioners, etc.
  • the voice control device can pass the voice signal through the picture
  • the ASR module, NLU module and DM module shown in 2 obtain the control command of the car window after processing, and then control the car window to move down slowly according to the control command.
  • the user can also issue control commands to other in-vehicle devices through the voice control device. For example, if the user says “raise the seat”, the voice control device will control the seat to raise slowly in response to the voice signal. For example, if the user says “turn down the air conditioner wind”, the voice control device will respond to the voice signal and control the air conditioner. The wind is slowly decreasing.
  • the voice control device provided in this application can also be applied to other scenarios, for example, in a home scenario, a user can use the voice control device to send a voice control device to a certain home device (which can be called a target device, such as a robot vacuum cleaner, a desk lamp, a curtain, etc.) in the home scenario. etc.) to issue control commands.
  • a target device such as a robot vacuum cleaner, a desk lamp, a curtain, etc.
  • the voice control device gradually increases the brightness of the console lights in response to the voice signal, etc.
  • FIG. 4 is a schematic diagram of a group of vehicle windows slowly moving down in the present application, wherein the thick solid line represents the vehicle door, and the thin dashed line represents the vehicle window.
  • the window is in a fully closed state, that is, the window has not been opened.
  • the window is in a half-open state, specifically in a 40% open state.
  • Fig. 4(a) the window is in a fully closed state, that is, the window has not been opened.
  • the window is in a half-open state, specifically in a 40% open state.
  • the window is still in a half-open state, specifically in a 60% open state.
  • the window is in a fully open state, that is, in a 100% open state. It takes about 3-4 seconds to move the window from the state of (a) in Figure 4 to the state of (d) in Figure 4.
  • the window is in a running state of slowly moving downward.
  • the user can intuitively feel the current open state of the window, and issue a control command to the window again through the voice control device according to personal needs, such as a window stop command, so that the window can stay at the desired position of the user. s position.
  • the voice control device can send a stop command to the car window again, for example, the user says "stop"
  • the voice control device can process the voice signal through the ASR module, NLU module, and DM module shown in FIG. The windows stop moving down.
  • the user's control of the target device in the running state may be referred to as process control.
  • the above-mentioned control of the vehicle window in the process of moving down may be referred to as the process control of the vehicle window.
  • the target device is other equipment in the vehicle equipment, such as seats, air conditioners, etc.
  • the target device is equipment in other scenarios, such as robot vacuum cleaners, curtains, Table lamps, etc.
  • the target device may also be considered to have a designated operating state, and the designated operating state includes at least two operating states, referred to as a first operating state and a second operating state.
  • the first running state may be the running state of the target device based on the voice signal (or control command) issued by the user for the first time, such as the running state of the window moving downward, or the slowly raising seat. operating status, etc.
  • the second operating state may be the operating state that the target device is in based on the voice signal (or control instruction) sent by the user for the second time, such as the operating state in which the window stops moving downward, such as the operating state in which the seat stops slowly raising Wait.
  • the voice control device In the process of the user sending the voice signal to the voice control device, the voice control device needs to determine that the user has finished sending the voice signal (or user voice, voice command), and then the voice control device can perform the voice signal according to the entire acquired voice signal. Recognition and semantic analysis result in control instructions.
  • a trailing silence may be set, and the voice control apparatus determines that the user has finished delivering the voice signal when the voice control apparatus determines that the duration of not receiving the voice signal reaches the silence period. Subsequently, the voice control device obtains a control instruction after processing the obtained entire voice signal through the ASR module, the NLU module, and the DM module shown in FIG. 2 .
  • FIG. 5 is a schematic diagram of the time delay for processing a voice signal by the first kind of voice control device exemplarily provided by this application.
  • the time delay specifically includes the silence duration, the processing duration of the ASR module, the processing duration of the NLU module and the processing duration of the DM module. It can be seen from the The time delay from when the voice control device receives the voice signal to when the voice control device generates the control command is relatively long.
  • a long time delay will cause the target device not to be controlled in time, especially in process control, the user cannot control the target device more intuitively and effectively through the voice control device. For example, when the window is moved down to 60%, the user intuitively feels that the current window position is more suitable, so the user says "stop", there may be a delay between the user saying "stop” and the actual stop of the window, such as 1 second (s ), then the window may have moved down to 80% at this time, so the final position of the window is not what the user wants.
  • the present application provides a voice control method for reducing control delay in a voice control process.
  • FIG. 6 exemplarily provides a data processing module in the present application.
  • the flow control module receives the text information from the ASR module and determines whether to send the text information to the quick match module.
  • the fast matching module can determine the preset instruction identifier from the preset set, and determine the control instruction to send to the target device according to the preset instruction identifier.
  • the quick matching module cannot determine the preset instruction identifier from the preset set
  • the corresponding control instruction can be further generated through the NLU module and the DM module, and sent to the target device.
  • the voice signal sent by the user for the first time is referred to as the second voice signal as follows.
  • the text information obtained by the voice control device according to the second voice signal is called the second text information
  • the control instruction generated according to the second text information is called the second control instruction
  • the second control instruction is used to control the target device to enter the first operating state.
  • the voice signal sent by the user for the second time is called the first voice signal
  • the first voice signal is the voice signal in the process control performed by the user on the target device.
  • the text information obtained by the voice control device according to the first voice signal is called the first text information
  • the control instruction generated according to the first text information is called the first control instruction
  • the first control instruction is used to control the target device from the first operating state. Switch to the second operating state.
  • FIG. 7 is a schematic flowchart of a voice control method exemplarily provided by the application, in the process:
  • Step 701 The voice control apparatus determines, according to the first voice signal, first text information with complete semantics.
  • the voice control device can recognize the received voice signal through the streaming voice recognition technology. In this way, the voice control device does not need to wait for a silent period, but starts to perform voice recognition after receiving the user's voice signal.
  • the first voice signal delivered by the user is a text.
  • the first voice signal sent by the user is “stop", and the user needs to finish saying the word “stop” after a period of time, such as 0.5s.
  • the voice control device the following operations can be performed: the voice signal "stop” is received, and the voice signal "stop” is converted into the text message "stop".
  • the first voice signal delivered by the user is a plurality of characters.
  • the first voice signal sent by the user is “just tune here", and the user needs to pass a period of time, such as 2s, to finish saying the four words "just tune here".
  • a period of time such as 2s
  • Time T2 Receive the voice signal "Tune”, convert the voice signal “Tune” into the text “Tune”, and generate the text information "Just Tune” in combination with the text information "Just” generated at the time of T1.
  • Time T3 After receiving the voice signal "To", the voice signal “To” is converted into the text “To”, and combined with the text information "Just tune” at the T2 time, the text information "Just tune to” is generated.
  • Time T4 Receive the voice signal "this”, convert the voice signal "this” into the text "this”, and combine with the text information at time T3 "just tune in” to generate the text message "just tune in here".
  • the text information recognized by the voice control device has complete semantics.
  • the voice control device performs speech recognition, the recognized text information does not have complete semantics, and the text information obtained at the above-mentioned time T4 has complete semantics.
  • the voice control device needs to determine whether the recognized text information has complete semantics. It can be understood that the text information has complete semantics here, and the voice control device can determine corresponding structured information or control instructions according to the text information.
  • a classification model can be preset, and the classification model is used to identify whether the text information has complete semantics.
  • the classification model can be called a first preset model, and the input of the first preset model is voice control.
  • the device performs streaming speech recognition to obtain text information (or one or more characters contained in the text information), and the output of the first preset model is first indication information, and the first indication information is used to indicate whether the text information is with full semantics.
  • the first indication information may be a preset bit. For example, when the preset bit is 1, it indicates that the input text information has complete semantics; when the preset bit is 0, it indicates that the input text information has complete semantics. The entered text information does not have full semantics.
  • the first preset model may be obtained by training in the following manner:
  • the first training set includes a plurality of first training data, and each first training data in the plurality of first training data includes first training text information and a first label, wherein The first training text information includes one or more words, and the first label is used to indicate whether the first training text information has complete semantics.
  • the first label may be manually pre-labeled, or may be automatically labeled during the machine learning process.
  • the first label can use a preset bit to indicate whether the corresponding first training text information has complete semantics. For example, when the preset bit is 1, it indicates that the corresponding first training text information has complete semantics. The preset When the value of the bit is 0, it indicates that the corresponding first training text information does not have complete semantics.
  • Table 1 exemplarily provides a plurality of first training data in the first training set for this application.
  • the first training data includes first training text information "Ji” and a first label "0", and the first label "0" is used to indicate that the first training text information "Ji" does not have complete semantics.
  • the first training data includes first training text information "stop” and a first label "1", and the first label "1" is used to indicate that the first training text information "stop” has complete semantics.
  • First training text information first tab First training text information first tab At once 0 broadcast 0 just, adjust 0 play, play 0 to, to, to 0 play, play, sound 0 just, tune, to, this 1 play music 1 stop 1 stop 1
  • one or more times of model training (which may be referred to as first model training) can be performed on the first training model according to a plurality of first training data in the first training set, and the trained model can be obtained as the first training model.
  • first model training can be performed on the first training model according to a plurality of first training data in the first training set, and the trained model can be obtained as the first training model.
  • Default model can be performed on the first training model according to a plurality of first training data in the first training set.
  • a plurality of first training data in the first training set may be input into the first training model to obtain the output result of the first training model (referred to as the first output result).
  • the first output result is, for example, determining whether the first training text information in each first training data has complete semantics.
  • a model update parameter is determined, wherein the model update parameter is such as a gradient parameter.
  • the current first training model is updated according to the model update parameter.
  • the next first model training is performed based on the updated first training model, and the above operations are repeated until the determined first output result meets the first preset condition.
  • the output accuracy rate of the first training model can be determined according to the first output result, for example, there are 1000 first training data in total, wherein the output results corresponding to 900 first training data in the first output result are correct, Then this output is 90% correct.
  • the first preset condition may be set such that the output accuracy rate is greater than the preset accuracy rate.
  • the voice control device can further update the model parameters of the first preset model according to the data obtained during the working process, so as to improve the accuracy of the model.
  • the voice control device processes the first voice signal through streaming voice recognition technology to obtain text information corresponding to the first voice signal (hereinafter referred to as third text information), where the third text information includes M characters, where M is a positive integer.
  • the voice control device inputs the third text information into the first preset model, and generates the first text information according to the output result of the first preset model and the third text information.
  • the output result of the first preset model indicates that the third text information has complete semantics
  • the voice control apparatus may use the third text information as the first text information. For example, input the third text message “just tune here” composed of four characters “just”, “tune”, “to” and “this” into the first preset model, and the output of the first preset model is " 1", the voice control device can take the third text message "Just call it here” as the first text message.
  • the voice control apparatus may, after recognizing the new text through the streaming voice technology, combine the new text with the M texts.
  • the new third text information composed of words is input into the first preset model, until the output result of the first preset model indicates that the input third text information has complete semantics, and the input third text information is used as the first text. information.
  • Step 702 the voice control apparatus controls the target device to switch from the first operation state to the second operation state in the designated operation state according to the first text information and the second text information.
  • the voice control apparatus first acquires the second text information.
  • the second text information is used to control the target device to enter the first running state in the specified running state, and there is a contextual relationship between the second text information and the first text information.
  • the voice control device will store a session state, and the session state may include text information and/or structured information determined by the voice control device according to the last received and processed voice signal.
  • the voice control device can determine whether to generate a corresponding control instruction according to the currently received voice signal and in combination with the stored session state.
  • the text information in the session state may be referred to as historical text information
  • the structured information in the session state may be referred to as historical structured information.
  • the historical text information and the historical structured information may also be referred to as second text information and second structured information, respectively.
  • the historical text information is the preceding text of the first text information, and/or the first text information is historical text information below.
  • both the historical text information and the first text information both correspond to the same target device.
  • both the historical text information and the first text information correspond to car windows.
  • both the historical text information and the first text information correspond to seats.
  • the execution action corresponding to the historical text information is of the same type as the execution action corresponding to the first text information.
  • the historical text information is used to instruct the car window to move down
  • the first text information is used to instruct the car window to stop moving down, both of which correspond to the action type of down move.
  • the historical text information is used to instruct the seat to be raised
  • the first text information is used to instruct the seat to be raised to stop, both of which correspond to the action type of raising.
  • the historical text information is "open the car window", and the first text information is "just tune here".
  • the historical text information is "Turn down the wind power of the air conditioner", and the first text information is "OK”.
  • a contextual relationship between the historical text information and the first text information may be used as the third preset condition.
  • the historical text information may instruct the target device to enter a certain operating state
  • the first text information may instruct the target device to switch from the operating state to another operating state.
  • the second text information instructs the target device to enter the first operating state
  • the first text information instructs the target device to switch from the first operating state to the second operating state.
  • the historical text information may instruct a certain device to enter a certain operating state
  • the first text information may instruct other devices to enter other operating states .
  • the following example illustrates the situation where there is no contextual relationship between the historical text information and the first text information:
  • the historical text information is "open car window", and the first text information is "open bluetooth”.
  • the voice control device may determine whether there is a contextual relationship between the historical text information and the first text information. In one example, whether there is a contextual relationship between the historical text information and the first text information may be determined through the above-mentioned condition 1 and/or condition 2.
  • a classification model may be preset, and the classification model is used to determine whether there is a contextual relationship between two pieces of text information.
  • the classification model may be called a second preset model, and the input of the second preset model is two pieces of text information, specifically historical text information and first text information, the output of the second preset model is second indication information, and the second indication information is used to indicate whether there is a context between the historical text information and the first text information relation.
  • the second indication information may be a preset bit.
  • the preset bit takes a value of 1, it indicates that there is a contextual relationship between the input historical text information and the first text information, and the preset bit When the value is 0, it indicates that there is no contextual relationship between the input historical text information and the first text information.
  • the second preset model may be obtained by training in the following manner:
  • the second training set includes a plurality of second training data, and each second training data in the plurality of second training data includes two text information and a second label, wherein the first The second tag is used to indicate whether there is a contextual relationship between the two text information.
  • the two pieces of text information have a sequential order.
  • the second label may be manually pre-labeled, or may be automatically labeled during the machine learning process.
  • the second tag can use a preset bit to indicate whether there is a contextual relationship between the two corresponding text information. For example, when the preset bit is 1, it indicates that there is a contextual relationship between the two corresponding textual information. When the preset bit takes a value of 0, it indicates that there is no contextual relationship between the two corresponding text information.
  • Table 2 exemplarily provides a plurality of second training data in the second training set for this application.
  • the second training data includes two text information "open the window”, “turn off the air conditioner” and a second label "0", and the second label "0" is used to indicate “open the window” and "turn off the air conditioner”. ' are not contextually related.
  • the second training data includes two text messages “open the car window”, “just call here” and a second label "1", the second label "1" is used to indicate “open the car window” and There is a contextual relationship between “just tune in here”.
  • one or more times of model training (which may be referred to as second model training) can be performed on the second training model according to a plurality of second training data in the second training set, and the trained model can be obtained as the second training model.
  • second model training can be performed on the second training model according to a plurality of second training data in the second training set, and the trained model can be obtained as the second training model.
  • Default model can be performed on the second training model according to a plurality of second training data in the second training set.
  • a plurality of second training data in the second training set may be input into the second training model to obtain the output result of the second training model (referred to as the second output result).
  • the second output result is, for example, determining whether there is a contextual relationship between two pieces of text information in each second training data.
  • a model update parameter is determined, wherein the model update parameter is such as a gradient parameter.
  • the current second training model is updated according to the model update parameter.
  • the output correct rate of the second training model may be determined according to the second output result, for example, there are 1000 second training data in total, wherein the output results corresponding to 900 second training data in the second output result are correct, Then this output is 90% correct.
  • the second preset condition may be set such that the output accuracy rate is greater than the preset accuracy rate.
  • the voice control device can further update the model parameters of the second preset model according to the data obtained during the working process, so as to improve the accuracy of the model.
  • the voice control device inputs the historical text information and the first text information into the second preset model, and determines the difference between the historical text information and the first text information according to the output result of the second preset model. Whether there is a context relationship, that is, whether there is second text information. The situation is explained as follows:
  • the voice control device determines a control instruction according to the second text information and the first text information, and the control instruction is used to control the target device to switch from the first operating state to the second operating state.
  • the following first explains that the target device enters the first operating state.
  • the voice control apparatus acquires the second text information based on the above implementation manner of acquiring the first text information.
  • the voice control apparatus acquires the second voice signal sent by the user, and obtains N characters corresponding to the second voice signal through voice recognition, where N is a positive integer.
  • the voice control device performs natural language understanding on the second text information to obtain second structured information, and then controls the target device to enter the first operating state according to the second structured information.
  • the second voice signal sent by the user is used to instruct the target device to enter the first operating state, for example, it is used to instruct the window to enter the operating state of moving downward, and for example, it is used to instruct the seat to enter the operating state of slowly raising etc., that is to say, the time delay requirement in the execution process corresponding to the second speech signal is lower than the time delay requirement in the execution process (ie process control) corresponding to the first speech signal, and the speech control device can also be based on the existing process The method controls the target device to enter the first operating state, which is not limited in this application.
  • One or more second preset structured information is included in the voice control device.
  • the second preset structured information corresponds to a preset set, and the preset set includes one or more presets text information.
  • one or more preset text information may correspond to one or more preset instruction identifiers.
  • Table 3 provides a correspondence between the second preset structured information and the preset set provided by the present application.
  • the preset instruction corresponding to the preset text information "stop, stop, ok, ok” is identified as “window stop”.
  • the preset instruction corresponding to the preset text information "stop, stop, ok, ok” is identified as "seat stop”.
  • the voice control device determines the preset set corresponding to the second structured information from the correspondence between the second preset structured information and the preset set according to the second structured information, and then determines the preset set corresponding to the second structured information. Whether the first text information is included in the preset set corresponding to the second structured information.
  • the voice control apparatus may determine a control instruction for controlling the target device according to the preset instruction identifier corresponding to the first text information in the preset set.
  • the second structured information is "control-window.adjust”
  • the voice control device determines that the first text information "stops” in the preset set corresponding to "control-window.adjust", and further determines that "stop” ”
  • the corresponding preset command is identified as “window stop”.
  • the voice control device determines to send a window stop command to the lower window according to the preset command identification "window stop”.
  • the first preset structured information corresponding to the preset text information may also be set in the preset set corresponding to each second preset structured information.
  • the preset text information "stop, stop, ok, ok” corresponds to the preset command mark "window stop”, and further corresponds to the first A preset structured information "control-window.stop”.
  • the dialog management can be performed according to the first preset structured information for subsequent instruction issuance .
  • the invalid instruction may be that the voice control device does not issue the control instruction, or the target device does not execute the control instruction after issuing the control instruction to the target device.
  • the voice control device may determine that the control command is invalid. Further, the voice control device can initiate a dialogue according to the first preset structured information "control-window.slower" corresponding to the "window deceleration command", such as reminding the user that the current minimum descent speed has been reached, or asking the user whether it is necessary to stop. Move the window down.
  • the preset set corresponding to the second preset structured information may include one or more preset text information and one or more first preset structured information.
  • Table 5 shows the correspondence between the second preset structured information and the preset set provided by the present application.
  • the first preset structured information corresponding to the preset text information "stop, stop, ok, ok” is "stop”.
  • the voice control device determines the preset set corresponding to the second structured information from the correspondence between the second preset structured information and the preset set according to the second structured information, and then determines the preset set corresponding to the second structured information. Whether the first text information is included in the preset set corresponding to the second structured information. In the case that the first text information is included in the preset set, the voice control device may generate a third text information according to the first preset structured information corresponding to the first text information in the preset set and in combination with the second structured information structured information, and determine a control instruction for controlling the target device according to the third structured information.
  • the voice control device determines that the first text information "stops” in the preset set corresponding to "control-window.adjust", and further determines that "stop” "The corresponding first preset structured information is "stop”.
  • the voice control device generates third structured information such as "control-window.stop” according to the first preset structured information "stop” and the second structured information as “control-window.adjust”, and then according to the third structured information
  • the control message "control-window.stop” sends a window stop command to the window.
  • the voice control device may execute voice according to the first text information It is understood that the first structured information is obtained, then the third structured information is generated according to the first structured information and the second structured information, and the control instruction for controlling the target device is determined according to the third structured information.
  • the second structured information is "control-window.adjust”
  • the voice control device determines that the first text information "just adjust to this” is not in the preset set corresponding to "control-window.adjust”, and the voice control
  • the device performs natural speech understanding on the first text information "just adjust here”, obtains the first structured information such as "stop”, and the voice control device further according to the second structured information "control-window.adjust” and the first structured information information "stop”, generate third structured information such as "control-window.stop", and then issue a window stop command to the vehicle window according to the third structured information "control-window.stop".
  • the voice control device may not include a preset set corresponding to the second structured information, that is, one or more second preset structured information in the voice control device does not include the second structure information.
  • the voice control device can perform voice understanding according to the first text information, obtain the first structured information, and then generate the third structured information according to the first structured information and the second structured information, and determine the user according to the third structured information. Control commands for controlling the target device.
  • the second structured information is "media-set.adjust" (where "media-set.adjust” is used to control the car speaker to play music), and the second structured information is not included in the plurality of second preset structured information middle.
  • the first structured information is still "stop”
  • the voice control device can generate the third structured information "media-set.stop” according to the second structured information "media-set.adjust” and the first structured information "stop” ”, and then according to the third structured information “media-set.stop”, a stop instruction for controlling the car speaker to stop playing music is generated.
  • control command determined by the voice control device according to the third structured information may be invalid.
  • the second structured information is "control-window.adjust” and the first structured information is "top”
  • the third structured information such as "control-top-window.adjust” is generated, and the corresponding control command is, for example, adjusting the sunroof.
  • the voice control device can determine that the generated control command is an invalid command.
  • the voice control device can update the session state according to the newly generated third structured information, and when the voice control device receives a new voice signal again, it can determine whether to generate a valid control command according to the new voice signal and the session state.
  • the voice control device receives a voice signal such as "stop”, then the voice control device can determine to stop the sunroof according to "stop” and "control-top-window.adjust" in the session state.
  • the voice control device may also initiate an inquiry, and communicate with the user through dialogue to generate effective control instructions. For example, when it is determined that the control instruction corresponding to the third structured information is invalid, a query sentence is generated, such as "Do you need to adjust the sunroof?" or "How do you adjust the sunroof?", and when it is determined that the user needs to stop adjusting the sunroof, a sunroof stop instruction is issued.
  • the target device may not be indicated in the first text information (or the first voice signal), and the voice control device can determine the first text information according to the second text information and the first text information with the contextual relationship.
  • the second text information and the first text information correspond to the same target device, that is, the target device corresponding to the first text information is the same as the target device corresponding to the second text information.
  • the second text information is "open car window"
  • the target device is the car window
  • the first text information is "stop”.
  • the first text information does not contain the target device, it can be
  • the second text information of the context relationship determines that the target device in the first text information is also a car window.
  • the present application does not exclude the situation that the target device is indicated in the first text information (or the first voice signal).
  • the second text information is "open the window”
  • the first text information is "the window is stopped” both of which are Indicates that the target device is a car window.
  • the voice control device performs voice understanding according to the first text information, obtains the first structured information, and updates the conversation state according to the first structured information.
  • the conversation state is stored in the voice control device, which is equivalent to storing historical text information and historical structured information in the voice control device.
  • the control device may perform speech understanding according to the first text information to obtain the first structured information, and then update the conversation state according to the first text information and the first structured information.
  • the conversation state in the voice control device is not empty, which is equivalent to that the historical text information and historical structured information are not stored in the voice control device, and the voice control device can perform voice understanding according to the first text information, and obtain the first structured information, and then use the first text information and the first structured information as the current session state.
  • the voice control device When the voice control device receives the voice signal again, it can generate a control instruction according to the new voice signal and the updated session state, or update the session state again.
  • the flow control module can be provided with a first preset model and a second preset model, which is equivalent to the flow control module used to determine the first text information with complete semantics according to the first voice signal , and determine whether there is a contextual relationship between the two according to the first text information and the historical text information.
  • a preset database may be set in the quick matching module, and the preset database includes one or more second preset structured information, which is equivalent to that the quick matching module is used to determine whether the first text information corresponds to a preset instruction identifier.
  • FIG. 8 Based on the modules in FIG. 6 , another voice control method is provided, and the flow of the method can be referred to as shown in FIG. 8 .
  • Step 801 the ASR module determines third text information according to the first voice signal, wherein the third text information includes M characters, and M is a positive integer.
  • Step 802 the ASR module sends the third text information to the flow control module.
  • the flow control module receives the third text information from the ASR module.
  • Step 803 the flow control module inputs the third text information into the first preset model, and determines whether the third text information has complete semantics. If yes, go to step 804, otherwise go back to step 801.
  • Step 804 the flow control module determines whether the historical text information has a contextual relationship with the first text information (ie, the third text information obtained in the above step 803). If yes, execute step 805, otherwise, the first text information is processed by the NLU module and the DM module.
  • Step 805 the flow control module sends the first text information to the quick matching module.
  • the fast matching module receives the first text information from the flow control module.
  • Step 806 the quick matching module determines whether there is a preset instruction identifier corresponding to the first text information in the preset set corresponding to the second structured information. If yes, execute step 807, otherwise, the first text information is processed by the NLU module and the DM module.
  • Step 807 the quick matching module sends the preset instruction identifier corresponding to the first text information to the decision module.
  • Step 808 the decision-making module generates a control instruction according to the preset instruction identifier corresponding to the first text information.
  • Step 809 the decision module sends a control instruction to the target device.
  • FIG. 9 exemplarily provides a schematic flowchart of the input and output of two preset models in the flow control module, wherein the input of the first preset model is third text information, for example, the third text information At this point, the output of the first preset model indicates that the third textual information has complete semantics.
  • the flow control module uses the third text information as the first text information, and inputs the historical text information and the first text information into the second preset model, for example, the historical text information is "open the car window", and the output of the second preset model It is indicated that the historical text information and the first text information have a contextual relationship.
  • the first preset model and the second preset model may be obtained through self-supervised learning.
  • a voice signal ie, the first voice signal
  • a voice signal ie, the first voice signal
  • Example 1 the voice signal (that is, the first voice signal) sent by the user for the second time is "stop", referring to the voice control flow exemplarily shown in FIG. 10 , including the following steps:
  • Step 1 the voice control device determines that the text message "stop" has complete semantics
  • Step 2 the voice control device determines that the text information "stop” and the text information "open car window” have a contextual relationship.
  • Step 3 the voice control device determines that the preset set corresponding to the structured information "control-window.adjust” includes the text information "stop”, and determines that the preset instruction corresponding to the text information "stop” is identified as “window stop”, according to The preset command flag "window stop” determines the window stop command.
  • Example 2 the voice signal (that is, the first voice signal) sent by the user for the second time is "just call here", referring to the voice control flow exemplarily shown in Figure 11, including the following steps:
  • Step 1 the voice control device determines that the text information "just" does not have complete semantics
  • Step 2 the voice control device determines that the text information "just tune” does not have complete semantics
  • Step 3 the voice control device determines that the text information "just call” does not have complete semantics
  • Step 4 the voice control device determines that the text message "just call here" has complete semantics
  • step 5 the voice control device determines that the text message "just call here" has a contextual relationship with "open the car window”.
  • Step 6 the voice control device determines that the preset set corresponding to the structured information "control-window.adjust" does not include the text information "just adjust here".
  • step 7 the voice control device performs semantic analysis processing on the text information "just call here" to obtain structured information "stop”.
  • Step 8 After the voice control device performs dialogue management on the structured information "stop” and the structured information "control-window.adjust", the structured information "control-window.stop” is obtained.
  • Step 9 the voice control device generates a window stop command according to the structured information "control-window.stop".
  • Example 3 the voice signal (that is, the first voice signal) issued by the user for the second time is "playing music", referring to the voice control flow exemplarily shown in Figure 12, including the following steps:
  • Step 1 the voice control device determines that the text information "play" does not have complete semantics
  • Step 2 the voice control device determines that the text information "playing" does not have complete semantics
  • Step 3 the voice control device determines that the text information "playing sound" does not have complete semantics
  • Step 4 the voice control device determines that the text information "playing music" has complete semantics
  • Step 5 the voice control device determines that the text information "playing music” and "opening the car window" do not have a contextual relationship.
  • the voice control device performs semantic analysis, dialogue management, etc. according to "play music", and updates the conversation state.
  • the car window may be controlled by the motor in the in-vehicle circuit
  • the voice control device may send a control command to the in-vehicle circuit
  • the in-vehicle circuit controls the power supply of the motor according to the control command. to control the windows.
  • the voice control device may send the window down instruction to the vehicle-mounted circuit, and the vehicle-mounted circuit can move the window down according to the second voice signal.
  • the window down command controls the motor power to be connected, and the motor works, making the window move down slowly.
  • the voice control device controls the window to stop moving downward in response to the first voice signal
  • the voice control device sends a window stop command to the vehicle-mounted circuit
  • the vehicle-mounted circuit controls the motor power to disconnect and the motor to stop working according to the window stop command. stop the windows from moving.
  • the car window may be controlled by a stepping motor in the stepping circuit, the voice control device may send a stepping signal to the stepping motor, and the stepping motor controls the car window according to the stepping signal.
  • the voice control device may send a start step to the stepper motor according to the window downward movement instruction. Enter the signal to control the stepping circuit to work, so that the window moves down slowly.
  • the voice control device may send a stop stepping signal to the stepping motor to control the stepping circuit to stop working, so that the car window stops moving.
  • the voice control device can obtain the second voice signal, determine the second text information according to the second voice signal, and then perform natural language understanding according to the second text information to obtain the second structured information, and then according to the second structured information.
  • the information controls the target device to enter the first operating state of the designated operating state.
  • the voice control device may also perform streaming speech recognition on the continuously obtained first voice signal in the process of continuously obtaining the first voice signal, so as to obtain the corresponding M characters,
  • the first text information composed of the M characters with completed semantics does not need to wait for the silence period after the user sends the completed voice signal. After the text information with complete semantics, it is inferred that the user has completed the delivery of the voice signal, thereby effectively reducing the control delay.
  • the target device in the first running state can be controlled according to the first text information and the second text information. Specifically, the running state of the target device is changed from the first running state The state is switched to the second running state, in this way, the target device indicated by the first voice signal sent by the user can be effectively determined, and the target device can be controlled.
  • the target device when controlling the target device, it can be determined whether the first text information is in a preset set corresponding to the second text information (ie, the second structured information), and when the first text information is in the preset set,
  • the preset instruction identifier corresponding to the first text information can be directly determined according to the preset set without performing natural language understanding and dialogue management on the first text information, thereby helping to further reduce the delay in the control process.
  • the time delay in the control process can be effectively reduced, and the user
  • the target device can be controlled more intuitively and effectively by sending a voice signal, which helps to improve the user experience.
  • the time delay generated by the voice control device for processing the voice signal can be reduced.
  • FIG. 13 is a schematic diagram of a second type of voice control device processing voice signal generation time delay provided by the present application.
  • the voice control device performs voice recognition from the moment of receiving the voice signal, according to the prediction corresponding to the second structured information. If the set is set, when the preset instruction identifier corresponding to the first text information cannot be determined, the corresponding control instruction can be obtained after being processed by the NLU module and the DM module, and sent to the target device.
  • the method of the present application can at least avoid the time delay caused by the voice control apparatus waiting for the silent duration.
  • FIG. 14 is a schematic diagram of the time delay for processing the voice signal by the third voice control device provided by the application.
  • the voice control device performs voice recognition from the moment of receiving the voice signal, and according to the prediction corresponding to the second structured information A set is set to determine the preset instruction identifier corresponding to the first text information, so as to obtain the corresponding control instruction and deliver it to the target device.
  • the method of the present application can not only avoid the delay caused by the voice control device waiting for the silent duration, but also avoid the delay caused by the processing of the NLU module and the DM module.
  • the methods and operations implemented by the voice control device may also be implemented by components (eg, chips or circuits) that can be used in the voice control device.
  • each functional module in each embodiment of the present application may be integrated into one processor, or may exist physically alone, or two or more modules may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules.
  • FIG. 15 and FIG. 16 are schematic structural diagrams of possible voice control apparatuses provided by the present application. These voice control apparatuses can be used to implement the functions of the voice control apparatuses in the above method embodiments, and thus can also achieve the beneficial effects of the above method embodiments.
  • the voice control apparatus includes a processing module 1501 and a control module 1502 .
  • the processing module 1501 may be configured to execute step 701 in the method embodiment exemplarily shown in FIG. 7
  • the control module 1502 may be configured to execute the steps in the method embodiment exemplarily shown in FIG. 7 . 702.
  • the processing module 1501 may be used to perform steps 801 to 805 in the method embodiment exemplarily shown in FIG. 8
  • the control module 1502 may be used to perform the method implementation exemplarily shown in FIG. 8 . Steps 806 to 809 in the example.
  • the processing module 1501 is used to determine the first text information with complete semantics according to the first voice signal; the control module 1502 is used to control the target device to specify the first text information according to the first text information and the second text information. Switching in the running state, wherein the second text information is acquired before the first text information, the second text information is used to control the target device to enter the first running state in the specified running state, and the second text information is the same as the first text information. Information is contextual.
  • the second text information and the first text information have a contextual relationship, including at least one or more of the following: the second text information and the first text information correspond to the same target device; the second text information and the first text information correspond to the same target device; The execution action corresponding to the information and the execution action corresponding to the first text information are of the same type.
  • the processing module 1501 before the processing module 1501 determines the first text information with complete semantics according to the first voice signal, the processing module 1501 is further configured to: determine the second text with complete semantics according to the second voice signal. information; perform natural language understanding on the second text information to obtain second structured information; the control module 1502 is further configured to: control the target device to enter the first operating state according to the second structured information.
  • control module 1502 is specifically configured to: determine a preset set corresponding to the second structured information according to the second structured information corresponding to the second text information, and the preset set includes one or more The corresponding relationship between the preset text information and the preset instruction identifier; when the one or more preset text information includes the first text information, the control instruction is determined according to the preset instruction identifier corresponding to the first text information, wherein the control instruction It is used to control the target device to switch from the first operation state in the designated operation state to the second operation state in the designated operation state.
  • control module 1502 is further configured to: when the first text information is different from any one of the one or more preset text information, perform natural language understanding on the first text information to obtain the first text information. a structured information; the control instruction is determined according to the first structured information and the second structured information.
  • control module 1502 is further configured to: after determining the control instruction according to the first structured information and the second structured information, in the case that the control instruction is invalid, update the second structure according to the first structured information information.
  • the processing module 1501 is specifically configured to: according to the first voice signal, determine M characters corresponding to the first voice signal, where M is a positive integer; and input the text information composed of the M characters into the first preset.
  • a model is set to obtain the output result of the first preset model, and the first preset model is used to judge whether the text information composed of the input multiple characters has complete semantics; according to the text information composed of M characters and the first preset model The output result generates the first text information.
  • the processing module 1501 is specifically configured to: obtain a first training set, the first training set includes a plurality of first training data, and for each first training data in the plurality of first training data , the first training data includes first training text information and a first label, the first training text information consists of one or more words, and the first label is used to indicate whether the first training text information has complete semantics; a training data and a first training model, perform one or more trainings of the first model until the first output result of the first training model meets the first preset condition, and select the first output result that meets the first preset condition
  • a training model is determined to be the first preset model; wherein, the first model training includes: inputting a plurality of first training data into the first training model to obtain a first output result; updating the first training data according to the first output result
  • the model parameters in the model are obtained to obtain the first training model after the model parameters are updated.
  • the processing module 1501 is further configured to: input the first text information and the historical text information.
  • the second preset model the output result of the second preset model is obtained, and the second preset model is used to judge whether the two input text information has a contextual relationship; according to the output result of the second preset model, the historical text The information is determined to be second text information.
  • the processing module 1501 is specifically configured to: obtain a second training set, the second training set includes a plurality of second training data, for each second training data in the plurality of second training data , the second training data includes two second training text information and a second label, and the second label is used to indicate whether the two second training text information have a contextual relationship; according to the plurality of second training data and the second training model, execute The second model is trained one or more times until the second output result of the second training model meets the second preset condition, and the second training model whose second output result meets the second preset condition is determined as the second preset model ; wherein, the second model training includes: inputting a plurality of second training data into the second training model to obtain a second output result; according to the second output result, updating the model parameters in the second training model to obtain a model parameter update After the second training model.
  • FIG. 16 shows the apparatus provided in this embodiment of the present application, and the apparatus shown in FIG. 16 may be a hardware circuit implementation of the apparatus shown in FIG. 15 .
  • the apparatus can be applied to the flow chart shown above to perform the functions of the voice control apparatus in the above method embodiments.
  • FIG. 16 shows only the main components of the device.
  • the voice control apparatus includes: a processor 1610 and an interface 1630 , and optionally, the voice control apparatus further includes a memory 1620 .
  • the interface 1630 is used to enable communication with other devices.
  • the method performed by the voice control apparatus in the above embodiments may be implemented by the processor 1610 calling a program stored in a memory (which may be the memory 1620 in the voice control apparatus, or an external memory). That is, the voice control apparatus may include a processor 1610, and the processor 1610 executes the method performed by the voice control apparatus in the above method embodiments by calling the program in the memory.
  • the processor here may be an integrated circuit with signal processing capability, such as a CPU.
  • the voice control device may be implemented by one or more integrated circuits configured to implement the above methods. For example: one or more ASICs, or, one or more microprocessor DSPs, or, one or more FPGAs, etc., or a combination of at least two of these integrated circuit forms. Alternatively, the above implementations may be combined.
  • the function/implementation process of the processing module 1501 and the control module 1502 in FIG. 15 can be implemented by the processor 1610 in the voice control device shown in FIG. 16 calling the computer execution instructions stored in the memory 1620 .
  • the present application provides a computing device, including a processor, the processor is connected to a memory, the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory, so that the computing device executes the above method methods in the examples.
  • the present application provides a computer-readable storage medium on which a computer program or instruction is stored.
  • the computing device executes the method in the above method embodiment.
  • the present application provides a computer program product, when a computer reads and executes the computer program product, so that a computing device executes the methods in the above method embodiments.
  • the present application provides a chip connected to a memory for reading and executing a software program stored in the memory, so that a computing device executes the methods in the above method embodiments.
  • an embodiment of the present application provides an apparatus, the apparatus includes a processor and an interface circuit, the interface circuit is configured to receive a program or an instruction code and transmit it to the processor; the processor The program or instruction code is executed to execute the method in the above method embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

A voice control method and apparatus, used for solving the problem of long delay in the existing voice control process. In the present application, first text information having complete semantics is determined according to a first voice signal; and according to the first text information and second text information, a target device is controlled to switch between specified operating states, wherein the second text information is acquired before the first text information, the second text information is used for controlling the target device to enter the first operating state in the specified operating states, and the second text information has a contextual relationship with the first text information.

Description

一种语音控制方法及装置A kind of voice control method and device 技术领域technical field
本申请涉及自动驾驶领域,尤其涉及一种语音控制方法及装置。The present application relates to the field of automatic driving, and in particular, to a voice control method and device.
背景技术Background technique
语音交互产品已经广泛进入到人们的日常生活中,比如智能手机、智能家居设备、智能车载设备等产品中都具有语音交互功能。特别是在车载环境中,语音交互可以解放双手,且具有命令控制快捷且保障行车安全的特点。Voice interaction products have been widely used in people's daily life. For example, smart phones, smart home devices, and smart vehicle-mounted devices all have voice interaction functions. Especially in the in-vehicle environment, voice interaction can free hands, and has the characteristics of fast command control and safe driving.
行车过程中,由于行车环境的变化,用户通常可以通过与车载的语音控制装置进行语音交互,实现对车载设备比如车窗和天窗的开度的调节。During the driving process, due to the change of the driving environment, the user can usually adjust the opening of the in-vehicle equipment such as the windows and the sunroof through voice interaction with the in-vehicle voice control device.
但当前的语音交互过程中,语音控制装置需要确定用户已经结束下发语音信号之后,才能根据获取到的整段用户语音,进行语音识别和语义分析以得到控制指令,然后根据控制指令对相应的车载设备比如车窗的开度进行调节。由于需要在获取整段用户语音之后才能对用户语音进行识别分析,使得整个控制过程的时延较长。However, in the current voice interaction process, the voice control device needs to determine that the user has finished sending the voice signal, and then can perform voice recognition and semantic analysis according to the obtained entire user voice to obtain control instructions, and then according to the control instructions to the corresponding In-vehicle devices such as window openings are adjusted. Since the user's voice needs to be recognized and analyzed after the entire segment of the user's voice is acquired, the time delay of the entire control process is relatively long.
发明内容SUMMARY OF THE INVENTION
本申请提供一种语音控制方法及装置,用于在语音控制过程中减少控制时延,提高用户体验。The present application provides a voice control method and device, which are used to reduce control delay and improve user experience during the voice control process.
本申请提供的语音控制方法可以由终端设备实现,例如,车辆或车载设备。该语音控制方法也可以由终端设备的部件实现,如由终端设备中的处理装置、电路、芯片等部件实现,例如,终端设备中支持无线通信功能相关的芯片,如系统芯片或通信芯片。其中,系统芯片也称为片上系统,或称为片上系统(system on chip,SOC)芯片。通信芯片可以包括射频处理芯片和基带处理芯片。基带处理芯片有时也称为调制解调器(modem)。在物理实现中,通信芯片可集成在SoC芯片内部,也可以不与SoC芯片集。例如,基带处理芯片集成在SoC芯片中,射频处理芯片不与SoC芯片集成。The voice control method provided in this application can be implemented by a terminal device, for example, a vehicle or a vehicle-mounted device. The voice control method can also be implemented by components of the terminal device, such as processing devices, circuits, chips and other components in the terminal device, for example, a chip supporting wireless communication functions in the terminal device, such as a system chip or a communication chip. The system-on-chip is also called a system-on-chip, or a system-on-chip (SOC) chip. The communication chip may include a radio frequency processing chip and a baseband processing chip. The baseband processing chip is also sometimes called a modem. In physical implementation, the communication chip can be integrated inside the SoC chip or not with the SoC chip set. For example, the baseband processing chip is integrated in the SoC chip, and the radio frequency processing chip is not integrated with the SoC chip.
第一方面,本申请提供一种语音控制方法,该方法包括:根据第一语音信号,确定具有完整语义的第一文本信息;根据第一文本信息和第二文本信息,控制目标设备在指定运行状态中切换。示例性的,目标设备对应的指定运行状态可以至少包括有第一运行状态和第二运行状态。其中,第二文本信息是在第一文本信息之前被获取的,第二文本信息与第一文本信息之间具有上下文关系,第二文本信息用于控制目标设备进入指定运行状态中的第一运行状态,第一文本信息用于控制目标设备由指定运行状态中的第一运行状态切换至第二运行状态。In a first aspect, the present application provides a voice control method, the method includes: determining first text information with complete semantics according to a first voice signal; switch state. Exemplarily, the specified operating state corresponding to the target device may include at least a first operating state and a second operating state. The second text information is acquired before the first text information, there is a contextual relationship between the second text information and the first text information, and the second text information is used to control the target device to enter the first operation in the specified operation state state, the first text information is used to control the target device to switch from the first operation state in the specified operation state to the second operation state.
示例性的,目标设备是车窗,车窗对应的指定运行状态可以包括有向下移动(即第一运行状态)和停止向下移动(即第二运行状态),其中第二文本信息用于控制车窗向下移动,第一文本信息用于控制车窗停止向下移动。可以根据第一文本信息和第二文本信息控制车窗由向下移动的状态切换至停止向下移动的状态。Exemplarily, the target device is a car window, and the specified operating state corresponding to the car window may include moving down (ie, the first operating state) and stopping moving downward (ie, the second operating state), wherein the second text information is used for The vehicle window is controlled to move downward, and the first text information is used to control the vehicle window to stop moving downward. According to the first text information and the second text information, the vehicle window can be controlled to switch from a state of moving downward to a state of stopping moving downward.
应理解的是,在上述技术方案中,无需等待用户结束下发语音信号,而是在第一文本信息具有完整语义,且存在与第一文本信息具有上下文关系的第二文本信息的情况下,确 定用户结束下发语音信号,并根据第一文本信息和第二文本信息生成控制指令,通过该方式,有助于减少语音控制过程中的控制时延,而且生成的控制指令是结合上一次的文本信息生成的,可有效控制处于指定运行状态的目标设备。It should be understood that, in the above technical solution, there is no need to wait for the user to finish sending the voice signal, but in the case that the first text information has complete semantics and there is second text information that has a contextual relationship with the first text information, It is determined that the user has finished sending the voice signal, and a control command is generated according to the first text information and the second text information. This method helps to reduce the control delay in the voice control process, and the generated control command is combined with the previous one. It is generated from text information and can effectively control the target device in the specified running state.
一种可选实现方式中,第二文本信息与第一文本信息具有上下文关系,至少包括如下的一项或多项:第二文本信息和第一文本信息对应于(或者作用于)同一个目标设备;第二文本信息对应的执行动作和第一文本信息对应的执行动作属于相同类型。In an optional implementation manner, the second text information and the first text information have a contextual relationship, including at least one or more of the following: the second text information and the first text information correspond to (or act on) the same target equipment; the execution action corresponding to the second text information and the execution action corresponding to the first text information are of the same type.
应理解的是,在上述技术方案中,在确定第二文本信息与第一文本信息具有上下文关系时,具体可确定第二文本信息和第一文本信息是否对应同一个目标设备,和/或,第二文本信息对应的执行动作和第一文本信息对应的执行动作是否属于相同类型,从而有助于提高确定第二文本信息是第一文本信息的上文的正确率。It should be understood that, in the above technical solution, when it is determined that the second text information and the first text information have a contextual relationship, it can be specifically determined whether the second text information and the first text information correspond to the same target device, and/or, Whether the execution action corresponding to the second text information and the execution action corresponding to the first text information are of the same type helps to improve the accuracy of determining that the second text information is the first text information.
一种可选实现方式中,根据第一语音信号,确定具有完整语义的第一文本信息之前,还包括:根据第二语音信号,确定具有完整语义的第二文本信息;对第二文本信息执行自然语言理解,得到第二结构化信息;根据第二结构化信息,控制目标设备进入第一运行状态。In an optional implementation manner, before determining the first text information with complete semantics according to the first voice signal, the method further includes: determining the second text information with complete semantics according to the second voice signal; The natural language is understood, and the second structured information is obtained; according to the second structured information, the target device is controlled to enter the first operating state.
应理解的是,在上述技术方案中,先获取第二语音信号,根据第二语音信号控制目标设备进入至第一运行状态,然后获取第一语音信号,根据第一语音信号控制目标设备由第一运行状态切换至第二运行状态,实现根据第二语音信号对处于第一运行状态的目标设备的控制。It should be understood that, in the above technical solution, the second voice signal is obtained first, the target device is controlled to enter the first operating state according to the second voice signal, and then the first voice signal is obtained, and the target device is controlled from the first voice signal according to the first voice signal. The first operating state is switched to the second operating state, so as to realize the control of the target device in the first operating state according to the second voice signal.
一种可选实现方式中,根据第一文本信息和第二文本信息,控制目标设备在指定运行状态中切换,包括:根据第二文本信息对应的第二结构化信息,确定与第二结构化信息相对应的预设集合,预设集合中包括一个或多个预设文本信息和预设指令标识的对应关系;在一个或多个预设文本信息中包括第一文本信息时,根据第一文本信息对应的预设指令标识确定控制指令,其中,控制指令用于控制目标设备由指定运行状态中的第一运行状态切换至指定运行状态中的第二运行状态。In an optional implementation manner, controlling the target device to switch in a specified operating state according to the first text information and the second text information includes: A preset set corresponding to the information, the preset set includes the correspondence between one or more preset text information and preset instruction identifiers; when the one or more preset text information includes first text information, according to the first The preset instruction identifier corresponding to the text information determines a control instruction, wherein the control instruction is used to control the target device to switch from the first operating state in the designated operating state to the second operating state in the designated operating state.
应理解的是,在上述技术方案中,设置第二结构化信息对应的预设集合,在第一文本信息包含于预设集合中的情况下,可以直接确定出第一文本信息对应的预设指令标识,再根据该预设指令标识生成控制指令,无需对第一文本信息执行自然语言理解和对话管理,进一步减少控制过程中的时延。It should be understood that, in the above technical solution, a preset set corresponding to the second structured information is set, and when the first text information is included in the preset set, the preset corresponding to the first text information can be directly determined. The instruction identifier is generated, and the control instruction is generated according to the preset instruction identifier, without performing natural language understanding and dialogue management on the first text information, and further reducing the time delay in the control process.
一种可选实现方式中,根据第一文本信息和第二文本信息,控制目标设备在指定运行状态中切换,还包括:在第一文本信息与一个或多个预设文本信息中任一个预设文本信息不同时,对第一文本信息执行自然语言理解得到第一结构化信息;根据第一结构化信息和第二结构化信息确定控制指令。In an optional implementation manner, controlling the target device to switch in a specified operating state according to the first text information and the second text information, further includes: presetting any one of the first text information and one or more preset text information. When the text information is different, the first structured information is obtained by performing natural language understanding on the first text information; the control instruction is determined according to the first structured information and the second structured information.
应理解的是,在上述技术方案中,在第一文本信息未包含于预设集合的情况下,可以先对第一文本信息进行自然语言理解得到第一结构化信息,然后根据第一结构化信息和第二结构化信息执行对话管理,得到控制指令,有助于保障系统的正常运行。It should be understood that, in the above technical solution, in the case where the first text information is not included in the preset set, the first text information can be understood by natural language to obtain the first structured information, and then the first structured information can be obtained according to the first structured information. The information and the second structured information are managed by dialogue, and control instructions are obtained, which helps to ensure the normal operation of the system.
一种可选实现方式中,根据第一结构化信息和第二结构化信息确定控制指令之后,还包括:在控制指令无效的情况下,根据第一结构化信息更新第二结构化信息。In an optional implementation manner, after the control instruction is determined according to the first structured information and the second structured information, the method further includes: if the control instruction is invalid, updating the second structured information according to the first structured information.
应理解的是,在上述技术方案中,在控制指令无效时,可以根据第一结构化信息更新存储的第二结构化信息(也即更新存储的历史结构化信息),保障当前存储的历史结构化信息是最新的结构化信息,保障系统的正确运行,且有助于在接收到新的语音信号时做出 正确的判断。It should be understood that, in the above technical solution, when the control instruction is invalid, the stored second structured information can be updated according to the first structured information (that is, the stored historical structured information is updated) to ensure the currently stored historical structure. The information is the latest structured information, which ensures the correct operation of the system and helps to make correct judgments when new voice signals are received.
一种可选实现方式中,根据第一语音信号,确定具有完整语义的第一文本信息,包括:根据第一语音信号,确定第一语音信号对应的M个文字,M为正整数;将M个文字组成的文本信息输入至第一预设模型,得到第一预设模型的输出结果,第一预设模型用于判断输入的多个文字所组成的文本信息是否具有完整语义;根据M个文字组成的文本信息和第一预设模型的输出结果,生成第一文本信息。In an optional implementation manner, determining the first text information with complete semantics according to the first voice signal includes: determining M characters corresponding to the first voice signal according to the first voice signal, where M is a positive integer; The text information composed of multiple characters is input into the first preset model, and the output result of the first preset model is obtained, and the first preset model is used to judge whether the text information composed of the input multiple characters has complete semantics; The text information composed of characters and the output result of the first preset model are used to generate the first text information.
一种可选实现方式中,第一预设模型由如下步骤确定:获取第一训练集合,第一训练集合中包括有多个第一训练数据,针对于多个第一训练数据中每个第一训练数据,第一训练数据包括有第一训练文本信息和第一标签,第一训练文本信息由一个或多个文字组成,第一标签用于指示第一训练文本信息是否具有完整语义;根据多个第一训练数据和第一训练模型,执行一次或多次第一模型训练,至第一训练模型的第一输出结果符合第一预设条件,并将第一输出结果符合第一预设条件的第一训练模型确定为第一预设模型;其中,第一模型训练包括:将多个第一训练数据输入至第一训练模型中,得到第一输出结果;根据第一输出结果,更新第一训练模型中的模型参数,得到模型参数更新后的第一训练模型。In an optional implementation manner, the first preset model is determined by the following steps: obtaining a first training set, the first training set includes a plurality of first training data, and for each first training data in the plurality of first training data; training data, the first training data includes first training text information and a first label, the first training text information is composed of one or more words, and the first label is used to indicate whether the first training text information has complete semantics; according to A plurality of first training data and a first training model, perform one or more first model training, until the first output result of the first training model meets the first preset condition, and make the first output result meet the first preset The conditional first training model is determined as the first preset model; wherein, the first model training includes: inputting a plurality of first training data into the first training model to obtain a first output result; updating according to the first output result The model parameters in the first training model are obtained, and the first training model after the model parameters are updated is obtained.
应理解的是,在上述技术方案中,预先设定第一预设模型,其中第一预设模型是根据多个历史训练数据训练得到的较为精确的分类模型。当根据第一语音信号确定出第一语音信号对应的M个文字时,可以将M个文字组成的文本信息输入至该第一预设模型中,以确定出当前第一语音信号对应的M个文字是否具有完整语义,有助于得到较为准确的判定结果,进而得到较为准确的具有完整语义的第一文本信息。It should be understood that, in the above technical solution, a first preset model is preset, wherein the first preset model is a relatively accurate classification model trained according to a plurality of historical training data. When the M characters corresponding to the first voice signal are determined according to the first voice signal, the text information composed of the M characters can be input into the first preset model to determine the M characters corresponding to the current first voice signal Whether the text has complete semantics is helpful to obtain a more accurate judgment result, thereby obtaining more accurate first text information with complete semantics.
一种可选实现方式中,根据第一文本信息和第二文本信息,控制目标设备在指定运行状态中切换之前,还包括:将第一文本信息和历史文本信息输入至第二预设模型中,得到第二预设模型的输出结果,第二预设模型用于判断输入的两个文本信息是否具有上下文关系;根据第二预设模型的输出结果,将历史文本信息确定为第二文本信息。In an optional implementation manner, according to the first text information and the second text information, before controlling the target device to switch in the specified operating state, it further includes: inputting the first text information and the historical text information into the second preset model. , obtain the output result of the second preset model, the second preset model is used to judge whether the two input text information has a contextual relationship; according to the output result of the second preset model, the historical text information is determined as the second text information .
一种可选实现方式中,第二预设模型由如下步骤确定:获取第二训练集合,第二训练集合中包括有多个第二训练数据,针对于多个第二训练数据中每个第二训练数据,第二训练数据包括两个第二训练文本信息和第二标签,第二标签用于指示两个第二训练文本信息是否具有上下文关系;根据多个第二训练数据和第二训练模型,执行一次或多次第二模型训练,至第二训练模型的第二输出结果符合第二预设条件,并将第二输出结果符合第二预设条件的第二训练模型确定为第二预设模型;其中,第二模型训练包括:将多个第二训练数据输入至第二训练模型中,得到第二输出结果;根据第二输出结果,更新第二训练模型中的模型参数,得到模型参数更新后的第二训练模型。In an optional implementation manner, the second preset model is determined by the following steps: obtaining a second training set, the second training set includes a plurality of second training data, and for each of the plurality of second training data Two training data, the second training data includes two pieces of second training text information and a second label, and the second label is used to indicate whether the two pieces of second training text information have a contextual relationship; model, perform one or more second model training, until the second output result of the second training model meets the second preset condition, and determine the second training model whose second output result meets the second preset condition as the second model A preset model; wherein the second model training includes: inputting a plurality of second training data into the second training model to obtain a second output result; updating model parameters in the second training model according to the second output result to obtain The second training model after the model parameters are updated.
应理解的是,在上述技术方案中,预先设定第二预设模型,其中第二预设模型是根据多个历史训练数据训练得到的较为精确的分类模型。当第一语音信号对应的M个文字具备完整语义时,即M个文字组成第一文本信息时,可以将第一文本信息和当前存储的历史文本信息输入至该第二预设模型,从而根据该第二预设模型的输出结果确定历史文本信息是否为第一文本信息的上文,通过该方式有助于得到较为准确的判定结果。It should be understood that, in the above technical solution, a second preset model is preset, wherein the second preset model is a more accurate classification model trained according to a plurality of historical training data. When the M characters corresponding to the first speech signal have complete semantics, that is, when the M characters form the first text information, the first text information and the currently stored historical text information can be input into the second preset model, so that according to the The output result of the second preset model determines whether the historical text information is above the first text information, which helps to obtain a more accurate determination result.
第二方面,本申请提供一种语音控制装置,该装置包括:处理模块,用于根据第一语音信号,确定具有完整语义的第一文本信息;控制模块,用于根据第一文本信息和第二文本信息,控制目标设备在指定运行状态中切换。示例性的,目标设备对应的指定运行状态 可以至少包括有第一运行状态和第二运行状态。其中,第二文本信息是在第一文本信息之前被获取的,第二文本信息与第一文本信息之间具有上下文关系,第二文本信息用于控制目标设备进入指定运行状态中的第一运行状态,第一文本信息用于控制目标设备由指定运行状态中的第一运行状态切换至第二运行状态。In a second aspect, the present application provides a voice control device, the device comprising: a processing module for determining first text information with complete semantics according to a first voice signal; a control module for determining according to the first text information and the first text information Two text information, control the target device to switch in the specified running state. Exemplarily, the specified operating state corresponding to the target device may include at least a first operating state and a second operating state. The second text information is acquired before the first text information, there is a contextual relationship between the second text information and the first text information, and the second text information is used to control the target device to enter the first operation in the specified operation state state, the first text information is used to control the target device to switch from the first operation state in the specified operation state to the second operation state.
一种可选实现方式中,第二文本信息与第一文本信息具有上下文关系,至少包括如下的一项或多项:第二文本信息和第一文本信息对应于同一个目标设备;第二文本信息对应的执行动作和第一文本信息对应的执行动作属于相同类型。In an optional implementation manner, the second text information and the first text information have a contextual relationship, including at least one or more of the following: the second text information and the first text information correspond to the same target device; the second text information and the first text information correspond to the same target device; The execution action corresponding to the information and the execution action corresponding to the first text information are of the same type.
一种可选实现方式中,在处理模块根据第一语音信号,确定具有完整语义的第一文本信息之前,处理模块还用于:根据第二语音信号,确定具有完整语义的第二文本信息;对第二文本信息执行自然语言理解,得到第二结构化信息;控制模块还用于:根据第二结构化信息,控制目标设备进入第一运行状态。In an optional implementation manner, before the processing module determines the first text information with complete semantics according to the first voice signal, the processing module is further configured to: determine the second text information with complete semantics according to the second voice signal; Perform natural language understanding on the second text information to obtain second structured information; the control module is further configured to: control the target device to enter the first operating state according to the second structured information.
一种可选实现方式中,控制模块具体用于:根据第二文本信息对应的第二结构化信息,确定与第二结构化信息相对应的预设集合,预设集合中包括一个或多个预设文本信息和预设指令标识的对应关系;在一个或多个预设文本信息中包括第一文本信息时,根据第一文本信息对应的预设指令标识确定控制指令,其中,控制指令用于控制目标设备由指定运行状态中的第一运行状态切换至指定运行状态中的第二运行状态。In an optional implementation manner, the control module is specifically configured to: determine a preset set corresponding to the second structured information according to the second structured information corresponding to the second text information, and the preset set includes one or more The correspondence between preset text information and preset instruction identifiers; when one or more preset text information includes first text information, the control instruction is determined according to the preset instruction identifier corresponding to the first text information, wherein the control instruction uses The control target device is switched from the first operation state in the designated operation state to the second operation state in the designated operation state.
一种可选实现方式中,控制模块还用于:在第一文本信息与一个或多个预设文本信息中任一个预设文本信息不同时,对第一文本信息执行自然语言理解得到第一结构化信息;根据第一结构化信息和第二结构化信息确定控制指令。In an optional implementation manner, the control module is further configured to: when the first text information is different from any one of the one or more preset text information, perform natural language understanding on the first text information to obtain the first text information. Structured information; the control instruction is determined according to the first structured information and the second structured information.
一种可选实现方式中,控制模块还用于:根据第一结构化信息和第二结构化信息确定控制指令之后,在控制指令无效的情况下,根据第一结构化信息更新第二结构化信息。In an optional implementation manner, the control module is further configured to: after determining the control instruction according to the first structured information and the second structured information, in the case that the control instruction is invalid, update the second structured information according to the first structured information. information.
一种可选实现方式中,处理模块具体用于:根据第一语音信号,确定第一语音信号对应的M个文字,M为正整数;将M个文字组成的文本信息输入至第一预设模型,得到第一预设模型的输出结果,第一预设模型用于判断输入的多个文字所组成的文本信息是否具有完整语义;根据M个文字组成的文本信息和第一预设模型的输出结果,生成第一文本信息。In an optional implementation manner, the processing module is specifically configured to: determine, according to the first voice signal, M characters corresponding to the first voice signal, where M is a positive integer; and input the text information composed of the M characters into the first preset. model, to obtain the output result of the first preset model, the first preset model is used to judge whether the text information composed of the input multiple characters has complete semantics; according to the text information composed of M characters and the first preset model The result is output, and the first text information is generated.
一种可选实现方式中,处理模块具体用于:获取第一训练集合,第一训练集合中包括有多个第一训练数据,针对于多个第一训练数据中每个第一训练数据,第一训练数据包括有第一训练文本信息和第一标签,第一训练文本信息由一个或多个文字组成,第一标签用于指示第一训练文本信息是否具有完整语义;根据多个第一训练数据和第一训练模型,执行一次或多次第一模型训练,至第一训练模型的第一输出结果符合第一预设条件,并将第一输出结果符合第一预设条件的第一训练模型确定为第一预设模型;其中,第一模型训练包括:将多个第一训练数据输入至第一训练模型中,得到第一输出结果;根据第一输出结果,更新第一训练模型中的模型参数,得到模型参数更新后的第一训练模型。In an optional implementation manner, the processing module is specifically configured to: obtain a first training set, the first training set includes a plurality of first training data, and for each first training data in the plurality of first training data, The first training data includes first training text information and a first label, the first training text information is composed of one or more words, and the first label is used to indicate whether the first training text information has complete semantics; The training data and the first training model are performed one or more times of training the first model until the first output result of the first training model meets the first preset condition, and the first output result that meets the first preset condition is determined. The training model is determined to be the first preset model; wherein, the first model training includes: inputting a plurality of first training data into the first training model to obtain a first output result; updating the first training model according to the first output result The model parameters in , obtain the first training model after the model parameters are updated.
一种可选实现方式中,控制模块根据第一文本信息和第二文本信息,控制目标设备在指定运行状态中切换之前,处理模块还用于:将第一文本信息和历史文本信息输入至第二预设模型中,得到第二预设模型的输出结果,第二预设模型用于判断输入的两个文本信息是否具有上下文关系;根据第二预设模型的输出结果,将历史文本信息确定为第二文本信息。In an optional implementation, before the control module controls the target device to switch in the specified operating state according to the first text information and the second text information, the processing module is further configured to: input the first text information and the historical text information into the first text information and the historical text information. In the second preset model, the output result of the second preset model is obtained, and the second preset model is used to determine whether the two input text information has a contextual relationship; according to the output result of the second preset model, the historical text information is determined is the second text message.
一种可选实现方式中,处理模块具体用于:获取第二训练集合,第二训练集合中包括 有多个第二训练数据,针对于多个第二训练数据中每个第二训练数据,第二训练数据包括两个第二训练文本信息和第二标签,第二标签用于指示两个第二训练文本信息是否具有上下文关系;根据多个第二训练数据和第二训练模型,执行一次或多次第二模型训练,至第二训练模型的第二输出结果符合第二预设条件,并将第二输出结果符合第二预设条件的第二训练模型确定为第二预设模型;其中,第二模型训练包括:将多个第二训练数据输入至第二训练模型中,得到第二输出结果;根据第二输出结果,更新第二训练模型中的模型参数,得到模型参数更新后的第二训练模型。In an optional implementation manner, the processing module is specifically configured to: obtain a second training set, the second training set includes a plurality of second training data, and for each second training data in the plurality of second training data, The second training data includes two pieces of second training text information and a second label, and the second label is used to indicate whether the two pieces of second training text information have a contextual relationship; according to the plurality of second training data and the second training model, execute once or multiple times of second model training, until the second output result of the second training model meets the second preset condition, and the second training model whose second output result meets the second preset condition is determined as the second preset model; The training of the second model includes: inputting a plurality of second training data into the second training model to obtain a second output result; updating model parameters in the second training model according to the second output results, and obtaining the updated model parameters the second trained model.
第三方面,本申请提供一种计算设备,包括处理器,处理器与存储器相连,存储器存储计算机程序,处理器用于执行存储器中存储的计算机程序,以使得计算设备执行上述第一方面或第一方面的任一种可能的实现方式中的方法。In a third aspect, the present application provides a computing device, comprising a processor, the processor is connected to a memory, the memory stores a computer program, and the processor is configured to execute the computer program stored in the memory, so that the computing device executes the first aspect or the first A method in any possible implementation of the aspect.
第四方面,本申请提供一种计算机可读存储介质,其上存储有计算机程序或指令,当该计算机程序或指令被执行时,使得计算机执行上述第一方面或第一方面的任一种可能的实现方式中的方法。In a fourth aspect, the present application provides a computer-readable storage medium on which a computer program or instruction is stored, and when the computer program or instruction is executed, enables a computer to execute the above-mentioned first aspect or any one of the first aspects. method in the implementation.
第五方面,本申请提供一种计算机程序产品,当计算机读取并执行计算机程序产品时,使得计算机执行上述第一方面或第一方面的任一种可能的实现方式中的方法。In a fifth aspect, the present application provides a computer program product, which, when the computer reads and executes the computer program product, causes the computer to execute the first aspect or the method in any possible implementation manner of the first aspect.
第六方面,本申请提供一种芯片,芯片与存储器相连,用于读取并执行存储器中存储的软件程序,以实现上述第一方面或第一方面的任一种可能的实现方式中的方法。In a sixth aspect, the present application provides a chip, which is connected to a memory and used to read and execute a software program stored in the memory, so as to realize the method in the above-mentioned first aspect or any possible implementation manner of the first aspect .
应理解的是,上述第一方面至第六方面的技术方案中,语音控制装置可以获取第二语音信号,根据第二语音信号确定第二文本信息,然后根据第二文本信息执行自然语言理解,得到第二结构化信息,再根据第二结构化信息控制目标设备进入指定运行状态的第一运行状态。在目标设备处于第一运行状态的情况中,语音控制装置还可以在持续获取第一语音信号的过程中,对持续获取到的第一语音信号执行流式语音识别,得到对应的M个文字,确定该M个文字组成的文本信息具有完整语义的情况下,将具有完成语义的M个文字组成的第一文本信息,无需在用户下发完成语音信号之后等待静默时长,而是在确定获取到具有完整语义的文本信息之后即推断出用户下发完成语音信号,从而有效减少控制时延。It should be understood that, in the technical solutions of the first aspect to the sixth aspect, the voice control device may acquire the second voice signal, determine the second text information according to the second voice signal, and then perform natural language understanding according to the second text information, The second structured information is obtained, and then the target device is controlled to enter the first operating state of the specified operating state according to the second structured information. In the case that the target device is in the first operating state, the voice control device may also perform streaming speech recognition on the continuously obtained first voice signal in the process of continuously obtaining the first voice signal, so as to obtain the corresponding M characters, In the case where it is determined that the text information composed of the M characters has complete semantics, the first text information composed of the M characters with completed semantics does not need to wait for the silence period after the user sends the completed voice signal. After the text information with complete semantics, it is inferred that the user has completed the delivery of the voice signal, thereby effectively reducing the control delay.
进一步的,根据第一文本信息和当前存储的历史文本信息确定二者之间是否具有上下文关系,在确定二者具有上下文关系的情况下,则可以确定出当前获取到的第一语音信号是用户针对于上一次的第二语音信号的进一步指示,于是可以根据第一文本信息和第二文本信息,控制处于第一运行状态的目标设备,具体的,将该目标设备的运行状态由第一运行状态切换至第二运行状态,如此,可以有效确定出用户下发的第一语音信号所指示的目标设备,并对该目标设备进行控制。Further, it is determined whether there is a contextual relationship between the two according to the first text information and the currently stored historical textual information, and when it is determined that the two have a contextual relationship, it can be determined that the currently obtained first voice signal is the user. According to the further instruction of the last second voice signal, the target device in the first running state can be controlled according to the first text information and the second text information. Specifically, the running state of the target device is changed from the first running state The state is switched to the second running state, in this way, the target device indicated by the first voice signal sent by the user can be effectively determined, and the target device can be controlled.
而且,在对目标设备进行控制时,可以确定第一文本信息是否在第二文本信息(即第二结构化信息)对应的预设集合中,当第一文本信息在该预设集合中时,可以无需对第一文本信息执行自然语言理解和对话管理,而直接根据该预设集合确定出第一文本信息对应的预设指令标识,从而有助于进一步减少控制过程中的时延。Moreover, when controlling the target device, it can be determined whether the first text information is in a preset set corresponding to the second text information (ie, the second structured information), and when the first text information is in the preset set, The preset instruction identifier corresponding to the first text information can be directly determined according to the preset set without performing natural language understanding and dialogue management on the first text information, thereby helping to further reduce the delay in the control process.
如此,本申请中通过流式语音识别技术、完整语义判定、上下文判定以及设定第二文本信息(即第二结构化信息)对应的预设集合,可以有效减少控制过程中的时延,用户可以通过下发语音信号实现更直观有效地控制目标设备,有助于提高用户体验。In this way, in this application, by using streaming speech recognition technology, complete semantic determination, context determination, and setting a preset set corresponding to the second text information (ie, the second structured information), the time delay in the control process can be effectively reduced, and the user The target device can be controlled more intuitively and effectively by sending a voice signal, which helps to improve the user experience.
附图说明Description of drawings
图1为本申请提供的一种语音控制装置包含的功能模块示意图;1 is a schematic diagram of functional modules included in a voice control device provided by the application;
图2为本申请提供的一种数据处理模块所包含的功能模块示意图;2 is a schematic diagram of functional modules included in a data processing module provided by the application;
图3为本申请提供的语音控制装置所适用的一种具体场景;FIG. 3 is a specific scene to which the voice control device provided by the present application is applicable;
图4为本申请提供的一组车窗缓慢下移过程的示意图;4 is a schematic diagram of a process of slowly moving down a group of vehicle windows provided by the present application;
图5为本申请提供的第一种语音控制装置处理语音信号产生时延的示意图;5 is a schematic diagram of a first voice control device processing voice signal generation time delay provided by the application;
图6为本申请提供的再一种数据处理模块所包含的功能模块示意图;6 is a schematic diagram of functional modules included in yet another data processing module provided by the present application;
图7为本申请提供的一种语音控制方法的流程示意图;7 is a schematic flowchart of a voice control method provided by the present application;
图8为本申请提供的再一种语音控制方法的流程示意图;8 is a schematic flowchart of yet another voice control method provided by the application;
图9为本申请提供的一种流控制模块中两个预设模型的输入输出的流程示意图;9 is a schematic flowchart of the input and output of two preset models in a flow control module provided by the present application;
图10为本申请提供的一种车载场景中的语音控制流程;10 is a voice control process in a vehicle-mounted scene provided by the application;
图11为本申请提供的再一种车载场景中的语音控制流程;FIG. 11 is a voice control process in yet another vehicle-mounted scene provided by the application;
图12为本申请提供的另一种车载场景中的语音控制流程;12 is a voice control process in another vehicle-mounted scene provided by the application;
图13为本申请提供的第二种语音控制装置处理语音信号产生时延的示意图;13 is a schematic diagram of a second type of voice control device processing voice signal generation delay provided by the application;
图14为本申请提供的第三种语音控制装置处理语音信号产生时延的示意图;FIG. 14 is a schematic diagram of a third voice control device processing voice signal generation time delay provided by the application;
图15为本申请提供的一种语音控制装置的结构示意图;15 is a schematic structural diagram of a voice control device provided by the application;
图16为本申请提供的另一种语音控制装置的结构示意图。FIG. 16 is a schematic structural diagram of another voice control apparatus provided by the present application.
具体实施方式Detailed ways
下面将结合附图,对本申请实施例进行详细描述。The embodiments of the present application will be described in detail below with reference to the accompanying drawings.
如图1所示,为本申请提供的一种语音控制装置包含的功能模块示意图。该语音控制装置中包括:语音获取模块、数据处理模块和决策模块。其中,语音获取模块用于获取语音信号,并将获取到的语音信号传输至数据处理模块。数据处理模块用于对语音信号执行语音分析、语义分析、对话管理等,以得到数据处理结果。数据处理模块将数据处理结果发送至决策模块,决策模块根据数据处理结果生成控制指令,并下发至对应的目标设备。As shown in FIG. 1 , a schematic diagram of functional modules included in a voice control device provided by the present application. The voice control device includes: a voice acquisition module, a data processing module and a decision-making module. Wherein, the voice acquisition module is used to acquire the voice signal, and transmit the acquired voice signal to the data processing module. The data processing module is used to perform speech analysis, semantic analysis, dialogue management, etc. on the speech signal to obtain data processing results. The data processing module sends the data processing result to the decision-making module, and the decision-making module generates a control instruction according to the data processing result and sends it to the corresponding target device.
如图2所示,为本申请提供的一种数据处理模块所包含的功能模块示意图。示例性的,数据处理模块中包括:语音识别(automatic speech recognition,ASR)功能模块、自然语言理解(natural language understanding,NLU)功能模块、对话管理(dialog management,DM)功能模块。为方便说明,以下将ASR功能模块、NLU功能模块、DM功能模块分别简称为ASR模块、NLU模块、DM模块。As shown in FIG. 2 , a schematic diagram of functional modules included in a data processing module provided by the present application. Exemplarily, the data processing module includes: a speech recognition (automatic speech recognition, ASR) function module, a natural language understanding (natural language understanding, NLU) function module, and a dialog management (dialog management, DM) function module. For convenience of description, the ASR function module, the NLU function module, and the DM function module are referred to as the ASR module, the NLU module, and the DM module for short below, respectively.
下面对这三个组成部分分别进行说明。The three components are described below.
一、ASR模块可以用于执行语音分析,即将用户输入的语音信号转换成自然语言文本(可以称为文本信息),相当于人类的耳朵。1. The ASR module can be used to perform speech analysis, that is, to convert the speech signal input by the user into natural language text (which can be called text information), which is equivalent to the human ear.
语音识别原理流程:“语音输入—编码(特征提取)—解码—文字输出”。示例性的,语音输入即将获取到的语音信号输入至ASR模块中。其中语音信号实际上是一种声波,ASR模块可以对该语音信号执行编码(特征提取),具体的,可以按照帧(毫秒级)对声波进行拆分,得到每帧对应的一小段波形。针对于每帧对应的一小段波形,对该一小段波形按照人耳特征转变为多维向量信息。ASR模块根据该多维向量信息,解码得到该多维向量信息对应的多个音素(phone),将该多个因素组成字词并串联成语句(即文本信息)。ASR模块将生成的文本信息输出。The principle flow of speech recognition: "speech input - encoding (feature extraction) - decoding - text output". Exemplarily, the voice input is to input the acquired voice signal into the ASR module. The voice signal is actually a sound wave, and the ASR module can perform encoding (feature extraction) on the voice signal. Specifically, the sound wave can be split according to frames (millisecond level) to obtain a small piece of waveform corresponding to each frame. For a small piece of waveform corresponding to each frame, the small piece of waveform is converted into multi-dimensional vector information according to human ear characteristics. The ASR module decodes and obtains a plurality of phonemes (phones) corresponding to the multi-dimensional vector information according to the multi-dimensional vector information, composes the plurality of factors into words and concatenates them into sentences (ie, text information). The ASR module outputs the generated text information.
与语音识别相关的技术,主要包括:Technologies related to speech recognition, including:
1)语音活动检测(voice active detection,VAD)1) Voice active detection (VAD)
语音活动检测也可以称为语音激活检测或静音检测等。Voice activity detection may also be referred to as voice activation detection or silence detection, etc.
在远场识别场景下,用户不能用手接触设备,这时噪声比较大,信噪比下降剧烈,简单可以理解为信号不清晰,则可以使用VAD技术。其作用就是判断什么时候有语音信号输入,什么时候没有语音信号输入(即静音),后续的语音信号处理或是语音识别可以是在VAD截取出来的有效语音片段上进行的。也即,VAD主要用于检测用户是否完成语音信号输入。In the far-field recognition scenario, users cannot touch the device with their hands. At this time, the noise is relatively large, and the signal-to-noise ratio drops sharply. It can be simply understood that the signal is not clear, so VAD technology can be used. Its function is to judge when there is a voice signal input and when there is no voice signal input (ie, mute), and subsequent voice signal processing or voice recognition can be performed on the valid voice fragments cut out by the VAD. That is, the VAD is mainly used to detect whether the user completes the input of the voice signal.
VAD主要包括语音VAD和语义VAD。语音VAD是指在检测到设定时长内没有语音信号输入,则停止接收语音信号(也称为停止收音)。语义VAD是指在确定当前从输入的语音信号转换得到的文本信息具有完整的语义时,则停止接收语音信号。VAD mainly includes phonetic VAD and semantic VAD. Voice VAD means that when it is detected that there is no voice signal input within a set period of time, it stops receiving voice signals (also referred to as stopping radio). Semantic VAD means that when it is determined that the text information currently converted from the input speech signal has complete semantics, the speech signal is stopped to be received.
2)语音唤醒(voice trigger,VT)2) Voice wake-up (voice trigger, VT)
在远场识别场景下,需要在VAD检测到人声之后,进行语音唤醒,相当于向该设备下发一个唤醒指令,从而触发后续的语音识别。In the far-field recognition scenario, voice wake-up needs to be performed after the VAD detects a human voice, which is equivalent to sending a wake-up command to the device to trigger subsequent voice recognition.
3)麦克风阵列(microphone array)3) Microphone array
这是一套用来对声场的空间特性进行采样并处理的系统,有一定数目的声学传感器(一般是麦克风)组成。其目的有几个:语音增强,从含噪声的语音信号中提取出纯净语音的过程;声源定位,使用麦克风阵列来计算目标说话人的角度和距离,从而实现对目标说话人的跟踪以及后续的语音定向拾取;去混响,减少一些反射声的影响;声源信号提取/分离,将多个混合声音全部提取出来。主要是适用于车载、户外、超市等多杂音、噪音、回音的复杂环境。This is a system used to sample and process the spatial characteristics of the sound field, and consists of a certain number of acoustic sensors (usually microphones). There are several purposes: speech enhancement, the process of extracting pure speech from a noisy speech signal; sound source localization, which uses a microphone array to calculate the angle and distance of the target speaker, so as to achieve the tracking of the target speaker and subsequent voice directional pickup; de-reverberation to reduce the influence of some reflected sounds; sound source signal extraction/separation to extract all mixed sounds. It is mainly suitable for complex environments with many noises, noises and echoes such as vehicles, outdoors and supermarkets.
二、NLU模块可以用于执行自然语言理解或语义分析,即将自然语言文本转换为机器可以理解的结构化信息。示例性的,自然语言文本比如“打开车窗”,通过自然语言理解得到的结构化信息,比如“control-window.adjust”。Second, the NLU module can be used to perform natural language understanding or semantic analysis, that is, to convert natural language text into structured information that can be understood by machines. Exemplarily, natural language text such as "open car window", structured information obtained through natural language understanding, such as "control-window.adjust".
三、DM模块可以用于执行对话管理,即基于对话的状态,根据语义信息,提供相应的业务。对话管理控制着人机对话的过程,它会根据对话的历史信息,决定该对用户做出什么样的反应。最常见的应用为任务驱动的多轮对话,用户带着明确的目的如订单查询等,用户需求比较复杂,有很多限制条件,可能需要分多轮进行陈述。本质上,任务驱动的对话管理实际就是一个决策过程,系统在对话过程中不断根据当前状态决定下一步应该采取的最优动作(如:提供结果,询问特定限制条件,澄清或确认需求等),从而最有效的辅助用户完成信息或服务获取的任务。3. The DM module can be used to perform dialogue management, that is, based on the state of the dialogue, according to the semantic information, provide corresponding services. Dialogue management controls the process of human-machine dialogue, and it will decide what kind of response to the user based on the historical information of the dialogue. The most common application is the task-driven multi-round dialogue. The user has a clear purpose such as order query, etc., the user needs are more complex, there are many restrictions, and it may be necessary to state in multiple rounds. In essence, task-driven dialogue management is actually a decision-making process. During the dialogue process, the system continuously decides the optimal action to be taken next according to the current state (such as: providing results, asking for specific constraints, clarifying or confirming requirements, etc.), So as to most effectively assist users to complete the task of obtaining information or services.
此外,数据处理模块中还可以包括:自然语言生成(natural language generation,NLG)功能模块和语音合成(text to speech,TTS)功能模块,为方便说明,以下将NLG功能模块、TTS功能模块分别简称为NLG模块、TTS模块。In addition, the data processing module may also include: a natural language generation (NLG) function module and a speech synthesis (text to speech, TTS) function module. For the convenience of description, the NLG function module and the TTS function module are respectively referred to below for short. For NLG module, TTS module.
其中NLG模块可以用于根据业务的信息生成自然语言文本。The NLG module can be used to generate natural language texts based on business information.
TTS模块可以用于将自然语言文本变成输出的语音信号。跟ASR模块相反,TTS模块是将自然语言文本转化为语音,让机器朗读出来,相当于人类的嘴巴。The TTS module can be used to turn natural language text into an output speech signal. Contrary to the ASR module, the TTS module converts natural language text into speech for the machine to read aloud, which is equivalent to a human mouth.
如图3为本申请提供的语音控制装置可以适用的一种具体场景,该具体场景可以是车载场景,用户可以通过语音控制装置,向某个车载设备(可以称为目标设备,比如为车窗、 车载音箱、座椅、空调等)下发控制指令。比如图3中,用户说“打开车窗”(相当于用户下发语音信号,该语音信号为“打开车窗”),则语音控制装置在接收到该语音信号之后,可以将语音信号通过图2所示的ASR模块、NLU模块和DM模块等处理后得到车窗的控制指令,然后根据该控制指令控制车窗缓慢下移。As shown in FIG. 3, a specific scene to which the voice control device provided by this application can be applied, the specific scene can be a vehicle-mounted scene, and the user can use the voice control device to send a message to a vehicle-mounted device (which can be called a target device, such as a car window , car speakers, seats, air conditioners, etc.) to issue control commands. For example, in Figure 3, if the user says "open the car window" (equivalent to the user sending a voice signal, the voice signal is "open the car window"), after receiving the voice signal, the voice control device can pass the voice signal through the picture The ASR module, NLU module and DM module shown in 2 obtain the control command of the car window after processing, and then control the car window to move down slowly according to the control command.
此外,用户还可以通过语音控制装置对其他车载设备下发控制指令。比如用户说“抬高座椅”,则语音控制装置响应于该语音信号,控制座椅缓慢抬高,再比如用户说“调小空调风力”,则语音控制装置响应于该语音信号,控制空调风力缓慢减小。In addition, the user can also issue control commands to other in-vehicle devices through the voice control device. For example, if the user says "raise the seat", the voice control device will control the seat to raise slowly in response to the voice signal. For example, if the user says "turn down the air conditioner wind", the voice control device will respond to the voice signal and control the air conditioner. The wind is slowly decreasing.
此外,本申请提供的语音控制装置还可以适用于其他场景,比如家庭场景中,用户可以通过语音控制装置,向家庭场景中某个家庭设备(可以称为目标设备,比如扫地机器人、台灯、窗帘等)下发控制指令。示例性的,用户说“打开窗帘”,则语音控制装置响应于该语音信号控制窗帘缓慢打开;用户说“调亮台灯”,则语音控制装置响应于该语音信号控制台灯逐渐调高亮度等。In addition, the voice control device provided in this application can also be applied to other scenarios, for example, in a home scenario, a user can use the voice control device to send a voice control device to a certain home device (which can be called a target device, such as a robot vacuum cleaner, a desk lamp, a curtain, etc.) in the home scenario. etc.) to issue control commands. Exemplarily, when the user says "open the curtains", the voice control device controls the curtains to open slowly in response to the voice signal; the user says "turn up the desk lamp", the voice control device gradually increases the brightness of the console lights in response to the voice signal, etc.
需要指出的是,上述目标设备基于控制指令,在预设时段内可以处于对应的运行状态中。比如在打开车窗的例子中,车窗由全闭状态下移至全开状态的过程大概需要3-4秒。如图4为本申请示例性提供的一组车窗缓慢下移过程的示意图,其中粗实线代表车门,细虚线代表车窗。在如图4中(a),车窗处于全闭状态,也即车窗尚未打开。在如图4中(b),车窗处于半开状态,具体处于40%的打开状态。在如图4中(c),车窗仍处于半开状态,具体处于60%的打开状态。在如图4中(d),车窗处于全开状态,也即处于100%的打开状态。车窗由如图4中(a)的状态下移至如图4中(d)的状态,大概需要3-4秒。It should be pointed out that the above-mentioned target device may be in a corresponding operating state within a preset period of time based on the control instruction. For example, in the case of opening a car window, the process of moving the car window from a fully closed state to a fully open state takes about 3-4 seconds. FIG. 4 is a schematic diagram of a group of vehicle windows slowly moving down in the present application, wherein the thick solid line represents the vehicle door, and the thin dashed line represents the vehicle window. In Fig. 4(a), the window is in a fully closed state, that is, the window has not been opened. In Fig. 4(b), the window is in a half-open state, specifically in a 40% open state. In Fig. 4(c), the window is still in a half-open state, specifically in a 60% open state. In Fig. 4(d), the window is in a fully open state, that is, in a 100% open state. It takes about 3-4 seconds to move the window from the state of (a) in Figure 4 to the state of (d) in Figure 4.
基于此,车窗在基于控制指令开始下移的3-4秒内,车窗处于缓慢下移的运行状态中。在该运行状态中,用户可以直观感受到当前车窗的打开状态,并根据个人需求通过语音控制装置再次向车窗下发控制指令,比如车窗停止指令,以使得车窗停留在用户想要的位置。Based on this, within 3-4 seconds after the window starts to move downward based on the control command, the window is in a running state of slowly moving downward. In this running state, the user can intuitively feel the current open state of the window, and issue a control command to the window again through the voice control device according to personal needs, such as a window stop command, so that the window can stay at the desired position of the user. s position.
比如车窗下移至图4中(c)示出的位置时,用户直观感觉当前车窗位置比较合适,于是可以通过语音控制装置再次向车窗下发停止指令,比如用户说“停”,则语音控制装置可以接收到该语音信号之后,将语音信号通过图2所示的ASR模块、NLU模块、DM模块处理后得到车窗的控制指令,比如“车窗停止”,根据该控制指令控制车窗停止下移。For example, when the car window is moved down to the position shown in (c) in Figure 4, the user intuitively feels that the current car window position is more appropriate, so the voice control device can send a stop command to the car window again, for example, the user says "stop", After receiving the voice signal, the voice control device can process the voice signal through the ASR module, NLU module, and DM module shown in FIG. The windows stop moving down.
本申请中,可以将用户对处于运行状态中的目标设备的控制,称为是过程控制,比如上述对处于下移过程中的车窗的控制,可以称为是对车窗的过程控制。上述说明也适用于目标设备为车载设备中的其他设备的情况中,比如座椅、空调等,当然还适用于目标设备为其他场景中设备的情况中,比如家庭场景中的扫地机器人、窗帘、台灯等。In this application, the user's control of the target device in the running state may be referred to as process control. For example, the above-mentioned control of the vehicle window in the process of moving down may be referred to as the process control of the vehicle window. The above description also applies to the case where the target device is other equipment in the vehicle equipment, such as seats, air conditioners, etc., of course, also applies to the case where the target device is equipment in other scenarios, such as robot vacuum cleaners, curtains, Table lamps, etc.
此处需要补充的是,也可以认为目标设备具有指定运行状态,该指定运行状态至少包括两种运行状态,称为第一运行状态和第二运行状态。其中第一运行状态可以是目标设备基于用户第一次下发的语音信号(或控制指令)所处于的运行状态,比如车窗处于向下移动的运行状态,再比如座椅处于缓慢抬高的运行状态等。第二运行状态可以是目标设备基于用户第二次下发的语音信号(或控制指令)所处于的运行状态,比如车窗停止向下移动的运行状态,比如座椅停止缓慢抬高的运行状态等。What needs to be added here is that the target device may also be considered to have a designated operating state, and the designated operating state includes at least two operating states, referred to as a first operating state and a second operating state. The first running state may be the running state of the target device based on the voice signal (or control command) issued by the user for the first time, such as the running state of the window moving downward, or the slowly raising seat. operating status, etc. The second operating state may be the operating state that the target device is in based on the voice signal (or control instruction) sent by the user for the second time, such as the operating state in which the window stops moving downward, such as the operating state in which the seat stops slowly raising Wait.
用户在向语音控制装置下发语音信号的过程中,语音控制装置需要确定用户已经结束下发语音信号(或称为用户语音、语音指令)之后,才能根据获取到的整段语音信号,进行语音识别和语义分析得到控制指令。In the process of the user sending the voice signal to the voice control device, the voice control device needs to determine that the user has finished sending the voice signal (or user voice, voice command), and then the voice control device can perform the voice signal according to the entire acquired voice signal. Recognition and semantic analysis result in control instructions.
示例性的,可以设定静默时长(trailing silence),语音控制装置确定未接收到语音信号的时长达到静默时长时,确定用户已经结束下发语音信号。随后语音控制装置对获取到的整段语音信号通过图2所示的ASR模块、NLU模块、DM模块处理后得到控制指令。Exemplarily, a trailing silence may be set, and the voice control apparatus determines that the user has finished delivering the voice signal when the voice control apparatus determines that the duration of not receiving the voice signal reaches the silence period. Subsequently, the voice control device obtains a control instruction after processing the obtained entire voice signal through the ASR module, the NLU module, and the DM module shown in FIG. 2 .
如图5为本申请示例性提供的第一种语音控制装置处理语音信号产生时延的示意图,该时延具体包括静默时长、ASR模块处理时长、NLU模块处理时长和DM模块处理时长,可见从语音控制装置接收到语音信号至语音控制装置生成控制指令的时延较长。FIG. 5 is a schematic diagram of the time delay for processing a voice signal by the first kind of voice control device exemplarily provided by this application. The time delay specifically includes the silence duration, the processing duration of the ASR module, the processing duration of the NLU module and the processing duration of the DM module. It can be seen from the The time delay from when the voice control device receives the voice signal to when the voice control device generates the control command is relatively long.
较长时延会导致目标设备不能被及时控制,尤其是在过程控制中,用户不能通过语音控制装置较为直观有效的控制目标设备。比如当车窗下移至60%时,用户直观感觉当前车窗位置比较合适,于是用户说“停”,可能从用户说“停”至车窗真正停止之间存在时延比如1秒(s),那么此时车窗可能已经下移至80%,这样车窗最终所处的位置并不是用户想要的。A long time delay will cause the target device not to be controlled in time, especially in process control, the user cannot control the target device more intuitively and effectively through the voice control device. For example, when the window is moved down to 60%, the user intuitively feels that the current window position is more suitable, so the user says "stop", there may be a delay between the user saying "stop" and the actual stop of the window, such as 1 second (s ), then the window may have moved down to 80% at this time, so the final position of the window is not what the user wants.
基于此,本申请提供一种语音控制方法,用于在语音控制过程中减少控制时延。Based on this, the present application provides a voice control method for reducing control delay in a voice control process.
为了更好解释本申请中语音控制方法,先对本申请中的数据处理模块进一步说明如下。In order to better explain the voice control method in this application, the data processing module in this application is further described as follows.
如图6为本申请示例性提供的一种数据处理模块,相比于如图2中的数据处理模块的具体结构,新增有流控制模块和快速匹配模块。流控制模块接收来自ASR模块的文本信息,并确定是否将该文本信息发送至快速匹配模块。在流控制模块将文本信息发送至快速匹配模块时,快速匹配模块可以从预设集合中确定出预设指令标识,并根据预设指令标识确定向目标设备发送的控制指令。在快速匹配模块不能从预设集合中确定出预设指令标识的情况下,可以进一步通过NLU模块和DM模块生成相应的控制指令,并发送至目标设备。具体实现可参见下述方法实施例中的描述。FIG. 6 exemplarily provides a data processing module in the present application. Compared with the specific structure of the data processing module in FIG. 2 , a flow control module and a fast matching module are newly added. The flow control module receives the text information from the ASR module and determines whether to send the text information to the quick match module. When the flow control module sends the text information to the fast matching module, the fast matching module can determine the preset instruction identifier from the preset set, and determine the control instruction to send to the target device according to the preset instruction identifier. In the case that the quick matching module cannot determine the preset instruction identifier from the preset set, the corresponding control instruction can be further generated through the NLU module and the DM module, and sent to the target device. For specific implementation, reference may be made to the descriptions in the following method embodiments.
在本申请实施例中,如下将用户在第一次下发的语音信号称为是第二语音信号。语音控制装置根据第二语音信号得到的文本信息称为是第二文本信息,根据第二文本信息生成的控制指令称为第二控制指令,第二控制指令用于控制目标设备进入第一运行状态。In this embodiment of the present application, the voice signal sent by the user for the first time is referred to as the second voice signal as follows. The text information obtained by the voice control device according to the second voice signal is called the second text information, the control instruction generated according to the second text information is called the second control instruction, and the second control instruction is used to control the target device to enter the first operating state. .
用户在第二次下发的语音信号称为是第一语音信号,第一语音信号即用户对目标设备执行过程控制中的语音信号。语音控制装置根据第一语音信号得到的文本信息称为是第一文本信息,根据第一文本信息生成的控制指令称为第一控制指令,第一控制指令用于控制目标设备由第一运行状态切换至第二运行状态。The voice signal sent by the user for the second time is called the first voice signal, and the first voice signal is the voice signal in the process control performed by the user on the target device. The text information obtained by the voice control device according to the first voice signal is called the first text information, and the control instruction generated according to the first text information is called the first control instruction, and the first control instruction is used to control the target device from the first operating state. Switch to the second operating state.
如图7为本申请示例性提供的一种语音控制方法的流程示意图,该流程中:FIG. 7 is a schematic flowchart of a voice control method exemplarily provided by the application, in the process:
步骤701,语音控制装置根据第一语音信号,确定具有完整语义的第一文本信息。Step 701: The voice control apparatus determines, according to the first voice signal, first text information with complete semantics.
语音控制装置可以通过流式语音识别技术对接收到的语音信号进行识别,该方式中语音控制装置无需等待静默时长,而是在接收到用户的语音信号开始,即执行语音识别。The voice control device can recognize the received voice signal through the streaming voice recognition technology. In this way, the voice control device does not need to wait for a silent period, but starts to perform voice recognition after receiving the user's voice signal.
情况1,用户下发的第一语音信号为一个文字。In case 1, the first voice signal delivered by the user is a text.
比如用户下发的第一语音信号为“停”,用户需要通过一段时长比如0.5s时长说完该“停”字。对于语音控制装置来说,可以执行如下操作:接收到语音信号“停”,将语音信号“停”转换为文字信息“停”。For example, the first voice signal sent by the user is "stop", and the user needs to finish saying the word "stop" after a period of time, such as 0.5s. For the voice control device, the following operations can be performed: the voice signal "stop" is received, and the voice signal "stop" is converted into the text message "stop".
情况2,用户下发的第一语音信号为多个文字。In case 2, the first voice signal delivered by the user is a plurality of characters.
比如用户下发的第一语音信号为“就调到这”,用户需要通过一段时长比如2s时长,才能说完该“就调到这”四个字。对于语音控制装置来说,可以执行如下操作:For example, the first voice signal sent by the user is "just tune here", and the user needs to pass a period of time, such as 2s, to finish saying the four words "just tune here". For voice-controlled devices, you can do the following:
T1时刻:接收到语音信号“就”,将语音信号“就”转换为文字“就”,即生成文本信 息“就”。Time T1: When the voice signal "Ji" is received, the voice signal "Ji" is converted into the text "Ji", that is, the text information "Ji" is generated.
T2时刻:接收到语音信号“调”,将语音信号“调”转换为文字“调”,并结合T1时刻生成的文本信息“就”,生成文本信息“就调”。Time T2: Receive the voice signal "Tune", convert the voice signal "Tune" into the text "Tune", and generate the text information "Just Tune" in combination with the text information "Just" generated at the time of T1.
T3时刻:接收到语音信号“到”,将语音信号“到”转换为文字“到”,并结合T2时刻的文本信息“就调”,生成文本信息“就调到”。Time T3: After receiving the voice signal "To", the voice signal "To" is converted into the text "To", and combined with the text information "Just tune" at the T2 time, the text information "Just tune to" is generated.
T4时刻:接收到语音信号“这”,将语音信号“这”转换为文字“这”,并结合T3时刻的文本信息“就调到”,生成文本信息“就调到这”。Time T4: Receive the voice signal "this", convert the voice signal "this" into the text "this", and combine with the text information at time T3 "just tune in" to generate the text message "just tune in here".
在上述情况1中,语音控制装置识别出的文本信息具有完整语义。在上述情况2的T1时刻至T3时刻中,虽然语音控制装置执行语音识别,但是识别出的文本信息是不具有完整语义的,而在上述T4时刻得到的文本信息才具有完整语义。语音控制装置需要确定识别出的文本信息是否具有完整语义。此处文本信息具有完整语义可以理解为,语音控制装置可以根据该文本信息确定对应的结构化信息或控制指令。In the above case 1, the text information recognized by the voice control device has complete semantics. In the above case 2 from time T1 to time T3, although the voice control device performs speech recognition, the recognized text information does not have complete semantics, and the text information obtained at the above-mentioned time T4 has complete semantics. The voice control device needs to determine whether the recognized text information has complete semantics. It can be understood that the text information has complete semantics here, and the voice control device can determine corresponding structured information or control instructions according to the text information.
一种可选实现方式中,可以预先设置分类模型,该分类模型用于识别文本信息是否具有完整语义,该分类模型可以称为第一预设模型,该第一预设模型的输入为语音控制装置执行流式语音识别得到文本信息(或者是说文本信息中包含的一个或多个文字),第一预设模型的输出为第一指示信息,该第一指示信息用于指示该文本信息是否具有完整语义。In an optional implementation, a classification model can be preset, and the classification model is used to identify whether the text information has complete semantics. The classification model can be called a first preset model, and the input of the first preset model is voice control. The device performs streaming speech recognition to obtain text information (or one or more characters contained in the text information), and the output of the first preset model is first indication information, and the first indication information is used to indicate whether the text information is with full semantics.
示例性的,该第一指示信息可以是一个预设比特,比如,该预设比特取值为1时,表示该输入的文本信息具有完整语义,该预设比特取值为0时,表示该输入的文本信息不具有完整语义。Exemplarily, the first indication information may be a preset bit. For example, when the preset bit is 1, it indicates that the input text information has complete semantics; when the preset bit is 0, it indicates that the input text information has complete semantics. The entered text information does not have full semantics.
一种可选实现方式中,该第一预设模型可以是基于如下方式训练得到:In an optional implementation manner, the first preset model may be obtained by training in the following manner:
预先准备第一训练集合,该第一训练集合中包括有多个第一训练数据,该多个第一训练数据中的每个第一训练数据包括有第一训练文本信息和第一标签,其中第一训练文本信息中包括有一个或多个文字,第一标签用于指示该第一训练文本信息是否具有完整语义。Prepare a first training set in advance, the first training set includes a plurality of first training data, and each first training data in the plurality of first training data includes first training text information and a first label, wherein The first training text information includes one or more words, and the first label is used to indicate whether the first training text information has complete semantics.
示例性的,第一标签可以是人工预先标记的,也可以是在机器学习过程中自动标记的。第一标签可以通过一个预设比特来表示对应的第一训练文本信息是否具有完整语义,比如,该预设比特取值为1时,表示对应的第一训练文本信息具有完整语义,该预设比特取值为0时,表示对应的第一训练文本信息不具有完整语义。Exemplarily, the first label may be manually pre-labeled, or may be automatically labeled during the machine learning process. The first label can use a preset bit to indicate whether the corresponding first training text information has complete semantics. For example, when the preset bit is 1, it indicates that the corresponding first training text information has complete semantics. The preset When the value of the bit is 0, it indicates that the corresponding first training text information does not have complete semantics.
如表1为本申请示例性提供的第一训练集合中的多个第一训练数据。Table 1 exemplarily provides a plurality of first training data in the first training set for this application.
示例性的,第一训练数据中包括第一训练文本信息“就”和第一标签“0”,第一标签“0”用于指示第一训练文本信息“就”不具备完整语义。Exemplarily, the first training data includes first training text information "Ji" and a first label "0", and the first label "0" is used to indicate that the first training text information "Ji" does not have complete semantics.
再示例性的,第一训练数据中包括第一训练文本信息“停”和第一标签“1”,第一标签“1”用于指示第一训练文本信息“停”具备完整语义。For another example, the first training data includes first training text information "stop" and a first label "1", and the first label "1" is used to indicate that the first training text information "stop" has complete semantics.
表1Table 1
第一训练文本信息First training text information 第一标签first tab 第一训练文本信息First training text information 第一标签first tab
At once 00 broadcast 00
就、调just, adjust 00 播、放play, play 00
就、调、到to, to, to 00 播、放、音play, play, sound 00
就、调、到、这just, tune, to, this 11 播、放、音、乐play music 11
stop 11 stopstop 11
进一步的,可以根据第一训练集合中的多个第一训练数据,对第一训练模型进行一次或多次模型训练(可以称为第一模型训练),得到训练完成的模型,以作为第一预设模型。Further, one or more times of model training (which may be referred to as first model training) can be performed on the first training model according to a plurality of first training data in the first training set, and the trained model can be obtained as the first training model. Default model.
示例性的,在每次第一模型训练中,可以将第一训练集合中的多个第一训练数据输入至第一训练模型中,得到第一训练模型的输出结果(称为第一输出结果),第一输出结果比如是判定每个第一训练数据中的第一训练文本信息是否具有完整语义。根据第一输出结果以及每个第一训练数据中的第一标签,确定模型更新参数,其中模型更新参数比如梯度参数。根据该模型更新参数对当前的第一训练模型进行更新。Exemplarily, in each training of the first model, a plurality of first training data in the first training set may be input into the first training model to obtain the output result of the first training model (referred to as the first output result). ), the first output result is, for example, determining whether the first training text information in each first training data has complete semantics. According to the first output result and the first label in each first training data, a model update parameter is determined, wherein the model update parameter is such as a gradient parameter. The current first training model is updated according to the model update parameter.
基于更新后的第一训练模型执行下一次的第一模型训练,循环上述操作,直至确定出的第一输出结果符合第一预设条件。The next first model training is performed based on the updated first training model, and the above operations are repeated until the determined first output result meets the first preset condition.
示例性的,可以根据第一输出结果确定第一训练模型的输出正确率,比如共计1000个第一训练数据,其中第一输出结果中有900个第一训练数据对应的输出结果是正确的,则该输出正确率为90%。相应的,可以设置第一预设条件为输出正确率大于预设正确率。第一训练模型的第一输出结果的输出正确率大于预设正确率的情况下,可以确定该第一训练模型已经训练完成,可以将该训练完成的第一训练模型作为第一预设模型。Exemplarily, the output accuracy rate of the first training model can be determined according to the first output result, for example, there are 1000 first training data in total, wherein the output results corresponding to 900 first training data in the first output result are correct, Then this output is 90% correct. Correspondingly, the first preset condition may be set such that the output accuracy rate is greater than the preset accuracy rate. When the output accuracy rate of the first output result of the first training model is greater than the preset accuracy rate, it can be determined that the first training model has been trained, and the trained first training model can be used as the first preset model.
需要说明的是,语音控制装置可以根据工作过程中得到的数据进一步对第一预设模型的模型参数进行更新,以提高模型准确性。It should be noted that the voice control device can further update the model parameters of the first preset model according to the data obtained during the working process, so as to improve the accuracy of the model.
语音控制装置通过流式语音识别技术,处理第一语音信号得到第一语音信号对应文本信息(如下称为第三文本信息),第三文本信息中包括有M个文字,M为正整数。The voice control device processes the first voice signal through streaming voice recognition technology to obtain text information corresponding to the first voice signal (hereinafter referred to as third text information), where the third text information includes M characters, where M is a positive integer.
语音控制装置将第三文本信息输入至第一预设模型中,根据第一预设模型的输出结果和第三文本信息,生成第一文本信息。The voice control device inputs the third text information into the first preset model, and generates the first text information according to the output result of the first preset model and the third text information.
一个示例中,第一预设模型的输出结果指示第三文本信息具有完整语义,语音控制装置可以将第三文本信息作为第一文本信息。比如将“就”、“调”、“到”、“这”4个文字组成的第三文本信息“就调到这”输入至第一预设模型中,第一预设模型的输出为“1”,语音控制装置可以将第三文本信息“就调到这”作为第一文本信息。In one example, the output result of the first preset model indicates that the third text information has complete semantics, and the voice control apparatus may use the third text information as the first text information. For example, input the third text message "just tune here" composed of four characters "just", "tune", "to" and "this" into the first preset model, and the output of the first preset model is " 1", the voice control device can take the third text message "Just call it here" as the first text message.
再一个示例中,第一预设模型的输出结果指示第三文本信息不具有完整语义,则语音控制装置可以在通过流式语音技术识别到新的文字之后,将该新的文字和该M个文字组成的新的第三文本信息并输入至第一预设模型中,直至第一预设模型的输出结果指示输入的第三文本信息具有完整语义,将输入的第三文本信息作为第一文本信息。In another example, the output result of the first preset model indicates that the third text information does not have complete semantics, then the voice control apparatus may, after recognizing the new text through the streaming voice technology, combine the new text with the M texts. The new third text information composed of words is input into the first preset model, until the output result of the first preset model indicates that the input third text information has complete semantics, and the input third text information is used as the first text. information.
步骤702,语音控制装置根据第一文本信息和第二文本信息,控制目标设备在指定运行状态中由第一运行状态切换至第二运行状态。 Step 702, the voice control apparatus controls the target device to switch from the first operation state to the second operation state in the designated operation state according to the first text information and the second text information.
其中,语音控制装置在获取第一文本信息之前,先获取到第二文本信息。第二文本信息用于控制目标设备进入指定运行状态中的第一运行状态,第二文本信息与第一文本信息之间具有上下文关系。Wherein, before acquiring the first text information, the voice control apparatus first acquires the second text information. The second text information is used to control the target device to enter the first running state in the specified running state, and there is a contextual relationship between the second text information and the first text information.
预先说明的是,语音控制装置中会存储有会话状态,会话状态中可以包括语音控制装置根据上一次接收处理的语音信号,确定的文本信息和/或结构化信息。语音控制装置可以根据当前接收到语音信号,并结合其存储的会话状态确定是否生成相应的控制指令。It is stated in advance that the voice control device will store a session state, and the session state may include text information and/or structured information determined by the voice control device according to the last received and processed voice signal. The voice control device can determine whether to generate a corresponding control instruction according to the currently received voice signal and in combination with the stored session state.
本申请中,可以将会话状态中的文本信息称为是历史文本信息,将会话状态中的结构化信息称为是历史结构化信息。在会话状态满足第三预设条件时,也可将历史文本信息和历史结构化信息分别称为第二文本信息和第二结构化信息。In this application, the text information in the session state may be referred to as historical text information, and the structured information in the session state may be referred to as historical structured information. When the session state satisfies the third preset condition, the historical text information and the historical structured information may also be referred to as second text information and second structured information, respectively.
在一种可选方式中,历史文本信息和第一文本信息之间具有上下文关系,也可以理解为,历史文本信息是第一文本信息的上文,和/或,第一文本信息是历史文本信息的下文。可以是满足如下任一个或任多个条件:In an optional manner, there is a contextual relationship between the historical text information and the first text information, and it can also be understood that the historical text information is the preceding text of the first text information, and/or the first text information is historical text information below. Can be any one or more of the following conditions:
条件1,历史文本信息与第一文本信息均对应于有相同目标设备。比如,历史文本信息与第一文本信息均对应于车窗。再比如,历史文本信息与第一文本信息均对应于座椅。Condition 1, the historical text information and the first text information both correspond to the same target device. For example, both the historical text information and the first text information correspond to car windows. For another example, both the historical text information and the first text information correspond to seats.
条件2,历史文本信息对应的执行动作与第一文本信息对应的执行动作属于相同类型。比如,历史文本信息用于指示车窗下移,第一文本信息用于指示车窗下移停止,则二者均对应于下移这个动作类型。再比如,历史文本信息用于指示座椅抬高,第一文本信息用于指示座椅抬高停止,则二者均对应于抬高这个动作类型。Condition 2, the execution action corresponding to the historical text information is of the same type as the execution action corresponding to the first text information. For example, the historical text information is used to instruct the car window to move down, and the first text information is used to instruct the car window to stop moving down, both of which correspond to the action type of down move. For another example, the historical text information is used to instruct the seat to be raised, and the first text information is used to instruct the seat to be raised to stop, both of which correspond to the action type of raising.
如下举例说明历史文本信息与第一文本信息之间具有上下文关系的情况:The following example illustrates the situation where there is a contextual relationship between the historical text information and the first text information:
(1)历史文本信息是“打开车窗”,第一文本信息是“就调到这”。(1) The historical text information is "open the car window", and the first text information is "just tune here".
(2)历史文本信息是“打开车窗”,第一文本信息是“停”。(2) The historical text information is "open car window", and the first text information is "stop".
(3)历史文本信息是“打开车窗”,第一文本信息是“右后车窗”。(3) The historical text information is "open car window", and the first text information is "right rear car window".
(4)历史文本信息是“调小空调风力”,第一文本信息是“好了”。(4) The historical text information is "Turn down the wind power of the air conditioner", and the first text information is "OK".
(5)历史文本信息是“车窗下移”,第一文本信息是“车窗下移停止”。(5) The historical text information is "window down", and the first text information is "window down stop".
可以将历史文本信息与第一文本信息之间具有上下文关系作为第三预设条件。历史文本信息可以指示目标设备进入某一运行状态,第一文本信息可以指示该目标设备由该运行状态切换至另一运行状态。相当于,第二文本信息指示目标设备进入第一运行状态,第一文本信息指示目标设备由第一运行状态切换至第二运行状态。A contextual relationship between the historical text information and the first text information may be used as the third preset condition. The historical text information may instruct the target device to enter a certain operating state, and the first text information may instruct the target device to switch from the operating state to another operating state. Equivalently, the second text information instructs the target device to enter the first operating state, and the first text information instructs the target device to switch from the first operating state to the second operating state.
在另一种可选方式中,历史文本信息和第一文本信息之间不具有上下文关系,历史文本信息可以指示某一设备进入某一运行状态,第一文本信息可以指示其他设备进入其他运行状态。如下举例说明历史文本信息与第一文本信息之间不具有上下文关系的情况:In another optional manner, there is no contextual relationship between the historical text information and the first text information, the historical text information may instruct a certain device to enter a certain operating state, and the first text information may instruct other devices to enter other operating states . The following example illustrates the situation where there is no contextual relationship between the historical text information and the first text information:
(a)历史文本信息是“打开车窗”,第一文本信息是“播放音乐”。(a) The historical text information is "open car window", and the first text information is "play music".
(b)历史文本信息是“打开车窗”,第一文本信息是“打开蓝牙”。(b) The historical text information is "open car window", and the first text information is "open bluetooth".
(c)历史文本信息是“打开车窗”,第一文本信息是“关闭空调”。(c) The historical text information is "open the car window", and the first text information is "turn off the air conditioner".
上述仅为示例性举例,不构成对本申请方法的限定。The above is only an exemplary example, and does not constitute a limitation to the method of the present application.
语音控制装置在确定出第一文本信息之后,可以确定历史文本信息与第一文本信息之间是否具有上下文关系。一个示例中,可以通过上述条件1和/或条件2确定历史文本信息与第一文本信息之间是否具有上下文关系。After determining the first text information, the voice control device may determine whether there is a contextual relationship between the historical text information and the first text information. In one example, whether there is a contextual relationship between the historical text information and the first text information may be determined through the above-mentioned condition 1 and/or condition 2.
再一个示例中,可以预先设置分类模型,该分类模型用于确定两个文本信息之间是否具有上下文关系,该分类模型可以称为第二预设模型,该第二预设模型的输入为两个文本信息,具体为历史文本信息和第一文本信息,第二预设模型的输出为第二指示信息,该第二指示信息用于指示该历史文本信息和第一文本信息之间是否具有上下文关系。In another example, a classification model may be preset, and the classification model is used to determine whether there is a contextual relationship between two pieces of text information. The classification model may be called a second preset model, and the input of the second preset model is two pieces of text information, specifically historical text information and first text information, the output of the second preset model is second indication information, and the second indication information is used to indicate whether there is a context between the historical text information and the first text information relation.
示例性的,该第二指示信息可以是一个预设比特,比如,该预设比特取值为1时,表示该输入的历史文本信息和第一文本信息之间具有上下文关系,该预设比特取值为0时,表示该输入的历史文本信息和第一文本信息之间不具有上下文关系。Exemplarily, the second indication information may be a preset bit. For example, when the preset bit takes a value of 1, it indicates that there is a contextual relationship between the input historical text information and the first text information, and the preset bit When the value is 0, it indicates that there is no contextual relationship between the input historical text information and the first text information.
一种可选实现方式中,该第二预设模型可以是基于如下方式训练得到:In an optional implementation manner, the second preset model may be obtained by training in the following manner:
预先准备第二训练集合,该第二训练集合中包括有多个第二训练数据,该多个第二训练数据中的每个第二训练数据包括有两个文本信息和第二标签,其中第二标签用于指示该 两个文本信息之间是否具有上下文关系。示例性的,该两个文本信息具有先后顺序。Prepare a second training set in advance, the second training set includes a plurality of second training data, and each second training data in the plurality of second training data includes two text information and a second label, wherein the first The second tag is used to indicate whether there is a contextual relationship between the two text information. Exemplarily, the two pieces of text information have a sequential order.
示例性的,第二标签可以是人工预先标记的,也可以是在机器学习过程中自动标记的。第二标签可以通过一个预设比特来表示对应的两个文本信息之间是否具有上下文关系,比如,该预设比特取值为1时,表示对应的两个文本信息之间具有上下文关系,该预设比特取值为0时,表示对应的两个文本信息之间不具有上下文关系。Exemplarily, the second label may be manually pre-labeled, or may be automatically labeled during the machine learning process. The second tag can use a preset bit to indicate whether there is a contextual relationship between the two corresponding text information. For example, when the preset bit is 1, it indicates that there is a contextual relationship between the two corresponding textual information. When the preset bit takes a value of 0, it indicates that there is no contextual relationship between the two corresponding text information.
如表2为本申请示例性提供的第二训练集合中的多个第二训练数据。Table 2 exemplarily provides a plurality of second training data in the second training set for this application.
示例性的,第二训练数据中包括两个文本信息“打开车窗”、“关闭空调”和第二标签“0”,第二标签“0”用于指示“打开车窗”和“关闭空调”之间不具有上下文关系。Exemplarily, the second training data includes two text information "open the window", "turn off the air conditioner" and a second label "0", and the second label "0" is used to indicate "open the window" and "turn off the air conditioner". ' are not contextually related.
再示例性的,第二训练数据中包括两个文本信息“打开车窗”、“就调到这”和第二标签“1”,第二标签“1”用于指示“打开车窗”和“就调到这”之间具有上下文关系。For another example, the second training data includes two text messages "open the car window", "just call here" and a second label "1", the second label "1" is used to indicate "open the car window" and There is a contextual relationship between "just tune in here".
表2Table 2
前一个文本信息previous text message 后一个文本信息next text message 第二标签second label
打开车窗open the windows 就调到这just call here 11
打开车窗open the windows stop 11
打开车窗open the windows 右后车窗right rear window 11
调小空调风力Turn down the air conditioner 好了All right 11
打开车窗open the windows 播放音乐play music 00
打开车窗open the windows 打开蓝牙Turn on bluetooth 00
打开车窗open the windows 关闭空调Turn off the air conditioner 00
进一步的,可以根据第二训练集合中的多个第二训练数据,对第二训练模型进行一次或多次模型训练(可以称为第二模型训练),得到训练完成的模型,以作为第二预设模型。Further, one or more times of model training (which may be referred to as second model training) can be performed on the second training model according to a plurality of second training data in the second training set, and the trained model can be obtained as the second training model. Default model.
示例性的,在每次第二模型训练中,可以将第二训练集合中的多个第二训练数据输入至第二训练模型中,得到第二训练模型的输出结果(称为第二输出结果),第二输出结果比如是判定每个第二训练数据中的两个文本信息之间是否具有上下文关系。根据第二输出结果以及每个第二训练数据中的第二标签,确定模型更新参数,其中模型更新参数比如梯度参数。根据该模型更新参数对当前的第二训练模型进行更新。Exemplarily, in each second model training, a plurality of second training data in the second training set may be input into the second training model to obtain the output result of the second training model (referred to as the second output result). ), and the second output result is, for example, determining whether there is a contextual relationship between two pieces of text information in each second training data. According to the second output result and the second label in each second training data, a model update parameter is determined, wherein the model update parameter is such as a gradient parameter. The current second training model is updated according to the model update parameter.
基于更新后的第二训练模型执行下一次的第二模型训练,循环上述操作,直至确定出的第二输出结果符合第二预设条件。Execute the next second model training based on the updated second training model, and repeat the above operations until the determined second output result meets the second preset condition.
示例性的,可以根据第二输出结果确定第二训练模型的输出正确率,比如共计1000个第二训练数据,其中第二输出结果中有900个第二训练数据对应的输出结果是正确的,则该输出正确率为90%。相应的,可以设置第二预设条件为输出正确率大于预设正确率。第二训练模型的第二输出结果的输出正确率大于预设正确率的情况下,可以确定该第二训练模型已经训练完成,可以将该训练完成的第二训练模型作为第二预设模型。Exemplarily, the output correct rate of the second training model may be determined according to the second output result, for example, there are 1000 second training data in total, wherein the output results corresponding to 900 second training data in the second output result are correct, Then this output is 90% correct. Correspondingly, the second preset condition may be set such that the output accuracy rate is greater than the preset accuracy rate. When the output accuracy rate of the second output result of the second training model is greater than the preset accuracy rate, it can be determined that the second training model has been trained, and the trained second training model can be used as the second preset model.
需要说明的是,语音控制装置可以根据工作过程中得到的数据进一步对第二预设模型的模型参数进行更新,以提高模型准确性。It should be noted that the voice control device can further update the model parameters of the second preset model according to the data obtained during the working process, so as to improve the accuracy of the model.
一种可选实现方式中,语音控制装置将历史文本信息和第一文本信息输入至第二预设模型中,根据第二预设模型的输出结果,确定历史文本信息与第一文本信息之间是否具有上下文关系,也即是否存在第二文本信息。如下分情况说明:In an optional implementation, the voice control device inputs the historical text information and the first text information into the second preset model, and determines the difference between the historical text information and the first text information according to the output result of the second preset model. Whether there is a context relationship, that is, whether there is second text information. The situation is explained as follows:
情况1,当存在第二文本信息时,语音控制装置根据第二文本信息和第一文本信息确 定控制指令,该控制指令即用于控制目标设备由第一运行状态切换至第二运行状态。In case 1, when the second text information exists, the voice control device determines a control instruction according to the second text information and the first text information, and the control instruction is used to control the target device to switch from the first operating state to the second operating state.
如下先对目标设备进入第一运行状态解释说明。The following first explains that the target device enters the first operating state.
一种可选的具体实现中,语音控制装置基于上述获取第一文本信息的实现方式获取第二文本信息。示例性的,语音控制装置获取用户下发的第二语音信号,通过语音识别得到第二语音信号对应的N个文字,N为正整数。语音控制装置在确定N个文字具有完整语义的情况下,对第二文本信息执行自然语言理解,得到第二结构化信息,然后根据第二结构化信息,控制目标设备进入第一运行状态。In an optional specific implementation, the voice control apparatus acquires the second text information based on the above implementation manner of acquiring the first text information. Exemplarily, the voice control apparatus acquires the second voice signal sent by the user, and obtains N characters corresponding to the second voice signal through voice recognition, where N is a positive integer. When it is determined that the N characters have complete semantics, the voice control device performs natural language understanding on the second text information to obtain second structured information, and then controls the target device to enter the first operating state according to the second structured information.
此外,由于用户下发的第二语音信号用于指示目标设备进入第一运行状态,比如用于指示车窗进入向下移动的运行状态,再比如用于指示座椅进入缓慢抬高的运行状态等,也就是说,第二语音信号对应的执行过程中的时延要求低于第一语音信号对应的执行过程(即过程控制)中的时延要求,语音控制装置还可以基于现有流程中方式控制目标设备进入第一运行状态,本申请不限定。In addition, since the second voice signal sent by the user is used to instruct the target device to enter the first operating state, for example, it is used to instruct the window to enter the operating state of moving downward, and for example, it is used to instruct the seat to enter the operating state of slowly raising etc., that is to say, the time delay requirement in the execution process corresponding to the second speech signal is lower than the time delay requirement in the execution process (ie process control) corresponding to the first speech signal, and the speech control device can also be based on the existing process The method controls the target device to enter the first operating state, which is not limited in this application.
语音控制装置中包括有一个或多个第二预设结构化信息。对于一个或多个第二预设结构化信息中的任一个第二预设结构化信息,该第二预设结构化信息对应有预设集合,预设集合中包括有一个或多个预设文本信息。One or more second preset structured information is included in the voice control device. For any one of the one or more second preset structured information, the second preset structured information corresponds to a preset set, and the preset set includes one or more presets text information.
在一种可选实现方式中,在第二预设结构化信息对应的预设集合中,一个或多个预设文本信息可以对应于一个或多个预设指令标识。In an optional implementation manner, in the preset set corresponding to the second preset structured information, one or more preset text information may correspond to one or more preset instruction identifiers.
示例性的,表3为本申请提供的一种第二预设结构化信息与预设集合的对应关系。Exemplarily, Table 3 provides a correspondence between the second preset structured information and the preset set provided by the present application.
比如第二预设结构化信息“control-window.adjust”对应的预设集合中,预设文本信息“停、stop、好了、ok”对应的预设指令标识为“车窗停止”。For example, in the preset set corresponding to the second preset structured information "control-window.adjust", the preset instruction corresponding to the preset text information "stop, stop, ok, ok" is identified as "window stop".
再比如第二预设结构化信息“control-chair.adjust”对应的预设集合中,预设文本信息“停、stop、好了、ok”对应的预设指令标识为“座椅停止”。For another example, in the preset set corresponding to the second preset structured information "control-chair.adjust", the preset instruction corresponding to the preset text information "stop, stop, ok, ok" is identified as "seat stop".
表3table 3
Figure PCTCN2021082019-appb-000001
Figure PCTCN2021082019-appb-000001
一种可选的具体实现中,语音控制装置根据第二结构化信息,从第二预设结构化信息与预设集合的对应关系中,确定第二结构化信息对应的预设集合,然后确定第一文本信息是否包含于第二结构化信息对应的预设集合中。在第一文本信息包含于该预设集合的情况下,语音控制装置可以根据第一文本信息在该预设集合中对应的预设指令标识,确定用于控制目标设备的控制指令。In an optional specific implementation, the voice control device determines the preset set corresponding to the second structured information from the correspondence between the second preset structured information and the preset set according to the second structured information, and then determines the preset set corresponding to the second structured information. Whether the first text information is included in the preset set corresponding to the second structured information. When the first text information is included in the preset set, the voice control apparatus may determine a control instruction for controlling the target device according to the preset instruction identifier corresponding to the first text information in the preset set.
结合表3举例,第二结构化信息为“control-window.adjust”,语音控制装置确定第一文本信息“停”在“control-window.adjust”对应的预设集合中,并进一步确定“停”对应的 预设指令标识为“车窗停止”。语音控制装置根据该预设指令标识“车窗停止”,确定向车窗下发车窗停止指令。Taking Table 3 as an example, the second structured information is "control-window.adjust", and the voice control device determines that the first text information "stops" in the preset set corresponding to "control-window.adjust", and further determines that "stop" ” The corresponding preset command is identified as “window stop”. The voice control device determines to send a window stop command to the lower window according to the preset command identification "window stop".
进一步的,还可以在每个第二预设结构化信息对应的预设集合中,设置预设文本信息对应的第一预设结构化信息。比如表4中,“control-window.adjust”对应的预设集合中,预设文本信息“停、stop、好了、ok”对应于预设指令标识“车窗停止”,并进一步对应于第一预设结构化信息“control-window.stop”。Further, the first preset structured information corresponding to the preset text information may also be set in the preset set corresponding to each second preset structured information. For example, in Table 4, in the preset set corresponding to "control-window.adjust", the preset text information "stop, stop, ok, ok" corresponds to the preset command mark "window stop", and further corresponds to the first A preset structured information "control-window.stop".
表4Table 4
Figure PCTCN2021082019-appb-000002
Figure PCTCN2021082019-appb-000002
若语音控制装置根据第一文本信息,从第二结构化信息对应的预设集合中确定的控制指令无效,则可以根据第一预设结构化信息执行对话管理,以用于后续的指令下发。其中指令无效可以是语音控制装置不下发该控制指令,或者向目标设备下发该控制指令之后,目标设备不执行。If the control command determined by the voice control device from the preset set corresponding to the second structured information according to the first text information is invalid, the dialog management can be performed according to the first preset structured information for subsequent instruction issuance . The invalid instruction may be that the voice control device does not issue the control instruction, or the target device does not execute the control instruction after issuing the control instruction to the target device.
示例性的,若语音控制装置确定控制指令为“车窗减速指令”,而当前车窗下移的速度已达到最低速度,则语音控制装置可以确定该控制指令无效。进一步的,语音控制装置可以根据该“车窗减速指令”对应的第一预设结构化信息“control-window.slower”,发起对话比如提醒用户当前已达到最低下降速度,或者询问用户是否需要停止下移车窗。Exemplarily, if the voice control device determines that the control command is a "window deceleration command" and the current speed of the downward movement of the car window has reached the minimum speed, the voice control device may determine that the control command is invalid. Further, the voice control device can initiate a dialogue according to the first preset structured information "control-window.slower" corresponding to the "window deceleration command", such as reminding the user that the current minimum descent speed has been reached, or asking the user whether it is necessary to stop. Move the window down.
在另一种可选实现方式中,在第二预设结构化信息对应的预设集合中,可以包括一个或多个预设文本信息以及一个或多个第一预设结构化信息。In another optional implementation manner, the preset set corresponding to the second preset structured information may include one or more preset text information and one or more first preset structured information.
示例性的,表5为本申请提供的再一种第二预设结构化信息与预设集合的对应关系。Exemplarily, Table 5 shows the correspondence between the second preset structured information and the preset set provided by the present application.
比如第二预设结构化信息“control-window.adjust”对应的预设集合中,预设文本信息“停、stop、好了、ok”对应的第一预设结构化信息为“stop”。For example, in the preset set corresponding to the second preset structured information "control-window.adjust", the first preset structured information corresponding to the preset text information "stop, stop, ok, ok" is "stop".
表5table 5
Figure PCTCN2021082019-appb-000003
Figure PCTCN2021082019-appb-000003
Figure PCTCN2021082019-appb-000004
Figure PCTCN2021082019-appb-000004
一种可选的具体实现中,语音控制装置根据第二结构化信息,从第二预设结构化信息与预设集合的对应关系中,确定第二结构化信息对应的预设集合,然后确定第一文本信息是否包含于第二结构化信息对应的预设集合中。在第一文本信息包含于该预设集合的情况下,语音控制装置可以根据第一文本信息在该预设集合中对应的第一预设结构化信息,结合第二结构化信息,生成第三结构化信息,并根据第三结构化信息确定用于控制目标设备的控制指令。In an optional specific implementation, the voice control device determines the preset set corresponding to the second structured information from the correspondence between the second preset structured information and the preset set according to the second structured information, and then determines the preset set corresponding to the second structured information. Whether the first text information is included in the preset set corresponding to the second structured information. In the case that the first text information is included in the preset set, the voice control device may generate a third text information according to the first preset structured information corresponding to the first text information in the preset set and in combination with the second structured information structured information, and determine a control instruction for controlling the target device according to the third structured information.
结合表5举例,第二结构化信息为“control-window.adjust”,语音控制装置确定第一文本信息“停”在“control-window.adjust”对应的预设集合中,并进一步确定“停”对应的第一预设结构化信息为“stop”。语音控制装置根据第一预设结构化信息“stop”和第二结构化信息为“control-window.adjust”,生成第三结构化信息,比如“control-window.stop”,然后根据第三结构化信息“control-window.stop”向车窗下发车窗停止指令。Taking Table 5 as an example, the second structured information is "control-window.adjust", the voice control device determines that the first text information "stops" in the preset set corresponding to "control-window.adjust", and further determines that "stop" "The corresponding first preset structured information is "stop". The voice control device generates third structured information such as "control-window.stop" according to the first preset structured information "stop" and the second structured information as "control-window.adjust", and then according to the third structured information The control message "control-window.stop" sends a window stop command to the window.
若语音控制装置遍历第二结构化信息对应的预设集合中的所有预设文本信息,确定该预设集合中未包含有该第一文本信息,则语音控制装置可以根据第一文本信息执行语音理解,得到第一结构化信息,然后根据第一结构化信息和第二结构化信息,生成第三结构化信息,并根据第三结构化信息确定用于控制目标设备的控制指令。If the voice control device traverses all the preset text information in the preset set corresponding to the second structured information, and determines that the preset set does not contain the first text information, the voice control device may execute voice according to the first text information It is understood that the first structured information is obtained, then the third structured information is generated according to the first structured information and the second structured information, and the control instruction for controlling the target device is determined according to the third structured information.
结合表3举例,第二结构化信息为“control-window.adjust”,语音控制装置确定第一文本信息“就调到这”不在“control-window.adjust”对应的预设集合中,语音控制装置对第一文本信息“就调到这”执行自然语音理解,得到第一结构化信息比如“stop”,语音控制装置进而根据第二结构化信息“control-window.adjust”和第一结构化信息“stop”,生成第三结构化信息比如“control-window.stop”,然后根据第三结构化信息“control-window.stop”向车窗下发车窗停止指令。Taking Table 3 as an example, the second structured information is "control-window.adjust", and the voice control device determines that the first text information "just adjust to this" is not in the preset set corresponding to "control-window.adjust", and the voice control The device performs natural speech understanding on the first text information "just adjust here", obtains the first structured information such as "stop", and the voice control device further according to the second structured information "control-window.adjust" and the first structured information information "stop", generate third structured information such as "control-window.stop", and then issue a window stop command to the vehicle window according to the third structured information "control-window.stop".
本申请实施例中,语音控制装置中还可以不包括第二结构化信息对应的预设集合,也即语音控制装置中的一个或多个第二预设结构化信息中不包括该第二结构化信息。语音控制装置可以根据第一文本信息执行语音理解,得到第一结构化信息,然后根据第一结构化信息和第二结构化信息,生成第三结构化信息,并根据第三结构化信息确定用于控制目标设备的控制指令。In this embodiment of the present application, the voice control device may not include a preset set corresponding to the second structured information, that is, one or more second preset structured information in the voice control device does not include the second structure information. The voice control device can perform voice understanding according to the first text information, obtain the first structured information, and then generate the third structured information according to the first structured information and the second structured information, and determine the user according to the third structured information. Control commands for controlling the target device.
比如,第二结构化信息为“media-set.adjust”(其中“media-set.adjust”用于控制车载音箱播放音乐),该第二结构化信息未在多个第二预设结构化信息中。比如第一结构化信息还是“stop”,语音控制装置可以根据第二结构化信息“media-set.adjust”和第一结构化信息“stop”,生成第三结构化信息“media-set.stop”,然后根据第三结构化信息“media-set.stop”,生成用于控制车载音箱停止播放音乐的停止指令。For example, the second structured information is "media-set.adjust" (where "media-set.adjust" is used to control the car speaker to play music), and the second structured information is not included in the plurality of second preset structured information middle. For example, the first structured information is still "stop", the voice control device can generate the third structured information "media-set.stop" according to the second structured information "media-set.adjust" and the first structured information "stop" ”, and then according to the third structured information “media-set.stop”, a stop instruction for controlling the car speaker to stop playing music is generated.
此外,本申请中,语音控制装置根据第三结构化信息确定出的控制指令可能无效,比如,第二结构化信息为“control-window.adjust”,第一结构化信息为“top”,那么生成第三结构化信息比如“control-top-window.adjust”,对应控制指令比如为调整天窗,基于前面的调整车窗的控制指令,则语音控制装置可以确定该生成的控制指令是无效指令。In addition, in this application, the control command determined by the voice control device according to the third structured information may be invalid. For example, if the second structured information is "control-window.adjust" and the first structured information is "top", then The third structured information such as "control-top-window.adjust" is generated, and the corresponding control command is, for example, adjusting the sunroof. Based on the previous control command for adjusting the window, the voice control device can determine that the generated control command is an invalid command.
语音控制装置可以根据该新生成的第三结构化信息更新会话状态,在语音控制装置再 次接收到新的语音信号时,根据该新的语音信号和该会话状态,确定是否生成有效的控制指令。示例性的,语音控制装置接收到的语音信号比如“停”,则此时语音控制装置可以根据“停”和会话状态中的“control-top-window.adjust”,确定将天窗停下来。The voice control device can update the session state according to the newly generated third structured information, and when the voice control device receives a new voice signal again, it can determine whether to generate a valid control command according to the new voice signal and the session state. Exemplarily, the voice control device receives a voice signal such as "stop", then the voice control device can determine to stop the sunroof according to "stop" and "control-top-window.adjust" in the session state.
再另外一些可能方式中,语音控制装置也可以发起询问,与用户通过对话交流,来生成有效的控制指令。比如在确定第三结构化信息对应控制指令无效时,生成询问语句,比如“需要调节天窗吗?”或者“请问如何调节天窗?”,在确定用户需要停止调节天窗时,下发天窗停止指令。In still other possible manners, the voice control device may also initiate an inquiry, and communicate with the user through dialogue to generate effective control instructions. For example, when it is determined that the control instruction corresponding to the third structured information is invalid, a query sentence is generated, such as "Do you need to adjust the sunroof?" or "How do you adjust the sunroof?", and when it is determined that the user needs to stop adjusting the sunroof, a sunroof stop instruction is issued.
需要说明的是,上述例子中,第一文本信息(或第一语音信号)中可能并未指示出目标设备,语音控制装置根据具有上下文关系的第二文本信息和第一文本信息,可以确定第二文本信息和第一文本信息对应于相同目标设备,也即第一文本信息对应的目标设备与第二文本信息对应的目标设备相同。比如,第二文本信息为“打开车窗”,其中目标设备为车窗,第一文本信息为“停”,虽然第一文本信息中不包含有目标设备,但可以根据与第一文本信息具有上下文关系的该第二文本信息,确定第一文本信息中目标设备同样为车窗。It should be noted that, in the above example, the target device may not be indicated in the first text information (or the first voice signal), and the voice control device can determine the first text information according to the second text information and the first text information with the contextual relationship. The second text information and the first text information correspond to the same target device, that is, the target device corresponding to the first text information is the same as the target device corresponding to the second text information. For example, the second text information is "open car window", the target device is the car window, and the first text information is "stop". Although the first text information does not contain the target device, it can be The second text information of the context relationship determines that the target device in the first text information is also a car window.
此外,本申请不排除第一文本信息(或第一语音信号)中指示目标设备的情况,比如第二文本信息为“打开车窗”,第一文本信息为“车窗停”,二者均指示出目标设备为车窗。In addition, the present application does not exclude the situation that the target device is indicated in the first text information (or the first voice signal). For example, the second text information is "open the window", and the first text information is "the window is stopped", both of which are Indicates that the target device is a car window.
情况2,当不存在第二文本信息时,语音控制装置根据第一文本信息执行语音理解,得到第一结构化信息,根据第一结构化信息更新会话状态。In case 2, when the second text information does not exist, the voice control device performs voice understanding according to the first text information, obtains the first structured information, and updates the conversation state according to the first structured information.
第一种可能方式中,语音控制装置中存储有会话状态,相当于语音控制装置中存储有历史文本信息和历史结构化信息,该历史文本信息与第一文本信息之间不存在上下文关系,语音控制装置可以根据第一文本信息执行语音理解,得到第一结构化信息,然后根据第一文本信息和第一结构化信息更新会话状态。In the first possible manner, the conversation state is stored in the voice control device, which is equivalent to storing historical text information and historical structured information in the voice control device. There is no contextual relationship between the historical text information and the first text information, and the voice The control device may perform speech understanding according to the first text information to obtain the first structured information, and then update the conversation state according to the first text information and the first structured information.
第二种可能方式中,语音控制装置中会话状态未空,相当于语音控制装置中未存储有历史文本信息和历史结构化信息,语音控制装置可以根据第一文本信息执行语音理解,得到第一结构化信息,然后将第一文本信息和第一结构化信息作为当前会话状态。In the second possible manner, the conversation state in the voice control device is not empty, which is equivalent to that the historical text information and historical structured information are not stored in the voice control device, and the voice control device can perform voice understanding according to the first text information, and obtain the first structured information, and then use the first text information and the first structured information as the current session state.
当语音控制装置再次接收到语音信号时,可以根据该新的语音信号,结合更新后的会话状态,生成控制指令,或者再次更新会话状态。When the voice control device receives the voice signal again, it can generate a control instruction according to the new voice signal and the updated session state, or update the session state again.
结合如图6中数据处理模块,流控制模块中可以设置有第一预设模型和第二预设模型,相当于流控制模块用于根据第一语音信号,确定具有完整语义的第一文本信息,以及根据第一文本信息和历史文本信息确定二者之间是否具有上下文关系。快速匹配模块中可以设置有预设数据库,该预设数据库中包括一个或多个第二预设结构化信息,相当于快速匹配模块用于确定第一文本信息是否对应有预设指令标识。Combined with the data processing module in FIG. 6 , the flow control module can be provided with a first preset model and a second preset model, which is equivalent to the flow control module used to determine the first text information with complete semantics according to the first voice signal , and determine whether there is a contextual relationship between the two according to the first text information and the historical text information. A preset database may be set in the quick matching module, and the preset database includes one or more second preset structured information, which is equivalent to that the quick matching module is used to determine whether the first text information corresponds to a preset instruction identifier.
基于如图6中的各模块,提供另一种语音控制方法,该方法流程可参见如图8所示。Based on the modules in FIG. 6 , another voice control method is provided, and the flow of the method can be referred to as shown in FIG. 8 .
步骤801,ASR模块根据第一语音信号,确定第三文本信息,其中第三文本信息中包括有M个文字,M为正整数。Step 801, the ASR module determines third text information according to the first voice signal, wherein the third text information includes M characters, and M is a positive integer.
步骤802,ASR模块将第三文本信息,发送至流控制模块。相应的,流控制模块接收到来自ASR模块的第三文本信息。Step 802, the ASR module sends the third text information to the flow control module. Correspondingly, the flow control module receives the third text information from the ASR module.
步骤803,流控制模块将第三文本信息输入至第一预设模型中,确定该第三文本信息是否具有完整语义。若是,则执行步骤804,否则返回步骤801。Step 803, the flow control module inputs the third text information into the first preset model, and determines whether the third text information has complete semantics. If yes, go to step 804, otherwise go back to step 801.
步骤804,流控制模块确定历史文本信息是否与第一文本信息(即上述步骤803中得到的第三文本信息)具有上下文关系。若是,则执行步骤805,否则将第一文本信息经NLU模块和DM模块处理。Step 804, the flow control module determines whether the historical text information has a contextual relationship with the first text information (ie, the third text information obtained in the above step 803). If yes, execute step 805, otherwise, the first text information is processed by the NLU module and the DM module.
步骤805,流控制模块向快速匹配模块发送第一文本信息。相应的,快速匹配模块接收到来自流控制模块的第一文本信息。Step 805, the flow control module sends the first text information to the quick matching module. Correspondingly, the fast matching module receives the first text information from the flow control module.
步骤806,快速匹配模块确定第二结构化信息对应的预设集合中是否存在第一文本信息对应的预设指令标识。若是,则执行步骤807,否则将第一文本信息经NLU模块和DM模块处理。Step 806, the quick matching module determines whether there is a preset instruction identifier corresponding to the first text information in the preset set corresponding to the second structured information. If yes, execute step 807, otherwise, the first text information is processed by the NLU module and the DM module.
步骤807,快速匹配模块向决策模块发送第一文本信息对应的预设指令标识。Step 807, the quick matching module sends the preset instruction identifier corresponding to the first text information to the decision module.
步骤808,决策模块根据第一文本信息对应的预设指令标识生成控制指令。Step 808, the decision-making module generates a control instruction according to the preset instruction identifier corresponding to the first text information.
步骤809,决策模块向目标设备发送控制指令。Step 809, the decision module sends a control instruction to the target device.
上述步骤801至步骤809未详细描述的内容均可参见如图7相关实施例中描述。For the content not described in detail in the above steps 801 to 809, please refer to the description in the related embodiment of FIG. 7 .
示例性的,图9示例性提供的流控制模块中两个预设模型的输入输出的流程示意图,其中,第一预设模型的输入为第三文本信息,比如第三文本信息为“就调到这”,第一预设模型的输出指示该第三文本信息具有完整语义。流控制模块将第三文本信息作为第一文本信息,将历史文本信息和第一文本信息输入至第二预设模型中,比如历史文本信息为“打开车窗”,第二预设模型的输出指示该历史文本信息和第一文本信息二者具有上下文关系。示例性的,第一预设模型和第二预设模型可以是通过自监督学习得到的。Exemplarily, FIG. 9 exemplarily provides a schematic flowchart of the input and output of two preset models in the flow control module, wherein the input of the first preset model is third text information, for example, the third text information At this point, the output of the first preset model indicates that the third textual information has complete semantics. The flow control module uses the third text information as the first text information, and inputs the historical text information and the first text information into the second preset model, for example, the historical text information is "open the car window", and the output of the second preset model It is indicated that the historical text information and the first text information have a contextual relationship. Exemplarily, the first preset model and the second preset model may be obtained through self-supervised learning.
为更好的解释本申请实施例,如下结合具体场景解释说明。In order to better explain the embodiments of the present application, descriptions are given below in conjunction with specific scenarios.
在图3的车载场景中,用户在第一次下发语音信号(即第二语音信号)时,说“打开车窗”,语音控制装置响应于该语音信号控制车窗缓慢下移,其中历史文本信息为“打开车窗”,历史结构化信息为“control-window.adjust”。In the vehicle-mounted scene in FIG. 3 , when the user sends a voice signal (ie, the second voice signal) for the first time, he says "open the car window", and the voice control device controls the car window to move down slowly in response to the voice signal. The text information is "open car window", and the historical structured information is "control-window.adjust".
用户在第二次下发语音信号(即第一语音信号)时,可以有如下几种示例:When the user sends a voice signal (ie, the first voice signal) for the second time, there may be the following examples:
示例一,用户第二次下发的语音信号(即第一语音信号)为“停”,参照如图10示例性示出的语音控制流程,包括如下步骤:Example 1, the voice signal (that is, the first voice signal) sent by the user for the second time is "stop", referring to the voice control flow exemplarily shown in FIG. 10 , including the following steps:
步骤1,语音控制装置确定文本信息“停”具有完整语义;Step 1, the voice control device determines that the text message "stop" has complete semantics;
步骤2,语音控制装置确定文本信息“停”与文本信息“打开车窗”具有上下文关系。Step 2, the voice control device determines that the text information "stop" and the text information "open car window" have a contextual relationship.
步骤3,语音控制装置确定结构化信息“control-window.adjust”对应的预设集合中包括文本信息“停”,确定文本信息“停”对应的预设指令标识为“车窗停止”,根据预设指令标识“车窗停止”确定车窗停止指令。Step 3, the voice control device determines that the preset set corresponding to the structured information "control-window.adjust" includes the text information "stop", and determines that the preset instruction corresponding to the text information "stop" is identified as "window stop", according to The preset command flag "window stop" determines the window stop command.
示例二,用户第二次下发的语音信号(即第一语音信号)为“就调到这”,参照如图11示例性示出的语音控制流程,包括如下步骤:Example 2, the voice signal (that is, the first voice signal) sent by the user for the second time is "just call here", referring to the voice control flow exemplarily shown in Figure 11, including the following steps:
步骤1,语音控制装置确定文本信息“就”不具有完整语义;Step 1, the voice control device determines that the text information "just" does not have complete semantics;
步骤2,语音控制装置确定文本信息“就调”不具有完整语义;Step 2, the voice control device determines that the text information "just tune" does not have complete semantics;
步骤3,语音控制装置确定文本信息“就调到”不具有完整语义;Step 3, the voice control device determines that the text information "just call" does not have complete semantics;
步骤4,语音控制装置确定文本信息“就调到这”具有完整语义;Step 4, the voice control device determines that the text message "just call here" has complete semantics;
步骤5,语音控制装置确定文本信息“就调到这”与“打开车窗”具有上下文关系。In step 5, the voice control device determines that the text message "just call here" has a contextual relationship with "open the car window".
步骤6,语音控制装置确定结构化信息“control-window.adjust”对应的预设集合中不包括文本信息“就调到这”。Step 6, the voice control device determines that the preset set corresponding to the structured information "control-window.adjust" does not include the text information "just adjust here".
步骤7,语音控制装置对文本信息“就调到这”执行语义分析处理后得到结构化信息“stop”。In step 7, the voice control device performs semantic analysis processing on the text information "just call here" to obtain structured information "stop".
步骤8,语音控制装置对结构化信息“stop”和结构化信息“control-window.adjust”执行对话管理后,得到结构化信息“control-window.stop”。Step 8: After the voice control device performs dialogue management on the structured information "stop" and the structured information "control-window.adjust", the structured information "control-window.stop" is obtained.
步骤9,语音控制装置根据结构化信息“control-window.stop”,生成车窗停止指令。Step 9, the voice control device generates a window stop command according to the structured information "control-window.stop".
示例三,用户第二次下发的语音信号(即第一语音信号)为“播放音乐”,参照如图12示例性示出的语音控制流程,包括如下步骤:Example 3, the voice signal (that is, the first voice signal) issued by the user for the second time is "playing music", referring to the voice control flow exemplarily shown in Figure 12, including the following steps:
步骤1,语音控制装置确定文本信息“播”不具有完整语义;Step 1, the voice control device determines that the text information "play" does not have complete semantics;
步骤2,语音控制装置确定文本信息“播放”不具有完整语义;Step 2, the voice control device determines that the text information "playing" does not have complete semantics;
步骤3,语音控制装置确定文本信息“播放音”不具有完整语义;Step 3, the voice control device determines that the text information "playing sound" does not have complete semantics;
步骤4,语音控制装置确定文本信息“播放音乐”具有完整语义;Step 4, the voice control device determines that the text information "playing music" has complete semantics;
步骤5,语音控制装置确定文本信息“播放音乐”与“打开车窗”不具有上下文关系。Step 5, the voice control device determines that the text information "playing music" and "opening the car window" do not have a contextual relationship.
语音控制装置根据“播放音乐”执行语义分析和对话管理等,并更新会话状态。The voice control device performs semantic analysis, dialogue management, etc. according to "play music", and updates the conversation state.
上述示例一至示例三中未详细描述的内容均可参见如图7相关实施例中描述。For the content not described in detail in the above examples 1 to 3, please refer to the description in the related embodiment of FIG. 7 .
在图3的车载场景中,一种具体的可选方式中,车窗可以是由车载电路中的马达控制,语音控制装置可以向车载电路发送控制指令,车载电路根据控制指令控制马达的电源通断,从而实现对车窗的控制。示例性的,在上述示例一和示例二中,当语音控制装置响应于第二语音信号,控制车窗向下移动时,可以是语音控制装置向车载电路发送车窗下移指令,车载电路根据车窗下移指令控制马达电源连通,马达工作,使得车窗缓慢向下移动。当语音控制装置响应于第一语音信号,控制车窗停止下移时,可以是语音控制装置向车载电路发送车窗停止指令,车载电路根据车窗停止指令控制马达电源断开,马达停止工作,使得车窗停止移动。In the in-vehicle scene of FIG. 3 , in a specific optional manner, the car window may be controlled by the motor in the in-vehicle circuit, the voice control device may send a control command to the in-vehicle circuit, and the in-vehicle circuit controls the power supply of the motor according to the control command. to control the windows. Exemplarily, in the above examples 1 and 2, when the voice control device controls the window to move down in response to the second voice signal, the voice control device may send the window down instruction to the vehicle-mounted circuit, and the vehicle-mounted circuit can move the window down according to the second voice signal. The window down command controls the motor power to be connected, and the motor works, making the window move down slowly. When the voice control device controls the window to stop moving downward in response to the first voice signal, it may be that the voice control device sends a window stop command to the vehicle-mounted circuit, and the vehicle-mounted circuit controls the motor power to disconnect and the motor to stop working according to the window stop command. stop the windows from moving.
再一种具体的可选方式中,车窗可以是由步进电路中步进电机控制,语音控制装置可以向步进电机发送步进信号,步进电机根据步进信号控制车窗。示例性的,在上述示例一和示例二中,当语音控制装置响应于第二语音信号,控制车窗向下移动时,可以是语音控制装置根据车窗下移指令向步进电机发送开始步进信号,控制步进电路工作,使得车窗缓慢向下移动。当语音控制装置响应于第一语音信号,控制车窗停止下移时,可以是语音控制装置向步进电机发送停止步进信号,控制步进电路停止工作,使得车窗停止移动。In another specific optional manner, the car window may be controlled by a stepping motor in the stepping circuit, the voice control device may send a stepping signal to the stepping motor, and the stepping motor controls the car window according to the stepping signal. Exemplarily, in the above examples 1 and 2, when the voice control device controls the window to move downward in response to the second voice signal, the voice control device may send a start step to the stepper motor according to the window downward movement instruction. Enter the signal to control the stepping circuit to work, so that the window moves down slowly. When the voice control device controls the car window to stop moving downward in response to the first voice signal, the voice control device may send a stop stepping signal to the stepping motor to control the stepping circuit to stop working, so that the car window stops moving.
上述技术方案中,语音控制装置可以获取第二语音信号,根据第二语音信号确定第二文本信息,然后根据第二文本信息执行自然语言理解,得到第二结构化信息,再根据第二结构化信息控制目标设备进入指定运行状态的第一运行状态。在目标设备处于第一运行状态的情况中,语音控制装置还可以在持续获取第一语音信号的过程中,对持续获取到的第一语音信号执行流式语音识别,得到对应的M个文字,确定该M个文字组成的文本信息具有完整语义的情况下,将具有完成语义的M个文字组成的第一文本信息,无需在用户下发完成语音信号之后等待静默时长,而是在确定获取到具有完整语义的文本信息之后即推断出用户下发完成语音信号,从而有效减少控制时延。In the above technical solution, the voice control device can obtain the second voice signal, determine the second text information according to the second voice signal, and then perform natural language understanding according to the second text information to obtain the second structured information, and then according to the second structured information. The information controls the target device to enter the first operating state of the designated operating state. In the case that the target device is in the first operating state, the voice control device may also perform streaming speech recognition on the continuously obtained first voice signal in the process of continuously obtaining the first voice signal, so as to obtain the corresponding M characters, In the case where it is determined that the text information composed of the M characters has complete semantics, the first text information composed of the M characters with completed semantics does not need to wait for the silence period after the user sends the completed voice signal. After the text information with complete semantics, it is inferred that the user has completed the delivery of the voice signal, thereby effectively reducing the control delay.
进一步的,根据第一文本信息和当前存储的历史文本信息确定二者之间是否具有上下文关系,在确定二者具有上下文关系的情况下,则可以确定出当前获取到的第一语音信号是用户针对于上一次的第二语音信号的进一步指示,于是可以根据第一文本信息和第二文本信息,控制处于第一运行状态的目标设备,具体的,将该目标设备的运行状态由第一运 行状态切换至第二运行状态,如此,可以有效确定出用户下发的第一语音信号所指示的目标设备,并对该目标设备进行控制。Further, it is determined whether there is a contextual relationship between the two according to the first text information and the currently stored historical textual information, and when it is determined that the two have a contextual relationship, it can be determined that the currently obtained first voice signal is the user. According to the further instruction of the last second voice signal, the target device in the first running state can be controlled according to the first text information and the second text information. Specifically, the running state of the target device is changed from the first running state The state is switched to the second running state, in this way, the target device indicated by the first voice signal sent by the user can be effectively determined, and the target device can be controlled.
而且,在对目标设备进行控制时,可以确定第一文本信息是否在第二文本信息(即第二结构化信息)对应的预设集合中,当第一文本信息在该预设集合中时,可以无需对第一文本信息执行自然语言理解和对话管理,而直接根据该预设集合确定出第一文本信息对应的预设指令标识,从而有助于进一步减少控制过程中的时延。Moreover, when controlling the target device, it can be determined whether the first text information is in a preset set corresponding to the second text information (ie, the second structured information), and when the first text information is in the preset set, The preset instruction identifier corresponding to the first text information can be directly determined according to the preset set without performing natural language understanding and dialogue management on the first text information, thereby helping to further reduce the delay in the control process.
如此,本申请中通过流式语音识别技术、完整语义判定、上下文判定以及设定第二文本信息(即第二结构化信息)对应的预设集合,可以有效减少控制过程中的时延,用户可以通过下发语音信号实现更直观有效地控制目标设备,有助于提高用户体验。In this way, in this application, by using streaming speech recognition technology, complete semantic determination, context determination, and setting a preset set corresponding to the second text information (ie, the second structured information), the time delay in the control process can be effectively reduced, and the user The target device can be controlled more intuitively and effectively by sending a voice signal, which helps to improve the user experience.
基于本申请中语音控制方法,可以降低语音控制装置处理语音信号产生时延。Based on the voice control method in the present application, the time delay generated by the voice control device for processing the voice signal can be reduced.
如图13为本申请示例性提供的第二种语音控制装置处理语音信号产生时延的示意图,语音控制装置在接收到语音信号的时刻起,进行语音识别,根据第二结构化信息对应的预设集合,不能确定第一文本信息对应的预设指令标识时,可以通过NLU模块和DM模块处理后,得到相应的控制指令,并下发至目标设备。相比于如图5示出的时延图,本申请方法至少可以避免语音控制装置等待静默时长导致的时延。FIG. 13 is a schematic diagram of a second type of voice control device processing voice signal generation time delay provided by the present application. The voice control device performs voice recognition from the moment of receiving the voice signal, according to the prediction corresponding to the second structured information. If the set is set, when the preset instruction identifier corresponding to the first text information cannot be determined, the corresponding control instruction can be obtained after being processed by the NLU module and the DM module, and sent to the target device. Compared with the time delay diagram shown in FIG. 5 , the method of the present application can at least avoid the time delay caused by the voice control apparatus waiting for the silent duration.
如图14为本申请示例性提供的第三种语音控制装置处理语音信号产生时延的示意图,语音控制装置在接收到语音信号的时刻起,进行语音识别,根据第二结构化信息对应的预设集合,确定第一文本信息对应的预设指令标识,从而得到相应的控制指令,并下发至目标设备。相比于如图5示出的时延图,本申请方法不仅可以避免语音控制装置等待静默时长产生的时延,还能避免NLU模块和DM模块处理导致的时延。FIG. 14 is a schematic diagram of the time delay for processing the voice signal by the third voice control device provided by the application. The voice control device performs voice recognition from the moment of receiving the voice signal, and according to the prediction corresponding to the second structured information A set is set to determine the preset instruction identifier corresponding to the first text information, so as to obtain the corresponding control instruction and deliver it to the target device. Compared with the delay diagram shown in FIG. 5 , the method of the present application can not only avoid the delay caused by the voice control device waiting for the silent duration, but also avoid the delay caused by the processing of the NLU module and the DM module.
本文中描述的各个实施例可以为独立的方案,也可以根据内在逻辑进行组合,这些方案都落入本申请的保护范围中。The various embodiments described herein may be independent solutions, or may be combined according to internal logic, and these solutions all fall within the protection scope of the present application.
可以理解的是,上述各个方法实施例中,由语音控制装置实现的方法和操作,也可以由可用于语音控制装置的部件(例如芯片或者电路)实现。It can be understood that, in the foregoing method embodiments, the methods and operations implemented by the voice control device may also be implemented by components (eg, chips or circuits) that can be used in the voice control device.
本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。另外,在本申请各个实施例中的各功能模块可以集成在一个处理器中,也可以是单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。The division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation. In addition, each functional module in each embodiment of the present application may be integrated into one processor, or may exist physically alone, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules.
基于上述内容和相同构思,图15和图16为本申请的提供的可能的语音控制装置的结构示意图。这些语音控制装置可以用于实现上述方法实施例中语音控制装置的功能,因此也能实现上述方法实施例所具备的有益效果。Based on the above content and the same concept, FIG. 15 and FIG. 16 are schematic structural diagrams of possible voice control apparatuses provided by the present application. These voice control apparatuses can be used to implement the functions of the voice control apparatuses in the above method embodiments, and thus can also achieve the beneficial effects of the above method embodiments.
如图15所示,该语音控制装置包括处理模块1501和控制模块1502。一种可选实现方式中,处理模块1501可用于执行如图7示例性示出的方法实施例中的步骤701,控制模块1502可用于执行如图7示例性示出的方法实施例中的步骤702。另一种可选实现方式中,处理模块1501可用于执行如图8示例性示出的方法实施例中的步骤801至步骤805,控制模块1502可用于执行如图8示例性示出的方法实施例中的步骤806至步骤809。As shown in FIG. 15 , the voice control apparatus includes a processing module 1501 and a control module 1502 . In an optional implementation manner, the processing module 1501 may be configured to execute step 701 in the method embodiment exemplarily shown in FIG. 7 , and the control module 1502 may be configured to execute the steps in the method embodiment exemplarily shown in FIG. 7 . 702. In another optional implementation manner, the processing module 1501 may be used to perform steps 801 to 805 in the method embodiment exemplarily shown in FIG. 8 , and the control module 1502 may be used to perform the method implementation exemplarily shown in FIG. 8 . Steps 806 to 809 in the example.
一种可选实现方式中,处理模块1501用于根据第一语音信号,确定具有完整语义的第一文本信息;控制模块1502用于根据第一文本信息和第二文本信息,控制目标设备在指定运行状态中切换,其中,第二文本信息是在第一文本信息之前被获取的,第二文本信 息用于控制目标设备进入指定运行状态中的第一运行状态,第二文本信息与第一文本信息具有上下文关系。In an optional implementation manner, the processing module 1501 is used to determine the first text information with complete semantics according to the first voice signal; the control module 1502 is used to control the target device to specify the first text information according to the first text information and the second text information. Switching in the running state, wherein the second text information is acquired before the first text information, the second text information is used to control the target device to enter the first running state in the specified running state, and the second text information is the same as the first text information. Information is contextual.
一种可选实现方式中,第二文本信息与第一文本信息具有上下文关系,至少包括如下的一项或多项:第二文本信息和第一文本信息对应于同一个目标设备;第二文本信息对应的执行动作和第一文本信息对应的执行动作属于相同类型。In an optional implementation manner, the second text information and the first text information have a contextual relationship, including at least one or more of the following: the second text information and the first text information correspond to the same target device; the second text information and the first text information correspond to the same target device; The execution action corresponding to the information and the execution action corresponding to the first text information are of the same type.
一种可选实现方式中,在处理模块1501根据第一语音信号,确定具有完整语义的第一文本信息之前,处理模块1501还用于:根据第二语音信号,确定具有完整语义的第二文本信息;对第二文本信息执行自然语言理解,得到第二结构化信息;控制模块1502还用于:根据第二结构化信息,控制目标设备进入第一运行状态。In an optional implementation manner, before the processing module 1501 determines the first text information with complete semantics according to the first voice signal, the processing module 1501 is further configured to: determine the second text with complete semantics according to the second voice signal. information; perform natural language understanding on the second text information to obtain second structured information; the control module 1502 is further configured to: control the target device to enter the first operating state according to the second structured information.
一种可选实现方式中,控制模块1502具体用于:根据第二文本信息对应的第二结构化信息,确定与第二结构化信息相对应的预设集合,预设集合中包括一个或多个预设文本信息和预设指令标识的对应关系;在一个或多个预设文本信息中包括第一文本信息时,根据第一文本信息对应的预设指令标识确定控制指令,其中,控制指令用于控制目标设备由指定运行状态中的第一运行状态切换至指定运行状态中的第二运行状态。In an optional implementation manner, the control module 1502 is specifically configured to: determine a preset set corresponding to the second structured information according to the second structured information corresponding to the second text information, and the preset set includes one or more The corresponding relationship between the preset text information and the preset instruction identifier; when the one or more preset text information includes the first text information, the control instruction is determined according to the preset instruction identifier corresponding to the first text information, wherein the control instruction It is used to control the target device to switch from the first operation state in the designated operation state to the second operation state in the designated operation state.
一种可选实现方式中,控制模块1502还用于:在第一文本信息与一个或多个预设文本信息中任一个预设文本信息不同时,对第一文本信息执行自然语言理解得到第一结构化信息;根据第一结构化信息和第二结构化信息确定控制指令。In an optional implementation manner, the control module 1502 is further configured to: when the first text information is different from any one of the one or more preset text information, perform natural language understanding on the first text information to obtain the first text information. a structured information; the control instruction is determined according to the first structured information and the second structured information.
一种可选实现方式中,控制模块1502还用于:根据第一结构化信息和第二结构化信息确定控制指令之后,在控制指令无效的情况下,根据第一结构化信息更新第二结构化信息。In an optional implementation manner, the control module 1502 is further configured to: after determining the control instruction according to the first structured information and the second structured information, in the case that the control instruction is invalid, update the second structure according to the first structured information information.
一种可选实现方式中,处理模块1501具体用于:根据第一语音信号,确定第一语音信号对应的M个文字,M为正整数;将M个文字组成的文本信息输入至第一预设模型,得到第一预设模型的输出结果,第一预设模型用于判断输入的多个文字所组成的文本信息是否具有完整语义;根据M个文字组成的文本信息和第一预设模型的输出结果,生成第一文本信息。In an optional implementation manner, the processing module 1501 is specifically configured to: according to the first voice signal, determine M characters corresponding to the first voice signal, where M is a positive integer; and input the text information composed of the M characters into the first preset. A model is set to obtain the output result of the first preset model, and the first preset model is used to judge whether the text information composed of the input multiple characters has complete semantics; according to the text information composed of M characters and the first preset model The output result generates the first text information.
一种可选实现方式中,处理模块1501具体用于:获取第一训练集合,第一训练集合中包括有多个第一训练数据,针对于多个第一训练数据中每个第一训练数据,第一训练数据包括有第一训练文本信息和第一标签,第一训练文本信息由一个或多个文字组成,第一标签用于指示第一训练文本信息是否具有完整语义;根据多个第一训练数据和第一训练模型,执行一次或多次第一模型训练,至第一训练模型的第一输出结果符合第一预设条件,并将第一输出结果符合第一预设条件的第一训练模型确定为第一预设模型;其中,第一模型训练包括:将多个第一训练数据输入至第一训练模型中,得到第一输出结果;根据第一输出结果,更新第一训练模型中的模型参数,得到模型参数更新后的第一训练模型。In an optional implementation manner, the processing module 1501 is specifically configured to: obtain a first training set, the first training set includes a plurality of first training data, and for each first training data in the plurality of first training data , the first training data includes first training text information and a first label, the first training text information consists of one or more words, and the first label is used to indicate whether the first training text information has complete semantics; a training data and a first training model, perform one or more trainings of the first model until the first output result of the first training model meets the first preset condition, and select the first output result that meets the first preset condition A training model is determined to be the first preset model; wherein, the first model training includes: inputting a plurality of first training data into the first training model to obtain a first output result; updating the first training data according to the first output result The model parameters in the model are obtained to obtain the first training model after the model parameters are updated.
一种可选实现方式中,控制模块1502根据第一文本信息和第二文本信息,控制目标设备在指定运行状态中切换之前,处理模块1501还用于:将第一文本信息和历史文本信息输入至第二预设模型中,得到第二预设模型的输出结果,第二预设模型用于判断输入的两个文本信息是否具有上下文关系;根据第二预设模型的输出结果,将历史文本信息确定为第二文本信息。In an optional implementation, before the control module 1502 controls the target device to switch in the specified operating state according to the first text information and the second text information, the processing module 1501 is further configured to: input the first text information and the historical text information. In the second preset model, the output result of the second preset model is obtained, and the second preset model is used to judge whether the two input text information has a contextual relationship; according to the output result of the second preset model, the historical text The information is determined to be second text information.
一种可选实现方式中,处理模块1501具体用于:获取第二训练集合,第二训练集合中包括有多个第二训练数据,针对于多个第二训练数据中每个第二训练数据,第二训练数 据包括两个第二训练文本信息和第二标签,第二标签用于指示两个第二训练文本信息是否具有上下文关系;根据多个第二训练数据和第二训练模型,执行一次或多次第二模型训练,至第二训练模型的第二输出结果符合第二预设条件,并将第二输出结果符合第二预设条件的第二训练模型确定为第二预设模型;其中,第二模型训练包括:将多个第二训练数据输入至第二训练模型中,得到第二输出结果;根据第二输出结果,更新第二训练模型中的模型参数,得到模型参数更新后的第二训练模型。In an optional implementation manner, the processing module 1501 is specifically configured to: obtain a second training set, the second training set includes a plurality of second training data, for each second training data in the plurality of second training data , the second training data includes two second training text information and a second label, and the second label is used to indicate whether the two second training text information have a contextual relationship; according to the plurality of second training data and the second training model, execute The second model is trained one or more times until the second output result of the second training model meets the second preset condition, and the second training model whose second output result meets the second preset condition is determined as the second preset model ; wherein, the second model training includes: inputting a plurality of second training data into the second training model to obtain a second output result; according to the second output result, updating the model parameters in the second training model to obtain a model parameter update After the second training model.
如图16所示为本申请实施例提供的装置,图16所示的装置可以为图15所示的装置的一种硬件电路的实现方式。该装置可适用于前面所示出的流程图中,执行上述方法实施例中语音控制装置的功能。FIG. 16 shows the apparatus provided in this embodiment of the present application, and the apparatus shown in FIG. 16 may be a hardware circuit implementation of the apparatus shown in FIG. 15 . The apparatus can be applied to the flow chart shown above to perform the functions of the voice control apparatus in the above method embodiments.
为了便于说明,图16仅示出了该装置的主要部件。For ease of explanation, FIG. 16 shows only the main components of the device.
该语音控制装置包括:处理器1610和接口1630,可选的,该语音控制装置还包括存储器1620。接口1630用于实现与其他设备进行通信。The voice control apparatus includes: a processor 1610 and an interface 1630 , and optionally, the voice control apparatus further includes a memory 1620 . The interface 1630 is used to enable communication with other devices.
以上实施例中语音控制装置执行的方法可以通过处理器1610调用存储器(可以是语音控制装置中的存储器1620,也可以是外部存储器)中存储的程序来实现。即,语音控制装置可以包括处理器1610,该处理器1610通过调用存储器中的程序,以执行以上方法实施例中语音控制装置执行的方法。这里的处理器可以是一种具有信号的处理能力的集成电路,例如CPU。语音控制装置可以通过配置成实施以上方法的一个或多个集成电路来实现。例如:一个或多个ASIC,或,一个或多个微处理器DSP,或,一个或者多个FPGA等,或这些集成电路形式中至少两种的组合。或者,可以结合以上实现方式。The method performed by the voice control apparatus in the above embodiments may be implemented by the processor 1610 calling a program stored in a memory (which may be the memory 1620 in the voice control apparatus, or an external memory). That is, the voice control apparatus may include a processor 1610, and the processor 1610 executes the method performed by the voice control apparatus in the above method embodiments by calling the program in the memory. The processor here may be an integrated circuit with signal processing capability, such as a CPU. The voice control device may be implemented by one or more integrated circuits configured to implement the above methods. For example: one or more ASICs, or, one or more microprocessor DSPs, or, one or more FPGAs, etc., or a combination of at least two of these integrated circuit forms. Alternatively, the above implementations may be combined.
具体的,图15中的处理模块1501和控制模块1502的功能/实现过程可以通过图16所示的语音控制装置中的处理器1610调用存储器1620中存储的计算机执行指令来实现。Specifically, the function/implementation process of the processing module 1501 and the control module 1502 in FIG. 15 can be implemented by the processor 1610 in the voice control device shown in FIG. 16 calling the computer execution instructions stored in the memory 1620 .
基于上述内容和相同构思,本申请提供一种计算设备,包括处理器,处理器与存储器相连,存储器用于存储计算机程序,处理器用于执行存储器中存储的计算机程序,以使得计算设备执行上述方法实施例中的方法。Based on the above content and the same concept, the present application provides a computing device, including a processor, the processor is connected to a memory, the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory, so that the computing device executes the above method methods in the examples.
基于上述内容和相同构思,本申请提供一种计算机可读存储介质,其上存储有计算机程序或指令,当该计算机程序或指令被执行时,以使得计算设备执行上述方法实施例中的方法。Based on the above content and the same concept, the present application provides a computer-readable storage medium on which a computer program or instruction is stored. When the computer program or instruction is executed, the computing device executes the method in the above method embodiment.
基于上述内容和相同构思,本申请提供一种计算机程序产品,当计算机读取并执行计算机程序产品时,以使得计算设备执行上述方法实施例中的方法。Based on the above content and the same concept, the present application provides a computer program product, when a computer reads and executes the computer program product, so that a computing device executes the methods in the above method embodiments.
基于上述内容和相同构思,本申请提供一种芯片,芯片与存储器相连,用于读取并执行存储器中存储的软件程序,以使得计算设备执行上述方法实施例中的方法。Based on the above content and the same concept, the present application provides a chip connected to a memory for reading and executing a software program stored in the memory, so that a computing device executes the methods in the above method embodiments.
基于上述内容和相同构思,本申请实施例提供一种装置,所述装置包括处理器和接口电路,所述接口电路,用于接收程序或指令代码并传输至所述处理器;所述处理器运行所述程序或指令代码以执行上述方法实施例中的方法。Based on the above content and the same concept, an embodiment of the present application provides an apparatus, the apparatus includes a processor and an interface circuit, the interface circuit is configured to receive a program or an instruction code and transmit it to the processor; the processor The program or instruction code is executed to execute the method in the above method embodiment.
可以理解的是,在本申请的实施例中涉及的各种数字编号仅为描述方便进行的区分,并不用来限制本申请的实施例的范围。上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定。It can be understood that, the various numbers and numbers involved in the embodiments of the present application are only for the convenience of description, and are not used to limit the scope of the embodiments of the present application. The size of the sequence numbers of the above processes does not imply the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic.
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的保护范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内, 则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the protection scope of the present application. Thus, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.

Claims (23)

  1. 一种语音控制方法,其特征在于,包括:A voice control method, comprising:
    根据第一语音信号,确定具有完整语义的第一文本信息;According to the first speech signal, determine the first text information with complete semantics;
    根据所述第一文本信息和第二文本信息,控制目标设备在指定运行状态中切换,其中,所述第二文本信息是在所述第一文本信息之前被获取的,所述第二文本信息用于控制所述目标设备进入所述指定运行状态中的第一运行状态,所述第二文本信息与所述第一文本信息具有上下文关系。According to the first text information and the second text information, the target device is controlled to switch in a specified operating state, wherein the second text information is acquired before the first text information, and the second text information It is used to control the target device to enter a first operation state in the specified operation state, and the second text information has a contextual relationship with the first text information.
  2. 如权利要求1所述的方法,其特征在于,所述第二文本信息与所述第一文本信息具有上下文关系,至少包括如下的一项或多项:The method of claim 1, wherein the second text information has a contextual relationship with the first text information, including at least one or more of the following:
    所述第二文本信息和所述第一文本信息对应于同一个目标设备;The second text information and the first text information correspond to the same target device;
    所述第二文本信息对应的执行动作和所述第一文本信息对应的执行动作属于相同类型。The execution action corresponding to the second text information and the execution action corresponding to the first text information belong to the same type.
  3. 如权利要求1或2所述的方法,其特征在于,所述根据第一语音信号,确定具有完整语义的第一文本信息之前,还包括:The method according to claim 1 or 2, wherein, before determining the first text information with complete semantics according to the first voice signal, the method further comprises:
    根据第二语音信号,确定具有完整语义的第二文本信息;According to the second speech signal, determine the second text information with complete semantics;
    对所述第二文本信息执行自然语言理解,得到第二结构化信息;performing natural language understanding on the second text information to obtain second structured information;
    根据所述第二结构化信息,控制所述目标设备进入所述第一运行状态。According to the second structured information, the target device is controlled to enter the first operating state.
  4. 如权利要求1至3任一项所述的方法,其特征在于,所述根据所述第一文本信息和第二文本信息,控制目标设备在指定运行状态中切换,包括:The method according to any one of claims 1 to 3, wherein the controlling the target device to switch in a specified operating state according to the first text information and the second text information, comprises:
    根据所述第二文本信息对应的第二结构化信息,确定与所述第二结构化信息相对应的预设集合,所述预设集合中包括一个或多个预设文本信息和预设指令标识的对应关系;Determine a preset set corresponding to the second structured information according to the second structured information corresponding to the second text information, where the preset set includes one or more preset text information and preset instructions The corresponding relationship of the identification;
    在所述一个或多个预设文本信息中包括所述第一文本信息时,根据所述第一文本信息对应的预设指令标识确定控制指令,其中,所述控制指令用于控制所述目标设备由所述指定运行状态中的第一运行状态切换至所述指定运行状态中的第二运行状态。When the one or more preset text information includes the first text information, a control instruction is determined according to the preset instruction identifier corresponding to the first text information, wherein the control instruction is used to control the target The device switches from a first operating state of the designated operating states to a second operating state of the designated operating states.
  5. 如权利要求4所述的方法,其特征在于,所述根据所述第一文本信息和第二文本信息,控制目标设备在指定运行状态中切换,还包括:The method according to claim 4, wherein the controlling the target device to switch in a specified operating state according to the first text information and the second text information, further comprises:
    在所述第一文本信息与所述一个或多个预设文本信息中任一个预设文本信息不同时,对所述第一文本信息执行自然语言理解得到第一结构化信息;When the first text information is different from any one of the one or more preset text information, performing natural language understanding on the first text information to obtain first structured information;
    根据所述第一结构化信息和所述第二结构化信息确定所述控制指令。The control instruction is determined according to the first structured information and the second structured information.
  6. 如权利要求5所述的方法,其特征在于,所述根据所述第一结构化信息和所述第二结构化信息确定所述控制指令之后,还包括:The method according to claim 5, wherein after determining the control instruction according to the first structured information and the second structured information, the method further comprises:
    在所述控制指令无效的情况下,根据所述第一结构化信息更新所述第二结构化信息。When the control instruction is invalid, the second structured information is updated according to the first structured information.
  7. 如权利要求1至6任一项所述的方法,其特征在于,所述根据第一语音信号,确定具有完整语义的第一文本信息,包括:The method according to any one of claims 1 to 6, wherein the determining, according to the first voice signal, the first text information with complete semantics comprises:
    根据所述第一语音信号,确定所述第一语音信号对应的M个文字,M为正整数;According to the first voice signal, determine M characters corresponding to the first voice signal, where M is a positive integer;
    将所述M个文字组成的文本信息输入至第一预设模型,得到所述第一预设模型的输出结果,所述第一预设模型用于判断输入的多个文字所组成的文本信息是否具有完整语义;Inputting the text information composed of the M characters into a first preset model to obtain an output result of the first preset model, where the first preset model is used to determine the text information composed of the input multiple characters Whether it has complete semantics;
    根据所述M个文字组成的文本信息和所述第一预设模型的输出结果,生成所述第一文本信息。The first text information is generated according to the text information composed of the M characters and the output result of the first preset model.
  8. 如权利要求7所述的方法,其特征在于,所述第一预设模型由如下步骤确定:The method of claim 7, wherein the first preset model is determined by the steps of:
    获取第一训练集合,所述第一训练集合中包括有多个第一训练数据,针对于所述多个第一训练数据中每个第一训练数据,所述第一训练数据包括有第一训练文本信息和第一标签,所述第一训练文本信息由一个或多个文字组成,所述第一标签用于指示所述第一训练文本信息是否具有完整语义;Acquire a first training set, the first training set includes a plurality of first training data, and for each first training data in the plurality of first training data, the first training data includes a first training data training text information and a first label, where the first training text information consists of one or more words, and the first label is used to indicate whether the first training text information has complete semantics;
    根据所述多个第一训练数据和第一训练模型,执行一次或多次第一模型训练,至所述第一训练模型的第一输出结果符合第一预设条件,并将第一输出结果符合所述第一预设条件的第一训练模型确定为所述第一预设模型;According to the plurality of first training data and the first training model, one or more first model training is performed until the first output result of the first training model meets the first preset condition, and the first output result is The first training model that meets the first preset condition is determined to be the first preset model;
    其中,所述第一模型训练包括:将所述多个第一训练数据输入至第一训练模型中,得到所述第一输出结果;根据所述第一输出结果,更新所述第一训练模型中的模型参数,得到所述模型参数更新后的第一训练模型。The training of the first model includes: inputting the plurality of first training data into the first training model to obtain the first output result; and updating the first training model according to the first output result to obtain the first training model after the model parameters are updated.
  9. 如权利要求1至8任一项所述的方法,其特征在于,所述根据所述第一文本信息和第二文本信息,控制目标设备在指定运行状态中切换之前,还包括:The method according to any one of claims 1 to 8, wherein, before controlling the target device to switch in a specified operating state according to the first text information and the second text information, the method further comprises:
    将所述第一文本信息和历史文本信息输入至第二预设模型中,得到所述第二预设模型的输出结果,所述第二预设模型用于判断输入的两个文本信息是否具有上下文关系;The first text information and the historical text information are input into the second preset model to obtain the output result of the second preset model, and the second preset model is used to judge whether the input two text information has context;
    根据所述第二预设模型的输出结果,将所述历史文本信息确定为所述第二文本信息。According to the output result of the second preset model, the historical text information is determined as the second text information.
  10. 如权利要求9所述的方法,其特征在于,所述第二预设模型由如下步骤确定:The method of claim 9, wherein the second preset model is determined by the steps of:
    获取第二训练集合,所述第二训练集合中包括有多个第二训练数据,针对于所述多个第二训练数据中每个第二训练数据,所述第二训练数据包括两个第二训练文本信息和第二标签,所述第二标签用于指示所述两个第二训练文本信息是否具有上下文关系;Acquire a second training set, the second training set includes a plurality of second training data, and for each second training data in the plurality of second training data, the second training data includes two first training data Two training text information and a second label, the second label is used to indicate whether the two second training text information have a contextual relationship;
    根据所述多个第二训练数据和第二训练模型,执行一次或多次第二模型训练,至所述第二训练模型的第二输出结果符合第二预设条件,并将第二输出结果符合所述第二预设条件的第二训练模型确定为所述第二预设模型;According to the plurality of second training data and the second training model, one or more second model training is performed until the second output result of the second training model meets the second preset condition, and the second output result is A second training model that meets the second preset condition is determined to be the second preset model;
    其中,所述第二模型训练包括:将所述多个第二训练数据输入至第二训练模型中,得到第二输出结果;根据所述第二输出结果,更新所述第二训练模型中的模型参数,得到所述模型参数更新后的第二训练模型。Wherein, the second model training includes: inputting the plurality of second training data into the second training model to obtain a second output result; updating the data in the second training model according to the second output result model parameters, to obtain a second training model after the model parameters are updated.
  11. 一种语音控制装置,其特征在于,包括:A voice control device, comprising:
    处理模块,用于根据第一语音信号,确定具有完整语义的第一文本信息;a processing module, configured to determine the first text information with complete semantics according to the first speech signal;
    控制模块,用于根据所述第一文本信息和第二文本信息,控制目标设备在指定运行状态中切换,其中,所述第二文本信息是在所述第一文本信息之前被获取的,所述第二文本信息用于控制所述目标设备进入所述指定运行状态中的第一运行状态,所述第二文本信息与所述第一文本信息具有上下文关系。A control module, configured to control the target device to switch in a specified operating state according to the first text information and the second text information, wherein the second text information is acquired before the first text information, so The second text information is used to control the target device to enter a first operating state in the specified operating state, and the second text information has a contextual relationship with the first text information.
  12. 如权利要求11所述的装置,其特征在于,所述第二文本信息与所述第一文本信息具有上下文关系,至少包括如下的一项或多项:The apparatus of claim 11, wherein the second text information has a contextual relationship with the first text information, including at least one or more of the following:
    所述第二文本信息和所述第一文本信息对应于同一个目标设备;The second text information and the first text information correspond to the same target device;
    所述第二文本信息对应的执行动作和所述第一文本信息对应的执行动作属于相同类型。The execution action corresponding to the second text information and the execution action corresponding to the first text information belong to the same type.
  13. 如权利要求11或12所述的装置,其特征在于,在所述处理模块根据第一语音信号,确定具有完整语义的第一文本信息之前,所述处理模块还用于:The device according to claim 11 or 12, wherein before the processing module determines the first text information with complete semantics according to the first voice signal, the processing module is further configured to:
    根据第二语音信号,确定具有完整语义的第二文本信息;对所述第二文本信息执行自然语言理解,得到第二结构化信息;Determine second text information with complete semantics according to the second voice signal; perform natural language understanding on the second text information to obtain second structured information;
    所述控制模块还用于:The control module is also used for:
    根据所述第二结构化信息,控制所述目标设备进入所述第一运行状态。According to the second structured information, the target device is controlled to enter the first operating state.
  14. 如权利要求11至13任一项所述的装置,其特征在于,所述控制模块具体用于:The device according to any one of claims 11 to 13, wherein the control module is specifically used for:
    根据所述第二文本信息对应的第二结构化信息,确定与所述第二结构化信息相对应的预设集合,所述预设集合中包括一个或多个预设文本信息和预设指令标识的对应关系;Determine a preset set corresponding to the second structured information according to the second structured information corresponding to the second text information, where the preset set includes one or more preset text information and preset instructions The corresponding relationship of the identification;
    在所述一个或多个预设文本信息中包括所述第一文本信息时,根据所述第一文本信息对应的预设指令标识确定控制指令,其中,所述控制指令用于控制所述目标设备由所述指定运行状态中的第一运行状态切换至所述指定运行状态中的第二运行状态。When the one or more preset text information includes the first text information, a control instruction is determined according to the preset instruction identifier corresponding to the first text information, wherein the control instruction is used to control the target The device switches from a first operating state of the designated operating states to a second operating state of the designated operating states.
  15. 如权利要求14所述的装置,其特征在于,所述控制模块还用于:The apparatus of claim 14, wherein the control module is further configured to:
    在所述第一文本信息与所述一个或多个预设文本信息中任一个预设文本信息不同时,对所述第一文本信息执行自然语言理解得到第一结构化信息;When the first text information is different from any one of the one or more preset text information, performing natural language understanding on the first text information to obtain first structured information;
    根据所述第一结构化信息和所述第二结构化信息确定所述控制指令。The control instruction is determined according to the first structured information and the second structured information.
  16. 如权利要求15所述的装置,其特征在于,所述控制模块还用于在所述控制指令无效的情况下,根据所述第一结构化信息更新所述第二结构化信息。The apparatus of claim 15, wherein the control module is further configured to update the second structured information according to the first structured information when the control instruction is invalid.
  17. 如权利要求11至16任一项所述的装置,其特征在于,所述处理模块具体用于:The device according to any one of claims 11 to 16, wherein the processing module is specifically configured to:
    根据所述第一语音信号,确定所述第一语音信号对应的M个文字,M为正整数;According to the first voice signal, determine M characters corresponding to the first voice signal, where M is a positive integer;
    将所述M个文字组成的文本信息输入至第一预设模型,得到所述第一预设模型的输出结果,所述第一预设模型用于判断输入的多个文字所组成的文本信息是否具有完整语义;Inputting the text information composed of the M characters into a first preset model to obtain an output result of the first preset model, where the first preset model is used to determine the text information composed of the input multiple characters Whether it has complete semantics;
    根据所述M个文字组成的文本信息和所述第一预设模型的输出结果,生成所述第一文本信息。The first text information is generated according to the text information composed of the M characters and the output result of the first preset model.
  18. 如权利要求17所述的装置,其特征在于,所述处理模块具体用于:The apparatus of claim 17, wherein the processing module is specifically configured to:
    获取第一训练集合,所述第一训练集合中包括有多个第一训练数据,针对于所述多个第一训练数据中每个第一训练数据,所述第一训练数据包括有第一训练文本信息和第一标签,所述第一训练文本信息由一个或多个文字组成,所述第一标签用于指示所述第一训练文本信息是否具有完整语义;Acquire a first training set, the first training set includes a plurality of first training data, and for each first training data in the plurality of first training data, the first training data includes a first training data training text information and a first label, the first training text information consists of one or more words, and the first label is used to indicate whether the first training text information has complete semantics;
    根据所述多个第一训练数据和第一训练模型,执行一次或多次第一模型训练,至所述第一训练模型的第一输出结果符合第一预设条件,并将第一输出结果符合所述第一预设条件的第一训练模型确定为所述第一预设模型;According to the plurality of first training data and the first training model, one or more first model training is performed until the first output result of the first training model meets the first preset condition, and the first output result is The first training model that meets the first preset condition is determined to be the first preset model;
    其中,所述第一模型训练包括:将所述多个第一训练数据输入至第一训练模型中,得到所述第一输出结果;根据所述第一输出结果,更新所述第一训练模型中的模型参数,得到所述模型参数更新后的第一训练模型。The training of the first model includes: inputting the plurality of first training data into the first training model to obtain the first output result; and updating the first training model according to the first output result to obtain the first training model after the model parameters are updated.
  19. 如权利要求11至18任一项所述的装置,其特征在于,所述控制模块根据所述第一文本信息和第二文本信息,控制目标设备在指定运行状态中切换之前,所述处理模块还用于:The apparatus according to any one of claims 11 to 18, characterized in that, before the control module switches the target device in a specified operating state according to the first text information and the second text information, the processing module Also used for:
    将所述第一文本信息和历史文本信息输入至第二预设模型中,得到所述第二预设模型的输出结果,所述第二预设模型用于判断输入的两个文本信息是否具有上下文关系;Inputting the first text information and historical text information into a second preset model to obtain an output result of the second preset model, the second preset model is used to determine whether the two input text information has context;
    根据所述第二预设模型的输出结果,将所述历史文本信息确定为所述第二文本信息。According to the output result of the second preset model, the historical text information is determined as the second text information.
  20. 如权利要求19所述的装置,其特征在于,所述处理模块具体用于:The apparatus of claim 19, wherein the processing module is specifically configured to:
    获取第二训练集合,所述第二训练集合中包括有多个第二训练数据,针对于所述多个第二训练数据中每个第二训练数据,所述第二训练数据包括两个第二训练文本信息和第二标签,所述第二标签用于指示所述两个第二训练文本信息是否具有上下文关系;Acquire a second training set, the second training set includes a plurality of second training data, and for each second training data in the plurality of second training data, the second training data includes two first training data Two training text information and a second label, the second label is used to indicate whether the two second training text information have a contextual relationship;
    根据所述多个第二训练数据和第二训练模型,执行一次或多次第二模型训练,至所述第二训练模型的第二输出结果符合第二预设条件,并将第二输出结果符合所述第二预设条件的第二训练模型确定为所述第二预设模型;According to the plurality of second training data and the second training model, one or more second model training is performed until the second output result of the second training model meets the second preset condition, and the second output result is A second training model that meets the second preset condition is determined to be the second preset model;
    其中,所述第二模型训练包括:将所述多个第二训练数据输入至第二训练模型中,得到第二输出结果;根据所述第二输出结果,更新所述第二训练模型中的模型参数,得到所述模型参数更新后的第二训练模型。Wherein, the second model training includes: inputting the plurality of second training data into the second training model to obtain a second output result; updating the data in the second training model according to the second output result model parameters, to obtain a second training model after the model parameters are updated.
  21. 一种计算设备,其特征在于,包括处理器,所述处理器与存储器相连,所述存储器存储计算机程序,所述处理器用于执行所述存储器中存储的计算机程序,以使得所述计算设备执行如权利要求1至10中任一项所述的方法。A computing device, characterized in that it includes a processor, the processor is connected to a memory, the memory stores a computer program, and the processor is configured to execute the computer program stored in the memory, so that the computing device executes A method as claimed in any one of claims 1 to 10.
  22. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机程序或指令,当所述计算机程序或指令被计算设备执行时,以使得所述计算设备执行如权利要求1至10中任一项所述的方法。A computer-readable storage medium, characterized in that, a computer program or instruction is stored in the computer-readable storage medium, and when the computer program or instruction is executed by a computing device, so that the computing device performs as claimed in the claims The method of any one of 1 to 10.
  23. 一种芯片,其特征在于,包括至少一个处理器和接口;A chip, characterized in that it includes at least one processor and an interface;
    所述接口,用于为所述至少一个处理器提供程序指令或者数据;the interface for providing program instructions or data for the at least one processor;
    所述至少一个处理器用于执行所述程序行指令,以使得如权利要求1至10中任一项所述的方法被执行。The at least one processor is adapted to execute the program line instructions such that the method of any of claims 1 to 10 is performed.
PCT/CN2021/082019 2021-03-22 2021-03-22 Voice control method and apparatus WO2022198365A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202180001481.6A CN113228167B (en) 2021-03-22 2021-03-22 Voice control method and device
PCT/CN2021/082019 WO2022198365A1 (en) 2021-03-22 2021-03-22 Voice control method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/082019 WO2022198365A1 (en) 2021-03-22 2021-03-22 Voice control method and apparatus

Publications (1)

Publication Number Publication Date
WO2022198365A1 true WO2022198365A1 (en) 2022-09-29

Family

ID=77081313

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/082019 WO2022198365A1 (en) 2021-03-22 2021-03-22 Voice control method and apparatus

Country Status (2)

Country Link
CN (1) CN113228167B (en)
WO (1) WO2022198365A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114327041B (en) * 2021-11-26 2022-09-27 北京百度网讯科技有限公司 Multi-mode interaction method and system for intelligent cabin and intelligent cabin with multi-mode interaction method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190318759A1 (en) * 2018-04-12 2019-10-17 Qualcomm Incorporated Context-based detection of end-point of utterance
CN111210824A (en) * 2018-11-21 2020-05-29 深圳绿米联创科技有限公司 Voice information processing method and device, electronic equipment and storage medium
US20210027788A1 (en) * 2019-07-23 2021-01-28 Baidu Online Network Technology (Beijing) Co., Ltd. Conversation interaction method, apparatus and computer readable storage medium
CN112382279A (en) * 2020-11-24 2021-02-19 北京百度网讯科技有限公司 Voice recognition method and device, electronic equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019101754A (en) * 2017-12-01 2019-06-24 キヤノン株式会社 Summarization device and method for controlling the same, summarization system, and program
CN108286386B (en) * 2018-01-22 2019-10-11 奇瑞汽车股份有限公司 The method and apparatus of vehicle window control
CN110219544A (en) * 2018-03-02 2019-09-10 上海博泰悦臻网络技术服务有限公司 Intelligent vehicle and its Intelligent control method for car window
CN111660773B (en) * 2020-05-29 2023-02-03 奇瑞汽车股份有限公司 Sound control window method and system applied to automobile

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190318759A1 (en) * 2018-04-12 2019-10-17 Qualcomm Incorporated Context-based detection of end-point of utterance
CN111210824A (en) * 2018-11-21 2020-05-29 深圳绿米联创科技有限公司 Voice information processing method and device, electronic equipment and storage medium
US20210027788A1 (en) * 2019-07-23 2021-01-28 Baidu Online Network Technology (Beijing) Co., Ltd. Conversation interaction method, apparatus and computer readable storage medium
CN112382279A (en) * 2020-11-24 2021-02-19 北京百度网讯科技有限公司 Voice recognition method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113228167B (en) 2022-09-09
CN113228167A (en) 2021-08-06

Similar Documents

Publication Publication Date Title
US11756563B1 (en) Multi-path calculations for device energy levels
US11720326B2 (en) Audio output control
US11669300B1 (en) Wake word detection configuration
EP4083998A1 (en) End of query detection
US20170256264A1 (en) System and Method for Performing Dual Mode Speech Recognition
US9293134B1 (en) Source-specific speech interactions
US20210134278A1 (en) Information processing device and information processing method
US20140324429A1 (en) Computer-implemented method for automatic training of a dialogue system, and dialogue system for generating semantic annotations
JPH0962293A (en) Speech recognition dialogue device and speech recognition dialogue processing method
CN112201246B (en) Intelligent control method and device based on voice, electronic equipment and storage medium
CN105793923A (en) Local and remote speech processing
US20230360650A1 (en) Response orchestrator for natural language interface
US20230298575A1 (en) Freeze Words
JP2020095121A (en) Speech recognition system, generation method for learned model, control method for speech recognition system, program, and moving body
WO2022198365A1 (en) Voice control method and apparatus
US10923122B1 (en) Pausing automatic speech recognition
KR20210042520A (en) An electronic apparatus and Method for controlling the electronic apparatus thereof
CN114495981A (en) Method, device, equipment, storage medium and product for judging voice endpoint
KR20210098250A (en) Electronic device and Method for controlling the electronic device thereof
JP3846500B2 (en) Speech recognition dialogue apparatus and speech recognition dialogue processing method
US11735172B2 (en) Flexible-format voice command
US11893996B1 (en) Supplemental content output
JP2017201348A (en) Voice interactive device, method for controlling voice interactive device, and control program
CN116959436A (en) Voice interaction method and electronic equipment
EP2760019A9 (en) Dynamic audio processing parameters with automatic speech recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21932000

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21932000

Country of ref document: EP

Kind code of ref document: A1