CN115862620A

CN115862620A - Voice instruction processing method and device, vehicle and storage medium

Info

Publication number: CN115862620A
Application number: CN202211485669.5A
Authority: CN
Inventors: 翟诺; 郝伟杰; 张影; 王翀; 杨庆敖; 臧琳; 沈悦
Original assignee: FAW Group Corp
Current assignee: FAW Group Corp
Priority date: 2022-11-24
Filing date: 2022-11-24
Publication date: 2023-03-28

Abstract

The invention discloses a method and a device for processing a voice instruction, a vehicle and a storage medium. The method comprises the following steps: collecting voice signals; if the voice signal carries at least one current voice instruction, determining the type of the environment where the target vehicle is located according to the voice signal; if the environment type is a multi-person voice environment, determining the voice volume of the at least one current voice instruction according to the voice signal, and determining a current voice instruction to be processed in the at least one current voice instruction based on the voice volume, wherein the voice volume of the current voice instruction to be processed is larger than or equal to a preset volume threshold value; and processing the current voice instruction to be processed. By adopting the technical scheme, the embodiment of the invention can enrich the processing mode of the voice instruction.

Description

Voice instruction processing method and device, vehicle and storage medium

Technical Field

The invention relates to the technical field of automobiles, in particular to a method and a device for processing a voice instruction, a vehicle and a storage medium.

Background

Currently, the trend of automobile intellectualization is more and more obvious. The intelligent interactive mode that not only can provide more convenient by the driver of car improves the security of driving, can liberate driver's both hands and both eyes moreover, has characteristics such as use threshold low, the learning cost is low, friendly interaction, very big improvement user's experience. In particular, in the acoustic transmission method, speech recognition is performed, and the vehicle is controlled in accordance with a speech command of a speaker.

However, in the prior art, the processing mode of the voice command is single, so that the situations of voice command misrecognition and vehicle error control are easy to occur.

Disclosure of Invention

The invention provides a method and a device for processing a voice instruction, a vehicle and a storage medium, which are used for enriching the processing mode of the voice instruction.

According to an aspect of the present invention, there is provided a method for processing a voice instruction, including:

collecting voice signals;

if the voice signal carries at least one current voice instruction, determining the type of the environment where the target vehicle is located according to the voice signal;

if the environment type is a multi-person voice environment, determining the voice volume of the at least one current voice instruction according to the voice signal, and determining the current voice instruction to be processed in the at least one current voice instruction based on the voice volume, wherein the voice volume of the current voice instruction to be processed is greater than or equal to a preset volume threshold;

and processing the current voice instruction to be processed.

According to another aspect of the present invention, there is provided a processing apparatus of a voice instruction, including:

the signal acquisition module is used for acquiring voice signals;

the volume determining module is used for responding to the condition that the voice signal carries at least one current voice instruction and determining the type of the environment where the target vehicle is located according to the voice signal;

the volume determining module is used for responding to the condition that the environment type is a multi-person voice environment and determining the voice volume of the at least one current voice instruction according to the voice signal;

the instruction determining module is used for determining a current voice instruction to be processed in the at least one current voice instruction based on the voice volume, wherein the voice volume of the current voice instruction to be processed is greater than or equal to a preset volume threshold;

and the first processing module is used for processing the current voice instruction to be processed.

According to another aspect of the present invention, there is provided a vehicle including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform a method of processing voice instructions according to any of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement a method for processing voice instructions according to any one of the embodiments of the present invention when the computer instructions are executed.

The embodiment of the invention provides a method and a device for processing a voice instruction, a vehicle and a storage medium, which are used for collecting voice signals; if the voice signal carries at least one current voice instruction, determining the type of the environment where the target vehicle is located according to the voice signal; if the environment type is a multi-user voice environment, determining the voice volume of the at least one current voice instruction according to the voice information, and determining a current voice instruction to be processed in the at least one current voice instruction based on the voice volume, wherein the voice volume of the current voice instruction to be processed is greater than or equal to a preset volume threshold value; and processing the current voice instruction to be processed. By adopting the technical scheme, when the vehicle is in a multi-person voice environment, the voice command carried in the voice signal is processed in different modes according to the voice volume, so that the processing modes of the voice command can be enriched, the efficiency of voice command recognition and the timeliness of voice command feedback are improved, and the probability of error control on the vehicle based on the voice command is reduced.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart illustrating a method for processing a voice command according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a method for processing a voice command according to a second embodiment of the present invention;

fig. 3 is a block diagram of a voice command processing apparatus according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a vehicle according to a fourth embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example one

Fig. 1 is a flowchart of a method for processing a voice command according to an embodiment of the present invention, where the embodiment is applicable to a case where a voice command is processed, and the method may be executed by a processing apparatus of the voice command, where the processing apparatus of the voice command may be implemented in a form of hardware and/or software, and the processing apparatus of the voice command may be configured in a voice control system, and the voice control system may be configured in a vehicle. As shown in fig. 1, the method includes:

and S110, collecting voice signals.

In this embodiment, the voice control system may have an awake state and a sleep state, for example, a user may wake up the voice control system of the vehicle through a specific voice or by triggering a specific touch control or a physical button, that is, the voice control system is switched from the sleep state to the awake state. The user can also switch the voice control system from the wake-up state to the sleep state through another specific voice instruction or by triggering the corresponding touch control or physical key. When the vehicle is in the awakening state, the voice control system can acquire the voice signal in real time and control the vehicle based on the voice signal. In addition, the voice control system can also automatically switch from the wake-up state to the sleep state when the voice signal/voice instruction is not collected for a long time (such as 10s or 20 s).

Specifically, the voice control system may collect a voice signal in real time, such as collecting a sound signal, and extract the voice signal included in the sound signal. The voice signal may be a signal of voice, such as a signal of human voice.

In this embodiment, after the voice signal is collected, it may be further determined whether the collected voice signal carries a voice instruction, so that when the collected voice signal carries the voice instruction, the voice instruction carried in the voice signal is processed. At this time, after the acquiring the voice signal, the method may further include: extracting a target voice feature vector of the voice signal; and if a target standard voice feature vector matched with the target voice feature vector exists in the voice library, taking a standard voice instruction corresponding to the target standard voice feature vector as a current voice instruction carried in the voice signal.

The target speech feature vector may be a speech feature vector of a speech signal currently acquired by the speech control system. The target standard speech feature vector may be a standard speech feature vector that matches the target speech feature vector. The standard speech feature vector may be a speech feature vector of a standard speech instruction. The speech feature vector may be understood as a vector for characterizing speech features. The standard voice command can be a standard voice command obtained by broadcasting a preset standard broadcast statement. The current voice instruction can be understood as a voice instruction carried in the voice signal, that is, a voice instruction carried in the currently acquired voice signal. The voice command can be understood as a command for controlling the vehicle by voice.

In this embodiment, standard broadcast statements corresponding to different voice commands may be preset; aiming at each standard broadcast statement, respectively adopting languages of different countries and/or regions to broadcast the standard broadcast statement to obtain different standard voice instructions corresponding to the standard broadcast statement; and further extracting the voice feature vector of each standard voice instruction to serve as the standard voice feature vector of the corresponding standard voice instruction, and correspondingly storing each standard voice instruction and the standard voice feature vector of each standard voice instruction in a voice library.

Therefore, after the voice signal is acquired, the interference signal in the voice signal can be filtered. After the filtering is completed, the filtered speech signal is subjected to standardization processing to extract speech features of the speech signal, and vector calculation is performed on the extracted speech features by using a Hidden Markov Model (HMM) obtained by pre-training to obtain a target speech feature vector of the speech signal. And matching the target voice feature vector of the voice signal with each standard voice feature vector stored in a voice library, judging whether a target standard voice feature vector matched with the target voice feature vector exists in the voice library, if so, judging that the voice signal carries a voice instruction, and taking the standard voice instruction corresponding to the target standard voice feature vector stored in the voice library as the current voice instruction carried in the voice signal.

In this embodiment, the standard voice obtained by broadcasting the standard broadcast statement by using the voices of different countries and/or regions is set in the voice library, so that the voice control system can adapt to the languages of different countries and/or regions, that is, a user can control a vehicle no matter what language is used to speak the voice command, and therefore, the recognition efficiency and the recognition accuracy of the voice command can be improved.

And S120, if the voice signal carries at least one current voice instruction, determining the type of the environment where the target vehicle is located according to the voice signal.

The target vehicle may be understood as a vehicle that executes the processing method of the voice command provided in the embodiment, that is, the vehicle in which the voice control system is disposed. The type of environment in which the target vehicle is located may be understood as the type of environment in which the target vehicle is located (including the in-vehicle environment and/or the out-vehicle environment of the target vehicle). The environment types may include, for example, a multi-person voice environment and a single-person voice environment.

Specifically, when it is determined that the current voice command is carried in the voice signal, the type of the environment in which the target vehicle is located may be further determined according to the voice signal, for example, the number of speakers may be determined according to the voice signal, and whether the environment in which the target vehicle is located is a multi-person voice environment or a single-person voice environment may be determined according to the number of speakers.

S130, if the environment type is a multi-person voice environment, determining the voice volume of the at least one current voice instruction according to the voice signal, and determining the current voice instruction to be processed in the at least one current voice instruction based on the voice volume, wherein the voice volume of the current voice instruction to be processed is larger than or equal to a preset volume threshold.

In this embodiment, different environment types and/or processing manners corresponding to different voice volumes may be preset, and after the environment type of the target vehicle and/or the voice volume of the current voice instruction carried in the voice signal are determined, the current voice instruction carried in the voice signal is processed by using the processing manner corresponding to the environment type of the target vehicle and/or the voice volume of the current voice instruction carried in the voice signal. The processing modes corresponding to different environment types and/or different voice volumes are limited.

In this embodiment, because when the mobile terminal is in a noisy environment, such as when multiple people are chatting, the user generally broadcasts the voice command with a higher volume, so that the voice control system can correctly receive the broadcasted voice command. Therefore, when the vehicle is in a multi-person voice environment, the vehicle can only respond to one or more voice instructions with larger voice volume carried in the voice signals, so that the situation that the chat content of the user is mistakenly identified as the voice instruction is avoided, and the probability of mistakenly controlling the vehicle based on the voice instruction is reduced.

A multi-person speech environment is understood to be an environment in which a plurality of speakers are present, i.e. an environment in which the speech signal contains the speech of a plurality of persons. The current voice instruction to be processed may be one or more current voice instructions with a voice volume greater than or equal to a preset volume threshold, and preferably, the number of the current voice instructions to be processed may be one, that is, a certain current voice instruction with a voice volume greater than or equal to a preset volume threshold may be used as the current voice instruction to be processed, for example, the current voice instruction with a maximum voice volume and a voice volume greater than or equal to a preset volume threshold may be used as the current voice instruction to be processed. At this time, optionally, the current voice instruction to be processed is a voice instruction with the largest voice volume in the at least one voice instruction, and this case is taken as an example for description below. The preset volume threshold may be set as needed, for example, the preset volume threshold may be set to 70 dB.

For example, when the target vehicle is in a multi-user voice environment, the voice volume of each current voice instruction carried in the voice signal may be further determined according to the voice signal, the current voice instruction with the largest voice volume carried in the voice signal is determined according to the voice volume, whether the voice volume of the current voice instruction is greater than or equal to a preset volume threshold is determined, and if the voice volume is greater than or equal to the preset volume threshold, the current voice instruction with the largest voice volume is determined as the current voice instruction to be processed.

Correspondingly, if the voice volume is smaller than the preset volume threshold, the response to each current voice instruction carried in the voice signal can be avoided, so as to avoid the situation of false response. At this time, optionally, the method for processing the voice instruction provided in this embodiment further includes: and if the at least one voice instruction is determined not to exist in the current voice instruction to be processed, not processing the at least one voice instruction.

And S140, processing the current voice instruction to be processed.

In this embodiment, after determining the current voice instruction to be processed, the determined current voice instruction to be processed may be processed.

Specifically, the current voice command to be processed may be converted into a control command, and the target vehicle is controlled by the control command, for example, the body hardware of the target vehicle is controlled, or software installed on the vehicle is controlled by the central system of the target vehicle.

In one embodiment, the processing the current voice instruction to be processed includes: and generating a control instruction corresponding to the current voice instruction to be processed, and sending the control instruction to a control module of an object to be controlled in the target vehicle so as to control the object to be controlled through the control module, wherein the object to be controlled is hardware or software corresponding to the current voice instruction to be processed.

Specifically, a control instruction may be generated according to the current voice instruction to be processed, the object to be controlled corresponding to the control instruction is determined, and the control instruction is sent to the control module of the object to be controlled. Correspondingly, after receiving the control instruction sent by the voice control system, the control module of the object to be controlled in the target vehicle can control the object to be processed according to the control instruction.

The generation mode of the control instruction corresponding to the current voice instruction to be processed may be flexibly set, for example, the control instruction corresponding to the current voice instruction to be processed may be generated by using the control instruction generation mode in the prior art, which is not limited in this embodiment. The object to be controlled can be hardware of a target vehicle, such as a window, a skylight, a seat, an air conditioner, light, a trunk cover and the like of the target vehicle; software installed for target vehicles can be used for controlling channel selection, drama chase, live broadcast/playback watching, song listening, translation, volume, games and the like through control instructions, so that the control range of the voice control system is further enriched. The control module of the object to be processed may be a processor or a controller or the like for controlling the object to be processed.

In one embodiment, the processing the current voice instruction to be processed includes: if the time interval between the receiving time of the current voice instruction to be processed and the time of receiving the awakening instruction for the last time is less than or equal to the preset time length, processing the current voice instruction to be processed while processing the historical voice instruction to be processed; if the time interval between the receiving time of the current voice instruction to be processed and the time of receiving the awakening instruction for the last time is greater than the preset time length, processing the current voice instruction to be processed after the historical voice instruction to be processed is processed; and the historical voice instruction to be processed is a voice instruction to be processed received before the current voice instruction to be processed.

The wake-up instruction may be understood as an instruction for waking up the voice control system, that is, an instruction for switching the voice control system from the sleep state to the wake-up state, which may be generated based on voice of a user or a trigger operation of the user on a corresponding touch control or physical key. Correspondingly, the time when the wake-up command is received last time can be understood as the time when the voice control system is switched from the dormant state to the wake-up state this time, that is, the starting time when the voice control system is in the wake-up state this time. The historical voice command to be processed can be understood as a voice command which is received before the voice signal is collected at this time and has not been responded to completely. The preset time length can be set according to needs, for example, the preset time length can be set to 10s or 15 s.

In this embodiment, when the voice control system is in the wake-up state, different processing manners may be adopted to respond to the received current voice instruction to be processed according to different receiving times, so as to further enrich the response manner of the voice instruction.

Specifically, if the time interval between the receiving time of the current to-be-processed voice instruction and the starting time of the voice control system in the wake-up state at this time is less than or equal to the preset time length, the one or more current to-be-processed voice instructions may be processed in parallel, and further, when there is a history to-be-processed voice instruction that has not been processed yet, the one or more current to-be-processed voice instructions and the history to-be-processed voice instruction that has not been processed yet may be responded in parallel. If the time interval between the receiving time of the current voice instruction to be processed and the starting time of the voice control system in the awakening state is greater than the preset time length, the one or more current voice instructions to be processed and the historical voice instructions to be processed which are not processed can be processed in sequence according to the sequence of the receiving time of each voice instruction.

In one embodiment, after the acquiring the voice signal, the method further includes: determining emotion information of a speaker according to the voice signal; and outputting prompt information corresponding to the emotion information, wherein the prompt information is used for prompting the execution of target operation corresponding to the emotion information.

In the above embodiment, the emotion of the speaker can be further determined according to the collected voice signal, and the vehicle can be prompted or controlled to perform an operation corresponding to the emotion of the speaker, such as performing voice interaction with the speaker by using the mood, the tone and the like corresponding to the emotion of the speaker, or when the speaker generates emotion fluctuation such as sadness, anger, happiness and the like, the emotion of the speaker can be pacified by using corresponding measures, and the like, so that not only can the speaker generate immersive communication experience, but also the speaker can drive stably in mood, and the probability of traffic accidents is reduced.

Specifically, after the voice signal is collected, the emotion information of the speaker can be identified according to the voice signal through an intelligent chip or an emotion identification model obtained through pre-training, for example, the emotion information of the speaker can be identified according to the tone, intonation and/or exclamation words contained in the voice signal; and determining target operation and prompt information corresponding to the emotion information, and outputting the prompt information, such as broadcasting the prompt information and/or displaying the prompt information, so as to prompt the target operation to be executed through the prompt information.

It can be understood that the voice control system can prompt the user that the target vehicle is performing or is about to perform the target operation through the prompt message, and control the target vehicle to perform the target operation; the user can also be prompted to execute the target operation through the prompt message; the user may also be prompted by the prompt message to control the target vehicle to execute the target operation, or the user is queried by the prompt message to determine whether the target vehicle is allowed to execute the target operation, and when a corresponding control instruction of the user is received, the target vehicle is controlled to execute the target operation, which may be specifically set as required, and this embodiment is not limited thereto.

The processing method of the voice command provided by the embodiment of the invention collects a voice signal; if the voice signal carries at least one current voice instruction, determining the type of the environment where the target vehicle is located according to the voice signal; if the environment type is a multi-user voice environment, determining the voice volume of the at least one current voice instruction according to the voice information, and determining a current voice instruction to be processed in the at least one current voice instruction based on the voice volume, wherein the voice volume of the current voice instruction to be processed is greater than or equal to a preset volume threshold value; and processing the current voice instruction to be processed. According to the technical scheme, when the vehicle is in a multi-user voice environment, the voice command carried in the voice signal is processed in different modes according to voice, the processing mode of the voice command can be enriched, the efficiency of voice command recognition is improved, the timeliness of voice command feedback is improved, and the probability of error control of the vehicle based on the voice command is reduced.

Example two

Fig. 2 is a flowchart illustrating a method for processing a voice command according to a second embodiment of the present invention. The present embodiment is optimized based on the above embodiments. Optionally, before the processing the current voice instruction to be processed, the method further includes: and if the environment type is a single voice environment, taking the at least one voice instruction as a current voice instruction to be processed.

Correspondingly, as shown in fig. 2, the method for processing a voice instruction according to the second embodiment of the present invention may include:

and S210, collecting voice signals.

S220, if the voice signal carries at least one current voice instruction, determining the type of the environment where the target vehicle is located according to the voice signal, and executing S230 or S240.

In the present embodiment, the environment in which the target vehicle is located may include a single-person voice environment or a multi-person voice environment. The mode of determining the environment type of the target vehicle according to the voice signal can be flexibly set, for example, whether the voice signal has the voice segments of a plurality of speakers speaking simultaneously, namely the voice segments of the voices of the plurality of speakers exist at the same time can be judged, if yes, the target vehicle is judged to be in the multi-person voice environment; if not, the target vehicle is judged to be in the single-person voice environment. Preferably, whether the voice signals contain voices of multiple speakers or not can be judged, namely, the multiple speakers exist in the voice signal acquisition time period, and if yes, the target vehicle is judged to be in a multi-person voice environment; if not, the target vehicle is judged to be in the single-person voice environment.

And S230, if the environment type is a multi-person voice environment, determining the voice volume of the at least one current voice instruction according to the voice signal, determining the current voice instruction to be processed in the at least one current voice instruction based on the voice volume, and executing S250, wherein the voice volume of the current voice instruction to be processed is greater than or equal to a preset volume threshold.

S240, if the environment type is a single voice environment, taking the at least one voice instruction as a current voice instruction to be processed.

Specifically, if the target vehicle is in a single-person voice environment, that is, if only one speaker exists, all voice commands carried in the voice signal may be used as the current voice commands to be processed, so as to respond to each current voice command to be processed.

And S250, processing the current voice instruction to be processed.

According to the processing method of the voice instruction provided by the embodiment of the invention, when the vehicle is in different environments and/or the voice volume of the voice instruction is different, the voice instruction carried in the voice signal is processed in different modes, so that the efficiency of voice instruction recognition and the timeliness of voice instruction feedback can be improved, and the probability of error control on the vehicle based on the voice instruction is reduced.

EXAMPLE III

Fig. 3 is a block diagram of a processing apparatus for voice commands according to a third embodiment of the present invention, which is applicable to a case of processing a voice command, and the processing apparatus for voice commands can be implemented in the form of hardware and/or software, and the processing apparatus for voice commands can be configured in a voice control system, and the voice control system can be configured in a vehicle. As shown in fig. 3, the apparatus includes: a signal acquisition module 301, a volume determination module 302, an instruction determination module 303, and a first processing module 304, wherein,

a signal acquisition module 301, configured to acquire a voice signal;

a type determining module 302, configured to determine, in response to a situation that the voice signal carries at least one current voice instruction, an environment type where the target vehicle is located according to the voice signal;

an instruction determining module 303, configured to determine, in response to a situation that the environment type is a multi-user voice environment, a voice volume of the at least one current voice instruction according to the voice signal, and determine, based on the voice volume, a current to-be-processed voice instruction in the at least one current voice instruction, where the voice volume of the current to-be-processed voice instruction is greater than or equal to a preset volume threshold;

the first processing module 304 is configured to process the current voice instruction to be processed.

According to the processing device of the voice instruction provided by the third embodiment of the invention, the voice signal is collected through the signal collection module; responding to the situation that at least one current voice instruction is carried in the voice signal through a type determining module, and determining the type of the environment where the target vehicle is located according to the voice signal; responding to the situation that the environment type is a multi-person voice environment through an instruction determining module, determining the voice volume of the at least one current voice instruction according to the voice information, and determining a current voice instruction to be processed in the at least one current voice instruction based on the voice volume, wherein the voice volume of the current voice instruction to be processed is larger than or equal to a preset volume threshold value; and processing the current voice instruction to be processed through the first processing module. According to the technical scheme, when the vehicle is in a multi-user voice environment, the voice command carried in the voice signal is processed in different modes according to voice volume, the processing mode of the voice command can be enriched, the efficiency of voice command recognition is improved, the timeliness of voice command feedback is improved, and the probability of error control of the vehicle based on the voice command is reduced.

Further, the apparatus for processing a voice instruction provided in this embodiment may further include: and the second processing module is used for responding to the condition that the current voice instruction to be processed does not exist in the at least one voice instruction and not processing the at least one voice instruction.

In the above scheme, the current voice instruction to be processed may be a voice instruction with the largest voice volume in the at least one voice instruction.

In the above solution, the instruction determining module 303 may be configured to: and in response to the condition that the environment type is the single voice environment, taking the at least one voice instruction as a current voice instruction to be processed, and responding to the current voice instruction to be processed.

In the foregoing solution, the first processing module 304 may be configured to: responding to the condition that the time interval between the receiving time of the current voice instruction to be processed and the time of receiving the awakening instruction for the last time is smaller than or equal to the preset time length, and processing the current voice instruction to be processed while processing the historical voice instruction to be processed; responding to the condition that the time interval between the receiving time of the current voice instruction to be processed and the time of receiving the awakening instruction for the last time is larger than the preset time length, and processing the voice instruction to be processed after the historical voice instruction to be processed is processed; and the historical voice instruction to be processed is a voice instruction to be processed received before the current voice instruction to be processed.

In the foregoing solution, the first processing module 304 may be configured to: and generating a control instruction corresponding to the current voice instruction to be processed, and sending the control instruction to a control module of an object to be controlled in the target vehicle so as to control the object to be controlled through the control module, wherein the object to be controlled is hardware or software corresponding to the current voice instruction to be processed.

Further, the processing apparatus of the voice instruction provided in this embodiment may further include: the emotion determining module is used for determining emotion information of a speaker according to the voice signal after the voice signal is collected; and the prompting module is used for outputting prompting information corresponding to the emotion information, and the prompting information is used for prompting to execute target operation corresponding to the emotion information.

Further, the processing apparatus of the voice instruction provided in this embodiment may further include: the feature extraction module is used for extracting a target voice feature vector of the voice signal after the voice signal is collected; and the instruction determining module is used for taking a standard voice instruction corresponding to the target standard voice feature vector as a current voice instruction carried in the voice signal when the target standard voice feature vector matched with the target voice feature vector exists in a voice library.

The processing device of the voice instruction provided by the embodiment of the invention can execute the processing method of the voice instruction provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example four

FIG. 4 illustrates a schematic block diagram of a vehicle 10 that may be used to implement an embodiment of the present invention. As shown in fig. 4, the vehicle 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the vehicle 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the vehicle 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the vehicle 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the various methods and processes described above, such as the processing of voice instructions.

In some embodiments, the method of processing the voice instructions may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed on the vehicle 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the processing method of the voice instructions described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured by any other suitable means (e.g., by means of firmware) to execute the processing method of the voice instructions.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described herein may be implemented on a vehicle having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the vehicle. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired result of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for processing a voice command, comprising:

collecting voice signals;

and processing the current voice instruction to be processed.

2. The method of claim 1, further comprising:

and if the at least one voice instruction is determined not to exist in the current voice instruction to be processed, not processing the at least one voice instruction.

3. The method according to claim 1, wherein the current voice instruction to be processed is a voice instruction with the largest voice volume in the at least one voice instruction.

4. The method according to claim 1, further comprising, before said processing said current pending voice instruction:

and if the environment type is a single voice environment, taking the at least one voice instruction as a current voice instruction to be processed.

5. The method according to claim 1 or 4, wherein the processing the current voice instruction to be processed comprises:

if the time interval between the receiving time of the current voice instruction to be processed and the time of receiving the awakening instruction for the last time is less than or equal to the preset time length, processing the current voice instruction to be processed while processing the historical voice instruction to be processed;

if the time interval between the receiving time of the current voice instruction to be processed and the time of receiving the awakening instruction for the last time is longer than the preset time, processing the current voice instruction to be processed after the historical voice instruction to be processed is processed;

and the historical voice instruction to be processed is a voice instruction to be processed received before the current voice instruction to be processed.

6. The method according to claim 1 or 4, wherein the processing the current voice instruction to be processed comprises:

and generating a control instruction corresponding to the current voice instruction to be processed, and sending the control instruction to a control module of an object to be controlled in the target vehicle so as to control the object to be controlled through the control module, wherein the object to be controlled is hardware or software corresponding to the current voice instruction to be processed.

7. The method of claim 1, further comprising, after said acquiring the speech signal:

determining emotion information of a speaker according to the voice signal;

and outputting prompt information corresponding to the emotion information, wherein the prompt information is used for prompting the execution of target operation corresponding to the emotion information.

8. The method of claim 1, further comprising, after said acquiring the speech signal:

extracting a target voice feature vector of the voice signal;

and if a target standard voice feature vector matched with the target voice feature vector exists in the voice library, taking a standard voice instruction corresponding to the target standard voice feature vector as a current voice instruction carried in the voice signal.

9. An apparatus for processing a voice command, comprising:

the signal acquisition module is used for acquiring voice signals;

the type determining module is used for responding to the condition that at least one current voice instruction is carried in the voice signal and determining the type of the environment where the target vehicle is located according to the voice signal;

the instruction determining module is used for responding to the situation that the environment type is a multi-person voice environment, determining the voice volume of the at least one current voice instruction according to the voice signal, and determining the current voice instruction to be processed in the at least one current voice instruction based on the voice volume, wherein the voice volume of the current voice instruction to be processed is larger than or equal to a preset volume threshold;

10. A vehicle, characterized by comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform a method of processing voice instructions as claimed in any one of claims 1 to 8.

11. A computer-readable storage medium storing computer instructions for causing a processor to perform a method of processing voice instructions according to any one of claims 1 to 8 when the computer instructions are executed.