GB2578766A

GB2578766A - Apparatus and method for controlling vehicle system operation

Info

Publication number: GB2578766A
Application number: GB1818151.1A
Authority: GB
Inventors: Thompson Simon; Frederick Brown Edward
Original assignee: Jaguar Land Rover Ltd
Current assignee: Jaguar Land Rover Ltd
Priority date: 2018-11-07
Filing date: 2018-11-07
Publication date: 2020-05-27
Anticipated expiration: 2038-11-07
Also published as: GB201818151D0; GB2578766B

Abstract

The invention provides for controlling operation of systems of a vehicle. The control system 300 comprises controllers configured to: receive 304 image data representative of a user or driver's face; receive 302 audio data from an audio input means within the vehicle; analyse 306 the received audio data to identify a spoken command from the driver; analyse 308 the received image data to determine a contextual meaning (e.g. emotional state) associated with the spoken command; determine 310 a supplemented spoken command in dependence on the spoken command and the contextual meaning; and output 312 an indication of the supplemented spoken command to one or more vehicle control systems in accordance with the supplemented spoken command. A Facial Action Coding System (FACS) could be used to interpret the facial image. The contextual meaning could include sincerity or insincerity. The system can make a judgement about whether a driver was being sarcastic or jocular when asking certain questions (212).

Description

APPARATUS AND METHOD FOR CONTROLLING VEHICLE SYSTEM OPERATION

TECHNICAL FIELD

The present disclosure relates to controlling vehicle system operation and particularly, but not exclusively, to controlling operation of a system of a vehicle in accordance with a spoken command supplemented with a contextual meaning derived from image data of the speaker. Aspects of the invention relate to a control system, a system, a vehicle, a method, and computer software.

BACKGROUND

It is known for a computer to process human speech, for example using voice recognition. Arguably an end goal is for a human to be able to have an intelligent and multi-faceted conversation with a computer, such as a computer within a vehicle, that is also articulate and fluent.

Despite current limitations in Artificial Intelligence, there exists a more rudimentary issue that stymies voice interaction which is recognition of words or phrases, especially when considering the complexity of lexicon and accents among humans. Voice interaction with machines is notoriously unreliable and the issues are ubiquitous across formats. Errors can occur by the computer in voice recognition due to the range of different accents and possible phrasing. Moreover, it remains a challenge not only to understand the text being spoken, but to understand the context in which it is spoken.

In an automotive context accurate voice interaction could free the user from distractions and lessen the workload for the ever-mounting number of tasks that come with increased vehicle functionality.

It is an object of embodiments of the invention to at least mitigate one or more of the problems

of the prior art.

SUMMARY OF THE INVENTION

Aspects and embodiments of the invention provide a control system for a vehicle, a system, a vehicle, a method, and computer software as claimed in the appended claims.

According to an aspect of the invention, there is provided a control system for controlling operation of one or more systems of a vehicle, the control system comprising one or more controllers, configured to: receive audio data from an audio input means within the vehicle; receive image data representative of at least one image of at least a portion of a user's face; analyse the received audio data to identify a spoken command from the user; analyse the received image data to determine a contextual meaning associated with the spoken command; determine a supplemented spoken command in dependence on the identified spoken command and the determined contextual meaning; and output an indication of the determined supplemented spoken command to one or more vehicle systems for controlling the one or more vehicle systems in accordance with the supplemented spoken command.

The one or more controllers may collectively comprise: at least one electronic processor having an electrical input for receiving the image data; and at least one memory device coupled to the at least one electronic processor and having instructions stored therein; wherein the at least one electronic processor is configured to access the at least one memory device and execute the instructions stored therein so as to output the indication of the determined supplemented spoken command.

The control system may be configured to analyse the received image data to determine a contextual meaning associated with the spoken command by identifying one or more Facial Action Units of the user.

The control system may be configured to use a Facial Action Coding System, FAGS, to identify one or more Facial Action Units. A "Facial Action Unit' may be defined as a change, e.g. a contraction or relaxation, of one or more muscles of the face. The one or more muscles may be a predetermined facial muscle group.

The contextual meaning may comprise one or more user emotions. The control system may be configured to process the received image data to identify the one or more user emotions of the user associated with the spoken command.

The control system may be configured to identify the one or more user emotions by identifying one or more Facial Action Units of the user.

The control system may be configured to determine whether the one or more user emotions comprise an insincere emotion or a sincere emotion.

The control system may be configured to identify that the supplemented spoken command may have a different meaning to the identified spoken command in dependence on the one or more user emotions comprising an insincere emotion; and control output of the indication of determined supplemented spoken command in dependence thereon.

The output may be controlled in some examples so that the command is not sent to the vehicle system at all. The output may be controlled in some examples so that the command is output to a different system such as a human-machine-interface, HMI, e.g. to provide a visual/audio/ tactile warning that the command has not been acted upon.

The control system may be configured to identify that the supplemented spoken command has the same meaning as the spoken command in dependence on the one or more user emotions comprising a sincere emotion; and control output of the indication of the determined 20 supplemented spoken command in dependence thereon.

The output may be controlled to be sent directly to vehicle system for controlling in response to spoken command in some examples. The output may be controlled to be output to a HMI for confirmation from a user in some examples. The control of the output may, in some examples, be dependent on a determined confidence of correct identification of the original spoken command.

The control system may be configured to: identify a user mouth shape associated with a speech sound from the received image data representative of at least one image of at least a portion of a user's face; identify, based on the identified user mouth shape, one or more speech sounds provided by the user when providing the voice command; and analyse the received audio data, in dependence on the one or more identified speech sounds, to identify the spoken command from the user.

The control system may be configured to determine the supplemented spoken content of the received voice signal using image data received one or more of: during receipt of the audio data; within a predetermined period before receipt of the audio data; and within a predetermined period after receipt of the audio data.

The image data may be captured by one or more in-vehicle cameras configured to transmit images of the at least part of the user's face to the controller.

The indication of the determined supplemented spoken content may be output as a control signal provided to a vehicle system in communication with the control system, the control signal configured to control the operation of the vehicle system.

The indication of the determined supplemented spoken content may be output as one or more of: an audio signal provided to an audio apparatus in communication with the control system; a visual signal provided to a display apparatus in communication with the control system; and a tactile signal provided to a tactile apparatus in communication with the control system.

According to an aspect of the invention, there is provided a system for controlling operation of one or more systems of a vehicle, comprising: the control system of any preceding claim; and one or more imaging devices configured to provide the image data to the control system; and one or more audio input devices configured to provide the audio data to the control system.

The system may comprise a vehicle system configured to receive the indication of the output supplemented spoken content.

The vehicle system may comprise one or more of: a vehicle system output apparatus configured to be controlled by the indication of the output supplemented spoken content; a display apparatus configured to display content based on the indication of the output supplemented spoken content; and an audio apparatus configured to output audio content based on the indication of the output supplemented spoken content; and a tactile apparatus configured to provide a tactile output based on the indication of the output supplemented spoken content.

According to an aspect of the invention, there is provided a vehicle comprising any controller described herein, or any system described herein.

According to an aspect of the invention, there is provided a method of controlling operation of one or more systems of a vehicle, the method comprising: receiving audio data from an audio input means within the vehicle; receiving image data representative of at least one image of at least a portion of a user's face; analysing the received audio data to identify a spoken command from the user; analysing the received image data to determine a contextual meaning associated with the 15 spoken command; determining a supplemented spoken command in dependence on the identified spoken command and the determined contextual meaning; and outputting an indication of the determined supplemented spoken command to one or more vehicle systems for controlling the one or more vehicle systems in accordance with the 20 supplemented spoken command.

Analysing the received image data to determine a contextual meaning associated with the spoken command may comprise identifying one or more Facial Action Units of the user.

The contextual meaning may comprise one or more user emotions. Analysing the received image data to determine a contextual meaning may comprise processing the received image data to identify the one or more user emotions of the user associated with the spoken command.

Identifying the one or more user emotions may comprise identifying one or more Facial Action Units of the user.

The method may comprise identifying that the supplemented spoken command may have a different meaning to the identified spoken command in dependence on the one or more user emotions comprising an insincere emotion; and controlling output of the indication of determined supplemented spoken command in dependence thereon.

The method may comprise identifying that the supplemented spoken command has the same meaning as the spoken command in dependence on the one or more user emotions comprising a sincere emotion; and controlling output of the indication of the determined supplemented spoken command in dependence thereon.

According to an aspect of the invention, there is provided computer software which, when executed, is arranged to perform a method described herein.

The computer software may be stored on a computer-readable medium. Optionally wherein the computer-readable medium comprises a non-transitory computer-readable medium.

Any controller or controllers described herein may suitably comprise a control unit or computational device having one or more electronic processors. Thus the system may comprise a single control unit or electronic controller or alternatively different functions of the controller may be embodied in, or hosted in, different control units or controllers. As used herein the term "controller or "control unit" will be understood to include both a single control unit or controller and a plurality of control units or controllers collectively operating to provide any stated control functionality. To configure a controller, a suitable set of instructions may be provided which, when executed, cause said control unit or computational device to implement the control techniques specified herein. The set of instructions may suitably be embedded in said one or more electronic processors. Alternatively, the set of instructions may be provided as software saved on one or more memory associated with said controller to be executed on said computational device. A first controller may be implemented in software run on one or more processors. One or more other controllers may be implemented in software run on one or more processors, optionally the same one or more processors as the first controller. Other suitable arrangements may also be used.

Within the scope of this application it is expressly intended that the various aspects, embodiments, examples and alternatives set out in the preceding paragraphs, in the claims and/or in the following description and drawings, and in particular the individual features thereof, may be taken independently or in any combination. That is, all embodiments and/or features of any embodiment can be combined in any way and/or combination, unless such features are incompatible. The applicant reserves the right to change any originally filed claim or file any new claim accordingly, including the right to amend any originally filed claim to depend from and/or incorporate any feature of any other claim although not originally claimed in that manner.

BRIEF DESCRIPTION OF THE DRAWINGS

One or more embodiments of the invention will now be described by way of example only, with reference to the accompanying drawings, in which: Figure 1 shows an illustration of a control system and a system according to an embodiment of the invention; Figure 2 shows an illustration of a controller according to an embodiment of the invention; Figure 3 shows an example of a vehicle occupant's face captured by image capture means according to an embodiment of the invention; Figure 4 illustrates a method according to an embodiment of the invention; and Figure 5 illustrates a vehicle according to an embodiment of the invention.

DETAILED DESCRIPTION

Facial Action Coding System (FAGS) is a method of detecting a person's exhibited emotions by denoting muscular movements or "Action Units" (which may also be termed "Facial Action Units") that quantify into an expression. A "Facial Action Unit" may be defined as a change, e.g. a contraction or relaxation, of one or more muscles of the face. The one or more muscles may be a predetermined facial muscle group.

FAGS encodes the movements of individual facial muscles and associates those movements with a user's emotion. FAGS may be used to describe many emotions that a human can express, including "false" emotions such as a sarcastic smile or frustrated laugh, in which the primary expressed emotion (e.g. happiness in the case of a smile, or relief/relaxation in the case of a laugh) is altered by a secondary indicator which may, at least partially, cancel out the primary expressed emotion (e.g. a sarcastic smile would indicate a lack of happiness, a frustrated laugh would indicate tension or stress).

In a vehicle environment in which a vehicle occupant can interact with a vehicle computer using spoken commands and instructions, it would be useful to be able to use such technology in a way which would benefit the operation of the vehicle. In particular, if FAGS can be used to determine the emotional state of the speaker, then a context may be applied to the spoken commands to more accurately provide an appropriate response to the spoken command. This may be particularly true in the case of "false" emotions which may indicate that the speaker has a different intention than that understood by only considering the spoken text alone with no applied context.

In the present case, the FACS approach to determining emotion from facial muscle movements may be extended to provide a context to associate with a spoken command. One way to categorise Facial Action Units in terms of a context is for video footage of users to be analysed and annotated by FACS experts to create a taxonomy of Facial Action Units of, for example, of sarcasm, disbelief, insincerity, surprise or anger. Such video data may be considered training data in the case of later using a model, such as a machine learning model, to analyse image data and obtain an indication of the user's emotion.

Figure 1 shows an illustration of a control system 106 and a system 100 according to an embodiment of the invention. The system 100 in Figure 1 comprises image capture means 102 configured to provide image data to the control system 106. The image data may comprise data representative of at least one image of at least a portion of a user's face. The at least a portion of a vehicle occupant's face may be, for example, the head and upper torso of the vehicle occupant, the head of the occupant, or a partial portion of the user's head, such as the mouth or eye/forehead region. It may be desired for the image data to relate to a face or facial area of the occupant to determine a context relating to a user's spoken command, such as an insincere emotion or other emotion which adds context to a spoken command. It may be in some examples that more is captured in the image than the at least a portion of a vehicle occupant's face, and only a portion of the captured image data is provided to the control system 106.

The image data may be captured by image capture means 102, such as an in-vehicle camera, e.g. a red-green-blue, RGB, camera or an infra-red, IR, camera, or other imaging device, for example comprising a charge-coupled device, CCD, sensor or the like. In some examples the image data may be captured by one or more in-vehicle camera or other image capture means configured to transmit images of the at least part of the user's face to the control system. In some examples the system 100 may comprise an in-vehicle camera or other image capture means. In other examples the system 100 may be configured to receive image data from a separate in-vehicle camera or other image capture means in communication with the system.

The system 100 also comprises audio input means 104 configured to provide audio data to the control system 106. The audio data may comprise spoken words provided by the driver or other occupant of the vehicle. In some examples the system 100 may comprise audio input means. In other examples the system 100 may be configured to receive audio input data from a separate audio input device in communication with the system.

The control system 106 is configured to analyse the received audio data to identify a spoken command from the user, and analyse the received image data to determine a contextual meaning associated with the spoken command. For example, the received audio data may be determined to be a command such as "how long will it take to reach York?", "open the driver window" or "answer call" to take a telephone call on a hands-free system, for example.

The contextual meaning may be, for example, that the user's face is indicative of someone speaking sarcastically (thus that user may not really want to know how long it will take to reach York, or may not want the driver window open, for example). The control system 106 may be configured to use Facial Action Coding System, FACS, to identify one or more Facial Action Units. A "Facial Action Unit" may be defined as a change, e.g. a contraction or relaxation, of one or more muscles of the face. The one or more muscles may be a predetermined facial muscle group, for example a group of muscles associated with frowning, smiling, or being puzzled, or surprised, for example. The control system 106 may be configured to analyse the received image data to determine a contextual meaning associated with the spoken command by identifying one or more Facial Action Units of the user.

The contextual meaning may comprise one or more user emotions. The control system 106 may be configured to process the received image data to identify the one or more user emotions of the user associated with the spoken command. For example, the user may be happy, sad, confused, angry, or cynical. The control system 106 may be configured to identify the one or more user emotions by identifying one or more Facial Action Units of the user as described above.

The control system 106 is also configured to determine a supplemented spoken command in dependence on the identified spoken command and the determined contextual meaning. From the examples above, if the user's face, as captured by the image capture means 102, indicates that the user is speaking sarcastically, then from the spoken content of "open the driver window", the supplemented spoken content may be "do not open the driver window". In other words, the control system 106 may be used to detect a user's emotions and/or expressions in relation to the user's speech, which provides context/depth to better understand the user's intent when speaking. It may be used to infer false-negative information in voice-recognised speech, such as a sarcastic response to information. The user's detected expressions may be used to act as arbitrators to the voice interaction algorithm which determines what the user is saying.

The control system 106 outputs an indication of the determined supplemented spoken command to one or more vehicle systems 108 for controlling the one or more vehicle systems in accordance with the supplemented spoken command. The indication of the determined supplemented spoken content may be output as a control signal provided to a vehicle system 108 in communication with the control system. The control signal may be configured to control the operation of the vehicle system 108.

In general, the indication of the determined supplemented spoken content may be output as: an audio signal provided to an audio apparatus in communication with the control system 106 (e.g. a beep, tone, or spoken message); a visual signal provided to a display apparatus in communication with the control system (e.g. an indicator light on a dashboard or other part of the vehicle, or a graphical display (words and/or images) on a display screen); and/or a tactile signal provided to a tactile apparatus in communication with the control system 106 (e.g. a vibration or change of shape of a tactile indicator). In some examples, an audio, visual and/or tactile output may accompany a negative control action. For example, if a user says "open the driver windoW' with a sarcastic expression, the driver window may not be opened, and an audio signal may recite "I don't think you meant that -window not opened".

From the example above, no active control signal needs to be provided to the driver window to change its position. Thus in this example, the control signal may be provided to an output system which can give an indication that the driver window is not going to be opened, such as a red indicator light on a dashboard display. Not changing a vehicle system setting by the control signal, though the spoken command related to controlling that vehicle system (e.g. the driver window) may be labelled a negative control action.

In other examples a positive control action may be taken, for example controlling the steering and/or braking of the vehicle to change the latitudinal and/or longitudinal positioning the vehicle if the user's command is to change the vehicle positioning.

In some examples, the control system may be configured to determine the supplemented spoken content of the received voice signal using image data received during receipt of the audio data. Such a system may be termed a "real-time" system which analysis the user's facial expressions on-the-fly. Such a system may consider the user's facial expression at the time of the user speaking the command. In some examples, the control system 106 may be configured to determine the supplemented spoken content of the received voice signal using image data received within a predetermined period before receipt of the audio data, and/or within a predetermined period after receipt of the audio data. For example, for a period (e.g. five seconds, 30 seconds) before and/or after the user speaks a command, images of the user's face may be analysed to determine the context applicable to the spoken command. For example, if the user sees a traffic jam ahead and rolls his eyes, then five seconds later asks "what is the speed limit on this road?", the control system 106 may analyse the user's facial expressing of eye-rolling which took place 10 seconds before the user asked the question, and associate this "insincere" expression with the spoken command. Thus, rather than reciting the speed limit to the user, which is of no help since the road is blocked with traffic, the control system 106 may provide an output signal to an audio system, such as "do you want me to tell you when the traffic has cleared?" or "don't worry about that now because of the traffic jam".

In some examples as shown in Figure 1, the system 100 may include one or more vehicle systems 108 configured to receive the indication of the output supplemented spoken content.

In some examples, the system 100 may be configured to communicate with one or more separate vehicle systems 108.

Such a vehicle system 108 may comprise one or more of: a vehicle system output apparatus configured to be controlled by the indication of the output supplemented spoken content; a display apparatus configured to display content based on the indication of the output supplemented spoken content; and an audio apparatus configured to output audio content based on the indication of the output supplemented spoken content; and a tactile apparatus configured to provide a tactile output based on the indication of the output supplemented spoken content.

Figure 2 shows an illustration of a control system 106 according to an embodiment of the invention. The control system 106 of Figure 2 comprises a controller 120, which comprises at least one electronic processor 109 having at least one electrical input 112 for receiving the image data and the audio data, and at least one memory device 110 coupled to the at least one electronic processor 109 and having instructions stored therein. The at least one electronic processor 109 is configured to access the at least one memory device 110 and execute the instructions stored therein so as to control operation of one or more systems of a vehicle. The controller 120 further includes an electrical output 114 for outputting a control signal to one or more vehicle systems as described herein. In some examples the control system 106 may comprise a plurality of controllers 120.

The electrical input(s) 112 and output(s) 114 of the controller 120 may be provided to/from a communication bus or network of the vehicle, such as a CANBus or other communication network which may, for example, be implemented by an Internet Protocol (IP) based network such as Ethernet, or FlexRay.

Figure 3 illustrates a user providing speech input 212 to a voice recognition system. The user's facial expressions are captured using a user-facing camera 202. In the example of the user being a vehicle driver, the camera 202 may be a dashboard-mounted camera behind the steering wheel. In this example, the user is asking the question "how far away is this place?" but it may be determined that the user is being insincere. For example, the user may feel that the target destination is further away than expected and so feel angry. Accordingly the user's question may be rhetorical. The control system 106 of the present invention may determine that the facial expression of the user is insincere, through recognition of the user's furrowed eyebrows 204, glaring eyes 206, high top lip 208 and protruding bottom lip 210. These different facial features may be determined by the control system 106, for example using FAGS, and determined to fall into the categories of the emotions "anger" and "frustration".

In this example, the control system 106 is configured to identify that the supplemented spoken command may have a different meaning to the identified spoken command in dependence on the one or more user emotions comprising an insincere emotion. The spoken command is the question "how far away is this place?". The user's facial expression is insincere, thus the user does not really wish to know the answer to his question. The supplemented spoken command may therefore be "I am annoyed that it seems to be taking a long time to reach my destination". The control system 106 then is configured to control output of the indication of determined supplemented spoken command in dependence on the insincere emotion.

Thus the control system 106 will not respond with the distance or time to the destination, but may provide a more relevant response, taking into account the user's emotions, such as "I am sorry it seems to be taking a long time. Do you want to play some music until we arrive?". The control system 106 may not calculate the distance or time to the destination in response to the question, because this is not what the user wishes to know, as determined from the user's facial expression. A further indication may be provided to the user to indicate that the literal response to his spoken command has not been provided, such as a warning light or tactile feedback.

In some examples, the control system 106 may be configured to identify that the supplemented spoken command has the same meaning as the spoken command in dependence on the one or more user emotions comprising a sincere emotion; and control output of the indication of the determined supplemented spoken command in dependence thereon. For example, the user may say "I am hungry, where are the nearest services?' The user's expression may be sincere (e.g. the user's facial expression is calm, serious or relaxed) and/or may match the spoken command (e.g. the user's facial expression indicates hunger and his spoken question relates to be being hungry). The user's facial expression is recorded using the image capture means 102 and determined by the control system 106 to show sincerity. As a result, the control system 106 acts upon the spoken command, and the determined user's expression while speaking has provided an increased confidence that the user really wishes to know the answer to the question. For example, an output signal may be provided to a navigation system to re-route the planned route to the nearest services, and an audio response may be provided to the user such as "services are two kilometres ahead as shown on your map". In this way, a user's spoken command may be acted upon with increased confidence compared with no user facial input being considered.

In some embodiments, the control system 106 may be configured to: identify a user mouth shape associated with a speech sound from the received image data representative of at least one image of at least a portion of a user's face; identify, based on the identified user mouth shape, one or more speech sounds provided by the user when providing the voice command; and analyse the received audio data, in dependence on the one or more identified speech sounds, to identify the spoken command from the user. For example, by identifying a user mouth shape associated with the sound "ooh", the control system may determine that the user is likely to have made an "ooh" sound when speaking, and this determination may be used to increase the confidence in converting the user's speech to text accurately, by checking that the voice recognised text incudes an "ooh" phoneme in the expected place in the spoken command. The control system 106 may be thought of as using images of the user's mouth to "lip-read" the user and increase the confidence with which the user's voice command has been understood and converted to text.

If a low confidence level is determined, then in some examples the control system 106 may perform further operations to better determine what the user has said. For example, if the user's mouth images are not determined to represent sounds which match the sounds determined from the audio input from the user's voice, the control system 106 may provide audio feedback reciting "please repeat that for me" or providing a series of options which the control system has determined may apply to the user's spoken input.

Figure 4 illustrates a method 300 according to an embodiment of the invention. The method 300 is for controlling operation of one or more systems of a vehicle. The method 300 comprises receiving audio data from an audio input means within the vehicle 302, receiving image data representative of at least one image of at least a portion of a user's face 304, analysing the received audio data to identify a spoken command from the user 306, analysing the received image data to determine a contextual meaning associated with the spoken command 308, determining a supplemented spoken command in dependence on the identified spoken command and the determined contextual meaning 310, and outputting an indication of the determined supplemented spoken command to one or more vehicle systems for controlling the one or more vehicle systems in accordance with the supplemented spoken command 312.

In some embodiments of the method 300, analysing the received image data to determine a contextual meaning associated with the spoken command 308 comprises identifying one or more Facial Action Units of the user. In some embodiments the contextual meaning comprises one or more user emotions, wherein analysing the received image data to determine a contextual meaning 308 comprises processing the received image data to identify the one or more user emotions of the user associated with the spoken command. For example, identifying the one or more user emotions comprises identifying one or more Facial Action Units of the user.

In some embodiments the method 300 comprises identifying that the supplemented spoken command may have a different meaning to the identified spoken command in dependence on the one or more user emotions comprising an insincere emotion; and controlling output of the indication of determined supplemented spoken command in dependence thereon. For example, if the user is determined to have a sarcastic expression from the received facial image data, then the user's spoken command may not be acted on directly (e.g. a question may not be answered). In some embodiments the method comprises identifying that the supplemented spoken command has the same meaning as the spoken command in dependence on the one or more user emotions comprising a sincere emotion; and controlling output of the indication of the determined supplemented spoken command in dependence thereon. For example, if a user has a sincere expression, and says "I need to go home now" then this command may be considered sincere, and the satellite navigation vehicle system may re-calculate a route from the current planned route to the user's home.

Figure 5 illustrates a vehicle 500 according to an embodiment of the invention. The vehicle 500 is a wheeled vehicle. The vehicle 500 may comprise a control system or a system as described above. In some embodiments the vehicle 500 may be arranged to perform a method according to an embodiment, such as that illustrated in Figure 4.

Embodiments as described above may allow for better accuracy and execution of occupant-initiated commands. The received image data may be used to improve the accuracy of understanding a user's spoken command -for example, by having an increased confidence level if a user appears sincere, and a reduced confidence level if a user appears insincere. The confidence level for a received supplemented spoken command may, in some embodiments, at least partly determine the action taken as a result of voice command being received. For example, a received voice command with a low associated confidence (e.g. below a predetermined threshold confidence) may require a secondary input (such as a voice confirmation) before the command is acted upon, whereas a received voice command with a high associated confidence (e.g. above a predetermined threshold confidence) may be acted upon without requiring any secondary confirmation.

Embodiments as described above may reduce the workload of a user compared with a control system which receives, analyses and acts upon spoken commands without any facial image data supplement. For example, by providing more accurate recognition of a user's intention when reciting a command, the user may be happier to use voice interaction to provide input to a vehicle, having increased confidence that they spoken commands will be more accurately acted upon. Improved voice interaction may allow the user to drive with fewer distractions, and lessen the workload for the ever-mounting number of tasks that come with increased vehicle functionality.

Embodiments as described above may allow for the user to converse with a vehicle. For example, by receiving and analysing both a user's voice input and a user's facial expression, more meaningful and/or natural responses may be provided to a user than if only the user's voice input is considered without the user's facial expression. Such a system may be used as a conversational tool in a vehicle, and may be used to engage the user, for example to help keep them alert on a long journey or during night-driving.

Embodiments as described above may allow for improved understanding of a user's particular way of speaking (e.g. the phrases that person commonly uses, the words that person usually uses, their idiolect and/or regional/learned accent). For example, by receiving and analysing both a user's voice input and a user's facial expression, the user's intention when speaking a command may be better understood and thus more accurately acted upon according to the user's actual wishes, rather than if only the user's voice input is considered without the user's facial expression. Such a system may "anthropomorphise" an autonomous vehicle, which may in turn build the user's empathy and trust with the vehicle.

It will be appreciated that the terms "user" and "vehicle occupant" may be used interchangeably to mean the person interacting with the control system and systems described herein.

It will also be appreciated that embodiments of the present invention can be realised in the form of hardware, software or a combination of hardware and software. Any such software may be stored in the form of volatile or non-volatile storage such as, for example, a storage device like a ROM, whether erasable or rewritable or not, or in the form of memory such as, for example, RAM, memory chips, device or integrated circuits or on an optically or magnetically readable medium such as, for example, a CD, DVD, magnetic disk or magnetic tape. It will be appreciated that the storage devices and storage media are embodiments of machine-readable storage that are suitable for storing a program or programs that, when executed, implement embodiments of the present invention. Accordingly, embodiments provide a computer program comprising code for implementing a system or method as claimed, and a machine-readable storage storing such a program (e.g. a non-transitory computer readable medium). Still further, embodiments of the present invention may be conveyed electronically via any medium such as a communication signal carried over a wired or wireless connection and embodiments suitably encompass the same.

All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive.

Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features.

The invention is not restricted to the details of any foregoing embodiments. The invention extends to any novel one, or any novel combination, of the features disclosed in this specification (including any accompanying claims, abstract and drawings), or to any novel one, or any novel combination, of the steps of any method or process so disclosed. The claims should not be construed to cover merely the foregoing embodiments, but also any embodiments which fall within the scope of the claims.

Claims

CLAIMS1. A control system for controlling operation of one or more systems of a vehicle, the control system comprising one or more controllers, configured to: receive audio data from an audio input means within the vehicle; receive image data representative of at least one image of at least a portion of a user's face; analyse the received audio data to identify a spoken command from the user; analyse the received image data to determine a contextual meaning associated with the spoken command; determine a supplemented spoken command in dependence on the identified spoken command and the determined contextual meaning; and output an indication of the determined supplemented spoken command to one or more vehicle systems for controlling the one or more vehicle systems in accordance with the supplemented spoken command.
2. A control system of claim 1, wherein the one or more controllers collectively comprise: at least one electronic processor having an electrical input for receiving the image data; and at least one memory device coupled to the at least one electronic processor and having instructions stored therein; wherein the at least one electronic processor is configured to access the at least one memory device and execute the instructions stored therein so as to output the indication of the determined supplemented spoken command.
3. The control system of any preceding claim, configured to analyse the received image data to determine a contextual meaning associated with the spoken command by identifying one or more Facial Action Units of the user.
4. The control system of any preceding claim, configured to use a Facial Action Coding System, FAGS, to identify one or more Facial Action Units.
5. The control system of any preceding claim, wherein the contextual meaning comprises one or more user emotions, and the control system is configured to process the received image data to identify the one or more user emotions of the user associated with the spoken command.
6. The control system of claim 5, configured to identify the one or more user emotions by identifying one or more Facial Action Units of the user.
7. The control system of claim 5 or claim 6, configured to determine whether the one or more user emotions comprise an insincere emotion or a sincere emotion.
8. The control system of claim 7, configured to: identify that the supplemented spoken command may have a different meaning to the identified spoken command in dependence on the one or more user emotions comprising an insincere emotion; and control output of the indication of determined supplemented spoken command in dependence thereon.
9. The control system of claim 7, configured to: identify that the supplemented spoken command has the same meaning as the spoken command in dependence on the one or more user emotions comprising a sincere emotion; and control output of the indication of the determined supplemented spoken command in dependence thereon.
10. The control system of any preceding claim, configured to: identify a user mouth shape associated with a speech sound from the received image data representative of at least one image of at least a portion of a user's face; identify, based on the identified user mouth shape, one or more speech sounds provided by the user when providing the voice command; and analyse the received audio data, in dependence on the one or more identified speech sounds, to identify the spoken command from the user.
11. The control system of any preceding claim, configured to determine the supplemented spoken content of the received voice signal using image data received one or more of: during receipt of the audio data; within a predetermined period before receipt of the audio data; and within a predetermined period after receipt of the audio data.
12. The control system of any preceding claim, wherein the indication of the determined supplemented spoken content is output as a control signal provided to a vehicle system in communication with the control system, the control signal configured to control the operation of the vehicle system.
13. The control system of any preceding claim, wherein the indication of the determined supplemented spoken content is output as one or more of: an audio signal provided to an audio apparatus in communication with the control system; a visual signal provided to a display apparatus in communication with the control system; and a tactile signal provided to a tactile apparatus in communication with the control system.
14. A system for controlling operation of one or more systems of a vehicle, comprising: the control system of any preceding claim; and one or more imaging devices configured to provide the image data to the control system; and one or more audio input devices configured to provide the audio data to the control system.
15. The system of claim 14, comprising a vehicle system configured to receive the indication of the output supplemented spoken content.
16. The system of claim 15, wherein the vehicle system comprises one or more of: a vehicle system output apparatus configured to be controlled by the indication of the output supplemented spoken content; a display apparatus configured to display content based on the indication of the output supplemented spoken content; and an audio apparatus configured to output audio content based on the indication of the output supplemented spoken content; and a tactile apparatus configured to provide a tactile output based on the indication of the output supplemented spoken content.
17. A vehicle comprising the controller of any of claims 1 to 13, or the system of any of claims 14 to 16.
18. A method of controlling operation of one or more systems of a vehicle, the method comprising: 19. 20. 21. 22. 23.receiving audio data from an audio input means within the vehicle; receiving image data representative of at least one image of at least a portion of a user's face; analysing the received audio data to identify a spoken command from the user; analysing the received image data to determine a contextual meaning associated with the spoken command; determining a supplemented spoken command in dependence on the identified spoken command and the determined contextual meaning for controlling the one or more vehicle systems in accordance with the supplemented spoken command.
The method of claim 18, wherein analysing the received image data to determine a contextual meaning associated with the spoken command comprises identifying one or more Facial Action Units of the user.
The method of claim 18 or claim 19, wherein the contextual meaning comprises one or more user emotions, and wherein analysing the received image data to determine a contextual meaning comprises processing the received image data to identify the one or more user emotions of the user associated with the spoken command.
The method of claim 20, wherein identifying the one or more user emotions comprises identifying one or more Facial Action Units of the user.
The method of any of claims 18 to 21, comprising identifying that the supplemented spoken command may have a different meaning to the identified spoken command in dependence on the one or more user emotions comprising an insincere emotion; and controlling output of the indication of determined supplemented spoken command in dependence thereon.
The method of any of claims 18 to 22, comprising identifying that the supplemented spoken command has the same meaning as the spoken command in dependence on the one or more user emotions comprising a sincere emotion; and controlling output of the indication of the determined supplemented spoken command in dependence thereon.
24. Computer software which, when executed, is arranged to perform a method according to any of claims 1810 23.
25. The computer software of claim 24 stored on a computer-readable medium.