WO2024071006A1 - 情報処理装置、情報処理方法、およびプログラム - Google Patents
情報処理装置、情報処理方法、およびプログラム Download PDFInfo
- Publication number
- WO2024071006A1 WO2024071006A1 PCT/JP2023/034642 JP2023034642W WO2024071006A1 WO 2024071006 A1 WO2024071006 A1 WO 2024071006A1 JP 2023034642 W JP2023034642 W JP 2023034642W WO 2024071006 A1 WO2024071006 A1 WO 2024071006A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- person
- captured image
- scene graph
- input instruction
- information processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01C—MEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
- G01C11/00—Photogrammetry or videogrammetry, e.g. stereogrammetry; Photographic surveying
- G01C11/04—Interpretation of pictures
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01C—MEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
- G01C21/00—Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
- G01C21/26—Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 specially adapted for navigation in a road network
- G01C21/34—Route searching; Route guidance
- G01C21/36—Input/output arrangements for on-board computers
- G01C21/3626—Details of the output of route guidance instructions
- G01C21/3647—Guidance involving output of stored or live camera images or video streams
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/29—Graphical models, e.g. Bayesian networks
- G06F18/295—Markov models or related models, e.g. semi-Markov models; Markov random fields; Networks embedding Markov models
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
- G06F3/013—Eye tracking input arrangements
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/017—Gesture based interaction, e.g. based on a set of recognized hand gestures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
- G06F3/167—Audio in a user interface, e.g. using voice commands for navigating, audio feedback
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/84—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using probabilistic graphical models from image or video features, e.g. Markov models or Bayesian networks
- G06V10/85—Markov-related models; Markov random fields
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/56—Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
- G06V20/58—Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/28—Recognition of hand or arm movements, e.g. recognition of deaf sign language
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/09—Arrangements for giving variable traffic instructions
- G08G1/0962—Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
- G08G1/0968—Systems involving transmission of navigation instructions to the vehicle
- G08G1/096833—Systems involving transmission of navigation instructions to the vehicle where different aspects are considered when computing the route
- G08G1/096838—Systems involving transmission of navigation instructions to the vehicle where different aspects are considered when computing the route where the user preferences are taken into account or the user selects one route out of a plurality
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/09—Arrangements for giving variable traffic instructions
- G08G1/0962—Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
- G08G1/0968—Systems involving transmission of navigation instructions to the vehicle
- G08G1/096833—Systems involving transmission of navigation instructions to the vehicle where different aspects are considered when computing the route
- G08G1/096844—Systems involving transmission of navigation instructions to the vehicle where different aspects are considered when computing the route where the complete route is dynamically recomputed based on new data
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/09—Arrangements for giving variable traffic instructions
- G08G1/0962—Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
- G08G1/0968—Systems involving transmission of navigation instructions to the vehicle
- G08G1/096833—Systems involving transmission of navigation instructions to the vehicle where different aspects are considered when computing the route
- G08G1/09685—Systems involving transmission of navigation instructions to the vehicle where different aspects are considered when computing the route where the complete route is computed only once and not updated
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/09—Arrangements for giving variable traffic instructions
- G08G1/0962—Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
- G08G1/0968—Systems involving transmission of navigation instructions to the vehicle
- G08G1/096855—Systems involving transmission of navigation instructions to the vehicle where the output is provided in a suitable form to the driver
- G08G1/096872—Systems involving transmission of navigation instructions to the vehicle where the output is provided in a suitable form to the driver where instructions are given per voice
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/09—Arrangements for giving variable traffic instructions
- G08G1/0962—Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
- G08G1/0968—Systems involving transmission of navigation instructions to the vehicle
- G08G1/096877—Systems involving transmission of navigation instructions to the vehicle where the input to the navigation device is provided by a suitable I/O arrangement
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/09—Arrangements for giving variable traffic instructions
- G08G1/0962—Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
- G08G1/0968—Systems involving transmission of navigation instructions to the vehicle
- G08G1/096877—Systems involving transmission of navigation instructions to the vehicle where the input to the navigation device is provided by a suitable I/O arrangement
- G08G1/096888—Systems involving transmission of navigation instructions to the vehicle where the input to the navigation device is provided by a suitable I/O arrangement where input information is obtained using learning systems, e.g. history databases
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/20—Monitoring the location of vehicles belonging to a group, e.g. fleet of vehicles, countable or determined number of vehicles
- G08G1/202—Dispatching vehicles on the basis of a location, e.g. taxi dispatching
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2203/00—Indexing scheme relating to G06F3/00 - G06F3/048
- G06F2203/038—Indexing scheme relating to G06F3/038
- G06F2203/0381—Multimodal input, i.e. interface arrangements enabling the user to issue commands by simultaneous use of input devices of different nature, e.g. voice plus gesture on digitizer
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20072—Graph-based image processing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20076—Probabilistic image processing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30248—Vehicle exterior or interior
- G06T2207/30252—Vehicle exterior; Vicinity of vehicle
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/09—Arrangements for giving variable traffic instructions
- G08G1/0962—Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
- G08G1/0968—Systems involving transmission of navigation instructions to the vehicle
- G08G1/096805—Systems involving transmission of navigation instructions to the vehicle where the transmitted instructions are used to compute a route
- G08G1/096827—Systems involving transmission of navigation instructions to the vehicle where the transmitted instructions are used to compute a route where the route is computed onboard
Definitions
- the present invention relates to an information processing device, an information processing method, and a program.
- Patent Document 1 describes a technique for detecting emergency vehicles contained in image data by inputting audio data and image data into an artificial neural network such as a deep neural network (DNN).
- DNN deep neural network
- Non-Patent Document 1 describes a neural state machine that converts multimodal raw information obtained by sensors into conceptual representations in a common abstract space and performs inference on a graph (scene graph) in which these conceptual representations are structured. Neural state machines are superior to the technology described in Patent Document 1 in that the processing content on the graph is made transparent and does not require large amounts of data for processing. However, the technology described in Non-Patent Document 1 does not utilize modalities such as a person's gaze or gestures, and does not provide a means to resolve ambiguity that occurs in inference.
- the present invention has been made in consideration of these circumstances, and one of its objectives is to provide an information processing device, information processing method, and program that utilizes modalities such as a person's gaze and gestures, and can resolve ambiguity that arises during the inference process.
- An information processing device includes an acquisition unit that acquires an captured image of the surroundings of a moving body by a camera mounted on the moving body, an input instruction sentence input by a person associated with the moving body, and gesture information related to a gesture made by the person, an extraction unit that extracts one or more instructions included in the input instruction sentence by applying a first predetermined processing to the input instruction sentence, a first generation unit that generates an estimated distribution related to a position pointed to by the person by applying a second predetermined processing to the gesture information, a second generation unit that generates a probabilistic scene graph from the captured image in which a probability is assigned to each object included in the captured image, and an identification unit that identifies the object pointed to by the person in the captured image based on the one or more instructions, the estimated distribution, and the probabilistic scene graph.
- the first predetermined process is a process of performing at least a dependency parser and attribute classification on the input instruction sentence.
- the second predetermined process is a process of generating the estimated distribution based on key points of the person included in the gesture information.
- the identification unit identifies the object by sequentially updating the probability of each object included in the probabilistic scene graph using the one or more extracted instructions.
- the second generation unit sets an initial value of the probability to be assigned to each object included in the probabilistic scene graph based on the estimated distribution.
- the identification unit when multiple objects are identified by the update, the identification unit generates a question for identifying one of the multiple objects.
- a computer acquires an image of the periphery of a moving object captured by a camera mounted on the moving object, an input instruction input by a person associated with the moving object, and gesture information related to a gesture made by the person, performs a first predetermined process on the input instruction to extract one or more instructions included in the input instruction, performs a second predetermined process on the gesture information to generate an estimated distribution related to the position pointed to by the person, generates from the captured image a probabilistic scene graph in which each object included in the captured image is assigned a probability, and identifies the object pointed to by the person in the captured image based on the one or more instructions, the estimated distribution, and the probabilistic scene graph.
- a program causes a computer to acquire an image of the surroundings of a moving object captured by a camera mounted on the moving object, an input instruction input by a person associated with the moving object, and gesture information related to a gesture made by the person, extract one or more instructions included in the input instruction by performing a first predetermined process on the input instruction, generate an estimated distribution related to the position indicated by the person by performing a second predetermined process on the gesture information, generate a probabilistic scene graph from the captured image in which each object included in the captured image is assigned a probability, and identify the object indicated by the person in the captured image based on the one or more instructions, the estimated distribution, and the probabilistic scene graph.
- aspects (1) to (8) make it possible to utilize modalities such as a person's gaze and gestures, and to resolve ambiguities that arise during the inference process.
- FIG. 1 is a diagram illustrating an example of the configuration of a moving object 1 and a control device 100 according to an embodiment.
- FIG. 2 is a perspective view of the moving body 1 seen from above.
- 1 is a diagram showing an example of a captured image IM captured by an external camera.
- 10 is a diagram for explaining an overview of a first predetermined process executed by an extraction unit 120.
- FIG. 11 is a diagram for explaining an overview of a second predetermined process executed by a generating unit 130.
- FIG. 11 is a diagram for explaining a probabilistic scene graph update process executed by the identification unit 140.
- FIG. 11 is a diagram for explaining a question generation process executed by the identification unit 140.
- FIG. 4 is a flowchart showing an example of a flow of processing executed by the control device 100.
- the information processing device is mounted on a moving object.
- the moving object moves on both roadways and a predetermined area different from the roadway.
- the moving object may be called micromobility.
- An electric kick scooter is a type of micromobility.
- the predetermined area is, for example, a sidewalk.
- the predetermined area may be a part or all of a sidewalk, a bicycle lane, a public open space, etc., or may include all of a sidewalk, a sidewalk, a bicycle lane, a public open space, etc.
- the information processing device identifies an object indicated by a person based on an image captured of the periphery of the moving object, an input instruction input by a person related to the moving object, and a gesture made by the person.
- the person related to the moving object is described as a passenger on the moving object, but the present invention is not limited to such a configuration, and may be a person who inputs an instruction outside the moving object (for example, to indicate a waiting location for the moving object after getting off).
- FIG. 1 is a diagram showing an example of the configuration of a moving body 1 and a control device 100 according to an embodiment.
- the moving body 1 is equipped with, for example, an external environment detection device 10, a moving body sensor 12, an operator 14, an internal camera 16, a positioning device 18, an HMI 20, a mode switch 22, a moving mechanism 30, a driving device 40, an external notification device 50, a storage device 70, and a control device 100.
- an external environment detection device 10 for example, an external environment detection device 10, a moving body sensor 12, an operator 14, an internal camera 16, a positioning device 18, an HMI 20, a mode switch 22, a moving mechanism 30, a driving device 40, an external notification device 50, a storage device 70, and a control device 100.
- the moving body is not limited to a vehicle, and may include a small mobility that runs alongside a walking user to carry luggage or lead a person, and may also include other moving bodies capable of autonomous movement (e.g., a walking robot, etc.).
- the external world detection device 10 is a device of various types whose detection range is the traveling direction of the moving body 1.
- the external world detection device 10 includes an external camera, a radar device, a LIDAR (Light Detection and Ranging), a sensor fusion device, etc.
- the external world detection device 10 outputs information indicating the detection result (images, object positions, etc.) to the control device 100.
- the external world detection device 10 outputs captured images of the surroundings of the moving body 1 captured by an external camera to the control device 100.
- the mobile body sensor 12 includes, for example, a speed sensor, an acceleration sensor, a yaw rate (angular velocity) sensor, a direction sensor, and an operation amount detection sensor attached to the operator 14.
- the operator 14 includes, for example, an operator for instructing acceleration/deceleration (for example, an accelerator pedal or a brake pedal) and an operator for instructing steering (for example, a steering wheel).
- the mobile body sensor 12 may include an accelerator opening sensor, a brake depression amount sensor, a steering torque sensor, etc.
- the mobile body 1 may also be provided with an operator 14 of a type other than those described above (for example, a non-annular rotary operator, a joystick, a button, etc.).
- the internal camera 16 captures an image of at least the head of an occupant of the vehicle 1 from the front.
- the internal camera 16 is a digital camera that uses an imaging element such as a CCD (Charge Coupled Device) or a CMOS (Complementary Metal Oxide Semiconductor).
- the internal camera 16 outputs the captured image to the control device 100.
- the positioning device 18 is a device that measures the position of the mobile body 1.
- the positioning device 18 is, for example, a GNSS (Global Navigation Satellite System) receiver, and identifies the position of the mobile body 1 based on signals received from GNSS satellites and outputs it as position information.
- the position information of the mobile body 1 may be estimated from the position of a Wi-Fi base station to which a communication device (described later) is connected.
- the HMI 20 includes a display device, a speaker, a touch panel, keys, etc.
- the occupant of the moving body 1 sets the destination of the moving body 1, for example, via the HMI 20, and the control unit 150 described later drives the moving body 1 to the set destination.
- the HMI 20 includes a voice input device such as a microphone, and the occupant of the moving body 1 inputs instructions to the voice input device by speaking instructions indicating the stopping position of the moving body 1.
- the HMI 20 analyzes the voice of the input instructions, converts them to text, and outputs them to the control device 100.
- the HMI 20 may accept instructions input as text by the occupant, for example, via a touch panel, and output the accepted instructions to the control device 100.
- the mode changeover switch 22 is a switch operated by the occupant.
- the mode changeover switch 22 may be a mechanical switch or a GUI (Graphical User Interface) switch set on the touch panel of the HMI 20.
- the mode changeover switch 22 accepts an operation to switch the driving mode to one of the following modes, for example: Mode A: an assist mode in which one of the steering operation and acceleration/deceleration control is performed by the occupant and the other is performed automatically (there may be Mode A-1 in which the steering operation is performed by the occupant and acceleration/deceleration control is performed automatically, and Mode A-2 in which the acceleration/deceleration operation is performed by the occupant and steering control is performed automatically); Mode B: a manual driving mode in which the steering operation and acceleration/deceleration operation are performed by the occupant; or Mode C: an automatic driving mode in which the operation control and acceleration/deceleration control are performed automatically.
- Mode A an assist mode in which one of the steering operation and acceleration/deceleration control is performed by the occupant and the other is
- the moving mechanism 30 is a mechanism for moving the mobile body 1 on a road.
- the moving mechanism 30 is, for example, a group of wheels including steering wheels and drive wheels.
- the moving mechanism 30 may also be legs for multi-legged walking.
- the driving device 40 outputs a force to the moving mechanism 30 to move the moving body 1.
- the driving device 40 includes a motor that drives the driving wheels, a battery that stores the power to be supplied to the motor, and a steering device that adjusts the steering angle of the steering wheels.
- the driving device 40 may also include an internal combustion engine or a fuel cell as a driving force output means or a power generation means.
- the driving device 40 may also further include a brake device that utilizes frictional force or air resistance.
- the external notification device 50 is, for example, a lamp, a display device, a speaker, etc., provided on the outer panel of the mobile unit 1, for notifying the outside of the mobile unit 1 of information.
- the external notification device 50 operates differently depending on whether the mobile unit 1 is moving on a sidewalk or on a roadway.
- the external notification device 50 is controlled to emit a lamp when the mobile unit 1 is moving on a sidewalk and not emit a lamp when the mobile unit 1 is moving on a roadway. It is preferable that the light color of this lamp is a color specified by law.
- the external notification device 50 may be controlled so that the lamp emits green light when the mobile unit 1 is moving on a sidewalk and emits blue light when the mobile unit 1 is moving on a roadway. If the external notification device 50 is a display device, the external notification device 50 displays the message "traveling on the sidewalk" in text or graphics when the mobile unit 1 is traveling on the sidewalk.
- FW is the steering wheel
- RW is the driving wheel
- SD is the steering device
- MT is the motor
- BT is the battery.
- the steering device SD, the motor MT, and the battery BT are included in the drive device 40.
- AP is the accelerator pedal
- BP is the brake pedal
- WH is the steering wheel
- SP is the speaker
- MC is the microphone.
- the moving body 1 shown in the figure is a one-seater moving body, and an occupant P is seated in the driver's seat DS and fastened with a seat belt SB.
- Arrow D1 is the traveling direction (velocity vector) of the moving body 1.
- the external environment detection device 10 is provided near the front end of the moving body 1, the internal camera 16 is provided in a position where it can capture an image of the head of the occupant P from in front of the occupant P, and the mode changeover switch 22 is provided in the boss part of the steering wheel WH.
- An external notification device 50 as a display device is provided near the front end of the moving body 1.
- the storage device 70 is a non-transitory storage device such as a hard disk drive (HDD), flash memory, or random access memory (RAM). Navigation map information 72 and the like are stored in the storage device 70. Although the storage device 70 is shown outside the frame of the control device 100 in the figure, the storage device 70 may be included in the control device 100. The storage device 70 may also be provided on a server (not shown).
- HDD hard disk drive
- RAM random access memory
- Navigation map information 72 is stored in advance in storage device 70, and is map information that includes, for example, information on the center of roads, including roadways and sidewalks, or information on road boundaries. Navigation map information 72 further includes information (such as names, addresses, and areas) on facilities and buildings adjacent to road boundaries.
- the control device 100 includes, for example, an acquisition unit 110, an extraction unit 120, a generation unit 130, an identification unit 140, and a control unit 150.
- the acquisition unit 110, the extraction unit 120, the generation unit 130, the identification unit 140, and the control unit 150 are realized by, for example, a hardware processor such as a CPU (Central Processing Unit) executing a program (software) 74.
- a hardware processor such as a CPU (Central Processing Unit) executing a program (software) 74.
- Some or all of these components may be realized by hardware (including circuitry) such as an LSI (Large Scale Integration), an ASIC (Application Specific Integrated Circuit), an FPGA (Field-Programmable Gate Array), or a GPU (Graphics Processing Unit), or may be realized by cooperation between software and hardware.
- LSI Large Scale Integration
- ASIC Application Specific Integrated Circuit
- FPGA Field-Programmable Gate Array
- GPU Graphics Processing Unit
- the program may be stored in the storage device 70 in advance, or may be stored in a removable storage medium (non-transient storage medium) such as a DVD or CD-ROM, and may be installed in the storage device 70 by mounting the storage medium in a drive device.
- a removable storage medium non-transient storage medium
- the combination of the acquisition unit 110, the extraction unit 120, the generation unit 130, and the identification unit 140 is an example of an "information processing device.”
- the acquisition unit 110 acquires an image IM obtained by an external camera, which is an external environment detection device 10, capturing an image of the surroundings of the mobile body 1.
- FIG. 3 is a diagram showing an example of an image IM captured by an external camera. As an example, FIG. 3 shows a situation in which the image IM captured by the external camera includes vehicles M1 and M2, vending machines V1, V2, and V3, and a postbox P.
- vending machine V1 is a red vending machine
- vending machines V2 and V3 are blue vending machines.
- the acquisition unit 110 acquires an instruction statement input by the occupant of the moving body 1 via the voice input device, which is the HMI 20, indicating the target position to be reached by the moving body 1.
- the voice input device which is the HMI 20, indicating the target position to be reached by the moving body 1.
- the occupant inputs an instruction statement such as "stop at the red vending machine behind that truck" to indicate that the moving body 1 should reach the target position TP in front of the vending machine V1.
- the acquisition unit 110 acquires an image representing a gesture made by the occupant captured by the internal camera 16 as gesture information.
- the acquired gesture information is used for processing by the generation unit 130, which will be described later.
- the extraction unit 120 performs a first predetermined process on the input instruction sentence to extract one or more instructions (reasoning instructions) included in the input instruction sentence.
- FIG. 4 is a diagram for explaining an outline of the first predetermined process executed by the extraction unit 120. More specifically, as the first predetermined process, the extraction unit 120 performs dependency parser and entity classifier on the input instruction sentence. For example, as shown in the left part of FIG.
- the extraction unit 120 analyzes, as dependency analysis, the instruction sentence "Stop at the red vending machine behind that truck” to determine that "that” is a determiner (det) that modifies “truck", "of” is a case marker (case) of "truck", "behind” is a noun modifier (nmod) that modifies "truck”, and "red” is a clause modifier (acl) that modifies "vending machine”.
- This dependency analysis may be performed using a known method.
- the extraction unit 120 classifies the attributes of each morpheme in the instruction sentence as an entity classification. For example, as shown in the right part of FIG. 4, the extraction unit 120 classifies "that" in the instruction sentence "stop at the red vending machine behind that truck” as a demonstrative, "truck” as an object, “behind” as a relation, "red” as a color, and "vending machine” as an object.
- the extraction unit 120 links and stores the dependency relationships between morphemes as a result of the dependency analysis and the attributes of each morpheme as a result of the entity classification.
- each morpheme (truck, that, behind, vending machine, red) is stored as one or more instructions (reasoning instructions) by linking its dependency relationship and attribute.
- these one or more instructions may be derived and stored using the method described in Non-Patent Document 1.
- the generating unit 130 generates an estimated distribution of the position indicated by the gesture of the occupant by performing a second predetermined process on the gesture information acquired by the acquiring unit 110.
- FIG. 5 is a diagram for explaining an outline of the second predetermined process executed by the generating unit 130.
- the generating unit 130 sets key points on two parts of the body of the occupant P.
- FIG. 5 shows, as an example, a situation in which the eyes and wrist of the occupant P are set as key points KP1 and KP2, respectively.
- the generating unit 130 estimates the intersection IS where the indication line L, which connects the eye KP1 and the wrist KP2 and is extended toward the wrist, intersects with the ground surface as the position indicated by the gesture of the occupant P, and generates an estimated distribution of the gesture position as a probability distribution with the intersection IS as the maximum value.
- the probability distribution to be generated any type of distribution, such as a normal distribution, may be assumed.
- the generation unit 130 may select any key point, but it is desirable to select one of the points as the eye, since the occupant P can specify the position accurately through the line of sight. It is also desirable for the other point to be a part that is easy to identify from the image, and it may be, for example, the wrist, fingertips, the tip or center of a fist, etc. Furthermore, when the operator is indicating a destination, it is possible that the face is facing in the direction of the destination and therefore the eyes cannot be photographed by the internal camera 16. In such cases, the position of the eyes may be estimated and identified. If the direction of the face can be identified, the position of the eyes can be estimated. Note that this estimation of the eye position may also be performed using a machine learning model.
- the generation unit 130 semantically extracts objects contained in the captured image IM captured by the external camera from the captured image IM, and generates a probabilistic scene graph in which each extracted object is assigned a probability that the occupant P pointed to the object.
- the generation unit 130 extracts vehicles M1 and M2, vending machines V1, V2, and V3, and postbox P. This extraction process can reduce the load associated with subsequent processing compared to methods that process raw data, such as deep neural networks (DNNs). More specifically, the generation of the probabilistic scene graph may be performed using the method described in Non-Patent Document 1.
- the initial probability value assigned to each object included in the generated probabilistic scene graph may be uniform (i.e., 1/(the number of objects included in the probabilistic scene graph)), or the generation unit 130 may set a different initial value for each object.
- the generation unit 130 may use an estimated distribution regarding the gesture position to set a different initial value according to the position of each object. More specifically, the generation unit 130 may set a higher initial value the closer the object is to the detected intersection point IS, and set a lower initial value the farther the object is from the intersection point IS. For example, in the case of FIG.
- the generation unit 130 is an example of a "first generation unit” and a "second generation unit”.
- the identification unit 140 identifies the object indicated by the occupant by sequentially updating the probability of each object included in the probabilistic scene graph using one or more instructions extracted by the extraction unit 120.
- FIG. 6 is a diagram for explaining the update process of the probabilistic scene graph executed by the identification unit 140.
- the identification unit 140 sequentially extracts one or more instructions extracted by the extraction unit 120, and updates the probabilistic scene graph so that the probability of the object corresponding to the extracted instruction becomes higher. For example, in the case of FIG. 6, the identification unit 140 extracts "truck” and "its” and updates the probabilistic scene graph so that the probability values of the vehicles M1 and M2 become higher.
- the identification unit 140 extracts "behind” and transitions from the vehicle M1 to the vending machine V1 and from the vehicle M2 to the vending machine V2 in the probabilistic scene graph.
- the identification unit 140 extracts "vending machine” and “red” to identify the vending machine V1 as a vending machine having the attribute "red”, and updates the probabilistic scene graph so that the probability value of the vending machine V1 becomes higher.
- the probabilities assigned to the objects in the probabilistic scene graph are successively updated, and the object with the highest probability value is ultimately identified as the object indicated by the occupant P.
- the identification unit 140 can identify the vending machine V1 with the highest probability value ultimately as the object indicated by the occupant P. More specifically, these updates to the probabilistic scene graph may be performed using the method described in Non-Patent Document 1.
- the identification unit 140 uses the probabilistic scene graph to identify the object with the maximum probability value as the object indicated by the occupant P.
- the identification unit 140 calculates the entropy of the probability distribution calculated for each object in the probabilistic scene graph, and if the calculated entropy is large (above a threshold), it can determine that the object indicated by the occupant P cannot be identified as a single object. In such a case, the conventional technology was unable to ultimately identify the object indicated by the occupant P.
- a question for identifying one object from the multiple candidates is generated, the question is asked of the occupant, and a response is received from the occupant, thereby ultimately identifying the object indicated by the occupant P.
- FIG. 7 is a diagram for explaining the process of generating a question executed by the identification unit 140.
- FIG. 7 shows a case where the occupant inputs the instruction "Stop at the vending machine behind the truck" and the identification unit 140 performs the process of updating the probabilistic scene graph, resulting in vending machine V1 behind vehicle M1 and vending machine V2 behind vehicle M2 being identified as objects with the same probability value (or objects whose difference in probability value is within a threshold value).
- the identification unit 140 generates a question to identify one of the identified objects. For example, a question sentence may be generated to directly identify the multiple objects, such as "Which vending machine?", or a question sentence may be generated to indirectly identify the multiple objects, such as "Which truck?” (i.e., if one truck has been identified, the vending machine can be identified by combining it with the noun modifier "behind").
- the identification unit 140 may compare attributes (e.g., color) of multiple candidate objects and generate a question related to the attribute having different values.
- attributes e.g., color
- vending machine V1 has a color attribute of "red”
- vending machine V2 does not have a specific color attribute
- the identification unit 140 may generate a question such as "Is the color of this vending machine red?” based on the difference in the color attribute.
- the identification unit 140 transmits the generated question to the HMI 20, receives an answer entered by the occupant on the HMI 20, and ultimately identifies the object pointed to by the occupant P based on the received answer. For example, the identification unit 140 may also accept a gesture by the occupant P again, and ultimately identify the object closest to the direction of the accepted gesture as the object pointed to by the occupant P. In this way, even if there are multiple candidates for the pointed to object due to the update process of the probabilistic scene graph, the pointed to object can be uniquely identified by generating a question for the occupant P.
- the control unit 150 drives the drive device 40 of the moving body 1 to travel to the target position, which is the object identified by the identification unit 140.
- FIG. 8 is a flowchart showing an example of the flow of processing executed by the control device 100.
- the processing according to this flowchart is executed in response to input of commands and gestures by the occupant while the vehicle 1 is traveling.
- the acquisition unit 110 acquires the captured image IM, the input instruction sentence, and the gesture information (step S100).
- the generation unit 130 extracts one or more instructions from the input instruction sentence acquired by the acquisition unit 110, and generates an estimated distribution of the position indicated by the occupant P from the gesture information (step S102).
- the generation unit 130 generates a probabilistic scene graph from the captured image IM and sets the initial probability of the probabilistic scene graph based on the estimated distribution (step S104).
- the identification unit 140 updates the probability of the probabilistic scene graph based on one or more instructions (step S106).
- the identification unit 140 determines whether or not a single object has been identified as a result of updating the probabilistic scene graph (step S108). If it is determined that a single object has been identified, the control unit 150 causes the mobile unit 1 to travel with the identified object as the target position (step S110). On the other hand, if it is determined that a single object has not been identified, the identification unit 140 generates a question sentence for identifying the single object, makes an inquiry, and identifies the single object (step S112). Thereafter, the identification unit 140 transitions the process to step S110. This ends the process related to this flowchart.
- the information processing device may at least identify a pointed object based on a captured image, an input instruction sentence, and gesture information, and if there are multiple candidates for the pointed object, may generate an additional question sentence and make an inquiry, ultimately identifying a single object.
- the information processing device according to the present invention may also be used to identify an object pointed to by a user in a VR (virtual reality) space.
- a captured image, an input instruction text, and gesture information are acquired, and the object indicated by the occupant is identified based on a probabilistic scene graph generated from the captured image, instructions extracted from the input instruction text, and an estimated distribution generated from the gesture information. If there are multiple candidates for the object indicated by the occupant, a question text is generated to identify a single object and is queried from the occupant. This makes it possible to utilize modalities such as a person's gaze and gestures, and to resolve ambiguity that arises during the inference process.
- a storage medium for storing computer-readable instructions
- a processor coupled to the storage medium
- the processor executes the computer-readable instructions to: Acquiring a captured image of the periphery of the moving body by a camera mounted on the moving body, an input instruction input by an occupant of the moving body, and gesture information regarding a gesture performed by the occupant; extracting one or more instructions included in the input instruction sentence by performing a first predetermined process on the input instruction sentence; generating an estimated distribution regarding a position indicated by the occupant by performing a second predetermined process on the gesture information; generating a probabilistic scene graph from the captured image in which a probability is assigned to each object included in the captured image; identifying an object indicated by the occupant in the captured image based on the one or more indications, the estimated distribution, and the probabilistic scene graph;
- the information processing device is configured as follows.
- External environment detection device 12 External environment detection device 12: Mobile sensor 14: Operator 16: Internal camera 18: Positioning device 20: HMI 22 Mode changeover switch 30 Moving mechanism 40 Driving device 50 External notification device 70 Storage device 72 Navigation map information 100 Control device 110 Acquisition unit 120 Extraction unit 130 Generation unit 140 Identification unit 150 Control unit
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Automation & Control Theory (AREA)
- Social Psychology (AREA)
- Probability & Statistics with Applications (AREA)
- Psychiatry (AREA)
- Biophysics (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Mathematics (AREA)
- Algebra (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Biomedical Technology (AREA)
- Image Analysis (AREA)
- User Interface Of Digital Computer (AREA)
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP23872231.8A EP4563941A4 (en) | 2022-09-27 | 2023-09-25 | INFORMATION PROCESSING DEVICE, INFORMATION PROCESSING METHOD AND PROGRAM |
| CN202380058940.3A CN119731511A (zh) | 2022-09-27 | 2023-09-25 | 信息处理装置、信息处理方法以及程序 |
| JP2024549350A JPWO2024071006A1 (https=) | 2022-09-27 | 2023-09-25 |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2022-153456 | 2022-09-27 | ||
| JP2022153456 | 2022-09-27 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024071006A1 true WO2024071006A1 (ja) | 2024-04-04 |
Family
ID=90477805
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2023/034642 Ceased WO2024071006A1 (ja) | 2022-09-27 | 2023-09-25 | 情報処理装置、情報処理方法、およびプログラム |
Country Status (4)
| Country | Link |
|---|---|
| EP (1) | EP4563941A4 (https=) |
| JP (1) | JPWO2024071006A1 (https=) |
| CN (1) | CN119731511A (https=) |
| WO (1) | WO2024071006A1 (https=) |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2013250747A (ja) * | 2012-05-31 | 2013-12-12 | Sharp Corp | 自走式電子機器 |
| JP2017228080A (ja) * | 2016-06-22 | 2017-12-28 | ソニー株式会社 | 情報処理装置、情報処理方法、及び、プログラム |
| JP2021522564A (ja) * | 2018-04-17 | 2021-08-30 | トヨタ リサーチ インスティテュート,インコーポレイティド | 非制約環境において人間の視線及びジェスチャを検出するシステムと方法 |
| JP2022096601A (ja) | 2020-12-17 | 2022-06-29 | インテル・コーポレーション | 車両のオーディオ‐ビジュアルおよび協調的認識 |
-
2023
- 2023-09-25 JP JP2024549350A patent/JPWO2024071006A1/ja active Pending
- 2023-09-25 CN CN202380058940.3A patent/CN119731511A/zh active Pending
- 2023-09-25 EP EP23872231.8A patent/EP4563941A4/en active Pending
- 2023-09-25 WO PCT/JP2023/034642 patent/WO2024071006A1/ja not_active Ceased
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2013250747A (ja) * | 2012-05-31 | 2013-12-12 | Sharp Corp | 自走式電子機器 |
| JP2017228080A (ja) * | 2016-06-22 | 2017-12-28 | ソニー株式会社 | 情報処理装置、情報処理方法、及び、プログラム |
| JP2021522564A (ja) * | 2018-04-17 | 2021-08-30 | トヨタ リサーチ インスティテュート,インコーポレイティド | 非制約環境において人間の視線及びジェスチャを検出するシステムと方法 |
| JP2022096601A (ja) | 2020-12-17 | 2022-06-29 | インテル・コーポレーション | 車両のオーディオ‐ビジュアルおよび協調的認識 |
Non-Patent Citations (2)
| Title |
|---|
| DREW A. HUDSONCHRISTOPHER D. MANNING: "Learning by Abstraction: The Neural State Machine", NEURIPS, pages 5901 - 5914 |
| See also references of EP4563941A4 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4563941A1 (en) | 2025-06-04 |
| CN119731511A (zh) | 2025-03-28 |
| EP4563941A4 (en) | 2025-10-29 |
| JPWO2024071006A1 (https=) | 2024-04-04 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10489686B2 (en) | Object detection for an autonomous vehicle | |
| CN111836747B (zh) | 用于车辆驾驶辅助的电子装置和方法 | |
| US12361717B2 (en) | Mobile object control device, mobile object control method, training device, training method, generation device, and storage medium | |
| JP2018190217A (ja) | 運転者監視装置、及び運転者監視方法 | |
| CN110390831A (zh) | 行进路线决定装置 | |
| WO2023230740A1 (zh) | 一种异常驾驶行为识别的方法、装置和交通工具 | |
| US20250013683A1 (en) | Context-based searching systems and methods for vehicles | |
| CN115214713B (zh) | 移动体的控制装置、移动体的控制方法及存储介质 | |
| CN117063217A (zh) | 移动体的控制装置、移动体的控制方法及存储介质 | |
| CN116710971A (zh) | 物体识别方法和飞行时间物体识别电路 | |
| EP4563941A1 (en) | Information processing apparatus, information processing method, and program | |
| JP7614261B2 (ja) | 画像認識装置、画像認識方法、およびプログラム | |
| JP7714122B2 (ja) | 移動体の制御装置、移動体の制御方法、および記憶媒体 | |
| JP7802194B2 (ja) | 情報処理装置、情報処理方法、およびプログラム | |
| JP7770224B2 (ja) | 移動体の制御装置、移動体の制御方法、および記憶媒体 | |
| US12394208B2 (en) | Mobile object control device, mobile object control method, and storage medium | |
| JP7738510B2 (ja) | 白線認識装置、移動体の制御システム、白線認識方法、およびプログラム | |
| EP4361961A1 (en) | Method of determining information related to road user | |
| JP2026068170A (ja) | 画像処理装置、画像処理方法、およびプログラム | |
| CN120548553A (zh) | 信息处理装置、信息处理方法及程序 | |
| WO2023188090A1 (ja) | 移動体の制御装置、移動体の制御方法、および記憶媒体 | |
| JP2022154109A (ja) | 移動体の制御装置、移動体の制御方法、およびプログラム | |
| CN120702461A (zh) | 一种路径推荐方法、装置以及车辆 | |
| CN118843574A (zh) | 移动体的控制装置、移动体的控制方法以及存储介质 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23872231 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202380058940.3 Country of ref document: CN |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023872231 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2023872231 Country of ref document: EP Effective date: 20250225 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2024549350 Country of ref document: JP |
|
| WWP | Wipo information: published in national office |
Ref document number: 202380058940.3 Country of ref document: CN |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWP | Wipo information: published in national office |
Ref document number: 2023872231 Country of ref document: EP |