WO2024071006A1 - 情報処理装置、情報処理方法、およびプログラム - Google Patents

情報処理装置、情報処理方法、およびプログラム Download PDF

Info

Publication number
WO2024071006A1
WO2024071006A1 PCT/JP2023/034642 JP2023034642W WO2024071006A1 WO 2024071006 A1 WO2024071006 A1 WO 2024071006A1 JP 2023034642 W JP2023034642 W JP 2023034642W WO 2024071006 A1 WO2024071006 A1 WO 2024071006A1
Authority
WO
WIPO (PCT)
Prior art keywords
person
captured image
scene graph
input instruction
information processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2023/034642
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
アマン ジェイン
アニルドレッディ コンダパッレィ
健太郎 山田
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Honda Motor Co Ltd
Original Assignee
Honda Motor Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Honda Motor Co Ltd filed Critical Honda Motor Co Ltd
Priority to EP23872231.8A priority Critical patent/EP4563941A4/en
Priority to CN202380058940.3A priority patent/CN119731511A/zh
Priority to JP2024549350A priority patent/JPWO2024071006A1/ja
Publication of WO2024071006A1 publication Critical patent/WO2024071006A1/ja
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C11/00Photogrammetry or videogrammetry, e.g. stereogrammetry; Photographic surveying
    • G01C11/04Interpretation of pictures
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/26Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 specially adapted for navigation in a road network
    • G01C21/34Route searching; Route guidance
    • G01C21/36Input/output arrangements for on-board computers
    • G01C21/3626Details of the output of route guidance instructions
    • G01C21/3647Guidance involving output of stored or live camera images or video streams
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • G06F18/295Markov models or related models, e.g. semi-Markov models; Markov random fields; Networks embedding Markov models
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/013Eye tracking input arrangements
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/84Arrangements for image or video recognition or understanding using pattern recognition or machine learning using probabilistic graphical models from image or video features, e.g. Markov models or Bayesian networks
    • G06V10/85Markov-related models; Markov random fields
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/09Arrangements for giving variable traffic instructions
    • G08G1/0962Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
    • G08G1/0968Systems involving transmission of navigation instructions to the vehicle
    • G08G1/096833Systems involving transmission of navigation instructions to the vehicle where different aspects are considered when computing the route
    • G08G1/096838Systems involving transmission of navigation instructions to the vehicle where different aspects are considered when computing the route where the user preferences are taken into account or the user selects one route out of a plurality
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/09Arrangements for giving variable traffic instructions
    • G08G1/0962Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
    • G08G1/0968Systems involving transmission of navigation instructions to the vehicle
    • G08G1/096833Systems involving transmission of navigation instructions to the vehicle where different aspects are considered when computing the route
    • G08G1/096844Systems involving transmission of navigation instructions to the vehicle where different aspects are considered when computing the route where the complete route is dynamically recomputed based on new data
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/09Arrangements for giving variable traffic instructions
    • G08G1/0962Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
    • G08G1/0968Systems involving transmission of navigation instructions to the vehicle
    • G08G1/096833Systems involving transmission of navigation instructions to the vehicle where different aspects are considered when computing the route
    • G08G1/09685Systems involving transmission of navigation instructions to the vehicle where different aspects are considered when computing the route where the complete route is computed only once and not updated
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/09Arrangements for giving variable traffic instructions
    • G08G1/0962Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
    • G08G1/0968Systems involving transmission of navigation instructions to the vehicle
    • G08G1/096855Systems involving transmission of navigation instructions to the vehicle where the output is provided in a suitable form to the driver
    • G08G1/096872Systems involving transmission of navigation instructions to the vehicle where the output is provided in a suitable form to the driver where instructions are given per voice
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/09Arrangements for giving variable traffic instructions
    • G08G1/0962Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
    • G08G1/0968Systems involving transmission of navigation instructions to the vehicle
    • G08G1/096877Systems involving transmission of navigation instructions to the vehicle where the input to the navigation device is provided by a suitable I/O arrangement
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/09Arrangements for giving variable traffic instructions
    • G08G1/0962Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
    • G08G1/0968Systems involving transmission of navigation instructions to the vehicle
    • G08G1/096877Systems involving transmission of navigation instructions to the vehicle where the input to the navigation device is provided by a suitable I/O arrangement
    • G08G1/096888Systems involving transmission of navigation instructions to the vehicle where the input to the navigation device is provided by a suitable I/O arrangement where input information is obtained using learning systems, e.g. history databases
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/20Monitoring the location of vehicles belonging to a group, e.g. fleet of vehicles, countable or determined number of vehicles
    • G08G1/202Dispatching vehicles on the basis of a location, e.g. taxi dispatching
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2203/00Indexing scheme relating to G06F3/00 - G06F3/048
    • G06F2203/038Indexing scheme relating to G06F3/038
    • G06F2203/0381Multimodal input, i.e. interface arrangements enabling the user to issue commands by simultaneous use of input devices of different nature, e.g. voice plus gesture on digitizer
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20072Graph-based image processing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20076Probabilistic image processing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30248Vehicle exterior or interior
    • G06T2207/30252Vehicle exterior; Vicinity of vehicle
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/09Arrangements for giving variable traffic instructions
    • G08G1/0962Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
    • G08G1/0968Systems involving transmission of navigation instructions to the vehicle
    • G08G1/096805Systems involving transmission of navigation instructions to the vehicle where the transmitted instructions are used to compute a route
    • G08G1/096827Systems involving transmission of navigation instructions to the vehicle where the transmitted instructions are used to compute a route where the route is computed onboard

Definitions

  • the present invention relates to an information processing device, an information processing method, and a program.
  • Patent Document 1 describes a technique for detecting emergency vehicles contained in image data by inputting audio data and image data into an artificial neural network such as a deep neural network (DNN).
  • DNN deep neural network
  • Non-Patent Document 1 describes a neural state machine that converts multimodal raw information obtained by sensors into conceptual representations in a common abstract space and performs inference on a graph (scene graph) in which these conceptual representations are structured. Neural state machines are superior to the technology described in Patent Document 1 in that the processing content on the graph is made transparent and does not require large amounts of data for processing. However, the technology described in Non-Patent Document 1 does not utilize modalities such as a person's gaze or gestures, and does not provide a means to resolve ambiguity that occurs in inference.
  • the present invention has been made in consideration of these circumstances, and one of its objectives is to provide an information processing device, information processing method, and program that utilizes modalities such as a person's gaze and gestures, and can resolve ambiguity that arises during the inference process.
  • An information processing device includes an acquisition unit that acquires an captured image of the surroundings of a moving body by a camera mounted on the moving body, an input instruction sentence input by a person associated with the moving body, and gesture information related to a gesture made by the person, an extraction unit that extracts one or more instructions included in the input instruction sentence by applying a first predetermined processing to the input instruction sentence, a first generation unit that generates an estimated distribution related to a position pointed to by the person by applying a second predetermined processing to the gesture information, a second generation unit that generates a probabilistic scene graph from the captured image in which a probability is assigned to each object included in the captured image, and an identification unit that identifies the object pointed to by the person in the captured image based on the one or more instructions, the estimated distribution, and the probabilistic scene graph.
  • the first predetermined process is a process of performing at least a dependency parser and attribute classification on the input instruction sentence.
  • the second predetermined process is a process of generating the estimated distribution based on key points of the person included in the gesture information.
  • the identification unit identifies the object by sequentially updating the probability of each object included in the probabilistic scene graph using the one or more extracted instructions.
  • the second generation unit sets an initial value of the probability to be assigned to each object included in the probabilistic scene graph based on the estimated distribution.
  • the identification unit when multiple objects are identified by the update, the identification unit generates a question for identifying one of the multiple objects.
  • a computer acquires an image of the periphery of a moving object captured by a camera mounted on the moving object, an input instruction input by a person associated with the moving object, and gesture information related to a gesture made by the person, performs a first predetermined process on the input instruction to extract one or more instructions included in the input instruction, performs a second predetermined process on the gesture information to generate an estimated distribution related to the position pointed to by the person, generates from the captured image a probabilistic scene graph in which each object included in the captured image is assigned a probability, and identifies the object pointed to by the person in the captured image based on the one or more instructions, the estimated distribution, and the probabilistic scene graph.
  • a program causes a computer to acquire an image of the surroundings of a moving object captured by a camera mounted on the moving object, an input instruction input by a person associated with the moving object, and gesture information related to a gesture made by the person, extract one or more instructions included in the input instruction by performing a first predetermined process on the input instruction, generate an estimated distribution related to the position indicated by the person by performing a second predetermined process on the gesture information, generate a probabilistic scene graph from the captured image in which each object included in the captured image is assigned a probability, and identify the object indicated by the person in the captured image based on the one or more instructions, the estimated distribution, and the probabilistic scene graph.
  • aspects (1) to (8) make it possible to utilize modalities such as a person's gaze and gestures, and to resolve ambiguities that arise during the inference process.
  • FIG. 1 is a diagram illustrating an example of the configuration of a moving object 1 and a control device 100 according to an embodiment.
  • FIG. 2 is a perspective view of the moving body 1 seen from above.
  • 1 is a diagram showing an example of a captured image IM captured by an external camera.
  • 10 is a diagram for explaining an overview of a first predetermined process executed by an extraction unit 120.
  • FIG. 11 is a diagram for explaining an overview of a second predetermined process executed by a generating unit 130.
  • FIG. 11 is a diagram for explaining a probabilistic scene graph update process executed by the identification unit 140.
  • FIG. 11 is a diagram for explaining a question generation process executed by the identification unit 140.
  • FIG. 4 is a flowchart showing an example of a flow of processing executed by the control device 100.
  • the information processing device is mounted on a moving object.
  • the moving object moves on both roadways and a predetermined area different from the roadway.
  • the moving object may be called micromobility.
  • An electric kick scooter is a type of micromobility.
  • the predetermined area is, for example, a sidewalk.
  • the predetermined area may be a part or all of a sidewalk, a bicycle lane, a public open space, etc., or may include all of a sidewalk, a sidewalk, a bicycle lane, a public open space, etc.
  • the information processing device identifies an object indicated by a person based on an image captured of the periphery of the moving object, an input instruction input by a person related to the moving object, and a gesture made by the person.
  • the person related to the moving object is described as a passenger on the moving object, but the present invention is not limited to such a configuration, and may be a person who inputs an instruction outside the moving object (for example, to indicate a waiting location for the moving object after getting off).
  • FIG. 1 is a diagram showing an example of the configuration of a moving body 1 and a control device 100 according to an embodiment.
  • the moving body 1 is equipped with, for example, an external environment detection device 10, a moving body sensor 12, an operator 14, an internal camera 16, a positioning device 18, an HMI 20, a mode switch 22, a moving mechanism 30, a driving device 40, an external notification device 50, a storage device 70, and a control device 100.
  • an external environment detection device 10 for example, an external environment detection device 10, a moving body sensor 12, an operator 14, an internal camera 16, a positioning device 18, an HMI 20, a mode switch 22, a moving mechanism 30, a driving device 40, an external notification device 50, a storage device 70, and a control device 100.
  • the moving body is not limited to a vehicle, and may include a small mobility that runs alongside a walking user to carry luggage or lead a person, and may also include other moving bodies capable of autonomous movement (e.g., a walking robot, etc.).
  • the external world detection device 10 is a device of various types whose detection range is the traveling direction of the moving body 1.
  • the external world detection device 10 includes an external camera, a radar device, a LIDAR (Light Detection and Ranging), a sensor fusion device, etc.
  • the external world detection device 10 outputs information indicating the detection result (images, object positions, etc.) to the control device 100.
  • the external world detection device 10 outputs captured images of the surroundings of the moving body 1 captured by an external camera to the control device 100.
  • the mobile body sensor 12 includes, for example, a speed sensor, an acceleration sensor, a yaw rate (angular velocity) sensor, a direction sensor, and an operation amount detection sensor attached to the operator 14.
  • the operator 14 includes, for example, an operator for instructing acceleration/deceleration (for example, an accelerator pedal or a brake pedal) and an operator for instructing steering (for example, a steering wheel).
  • the mobile body sensor 12 may include an accelerator opening sensor, a brake depression amount sensor, a steering torque sensor, etc.
  • the mobile body 1 may also be provided with an operator 14 of a type other than those described above (for example, a non-annular rotary operator, a joystick, a button, etc.).
  • the internal camera 16 captures an image of at least the head of an occupant of the vehicle 1 from the front.
  • the internal camera 16 is a digital camera that uses an imaging element such as a CCD (Charge Coupled Device) or a CMOS (Complementary Metal Oxide Semiconductor).
  • the internal camera 16 outputs the captured image to the control device 100.
  • the positioning device 18 is a device that measures the position of the mobile body 1.
  • the positioning device 18 is, for example, a GNSS (Global Navigation Satellite System) receiver, and identifies the position of the mobile body 1 based on signals received from GNSS satellites and outputs it as position information.
  • the position information of the mobile body 1 may be estimated from the position of a Wi-Fi base station to which a communication device (described later) is connected.
  • the HMI 20 includes a display device, a speaker, a touch panel, keys, etc.
  • the occupant of the moving body 1 sets the destination of the moving body 1, for example, via the HMI 20, and the control unit 150 described later drives the moving body 1 to the set destination.
  • the HMI 20 includes a voice input device such as a microphone, and the occupant of the moving body 1 inputs instructions to the voice input device by speaking instructions indicating the stopping position of the moving body 1.
  • the HMI 20 analyzes the voice of the input instructions, converts them to text, and outputs them to the control device 100.
  • the HMI 20 may accept instructions input as text by the occupant, for example, via a touch panel, and output the accepted instructions to the control device 100.
  • the mode changeover switch 22 is a switch operated by the occupant.
  • the mode changeover switch 22 may be a mechanical switch or a GUI (Graphical User Interface) switch set on the touch panel of the HMI 20.
  • the mode changeover switch 22 accepts an operation to switch the driving mode to one of the following modes, for example: Mode A: an assist mode in which one of the steering operation and acceleration/deceleration control is performed by the occupant and the other is performed automatically (there may be Mode A-1 in which the steering operation is performed by the occupant and acceleration/deceleration control is performed automatically, and Mode A-2 in which the acceleration/deceleration operation is performed by the occupant and steering control is performed automatically); Mode B: a manual driving mode in which the steering operation and acceleration/deceleration operation are performed by the occupant; or Mode C: an automatic driving mode in which the operation control and acceleration/deceleration control are performed automatically.
  • Mode A an assist mode in which one of the steering operation and acceleration/deceleration control is performed by the occupant and the other is
  • the moving mechanism 30 is a mechanism for moving the mobile body 1 on a road.
  • the moving mechanism 30 is, for example, a group of wheels including steering wheels and drive wheels.
  • the moving mechanism 30 may also be legs for multi-legged walking.
  • the driving device 40 outputs a force to the moving mechanism 30 to move the moving body 1.
  • the driving device 40 includes a motor that drives the driving wheels, a battery that stores the power to be supplied to the motor, and a steering device that adjusts the steering angle of the steering wheels.
  • the driving device 40 may also include an internal combustion engine or a fuel cell as a driving force output means or a power generation means.
  • the driving device 40 may also further include a brake device that utilizes frictional force or air resistance.
  • the external notification device 50 is, for example, a lamp, a display device, a speaker, etc., provided on the outer panel of the mobile unit 1, for notifying the outside of the mobile unit 1 of information.
  • the external notification device 50 operates differently depending on whether the mobile unit 1 is moving on a sidewalk or on a roadway.
  • the external notification device 50 is controlled to emit a lamp when the mobile unit 1 is moving on a sidewalk and not emit a lamp when the mobile unit 1 is moving on a roadway. It is preferable that the light color of this lamp is a color specified by law.
  • the external notification device 50 may be controlled so that the lamp emits green light when the mobile unit 1 is moving on a sidewalk and emits blue light when the mobile unit 1 is moving on a roadway. If the external notification device 50 is a display device, the external notification device 50 displays the message "traveling on the sidewalk" in text or graphics when the mobile unit 1 is traveling on the sidewalk.
  • FW is the steering wheel
  • RW is the driving wheel
  • SD is the steering device
  • MT is the motor
  • BT is the battery.
  • the steering device SD, the motor MT, and the battery BT are included in the drive device 40.
  • AP is the accelerator pedal
  • BP is the brake pedal
  • WH is the steering wheel
  • SP is the speaker
  • MC is the microphone.
  • the moving body 1 shown in the figure is a one-seater moving body, and an occupant P is seated in the driver's seat DS and fastened with a seat belt SB.
  • Arrow D1 is the traveling direction (velocity vector) of the moving body 1.
  • the external environment detection device 10 is provided near the front end of the moving body 1, the internal camera 16 is provided in a position where it can capture an image of the head of the occupant P from in front of the occupant P, and the mode changeover switch 22 is provided in the boss part of the steering wheel WH.
  • An external notification device 50 as a display device is provided near the front end of the moving body 1.
  • the storage device 70 is a non-transitory storage device such as a hard disk drive (HDD), flash memory, or random access memory (RAM). Navigation map information 72 and the like are stored in the storage device 70. Although the storage device 70 is shown outside the frame of the control device 100 in the figure, the storage device 70 may be included in the control device 100. The storage device 70 may also be provided on a server (not shown).
  • HDD hard disk drive
  • RAM random access memory
  • Navigation map information 72 is stored in advance in storage device 70, and is map information that includes, for example, information on the center of roads, including roadways and sidewalks, or information on road boundaries. Navigation map information 72 further includes information (such as names, addresses, and areas) on facilities and buildings adjacent to road boundaries.
  • the control device 100 includes, for example, an acquisition unit 110, an extraction unit 120, a generation unit 130, an identification unit 140, and a control unit 150.
  • the acquisition unit 110, the extraction unit 120, the generation unit 130, the identification unit 140, and the control unit 150 are realized by, for example, a hardware processor such as a CPU (Central Processing Unit) executing a program (software) 74.
  • a hardware processor such as a CPU (Central Processing Unit) executing a program (software) 74.
  • Some or all of these components may be realized by hardware (including circuitry) such as an LSI (Large Scale Integration), an ASIC (Application Specific Integrated Circuit), an FPGA (Field-Programmable Gate Array), or a GPU (Graphics Processing Unit), or may be realized by cooperation between software and hardware.
  • LSI Large Scale Integration
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • GPU Graphics Processing Unit
  • the program may be stored in the storage device 70 in advance, or may be stored in a removable storage medium (non-transient storage medium) such as a DVD or CD-ROM, and may be installed in the storage device 70 by mounting the storage medium in a drive device.
  • a removable storage medium non-transient storage medium
  • the combination of the acquisition unit 110, the extraction unit 120, the generation unit 130, and the identification unit 140 is an example of an "information processing device.”
  • the acquisition unit 110 acquires an image IM obtained by an external camera, which is an external environment detection device 10, capturing an image of the surroundings of the mobile body 1.
  • FIG. 3 is a diagram showing an example of an image IM captured by an external camera. As an example, FIG. 3 shows a situation in which the image IM captured by the external camera includes vehicles M1 and M2, vending machines V1, V2, and V3, and a postbox P.
  • vending machine V1 is a red vending machine
  • vending machines V2 and V3 are blue vending machines.
  • the acquisition unit 110 acquires an instruction statement input by the occupant of the moving body 1 via the voice input device, which is the HMI 20, indicating the target position to be reached by the moving body 1.
  • the voice input device which is the HMI 20, indicating the target position to be reached by the moving body 1.
  • the occupant inputs an instruction statement such as "stop at the red vending machine behind that truck" to indicate that the moving body 1 should reach the target position TP in front of the vending machine V1.
  • the acquisition unit 110 acquires an image representing a gesture made by the occupant captured by the internal camera 16 as gesture information.
  • the acquired gesture information is used for processing by the generation unit 130, which will be described later.
  • the extraction unit 120 performs a first predetermined process on the input instruction sentence to extract one or more instructions (reasoning instructions) included in the input instruction sentence.
  • FIG. 4 is a diagram for explaining an outline of the first predetermined process executed by the extraction unit 120. More specifically, as the first predetermined process, the extraction unit 120 performs dependency parser and entity classifier on the input instruction sentence. For example, as shown in the left part of FIG.
  • the extraction unit 120 analyzes, as dependency analysis, the instruction sentence "Stop at the red vending machine behind that truck” to determine that "that” is a determiner (det) that modifies “truck", "of” is a case marker (case) of "truck", "behind” is a noun modifier (nmod) that modifies "truck”, and "red” is a clause modifier (acl) that modifies "vending machine”.
  • This dependency analysis may be performed using a known method.
  • the extraction unit 120 classifies the attributes of each morpheme in the instruction sentence as an entity classification. For example, as shown in the right part of FIG. 4, the extraction unit 120 classifies "that" in the instruction sentence "stop at the red vending machine behind that truck” as a demonstrative, "truck” as an object, “behind” as a relation, "red” as a color, and "vending machine” as an object.
  • the extraction unit 120 links and stores the dependency relationships between morphemes as a result of the dependency analysis and the attributes of each morpheme as a result of the entity classification.
  • each morpheme (truck, that, behind, vending machine, red) is stored as one or more instructions (reasoning instructions) by linking its dependency relationship and attribute.
  • these one or more instructions may be derived and stored using the method described in Non-Patent Document 1.
  • the generating unit 130 generates an estimated distribution of the position indicated by the gesture of the occupant by performing a second predetermined process on the gesture information acquired by the acquiring unit 110.
  • FIG. 5 is a diagram for explaining an outline of the second predetermined process executed by the generating unit 130.
  • the generating unit 130 sets key points on two parts of the body of the occupant P.
  • FIG. 5 shows, as an example, a situation in which the eyes and wrist of the occupant P are set as key points KP1 and KP2, respectively.
  • the generating unit 130 estimates the intersection IS where the indication line L, which connects the eye KP1 and the wrist KP2 and is extended toward the wrist, intersects with the ground surface as the position indicated by the gesture of the occupant P, and generates an estimated distribution of the gesture position as a probability distribution with the intersection IS as the maximum value.
  • the probability distribution to be generated any type of distribution, such as a normal distribution, may be assumed.
  • the generation unit 130 may select any key point, but it is desirable to select one of the points as the eye, since the occupant P can specify the position accurately through the line of sight. It is also desirable for the other point to be a part that is easy to identify from the image, and it may be, for example, the wrist, fingertips, the tip or center of a fist, etc. Furthermore, when the operator is indicating a destination, it is possible that the face is facing in the direction of the destination and therefore the eyes cannot be photographed by the internal camera 16. In such cases, the position of the eyes may be estimated and identified. If the direction of the face can be identified, the position of the eyes can be estimated. Note that this estimation of the eye position may also be performed using a machine learning model.
  • the generation unit 130 semantically extracts objects contained in the captured image IM captured by the external camera from the captured image IM, and generates a probabilistic scene graph in which each extracted object is assigned a probability that the occupant P pointed to the object.
  • the generation unit 130 extracts vehicles M1 and M2, vending machines V1, V2, and V3, and postbox P. This extraction process can reduce the load associated with subsequent processing compared to methods that process raw data, such as deep neural networks (DNNs). More specifically, the generation of the probabilistic scene graph may be performed using the method described in Non-Patent Document 1.
  • the initial probability value assigned to each object included in the generated probabilistic scene graph may be uniform (i.e., 1/(the number of objects included in the probabilistic scene graph)), or the generation unit 130 may set a different initial value for each object.
  • the generation unit 130 may use an estimated distribution regarding the gesture position to set a different initial value according to the position of each object. More specifically, the generation unit 130 may set a higher initial value the closer the object is to the detected intersection point IS, and set a lower initial value the farther the object is from the intersection point IS. For example, in the case of FIG.
  • the generation unit 130 is an example of a "first generation unit” and a "second generation unit”.
  • the identification unit 140 identifies the object indicated by the occupant by sequentially updating the probability of each object included in the probabilistic scene graph using one or more instructions extracted by the extraction unit 120.
  • FIG. 6 is a diagram for explaining the update process of the probabilistic scene graph executed by the identification unit 140.
  • the identification unit 140 sequentially extracts one or more instructions extracted by the extraction unit 120, and updates the probabilistic scene graph so that the probability of the object corresponding to the extracted instruction becomes higher. For example, in the case of FIG. 6, the identification unit 140 extracts "truck” and "its” and updates the probabilistic scene graph so that the probability values of the vehicles M1 and M2 become higher.
  • the identification unit 140 extracts "behind” and transitions from the vehicle M1 to the vending machine V1 and from the vehicle M2 to the vending machine V2 in the probabilistic scene graph.
  • the identification unit 140 extracts "vending machine” and “red” to identify the vending machine V1 as a vending machine having the attribute "red”, and updates the probabilistic scene graph so that the probability value of the vending machine V1 becomes higher.
  • the probabilities assigned to the objects in the probabilistic scene graph are successively updated, and the object with the highest probability value is ultimately identified as the object indicated by the occupant P.
  • the identification unit 140 can identify the vending machine V1 with the highest probability value ultimately as the object indicated by the occupant P. More specifically, these updates to the probabilistic scene graph may be performed using the method described in Non-Patent Document 1.
  • the identification unit 140 uses the probabilistic scene graph to identify the object with the maximum probability value as the object indicated by the occupant P.
  • the identification unit 140 calculates the entropy of the probability distribution calculated for each object in the probabilistic scene graph, and if the calculated entropy is large (above a threshold), it can determine that the object indicated by the occupant P cannot be identified as a single object. In such a case, the conventional technology was unable to ultimately identify the object indicated by the occupant P.
  • a question for identifying one object from the multiple candidates is generated, the question is asked of the occupant, and a response is received from the occupant, thereby ultimately identifying the object indicated by the occupant P.
  • FIG. 7 is a diagram for explaining the process of generating a question executed by the identification unit 140.
  • FIG. 7 shows a case where the occupant inputs the instruction "Stop at the vending machine behind the truck" and the identification unit 140 performs the process of updating the probabilistic scene graph, resulting in vending machine V1 behind vehicle M1 and vending machine V2 behind vehicle M2 being identified as objects with the same probability value (or objects whose difference in probability value is within a threshold value).
  • the identification unit 140 generates a question to identify one of the identified objects. For example, a question sentence may be generated to directly identify the multiple objects, such as "Which vending machine?", or a question sentence may be generated to indirectly identify the multiple objects, such as "Which truck?” (i.e., if one truck has been identified, the vending machine can be identified by combining it with the noun modifier "behind").
  • the identification unit 140 may compare attributes (e.g., color) of multiple candidate objects and generate a question related to the attribute having different values.
  • attributes e.g., color
  • vending machine V1 has a color attribute of "red”
  • vending machine V2 does not have a specific color attribute
  • the identification unit 140 may generate a question such as "Is the color of this vending machine red?” based on the difference in the color attribute.
  • the identification unit 140 transmits the generated question to the HMI 20, receives an answer entered by the occupant on the HMI 20, and ultimately identifies the object pointed to by the occupant P based on the received answer. For example, the identification unit 140 may also accept a gesture by the occupant P again, and ultimately identify the object closest to the direction of the accepted gesture as the object pointed to by the occupant P. In this way, even if there are multiple candidates for the pointed to object due to the update process of the probabilistic scene graph, the pointed to object can be uniquely identified by generating a question for the occupant P.
  • the control unit 150 drives the drive device 40 of the moving body 1 to travel to the target position, which is the object identified by the identification unit 140.
  • FIG. 8 is a flowchart showing an example of the flow of processing executed by the control device 100.
  • the processing according to this flowchart is executed in response to input of commands and gestures by the occupant while the vehicle 1 is traveling.
  • the acquisition unit 110 acquires the captured image IM, the input instruction sentence, and the gesture information (step S100).
  • the generation unit 130 extracts one or more instructions from the input instruction sentence acquired by the acquisition unit 110, and generates an estimated distribution of the position indicated by the occupant P from the gesture information (step S102).
  • the generation unit 130 generates a probabilistic scene graph from the captured image IM and sets the initial probability of the probabilistic scene graph based on the estimated distribution (step S104).
  • the identification unit 140 updates the probability of the probabilistic scene graph based on one or more instructions (step S106).
  • the identification unit 140 determines whether or not a single object has been identified as a result of updating the probabilistic scene graph (step S108). If it is determined that a single object has been identified, the control unit 150 causes the mobile unit 1 to travel with the identified object as the target position (step S110). On the other hand, if it is determined that a single object has not been identified, the identification unit 140 generates a question sentence for identifying the single object, makes an inquiry, and identifies the single object (step S112). Thereafter, the identification unit 140 transitions the process to step S110. This ends the process related to this flowchart.
  • the information processing device may at least identify a pointed object based on a captured image, an input instruction sentence, and gesture information, and if there are multiple candidates for the pointed object, may generate an additional question sentence and make an inquiry, ultimately identifying a single object.
  • the information processing device according to the present invention may also be used to identify an object pointed to by a user in a VR (virtual reality) space.
  • a captured image, an input instruction text, and gesture information are acquired, and the object indicated by the occupant is identified based on a probabilistic scene graph generated from the captured image, instructions extracted from the input instruction text, and an estimated distribution generated from the gesture information. If there are multiple candidates for the object indicated by the occupant, a question text is generated to identify a single object and is queried from the occupant. This makes it possible to utilize modalities such as a person's gaze and gestures, and to resolve ambiguity that arises during the inference process.
  • a storage medium for storing computer-readable instructions
  • a processor coupled to the storage medium
  • the processor executes the computer-readable instructions to: Acquiring a captured image of the periphery of the moving body by a camera mounted on the moving body, an input instruction input by an occupant of the moving body, and gesture information regarding a gesture performed by the occupant; extracting one or more instructions included in the input instruction sentence by performing a first predetermined process on the input instruction sentence; generating an estimated distribution regarding a position indicated by the occupant by performing a second predetermined process on the gesture information; generating a probabilistic scene graph from the captured image in which a probability is assigned to each object included in the captured image; identifying an object indicated by the occupant in the captured image based on the one or more indications, the estimated distribution, and the probabilistic scene graph;
  • the information processing device is configured as follows.
  • External environment detection device 12 External environment detection device 12: Mobile sensor 14: Operator 16: Internal camera 18: Positioning device 20: HMI 22 Mode changeover switch 30 Moving mechanism 40 Driving device 50 External notification device 70 Storage device 72 Navigation map information 100 Control device 110 Acquisition unit 120 Extraction unit 130 Generation unit 140 Identification unit 150 Control unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Automation & Control Theory (AREA)
  • Social Psychology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Psychiatry (AREA)
  • Biophysics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Mathematics (AREA)
  • Algebra (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biomedical Technology (AREA)
  • Image Analysis (AREA)
  • User Interface Of Digital Computer (AREA)
PCT/JP2023/034642 2022-09-27 2023-09-25 情報処理装置、情報処理方法、およびプログラム Ceased WO2024071006A1 (ja)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP23872231.8A EP4563941A4 (en) 2022-09-27 2023-09-25 INFORMATION PROCESSING DEVICE, INFORMATION PROCESSING METHOD AND PROGRAM
CN202380058940.3A CN119731511A (zh) 2022-09-27 2023-09-25 信息处理装置、信息处理方法以及程序
JP2024549350A JPWO2024071006A1 (https=) 2022-09-27 2023-09-25

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2022-153456 2022-09-27
JP2022153456 2022-09-27

Publications (1)

Publication Number Publication Date
WO2024071006A1 true WO2024071006A1 (ja) 2024-04-04

Family

ID=90477805

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/034642 Ceased WO2024071006A1 (ja) 2022-09-27 2023-09-25 情報処理装置、情報処理方法、およびプログラム

Country Status (4)

Country Link
EP (1) EP4563941A4 (https=)
JP (1) JPWO2024071006A1 (https=)
CN (1) CN119731511A (https=)
WO (1) WO2024071006A1 (https=)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013250747A (ja) * 2012-05-31 2013-12-12 Sharp Corp 自走式電子機器
JP2017228080A (ja) * 2016-06-22 2017-12-28 ソニー株式会社 情報処理装置、情報処理方法、及び、プログラム
JP2021522564A (ja) * 2018-04-17 2021-08-30 トヨタ リサーチ インスティテュート,インコーポレイティド 非制約環境において人間の視線及びジェスチャを検出するシステムと方法
JP2022096601A (ja) 2020-12-17 2022-06-29 インテル・コーポレーション 車両のオーディオ‐ビジュアルおよび協調的認識

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013250747A (ja) * 2012-05-31 2013-12-12 Sharp Corp 自走式電子機器
JP2017228080A (ja) * 2016-06-22 2017-12-28 ソニー株式会社 情報処理装置、情報処理方法、及び、プログラム
JP2021522564A (ja) * 2018-04-17 2021-08-30 トヨタ リサーチ インスティテュート,インコーポレイティド 非制約環境において人間の視線及びジェスチャを検出するシステムと方法
JP2022096601A (ja) 2020-12-17 2022-06-29 インテル・コーポレーション 車両のオーディオ‐ビジュアルおよび協調的認識

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DREW A. HUDSONCHRISTOPHER D. MANNING: "Learning by Abstraction: The Neural State Machine", NEURIPS, pages 5901 - 5914
See also references of EP4563941A4

Also Published As

Publication number Publication date
EP4563941A1 (en) 2025-06-04
CN119731511A (zh) 2025-03-28
EP4563941A4 (en) 2025-10-29
JPWO2024071006A1 (https=) 2024-04-04

Similar Documents

Publication Publication Date Title
US10489686B2 (en) Object detection for an autonomous vehicle
CN111836747B (zh) 用于车辆驾驶辅助的电子装置和方法
US12361717B2 (en) Mobile object control device, mobile object control method, training device, training method, generation device, and storage medium
JP2018190217A (ja) 運転者監視装置、及び運転者監視方法
CN110390831A (zh) 行进路线决定装置
WO2023230740A1 (zh) 一种异常驾驶行为识别的方法、装置和交通工具
US20250013683A1 (en) Context-based searching systems and methods for vehicles
CN115214713B (zh) 移动体的控制装置、移动体的控制方法及存储介质
CN117063217A (zh) 移动体的控制装置、移动体的控制方法及存储介质
CN116710971A (zh) 物体识别方法和飞行时间物体识别电路
EP4563941A1 (en) Information processing apparatus, information processing method, and program
JP7614261B2 (ja) 画像認識装置、画像認識方法、およびプログラム
JP7714122B2 (ja) 移動体の制御装置、移動体の制御方法、および記憶媒体
JP7802194B2 (ja) 情報処理装置、情報処理方法、およびプログラム
JP7770224B2 (ja) 移動体の制御装置、移動体の制御方法、および記憶媒体
US12394208B2 (en) Mobile object control device, mobile object control method, and storage medium
JP7738510B2 (ja) 白線認識装置、移動体の制御システム、白線認識方法、およびプログラム
EP4361961A1 (en) Method of determining information related to road user
JP2026068170A (ja) 画像処理装置、画像処理方法、およびプログラム
CN120548553A (zh) 信息处理装置、信息处理方法及程序
WO2023188090A1 (ja) 移動体の制御装置、移動体の制御方法、および記憶媒体
JP2022154109A (ja) 移動体の制御装置、移動体の制御方法、およびプログラム
CN120702461A (zh) 一种路径推荐方法、装置以及车辆
CN118843574A (zh) 移动体的控制装置、移动体的控制方法以及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23872231

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202380058940.3

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 2023872231

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2023872231

Country of ref document: EP

Effective date: 20250225

WWE Wipo information: entry into national phase

Ref document number: 2024549350

Country of ref document: JP

WWP Wipo information: published in national office

Ref document number: 202380058940.3

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

WWP Wipo information: published in national office

Ref document number: 2023872231

Country of ref document: EP