WO2025004590A1 - 学習装置、学習方法、及びプログラム - Google Patents

学習装置、学習方法、及びプログラム Download PDF

Info

Publication number
WO2025004590A1
WO2025004590A1 PCT/JP2024/018221 JP2024018221W WO2025004590A1 WO 2025004590 A1 WO2025004590 A1 WO 2025004590A1 JP 2024018221 W JP2024018221 W JP 2024018221W WO 2025004590 A1 WO2025004590 A1 WO 2025004590A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
input
error
target position
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2024/018221
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
直希 細見
輝久 翠
駿平 畑中
巍 楊
孔明 杉浦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Honda Motor Co Ltd
Keio University
Original Assignee
Honda Motor Co Ltd
Keio University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Honda Motor Co Ltd, Keio University filed Critical Honda Motor Co Ltd
Priority to JP2025529506A priority Critical patent/JPWO2025004590A1/ja
Priority to CN202480042444.3A priority patent/CN121605423A/zh
Priority to EP24831462.7A priority patent/EP4730218A1/en
Publication of WO2025004590A1 publication Critical patent/WO2025004590A1/ja
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads

Definitions

  • the present invention relates to a learning device, a learning method, and a program.
  • Patent Document 1 describes training a neural network using sensor data acquired by a vehicle.
  • Non-Patent Document 1 describes a model that uses images and language as input.
  • One aspect of the present invention aims to generate a model that can infer the target location with high accuracy.
  • a learning device for performing machine learning, comprising: an acquisition means for acquiring teacher data including input data and correct answer data, the input data including an input image including a reference object and input text that relatively specifies a target position by referring to the reference object; a generation means for generating output data for identifying the target position, a reference position that is the position of the reference object, and a positional relationship of the target position with respect to the reference position by inputting the input data into a model; and an update means for updating parameters of the model so that a loss obtained by inputting the output data and the correct answer data into a loss function is reduced, the loss function being based on at least two of a first error between the target position identified by the output data and the target position identified by the correct answer data, a second error between the reference position identified by the output data and the reference position identified by the correct answer data, and a third error between the positional relationship identified by the output data and the positional relationship identified by the correct answer data.
  • a model can be generated that can accurately infer the target position.
  • FIG. 1 is a block diagram illustrating an example of the hardware configuration of a computer according to some embodiments.
  • FIG. 4 is a schematic diagram illustrating an example of input data according to some embodiments.
  • FIG. 11 is a schematic diagram illustrating an example of correct answer data according to some embodiments.
  • FIG. 13 is a schematic diagram illustrating an example of the configuration of a model according to some embodiments.
  • FIG. 1 is a schematic diagram illustrating an example configuration of a self-attention layer according to some embodiments.
  • FIG. 1 is a schematic diagram illustrating an example configuration of a cross-attention layer according to some embodiments.
  • FIG. 1 is a flow diagram illustrating an example of a training method according to some embodiments.
  • the computer 100 is used to learn a model by machine learning. Therefore, the computer 100 may be called a learning device.
  • the computer 100 may be, for example, a server computer or a personal computer (e.g., a desktop or laptop type).
  • the computer 100 may be a computer resource located in a cloud environment.
  • Computer 100 may have the hardware devices shown in FIG. 1.
  • Processor 101 controls the overall operation of computer 100.
  • Processor 101 may be, for example, a central processing unit (CPU), a graphics processing unit (GPU), or a combination of these.
  • Processor 101 may be a single processor, or a collection of multiple processors connected to each other so that they can communicate with each other.
  • Memory 102 stores programs and data used in the processing of computer 100.
  • Memory 102 may be configured, for example, by a combination of random access memory (RAM) and read-only memory (ROM).
  • RAM random access memory
  • ROM read-only memory
  • the input device 103 is a device for obtaining instructions from a user of the computer 100.
  • the input device 103 may be configured, for example, by a combination of one or more of a keyboard, a button, a touchpad, and a microphone.
  • the display device 104 is a device for visually presenting information to the user of the computer 100.
  • the display device 104 may be, for example, a dot-matrix display such as a liquid crystal display.
  • the computer 100 may have a device (for example, a touch screen) in which the input device 103 and the display device 104 are integrated.
  • the input device 103 and the display device 104 may be external to the computer. In this case, the computer 100 may have an interface for communicating with the external input device 103 and display device 104.
  • the communication device 105 is a device for communicating with devices external to the computer 100.
  • the communication device 105 may be a network interface card (NIC) having a connector for connecting a cable.
  • NIC network interface card
  • the communication device 105 may be a wireless communication module including an antenna and a baseband processing circuit.
  • the secondary storage device 106 is a device for non-volatilely storing programs and data used in the processing of the computer 100.
  • the secondary storage device 106 is configured, for example, by a hard disk drive (HDD) or a solid state drive (SSD).
  • HDD hard disk drive
  • SSD solid state drive
  • the computer 100 may be capable of communicating with an external database 110.
  • the database 110 may store teacher data 111 used for machine learning by the computer 100.
  • the computer 100 may acquire the teacher data 111 from the database 110.
  • the teacher data 111 may be stored in the secondary storage device 106 of the computer 100.
  • different teacher data 111 are used.
  • the two teacher data 111 being different may mean that the input data 112 contained in these teacher data 111 are different (for example, at least one of the input text 201 and the input image 202 described below is different).
  • a part of the teacher data 111 may be used as verification data and test data.
  • the teacher data 111 includes input data 112 and correct answer data 113.
  • the input data 112 may be data that is input to a model (e.g., model 400 in FIG. 4) to learn the model.
  • the correct answer data 113 may be data that is to be output by the model.
  • the input data 112 may include a pair of an input image 202 including a reference object 203 and input text 201 that relatively specifies a target position by referencing the reference object 203.
  • the input image 202 may be any image including an object.
  • the input image 202 may be an image taken by the camera 211 of the vehicle 210.
  • the input image 202 may be an image taken by the camera 211 attached to the vehicle so as to capture an image in front of the vehicle 210.
  • the input image 202 may be an image taken by a camera attached to the vehicle so as to capture an image in another direction (e.g., rearward) of the vehicle 210.
  • the camera 211 of the vehicle 210 may be a camera 211 attached to the vehicle 210, or may be a camera brought into the vehicle (e.g., a smartphone of a vehicle occupant).
  • the input image 202 may be an image not related to the vehicle.
  • the reference object 203 may be any object included in the input image 202.
  • a vehicle is used as the reference object 203.
  • the reference object 203 may be a traffic participant other than a vehicle, a road sign, a traffic light, a guardrail, an intersection, a pedestrian crossing, etc.
  • the input text 201 may be expressed in natural language, for example, "Park in front of the black car on the right.”
  • the "black car on the right" in the input text 201 specifies the reference object 203
  • the "in front of” in the input text 201 specifies the target position relative to the reference object 203.
  • the input text 201 may be expressed in other formats.
  • the input text 201 may be selected from multiple candidates for combinations of reference objects and positional relationships that are set in advance.
  • the correct answer data 113 may be data for identifying a target position specified by the input text 201, a reference position which is the position of the reference object 203, and a positional relationship of the target position with respect to the reference object 203.
  • the correct answer data 113 may be set manually for the input data 112, or may be set by a computer.
  • the correct answer data 113 includes a point 301 in the input image 202 and an area 302 in the input image 202.
  • the target position specified by the input text 201 is identified by the point 301.
  • the point 301 may be the target position.
  • the area centered on the point 301 may be the target position.
  • the point 301 may be represented by components in each direction in a two-dimensional coordinate system set in the input image 202 (hereinafter simply referred to as the "coordinate system of the input image 202").
  • the reference position which is the position of the reference object 203, is identified by the region 302.
  • the region 302 may be the reference position.
  • the region 302 may be a rectangle having an outer edge that circumscribes the reference object 203.
  • the region 302 may be represented by a center, a width, and a height.
  • the center of the region 302 may be represented by components in each direction in the coordinate system of the input image 202.
  • the region 302 may be represented by the coordinates of the upper left corner and the coordinates of the lower right corner.
  • the region 302 may be other than rectangular, and may be, for example, circular.
  • the shape of the region 302 may differ depending on the shape of the reference object 203.
  • the positional relationship of the target position with respect to the reference object 203 is specified by a point 301 and an area 302.
  • the positional relationship may be represented by a two-dimensional vector from the center of the area 302 to the point 301. This vector may be represented by components in each direction in the coordinate system of the input image 202.
  • the correct answer data 113 includes a point 301 and an area 302.
  • the correct answer data 113 may include an area 302 and a two-dimensional vector.
  • the target position specified by the input text 201 may be specified by a point that is moved from the center of the area 302 by the amount of the two-dimensional vector.
  • the correct answer data 113 may include a point 301 and two two-dimensional vectors.
  • the upper left corner and the upper right corner of the area 302 may be specified by points that are moved from the point 301 by the amount of each of the two two-dimensional vectors.
  • the correct answer data 113 explicitly represents two of the three pieces of information: the target position specified by the input text 201, the reference position which is the position of the reference object 203, and the positional relationship of the target position with respect to the reference object 203, and the remaining piece of information is determined from the other two pieces of information.
  • the correct answer data 113 may explicitly represent these three pieces of information.
  • a model 400 that is machine-learned by the computer 100 will be described. Based on the input data 112, the model 400 generates output data for identifying a target position specified by the input text 201, a reference position that is the position of the reference object 203, and the positional relationship of the target position with respect to the reference object 203.
  • the model 400 may have any structure that affects the output data of the model by processing both the input text 201 and the input image 202 included in the input data 112 with the parameters of the model.
  • the model 400 in FIG. 4 is an example of such a model.
  • the model 400 includes an image input layer 410, a text input layer 420, an image coding layer 430, a text coding layer 440, and an output layer 450.
  • the image input layer 410 converts the input image 202 into a format that is input to the image coding layer 430.
  • the image input layer 410 converts the input image 202 into a number of vectors.
  • the image input layer 410 may divide the input image 202 into a number of patch images and rearrange the pixel values of each patch image into a one-dimensional vector.
  • the image coding layer 430 encodes the input image 202 (specifically, the input image 202 expressed as a plurality of vectors) input from the image input layer 410.
  • the specific configuration of the image coding layer 430 will be described later.
  • the output layer 450 generates output data for identifying the target position specified by the input text 201, the reference position which is the position of the reference object 203, and the positional relationship of the target position with respect to the reference object 203, based on the data encoded by the image coding layer 430.
  • the output data of the model 400 may have the same configuration as the above-mentioned correct answer data 113.
  • the output data of the model 400 may be the point 301 and the area 302.
  • a matrix in which a plurality of row vectors are combined is output from the image coding layer 430.
  • the output layer 450 may calculate a one-dimensional column vector by multiplying the output matrix by a weight matrix from the right. This weight matrix is one of the parameters determined by machine learning.
  • the multiple components of the calculated column vector become the coordinate values of point 301 and values for identifying area 302.
  • the image coding layer 430 may include one or more independent coding layers 460 (two in the example of FIG. 4) and one or more collaborative coding layers 470 (two in the example of FIG. 4). When the image coding layer 430 includes multiple independent coding layers 460, these may be connected in series. When the image coding layer 430 includes multiple collaborative coding layers 470, these may be connected in series. One or more independent coding layers 460 may be arranged together in the first half of the image coding layer 430, and one or more collaborative coding layers 470 may be arranged together in the second half of the image coding layer 430. Alternatively, the independent coding layers 460 and the collaborative coding layers 470 may be arranged in a mixed manner.
  • the independent coding layer 460 included in the image coding layer 430 does not use the features determined by the text coding layer 440 as input, but rather codes each of the multiple vectors input from the previous layer in the image coding layer 430.
  • the independent coding layer 460 may include a self-attention layer 461 and a fully connected layer 462.
  • the multiple vectors input to the independent coding layer 460 are converted into multiple different vectors by the self-attention layer 461.
  • the multiple vectors output from the self-attention layer 461 are converted into multiple different vectors by the fully connected layer 462.
  • the multiple vectors output from the fully connected layer 462 are output from the independent coding layer 460.
  • Each of the multiple output vectors of the self-attention layer 461 represents the relationship of each input vector to other input vectors in the multiple input vectors of the self-attention layer 461.
  • a specific configuration of the self-attention layer 461 will be described with reference to FIG. 5.
  • the self-attention layer 461 combines multiple input row vectors into one two-dimensional input matrix X.
  • the self-attention layer 461 calculates a query Q , a key K, and a value V by multiplying the input matrix X by a weight matrix WQ , a weight matrix WK, and a weight matrix WV from the right.
  • the weight matrices WQ , WK , and WV are parameters determined by machine learning.
  • the self-attention layer 461 includes a score calculation unit 501.
  • the score calculation unit 501 calculates a score S based on the query Q and the key K. Specifically, the score calculation unit 501 calculates an intermediate matrix by multiplying the query Q by the transposed matrix of the key K from the right and dividing each component by a predetermined value (e.g., the square root of the number of columns of the key K). Then, the score calculation unit 501 calculates the score S by applying a Softmax function to each row of the intermediate matrix. Then, the self-attention layer 461 calculates the matrix Y by multiplying the score S by a value V from the right. The self-attention layer 461 outputs the matrix Y calculated in this manner. The multiple rows of the matrix Y correspond to the multiple row vectors output from the self-attention layer 461.
  • the fully connected layer 462 outputs multiple different vectors by combining all of the multiple input vectors. For example, the fully connected layer 462 multiplies the matrix Y output from the self-attention layer 461 by a weight matrix from the right, and adds a bias vector to each row of the resulting matrix.
  • the weight matrix and bias vector are parameters determined by machine learning.
  • the fully connected layer 462 then outputs a matrix obtained by applying an activation function to each element of the matrix calculated in this way.
  • the weight matrix of the fully connected layer 462 has a size such that the matrix output from the fully connected layer 462 (i.e., the matrix output from the independent coding layer 460) is the same size as the input matrix of the next independent coding layer 460.
  • the collaborative coding layer 470 included in the image coding layer 430 uses the features determined by the text coding layer 440 as additional inputs to encode each of the multiple vectors input from the previous layer in the image coding layer 430.
  • the collaborative coding layer 470 may further include a cross-attention layer 471.
  • the multiple vectors input to the collaborative coding layer 470 are converted into multiple different vectors by the self-attention layer 461. Some of the features determined by the self-attention layer 461 are input to the cross-attention layer 471. Some of the features determined by the collaborative coding layer 470 (specifically, the self-attention layer 461) included in the text coding layer 440 are also input to the cross-attention layer 471.
  • the cross-attention layer 471 generates and outputs multiple vectors based on these inputs.
  • the multiple vectors output from the self-attention layer 461 and the multiple vectors output from the cross-attention layer 471 are added together and input to the fully connected layer 462.
  • the fully connected layer 462 converts the multiple input vectors into multiple different vectors.
  • the multiple vectors output from the fully connected layer 462 are output from the joint coding layer 470.
  • Each of the multiple output vectors of the cross attention layer 471 represents the relationship of the multiple output vectors from the self attention layer 461 included in the image coding layer 430 with respect to each vector of the multiple output vectors from the self attention layer 461 included in the text coding layer 440.
  • the cross-attention layer 471 receives a query Q from the self-attention layer 461a included in the image coding layer 430, and a key K and a value V from the self-attention layer 461b included in the text coding layer 440.
  • the query Q is part of the features determined by the self-attention layer 461a.
  • the key K and the value V are part of the features determined by the self-attention layer 461b.
  • the score calculation unit 501 calculates the score S based on the query Q and the key K in the same manner as described above.
  • the cross attention layer 471 then calculates the matrix Z by multiplying the score S by the value V from the right.
  • the cross attention layer 471 outputs the matrix Z calculated in this manner.
  • the multiple rows of the matrix Z correspond to the multiple row vectors output from the cross attention layer 471.
  • the text input layer 420 converts the input text 201 into a format that is input to the text coding layer 440.
  • the text input layer 420 segments the input text 201 into multiple words and converts each word into a vector.
  • An existing technique such as word2vec may be used to vectorize the words.
  • the text coding layer 440 encodes the input text 201 (specifically, the input text 201 expressed as multiple vectors) input from the text input layer 420.
  • the text coding layer 440 may have the same layer structure as the image coding layer 430. Alternatively, the text coding layer 440 may have a layer structure different from that of the image coding layer 430.
  • the text coding layer 440 may include one or more independent coding layers 460 (two in the example of FIG. 4) and one or more collaborative coding layers 470 (two in the example of FIG. 4).
  • the independent coding layers 460 included in the text coding layer 440 each encode a plurality of vectors input from a previous layer in the text coding layer 440 without using the features determined by the image coding layer 430 as input.
  • the collaborative coding layer 470 included in the text coding layer 440 each encodes a plurality of vectors input from a previous layer in the text coding layer 440 using the features determined by the image coding layer 430 as additional input.
  • the output data output from the model 400 is input to a loss function 480 when the model 400 is trained.
  • the correct answer data 113 corresponding to the input data 112 is also input to the loss function 480.
  • the loss function 480 outputs a loss based on the error between the output data and the correct answer data 113.
  • each step of the method in FIG. 7 may be processed, for example, by the processor 101 of the computer 100 executing a program read into the memory 102. Alternatively, some or all of the steps of the method in FIG. 7 may be executed by a dedicated circuit such as an application specific integrated circuit (ASIC).
  • ASIC application specific integrated circuit
  • the parameters of the model 400 may be set to random values.
  • the computer 100 acquires one piece of teacher data 111.
  • the teacher data 111 may be read from the database 110 at this point, or may be stored in the secondary storage device 106 in advance. Instead of using each piece of teacher data 111 one by one, multiple pieces of teacher data 111 may be used together as a batch.
  • the computer 100 In S702, the computer 100 generates output data by inputting the input data 112 contained in the teacher data 111 acquired in S701 to the model 400.
  • the output data is data for identifying the target position (e.g., point 301), the reference position (e.g., area 302) which is the position of the reference object 203, and the positional relationship of the target position with respect to the reference position.
  • the computer 100 updates the parameters of the model 400 so as to reduce the loss obtained by inputting the output data generated in S702 and the correct answer data 113 included in the teacher data 111 acquired in S701 into the loss function 480.
  • the parameter update may be performed using an existing method such as Adam.
  • the loss function 480 used in some embodiments will be described in detail.
  • the loss function 480 is based on at least two of the following: (1) the error between the target position identified by the output data of the model 400 and the target position identified by the correct answer data 113 (hereinafter, the target error); (2) the error between the reference position identified by the output data of the model 400 and the reference position identified by the correct answer data 113 (hereinafter, the reference error); and (3) the error between the positional relationship identified by the output data of the model 400 and the positional relationship identified by the correct answer data 113 (hereinafter, the relationship error).
  • the loss function 480 may be based on the target error and the reference error, or may be based on the target error and the relationship error, or may be based on the reference error and the relationship error. Furthermore, the loss function 480 may be based on all of the target error, the reference error, and the relationship error. In this way, the loss function 480 is based on at least two of the target error, the reference error, and the relationship error, so that the model 400 capable of accurately inferring the target position can be generated.
  • the loss function 480 used in some other embodiments will be described in detail.
  • the loss function 480 is based on (3) the error (the above-mentioned relationship error) between the positional relationship identified by the output data of the model 400 and the positional relationship identified by the ground truth data 113.
  • the relationship error is an error in the positional relationship of the target position with respect to the reference position, and is therefore based on both the target position and the reference position. Therefore, even if the loss function 480 is based only on the relationship error, it is possible to generate a model 400 that can accurately infer the target position.
  • the target error may be the difference between each coordinate value of point 301 identified by the output data of model 400 and each coordinate value of point 301 identified by correct answer data 113.
  • the reference position is represented by area 302 as described above
  • the reference error may be the difference between each coordinate value of the center, width, and height of area 302 identified by the output data of model 400 and each coordinate value of the center, width, and height of area 302 identified by correct answer data 113.
  • the relationship error may be the difference between each component of the two-dimensional vector identified by the output data of model 400 and each component of the two-dimensional vector identified by correct answer data 113.
  • the loss calculated by the loss function 480 may be a sum of at least two of the loss based on the target error, the loss based on the reference error, and the loss based on the relational error. This sum may be a weighted sum, and the coefficient of the weighted sum may be determined as a hyperparameter.
  • the loss based on the target error may be, for example, an L1 loss of the target error.
  • the loss based on the reference error may be, for example, a sum of an L1 loss of the reference error and a GIoU loss of the reference error. This sum may be a weighted sum, and the coefficient of the weighted sum may be determined as a hyperparameter.
  • the loss based on the relation error may be, for example, an L1 loss of the relation error.
  • the computer 100 determines whether a condition for terminating the iteration of parameter updates (hereinafter, the termination condition) is satisfied. If the computer determines that the termination condition is satisfied ("YES" at S704), it terminates the process, and otherwise ("NO” at S704), transitions the process to S701.
  • the termination condition may be that the parameters have been updated a predetermined number of times (i.e., S704 has been executed).
  • the computer 100 may store the learned model 400 in the secondary storage device 106 for future processing, or may transmit it to another device (e.g., database 110).
  • the parameters of model 400 at the start of learning may be set randomly.
  • the parameters of model 400 at the start of learning may be parameters of image encoding layer 430 determined by another machine learning process that uses input image 202 as input data and the position of reference object 203 as correct answer data.
  • the parameters of model 400 may be determined by fine tuning that uses parameters determined by another machine learning process.
  • a vehicle acquires voice input from an occupant through a microphone and converts the voice input into text.
  • the vehicle acquires an image by photographing the scenery in front of the vehicle.
  • the vehicle generates output data by inputting the text and image thus acquired into the model 400, and identifies a target position using the output data.
  • the vehicle executes a process specified by the voice input with respect to the target position. For example, when an instruction to "park in front of the black car on the right" is given by voice input, the vehicle identifies the position in front of the black car on the right as the target position, and controls the traveling of the vehicle so as to stop at this target position. In this way, by using the model 400, it is possible to simultaneously predict the target position and the reference object specified in the input text.
  • a learning device (100) for performing machine learning An acquisition means for acquiring teacher data (111) including input data (112) and correct answer data (113), the input data including an input image (202) including a reference object (203) and an input text (201) that relatively specifies a target position (301) by referring to the reference object; a generating means for generating output data for identifying the target position, a reference position (302) which is the position of the reference object, and a positional relationship of the target position with respect to the reference position by inputting the input data into a model (400); An update means for updating parameters of the model so that a loss obtained by inputting the output data and the correct answer data into a loss function (480) is reduced;
  • the loss function is a first error between the target position specified by the output data and the target position specified by the ground truth data; a second error between the reference position identified by the output data and the reference position identified by the correct answer data; and a third error between the positional relationship specified by the output data and the positional relationship specified
  • a model capable of inferring the target position with high accuracy can be generated.
  • [Item 2] 2. The learning device according to claim 1, wherein the input image includes an image captured by a camera (211) of a vehicle (210). According to this item, a model can be generated that can accurately infer the target position used for controlling the vehicle.
  • [Item 3] 3. The learning device according to item 1 or 2, wherein the input text is expressed in a natural language. According to this item, a model can be generated that can accurately infer a target position specified in natural language.
  • [Item 4] 4 4. The learning device according to any one of claims 1 to 3, wherein the loss function is based on at least the first error.
  • the learning device since it is based on the first error, it is possible to generate a model that can infer the target position with even greater accuracy.
  • the learning device according to any one of claims 1 to 4, wherein the loss function is based on all of the first error, the second error, and the third error. According to this item, a model can be generated that can infer the target position with even greater accuracy.
  • the model is an image coding layer (430) for coding the input image; a text coding layer (440) for coding the input text, A part of the feature amount determined by the text coding layer is input to the image coding layer; 6.
  • the learning device according to any one of claims 1 to 5, wherein a portion of features determined by the image coding layer is input to the text coding layer. According to this item, both the input image and the input text can be reflected in the output data.
  • [Item 7] 7 The learning device according to claim 6, wherein the image coding layer and the text coding layer have the same layer structure. This item makes the model easier to implement.
  • the machine learning of the model is a first machine learning, 8.
  • the learning device according to item 6 or 7, wherein the parameters of the model at the start of learning are parameters of the image encoding layer determined by a second machine learning process using the input image as input data and the position of the reference object as correct answer data. This item allows the model training time to be shortened.
  • a learning device (100) for performing machine learning An acquisition means for acquiring teacher data (111) including input data (112) and correct answer data (113), the input data including an input image (202) including a reference object (203) and an input text (201) that relatively specifies a target position (301) by referring to the reference object; a generating means for generating output data for identifying the target position, a reference position (302) which is the position of the reference object, and a positional relationship of the target position with respect to the reference position by inputting the input data into a model (400); An update means for updating parameters of the model so that a loss obtained by inputting the output data and the correct answer data into a loss function (480) is reduced; A learning device, wherein the loss function is based on an error between the positional relationship identified by the output data and the positional relationship identified by the ground truth data.
  • a learning method for machine learning comprising: an acquisition step of acquiring teacher data (111) including input data (112) and correct answer data (113), the input data including an input image (202) including a reference object (203) and input text (201) that relatively specifies a target position (301) by referencing the reference object; a generation step of inputting the input data into a model (400) to generate output data for identifying the target position, a reference position (302) which is the position of the reference object, and a positional relationship of the target position with respect to the reference position; An update step of updating parameters of the model so that a loss obtained by inputting the output data and the ground truth data into a loss function (480) is reduced;
  • the loss function is a first error between the target position specified by the output data and the target position specified by the ground truth data; a second error between the reference position identified by the output data and the reference position identified by the correct answer data; and a third error between the positional relationship specified by the output data and the positional relationship specified by the ground truth data; or the

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
PCT/JP2024/018221 2023-06-26 2024-05-16 学習装置、学習方法、及びプログラム Ceased WO2025004590A1 (ja)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2025529506A JPWO2025004590A1 (https=) 2023-06-26 2024-05-16
CN202480042444.3A CN121605423A (zh) 2023-06-26 2024-05-16 学习装置、学习方法以及程序
EP24831462.7A EP4730218A1 (en) 2023-06-26 2024-05-16 Learning device, learning method, and program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US18/213,980 US12579820B2 (en) 2023-06-26 2023-06-26 Learning apparatus and learning method
US18/213,980 2023-06-26

Publications (1)

Publication Number Publication Date
WO2025004590A1 true WO2025004590A1 (ja) 2025-01-02

Family

ID=93929730

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2024/018221 Ceased WO2025004590A1 (ja) 2023-06-26 2024-05-16 学習装置、学習方法、及びプログラム

Country Status (5)

Country Link
US (1) US12579820B2 (https=)
EP (1) EP4730218A1 (https=)
JP (1) JPWO2025004590A1 (https=)
CN (1) CN121605423A (https=)
WO (1) WO2025004590A1 (https=)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009193097A (ja) * 2008-02-12 2009-08-27 Yaskawa Electric Corp 移動ロボットの制御装置および移動ロボットシステム
JP2015149013A (ja) * 2014-02-07 2015-08-20 トヨタ自動車株式会社 移動体の目標位置の設定方法
JP2022513866A (ja) 2018-12-21 2022-02-09 ウェイモ エルエルシー 領域外コンテキストを用いたオブジェクト分類
WO2023101679A1 (en) * 2021-12-02 2023-06-08 Innopeak Technology, Inc. Text-image cross-modal retrieval based on virtual word expansion
CN116310920A (zh) * 2023-03-20 2023-06-23 重庆邮电大学 一种基于场景上下文感知的图像隐私预测方法

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11574142B2 (en) * 2020-07-30 2023-02-07 Adobe Inc. Semantic image manipulation using visual-semantic joint embeddings
US11468688B2 (en) * 2020-07-31 2022-10-11 Toyota Motor Engineering & Manufacturing North America, Inc. Vehicle sensor data sharing
US12394085B2 (en) * 2020-11-16 2025-08-19 Waymo Llc Long range distance estimation using reference objects
US11663294B2 (en) * 2021-03-16 2023-05-30 Toyota Research Institute, Inc. System and method for training a model using localized textual supervision
US12271792B2 (en) * 2021-05-26 2025-04-08 Salesforce, Inc. Systems and methods for vision-and-language representation learning
US12254707B2 (en) * 2022-09-28 2025-03-18 Lemon Inc. Pre-training for scene text detection
US20240257536A1 (en) * 2023-01-30 2024-08-01 Argo AI, LLC System, Method, and Computer Program Product for Streaming Data Mining with Text-Image Joint Embeddings
US20250095393A1 (en) * 2023-09-20 2025-03-20 Adobe Inc. Text-augmented object centric relationship detection
US11978271B1 (en) * 2023-10-27 2024-05-07 Google Llc Instance level scene recognition with a vision language model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009193097A (ja) * 2008-02-12 2009-08-27 Yaskawa Electric Corp 移動ロボットの制御装置および移動ロボットシステム
JP2015149013A (ja) * 2014-02-07 2015-08-20 トヨタ自動車株式会社 移動体の目標位置の設定方法
JP2022513866A (ja) 2018-12-21 2022-02-09 ウェイモ エルエルシー 領域外コンテキストを用いたオブジェクト分類
WO2023101679A1 (en) * 2021-12-02 2023-06-08 Innopeak Technology, Inc. Text-image cross-modal retrieval based on virtual word expansion
CN116310920A (zh) * 2023-03-20 2023-06-23 重庆邮电大学 一种基于场景上下文感知的图像隐私预测方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HATANAKA, SHUMPEI; WEI, YANG; HOSOMI, NAOKI; MISU, TERUHISA; YAMADA, KENTAROU; SUGIURA, KOMEI: "Target Position Prediction Using UNITER Regressor for Understanding Navigation Instructions in Urban Areas", JSAI TECHNICAL REPORT, SIG-KBS, JAPANESE SOCIETY FOR ARTIFICIAL INTELLIGENCE, JAPAN, vol. 126, 19 August 2022 (2022-08-19), Japan, pages 34 - 39, XP009560840, ISSN: 2436-4592, DOI: 10.11517/jsaikbs.126.0_34 *
ZI-YI DOU ET AL.: "Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone", ARXIV, 18 November 2022 (2022-11-18), Retrieved from the Internet <URL:https://arxiv.org/pdf/2206.07643.pdf>

Also Published As

Publication number Publication date
JPWO2025004590A1 (https=) 2025-01-02
EP4730218A1 (en) 2026-04-22
US20240428597A1 (en) 2024-12-26
CN121605423A (zh) 2026-03-03
US12579820B2 (en) 2026-03-17

Similar Documents

Publication Publication Date Title
US12499363B2 (en) Learning to generate synthetic datasets for training neural networks
JP6441980B2 (ja) 教師画像を生成する方法、コンピュータおよびプログラム
EP3542319B1 (en) Training neural networks using a clustering loss
JP6728487B2 (ja) 電子装置及びその制御方法
CN117597703A (zh) 用于图像分析的多尺度变换器
JP2018200531A (ja) 教師データ生成装置、教師データ生成方法、教師データ生成プログラム、及び物体検出システム
CN112241784A (zh) 训练生成模型和判别模型
JP6612486B1 (ja) 学習装置、分類装置、学習方法、分類方法、学習プログラム、及び分類プログラム
CN116663650A (zh) 深度学习模型的训练方法、目标对象检测方法及装置
CN118228743B (zh) 一种基于文图注意力机制的多模态机器翻译方法及装置
WO2025004590A1 (ja) 学習装置、学習方法、及びプログラム
JP2023009344A (ja) 生成方法、情報処理装置、プログラム、及び情報処理システム
CN111899284A (zh) 一种基于参数化esm网络的平面目标跟踪方法
Fujita et al. Perceptual font manifold from generative model
CN116994142B (zh) 图像差异信息的检测方法及装置、非易失性存储介质
US20250315971A1 (en) Learning apparatus, estimation apparatus, learning method, estimation method, and storage medium
CN117574314B (zh) 传感器的信息融合方法、装置、设备及存储介质
KR102724395B1 (ko) 매개변수 집합 최적화 장치 및 방법
CN115239930B (zh) 场景生成方法、装置、计算机设备和存储介质
JP7666099B2 (ja) 街区設計装置、及びプログラム
JP4691659B2 (ja) 画像認識装置、画像認識方法及びプログラム
WO2025210865A1 (ja) 学習装置、学習方法及びプログラム
JP2025152537A (ja) 学習装置、推定装置、学習方法、推定方法及びプログラム
CN121214323A (zh) 一种面向机器人平台的轻量化单目三维目标检测方法
CN121336211A (zh) 利用对比知识蒸馏训练神经网络

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24831462

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2025529506

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2025529506

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 2024831462

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2024831462

Country of ref document: EP

Effective date: 20260119

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2024831462

Country of ref document: EP

Effective date: 20260119

ENP Entry into the national phase

Ref document number: 2024831462

Country of ref document: EP

Effective date: 20260119