WO2025018220A1 - 検索装置、関係情報生成装置、検索方法、関係情報生成方法、及びプログラム - Google Patents

検索装置、関係情報生成装置、検索方法、関係情報生成方法、及びプログラム Download PDF

Info

Publication number
WO2025018220A1
WO2025018220A1 PCT/JP2024/024735 JP2024024735W WO2025018220A1 WO 2025018220 A1 WO2025018220 A1 WO 2025018220A1 JP 2024024735 W JP2024024735 W JP 2024024735W WO 2025018220 A1 WO2025018220 A1 WO 2025018220A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
relationship
relation
feature
objects
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/JP2024/024735
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
智史 山崎
健全 劉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Corp
Original Assignee
NEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Corp filed Critical NEC Corp
Priority to JP2025533995A priority Critical patent/JPWO2025018220A1/ja
Publication of WO2025018220A1 publication Critical patent/WO2025018220A1/ja
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/60Analysis of geometric attributes

Definitions

  • This disclosure relates to a search device, a relationship information generation device, a search method, a relationship information generation method, and a program.
  • Patent Document 1 discloses a technology for managing multiple images by associating each of the images with features of the posture of the person contained in the image.
  • the search device of Patent Document 1 calculates features for the posture of the person contained in a query image, and searches an image database for images associated with features that match the calculated features. As a result, it is possible to search for images that include a person in a desired posture from the image database.
  • Patent Document 1 does not anticipate searching for information based on anything other than a person's posture. This disclosure has been made in light of this issue, and one of its objectives is to provide a new technology for searching for information obtained from images.
  • the search device disclosed herein includes an acquisition means for acquiring query data representing a relationship between objects for a pair of objects, a calculation means for calculating a relationship feature representing a characteristic of the relationship between the objects from the query data, and a detection means for detecting a relationship feature matching the calculated relationship feature from object relationship information in which the relationship feature is associated with each of a plurality of object pairs.
  • the object relationship information is generated using one or more image data.
  • the relationship information generating device disclosed herein includes an acquisition means for acquiring source image data, a detection means for detecting one or more pairs of objects from the source image data, a calculation means for calculating a relationship feature quantity that represents a characteristic of the relationship between the objects for the pair, and a generation means for generating object relationship information that indicates the pair and the calculated relationship feature quantity for the pair in association with each other.
  • the search method disclosed herein includes an acquisition step of acquiring query data representing a relationship between objects for a pair of objects, a calculation step of calculating a relationship feature representing a characteristic of the relationship between the objects from the query data, and a detection step of detecting a relationship feature matching the calculated relationship feature from object relationship information in which the relationship feature is associated with each of a plurality of object pairs.
  • the object relationship information is generated using one or more image data.
  • the relationship information generating method disclosed herein includes an acquisition step of acquiring source image data, a detection step of detecting one or more object pairs from the source image data, a calculation step of calculating a relationship feature quantity representing characteristics of the relationship between the objects for the pair, and a generation step of generating object relationship information in which the pair and the calculated relationship feature quantity for the pair are associated with each other.
  • the first program of the present disclosure causes a computer to execute an acquisition step of acquiring query data representing a relationship between objects for a pair of objects, a calculation step of calculating a relationship feature representing a characteristic of the relationship between the objects from the query data, and a detection step of detecting a relationship feature that matches the calculated relationship feature from object relationship information in which the relationship feature is associated with each of a plurality of object pairs.
  • the object relationship information is generated using one or more image data.
  • the second program of the present disclosure causes a computer to execute an acquisition step of acquiring source image data, a detection step of detecting one or more pairs of objects from the source image data, a calculation step of calculating, for the pairs, a relation feature that represents a characteristic of the relationship between the objects, and a generation step of generating object relation information that indicates the pairs in correspondence with the relation feature calculated for the pairs.
  • This disclosure provides a new technique for searching information obtained from images.
  • FIG. 2 is a diagram illustrating an example of an outline of the operation of the search device;
  • FIG. 2 is a block diagram illustrating a functional configuration of a search device.
  • FIG. 2 is a block diagram illustrating a hardware configuration of a computer that realizes the search device.
  • 10 is a flowchart illustrating a flow of a process executed by a search device.
  • 10 is a diagram illustrating an example of a configuration of object-related information;
  • 11 is a diagram illustrating an example of the configuration of object-related information in which an identifier of source image data is indicated.
  • FIG. FIG. 2 is a diagram illustrating an example of a configuration of object information. This is a diagram illustrating the structure of the vision-and-language model.
  • FIG. 2 is a diagram illustrating an example of an outline of the operation of the relationship information generating device.
  • FIG. 2 is a diagram illustrating a functional configuration of a relationship information generating device.
  • 11 is a flowchart illustrating a process executed by the relationship information generating device.
  • 13 is a diagram conceptually illustrating a process performed by a calculation unit;
  • FIG. 1 is a diagram illustrating an example of an outline of the operation of the training device.
  • FIG. 2 is a diagram illustrating an example of a functional configuration of a training device.
  • 11 is a flowchart illustrating a process executed by the training device.
  • predetermined values such as predetermined values and threshold values are stored in advance in a storage device accessible from a device that uses the value.
  • the storage unit is composed of one or any number of storage devices.
  • Fig. 1 is a diagram illustrating an example of an outline of the operation of a search device 2000.
  • Fig. 1 is a diagram for facilitating understanding of the outline of the search device 2000, and the operation of the search device 2000 is not limited to the operation shown in Fig. 1.
  • the search device 2000 is used to search for desired information from the object relationship information 20.
  • the object relationship information 20 indicates a relationship feature for each of a plurality of pairs of objects.
  • the relationship feature of an object pair is a feature that represents the characteristics of the relationship between two objects. One of the two objects is the subject and the other is the object.
  • the relationship between a subject and an object represents an action being performed by the subject towards the object.
  • the subject the person, performs the action of "lifting” on the object, the ball. Therefore, there is a relationship of "lifting” between the two objects, the person and the ball. Therefore, the object relationship information 20 generated for the situation where "a person is lifting a ball” shows the object pair of the person and the ball in correspondence with a relationship feature that represents the relationship of "lifting".
  • object pairs are also referred to as "object pairs”.
  • the object-related information 20 is generated, for example, by analyzing one or more source image data.
  • the source image data may be still image data generated by a still camera, or video frames obtained from video data generated by a video camera.
  • the source image data includes a person sitting on a chair and a dog holding a ball in its mouth.
  • This source image data shows the situation "a person sitting on a chair” and the situation "the dog is holding a ball in its mouth”.
  • the object relation information 20 generated from this source image data shows a relation feature representing the relationship "sitting" in association with the object pair of the person and the chair.
  • the object relation information 20 also shows a relation feature representing the relationship "holding” in association with the object pair of the dog and the ball.
  • the relational feature may include information about the subject, information about the object, or both.
  • the relational feature generated for the situation "a person sitting in a chair” represents the information that "a person is sitting on something.”
  • the relational feature generated for the situation "a person sitting in a chair” represents the information that "something is sitting on the chair.”
  • the relational feature generated for the situation "a person sitting in a chair” represents the information that "a person is sitting in a chair.”
  • a search by the search device 2000 is performed, for example, as follows.
  • the search device 2000 acquires query data 10.
  • the query data 10 is information representing one or more object pairs and the relationship between the objects in each object pair.
  • the query data 10 indicates image data including one or more object pairs.
  • the query data 10 indicates text data in which the relationship between the objects is expressed in words for one or more object pairs.
  • the text data in which the relationship between the objects is expressed in words is, for example, data indicating text such as "a person is sitting on a dog.”
  • the search device 2000 uses the query data 10 to calculate a relation feature for each object pair indicated in the query data 10. Furthermore, the search device 2000 uses the relation feature calculated from the query data 10 to detect, from the object relation information 20, a relation between objects that matches the relation between objects represented by the query data 10.
  • the search device 2000 outputs information based on the detection result.
  • the information output by the search device 2000 indicates, for example, information on object pairs associated with relational features detected from the object relation information 20, and source image data used to generate records of the object relation information 20 that indicate the relational features. Details of the information output by the search device 2000 will be described later.
  • the search device 2000 can search for information based on the relationship between objects. Specifically, it is possible to search for a relationship between objects that matches the relationship between objects in the object pair represented by the query data 10 from among the relationships between objects in each object pair extracted from the source image data. In this way, the search device 2000 provides a new technique for searching for information obtained from images, which is a search using the relationship between objects as a key.
  • the relationships between objects are expressed by relationship features.
  • the search device 2000 then performs a search using the relationship features.
  • labels such as "sitting” or “holding” instead of relational features as data representing relationships between objects.
  • label group a set of usable labels
  • the relationships between objects are expressed using relationship features, so there is no need to predefine the types of relationships between objects that can be handled by the search device 2000. Therefore, the search device 2000 can handle various relationships between objects.
  • the search device 2000 of this embodiment will be described in more detail below.
  • ⁇ Example of functional configuration> 2 is a block diagram illustrating a functional configuration of the search device 2000.
  • the search device 2000 includes an acquisition unit 2020, a calculation unit 2040, and a detection unit 2060.
  • the acquisition unit 2020 acquires query data 10.
  • the calculation unit 2040 uses the query data 10 to calculate a relation feature amount for an object pair indicated in the query data 10.
  • the detection unit 2060 uses the relation feature amount calculated from the query data 10 to detect, from the object relation information 20, a relation between objects that matches the relation between objects indicated by the query data 10.
  • Each functional component of the search device 2000 may be realized by hardware that realizes each functional component (e.g., a hardwired electronic circuit, etc.), or may be realized by a combination of hardware and software (e.g., a combination of an electronic circuit and a program that controls it, etc.).
  • a further explanation will be given of the case where each functional component of the search device 2000 is realized by a combination of hardware and software.
  • FIG. 3 is a block diagram illustrating an example of the hardware configuration of a computer 1000 that realizes the search device 2000.
  • the computer 1000 is any computer.
  • the computer 1000 is a stationary computer such as a PC (Personal Computer) or a server machine.
  • the computer 1000 is a portable computer such as a smartphone or a tablet terminal.
  • the computer 1000 may be a dedicated computer designed to realize the search device 2000, or may be a general-purpose computer.
  • each function of the search device 2000 is realized on the computer 1000.
  • the application is composed of a program for realizing each functional component of the search device 2000.
  • the method of acquiring the program is arbitrary.
  • the program can be acquired from a storage medium (such as a DVD disc or USB memory) on which the program is stored.
  • the program can be acquired by downloading the program from a server device that manages the storage device on which the program is stored.
  • Computer 1000 has bus 1020, processor 1040, memory 1060, storage device 1080, input/output interface 1100, and network interface 1120.
  • Bus 1020 is a data transmission path for processor 1040, memory 1060, storage device 1080, input/output interface 1100, and network interface 1120 to transmit and receive data to and from each other.
  • the method of connecting processor 1040 and the like to each other is not limited to bus connection.
  • the processor 1040 is a variety of processors, such as a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a DSP (Digital Signal Processor), or an FPGA (Field-Programmable Gate Array).
  • the memory 1060 is a primary storage device realized using RAM (Random Access Memory) or the like.
  • the storage device 1080 is an auxiliary storage device realized using a hard disk, SSD (Solid State Drive), memory card, or ROM (Read Only Memory) or the like.
  • the input/output interface 1100 is an interface for connecting the computer 1000 to an input/output device.
  • an input device such as a keyboard and an output device such as a display device are connected to the input/output interface 1100.
  • the network interface 1120 is an interface for connecting the computer 1000 to a network.
  • This network may be a LAN (Local Area Network) or a WAN (Wide Area Network).
  • the storage device 1080 stores a program that realizes each functional component of the search device 2000 (a program that realizes the application described above).
  • the processor 1040 reads this program into the memory 1060 and executes it to realize each functional component of the search device 2000.
  • the search device 2000 may be realized by one computer 1000, or may be realized by multiple computers 1000. In the latter case, the configuration of each computer 1000 does not need to be the same, and can be different from each other.
  • ⁇ Processing flow> 4 is a flowchart illustrating a process executed by the search device 2000.
  • the acquisition unit 2020 acquires the query data 10 (S102).
  • the calculation unit 2040 calculates a relation feature amount for an object pair indicated in the query data 10 (S104).
  • the detection unit 2060 uses the relation feature amount calculated from the query data 10 to detect an object relation between objects that matches the object relation between objects indicated by the query data 10 from the object relation information 20 (S106).
  • the acquiring unit 2020 acquires the query data 10 (S102).
  • the query data 10 indicates image data generated by capturing an image of one or more object pairs.
  • image data generated by capturing an image of a person sitting on a chair represents a relationship of "sitting" between the object pair of the person and the chair.
  • the query data 10 may represent distance measurement data generated by sensing one or more object pairs with a distance measurement sensor.
  • Various sensors such as a LiDAR (Light Detection And Ranging) sensor or a depth camera may be used as the distance measurement sensor.
  • the distance measurement data may be, for example, a distance image or point cloud data.
  • the query data 10 indicates text data representing the relationship between one or more object pairs.
  • Such text data is, for example, text data representing the sentence "A person is sitting in a chair.”
  • the text data representing the relationship between objects may be a tuple that lists a subject, an object, and the relationship between them.
  • An example of a tuple that lists a subject, an object, and the relationship between them is (person, chair, sit).
  • the query data 10 includes, for example, two or more of the above-mentioned sentences or tuples.
  • the query data 10 may also represent voice data that represents the relationship between objects in an object pair.
  • the voice data may be, for example, a voice representation of the text data described above.
  • the acquisition unit 2020 provides a screen for inputting the query data 10 to the user of the search device 2000.
  • the acquisition unit 2020 acquires the data input on this screen as the query data 10.
  • the acquisition unit 2020 acquires the query data 10 by receiving the query data 10 transmitted from another device such as a terminal used by the user (e.g., a PC or a smartphone).
  • the acquisition unit 2020 may acquire the query data 10 by reading the query data 10 stored in a memory unit accessible from the search device 2000 from the memory unit.
  • the object relation information 20 indicates an association between an object pair and a relation feature.
  • the relation feature corresponding to a certain object pair represents the characteristics of the relation between the objects in the object pair.
  • FIG. 5 is a diagram illustrating an example of the configuration of object relation information 20.
  • object relation information 20 includes columns named object pairs 22 and relation features 23.
  • the object pair 22 indicates an object pair. More specifically, the object pair 22 indicates a correspondence between a subject identifier 24 and an object identifier 25.
  • the subject identifier 24 is an identifier of the object that is the subject.
  • the object identifier 25 is an identifier of the object that is the object.
  • the relation feature 23 indicates a relation feature that represents the characteristics of the subject-object relationship for the corresponding object pair.
  • the relation feature is represented, for example, as a numeric vector of a predetermined length (a vector in which each element indicates a numeric value).
  • each record of object relationship information 20 is generated using multiple source image data, it is preferable for each record of object relationship information 20 to indicate the identifier of the source image data used to generate that record.
  • Figure 6 is a diagram illustrating an example of the configuration of object relationship information 20 in which the identifier of the source image data is indicated.
  • the object-related information 20 further includes a column named source identifier 21.
  • the source identifier 21 indicates an identifier of the source image data.
  • the identifier of the source image data is represented, for example, by the file name or path of the file representing the source image data.
  • the identifier of the source image data may be represented by a frame number.
  • the identifier of the source image data is represented, for example, by a combination of the identifier of the video data (such as the file name or path) and the frame number.
  • the object relationship information 20 when the object relationship information 20 is generated using video data, the relationship between a certain object pair may continue across multiple video frames.
  • the source identifier 21 may indicate multiple identifiers of the source image data.
  • the object relationship information 20 may indicate time information together with the source identifier 21 or instead of the source identifier 21.
  • each record of the object relationship information 20 indicates the date and time of generation of the source image data used to generate that record.
  • the source identifier 21 may indicate a combination of the start and end points of the relationship indicated by that record.
  • each object is represented by an identifier.
  • further information such as the type of each object is required.
  • Information regarding each object may be included in object-relationship information 20, or may be managed separately from object-relationship information 20.
  • object information information in which an object identifier and information regarding the object are associated with each other will be referred to as object information.
  • FIG. 7 is a diagram illustrating the configuration of object information.
  • Object information 30 has columns named object identifier 31, type 32, and area 33.
  • Object identifier 31 indicates the identifier of the object.
  • Type 32 indicates the type of object.
  • Area 33 indicates the position of the image area (hereinafter, object area) that indicates the corresponding object in the source image data. For example, if the circumscribing rectangle of an object is treated as the object area, area 33 indicates information that can identify the circumscribing rectangle (for example, the coordinates of the upper left corner of the circumscribing rectangle and the coordinates of the lower right corner of the circumscribing rectangle).
  • the multiple source image data used to generate the object relationship information 20 is time-series data such as video data
  • the same object may be detected from the multiple source image data.
  • the area 33 indicates multiple pairs of a time point and an object area at that time point.
  • the time point may be represented by an absolute value or a relative value. In the former case, the time point is represented by, for example, a date and time. In the latter case, the time point is represented by, for example, a frame number.
  • a scene graph is an example of a data structure that represents the relationships between objects.
  • each object is represented by a node.
  • the relationship between two objects is represented by an edge connecting those objects.
  • Each record of the object relationship information 20 described above can also be represented as a scene graph in which the subject and object are represented by nodes, and relationship features are attached to the edges. From this, the process performed by the search device 2000 can also be considered as a process of detecting a desired node pair from a scene graph in which relationship features are attached to edges that represent relationships between objects.
  • the calculation unit 2040 calculates a relation feature amount by using the query data 10 (S104).
  • a trained model hereinafter, a relation feature amount calculation model
  • a machine learning-based model such as a neural network is used as the relation feature amount calculation model.
  • the calculation unit 2040 inputs the query data 10 into the relation feature calculation model.
  • the calculation unit 2040 acquires the relation feature output from the relation feature calculation model.
  • a feature calculation model included in the vision-and-language model can be used as the relational feature calculation model.
  • the vision-and-language model is a model that can execute processes that handle both visual data and linguistic data. Examples of visual data include image data and point cloud data. Examples of linguistic data include text data.
  • An example of a process that handles both visual data and linguistic data is a process that compares the contents of image data with the contents of text data.
  • a more specific example would be a process that determines whether the relationships between objects contained in image data match the relationships between objects represented by text data.
  • Figure 8 is a diagram illustrating the configuration of a vision-and-language model.
  • the vision-and-language model 40 acquires visual data 50 and linguistic data 60.
  • the vision-and-language model 40 determines whether the contents of the visual data 50 and the linguistic data 60 match.
  • the vision-and-language model 40 calculates mutually comparable features from the visual data 50 and the linguistic data 60 in order to compare the contents of the visual data 50 and the linguistic data 60.
  • the vision-and-language model 40 has a first feature calculation model 42 and a second feature calculation model 44.
  • the first feature calculation model 42 calculates the features of the visual data 50.
  • the second feature calculation model 44 calculates the features of the linguistic data 60.
  • the vision-and-language model 40 further includes a judgment model 46.
  • the judgment model 46 compares the features obtained from the visual data 50 with the features obtained from the linguistic data 60 to judge whether the contents of the visual data 50 and the linguistic data 60 match.
  • the vision-and-language model 40 determines whether the relationship between objects represented by the visual data 50 matches the relationship between objects represented by the linguistic data 60.
  • the first feature calculation model 42 is trained to calculate features of the relationship between objects represented by the visual data 50.
  • the second feature calculation model 44 is trained to calculate features of the relationship between objects represented by the linguistic data 60.
  • the relational feature calculation model to be included in the calculation unit 2040 can be obtained from a vision-and-language model 40 that handles the same type of data as the type of query data 10.
  • the query data 10 is image data.
  • the first feature calculation model 42 of the trained vision-and-language model 40 in which image data is treated as visual data 50, can be used as the relational feature calculation model to be included in the calculation unit 2040.
  • the second feature calculation model 44 of the trained vision-and-language model 40 in which text data expressing the relationship between objects in sentences is treated as linguistic data 60, can be used as the relationship feature calculation model to be included in the calculation unit 2040.
  • the query data 10 is visual data such as image data, and also includes information other than the object pair.
  • the calculation unit 2040 detects an area representing the object pair from the visual data, and inputs the detected area into the relational feature calculation model.
  • the calculation unit 2040 detects, from the query data 10, the object region of the subject of the object pair, the object region of the object of the object pair, and a region that includes both the subject and the object (e.g., a region circumscribing the subject and the object). Furthermore, the calculation unit 2040 generates combined data of these three regions (e.g., a concatenation of data representing these three regions). The calculation unit 2040 then inputs the combined data into the relationship extraction model.
  • the relationship extraction model will be described in detail later.
  • the detection unit 2060 uses the relation feature obtained from the query data 10 to detect, from the object relation information 20, a relation between objects that matches the relation between objects represented by the query data 10 (S106). For example, the detection unit 2060 calculates a similarity between each relation feature 23 indicated in the object relation information 20 and the relation feature obtained from the query data 10, and determines whether the calculated similarity is equal to or greater than a predetermined threshold. If the similarity calculated for a certain relation feature 23 is equal to or greater than the threshold, the detection unit 20 detects the relation between objects represented by the relation feature 23 as matching the relation between objects represented by the query data 10.
  • the detection unit 2060 treats the relationship between the objects represented by the relationship feature 23 with the highest similarity as matching the relationship between the objects represented by the query data 10.
  • the detection unit 2060 may treat each of the relationships between the objects represented by the top N relationship features 23 (N is a predetermined natural number) in descending order of similarity as matching the relationship between the objects represented by the query data 10.
  • the detection unit 2060 calculates the similarity between the two features using the distance between the two features.
  • the similarity between the two features can be expressed as the inverse of the distance between those features.
  • the detection unit 2060 calculates the cosine similarity between the two features as the similarity between those features.
  • the search device 2000 outputs information based on the result of detection by the detection unit 2060.
  • the information output based on the result of the search is called output information.
  • the detection unit 2060 extracts from the object relation information 20 relations between objects that match the relations between objects represented by the query data 10, and includes the extracted information in the output information. More specifically, for a relation feature 23 that is determined to match the relation feature calculated from the query data 10, the detection unit 20 includes in the output information a record of the object relation information 20 in which the relation feature 23 is indicated. For example, assume that the relation feature calculated from the query data 10 matches Fr2 in FIG. 5. In this case, the output information includes the information indicated in the second record of the object relation information 20 in FIG. 5.
  • the user of the search device 2000 can learn the subjects, objects, and subject-object relationships for the object relationships obtained from the source image data that match the object relationships represented by the query data 10.
  • object information 30 it is preferable for the search device 2000 to use object information 30 in addition to object relationship information 20, and to include information regarding the type of object, etc., for each of the subject and object in the output information.
  • the detection unit 2060 may output source image data for a relationship between objects that matches the relationship between objects represented by the query data 10, in addition to or instead of the records of the object relationship information 20.
  • a relationship feature 23 that matches the relationship feature calculated from the query data 10 is indicated in the second record of the object relationship information 20 in FIG. 6.
  • the search device 2000 includes in the output information the source image data with identifier 002 identified by the source identifier 21 of the second record. This allows the user of the search device 2000 to obtain source image data including the relationship between objects represented by the query data 10.
  • the query data 10 is text data expressing the relationship "a person is sitting in a chair.”
  • a user of the search device 2000 can obtain source image data including a scene in which a person is sitting in a chair by searching using the query data 10.
  • the output information may include image data that includes only the object of interest.
  • the query data 10 is text data such as "a person is sitting in a chair.”
  • a certain source image data includes both a scene of a person sitting in a chair and a scene of a dog holding a ball in its mouth.
  • the output information may include image data from this source image data that represents the areas of the person and the chair.
  • the query data 10 may represent the relationship between two or more object pairs.
  • the query data 10 may be text data such as "a person is sitting in a chair, and a dog is holding a ball in its mouth.”
  • the calculation unit 2040 calculates a relationship feature for each object pair. Furthermore, the detection unit 2060 searches the object relationship information 20 for a relationship feature 23 that matches each of the calculated relationship features.
  • the detection unit 2060 1) detects from the object relation information 20 a relation feature 23 that matches a relation feature obtained from the text data of "A person is sitting in a chair,” and 2) detects from the object relation information 20 a relation feature 23 that matches a relation feature obtained from the text data of "The dog is holding a ball.”
  • the calculation unit 2040 has a parser that interprets the sentence indicated in the query data 10.
  • the parser divides the query data 10 into sentences. Furthermore, the parser interprets each sentence to identify the subject, object, and relationship between the subject and object indicated in each sentence.
  • the parser interprets the logical relationship between the sentences to identify the search conditions. For example, assume that the query data 10 indicates the sentence "A person is sitting in a chair, and a dog is holding a ball in its mouth.” In this case, the parser divides the query data 10 into two sentences, "A person is sitting in a chair” and "A dog is holding a ball,” and identifies that these are connected with "and.” The parser further interprets each of the two sentences to identify the subject, object, and relationship between the subject and object, namely, "Subject: person, object: chair, relationship: sit” and "Subject: dog, object: ball, relationship: hold.” Then, based on the interpretation of the two sentences and the logical relationship between the two sentences, the parser identifies the search criteria "(subject: person, object: chair, relationship: sit) and (subject: dog, object: ball, relationship: hold in mouth).
  • the calculation unit 2040 assumes that a predetermined logical relationship (for example, and) exists between these sentences. For example, assume that the query data 10 indicates "1) a person is sitting in a chair, 2) a dog is holding a ball in its mouth.” Also assume that the predetermined logical relationship is and. In this case, as in the previous example, the search criteria "(subject: person, object: chair, relationship: sitting) and (subject: dog, object: ball, relationship: holding)" are identified.
  • the detection unit 2060 may determine whether the relation feature 23 detected from the object relation information 20 for each object pair corresponds to the same source image data. Then, the detection unit 2060 may determine that the object relation information 20 indicates a relation between objects that matches the relation between objects represented in the query data 10 only when the relation feature 23 detected from the object relation information 20 for each object pair corresponds to the same source image data.
  • the query data 10 is text data stating "A person is sitting in a chair and a dog is holding a ball in its mouth.”
  • the search criteria "(subject: person, object: chair, relationship: sitting) and (subject: dog, object: ball, relationship: holding)" are specified. Therefore, only when a single source image data contains both a scene in which a person is sitting in a chair and a scene in which a dog is holding a ball in its mouth, is the desired relationship treated as having been detected from the object relationship information 20.
  • the output information may be output in a variety of ways.
  • the detection unit 2060 stores the output information in an arbitrary storage unit.
  • the detection unit 2060 displays the output information on a display device.
  • the detection unit 2060 transmits the output information to another device. For example, when query data 10 is transmitted from another device, the detection unit 2060 transmits the output information to the device that transmitted the query data 10.
  • the search device 2000 can be used to search for a desired event or collect information about an accident from a video of a drive recorder.
  • the object relationship information 20 is generated using the video of the drive recorder.
  • the source image data is each video frame constituting the video of the drive recorder.
  • the search device 2000 can be used to monitor a person using video. Specifically, the search device 2000 can be used to detect a desired behavior of a person being watched from video obtained by shooting the person being watched (hereinafter, "watching video"). In this case, the object relationship information 20 is generated using the watching video.
  • the source image data are each video frame that makes up the watching video.
  • the search device 2000 can also be used to detect criminal activity from surveillance footage.
  • the object-relation information 20 is generated using the surveillance footage.
  • the source image data are the individual video frames that make up the surveillance footage.
  • the search device 2000 can also be used to analyze customer purchasing behavior, for example.
  • the object-relationship information 20 is also generated using surveillance video.
  • a method for generating the object relationship information 20 will be exemplified.
  • a device for generating the object relationship information 20 is called a relationship information generation device.
  • the relationship information generation device may be provided integrally with the search device 2000, or may be provided separately from the search device 2000. In the latter case, the hardware configuration of the relationship information generation device is represented in FIG. 3, for example, similar to the hardware configuration of the search device 2000.
  • FIG. 9 is a diagram illustrating an example of an overview of the operation of the relationship information generating device. Note that FIG. 9 is a diagram for facilitating understanding of the overview of the relationship information generating device 3000, and the operation of the relationship information generating device 3000 is not limited to the operation shown in FIG. 9.
  • the relationship information generating device 3000 acquires source image data 100.
  • the relationship information generating device 3000 detects object pairs from the source image data 100.
  • the relationship information generating device 3000 calculates relationship features for the detected object pairs.
  • the relationship information generating device 3000 generates object relationship information 20 by associating the object pairs with the relationship features calculated for the object pairs. Note that two or more source image data 100 may be used to generate the object relationship information 20.
  • the relationship information generating device 3000 can automatically generate object relationship information 20 that indicates relationship features that represent the characteristics of the relationship between objects for each of multiple object pairs included in one or more image data.
  • the relationship information generating device 3000 includes an acquiring unit 3020, a detecting unit 3040, a calculating unit 3060, and a generating unit 3080.
  • the acquiring unit 3020 acquires source image data 100.
  • the detecting unit 3040 detects an object pair from the source image data 100.
  • the calculating unit 3060 calculates a relation feature amount for the detected object pair.
  • the generating unit 3080 generates object relationship information 20 by associating the object pair with the relation feature amount calculated for the object pair.
  • ⁇ Process flow>> 11 is a flowchart illustrating the flow of processing executed by the relationship information generating device 3000.
  • the acquiring unit 3020 acquires the source image data 100 (S202).
  • the detecting unit 3040 detects object pairs from the source image data 100 (S204).
  • the calculating unit 3060 calculates a relation feature amount for a plurality of pairs detected from the source image data 100 (S206).
  • the generating unit 3080 generates object relationship information 20 by associating the object pairs with the relation feature amount calculated for the object pairs (S208).
  • the acquisition unit 3020 acquires the source image data 100 (S202). There are various methods for the acquisition unit 3020 to acquire the source image data 100.
  • the source image data 100 is stored in advance in a storage unit accessible from the relationship information generating device 3000.
  • the acquisition unit 3020 acquires the source image data 100 by reading the source image data 100 from this storage unit.
  • the acquisition unit 3020 may acquire the source image data 100 by receiving the source image data 100 transmitted from another device.
  • the device that transmits the source image data 100 to the relationship information generating device 3000 is, for example, a device (such as a camera or a distance measuring device) that generated the source image data 100.
  • the detection unit 3040 detects object pairs from the source image data 100 (S204). To this end, the detection unit 3040 first detects objects from the source image data 100. More specifically, the detection unit 3040 detects objects from the source image data 100 and generates the object information 30 described above.
  • the detection unit 3040 detects objects by detecting object regions from the source image data 100. An identifier is assigned to each detected object. The detection unit 3040 also identifies the type of object represented by each object region. The detection unit 3040 generates object information 30 by associating the identifier, type, and object region for each object.
  • a trained model for example, is used for the process of detecting an object region from the source image data 100 and identifying the type of object represented by the object region.
  • a model may be, for example, a machine learning-based model such as a neural network.
  • the detection unit 3040 When multiple time-series source image data 100 are used, such as video data, the detection unit 3040 generates object information 30 using information on objects detected from each of the multiple source image data 100. In this case, the same object may be detected from different source image data 100. The detection unit 3040 then identifies objects between the different source image data 100 by a process such as tracking. The detection unit 3040 then assigns the same identifier to the same objects detected from the different source image data 100. Therefore, information on the same objects detected from the different source image data 100 is stored in the same record in the object information 30.
  • the detection unit 3040 generates object pairs for multiple objects detected from the source image data 100.
  • the calculation unit 3060 generates all possible pairs for multiple objects detected from the source image data 100. For example, if there are three detected objects A, B, and C, three object pairs are generated: (A,B), (B,A), (A,C), (C,A), (B,C), and (C,B).
  • an object listed first is the subject, and the object listed later is the object.
  • the notation of an object pair (A,B) represents an object pair in which the subject and object are object A and object B, respectively.
  • the detection unit 3040 may generate only object pairs that are suspected to have a relationship between the objects. For example, the detection unit 3040 ensures that a specific type of object is always included in the object pair. In other words, an object pair that does not include a specific type of object is not treated as an object pair. Furthermore, when a specific type of object is always included in an object pair in this way, the detection unit 3040 treats the specific type of object as the subject.
  • each record of the object relationship information 20 represents the relationship between a person and other objects, in other words, the behavior of a person with respect to other objects.
  • the detection unit 3040 generates (A,B), (A,C), (A,D), (B,A), (B,C), and (B,D) as object pairs.
  • (C,D) and (D,C) are not treated as object pairs because they do not include a person.
  • the detection unit 3040 may also determine whether or not there is a relationship between two objects based on the positional relationship between the object regions of these two objects. More specifically, when the object regions of two objects overlap or touch each other, the detection unit 3040 determines that there is a relationship between these objects. Therefore, the detection unit 3040 generates an object pair for these two objects. On the other hand, when the object regions of the two objects do not overlap or touch each other, the detection unit 3040 determines that there is no relationship between these objects. Therefore, the detection unit 3040 does not generate an object pair for these two objects.
  • the calculation unit 3060 calculates a relation feature amount for each object pair (S208).
  • the relation feature amount between two objects included in the object pair is calculated using, for example, features obtained from the image regions of the object pair.
  • FIG. 12 is a diagram conceptually illustrating the processing performed by the calculation unit 3060.
  • the calculation unit 3060 has a relationship extraction model 200.
  • the relationship extraction model 200 is composed of a first feature amount calculation model 210 and a second feature amount calculation model 220.
  • the first feature amount calculation model 210 and the second feature amount calculation model 220 are each realized by a machine learning-based model such as a neural network, for example.
  • the calculation unit 3060 generates input data 130 from each object pair.
  • the input data 130 is composed of image regions obtained from the object pair.
  • the calculation unit 3060 combines the object region 132, object region 134, and image region 136 obtained from the object pair to generate the input data 130 for the object pair.
  • the object region 132 is the object region of the subject included in the object pair.
  • the object region 134 is the object region of the object included in the object pair.
  • the image region 136 is an image region that represents the circumscribing rectangle of the object region 132 and the object region 134.
  • the relationship extraction model 200 receives data enumerating one or more input data 130 obtained from the source image data 100 (in the case of FIG. 12, data enumerating input data 130-1 to 130-M).
  • the first feature calculation model 210 is configured to output object pair features 140 in response to the input of the input data 130.
  • the calculation unit 3060 obtains object pair features 140 for each object pair by inputting each input data 130 into the first feature calculation model 210.
  • M object pairs are generated as a premise. From these M object pairs, M pieces of input data 130, namely, input data 130-1 to 130-M, are generated. Then, M pieces of object pair features 140, namely, object pair features 140-1 to 140-M, are calculated from the input data 130-1 to 130-M.
  • the second feature calculation model 220 is configured to output, in response to the input of a plurality of object pair features 140, the same number of relational features 150 as the input object pair features 140.
  • the calculation unit 3060 inputs the object pair features 140 obtained for each object pair together into the second feature calculation model 220, thereby obtaining a relational feature 150 for each object pair.
  • M relational features 150 are obtained.
  • the second feature calculation model 220 outputs data in which a plurality of relational features 150 are listed.
  • the second feature calculation model 220 outputs the relationship feature 150 in a manner that makes it possible to determine that there is no relationship between the objects. Therefore, for example, the second feature calculation model 220 is configured to output a predetermined reference vector (e.g., a zero vector) or a vector whose distance from the reference vector is equal to or less than a predetermined threshold value as the relationship feature 150 for object pairs in which the objects are not related to each other.
  • a predetermined reference vector e.g., a zero vector
  • a vector whose distance from the reference vector is equal to or less than a predetermined threshold value
  • the first feature amount calculation model 210 and the second feature amount calculation model 220 are trained in advance to perform the above-mentioned operations.
  • the training method for the first feature amount calculation model 210 and the second feature amount calculation model 220 will be described later.
  • the second feature calculation model 220 shown in FIG. 12 is configured to acquire multiple object pair features 140 at once and calculate multiple relational features 150 from these multiple object pair features 140.
  • the second feature calculation model 220 can calculate the relational feature of an object pair by taking into account not only the features of that object pair, but also the features of other object pairs. Therefore, it is possible to calculate the relational feature of each object pair by taking into account the context of the entire scene recorded in the source image data 100.
  • the configuration of the second feature calculation model 220 is not limited to a configuration in which multiple object pair features 140 are acquired as input.
  • the second feature calculation model 220 may be configured to output one relation feature 150 in response to one object pair feature 140 being input.
  • the calculation unit 3060 calculates the object pair feature 140 using the second feature calculation model 220 for each object pair individually.
  • the generation unit 3080 generates the object relation information 20 by using the relation feature 150 calculated by the calculation unit 3060 (S208). Specifically, the calculation unit 3060 generates a record of the object relation information 20 by associating, for each object pair, an identifier of a subject included in the object pair, an identifier of an object included in the object pair, and the relation feature 150 calculated for the object pair.
  • the object pairs detected by the detection unit 3040 may include object pairs with no relationship between the objects. Therefore, for example, the generation unit 3080 determines whether or not there is a relationship between the objects for each object pair, using the relationship feature amount 150 calculated for that object pair. Then, the generation unit 3080 generates records of object relationship information 20 only for object pairs determined to have a relationship between the objects.
  • a reference vector is calculated as the relation feature 150 of an object pair that has no relation between the objects.
  • the generation unit 3080 determines whether or not the relation feature 150 matches the reference vector for each object pair. Then, the generation unit 3080 generates records of the object relation information 20 only for object pairs corresponding to relation feature 150 whose relation feature 150 does not match the reference vector.
  • the generation unit 3080 determines whether or not the distance between the relation feature 150 and the reference vector for each object pair is equal to or less than the threshold. The generation unit 3080 then generates records of the object relation information 20 only for object pairs corresponding to relation feature 150 whose distance from the reference vector is not equal to or less than the threshold.
  • the relationship information generating device 3000 outputs the generated object relationship information 20.
  • the relationship information generating device 3000 stores the object relationship information 20 in a storage unit accessible from the search device 2000.
  • the relationship information generating device 3000 may transmit the object relationship information 20 to the search device 2000.
  • Training method for the relationship extraction model 200 The first feature amount calculation model 210 and the second feature amount calculation model 220 constituting the relationship extraction model 200 are trained in advance. Here, a method for training the relationship extraction model 200 will be described. A device for training the relationship extraction model 200 is called a training device.
  • FIG. 13 is a diagram illustrating an example of an outline of the operation of the training device 4000. Note that FIG. 13 is a diagram for facilitating understanding of the outline of the training device 4000, and the operation of the training device 4000 is not limited to the operation shown in FIG. 9.
  • the training device 4000 acquires training data 170 and uses the training data 170 to train the relationship extraction model 200.
  • the training data 170 indicates a pair of input data 172 and ground truth data 174 for each of one or more object pairs.
  • the input data 172 is data having a structure similar to that of the input data 130. Therefore, for example, the input data 172 represents data in which the subject's object region, the object's object region, and the circumscribing rectangular image regions of the subject's object region and the object's object region are combined (e.g., concatenation).
  • the ground truth data 174 represents relationship features to be output by the trained relationship extraction model 200 for the corresponding input data 172.
  • the training device 4000 inputs all of the input data 172 shown in the training data 170 to the relationship extraction model 200.
  • the input data 172 input to the relationship extraction model 200 is input to the first feature calculation model 210.
  • the first feature calculation model 210 outputs the same number of object pair features 140 as the number of input data 172.
  • the object pair features 140 output from the first feature calculation model 210 are input to the second feature calculation model 220.
  • the same number of relation features 150 as the number of object pair features 140 i.e., the same number as the input data 172 are output.
  • the training device 4000 calculates the similarity between the relationship feature 150 and the ground truth data 174 for each object pair, and updates the relationship extraction model 200 based on the calculated similarity.
  • the relationship extraction model 200 is updated by updating trainable parameters of the relationship extraction model 200 (for example, the weights and biases assigned to each edge of the neural network).
  • the training device 4000 trains the relationship extraction model 200 by repeatedly updating the relationship extraction model 200 using multiple training data 170.
  • the training device 4000 can automatically generate the relationship extraction model 200 that outputs the relationship feature amount of each object pair in response to the input of one or more images of object pairs.
  • the training device 4000 includes an acquisition unit 4020, a calculation unit 4040, and an update unit 4060.
  • the acquisition unit 4020 acquires training data 170.
  • the calculation unit 4040 inputs input data 172 of one or more object pairs to the relationship extraction model 200 to calculate a relationship feature 150 for each object pair.
  • the update unit 4060 updates the relationship extraction model 200 based on the similarity between the relationship feature 150 and ground truth data 174.
  • ⁇ Process flow>> 15 is a flowchart illustrating the flow of processing executed by the training device 4000.
  • the acquisition unit 4020 acquires the training data 170 (S302).
  • the calculation unit 4040 inputs the input data 172 of each object pair to the relationship extraction model 200 to calculate the relationship feature 150 for each object pair (S304).
  • the update unit 4060 calculates the similarity between the relationship feature 150 and the ground truth data 174 (S306).
  • the update unit 4060 updates the relationship extraction model 200 based on the calculated similarity (S308).
  • the process flow executed by the training device 4000 is not limited to that shown in FIG. 15.
  • the training device 4000 may perform batch learning using multiple training data 170.
  • the training device 4000 may use each of the multiple training data 170 to calculate a loss based on the similarity between the relation feature 150 and the ground truth data 174, and update the relation extraction model 200 using the calculated loss statistics.
  • the acquiring unit 4020 acquires the training data 170 (S302). There are various methods for the acquiring unit 4020 to acquire the training data 170.
  • the training data 170 is stored in advance in a storage unit accessible from the training device 4000.
  • the acquiring unit 4020 acquires the training data 170 by reading the training data 170 from this storage unit.
  • the training device 4000 may acquire the training data 170 by receiving the training data 170 transmitted by another device.
  • the device that transmits the training data 170 is, for example, the device that generated the training data 170.
  • each input data 172 included in the training data 170 is generated from one image data.
  • the relationship information generating device 3000 detects one or more object pairs from the source image data 100 and generates input data 130 for each object pair.
  • the multiple input data 172 included in the training data 170 can also be generated in a similar manner.
  • the process of generating input data 172 from image data may be performed by the training device 4000 or by another device.
  • the training device 4000 acquires image data and generates input data 172 for each of one or more object pairs detected from the image data.
  • Each piece of ground truth data 174 included in the training data 170 is calculated, for example, using data in which the relationship between the objects in the object pair is expressed in text (hereinafter, relational text data).
  • relational text data data in which the relationship between the objects in the object pair is expressed in text
  • the process of calculating the ground truth data 174 representing the relational features from the relational text data can be performed, for example, by using a model included in the vision-and-language model.
  • the second feature calculation model 44 shown in Figure 8 can be used.
  • the process of generating ground truth data 174 for each object pair may be performed by the training device 4000 or by another device.
  • the training device 4000 acquires related text data for each of the multiple object pairs for the source image data used to generate the input data 172.
  • the training device 4000 converts each related text data into a related feature by inputting the related text data into the second feature calculation model 44. In this way, the training device 4000 generates ground truth data 174 for each object pair detected from the source image data.
  • the method of associating the input data 172 with the ground truth data 174 is arbitrary. For example, information indicating which part of the image data the related text data is related to regarding the object pair is associated with each related text data used to generate the ground truth data 174 in advance. The device that generates the training data 170 uses this information to associate the input data 172 with the ground truth data 174.
  • the update unit 4060 calculates the similarity between the related feature 150 and the ground truth data 174 (S306).
  • the similarity between the feature amounts can be calculated using, for example, the cosine similarity or the distance between the feature amounts, as described above.
  • the similarity between the relation feature 150 and the ground truth data 174 may be calculated for each object pair, or may be calculated for multiple object pairs collectively. In the former case, the update unit 4060 calculates the similarity between the relation feature 150 and the ground truth data 174 for each object pair.
  • the update unit 4060 When the similarity between the relation feature 150 and the ground truth data 174 is calculated collectively for multiple object pairs, for example, the update unit 4060 generates combined data (e.g., concatenation) of the relation feature 150 by combining multiple relation feature 150. Similarly, the update unit 4060 generates combined data (e.g., concatenation) of multiple ground truth data 174 by combining multiple ground truth data 174 shown in the training data 170. Then, the update unit 4060 calculates the similarity between the combined data of the relation feature 150 and the combined data of the ground truth data 174. The similarity between the combined data can also be calculated using the cosine similarity or distance between the combined data.
  • the relation feature does not have to include information on the subject or object.
  • the relation feature does not include information on the object.
  • ground truth data 174 is generated using multiple relation text data in which the subject and object relationships match but the objects do not match. Specifically, the statistics of the relation feature obtained from each of these multiple relation text data are used as ground truth data 174.
  • relationship extraction model 200 is trained to be able to calculate a relationship feature that represents the relationship between objects, such as "a person is holding something.”
  • relationship text data include “a person is holding a ball,” “a person is holding a cup,” and "a person is holding a smartphone.”
  • the ground truth data 174 generated from a plurality of relational text data in which the relationship between the subject and the object matches but the object does not match is associated with a plurality of input data 172 in which the relationship between the subject and the object matches but the object does not match.
  • the ground truth data 174 is generated from relational text data such as "a person is holding a ball,” “a person is holding a cup,” and "a person is holding a smartphone.”
  • a plurality of training data 170 is generated, such as training data 170 in which this ground truth data 174 is associated with input data 172 obtained from image data of a scene in which a person is holding a ball, or training data 170 in which this ground truth data 174 is associated with input data 172 obtained from image data of a scene in which a person is holding a cup.
  • the training data 170 may indicate a pair of input data 172 and ground truth data 174 that match each other in terms of the relationship between the subject and the object, but do not match each other in terms of the object.
  • the relationship extraction model 200 is trained to be able to calculate a relationship feature that represents the relationship "a person is holding something.”
  • the relationship extraction model 200 is trained using multiple sets of training data 170 in which input data 172 that represents a scene in which a person is holding something is associated with ground truth data 174 obtained from text data that represents a person holding something.
  • training data 170 associates input data 172 obtained from image data of a scene in which a person is holding a ball with ground truth data 174 obtained from related text data of "a person is holding a cup.” Training data 170 also associates input data 172 obtained from image data of a scene in which a person is holding a cup with ground truth data 174 obtained from related text data of "a person is holding a smartphone.”
  • the method of training the relationship extraction model 200 to calculate relationship features that do not include information about the subject can be realized by a method of training the search device 2000 to calculate relationship features that do not include information about the object.
  • the update unit 4060 updates the relationship extraction model 200 based on the similarity between the relation feature 150 and the ground truth data 174 (S308). For example, the update unit 4060 calculates a loss by inputting the similarity between the relation feature 150 and the ground truth data 174 into a loss function.
  • the loss function is designed so that the greater the similarity between the relation feature 150 and the ground truth data 174, the smaller the loss.
  • the update unit 4060 updates the trainable parameters included in the relationship extraction model 200 (in other words, the trainable parameters included in the first feature quantity calculation model 210 and the first feature quantity calculation model 210) based on the calculated loss.
  • Various methods such as gradient descent can be used to update the model parameters based on the loss.
  • the update unit 4060 calculates the loss using the multiple calculated similarities. For example, the update unit 4060 calculates the loss by inputting the statistical values of the multiple calculated similarities into a loss function.
  • the training data 170 is training data of positive examples. That is, the input data 172 is associated with a relation feature that should be output when the input data 172 is input to the relation extraction model 200.
  • negative example training data may also be used to train the relationship extraction model 200.
  • the negative example training data corresponds to the input data 172 a relationship feature (hereinafter, error data) that should not be output when the input data 172 is input to the relationship extraction model 200.
  • Error data can be generated, for example, using relational text data that represents a relationship different from the relationship between object pairs. For example, assume that the input data 172 is obtained from an image region of a scene in which a person is sitting on a chair. In this case, error data is generated from relational text data such as "the person is holding a chair,” “the person is looking at a smartphone,” or "the dog is holding a ball.”
  • the update unit 4060 calculates the similarity between the relation feature 150 obtained by inputting the input data 172 into the relation extraction model 200 and the error data corresponding to the input data 172.
  • the similarity between the relation feature 150 and the error data is small. Therefore, in training using negative example training data, the loss function is designed so that the loss increases as the similarity between the relation feature 150 and the error data decreases.
  • the update unit 4060 calculates the loss by inputting the similarity between the relation feature 150 and the error data into this loss function. Then, the update unit 4060 updates the relationship extraction model 200 based on the calculated loss.
  • the training data for positive examples and the training data for negative examples are configured so that they can be distinguished from each other.
  • these training data are stored in different storage units.
  • a label indicating whether the training data is a positive example or a negative example is attached in advance to the training data.
  • the training device 4000 outputs the generated relationship extraction model 200.
  • the training device 4000 stores the relationship extraction model 200 in a storage unit accessible from the relationship information generation device 3000.
  • the training device 4000 may transmit the relationship extraction model 200 to the relationship information generation device 3000.
  • the training device 4000 may output only the set of trainable parameters included in the relationship extraction model 200, or may output the entire relationship extraction model 200. In the latter case, the program that realizes the relationship extraction model 200 and parameters that are not updated by training (such as hyperparameters) are also output.
  • Appendix 1 an acquisition means for acquiring query data representing a relationship between the objects for a pair of objects; A calculation means for calculating a relation feature quantity representing a characteristic of a relation between objects from the query data; a detection means for detecting a relation feature amount that matches the calculated relation feature amount from object relation information in which the relation feature amount is associated with each of a plurality of object pairs, The object-related information is generated using one or more image data.
  • Appendix 2 2. The search device according to claim 1, wherein the query data is text data that represents a relationship between objects for the pair.
  • the calculation means calculates the relational feature for a pair of objects represented by the query data by inputting the query data into a relational feature calculation model; 3.
  • the relationship feature calculation model is generated using a model that calculates features of the input text data in a vision-and-language model to which image data including an object pair and text data indicating a relationship between the objects for the object pair are input.
  • the detection means detects the relation feature indicated in the object relation information, whose similarity with the calculated relation feature is equal to or greater than a threshold, as the relation feature that matches the calculated relation feature.
  • the query data represents, for each of a plurality of pairs of entities, a relationship between the entities;
  • the calculation means calculates the relation feature amount for each of a plurality of pairs indicated in the query data;
  • the search device according to any one of appendices 1 to 3, wherein the detection means detects the relational feature that matches each of the plurality of relational feature calculated for the plurality of pairs indicated in the query data from the relational feature that corresponds to the same image data in the object relation information.
  • Appendix 6 An acquisition means for acquiring source image data; detection means for detecting one or more object pairs from the source image data; A calculation means for calculating a relation feature amount representing a characteristic of a relation between the objects of the pair; and generating means for generating object relationship information indicating the pair and the relationship feature calculated for the pair in association with each other.
  • Appendix 7 The relationship information generation device described in Appendix 6, wherein the calculation means uses the source image data to generate combined data obtained by combining an image region of a first object included in the pair, an image region of a second object included in the pair, and an image region including the first object and the second object, and calculates the relationship feature using the generated combined data.
  • the calculation means outputs data in which the relationship features for each of the plurality of pairs are listed in response to input of data in which the combined data for each of the plurality of pairs is listed.
  • (Appendix 9) The relationship information generation device according to any one of appendix 6 to 8, wherein the calculation means calculates a predetermined reference vector or a vector whose distance from the reference vector is equal to or less than a predetermined threshold as the relationship feature of a pair of objects that are not related to each other.
  • the relationship feature calculation model is generated by using a model that calculates features of the input text data in a vision-and-language model to which image data including an object pair and text data expressing a relationship between the objects for the object pair are input.
  • Appendix 13 A search method described in any one of Appendices 10 to 12, wherein in the detection step, the relation feature indicated in the object relation information, whose similarity with the calculated relation feature is equal to or greater than a threshold, is detected as the relation feature that matches the calculated relation feature.
  • the query data represents, for each of a plurality of pairs of entities, a relationship between the entities; In the calculation step, the relation feature is calculated for each of a plurality of pairs indicated in the query data; 13.
  • the relationship information generating method according to any one of appendices 15 to 17, wherein in the calculation step, a predetermined reference vector or a vector whose distance from the reference vector is equal to or less than a predetermined threshold is calculated as the relationship feature of a pair of objects that are not related to each other.
  • Appendix 19 an acquisition step of acquiring query data representing a relationship between the objects for a pair of objects; a calculation step of calculating a relation feature amount representing a characteristic of a relation between objects from the query data; a detection step of detecting a relation feature amount that matches the calculated relation feature amount from object relation information in which the relation feature amount is associated with each of a plurality of object pairs;
  • the object-related information is generated using one or more pieces of image data.
  • (Appendix 20) 20 The program of claim 19, wherein the query data is text data that represents a relationship between the objects for the pair.
  • the query data is input to a relational feature calculation model to calculate the relational feature for the pair of objects represented by the query data;
  • the relationship feature calculation model is generated using a model that calculates features of the input text data in a vision-and-language model to which image data including an object pair and text data indicating a relationship between the objects for the object pair are input.
  • Appendix 22 22.
  • the relation feature indicated in the object relation information whose similarity with the calculated relation feature is equal to or greater than a threshold, is detected as the relation feature that matches the calculated relation feature.
  • the query data represents, for each of a plurality of pairs of objects, a relationship between the objects;
  • the relation feature is calculated for each of a plurality of pairs indicated in the query data; 22.
  • the relational feature that matches each of the plurality of relational feature calculated for the plurality of pairs indicated in the query data is detected from the relational feature that corresponds to the same image data in the object relation information.
  • Appendix 24 An acquisition step of acquiring source image data; detecting one or more pairs of objects from the source image data; a calculation step of calculating a relation feature amount representing a characteristic of a relation between the objects of the pair; and generating object relation information indicating the pair and the relation feature calculated for the pair in association with each other.
  • Appendix 25 The program described in Appendix 24, wherein in the calculation step, combined data is generated by combining, using the source image data, an image region of a first object included in the pair, an image region of a second object included in the pair, and an image region including the first object and the second object, and the relational feature is calculated using the generated combined data.
  • Appendix 26 26.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Geometry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
PCT/JP2024/024735 2023-07-20 2024-07-09 検索装置、関係情報生成装置、検索方法、関係情報生成方法、及びプログラム Pending WO2025018220A1 (ja)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2025533995A JPWO2025018220A1 (https=) 2023-07-20 2024-07-09

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2023118634 2023-07-20
JP2023-118634 2023-07-20

Publications (1)

Publication Number Publication Date
WO2025018220A1 true WO2025018220A1 (ja) 2025-01-23

Family

ID=94282016

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2024/024735 Pending WO2025018220A1 (ja) 2023-07-20 2024-07-09 検索装置、関係情報生成装置、検索方法、関係情報生成方法、及びプログラム

Country Status (2)

Country Link
JP (1) JPWO2025018220A1 (https=)
WO (1) WO2025018220A1 (https=)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2022180942A (ja) * 2021-05-25 2022-12-07 ソフトバンク株式会社 情報処理装置、情報処理方法及び情報処理プログラム
JP2023039656A (ja) * 2021-09-09 2023-03-22 株式会社東芝 事例検索装置、方法及びプログラム

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2022180942A (ja) * 2021-05-25 2022-12-07 ソフトバンク株式会社 情報処理装置、情報処理方法及び情報処理プログラム
JP2023039656A (ja) * 2021-09-09 2023-03-22 株式会社東芝 事例検索装置、方法及びプログラム

Also Published As

Publication number Publication date
JPWO2025018220A1 (https=) 2025-01-23

Similar Documents

Publication Publication Date Title
US11409791B2 (en) Joint heterogeneous language-vision embeddings for video tagging and search
US10909401B2 (en) Attention-based explanations for artificial intelligence behavior
CN105027162B (zh) 图像解析装置、图像解析系统、图像解析方法
US9104667B2 (en) Social media event detection and content-based retrieval
US10679054B2 (en) Object cognitive identification solution
JP2018524678A (ja) 画像からの事業発見
CN112820071A (zh) 一种行为识别方法和装置
KR102365429B1 (ko) 불성실응답자를 판별하는 인공지능을 이용한 온라인 모바일 설문조사 플랫폼
CN106537387B (zh) 检索/存储与事件相关联的图像
CN114883005A (zh) 一种数据分类分级方法、装置、电子设备和存储介质
CN117633613A (zh) 跨模态的视频情感分析方法及装置、设备、存储介质
CN114417029B (zh) 模型训练方法、装置、电子设备及存储介质
CN120356136B (zh) 视频内容检测方法
JPH11250106A (ja) 内容基盤の映像情報を利用した登録商標の自動検索方法
CN115186647A (zh) 文本相似度的检测方法、装置、电子设备及存储介质
CN120472379A (zh) 视频关键信息提取方法、视频问答方法、设备及介质
JP7192888B2 (ja) 処理装置、処理方法及びプログラム
CN105809488B (zh) 一种信息处理方法及电子设备
Ma et al. Interpretable multimodal out-of-context detection with soft logic regularization
WO2025018220A1 (ja) 検索装置、関係情報生成装置、検索方法、関係情報生成方法、及びプログラム
Eyzaguirre et al. Streaming detection of queried event start
CN116662589A (zh) 图像匹配方法、装置、电子设备以及存储介质
CN114783551A (zh) 一种疼痛程度预测方法、装置、电子设备及可读存储介质
CN114238968A (zh) 应用程序检测方法及装置、存储介质及电子设备
WO2025164534A1 (ja) 検索装置、検索方法及びプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24843012

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2025533995

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2025533995

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE