WO2025224820A1 - 画像認識システム、画像認識装置、文章特徴抽出装置、プログラムおよび画像認識方法 - Google Patents
画像認識システム、画像認識装置、文章特徴抽出装置、プログラムおよび画像認識方法Info
- Publication number
- WO2025224820A1 WO2025224820A1 PCT/JP2024/015829 JP2024015829W WO2025224820A1 WO 2025224820 A1 WO2025224820 A1 WO 2025224820A1 JP 2024015829 W JP2024015829 W JP 2024015829W WO 2025224820 A1 WO2025224820 A1 WO 2025224820A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- sentence
- features
- unit
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
Definitions
- This disclosure relates to an image recognition system, image recognition device, text feature extraction device, program, and image recognition method that detect targets specified by natural language phrases from image data.
- Image recognition technology which detects targets from image data, has become significantly more accurate with the advancement of artificial intelligence, and is expected to lead to the realization of advanced security systems and the like.
- the trained models used in artificial intelligence require significantly more computational effort than conventional image recognition technology, making them difficult to implement in devices with limited computing power, such as surveillance cameras.
- Non-Patent Document 1 discloses a technology called Grounding DINO (self-DIstillation with NO labels).
- Grounding DINO is a technology that can detect a wider range of object categories by training a DNN model in advance using image data that is 10 to 100 times larger than conventional deep neural networks (DNNs) and paired natural language phrases as input.
- DNNs deep neural networks
- Grounding DINO by specifying the object category to be detected using natural language phrases during inference, it is possible to change or add object categories to be detected without retraining the DNN model.
- Non-Patent Document 1 Even with the technology disclosed in Non-Patent Document 1, the computational complexity of the inference process that uses a trained model to detect targets specified in words entered in natural language from image data is still very large, making it difficult to implement when the computational power of the image recognition device is limited.
- the present disclosure has been made in light of the above, and aims to provide an image recognition system that can implement inference processing using a trained model to detect targets specified by phrases input in natural language from image data, even when the computational power of the image recognition device is limited.
- the image recognition system disclosed herein is an image recognition system that detects a detection target from image data of the recognition target, and includes a sentence feature extraction device and an image recognition device.
- the sentence feature extraction device includes a sentence acquisition unit that acquires natural language phrases for specifying the detection target, a sentence feature extraction unit that extracts sentence features, which are features comparable to image features, from the natural language phrases, and a sentence feature transmission unit that transmits the sentence features.
- the image recognition device includes a sentence feature reception unit that receives sentence features from the sentence feature extraction device, an image acquisition unit that acquires image data, an image analysis unit that extracts candidate areas, which are candidates for areas in the image where the detection target exists, and image features, which are features of the image within the candidate areas, from the image data, a feature comparison unit that compares the sentence features with the image features corresponding to the candidate areas and outputs the comparison results, and a detection result generation unit that generates a detection result for the detection target based on the comparison results.
- the image recognition system is characterized in that the processing of the sentence feature extraction unit is performed by the sentence feature extraction device, which is a device separate from the image recognition device.
- the present disclosure has the advantage of making it possible to obtain an image recognition system that can implement inference processing using a trained model to detect targets specified by phrases input in natural language from image data, even when the computational power of the image recognition device is limited.
- FIG. 1 is a block diagram of an image recognition system according to a first embodiment
- FIG. 2 is a diagram showing an example of the configuration of a DNN model used by the image recognition system shown in FIG. 1
- 1 is a flowchart illustrating an example of the operation of the image recognition system shown in FIG. 1 is a block diagram of an image recognition system according to a second embodiment.
- 5 is a flowchart illustrating an example of the operation of the image recognition system shown in FIG. 4 .
- FIG. 1 is a diagram showing dedicated hardware for realizing the functions of the image recognition system according to the first and second embodiments.
- FIG. 1 is a diagram showing the configuration of a control circuit for realizing the functions of the image recognition system according to the first and second embodiments.
- FIG. 1 is a configuration diagram of an image recognition system 10 according to a first embodiment.
- the image recognition system 10 includes an image recognition device 1000, a sentence feature extraction device 2000, and a result display unit 3000.
- the image recognition system 10 has a function of detecting a detection target specified by a phrase input in natural language from image data by performing an inference process using a trained model.
- the trained model may be generated using a technology such as Grounding DINO.
- the inference process is shared and executed by the image recognition device 1000 and the sentence feature extraction device 2000.
- the inference process that requires a high load in other words the process of extracting text features, which is a process that requires a large amount of calculation, is executed by the text feature extraction device 2000, and the other processes are executed by the image recognition device 1000.
- the allocation of processes to be shared between the image recognition device 1000 and the text feature extraction device 2000 is determined in advance by a human, but other methods of determination may also be used.
- the multiple arithmetic units included therein will be treated as a single entity.
- the image recognition device 1000 is composed of multiple devices or has multiple arithmetic units
- the multiple arithmetic units included therein will be treated as a single entity.
- they will be treated as a single entity that executes the above-mentioned inference processing.
- the image recognition device 1000 is a single imaging device with a single arithmetic unit and the sentence feature extraction device 2000 is one or more server devices with multiple arithmetic units
- the image recognition device 1000 and the sentence feature extraction device 2000 can be treated as a single entity and their arithmetic processing capabilities compared.
- the sentence feature extraction device 2000 is considered to have higher arithmetic processing capabilities than the image recognition device 1000.
- Image recognition device 1000 includes image acquisition unit 1100, image analysis unit 1200, sentence feature receiving unit 1300, feature comparison unit 1400, detection result generation unit 1500, and detection result transmission unit 1600. Image recognition device 1000 determines whether a specified object appears in the image acquired by image acquisition unit 1100, and if so, recognizes the area occupied by the object within the image, and transmits the recognition result to result display unit 3000.
- Image recognition device 1000 is, for example, a surveillance camera with a built-in microcomputer capable of executing a processing device or processing software for image recognition, a digital camera, or a computer connected to a surveillance camera. Image recognition device 1000 is not limited to the above example. Furthermore, for simplicity in this embodiment, image recognition system 10 includes one image recognition device 1000, but in practice, image recognition system 10 may include multiple image recognition devices 1000.
- the sentence feature extraction device 2000 has a sentence acquisition unit 2100, a sentence feature extraction unit 2200, and a sentence feature transmission unit 2300.
- the sentence feature extraction device 2000 is a different device from the image recognition device 1000.
- the sentence feature extraction device 2000 extracts sentence features from the input natural language phrase and transmits the extracted sentence features to the image recognition device 1000.
- the sentence features are features that can be compared with image features.
- the result display unit 3000 is a device for displaying the results of object detection from an image performed by the image recognition device 1000.
- the result display unit 3000 only needs to have a display function, and there are no particular restrictions on its specific configuration.
- the image acquisition unit 1100 acquires image data to be processed and outputs the acquired image data to the image analysis unit 1200.
- the image acquisition unit 1100 is, for example, an image sensor such as a CCD (Charge Coupled Device) or a CMOS (Complementary Metal Oxide Semiconductor), or an imaging device such as a digital camera or surveillance camera connected to a computer.
- CCD Charge Coupled Device
- CMOS Complementary Metal Oxide Semiconductor
- the image analysis unit 1200 extracts multiple candidate areas, which are areas that are thought to contain objects, from the image data acquired by the image acquisition unit 1100, and extracts image features, which are characteristic amounts contained in each of the multiple extracted candidate areas.
- the image analysis unit 1200 outputs the multiple extracted candidate areas and the image features corresponding to each candidate area to the feature comparison unit 1400.
- the image analysis unit 1200 extracts the candidate areas and image features using a first partial model, which is part of a pre-trained DNN model and is a part whose output does not depend on natural language input but only on image data input.
- the image analysis unit 1200 can extract the same number of image features as the number of candidate regions. For example, if the image analysis unit 1200 extracts N candidate regions, it can extract N image features.
- the representation format of the information indicating the candidate area is not particularly limited, as long as it can represent the range of the candidate area within the image.
- the candidate area can be represented by specifying the position of a geometric shape.
- the geometric shape may be a polygon such as a rectangle, or a circle. If the candidate area is a polygon, the position of the candidate area can be represented using the coordinates of the polygon's vertices, side lengths, and center of gravity coordinates.
- the candidate area may be represented by the coordinates of the rectangle's four vertices, or by a combination of the coordinates of two of the rectangle's four vertices located diagonally, or by the coordinate of one of the rectangle's four vertices and the numerical value of the length of each side of the rectangle, or by the coordinate of the center of gravity of the rectangle and the numerical value of the length of each side of the rectangle.
- the numerical values of the vertex coordinates, center of gravity coordinates, and side lengths may be a combination of the image size and actual coordinates, or may be relative values to the image size.
- the sentence feature receiving unit 1300 receives sentence features sent from the sentence feature extraction device 2000 and outputs the received sentence features to the feature comparing unit 1400. There are as many sentence features as there are specified object categories, and if there are M specified object categories, M sentence features are also received. If the sentence features are Q-dimensional vectors, the sentence feature receiving unit 1300 will receive data for M Q-dimensional vectors.
- the feature comparison unit 1400 performs a similarity calculation by comparing the N image features corresponding to each candidate area output by the image analysis unit 1200 with the M text features output by the text feature receiving unit 1300, and calculates which of the M text features each image feature is most similar to.
- the feature comparison unit 1400 outputs the similarity calculation results to the detection result generation unit 1500 as the comparison result.
- the feature comparison unit 1400 is part of a pre-trained DNN model, and receives as input the output of a first partial model executed by the image analysis unit 1200 and the output of a second partial model executed by the text feature extraction unit 2200 (described below), and performs the above-mentioned processing by executing a third partial model that outputs the similarity as the comparison result.
- the detection result generation unit 1500 Based on the comparison results output by the feature comparison unit 1400, the detection result generation unit 1500 identifies the object category corresponding to the sentence feature to which each candidate area is most similar, and outputs the identified object category as the detection result to the detection result transmission unit 1600. Furthermore, if there is overlap between the object category and the candidate area, redundant results are removed from the detection results.
- the detection result transmission unit 1600 outputs the detection results output by the detection result generation unit 1500 to the result display unit 3000.
- the result display unit 3000 displays the detection results to the user of the image recognition system 10 in the desired manner based on the detection results received from the detection result transmission unit 1600. There are no particular restrictions on the method for displaying the detection results.
- the text acquisition unit 2100 accepts input of natural language phrases for object categories that the user of the image recognition system 10 wants the image recognition device 1000 to detect, and outputs the accepted phrases to the text feature extraction unit 2200.
- the text acquisition unit 2100 can accept input of natural language phrases by, for example, character input using an input means such as a keyboard or touch sensor, voice recognition using a microphone, or reading data from a pre-created phrase list from a storage device.
- multiple phrases may be input, with each phrase corresponding to an object category.
- the sentence feature extraction unit 2200 extracts sentence features from the natural language phrases acquired by the sentence acquisition unit 2100, and outputs the extracted sentence features to the sentence feature transmission unit 2300. If there are multiple input natural language phrases, the sentence feature extraction unit 2200 extracts sentence features for each phrase. If there are M phrases, the sentence feature extraction unit 2200 extracts M sentence features.
- the sentence feature extraction unit 2200 performs the above processing by executing a second partial model, which is part of a pre-trained DNN model and whose output does not depend on the input of image data but only on the input of natural language phrases.
- the sentence feature sending unit 2300 sends the sentence features output by the sentence feature extraction unit 2200 to the sentence feature receiving unit 1300 of the image recognition device 1000. If there are multiple sentence features, the sentence feature sending unit 2300 sends all of them.
- DNN model #0 is a trained model that receives image data and natural language phrases as input and is trained to detect objects corresponding to the natural language phrases from within an image.
- This DNN model #0 is composed of three partial models.
- DNN model #1 is a first partial model whose output depends only on the input of image data and is not dependent on the input of natural language phrases, and can operate independently.
- DNN model #1 receives image data as input and outputs candidate regions that are candidates for the region in the image where the detection target exists, and image features for each candidate region.
- DNN model #1 corresponds to the processing of the image analysis unit 1200.
- DNN model #2 is a second partial model whose output depends only on the input of natural language phrases and is not dependent on the input of image data, and can operate independently.
- DNN model #2 receives natural language phrases as input and outputs sentence features for each phrase.
- the natural language phrases indicate the object category to be detected by the image recognition device 1000, and in the example of Figure 2, phrases for detecting people, such as "male,” “female,” “person with a cane,” and “person on a bicycle,” are exemplified.
- DNN model #2 corresponds to the processing of the sentence feature extraction unit 2200.
- DNN model #3 is the remainder of DNN model #0 after removing DNN model #1 and DNN model #2, and is the third partial model.
- DNN model #3 receives the outputs of DNN model #1 and DNN model #2 as inputs and calculates the similarity between sentence features and image features.
- DNN model #3 corresponds to the processing of the feature comparison unit 1400.
- DNN model #0 is trained using a sufficiently large dataset.
- the representation format for regions within an image follows the one described above in the explanation of the image analysis unit 1200.
- FIG. 3 is a flowchart for explaining an example of the operation of the image recognition system 10 shown in Figure 1.
- the image recognition device 1000 performs the following processing each time the image acquisition unit 1100 acquires image data.
- the image acquisition unit 1100 periodically acquires image data (step S2110).
- the acquisition period is determined in advance depending on the application of the image recognition system 10.
- the image analysis unit 1200 uses DNN model #1 to extract multiple candidate areas from the acquired image data and extract image features within each candidate area (step S2120). Let N be the number of candidate areas obtained here.
- the feature comparison unit 1400 if there are multiple candidate areas, selects one of the candidate areas and begins feature comparison processing for each candidate area (step S2130), and waits for the sentence feature receiving unit 1300 to receive the sentence features.
- the sentence acquisition unit 2100 performs sentence acquisition processing to acquire natural language phrases input by the user of the image recognition system 10 (step S2210), and outputs the acquired natural language phrases to the sentence feature extraction unit 2200.
- the natural language phrases indicate the object category that is to be detected by the image recognition device 1000.
- the sentence feature sending unit 2300 associates the sentence features with the natural language phrases and sends them to the sentence feature receiving unit 1300 of the image recognition device 1000 (step S2260).
- the feature comparison unit 1400 determines whether the feature amounts have been compared for all words and phrases for the target candidate region (step S2160). If there are words that have not yet been compared (step S2160: No), the feature comparison unit 1400 selects the next word and repeats the process from step S2150. If there are words that have not yet been compared (step S2160: Yes), the feature comparison unit 1400 ends the feature amount comparison process for each word and phrase for the target candidate region (step S2170) and determines whether the feature amounts have been compared for all candidate regions (step S2180). If there are candidate regions that have not yet been compared (step S2180: No), the feature comparison unit 1400 selects the next candidate region and repeats the process from step S2140.
- the feature comparison unit 1400 ends the feature amount comparison process for each candidate region (step S2190). Once the similarity calculations for all combinations of the text features of the M words and the image features of the N candidate areas have been completed, the feature comparison unit 1400 stores the word with the highest similarity for each candidate area.
- the detection result generation unit 1500 obtains the comparison results from the feature comparison unit 1400, if the information for each candidate area is relative to the width and height of the image data, it converts the information for the candidate area to the original image coordinate reference values. If there are any overlapping candidate areas, it compares the size of the overlapping area with the associated text features. If the overlap is sufficiently large and there are multiple areas with the same associated text features, it determines that the candidate areas are overlapping, keeps one of the overlapping areas, and removes unnecessary candidate areas (step S2310), generates a detection result, and outputs the generated detection result to the detection result transmission unit 1600.
- the detection result transmission unit 1600 transmits the detection results output by the detection result generation unit 1500 to the result display unit 3000 (step S2320). After transmitting the detection results, the image recognition device 1000 waits to acquire the next image data and starts processing again from step S2110.
- DNN model #0 is a trained model that accepts as input image data and natural language phrases that specify the category of object to be detected from the image data, and outputs detection results, and has been trained using a sufficiently large dataset. Therefore, if a user of image recognition system 10 wants to change the object category to be detected, they can change the object category to be detected simply by changing the natural language phrases provided to text acquisition unit 2100, without having to retrain the DNN model.
- a loop in which a comparison process is performed for one image feature with multiple sentence features is repeated in the same order as the number of image features, but the order of the loops may be reversed. In other words, the same results can be obtained by repeating a loop in which a comparison process is performed for one sentence feature with multiple image features in the same order as the number of sentence features.
- the processing of the feature comparison unit 1400 is described as using DNN model #3, depending on the structure of DNN model #0, this part does not necessarily have to be a DNN model.
- the feature comparison unit 1400 may be a processing unit that calculates the cosine similarity between image features and text features, or a processing unit that calculates the Euclidean distance between image features and text features.
- this is based on the premise that DNN model #0 has been sufficiently trained using the above-mentioned processing units when it was trained in advance.
- embodiment 1 provides an image recognition system 10 that includes an image recognition device 1000 and a sentence feature extraction device 2000 and detects a detection target from image data of the recognition target.
- the sentence feature extraction device 2000 includes a sentence acquisition unit 2100 that acquires natural language phrases for specifying the detection target, a sentence feature extraction unit 2200 that extracts sentence features, which are features comparable to image features, from the natural language phrases, and a sentence feature transmission unit 2300 that transmits the sentence features.
- the image recognition device 1000 includes a sentence feature reception unit 1300 that receives sentence features from the sentence feature extraction device 2000, an image acquisition unit 1100 that acquires image data, an image analysis unit 1200 that extracts candidate areas, which are candidates for areas in the image where the detection target exists, and image features, which are feature quantities of the image within the candidate areas, from the image data, a feature comparison unit 1400 that compares the sentence features with the image features corresponding to the candidate areas and outputs the comparison results, and a detection result generation unit 1500 that generates a detection result for the detection target based on the comparison results.
- the image recognition system 10 is characterized in that the processing of the sentence feature extraction unit 2200 is executed by a sentence feature extraction device 2000, which is a device separate from the image recognition device 1000.
- the image analysis unit 1200 is realized by executing DNN model #1, a first partial model whose output does not depend on natural language input, which is part of DNN model #0, a trained model for detecting targets specified by natural language phrases from image data.
- DNN model #0 is, for example, generative AI (artificial intelligence).
- Generative AI that inputs multimodal data such as image data and natural language phrases requires a very large amount of calculation and requires very high computing power. Therefore, when running it on embedded devices such as surveillance cameras, the computing power may be insufficient, making implementation difficult.
- the technology of this embodiment even if embedded devices such as surveillance cameras lack the computing power and it is difficult to process multimodal data collectively, it becomes possible to execute some of the processing with a high processing load on an external device.
- the sentence feature extraction unit 2200 may be realized by executing DNN model #2, which is a part of DNN model #0, a trained model for detecting detection targets specified by natural language phrases from image data, and is a second partial model whose output does not depend on the input of image data.
- the feature comparison unit 1400 may be realized by executing DNN model #3, a third partial model that is part of DNN model #0, a trained model for detecting detection targets specified by natural language phrases from image data, and that accepts as input the output of DNN model #1, a first partial model that is independent of natural language input, and the output of DNN model #2, a second partial model that is independent of image data input, and outputs the results of comparing text features with image features.
- an image recognition device 1000 that detects a detection target from image data of a recognition target.
- the image recognition device 1000 is characterized by comprising: a sentence feature receiving unit 1300 that receives, from outside the image recognition device 1000, sentence features that are comparable to image features and extracted from natural language phrases used to specify the detection target; an image acquisition unit 1100 that acquires image data; an image analysis unit 1200 that extracts, from the image data, candidate areas that are candidates for areas in the image where the detection target exists, and image features that are feature amounts of the image within the candidate areas; a feature comparison unit 1400 that compares the sentence features with the image features corresponding to the candidate areas and outputs the comparison results; and a detection result generation unit 1500 that generates a detection result for the detection target based on the comparison results.
- the image analysis unit 1200 may be realized by executing DNN model #1, which is a part of DNN model #0, a trained model for detecting a detection target specified by a natural language phrase from image data, and which is a first partial model whose output does not depend on natural language input.
- the feature comparison unit 1400 may be realized by executing DNN model #3, which is a third partial model that is part of the trained model DNN model #0 and which accepts as input the output of DNN model #1, which is a first partial model, and the output of DNN model #2, which is a part of the trained model DNN model #0 and is a second partial model that does not depend on image data input, and outputs a comparison result.
- DNN model #3 which is a third partial model that is part of the trained model DNN model #0 and which accepts as input the output of DNN model #1, which is a first partial model
- DNN model #2 which is a part of the trained model DNN model #0 and is a second partial model that does not depend on image data input
- a sentence feature extraction device 2000 can also be provided.
- the sentence feature extraction device 2000 is characterized by comprising a sentence acquisition unit 2100 that acquires natural language phrases for specifying a detection target to be detected from image data, a sentence feature extraction unit 2200 that extracts sentence features, which are features comparable to image features, from the natural language phrases, and a sentence feature transmission unit 2300 that transmits the sentence features to the image recognition device 1000, which detects the detection target from image data.
- the text feature extraction unit 2200 is part of DNN model #0, a trained model for detecting detection targets specified by natural language phrases from image data, and can be realized by executing DNN model #2, a second partial model whose output does not depend on the image data input.
- a program for causing a computer to function as image recognition device 1000 which detects a detection target from image data of the recognition target.
- This program can cause a computer functioning as image recognition device 1000 to execute the following steps: receiving sentence features, which are features comparable to image features, extracted from natural language phrases for specifying the detection target in a computer different from the computer; acquiring image data; extracting from the image data candidate areas that are candidates for areas in the image where the detection target exists, and image features that are feature amounts of the image in the candidate areas; comparing the sentence features with the image features corresponding to the candidate areas and outputting the comparison results; and generating a detection result for the detection target based on the comparison results.
- this program can cause a computer functioning as the sentence feature extraction device 2000 to execute the following steps: acquiring natural language phrases for specifying a detection target to be detected from image data; extracting sentence features, which are features comparable to image features, from the natural language phrases; and transmitting the sentence features to a computer, different from the computer functioning as the sentence feature extraction device 2000, that functions as the image recognition device 1000 for detecting the detection target from image data.
- an image recognition method for detecting a detection target from image data of a recognition target includes the steps of: extracting sentence features, which are features comparable to image features, from natural language phrases used to specify the detection target; extracting candidate areas, which are candidates for areas in the image where the detection target exists, from the image data, and image features, which are features of the image within the candidate areas; comparing the sentence features with the image features corresponding to the candidate areas to generate a comparison result; and generating a detection result for the detection target based on the comparison result.
- This image recognition method is also characterized in that the processing for extracting sentence features and the processing for extracting image features are performed by different devices.
- the image recognition method can be performed by image recognition system 10 including image recognition device 1000 and sentence feature extraction device 2000, in which sentence feature extraction device 2000 performs the step of extracting sentence features, and image recognition device 1000 performs the steps of extracting candidate areas and image features, generating a comparison result, and generating a detection result.
- the image recognition method may further include the steps of acquiring natural language phrases, transmitting extracted sentence features, and receiving sentence features.
- the sentence feature extraction device 2000 executes the steps of acquiring natural language phrases and transmitting extracted sentence features
- the image recognition device 1000 executes the steps of receiving sentence features and generating a comparison result using the received sentence features.
- the processing of each described step does not necessarily have to be performed in the order described.
- the image recognition method may include multiple steps performed simultaneously in parallel.
- the processing of each described step in the image recognition method does not necessarily have to be a series of processes performed consecutively, and may include multiple processes performed at different times.
- Embodiment 2. 4 is a configuration diagram of an image recognition system 20 according to the second embodiment.
- the image recognition system 20 includes an image recognition device 1050, a sentence feature extraction device 2000, and a result display unit 3000.
- the image recognition system 20 includes an image recognition device 1050 instead of the image recognition device 1000 of the image recognition system 10. The following mainly describes the differences from the first embodiment, and a detailed description of the same parts as those in the first embodiment will be omitted.
- the image recognition system 20 performs the text feature extraction process only when there is a change or addition to the object category to be detected.
- the image recognition device 1050 has an image acquisition unit 1100, an image analysis unit 1200, a sentence feature receiving unit 1300, a feature comparison unit 1400, a detection result generation unit 1500, and a detection result transmission unit 1600. In addition to the components of the image recognition device 1000, the image recognition device 1050 also has a sentence feature storage unit 1350.
- the sentence feature storage unit 1350 stores the sentence features that the sentence feature receiving unit 1300 receives from the sentence feature extraction device 2000.
- the sentence feature storage unit 1350 is capable of storing multiple sentence features and can output the stored sentence features to the feature comparison unit 1400.
- FIG. 5 is a flowchart for explaining an example of the operation of the image recognition system 20 shown in FIG. 4.
- FIG. 5 adds step S2330 to the flowchart of FIG. 3. Since the steps other than step S2330 are the same as those in FIG. 3, detailed explanations of each step will be omitted, and only the operations that differ from those in FIG. 3 will be explained below.
- the series of steps from S2210 to S2250 which are the process of extracting sentence features, are not necessarily executed in parallel with the image feature extraction process, but are basically executed independently at any timing earlier than the image feature extraction process.
- the sentence feature sending unit 2300 sends the sentence features in step S2260
- the sentence feature receiving unit 1300 stores the received sentence features in the sentence feature storage unit 1350 (step S2330).
- the feature comparison unit 1400 After the sentence features are stored in the sentence feature storage unit 1350, the feature comparison unit 1400 performs a comparison process between the sentence features and the image features in step S2150. The subsequent process is the same as in Figure 3.
- step S2210 when the sentence acquisition process of step S2210 occurs at any timing, the sentence feature extraction device 2000 again executes the processes of steps S2220 to S2260, and the sentence feature receiving unit 1300 updates the contents of the sentence feature storage unit 1350.
- the image recognition device 1050 further includes a sentence feature storage unit 1350 that stores sentence features received by the sentence feature receiving unit 1300, and the feature comparison unit 1400 can compare the sentence features stored in the sentence feature storage unit 1350 with image features.
- This configuration makes it possible to extract sentence features and store them in the image recognition device 1050 before acquiring image data. This eliminates the need to extract sentence features when there is no change in the detection target, and the feature comparison unit 1400 can perform feature comparison processing using sentence features stored in advance in the sentence feature storage unit 1350 without waiting for the reception of sentence features. This improves the throughput of the image recognition device 1050 and reduces the calculation load on the image recognition system 20 as a whole.
- the sentence feature extraction device 2000 extracts sentence features from the new natural language phrase using the sentence feature extraction unit 2200, and the sentence feature transmission unit 2300 transmits the extracted sentence features to the image recognition device 1050.
- Acquisition of a new natural language phrase does not need to be performed, for example, each time image recognition processing is performed.
- image recognition is generally performed using the sentence features stored in the sentence feature storage unit 1350 until the detection target is changed, that is, until an object category for the detection target is added or an object category set for the detection target is deleted.
- the sentence acquisition unit 2100 acquires a new natural language phrase, and the sentence feature extraction unit 2200 extracts new sentence features.
- the entire DNN model must be operated each time an image is acquired.
- the detection target is changed, there is no change in the processing content or results of the sentence feature extraction unit 2200, resulting in poor computational efficiency.
- the sentence feature extraction process is omitted, thereby improving the throughput of the image recognition device 1050 and reducing the computational load on the image recognition system 20 as a whole.
- the text acquisition unit 2100, text feature extraction unit 2200, image acquisition unit 1100, image analysis unit 1200, feature comparison unit 1400, and detection result generation unit 1500 are realized by processing circuits. These processing circuits may be realized by dedicated hardware, or may be control circuits using a CPU (Central Processing Unit).
- CPU Central Processing Unit
- FIG. 6 is a diagram showing dedicated hardware for implementing the functions of the image recognition systems 10 and 20 according to the first and second embodiments.
- the processing circuit 90 is a single circuit, a composite circuit, a programmed processor, a parallel programmed processor, an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), or a combination of these.
- FIG. 7 is a diagram showing the configuration of a control circuit for realizing the functions of the image recognition systems 10 and 20 according to the first and second embodiments.
- the control circuit 91 includes a processor 92 and memory 93.
- the processor 92 is a CPU, and is also called a processing device, arithmetic unit, microprocessor, microcomputer, DSP (Digital Signal Processor), etc.
- the memory 93 is, for example, non-volatile or volatile semiconductor memory such as RAM (Random Access Memory), ROM (Read Only Memory), flash memory, EPROM (Erasable Programmable ROM), EEPROM (registered trademark) (Electrically EPROM), magnetic disk, flexible disk, optical disk, compact disk, minidisk, DVD (Digital Versatile Disk), etc.
- RAM Random Access Memory
- ROM Read Only Memory
- flash memory EPROM (Erasable Programmable ROM), EEPROM (registered trademark) (Electrically EPROM), magnetic disk, flexible disk, optical disk, compact disk, minidisk, DVD (Digital Versatile Disk), etc.
- the control circuit 91 When the above processing circuit is realized by the control circuit 91, it is realized by the processor 92 reading and executing the programs stored in the memory 93 and corresponding to the processing of each component.
- the memory 93 is also used as temporary memory for each process executed by the processor 92.
- the programs executed by the processor 92 may be provided in a state stored on a storage medium, or may be provided via a communication channel such as the Internet.
- the functions of the sentence feature receiving unit 1300, detection result transmitting unit 1600, and sentence feature transmitting unit 2300 can be realized using a communication device (not shown). Furthermore, the functions of the image acquiring unit 1100 can be realized using an image sensor or imaging device, as described above.
- the sentence feature storage unit 1350 can be realized using a storage device.
- Figures 1 and 4 show only one image recognition device 1000, 1050 and one sentence feature extraction device 2000, but the image recognition systems 10, 20 may be equipped with multiple image recognition devices 1000, 1050.
- the sentence feature extraction device 2000 can transmit sentence features to multiple image recognition devices 1000, 1050.
- the image recognition systems 10, 20 may be equipped with multiple sentence feature extraction devices 2000.
- the image recognition devices 1000, 1050 receive sentence features directly from the sentence feature extraction device 2000, but this is not limited to this example. It is sufficient if the sentence feature extraction process is performed somewhere other than the image recognition devices 1000, 1050.
- the sentence features extracted by the sentence feature extraction device 2000 may be stored in a computer other than the sentence feature extraction device 2000, and the image recognition devices 1000, 1050 may receive the sentence features from the computer that stores the sentence features.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Image Analysis (AREA)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2024/015829 WO2025224820A1 (ja) | 2024-04-23 | 2024-04-23 | 画像認識システム、画像認識装置、文章特徴抽出装置、プログラムおよび画像認識方法 |
| JP2024569411A JP7682408B1 (ja) | 2024-04-23 | 2024-04-23 | 画像認識システム、画像認識装置、文章特徴抽出装置、プログラムおよび画像認識方法 |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2024/015829 WO2025224820A1 (ja) | 2024-04-23 | 2024-04-23 | 画像認識システム、画像認識装置、文章特徴抽出装置、プログラムおよび画像認識方法 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025224820A1 true WO2025224820A1 (ja) | 2025-10-30 |
Family
ID=95745040
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2024/015829 Pending WO2025224820A1 (ja) | 2024-04-23 | 2024-04-23 | 画像認識システム、画像認識装置、文章特徴抽出装置、プログラムおよび画像認識方法 |
Country Status (2)
| Country | Link |
|---|---|
| JP (1) | JP7682408B1 (https=) |
| WO (1) | WO2025224820A1 (https=) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP7796291B1 (ja) * | 2025-10-06 | 2026-01-08 | ソフトバンク株式会社 | 情報処理装置、情報処理装置の制御方法、及び情報処理装置の制御プログラム |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2022106147A (ja) * | 2021-01-06 | 2022-07-19 | 富士通株式会社 | 判定モデル生成プログラム、情報処理装置及び判定モデル生成方法 |
-
2024
- 2024-04-23 JP JP2024569411A patent/JP7682408B1/ja active Active
- 2024-04-23 WO PCT/JP2024/015829 patent/WO2025224820A1/ja active Pending
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2022106147A (ja) * | 2021-01-06 | 2022-07-19 | 富士通株式会社 | 判定モデル生成プログラム、情報処理装置及び判定モデル生成方法 |
Non-Patent Citations (1)
| Title |
|---|
| RADFORD ALEC, KIM JONG WOOK, HALLACY CHRIS, RAMESH ADITYA, GOH GABRIEL, AGARWAL SANDHINI, SASTRY GIRISH, ASKELL AMANDA, MISHKIN PA: "Learning transferable visual models from natural language supervision", 26 February 2021 (2021-02-26), XP093067451, Retrieved from the Internet <URL:https://arxiv.org/pdf/2103.00020.pdf> DOI: 10.48550/arXiv.2103.00020 * |
Also Published As
| Publication number | Publication date |
|---|---|
| JP7682408B1 (ja) | 2025-05-23 |
| JPWO2025224820A1 (https=) | 2025-10-30 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Jiang et al. | Skeleton aware multi-modal sign language recognition | |
| EP3399460B1 (en) | Captioning a region of an image | |
| CN108615036B (zh) | 一种基于卷积注意力网络的自然场景文本识别方法 | |
| CN112000818B (zh) | 一种面向文本和图像的跨媒体检索方法及电子装置 | |
| CN114648631B (zh) | 图像描述生成方法和装置、电子设备及存储介质 | |
| Baradel et al. | Human activity recognition with pose-driven attention to rgb | |
| CN112990297A (zh) | 多模态预训练模型的训练方法、应用方法及装置 | |
| Das et al. | Automated Indian sign language recognition system by fusing deep and handcrafted feature | |
| CN112819011B (zh) | 对象间关系的识别方法、装置和电子系统 | |
| WO2023173552A1 (zh) | 目标检测模型的建立方法、应用方法、设备、装置及介质 | |
| CN118799608A (zh) | 用于目标检测的模型训练及应用方法、装置和存储介质 | |
| Jia et al. | Richer and deeper supervision network for salient object detection | |
| US20250094484A1 (en) | Systems and methods for language-guided image retrieval | |
| CN113095072A (zh) | 文本处理方法及装置 | |
| WO2024012289A1 (zh) | 视频生成方法、装置、电子设备及介质 | |
| JP7682408B1 (ja) | 画像認識システム、画像認識装置、文章特徴抽出装置、プログラムおよび画像認識方法 | |
| WO2023196014A1 (en) | Object prior embedded network for query-agnostic image retrieval | |
| CN118864588A (zh) | 由电子设备执行的方法、电子设备及计算机可读存储介质 | |
| WO2026046297A1 (zh) | 一种图像目标检测方法、系统、装置和存储介质 | |
| CN108229432A (zh) | 人脸标定方法及装置 | |
| CN113688664A (zh) | 人脸关键点检测方法和人脸关键点检测装置 | |
| Kapadia et al. | Improved CBIR system using Multilayer CNN | |
| Rani et al. | Combining handcrafted spatio-temporal and deep spatial features for effective human action recognition | |
| Tan et al. | 3D detection transformer: Set prediction of objects using point clouds | |
| JP2025124582A (ja) | 画像を記述する構造化テキストを生成する方法 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| ENP | Entry into the national phase |
Ref document number: 2024569411 Country of ref document: JP Kind code of ref document: A |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2024569411 Country of ref document: JP |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24937113 Country of ref document: EP Kind code of ref document: A1 |