WO2023159945A1 - 多模态模型训练以及图像识别方法、装置、电子设备 - Google Patents

多模态模型训练以及图像识别方法、装置、电子设备 Download PDF

Info

Publication number
WO2023159945A1
WO2023159945A1 PCT/CN2022/122303 CN2022122303W WO2023159945A1 WO 2023159945 A1 WO2023159945 A1 WO 2023159945A1 CN 2022122303 W CN2022122303 W CN 2022122303W WO 2023159945 A1 WO2023159945 A1 WO 2023159945A1
Authority
WO
WIPO (PCT)
Prior art keywords
generated
feature
features
target
image
Prior art date
Application number
PCT/CN2022/122303
Other languages
English (en)
French (fr)
Inventor
申冲
李峰
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Publication of WO2023159945A1 publication Critical patent/WO2023159945A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Definitions

  • the present application relates to the field of image recognition, in particular to a multi-modal model training and image recognition method, device, and electronic equipment.
  • transformer transformer, a deep learning model using self-attention mechanism
  • real artificial intelligence can understand both images and text, not just images or text. Therefore, in recent years, there has been a lot of related research on multimodal (text, image) understanding problems.
  • the main purpose of this part of the work is to convert the original image into text-like features for representation, and then stitch each feature of the image line by line, as shown in Figure 2, so that a representation similar to text features is obtained , the image features can be input into the transformer structure like the text features, and the text corresponding to the original image can be output according to the image features.
  • the above method obtains the feature vector of the image through the encoder (encoder), the above method stitches the image feature vector line by line, and simply applies the autoregressive method of the text (one-dimensional feature).
  • the above method makes the image in the process of generation
  • the local receptive field in the middle is destroyed, the amount of information covered by the feature vector is not rich enough, and the relationship with other feature vectors is not considered, resulting in low accuracy of the generated image feature vector and low image recognition accuracy.
  • the embodiment of the present application provides a multi-modal model training and image recognition method, device, and electronic equipment, aiming at solving the problem that the local receptive field is destroyed during the image generation process in the prior art.
  • the embodiment of the present application provides a multimodal model training method, the method comprising:
  • the feature extraction network is used to encode the sample image and generate image feature vector
  • the parameters of the initial multimodal model are updated to determine the target multimodal model.
  • the multimodal model training method acquires sample images and text feature vectors corresponding to the sample images, and inputs the sample images to the feature extraction network of the initial multimodal model to generate image feature vectors corresponding to the sample images.
  • the feature extraction network is used to encode the sample image, and generate the image feature vector according to the correlation between the features to be generated and the generated features, so as to ensure that the features to be generated and the generated features are taken into account when generating the image feature vector.
  • the correlation between the features instead of directly splicing the features without considering the correlation between the various features, so that the above method ensures the accuracy of the generated image feature vector, so that the local receptive field will not occur during the image generation process. destroy.
  • the character feature vector and the image feature vector are input into the converter structure of the initial multimodal model, and the candidate text corresponding to the sample image is output, which ensures the accuracy of the generated candidate text.
  • the parameters of the initial multimodal model are updated to determine the target multimodal model. The accuracy of the determined target multimodal model is thus guaranteed.
  • the sample image is input to the feature extraction network of the initial multimodal model to generate an image feature vector corresponding to the sample image, including:
  • the image feature vector is generated according to the association relationship between each feature to be generated and the generated feature and the sequence of feature generation.
  • the multi-modal model training method obtained by the embodiment of the present application obtains the feature generation sequence corresponding to the sample image. Since the feature generation sequence determines the sequence of each feature generation in the image feature vector, it determines the relationship between each feature to be generated and the generated feature. Therefore, obtaining the feature generation sequence corresponding to the sample image can improve the accuracy of the generated image feature vector. Then, an image feature vector is generated according to the association relationship between each feature to be generated and the generated feature and the sequence of feature generation, thereby ensuring the accuracy of the generated image feature vector.
  • the dependency relationship between each feature to be generated and the generated features of the target is determined
  • the image feature vector is generated according to the sequence of feature generation.
  • the multi-modal model training method acquires the generated features of the target within a preset range with the features to be generated, thereby ensuring the accuracy of the acquired generated features of the target. Then, according to the positional relationship between the features to be generated and the corresponding generated features of the target, the dependency relationship between the features to be generated and the generated features of the target is determined, thus ensuring the relationship between the determined features to be generated and the generated features of the target.
  • the accuracy of the dependencies between According to the dependency between each feature to be generated and the target generated feature, the image feature vector is generated according to the feature generation sequence, which ensures the accuracy of each feature in the generated image feature vector, and then ensures the accuracy of the generated image feature vector. sex.
  • the dependency relationship between each feature to be generated and the generated target features is determined, including:
  • the dependency weight of each target generated feature relative to the to-be-generated feature is determined.
  • the multi-modal model training method provided in the embodiment of the present application, for each feature to be generated, according to the positional relationship between the feature to be generated and the corresponding generated feature of the target, determine the relationship between the feature to be generated and the corresponding generated feature of the target distance, thus ensuring the accuracy of the distance between the determined feature to be generated and the corresponding generated feature of the target.
  • the dependence weight of each target’s generated features relative to the to-be-generated features is determined, thereby ensuring the determined dependence of each target’s generated features on the to-be-generated features
  • the accuracy of the weights thereby ensuring the accuracy of the image feature vector determined according to the dependent weights of the generated features of each target with respect to the features to be generated.
  • the image feature vector is generated according to the sequence of feature generation, including:
  • the image feature vector is generated according to the sequence of feature generation.
  • the multimodal model training method provided by the embodiment of the present application determines each feature to be generated according to the dependency weight of each target generated feature with respect to the to-be-generated feature, which ensures the accuracy of the generated to-be-generated features. Then, according to each feature to be generated, the image feature vector is generated according to the sequence of feature generation, which ensures the accuracy of the generated image feature vector.
  • the embodiment of the present application also provides an image recognition method, the method includes:
  • the target image is input into the target multimodal model, and the text corresponding to the target image is output; the target multimodal model is trained according to the multimodal model training method in any one of the above implementation manners.
  • the image recognition method provided by the embodiment of the present application acquires the target image to be recognized; inputs the target image into the target multimodal model, and outputs the text corresponding to the target image; thus, the text corresponding to the target image can be generated according to the target image, and guarantee the accuracy of the generated text.
  • the embodiment of the present application also provides a multimodal model training device, which includes:
  • the first acquisition module is used to acquire the sample image and the character feature vector corresponding to the sample image
  • the generation module is used to input the sample image to the feature extraction network of the initial multimodal model to generate the image feature vector corresponding to the sample image, and the feature extraction network is used to encode the sample image, and according to the difference between the feature to be generated and the generated feature The association relationship between generates image feature vectors;
  • the first output module is used to input the text feature vector and the image feature vector into the converter structure of the initial multimodal model, and output the candidate text corresponding to the sample image;
  • the update module is configured to update the parameters of the initial multimodal model according to the target text and the candidate text corresponding to the text feature vector, so as to determine the target multimodal model.
  • the multimodal model training device acquires sample images and text feature vectors corresponding to the sample images, and inputs the sample images to the feature extraction network of the initial multimodal model to generate image feature vectors corresponding to the sample images.
  • the feature extraction network is used to encode the sample image, and generate the image feature vector according to the correlation between the features to be generated and the generated features, so as to ensure that the features to be generated and the generated features are taken into account when generating the image feature vector.
  • the correlation between the features instead of directly splicing the features without considering the correlation between the various features, so that the above method ensures the accuracy of the generated image feature vector, so that the local receptive field will not occur during the image generation process. destroy.
  • the character feature vector and the image feature vector are input into the converter structure of the initial multimodal model, and the candidate text corresponding to the sample image is output, which ensures the accuracy of the generated candidate text.
  • the parameters of the initial multimodal model are updated to determine the target multimodal model. The accuracy of the determined target multimodal model is thus guaranteed.
  • the embodiment of the present application also provides an image recognition device, which includes:
  • the second acquisition module is used to acquire the target image to be identified
  • the second output module is configured to input the target image into the target multimodal model, and output the text corresponding to the target image; the target multimodal model is trained according to the multimodal model training method in any one of the above implementation manners.
  • the image recognition device provided in the embodiment of the present application acquires the target image to be recognized; inputs the target image into the target multimodal model, and outputs the text corresponding to the target image; thus, the text corresponding to the target image can be generated according to the target image, and guarantee the accuracy of the generated text.
  • An embodiment of the present application provides an electronic device, including a memory and a processor.
  • the memory and the processor are connected to each other in communication.
  • Computer instructions are stored in the memory.
  • the processor executes the computer instructions to perform any of the above-mentioned embodiments. Multimodal model training method and image recognition method.
  • the embodiment of the present application provides a non-volatile readable storage medium, the non-volatile readable storage medium stores computer instructions, and the computer instructions are used to make the computer execute the multimodal model training method in any one of the above-mentioned implementation modes and image recognition methods.
  • FIG. 1 is a flow chart of encoding and decoding of image features in the prior art provided by an embodiment of the present application;
  • FIG. 2 is a flow chart of the generation sequence of image features in the prior art provided by an embodiment of the application;
  • Fig. 3 is a flow chart of applying the multimodal model training method provided by the embodiment of the present application.
  • FIG. 4 is a schematic diagram of an autoregressive strategy for images and text in the target multimodal model provided by the embodiment of the application;
  • Fig. 5 is a flow chart of applying the multimodal model training method provided by the embodiment of the present application.
  • FIG. 6 is a schematic diagram of the sequence of feature generation in the multimodal model training method provided by the embodiment of the present application.
  • Fig. 7 is a flow chart of applying the multimodal model training method provided by the embodiment of the present application.
  • FIG. 8 is a schematic diagram of a feature-dependent frame in a multimodal model training method provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of the dependency relationship between the feature P4 to be generated and the target generated feature in the multimodal model training method provided by the embodiment of the present application;
  • FIG. 10 is a schematic diagram of the dependency relationship between the feature to be generated P 57 and the target generated feature in the multimodal model training method provided by the embodiment of the present application;
  • Fig. 11 is a flow chart of applying the multimodal model training method provided by the embodiment of the present application.
  • FIG. 12 is a schematic diagram of the distance between the features to be generated and the corresponding target generated features in the multimodal model training method provided by the embodiment of the present application;
  • Fig. 13 is a flow chart of applying the multimodal model training method provided by the embodiment of the present application.
  • Fig. 14 is a flow chart of applying the image recognition method provided by the embodiment of the present application.
  • Fig. 15 is a functional block diagram of the multimodal model training device provided by the embodiment of the application.
  • Fig. 16 is a functional block diagram of the image recognition device provided by the embodiment of the application.
  • FIG. 17 is a schematic diagram of a hardware structure of an electronic device provided by an embodiment of the application.
  • the execution body of the multimodal model training method provided in the embodiment of the present application may be a multimodal model training device, and the multimodal model training device may be implemented through software, hardware, or a combination of software and hardware. It can be implemented as part or all of a computer device, where the computer device can be a server or a terminal, where the server in this embodiment of the application can be a single server, or a server cluster composed of multiple servers.
  • the terminals in the embodiments may be smart phones, personal computers, tablet computers, wearable devices, intelligent robots and other intelligent hardware devices.
  • the implementation subject is an electronic device as an example for illustration.
  • a multi-modal model training method is provided.
  • the application of the method to electronic equipment is used as an example for illustration, including the following steps:
  • the electronic device can receive the sample image sent by other devices and the character feature vector corresponding to the sample image through the connection with other devices; the electronic device can also receive the sample image input by the user And the text feature vector corresponding to the sample image.
  • the electronic device can also receive the sample image sent by other devices and the target text corresponding to the sample image through the connection with other devices; the electronic device can also receive the sample image input by the user. Image and the target text corresponding to the sample image.
  • the electronic device After receiving the sample image and the target text corresponding to the sample image, the electronic device performs feature extraction on the target text corresponding to the sample image, thereby obtaining a text feature vector corresponding to the target text.
  • the embodiment of the present application does not specifically limit the manner in which the electronic device acquires the sample image and the character feature vector corresponding to the sample image.
  • the feature extraction network is used to encode the sample image, and generate image feature vectors according to the association relationship between the features to be generated and the generated features.
  • the electronic device inputs the sample image to the feature extraction network of the initial multimodal model, and the feature extraction network encodes the sample image, and generates image feature vectors according to the association between the features to be generated and the generated features.
  • the electronic device can input the text feature vector and the image feature vector into the transformer structure of the initial multimodal model, and the transformer structure of the initial multimodal model converts the text feature vector and the image feature vector, thereby outputting the sample The candidate text corresponding to the image.
  • the text feature vector and the image feature vector are input into the transformer structure of the initial multimodal model, and the candidate text corresponding to the sample image is output.
  • the electronic device may calculate the loss value according to the target text and the candidate text corresponding to the text feature vector, and then update the parameters of the initial multimodal model according to the calculated loss value to determine the target multimodal model.
  • the multimodal model training method acquires sample images and text feature vectors corresponding to the sample images, and inputs the sample images to the feature extraction network of the initial multimodal model to generate image feature vectors corresponding to the sample images.
  • the feature extraction network is used to encode the sample image, and generate the image feature vector according to the correlation between the features to be generated and the generated features, so as to ensure that the features to be generated and the generated features are taken into account when generating the image feature vector.
  • the correlation between the features instead of directly splicing the features without considering the correlation between the various features, so that the above method ensures the accuracy of the generated image feature vector, so that the local receptive field will not occur during the image generation process. destroy.
  • the character feature vector and the image feature vector are input into the converter structure of the initial multimodal model, and the candidate text corresponding to the sample image is output, which ensures the accuracy of the generated candidate text.
  • the parameters of the initial multimodal model are updated to determine the target multimodal model. The accuracy of the determined target multimodal model is thus guaranteed.
  • "input the sample image to the feature extraction network of the initial multimodal model in the above S12, and generate the image feature vector corresponding to the sample image” may include the following step:
  • the electronic device may receive a feature generation sequence corresponding to a sample image input by a user.
  • the feature generation order corresponding to the sample image may be in a clockwise order from outside to inside.
  • P 1 is the first feature generated, then generate P 1 first in a clockwise direction, then generate P 2 , and finally generate P 8 , P 9 , and then generate in a clockwise direction
  • the sequence of feature generation corresponding to the sample image may be from outside to inside in a counterclockwise direction, wherein the first feature may start from the feature corresponding to any one of the four corners of the sample image.
  • the feature generation when generating the image feature vector corresponding to the sample image, the feature generation is not performed line by line, but the surrounding features are first generated, and then according to the relationship between the features to be generated and the surrounding generated features Association relationship, generate all features from outside to inside.
  • the electronic device After acquiring the feature generation sequence, the electronic device generates the image feature vector according to the association relationship between each feature to be generated and the generated feature and the feature generation sequence.
  • the multi-modal model training method obtained by the embodiment of the present application obtains the feature generation sequence corresponding to the sample image. Since the feature generation sequence determines the sequence of each feature generation in the image feature vector, it determines the relationship between each feature to be generated and the generated feature. Therefore, obtaining the feature generation sequence corresponding to the sample image can improve the accuracy of the generated image feature vector. Then, an image feature vector is generated according to the association relationship between each feature to be generated and the generated feature and the sequence of feature generation, thereby ensuring the accuracy of the generated image feature vector.
  • the "generate image feature vectors according to the association relationship between each feature to be generated and the generated features and the sequence of feature generation" in the above S22 may include the following step:
  • the electronic device may determine, according to an instruction output by the user, a target already-generated feature within a preset range of the feature to be generated.
  • the electronic device may receive a target generated feature input by a user and within a preset range of the feature to be generated.
  • the electronic device can generate an N*N feature dependency frame according to the user instruction, where N is an odd number and greater than 3, and then the electronic device determines the preset range according to the feature dependency frame, thereby obtaining Target generated features whose features to be generated are within the preset range.
  • the feature dependency frame is as shown in FIG. 8 .
  • the feature P i to be generated is located in the middle of the feature dependency frame window, and it should depend on all surrounding target generated features as much as possible.
  • the electronic device may determine the dependency relationship between each feature to be generated and the target generated feature according to the positional relationship between the to-be-generated feature and the corresponding target generated feature.
  • the electronic device can fuse the generated target features that have a dependency relationship with the feature to be generated, thereby generating Features to be generated. Then, the electronic device generates image feature vectors according to the features to be generated and according to the sequence of feature generation.
  • the feature P 4 to be generated is determined according to the 5*5 feature dependency frame mentioned in the above embodiment and the feature generation sequence shown in FIG. 6 can only rely on the target generated features P 2 and P 3 , therefore, the electronic device can use the target generated feature P 2 according to the dependency relationship between the target generated feature P 4 and the target generated features P 2 and P 3 Fusion with P 3 to generate feature P 4 to be generated. Then, the electronic device generates image feature vectors according to the feature generation sequence according to the feature P 4 to be generated.
  • the feature P 57 to be generated depends on Among the generated features of , not only row features are included, but also column features. As shown in FIG. 10, the black area is all generated features.
  • the biological characteristics are P 1 -P 5 , P 39 -P 36 , P 54 -P 56 . Then, the electronic device generates an image feature vector according to the feature P 57 to be generated and according to the sequence of feature generation.
  • the multi-modal model training method acquires the generated features of the target within a preset range with the features to be generated, thereby ensuring the accuracy of the acquired generated features of the target. Then, according to the positional relationship between the features to be generated and the corresponding generated features of the target, the dependency relationship between the features to be generated and the generated features of the target is determined, thus ensuring the relationship between the determined features to be generated and the generated features of the target.
  • the accuracy of the dependencies between According to the dependency between each feature to be generated and the target generated feature, the image feature vector is generated according to the feature generation sequence, which ensures the accuracy of each feature in the generated image feature vector, and then ensures the accuracy of the generated image feature vector. sex.
  • the electronic device can obtain the position of the feature to be generated and the corresponding target generated feature, and then determine the position of the feature to be generated and the corresponding target generated feature according to the position of the feature to be generated and the corresponding target generated feature Generates positional relationships between features.
  • the electronic device determines the distance between the feature to be generated and the corresponding generated feature of the target according to the positional relationship between the feature to be generated and the corresponding target generated feature.
  • the distance between the feature P ij to be generated and the corresponding generated feature of the target is shown.
  • S xy is the distance between the generated features of the target in the xth row and the yth column to the feature P ij to be generated
  • S ij ⁇ the distance from the generated features of all targets to the features to be generated ⁇
  • N w is the number of rows of features in the sample image
  • N h is the number of columns of features in the sample image.
  • the multi-modal model training method provided in the embodiment of the present application, for each feature to be generated, according to the positional relationship between the feature to be generated and the corresponding generated feature of the target, determine the relationship between the feature to be generated and the corresponding generated feature of the target distance, thus ensuring the accuracy of the distance between the determined feature to be generated and the corresponding generated feature of the target.
  • the dependence weight of each target’s generated features relative to the to-be-generated features is determined, thereby ensuring the determined dependence of each target’s generated features on the to-be-generated features
  • the accuracy of the weights thereby ensuring the accuracy of the image feature vector determined according to the dependent weights of the generated features of each target with respect to the features to be generated.
  • the electronic device calculates the dependency weights of the generated features of each target with respect to the features to be generated
  • the following formula can be used to determine each feature to be generated T ij according to the dependency weight of the generated features of each target with respect to the features to be generated . Its calculation method can be expressed in the following way:
  • W xy represents the dependency weight of the generated features of each target relative to the features to be generated
  • A(x, y) is the generated features of the target corresponding to the features to be generated in the xth row and the yth column.
  • N w is the number of rows of features in the sample image
  • N h is the number of columns of features in the sample image.
  • the electronic device After generating the features to be generated, the electronic device generates an image feature vector according to each feature to be generated in a sequence of feature generation.
  • the multimodal model training method provided by the embodiment of the present application determines each feature to be generated according to the dependency weight of each target generated feature with respect to the to-be-generated feature, which ensures the accuracy of the generated to-be-generated features. Then, according to each feature to be generated, the image feature vector is generated according to the sequence of feature generation, which ensures the accuracy of the generated image feature vector.
  • an embodiment of the present application provides an image recognition method.
  • the image recognition method provided by the embodiment of the present application may be executed by an image recognition device, and the image recognition device may be implemented as part or all of computer equipment through software, hardware, or a combination of software and hardware.
  • the computer device may be a server or a terminal
  • the server in the embodiment of the present application may be a single server, or may be a server cluster composed of multiple servers
  • the terminal in the embodiment of the present application may be a smart phone , personal computers, tablets, wearable devices, and smart robots and other smart hardware devices.
  • the implementation subject is an electronic device as an example for illustration.
  • an image recognition method is provided.
  • the application of the method to electronic equipment is used as an example for illustration, including the following steps:
  • the electronic device may receive target images sent by other devices, and may also receive target images input by a user.
  • the target multimodal model is obtained according to any of the multimodal model training methods in the above-mentioned embodiments.
  • the electronic device may train a target multimodal model based on the sample image and the target text corresponding to the sample image. Then, the electronic device inputs the target image into the target multimodal model, and the feature extraction network in the target multimodal model encodes the target image, and generates image feature vectors based on the correlation between the features to be generated and the generated features.
  • the target image feature vector is input into the transformer structure in the target multimodal model, and the text corresponding to the target image is output.
  • the image recognition method provided by the embodiment of the present application acquires the target image to be recognized; inputs the target image into the target multimodal model, and outputs the text corresponding to the target image; thus, the text corresponding to the target image can be generated according to the target image, and guarantee the accuracy of the generated text.
  • this embodiment provides a multi-modal model training device, the device includes:
  • the first obtaining module 71 is used to obtain the sample image and the character feature vector corresponding to the sample image;
  • the generation module 72 is used to input the sample image to the feature extraction network of the initial multimodal model to generate the image feature vector corresponding to the sample image.
  • the feature extraction network is used to encode the sample image, and according to the features to be generated and the generated features The association relationship between generates image feature vectors;
  • the first output module 73 is used to input the text feature vector and the image feature vector into the transformer structure of the initial multimodal model, and output the candidate text corresponding to the sample image;
  • the update module 74 is configured to update the parameters of the initial multimodal model according to the target text and the candidate text corresponding to the text feature vector, so as to determine the target multimodal model.
  • the above-mentioned generation module 72 is specifically used to obtain the generated features of the target within a preset range with the features to be generated; according to the positional relationship between the features to be generated and the corresponding generated features of the target, determine The dependency relationship between each feature to be generated and the target generated feature; according to the dependency relationship between each feature to be generated and the target generated feature, the image feature vector is generated according to the sequence of feature generation.
  • the above-mentioned generating module 72 is specifically used to determine each feature to be generated according to the dependency weight of each target generated feature relative to the feature to be generated; according to each feature to be generated, generate an image according to the sequence of feature generation Feature vector.
  • Each module in the above-mentioned multi-modal model training device and image recognition device can be fully or partially realized by software, hardware and a combination thereof.
  • the above-mentioned modules can be embedded in or independent of the processor in the electronic device in the form of hardware, and can also be stored in the memory of the electronic device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.
  • the embodiment of the present application also provides an electronic device, which has the multimodal model training device shown in FIG. 15 and the image recognition device shown in FIG. 16 .
  • the memory 94 can be a high-speed RAM memory (Random Access Memory, volatile random access memory), or a non-volatile memory (non-volatile memory), such as at least one disk memory.
  • the memory 94 may also be at least one storage device located away from the aforementioned processor 91 .
  • the processor 91 may be combined with the device described in FIG. 15 or FIG. 16 , the memory 94 stores an application program, and the processor 91 invokes the program code stored in the memory 94 to execute any of the above method steps.
  • the communication bus 92 may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (EISA for short) bus or the like.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the communication bus 92 can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is used in FIG. 17 , but it does not mean that there is only one bus or one type of bus.
  • the memory 94 may include a volatile memory (English: volatile memory), such as a random access memory (English: random-access memory, abbreviated: RAM); the memory may also include a non-volatile memory (English: non-volatile memory), such as flash memory (English: flash memory), hard disk (English: hard disk drive, abbreviated: HDD) or solid-state hard disk (English: solid-state drive, abbreviated: SSD); memory 94 can also include the above-mentioned types combination of memory.
  • volatile memory such as a random access memory (English: random-access memory, abbreviated: RAM)
  • non-volatile memory such as flash memory (English: flash memory), hard disk (English: hard disk drive, abbreviated: HDD) or solid-state hard disk (English: solid-state drive, abbreviated: SSD); memory 94 can also include the above-mentioned types combination of memory.
  • the processor 91 may be a central processing unit (English: central processing unit, abbreviated: CPU), a network processor (English: network processor, abbreviated: NP) or a combination of CPU and NP.
  • CPU central processing unit
  • NP network processor
  • the processor 91 may further include a hardware chip.
  • the aforementioned hardware chip may be an application-specific integrated circuit (English: application-specific integrated circuit, abbreviation: ASIC), a programmable logic device (English: programmable logic device, abbreviation: PLD) or a combination thereof.
  • the above PLD can be a complex programmable logic device (English: complex programmable logic device, abbreviated: CPLD), field programmable logic gate array (English: field-programmable gate array, abbreviated: FPGA), general array logic (English: generic array logic, abbreviation: GAL) or any combination thereof.
  • the embodiment of the present application also provides a non-volatile readable storage medium, and the non-volatile readable storage medium stores computer-executable instructions, and the computer-executable instructions can execute the multimodal method in any of the above-mentioned method embodiments.
  • Model training method and image recognition method can be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a flash memory (Flash Memory), a hard disk (Hard Disk Drive) , abbreviation: HDD) or solid-state drive (Solid-State Drive, SSD), etc.; the storage medium may also include a combination of the above-mentioned types of memory.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

本申请揭示了一种多模态模型训练以及图像识别方法、装置、电子设备,涉及图像识别领域。该方法包括:获取样本图像以及样本图像对应的文字特征向量;将样本图像输入至初始多模态模型的特征提取网络,生成样本图像对应的图像特征向量。其中特征提取网络用于对样本图像进行编码,并根据待生成特征与已生成特征之间的关联关系生成图像特征向量;将文字特征向量以及图像特征向量输入至初始多模态模型的变换器结构中,输出样本图像对应的候选文字;根据文字特征向量对应的目标文字以及候选文字,更新初始多模态模型的参数,以确定目标多模态模型。采用该方法可以保证生成的图像特征向量的准确性,进而使得图像在生成的过程中局部感受野不会发生破坏。

Description

多模态模型训练以及图像识别方法、装置、电子设备
相关申请的交叉引用
本申请要求于2022年02月25日提交中国专利局,申请号为202210174577.9,申请名称为“多模态模型训练以及图像识别方法、装置、电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及图像识别领域,具体涉及一种多模态模型训练以及图像识别方法、装置、电子设备。
背景技术
从transformer(变换器,一种采用自注意力机制的深度学习模型)出世以来,其在图像和文本上都取得了巨大的成功。但是,真正的人工智能是既可以理解图像,又可以理解的文本的,而非只针对图像或文本。因此,近几年,在多模态(文本、图像)理解问题上的相关研究涌现很多。
现有的多模态自回归模型,仍然是采用与自然语言处理的自回归模型类似的方法,重点在于如何将图像转换为类似于文本那样的特征。如附图1所示,主流方法是使用变分自编码器的特征提取部分,对图像进行卷积、池化等操作,最终得到V_size*N_h*N_w大小的向量矩阵,其中,V_size为特征的大小,N_h为特征的列数,N_w为特征的行数。
这部分工作的主要目的是为了将原图转换为类似于文本的特征进行表示,然后将图像的各个特征进行逐行拼接起来,如附图2所示,这样就取得了与文本特征类似的表示,可以将图像特征如同文本特征一样输入到变换器结构中,根据图像特征输出原图对应的文字。
上述方法,虽然经过encoder(编码器)得到了图像的特征向量,但是上述方法将图像特征向量逐行拼接起来,单纯适用文本(一维特征)的自回归方法,上述方法使得图像在生成的过程中局部感受野发生破坏,特征向量所涵盖的信息量不够丰富,并未考虑到与其它的特征向量之间的关系,导致生成的图像特征向量的准确性不高,图像识别准确性不高。
发明内容
有鉴于此,本申请实施例提供了一种多模态模型训练以及图像识别方法、装置、电子设备,旨在解决现有技术会使得图像在生成的过程中局部感受野发生了破坏的问题。
本申请实施例提供了一种多模态模型训练方法,该方法包括:
获取样本图像以及样本图像对应的文字特征向量;
将样本图像输入至初始多模态模型的特征提取网络,生成样本图像对应的图像特征向量,特征提取网络用于对样本图像进行编码,并根据待生成特征与已生成特征之间的关联关系生成图像特征向量;
将文字特征向量以及图像特征向量输入至初始多模态模型的变换器结构中,输出样本图像对应的候选文字;
根据文字特征向量对应的目标文字以及候选文字,更新初始多模态模型的参数,以确定目标多模态模型。
本申请实施方式提供的多模态模型训练方法,获取样本图像以及样本图像对应的文字特征向量,并将样本图像输入至初始多模态模型的特征提取网络,生成样本图像对应的图像特征向量。其中,特征提取网络用于对样本图像进行编码,并根据待生成特征与已生成特征之间的关联关系生成图像特征向量,从而可以保证在生成图像特征向量时,考虑了待生成特征与已生成特征之间的关联,而不是不考虑各个特征之间的关联直接将特征进行拼接,从而上述方法保证了生成的图像特征向量的准确性,进而使得图像在生成的过程中局部感受野不会发生破坏。然后,将文字特征向量以及图像特征向量输入至初始多模态模型的变换器结构中,输出样本图像对应的候选文字,保证了生成的候选文字的准确性。根据文字特征向量对应的目标文字以及候选文字,更新初始多模态模型的参数,以确定目标多模态模型。从而保证了确定的目标多模态模型的准确性。
可选地,将样本图像输入至初始多模态模型的特征提取网络,生成样本图像对应的图像特征向量,包括:
获取样本图像对应的特征生成顺序;
根据各待生成特征与已生成特征之间的关联关系以及特征生成顺序,生成图像特征向量。
本申请实施方式提供的多模态模型训练方法,获取样本图像对应的特征生成顺序,由于特征生成顺序决定了图像特征向量中各个特征生成的顺序,从而决定了各个待生成特征与已 生成特征之间的关联关系,因此,获取样本图像对应的特征生成顺序,可以提高生成的图像特征向量的准确性。然后,根据各待生成特征与已生成特征之间的关联关系以及特征生成顺序,生成图像特征向量,从而保证了生成的图像特征向量的准确性。
可选地,根据各待生成特征与已生成特征之间的关联关系以及特征生成顺序,生成图像特征向量,包括:
获取与待生成特征在预设范围内的目标已生成特征;
根据待生成特征与对应的目标已生成特征之间的位置关系,确定各待生成特征与目标已生成特征之间的依赖关系;
根据各待生成特征与目标已生成特征之间的依赖关系,按照特征生成顺序,生成图像特征向量。
本申请实施方式提供的多模态模型训练方法,获取与待生成特征在预设范围内的目标已生成特征,从而保证了获取到的目标已生成特征的准确性。然后,根据待生成特征与对应的目标已生成特征之间的位置关系,确定各待生成特征与目标已生成特征之间的依赖关系,从而保证了确定的各待生成特征与目标已生成特征之间的依赖关系的准确性。根据各待生成特征与目标已生成特征之间的依赖关系,按照特征生成顺序,生成图像特征向量,保证了生成的图像特征向量中各个特征的准确性,进而保证了生成的图像特征向量的准确性。
可选地,根据待生成特征与对应的目标已生成特征之间的位置关系,确定各待生成特征与目标已生成特征之间的依赖关系,包括:
针对各待生成特征,根据待生成特征与对应的目标已生成特征之间的位置关系,确定待生成特征与对应的目标已生成特征之间的距离;
根据待生成特征与对应的目标已生成特征之间的距离,确定各目标已生成特征相对于待生成特征的依赖权重。
本申请实施方式提供的多模态模型训练方法,针对各待生成特征,根据待生成特征与对应的目标已生成特征之间的位置关系,确定待生成特征与对应的目标已生成特征之间的距离,从而保证了确定的待生成特征与对应的目标已生成特征之间的距离的准确性。然后,根据待生成特征与对应的目标已生成特征之间的距离,确定各目标已生成特征相对于待生成特征的依赖权重,从而保证了确定的各目标已生成特征相对于待生成特征的依赖权重的准确性,进 而保证根据各目标已生成特征相对于待生成特征的依赖权重,确定的图像特征向量的准确性。
可选地,根据各待生成特征与目标已生成特征之间的依赖关系,按照特征生成顺序,生成图像特征向量,包括:
根据各目标已生成特征相对于待生成特征的依赖权重,确定各待生成特征;
根据各待生成特征,按照特征生成顺序,生成图像特征向量。
本申请实施方式提供的多模态模型训练方法,根据各目标已生成特征相对于待生成特征的依赖权重,确定各待生成特征,保证了生成的各待生成特征的准确性。然后,根据各待生成特征,按照特征生成顺序,生成图像特征向量,保证了生成的图像特征向量的准确性。
本申请实施例还提供了一种图像识别方法,方法包括:
获取待识别的目标图像;
将目标图像输入至目标多模态模型,输出目标图像对应的文字;目标多模态模型根据上述任意一种实施方式中的多模态模型训练方法训练得到。
本申请实施方式提供的图像识别方法,获取待识别的目标图像;将目标图像输入至目标多模态模型,输出目标图像对应的文字;从而可以完成根据目标图像生成目标图像对应的文字,且保证了生成的文字的准确性。
本申请实施例还提供了一种多模态模型训练装置,该装置包括:
第一获取模块,用于获取样本图像以及样本图像对应的文字特征向量;
生成模块,用于将样本图像输入至初始多模态模型的特征提取网络,生成样本图像对应的图像特征向量,特征提取网络用于对样本图像进行编码,并根据待生成特征与已生成特征之间的关联关系生成图像特征向量;
第一输出模块,用于将文字特征向量以及图像特征向量输入至初始多模态模型的变换器结构中,输出样本图像对应的候选文字;
更新模块,用于根据文字特征向量对应的目标文字以及候选文字,更新初始多模态模型的参数,以确定目标多模态模型。
本申请实施方式提供的多模态模型训练装置,获取样本图像以及样本图像对应的文字特征向量,并将样本图像输入至初始多模态模型的特征提取网络,生成样本图像对应的图像特 征向量。其中,特征提取网络用于对样本图像进行编码,并根据待生成特征与已生成特征之间的关联关系生成图像特征向量,从而可以保证在生成图像特征向量时,考虑了待生成特征与已生成特征之间的关联,而不是不考虑各个特征之间的关联直接将特征进行拼接,从而上述方法保证了生成的图像特征向量的准确性,进而使得图像在生成的过程中局部感受野不会发生破坏。然后,将文字特征向量以及图像特征向量输入至初始多模态模型的变换器结构中,输出样本图像对应的候选文字,保证了生成的候选文字的准确性。根据文字特征向量对应的目标文字以及候选文字,更新初始多模态模型的参数,以确定目标多模态模型。从而保证了确定的目标多模态模型的准确性。
本申请实施例还提供了一种图像识别装置,该装置包括:
第二获取模块,用于获取待识别的目标图像;
第二输出模块,用于将目标图像输入至目标多模态模型,输出目标图像对应的文字;目标多模态模型根据上述任意一种实施方式中的多模态模型训练方法训练得到。
本申请实施方式提供的图像识别装置,获取待识别的目标图像;将目标图像输入至目标多模态模型,输出目标图像对应的文字;从而可以完成根据目标图像生成目标图像对应的文字,且保证了生成的文字的准确性。
本申请实施例提供了一种电子设备,包括存储器和处理器,存储器和处理器之间互相通信连接,存储器中存储有计算机指令,处理器通过执行计算机指令,从而执行上述任意一种实施方式中的多模态模型训练方法以及图像识别方法。
本申请实施例提供了一种非易失性可读存储介质,非易失性可读存储介质存储计算机指令,计算机指令用于使计算机执行上述任意一种实施方式中的多模态模型训练方法以及图像识别方法。
附图说明
为了更清楚地说明本申请具体实施方式或现有技术中的技术方案,下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是应用本申请实施例提供的现有技术中图像特征的编码与解码的流程图;
图2是应用本申请实施例提供的现有技术中图像特征的生成顺序流程图;
图3是应用本申请实施例提供的多模态模型训练方法的流程图;
图4是应用本申请实施例提供的目标多模态模型中图像与文本的自回归策略示意图;
图5是应用本申请实施例提供的多模态模型训练方法的流程图;
图6是应用本申请实施例提供的多模态模型训练方法中特征生成顺序的示意图;
图7是应用本申请实施例提供的多模态模型训练方法的流程图;
图8是应用本申请实施例提供的多模态模型训练方法中特征依赖框的示意图;
图9是应用本申请实施例提供的多模态模型训练方法中待生成特征P 4的与目标已生成特征之间的依赖关系示意图;
图10是应用本申请实施例提供的多模态模型训练方法中待生成特征P 57的与目标已生成特征之间的依赖关系示意图;
图11是应用本申请实施例提供的多模态模型训练方法的流程图;
图12是应用本申请实施例提供的多模态模型训练方法中待生成与对应的目标已生成特征之间的距离示意图;
图13是应用本申请实施例提供的多模态模型训练方法的流程图;
图14是应用本申请实施例提供的图像识别方法的流程图;
图15是应用本申请实施例提供的多模态模型训练装置的功能模块图;
图16是应用本申请实施例提供的图像识别装置的功能模块图;
图17是应用本申请实施例提供的电子设备的硬件结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
需要说明的是,本申请实施例提供的多模态模型训练的方法,其执行主体可以是多模态模型训练的装置,该多模态模型训练的装置可以通过软件、硬件或者软硬件结合的方式实现成为计算机设备的部分或者全部,其中,该计算机设备可以是服务器或者终端,其中,本申 请实施例中的服务器可以为一台服务器,也可以为由多台服务器组成的服务器集群,本申请实施例中的终端可以是智能手机、个人电脑、平板电脑、可穿戴设备以及智能机器人等其他智能硬件设备。下述方法实施例中,均以执行主体是电子设备为例来进行说明。
在本申请一个实施例中,如图3所示,提供了一种多模态模型训练方法,以该方法应用于电子设备为例进行说明,包括以下步骤:
S11、获取样本图像以及样本图像对应的文字特征向量。
在本申请一种可选的实施方式中,电子设备可以通过与其他设备之间的连接,接收其他设备发送的样本图像以及样本图像对应的文字特征向量;电子设备也可以接收用户输入的样本图像以及样本图像对应的文字特征向量。
在本申请另一种可选的实施方式中,电子设备还可以通过与其他设备之间的连接,接收其他设备发送的样本图像以及样本图像对应的目标文字;电子设备也可以接收用户输入的样本图像以及样本图像对应的目标文字。
接收到样本图像以及样本图像对应的目标文字之后,电子设备对样本图像对应的目标文字进行特征提取,从而获取到目标文字对应的文字特征向量。
本申请实施例对电子设备获取样本图像以及样本图像对应的文字特征向量的方式不做具体限定。
S12、将样本图像输入至初始多模态模型的特征提取网络,生成样本图像对应的图像特征向量。
其中,特征提取网络用于对样本图像进行编码,并根据待生成特征与已生成特征之间的关联关系生成图像特征向量。
具体地,电子设备将样本图像输入至初始多模态模型的特征提取网络,特征提取网络对样本图像进行编码,并根据待生成特征与已生成特征之间的关联关系生成图像特征向量。
关于该步骤将在下文进行详细介绍。
S13、将文字特征向量以及图像特征向量输入至初始多模态模型的变换器结构中,输出样本图像对应的候选文字。
具体地,电子设备可以将文字特征向量以及图像特征向量输入至初始多模态模型的变换器结构中,初始多模态模型的变换器结构对文字特征向量以及图像特征向量进行转换,从而输出样本图像对应的候选文字。
示例性的,如图4所示,将文字特征向量以及图像特征向量输入至初始多模态模型的变 换器结构中,输出样本图像对应的候选文字。
S14、根据文字特征向量对应的目标文字以及候选文字,更新初始多模态模型的参数,以确定目标多模态模型。
具体地,电子设备可以根据文字特征向量对应的目标文字以及候选文字计算损失值,然后根据计算得到的损失值,更新初始多模态模型的参数,以确定目标多模态模型。
本申请实施方式提供的多模态模型训练方法,获取样本图像以及样本图像对应的文字特征向量,并将样本图像输入至初始多模态模型的特征提取网络,生成样本图像对应的图像特征向量。其中,特征提取网络用于对样本图像进行编码,并根据待生成特征与已生成特征之间的关联关系生成图像特征向量,从而可以保证在生成图像特征向量时,考虑了待生成特征与已生成特征之间的关联,而不是不考虑各个特征之间的关联直接将特征进行拼接,从而上述方法保证了生成的图像特征向量的准确性,进而使得图像在生成的过程中局部感受野不会发生破坏。然后,将文字特征向量以及图像特征向量输入至初始多模态模型的变换器结构中,输出样本图像对应的候选文字,保证了生成的候选文字的准确性。根据文字特征向量对应的目标文字以及候选文字,更新初始多模态模型的参数,以确定目标多模态模型。从而保证了确定的目标多模态模型的准确性。
在本申请一个可选的实施例中,如图5所示,上述S12中的“将样本图像输入至初始多模态模型的特征提取网络,生成样本图像对应的图像特征向量”,可以包括以下步骤:
S21、获取样本图像对应的特征生成顺序。
具体地,电子设备可以接收用户输入的样本图像对应的特征生成顺序。
可选的,由于样本图像中越靠近中心位置的特征,其更依赖于周围的特征;而越是样本图像边缘位置的特征,其对周围特征的依赖性更小。因此,样本图像对应的特征生成顺序可以是以顺时针方向由外向内的顺序。示例性的,如图6所示,P 1为生成的第一个特征,然后按照顺时针方向即先生成P 1,然后再生成P 2,最后生成P 8,P 9,然后按照顺时针方向继续向下生成各个特征,依次生成P 10-P 17,P 18-P 25,P 26-P 32,P 33-P 39,……。
可选的,样本图像对应的特征生成顺序可以是以逆时针方向由外向内的顺序,其中,第一个特征可以从样本图像的四个角中的任意一个角对应的特征开始。
基于上述内容可知,本申请实施例中,在生成样本图像对应的图像特征向量时,并不是逐行进行特征生成,而是先生成周围特征,然后根据待生成特征与周围已生成特征之间的关 联关系,由外到内生成全部特征。
S22、根据各待生成特征与已生成特征之间的关联关系以及特征生成顺序,生成图像特征向量。
具体地,电子设备在获取到特征生成顺序之后,根据各待生成特征与已生成特征之间的关联关系以及特征生成顺序,生成图像特征向量。
关于该步骤将在下文进行详细介绍。
本申请实施方式提供的多模态模型训练方法,获取样本图像对应的特征生成顺序,由于特征生成顺序决定了图像特征向量中各个特征生成的顺序,从而决定了各个待生成特征与已生成特征之间的关联关系,因此,获取样本图像对应的特征生成顺序,可以提高生成的图像特征向量的准确性。然后,根据各待生成特征与已生成特征之间的关联关系以及特征生成顺序,生成图像特征向量,从而保证了生成的图像特征向量的准确性。
在本申请一个可选的实施例中,如图7所示,上述S22中的“根据各待生成特征与已生成特征之间的关联关系以及特征生成顺序,生成图像特征向量”,可以包括以下步骤:
S31、获取与待生成特征在预设范围内的目标已生成特征。
在一种可选的实施方式中,电子设备可以根据用户输出的指令,确定与待生成特征在预设范围内的目标已生成特征。
在一种可选的实施方式中,电子设备可以接收用户输入的与待生成特征在预设范围内的目标已生成特征。
在另一种可选的实施方式中,电子设备可以根据用户指令生成一个N*N的特征依赖框,其中N为奇数且大于3,然后电子设备根据特征依赖框确定预设范围,从而获取到待生成特征在预设范围内的目标已生成特征。
示例性的,当N=5时,特征依赖框如图8所示。其中,待生成特征P i位于特征依赖框窗口的正中间,其应尽可能依赖到周围所有的目标已生成特征。
S32、根据待生成特征与对应的目标已生成特征之间的位置关系,确定各待生成特征与目标已生成特征之间的依赖关系。
具体地,电子设备可以根据待生成特征与对应的目标已生成特征之间的位置关系,确定各待生成特征与目标已生成特征之间的依赖关系。
关于该步骤将在下文进行详细说明。
S33、根据各待生成特征与目标已生成特征之间的依赖关系,按照特征生成顺序,生成图像特征向量。
在一种可选的实施方式中,在获取到各待生成特征与目标已生成特征之间的依赖关系之后,电子设备可以将与待生成特征具有依赖关系的目标已生成特征进行融合,从而生成待生成特征。然后,电子设备根据待生成特征,按照特征生成顺序,生成图像特征向量。
示例性的,如图9所示,在生成待生成特征P 4时,根据上述实施方式提及到的5*5的特征依赖框以及图6所示的特征生成顺序,确定待生成特征P 4只能依赖目标已成生特征P 2和P 3,因此,电子设备可以根据待生成特征P 4与目标已成生特征P 2和P 3之间的依赖关系,将目标已成生特征P 2和P 3进行融合,生成待生成特征P 4。然后,电子设备根据待生成特征P 4,按照特征生成顺序,生成图像特征向量。
示例性的,如图10所示,在生成待生成特征P 57时,根据上述实施方式提及到的5*5的特征依赖框以及图6所示的特征生成顺序,待生成特征P 57依赖的已生成特征中,不仅包括行特征,还包括列特征。如图10所示,黑色区域为全部已生成特征,根据上述实施方式提及到的5*5的特征依赖框以及图5所示的特征生成顺序,确定待生成特征P 57依赖的目标已成生特征为P 1-P 5、P 39-P 36、P 54-P 56。然后,电子设备根据待生成特征P 57,按照特征生成顺序,生成图像特征向量。
本申请实施方式提供的多模态模型训练方法,获取与待生成特征在预设范围内的目标已生成特征,从而保证了获取到的目标已生成特征的准确性。然后,根据待生成特征与对应的目标已生成特征之间的位置关系,确定各待生成特征与目标已生成特征之间的依赖关系,从而保证了确定的各待生成特征与目标已生成特征之间的依赖关系的准确性。根据各待生成特征与目标已生成特征之间的依赖关系,按照特征生成顺序,生成图像特征向量,保证了生成的图像特征向量中各个特征的准确性,进而保证了生成的图像特征向量的准确性。
在本申请一个可选的实施例中,如图11所示,上述S32中的“根据待生成特征与对应的目标已生成特征之间的位置关系,确定各待生成特征与目标已生成特征之间的依赖关系”,可以包括以下步骤:
S41、针对各待生成特征,根据待生成特征与对应的目标已生成特征之间的位置关系,确定待生成特征与对应的目标已生成特征之间的距离。
具体地,针对各待生成特征,电子设备可以获取待生成特征与对应的目标已生成特征的 位置,然后根据待生成特征与对应的目标已生成特征的位置,确定待生成特征与对应的目标已生成特征之间的位置关系。电子设备根据待生成特征与对应的目标已生成特征之间的位置关系,确定待生成特征与对应的目标已生成特征之间的距离。
示例性的,如图12所示为待生成特征P ij与对应的目标已生成特征之间的距离。
S42、根据待生成特征与对应的目标已生成特征之间的距离,确定各目标已生成特征相对于待生成特征的依赖权重。
具体地,文本往往具有很长的上下文依赖关系,区别于文本,样本图像对一定空间范围内的依赖更强,对于更偏远的依赖更弱,因此,我们在进行自注意力权重计算时,需要添加一个依赖权重干预,为计算待生成特征(第i行,第j列位置的待生成特征)对目标已成生特征的依赖权重,设计公式如下:
Figure PCTCN2022122303-appb-000001
其中,S xy为第x行,第y列位置的目标已生成特征到待生成特征P ij的距离,S ij∈{所有目标已生成特征到待生成特征的距离},i∈[0,N w],j∈[0,N h],N w为样本图像中特征的行数,N h为样本图像中特征的列数。
本申请实施方式提供的多模态模型训练方法,针对各待生成特征,根据待生成特征与对应的目标已生成特征之间的位置关系,确定待生成特征与对应的目标已生成特征之间的距离,从而保证了确定的待生成特征与对应的目标已生成特征之间的距离的准确性。然后,根据待生成特征与对应的目标已生成特征之间的距离,确定各目标已生成特征相对于待生成特征的依赖权重,从而保证了确定的各目标已生成特征相对于待生成特征的依赖权重的准确性,进而保证根据各目标已生成特征相对于待生成特征的依赖权重,确定的图像特征向量的准确性。
在本申请一个可选的实施例中,如图13所示,上述S33中的“根据各待生成特征与目标已生成特征之间的依赖关系,按照特征生成顺序,生成图像特征向量”,可以包括以下步骤:
S51、根据各目标已生成特征相对于待生成特征的依赖权重,确定各待生成特征。
具体地,电子设备在计算得到各目标已生成特征相对于待生成特征的依赖权重之后,可以利用如下公式,根据各目标已生成特征相对于待生成特征的依赖权重,确定各待生成特征 T ij。其计算方式可以以下述方式表达:
T ij=∑W ,vA(x,y)   (2)
其中,W xy表示各目标已生成特征相对于待生成特征的依赖权重,A(x,y)为第x行,第y列的待生成特征对应的目标已生成特征。其中,i∈[0,N w],j∈[0,N h],N w为样本图像中特征的行数,N h为样本图像中特征的列数。
S52、根据各待生成特征,按照特征生成顺序,生成图像特征向量。
具体地,电子设备在生成待生成特征之后,根据各待生成特征,按照特征生成顺序,生成图像特征向量。
本申请实施方式提供的多模态模型训练方法,根据各目标已生成特征相对于待生成特征的依赖权重,确定各待生成特征,保证了生成的各待生成特征的准确性。然后,根据各待生成特征,按照特征生成顺序,生成图像特征向量,保证了生成的图像特征向量的准确性。
为了更好的介绍上述多模态模型训练方法训练得到的目标多模态模型,本申请实施例提供了一种图像识别方法。需要说明的是,本申请实施例提供的图像识别的方法,其执行主体可以是图像识别的装置,该图像识别的装置可以通过软件、硬件或者软硬件结合的方式实现成为计算机设备的部分或者全部,其中,该计算机设备可以是服务器或者终端,其中,本申请实施例中的服务器可以为一台服务器,也可以为由多台服务器组成的服务器集群,本申请实施例中的终端可以是智能手机、个人电脑、平板电脑、可穿戴设备以及智能机器人等其他智能硬件设备。下述方法实施例中,均以执行主体是电子设备为例来进行说明。
在本申请一个实施例中,如图14所示,提供了一种图像识别方法,以该方法应用于电子设备为例进行说明,包括以下步骤:
S61、获取待识别的目标图像。
具体地,电子设备可以接收其他设备发送的目标图像,也可以接收用户输入的目标图像。
S62、将目标图像输入至目标多模态模型,输出目标图像对应的文字。
其中,目标多模态模型根据上述实施方式中任一的多模态模型训练方法得到。
具体地,电子设备可以基于样本图像以及样本图像对应的目标文字训练得到目标多模态模型。然后,电子设备将目标图像输入至目标多模态模型,目标多模态模型中的特征提取网络对目标图像进行编码,并待生成特征与已生成特征之间的关联关系生成图像特征向量。
然后,将目标图像特征向量输入至目标多模态模型中的变换器结构中,输出目标图像对应的文字。
本申请实施方式提供的图像识别方法,获取待识别的目标图像;将目标图像输入至目标多模态模型,输出目标图像对应的文字;从而可以完成根据目标图像生成目标图像对应的文字,且保证了生成的文字的准确性。
应该理解的是,虽然图3、图5、图7、图11以及图13-14的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图3、图5、图7、图11以及图13-14中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。
如图15所示,本实施例提供一种多模态模型训练装置,装置包括:
第一获取模块71,用于获取样本图像以及样本图像对应的文字特征向量;
生成模块72,用于将样本图像输入至初始多模态模型的特征提取网络,生成样本图像对应的图像特征向量,特征提取网络用于对样本图像进行编码,并根据待生成特征与已生成特征之间的关联关系生成图像特征向量;
第一输出模块73,用于将文字特征向量以及图像特征向量输入至初始多模态模型的变换器结构中,输出样本图像对应的候选文字;
更新模块74,用于根据文字特征向量对应的目标文字以及候选文字,更新初始多模态模型的参数,以确定目标多模态模型。
在本申请一个实施例中,上述生成模块72,具体用于获取样本图像对应的特征生成顺序;根据各待生成特征与已生成特征之间的关联关系以及特征生成顺序,生成图像特征向量。
在本申请一个实施例中,上述生成模块72,具体用于获取与待生成特征在预设范围内的目标已生成特征;根据待生成特征与对应的目标已生成特征之间的位置关系,确定各待生成特征与目标已生成特征之间的依赖关系;根据各待生成特征与目标已生成特征之间的依赖关系,按照特征生成顺序,生成图像特征向量。
在本申请一个实施例中,上述生成模块72,具体用于针对各待生成特征,根据待生成特征与对应的目标已生成特征之间的位置关系,确定待生成特征与对应的目标已生成特征之间的距离;根据待生成特征与对应的目标已生成特征之间的距离,确定各目标已生成特征相对于待生成特征的依赖权重。
在本申请一个实施例中,上述生成模块72,具体用于根据各目标已生成特征相对于待生成特征的依赖权重,确定各待生成特征;根据各待生成特征,按照特征生成顺序,生成图像特征向量。
如图16所示,本实施例提供一种图像识别装置,装置包括:
第二获取模块81,用于获取待识别的目标图像;
第二输出模块82,用于将目标图像输入至目标多模态模型,输出目标图像对应的文字;目标多模态模型根据上述实施方式中任一的多模态模型训练方法得到。
关于多模态模型训练装置以及图像识别装置的具体限定以及有益效果可以参见上文中对于多模态模型训练方法以及图像识别方法的限定,在此不再赘述。上述多模态模型训练装置以及图像识别装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于电子设备中的处理器中,也可以以软件形式存储于电子设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
本申请实施例还提供一种电子设备,具有上述图15所示的多模态模型训练装置以及图16的图像识别装置。
如图17所示,图17是本申请可选实施例提供的一种电子设备的结构示意图,如图17所示,该电子设备可以包括:至少一个处理器91,例如CPU(Central Processing Unit,中央处理器),至少一个通信接口93,存储器94,至少一个通信总线92。其中,通信总线92用于实现这些组件之间的连接通信。其中,通信接口93可以包括显示屏(Display)、键盘(Keyboard),可选通信接口93还可以包括标准的有线接口、无线接口。存储器94可以是高速RAM存储器(Random Access Memory,易挥发性随机存取存储器),也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。存储器94可选的还可以是至少一个位于远离前述处理器91的存储装置。其中处理器91可以结合图15或图16所描述的装置,存储器94中存储应用程序,且处理器91调用存储器94中存储的程序代码,以用于执行上述任一方法步骤。
其中,通信总线92可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。 通信总线92可以分为地址总线、数据总线、控制总线等。为便于表示,图17中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
其中,存储器94可以包括易失性存储器(英文:volatile memory),例如随机存取存储器(英文:random-access memory,缩写:RAM);存储器也可以包括非易失性存储器(英文:non-volatile memory),例如快闪存储器(英文:flash memory),硬盘(英文:hard disk drive,缩写:HDD)或固态硬盘(英文:solid-state drive,缩写:SSD);存储器94还可以包括上述种类的存储器的组合。
其中,处理器91可以是中央处理器(英文:central processing unit,缩写:CPU),网络处理器(英文:network processor,缩写:NP)或者CPU和NP的组合。
其中,处理器91还可以进一步包括硬件芯片。上述硬件芯片可以是专用集成电路(英文:application-specific integrated circuit,缩写:ASIC),可编程逻辑器件(英文:programmable logic device,缩写:PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(英文:complex programmable logic device,缩写:CPLD),现场可编程逻辑门阵列(英文:field-programmable gate array,缩写:FPGA),通用阵列逻辑(英文:generic array logic,缩写:GAL)或其任意组合。
可选地,存储器94还用于存储程序指令。处理器91可以调用程序指令,实现如本申请图3、图5、图7、图11以及图13实施例中所示的多模态模型训练方法以及图14实施例中所示的图像识别方法。
本申请实施例还提供了一种非易失性可读存储介质,非易失性可读存储介质存储有计算机可执行指令,该计算机可执行指令可执行上述任意方法实施例中的多模态模型训练方法以及图像识别方法。其中,存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)、随机存储记忆体(Random Access Memory,RAM)、快闪存储器(Flash Memory)、硬盘(Hard Disk Drive,缩写:HDD)或固态硬盘(Solid-State Drive,SSD)等;存储介质还可以包括上述种类的存储器的组合。
虽然结合附图描述了本申请的实施例,但是本领域技术人员可以在不脱离本申请的精神和范围的情况下做出各种修改和变型,这样的修改和变型均落入由所附权利要求所限定的范围之内。

Claims (20)

  1. 一种多模态模型训练方法,其特征在于,所述方法包括:
    获取样本图像以及所述样本图像对应的文字特征向量;
    将所述样本图像输入至初始多模态模型的特征提取网络,生成所述样本图像对应的图像特征向量,所述特征提取网络用于对所述样本图像进行编码,并根据待生成特征与已生成特征之间的关联关系生成所述图像特征向量;
    将所述文字特征向量以及所述图像特征向量输入至所述初始多模态模型的变换器结构中,输出所述样本图像对应的候选文字;
    根据所述文字特征向量对应的目标文字以及所述候选文字,更新所述初始多模态模型的参数,以确定目标多模态模型。
  2. 根据权利要求1所述的方法,其特征在于,所述获取样本图像以及所述样本图像对应的文字特征向量,包括:
    获取所述样本图像以及所述样本图像对应的目标文字;
    对所述样本图像对应的所述目标文字进行特征提取,获取对应的所述文字特征向量。
  3. 根据权利要求1所述的方法,其特征在于,所述将所述样本图像输入至初始多模态模型的特征提取网络,生成所述样本图像对应的图像特征向量,包括:
    获取所述样本图像对应的特征生成顺序;
    根据各所述待生成特征与所述已生成特征之间的关联关系以及所述特征生成顺序,生成所述图像特征向量。
  4. 根据权利要求3所述的方法,其特征在于,所述根据各所述待生成特征与所述已生成特征之间的关联关系以及所述特征生成顺序,生成所述图像特征向量,包括:
    确定所述待生成特征周围的所述已生成特征;
    根据各所述待生成特征与周围的所述已生成特征之间的关联关系以及所述特征生成顺序,由外到内生成所有的所述图像特征向量。
  5. 根据权利要求3所述的方法,其特征在于,所述根据各所述待生成特征与已生成特征之间的关联关系以及所述特征生成顺序,生成所述图像特征向量,包括:
    获取与所述待生成特征在预设范围内的目标已生成特征;
    根据所述待生成特征与对应的所述目标已生成特征之间的位置关系,确定各所述待生成特征与所述目标已生成特征之间的依赖关系;
    根据各所述待生成特征与所述目标已生成特征之间的依赖关系,按照所述特征生成顺序,生成所述图像特征向量。
  6. 根据权利要求5所述的方法,其特征在于,所述获取与所述待生成特征在预设范围内的目标已生成特征之前,还包括:
    获取特征依赖框;
    根据所述特征依赖框确定所述预设范围。
  7. 根据权利要求5所述的方法,其特征在于,所述根据所述待生成特征与对应的所述目标已生成特征之间的位置关系,确定各所述待生成特征与所述目标已生成特征之间的依赖关系,包括:
    针对各所述待生成特征,根据所述待生成特征与对应的所述目标已生成特征之间的位置关系,确定所述待生成特征与对应的所述目标已生成特征之间的距离;
    根据所述待生成特征与对应的所述目标已生成特征之间的距离,确定各所述目标已生成特征相对于所述待生成特征的依赖权重。
  8. 根据权利要求7所述的方法,其特征在于,所述根据所述待生成特征与对应的所述目标已生成特征之间的距离,确定各所述目标已生成特征相对于所述待生成特征的依赖权重,包括:
    根据所述待生成特征与对应的所述目标已生成特征之间的距离,采用预设的依赖权重计算公式,确定各所述目标已生成特征相对于所述待生成特征的依赖权重;
    其中,预设的依赖权重计算公式为:
    Figure PCTCN2022122303-appb-100001
    其中,S xy为第x行,第y列位置的目标已生成特征到待生成特征P ij的距离;P ij为第i行,第j列位置的待生成特征;S ij∈{所有目标已生成特征到待生成特征的距离},i∈[0,N w],j∈[0,N h],x∈[0,N w],y∈[0,N h],N w为样本图像中特征的行数,N h为样本图像中特征的列数;W xy为待生成特征P ij对第x行,第y列位置的目标已生成特征的依赖权重。
  9. 根据权利要求7所述的方法,其特征在于,所述根据各所述待生成特征与所述目标已生成特征之间的依赖关系,按照所述特征生成顺序,生成所述图像特征向量,包括:
    根据各所述目标已生成特征相对于所述待生成特征的依赖权重,确定各所述待生成 特征;
    根据各所述待生成特征,按照所述特征生成顺序,生成所述图像特征向量。
  10. 根据权利要求9所述的方法,其特征在于,所述根据各所述目标已生成特征相对于所述待生成特征的依赖权重,确定各所述待生成特征,包括:
    根据各所述目标已生成特征相对于所述待生成特征的依赖权重,采用预设的特征生成公式,确定各所述待生成特征;
    其中,预设的的特征生成公式为:
    T ij=ΣW xyA(x,y),
    其中,T ij为第i行,第j列位置的待生成特征;W xy表示各目标已生成特征相对于待生成特征T ij的依赖权重;A(x,y)为第x行,第y列的待生成特征T ij对应的目标已生成特征。i∈[0,N w],j∈[0,N h],x∈[0,N w],y∈[0,N h],N w为样本图像中特征的行数,N h为样本图像中特征的列数。
  11. 根据权利要求5所述的方法,其特征在于,所述根据各所述待生成特征与所述目标已生成特征之间的依赖关系,按照所述特征生成顺序,生成所述图像特征向量,包括:
    针对各所述待生成特征,将与所述待生成特征具有依赖关系的所述目标已生成特征进行融合,生成对应的所述待生成特征;
    根据各所述待生成特征,按照所述特征生成顺序,生成所述图像特征向量。
  12. 根据权利要求1所述的方法,其特征在于,所述根据所述文字特征向量对应的目标文字以及所述候选文字,更新所述初始多模态模型的参数,以确定目标多模态模型,包括:
    根据所述目标文字以及所述候选文字计算损失值;
    根据计算得到的所述损失值,更新所述初始多模态模型的参数,以确定所述目标多模态模型。
  13. 根据权利要求3-12中任一项所述的方法,其特征在于,所述特征生成顺序为以顺时针方向由外向内的顺序或以逆时针方向由外向内的顺序。
  14. 根据权利要求3-12中任一项所述的方法,其特征在于,所述特征生成顺序决定各所述待生成特征与所述已生成特征之间的关联关系。
  15. 根据权利要求1-12中任一项所述的方法,其特征在于,还包括:
    在生成所述图像特征向量的过程中,局部感受野不会发生破坏。
  16. 一种图像识别方法,其特征在于,所述方法包括:
    获取待识别的目标图像;
    将所述目标图像输入至目标多模态模型,输出所述目标图像对应的文字;所述目标多模态模型根据权利要求1-15任一所述的多模态模型训练方法得到。
  17. 一种多模态模型训练装置,其特征在于,所述装置包括:
    第一获取模块,用于获取样本图像以及所述样本图像对应的文字特征向量;
    生成模块,用于将所述样本图像输入至初始多模态模型的特征提取网络,生成所述样本图像对应的图像特征向量,所述特征提取网络用于对所述样本图像进行编码,并根据待生成特征与已生成特征之间的关联关系生成所述图像特征向量;
    第一输出模块,用于将所述文字特征向量以及所述图像特征向量输入至所述初始多模态模型的变换器结构中,输出所述样本图像对应的候选文字;
    更新模块,用于根据所述文字特征向量对应的目标文字以及所述候选文字,更新所述初始多模态模型的参数,以确定目标多模态模型。
  18. 一种图像识别装置,其特征在于,所述装置包括:
    第二获取模块,用于获取待识别的目标图像;
    第二输出模块,用于将所述目标图像输入至目标多模态模型,输出所述目标图像对应的文字;所述目标多模态模型根据权利要求1-15任一所述的多模态模型训练方法得到。
  19. 一种电子设备,其特征在于,包括存储器和处理器,所述存储器中存储有计算机指令,所述处理器通过执行所述计算机指令,从而执行权利要求1-15中任一项所述的多模态模型训练方法以及权利要求16中所述的图像识别方法。
  20. 一种非易失性可读存储介质,其特征在于,所述非易失性可读存储介质存储有计算机指令,所述计算机指令用于使计算机执行权利要求1-15中任一项所述的多模态模型训练方法以及权利要求16中所述的图像识别方法。
PCT/CN2022/122303 2022-02-25 2022-09-28 多模态模型训练以及图像识别方法、装置、电子设备 WO2023159945A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210174577.9 2022-02-25
CN202210174577.9A CN114239760B (zh) 2022-02-25 2022-02-25 多模态模型训练以及图像识别方法、装置、电子设备

Publications (1)

Publication Number Publication Date
WO2023159945A1 true WO2023159945A1 (zh) 2023-08-31

Family

ID=80748161

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/122303 WO2023159945A1 (zh) 2022-02-25 2022-09-28 多模态模型训练以及图像识别方法、装置、电子设备

Country Status (2)

Country Link
CN (1) CN114239760B (zh)
WO (1) WO2023159945A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114239760B (zh) * 2022-02-25 2022-05-20 苏州浪潮智能科技有限公司 多模态模型训练以及图像识别方法、装置、电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112464993A (zh) * 2020-11-05 2021-03-09 苏州浪潮智能科技有限公司 一种多模态模型训练方法、装置、设备及存储介质
US20210303921A1 (en) * 2020-03-30 2021-09-30 Beijing Baidu Netcom Science And Technology Co., Ltd. Cross-modality processing method and apparatus, and computer storage medium
CN113920293A (zh) * 2021-10-18 2022-01-11 北京达佳互联信息技术有限公司 信息识别方法、装置、电子设备及存储介质
CN114005012A (zh) * 2021-11-05 2022-02-01 北京市商汤科技开发有限公司 多模态预训练模型的训练方法、装置、设备及存储介质
CN114239760A (zh) * 2022-02-25 2022-03-25 苏州浪潮智能科技有限公司 多模态模型训练以及图像识别方法、装置、电子设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210303921A1 (en) * 2020-03-30 2021-09-30 Beijing Baidu Netcom Science And Technology Co., Ltd. Cross-modality processing method and apparatus, and computer storage medium
CN112464993A (zh) * 2020-11-05 2021-03-09 苏州浪潮智能科技有限公司 一种多模态模型训练方法、装置、设备及存储介质
CN113920293A (zh) * 2021-10-18 2022-01-11 北京达佳互联信息技术有限公司 信息识别方法、装置、电子设备及存储介质
CN114005012A (zh) * 2021-11-05 2022-02-01 北京市商汤科技开发有限公司 多模态预训练模型的训练方法、装置、设备及存储介质
CN114239760A (zh) * 2022-02-25 2022-03-25 苏州浪潮智能科技有限公司 多模态模型训练以及图像识别方法、装置、电子设备

Also Published As

Publication number Publication date
CN114239760A (zh) 2022-03-25
CN114239760B (zh) 2022-05-20

Similar Documents

Publication Publication Date Title
US20220383078A1 (en) Data processing method and related device
WO2022007823A1 (zh) 一种文本数据处理方法及装置
US11030522B2 (en) Reducing the size of a neural network through reduction of the weight matrices
US20230229898A1 (en) Data processing method and related device
US11037031B2 (en) Image recognition method, electronic apparatus and readable storage medium
GB2571825A (en) Semantic class localization digital environment
US20240004703A1 (en) Method, apparatus, and system for multi-modal multi-task processing
EP4006909B1 (en) Method, apparatus and device for quality control and storage medium
WO2022001724A1 (zh) 一种数据处理方法及装置
WO2023005386A1 (zh) 模型训练方法和装置
WO2020151175A1 (zh) 文本生成方法、装置、计算机设备及存储介质
CN110598210B (zh) 实体识别模型训练、实体识别方法、装置、设备及介质
EP4379603A1 (en) Model distillation method and related device
CN113095129A (zh) 姿态估计模型训练方法、姿态估计方法、装置和电子设备
US20240152770A1 (en) Neural network search method and related device
WO2023159945A1 (zh) 多模态模型训练以及图像识别方法、装置、电子设备
CN116843901A (zh) 医学图像分割模型训练方法和医学图像分割方法
CN113408507B (zh) 基于履历文件的命名实体识别方法、装置和电子设备
WO2024046144A1 (zh) 一种视频处理方法及其相关设备
WO2021082518A1 (zh) 机器翻译方法、机器翻译模型训练方法、装置及存储介质
WO2023197910A1 (zh) 一种用户行为预测方法及其相关设备
CN111475635A (zh) 语义补全方法、装置和电子设备
WO2023045949A1 (zh) 一种模型训练方法及其相关设备
WO2020237215A1 (en) Object discovery in images through categorizing object parts
WO2023236900A1 (zh) 一种项目推荐方法及其相关设备