WO2023138314A1 - 对象属性识别方法、装置、可读存储介质及电子设备 - Google Patents

对象属性识别方法、装置、可读存储介质及电子设备 Download PDF

Info

Publication number
WO2023138314A1
WO2023138314A1 PCT/CN2022/141994 CN2022141994W WO2023138314A1 WO 2023138314 A1 WO2023138314 A1 WO 2023138314A1 CN 2022141994 W CN2022141994 W CN 2022141994W WO 2023138314 A1 WO2023138314 A1 WO 2023138314A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
feature sequence
module
sequence
attribute
Prior art date
Application number
PCT/CN2022/141994
Other languages
English (en)
French (fr)
Inventor
毛晓飞
黄灿
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023138314A1 publication Critical patent/WO2023138314A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Definitions

  • the present disclosure relates to the technical field of image processing, and in particular, to an object attribute recognition method, device, readable storage medium and electronic equipment.
  • Image structuring is a technology that extracts key target objects (such as vehicles, pedestrians, etc.) based on image content information. It uses time-space segmentation, feature extraction, object recognition and other processing methods to organize image content into structured information that can be understood by computers and humans according to the semantic relationship. Among them, the recognition of the attributes of the objects in the image is an important functional module of the image structure. It can predict the attribute labels of the objects from the image, such as the age, gender and clothing style of pedestrians, the license plate number of the vehicle, and the age of the vehicle. It can be used for intelligent applications in the image perception world. Among them, how to improve the accuracy and richness of image object attribute recognition has become the key to enhancing image understanding.
  • the present disclosure provides a method for identifying object attributes, including:
  • the target image includes a target object and object description information of the target object
  • the multimodal feature sequence includes a visual feature sequence and a semantic feature sequence of the target attribute
  • Multiple object attributes of the target object are determined according to the key information feature sequence and the multimodal feature sequence, wherein the multiple object attributes include the target attribute.
  • an object attribute recognition device including:
  • An acquisition module configured to acquire a target image, wherein the target image includes a target object and object description information of the target object;
  • the first extraction module is used to extract the key information feature sequence of the target object and the multimodal feature sequence corresponding to the target attribute of the target object from the target image obtained by the acquisition module, wherein the multimodal feature sequence includes a visual feature sequence and a semantic feature sequence of the target attribute;
  • a determining module configured to determine multiple object attributes of the target object according to the key information feature sequence and the multimodal feature sequence extracted by the first extraction module, wherein the multiple object attributes include the target attribute.
  • the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of the method provided in the first aspect of the present disclosure are implemented.
  • an electronic device including:
  • a processing device configured to execute the computer program in the storage device to implement the steps of the method provided in the first aspect of the present disclosure.
  • the target image is obtained, and the target image contains the target object and the object description information of the target object; then, the key information feature sequence of the target object and the multimodal feature sequence corresponding to the target attribute of the target object are extracted from the target image, and the multimodal feature sequence includes the visual feature sequence and semantic feature sequence of the target attribute; finally, multiple object attributes of the target object are determined according to the key information feature sequence and the multimodal feature sequence.
  • Fig. 1 is a flowchart of a method for identifying object attributes according to an exemplary embodiment.
  • Fig. 2 is a schematic structural diagram of a multimodal feature extraction model according to an exemplary embodiment.
  • Fig. 3 is a schematic structural diagram of a multimodal fusion model according to an exemplary embodiment.
  • Fig. 4 is a flowchart of a method for identifying object attributes according to another exemplary embodiment.
  • Fig. 5 is a schematic structural diagram of an appearance feature extraction model according to an exemplary embodiment.
  • Fig. 6 is a flow chart showing a method for identifying object attributes according to another exemplary embodiment.
  • Fig. 7 is a schematic structural diagram of a global visual feature extraction model according to an exemplary embodiment.
  • Fig. 8 is a block diagram of an object attribute recognition device according to an exemplary embodiment.
  • Fig. 9 is a block diagram of an electronic device according to an exemplary embodiment.
  • the term “comprise” and its variations are open-ended, ie “including but not limited to”.
  • the term “based on” is “based at least in part on”.
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one further embodiment”; the term “some embodiments” means “at least some embodiments.” Relevant definitions of other terms will be given in the description below.
  • Fig. 1 is a flowchart of a method for identifying object attributes according to an exemplary embodiment. As shown in FIG. 1 , the method may include the following S101-S103.
  • the target image includes a target object (specifically, an image of the target object) and object description information of the target object, wherein the target object can be, for example, a vehicle, a pedestrian, a bookcase, a television, etc., and the object description information is text used to describe the target object.
  • a target object specifically, an image of the target object
  • object description information is text used to describe the target object.
  • the key information feature sequence of the target object and the multimodal feature sequence corresponding to the target attribute of the target object are extracted from the target image.
  • the multimodal feature sequence includes a visual feature sequence and a semantic feature sequence of the target attribute, wherein the target attribute can be any attribute of the target object that the user focuses on, for example, the target object is a person, and the target attribute is age.
  • multiple object attributes of the target object are determined according to the key information feature sequence and the multimodal feature sequence.
  • the plurality of object attributes include the above-mentioned target attribute.
  • the target object is a pedestrian
  • the multiple object attributes may include age, height, gender, clothing style, hairstyle, and the like.
  • the target object is an article (eg, a bookcase, a vehicle), and the multiple object attributes may include category, brand, name, basic parameters, capacity/volume, usage range, and the like.
  • the target image is obtained, and the target image contains the target object and the object description information of the target object; then, the key information feature sequence of the target object and the multimodal feature sequence corresponding to the target attribute of the target object are extracted from the target image, and the multimodal feature sequence includes the visual feature sequence and semantic feature sequence of the target attribute; finally, multiple object attributes of the target object are determined according to the key information feature sequence and the multimodal feature sequence.
  • the recognition text may be a multilingual text or a monolingual text, which is not specifically limited in the present disclosure.
  • the recognized text can be obtained by inputting the target image into a pre-trained text recognition model, wherein the above-mentioned text recognition model can be, for example, a convolutional cyclic neural network, an encoding and decoding network based on an attention mechanism, and the like.
  • the above-mentioned text recognition model can be, for example, a convolutional cyclic neural network, an encoding and decoding network based on an attention mechanism, and the like.
  • the multilingual language model is used to extract key features of objects in the recognition text corresponding to the target image.
  • the multilingual language model may be composed of multiple (for example, 12) encoding networks connected in series in sequence and multiple (for example, 6) decoding networks connected in series in sequence, wherein the last encoding network in the plurality of encoding networks connected in series is connected in series with the first decoding network in the decoding networks connected in series in sequence.
  • the embodiment of the present disclosure does not limit the above encoding network, and any existing or future encoding network (for example, an Encoder module in a transformer model, an Encoder module in a conformer model, etc.) can be used for implementation.
  • any existing or future encoding network for example, an Encoder module in a transformer model, an Encoder module in a conformer model, etc.
  • the embodiment of the present disclosure does not limit the above-mentioned decoding network, and any existing or future decoding network (for example, a Decoder module in a transformer model, a Decoder module in a conformer model, etc.) can be used for implementation.
  • any existing or future decoding network for example, a Decoder module in a transformer model, a Decoder module in a conformer model, etc.
  • the target image can be input into a pre-trained multimodal feature extraction model to obtain a multimodal feature sequence corresponding to the target attribute of the target object.
  • the above multimodal feature extraction model may include: a first target detection module, a first preprocessing module, a first full connection module, a text recognition module, a multilingual language sub-model, a splicing module, a first encoding module and a second full connection module.
  • the first target detection module is used to extract the first region where the mark of the target attribute of the target object is located from the target image, wherein the mark can be, for example, a license plate number, a brand logo, etc.;
  • the first preprocessing module is connected to the first target detection module, and is used to normalize the first region to an image of a first preset size (for example, 32*32), and straighten the image obtained after normalization into a one-dimensional row vector of a first preset length (for example, 1024);
  • the row vector generates the visual feature sequence of the target attribute;
  • the text recognition module is connected with the first target detection module for text recognition of the first region to obtain the attribute description text of the target attribute, such as brand words;
  • the multilingual language sub-model is connected with the text recognition module for extracting the semantic feature sequence of the target attribute from the attribute description text;
  • the splicing module is connected with the first fully connected module and the multilingual language sub-model respectively for splicing the visual feature sequence and the semantic feature sequence
  • the first target detection module may be, for example, a YOLO (You Only Look Once) network, a single-stage multi-frame detection (Single Shot MultiBox Detector, SSD) and the like.
  • the above-mentioned text recognition module may be, for example, a convolutional cyclic neural network, an encoding and decoding network based on an attention mechanism, and the like.
  • the structure of the multilingual language sub-model may be the same as that of the above-mentioned multilingual language model.
  • the first fully connected module may include 2 layers of fully connected layers connected in series
  • the second fully connected module may include 2 layers of fully connected layers
  • the first encoding module may include 4 layers of encoding networks connected in series in sequence, wherein, the embodiment of the present disclosure does not limit the encoding network of the above 4 layers in series, and any existing or future encoding network (for example, the Encoder module in the transformer model, the Encoder module in the conformer model, etc.) can be used for implementation.
  • the length of the visual feature sequence and the semantic feature sequence of the target attribute is the preset dimension
  • the dimension of the concatenated sequence is twice the preset dimension.
  • the preset dimension is 128 dimensions.
  • key information feature sequences and multimodal feature sequences can be input into a pre-trained multimodal fusion model to obtain multiple object attributes of the target object.
  • the above-mentioned multimodal fusion model may include a second encoding module, a plurality of first decoding modules corresponding to the attribute categories of multiple object attributes (in Figure 3, N decoding modules are used as an example, N is greater than 1), respectively connected to the second encoding module.
  • the second encoding module is used to encode the first feature matrix formed by the key information feature sequence and the multimodal feature sequence to obtain a second encoding sequence, wherein the dimensions of the key information feature sequence and the multimodal feature sequence are preset dimensions;
  • the first decoding module is configured to generate object attributes under the corresponding attribute category according to the second encoding sequence, wherein the attribute categories of each object attribute are different, that is, the number of the first decoding modules is equal to the number of the above-mentioned multiple object attributes.
  • the first decoding module 1 and the attribute category object of the object attribute a1 are used to generate the object attribute a1
  • the first decoding module 2 and the attribute category object of the object attribute a2 are used to generate the object attribute a2
  • the attribute category objects of the first decoding module N and the object attribute aN are used to generate the object attribute aN.
  • the second encoding module may include an encoding network with 12 layers serially connected in sequence
  • the first decoding module may include a decoding network
  • the above method may further include the following S104.
  • an appearance feature sequence of the target object is extracted from the target image.
  • the above S103 may determine multiple object attributes of the target object according to the key information feature sequence, multimodal feature sequence, and appearance feature sequence.
  • the key information feature sequence, multimodal feature sequence, and appearance feature sequence can be input into a pre-trained multimodal fusion model to obtain multiple object attributes of the target object, wherein the second encoding module in the above multimodal fusion model is used to encode the feature matrix composed of the key information feature sequence, multimodal feature sequence, and appearance feature sequence.
  • the target image can be input into a pre-trained appearance feature extraction model to obtain a sequence of appearance features of the target object.
  • the above-mentioned appearance feature extraction model may include a second object detection module, a second preprocessing module, a third encoding module, and a third fully connected module connected in sequence.
  • the second target detection module is used to extract the second area where the appearance of the target object is located from the target image, wherein the appearance can include the appearance of the target object, the packaging of the target object, etc.;
  • the second preprocessing module is used to normalize the second area to an image of a second preset size (for example, 16*16), and straighten the image obtained after normalization into a one-dimensional row vector of a second preset length (for example, 256);
  • the third encoding module is used for encoding the one-dimensional row vector of the second preset length to obtain a third encoding sequence;
  • the encoding sequence is subjected to dimensionality reduction processing to obtain the appearance feature sequence of the target object with a preset dimension.
  • the second target detection module may be, for example, a YOLO network, SSD, and the like.
  • the third encoding module may include a 2-layer serial encoding network, wherein the embodiment of the present disclosure does not limit the above-mentioned 2-layer serial serial encoding network, and any existing or future encoding network (for example, the Encoder module in the transformer model, the Encoder module in the conformer model, etc.) may be used for implementation; the third fully connected module may include 2 serially connected fully connected layers.
  • the global visual feature sequence of the target image can also be referred to when performing object attribute recognition.
  • the above method may further include the following S105.
  • a global visual feature sequence of the target image is extracted from the target image.
  • the above S103 may determine multiple object attributes of the target object according to the key information feature sequence, multimodal feature sequence, appearance feature sequence, and global visual feature sequence.
  • the key information feature sequence, multimodal feature sequence, appearance feature sequence, and global visual feature sequence can be input into a pre-trained multimodal fusion model to obtain multiple object attributes of the target object.
  • the second encoding module in the above multimodal fusion model is used to encode the feature matrix composed of the key information feature sequence, multimodal feature sequence, appearance feature sequence, and global visual feature sequence.
  • the lengths of the key information feature sequence, the multimodal feature sequence, the appearance feature sequence and the global visual feature sequence are all preset lengths.
  • the target image can be input into a pre-trained global visual feature extraction model to obtain the global visual feature sequence of the target image.
  • the above global visual feature extraction model may include a third preprocessing module, a fourth fully connected module, a fourth encoding module and a second decoding module connected in sequence.
  • the third preprocessing module is used to adjust the target image to a third preset size (for example, 256*256), divide the resized target image into a plurality of image blocks according to the fourth preset size (for example, 16*16), and then straighten each image block into a one-dimensional feature vector of a third preset length (for example, 256), and form a second feature matrix with each one-dimensional feature vector of a third preset length (for example, 256); the fourth fully connected module is used to generate the original feature sequence corresponding to the target image according to the second feature matrix ;
  • the fourth encoding module is used to encode the original feature sequence to obtain the fourth coding sequence;
  • the second decoding module is used to decode the fourth coding sequence to obtain the global visual feature sequence of the target image.
  • the fourth fully connected module may include 2 fully connected layers connected in series, the fourth coding module may include 6 sequentially connected coding networks, and the second decoding module may include a decoding network.
  • Fig. 8 is a block diagram of an object attribute recognition device according to an exemplary embodiment. As shown in Figure 8, the device 800 includes:
  • An acquisition module 801 configured to acquire a target image, wherein the target image includes a target object and object description information of the target object;
  • the first extraction module 802 is configured to extract the key information feature sequence of the target object and the multimodal feature sequence corresponding to the target attribute of the target object from the target image acquired by the acquisition module 801, wherein the multimodal feature sequence includes a visual feature sequence and a semantic feature sequence of the target attribute;
  • the determination module 803 is configured to determine multiple object attributes of the target object according to the key information feature sequence and the multimodal feature sequence extracted by the first extraction module 802, wherein the multiple object attributes include the target attribute.
  • the target image is obtained, and the target image contains the target object and the object description information of the target object; then, the key information feature sequence of the target object and the multimodal feature sequence corresponding to the target attribute of the target object are extracted from the target image, and the multimodal feature sequence includes the visual feature sequence and semantic feature sequence of the target attribute; finally, multiple object attributes of the target object are determined according to the key information feature sequence and the multimodal feature sequence.
  • the first extraction module 802 is configured to input the target image into a pre-trained multimodal feature extraction model to obtain a multimodal feature sequence corresponding to the target attribute of the target object.
  • the multimodal feature extraction model includes:
  • a first target detection module configured to extract from the target image a first region where the target attribute identifier of the target object is located
  • a first preprocessing module connected to the first target detection module, for normalizing the first region into an image of a first preset size, and straightening the normalized image into a one-dimensional row vector of a first preset length;
  • the first fully connected module connected to the first preprocessing module, is used to generate the visual feature sequence of the target attribute according to the one-dimensional row vector of the first preset length;
  • a text recognition module connected to the first target detection module, for performing text recognition on the first region to obtain the attribute description text of the target attribute;
  • a multilingual language sub-model connected to the text recognition module, for extracting the semantic feature sequence of the target attribute from the attribute description text;
  • the splicing module is respectively connected with the first fully connected module and the multilingual language sub-model, and is used to splice the visual feature sequence and the semantic feature sequence of the target attribute to obtain a spliced sequence;
  • a first coding module connected to the splicing module, for coding the spliced sequence to obtain a first coding sequence
  • the second fully connected module is connected to the first encoding module, and is configured to perform dimensionality reduction processing on the first encoding sequence to obtain a multimodal feature sequence of preset dimensions corresponding to the target attribute.
  • the first extraction module 802 includes:
  • a recognition submodule configured to perform text recognition on the target image to obtain a recognized text, wherein the recognized text is a multilingual text or a monolingual text;
  • the input sub-module is used to input the recognized text into the pre-trained multilingual language model to obtain the key information feature sequence of the target object.
  • the determining module 802 is configured to input the key information feature sequence and the multimodal feature sequence into a pre-trained multimodal fusion model to obtain multiple object attributes of the target object;
  • the multimodal fusion model includes:
  • the second encoding module is configured to encode the first feature matrix formed by the key information feature sequence and the multimodal feature sequence to obtain a second encoding sequence, wherein the dimensions of the key information feature sequence and the multimodal feature sequence are preset dimensions;
  • a plurality of first decoding modules corresponding to the attribute categories of the plurality of object attributes are respectively connected to the second encoding module, and are used to generate object attributes under the corresponding attribute categories according to the second encoding sequence, wherein the attribute categories of each of the object attributes are different.
  • the device 800 also includes:
  • the second extraction module is used to extract the appearance feature sequence of the target object from the target image
  • the determining module 803 is configured to determine multiple object attributes of the target object according to the key information feature sequence, the multimodal feature sequence, and the appearance feature sequence.
  • the second extraction module is configured to input the target image into a pre-trained appearance feature extraction model to obtain an appearance feature sequence of the target object
  • the appearance feature extraction model includes a second target detection module, a second preprocessing module, a third encoding module, and a third fully connected module connected in sequence;
  • the second target detection module is configured to extract the second area where the appearance of the target object is located from the target image
  • the second preprocessing module is configured to normalize the second region into an image of a second preset size, and straighten the normalized image into a one-dimensional row vector of a second preset length;
  • the third encoding module is configured to encode the one-dimensional row vector of the second preset length to obtain a third encoding sequence
  • the third fully connected module is configured to perform dimensionality reduction processing on the third coding sequence to obtain an appearance feature sequence of the target object in a preset dimension.
  • the device 800 also includes:
  • the third extraction module is used to extract the global visual feature sequence of the target image from the target image
  • the determining module 803 is configured to determine multiple object attributes of the target object according to the key information feature sequence, the multimodal feature sequence, the appearance feature sequence, and the global visual feature sequence.
  • the third extraction module is configured to input the target image into a pre-trained global visual feature extraction model to obtain a global visual feature sequence of the target image, and the global visual feature extraction model includes a third preprocessing module, a fourth fully connected module, a fourth encoding module, and a second decoding module connected in sequence;
  • the third preprocessing module is configured to adjust the target image to a third preset size, divide the resized target image into a plurality of image blocks according to a fourth preset size, and then straighten each of the image blocks into a one-dimensional feature vector of a third preset length, and form each one-dimensional feature vector of the third preset length into a second feature matrix;
  • the fourth fully connected module is configured to generate an original feature sequence corresponding to the target image according to the second feature matrix
  • the fourth encoding module is configured to encode the original feature sequence to obtain a fourth encoding sequence
  • the second decoding module is configured to decode the fourth coding sequence to obtain the global visual feature sequence of the target image.
  • the present disclosure also provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of the above-mentioned method for identifying object attributes provided by the present disclosure are implemented.
  • FIG. 9 shows a schematic structural diagram of an electronic device (terminal device or server) 600 suitable for implementing an embodiment of the present disclosure.
  • the terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (Personal Digital Assistants), PADs (Tablet Computers), PMPs (Portable Multimedia Players), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), etc., and fixed terminals such as digital TVs, desktop computers, etc.
  • the electronic device shown in FIG. 9 is only an example, and should not limit the functions and application scope of the embodiments of the present disclosure.
  • an electronic device 600 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) 601, which may perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 602 or a program loaded from a storage device 608 into a random access memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic device 600 are also stored.
  • the processing device 601, ROM 602, and RAM 603 are connected to each other through a bus 604.
  • An input/output (I/O) interface 605 is also connected to the bus 604 .
  • the following devices can be connected to the I/O interface 605: an input device 606 including, for example, a touch screen, a touchpad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, etc.; an output device 607 including, for example, a liquid crystal display (LCD), a speaker, a vibrator, etc.; a storage device 608 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 609.
  • the communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While FIG. 9 shows electronic device 600 having various means, it is to be understood that implementing or having all of the means shown is not a requirement. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602.
  • the processing device 601 When the computer program is executed by the processing device 601, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are performed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the two.
  • a computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples of computer readable storage media may include, but are not limited to, electrical connections having one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
  • Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
  • the client and the server can communicate using any currently known or future-developed network protocols such as HTTP (HyperText Transfer Protocol, Hypertext Transfer Protocol), and can be interconnected with any form or medium of digital data communication (for example, a communication network).
  • HTTP HyperText Transfer Protocol
  • Examples of communication networks include local area networks ("LANs”), wide area networks ("WANs”), internetworks (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.
  • the computer-readable medium carries one or more programs.
  • the electronic device acquires a target image, wherein the target image contains a target object and object description information of the target object; extracts from the target image a key information feature sequence of the target object and a multimodal feature sequence corresponding to the target attribute of the target object, wherein the multimodal feature sequence includes a visual feature sequence and a semantic feature sequence of the target attribute; according to the key information feature sequence and the multimodal feature sequence, determine the target object.
  • a plurality of object attributes wherein the plurality of object attributes includes the target attribute.
  • Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and conventional procedural programming languages—such as the “C” language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through the Internet using an Internet service provider).
  • LAN local area network
  • WAN wide area network
  • Internet service provider such as AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
  • each block in the flowchart or block diagram may represent a module, program segment, or portion of code that includes one or more executable instructions for implementing specified logical functions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • the modules involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of the module does not constitute a limitation of the module itself under certain circumstances, for example, the obtaining module may also be described as "a module for obtaining the target image".
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs System on Chips
  • CPLDs Complex Programmable Logic Devices
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • a machine-readable storage medium would include one or more wire-based electrical connections, a portable computer disk, a hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage devices or any suitable combination of the foregoing.
  • Example 1 provides an object attribute recognition method, including: acquiring a target image, wherein the target image contains a target object and object description information of the target object; extracting a key information feature sequence of the target object and a multimodal feature sequence corresponding to a target attribute of the target object from the target image, wherein the multimodal feature sequence includes a visual feature sequence and a semantic feature sequence of the target attribute; Attributes include the target attribute.
  • Example 2 provides the method of Example 1, the extracting the multimodal feature sequence corresponding to the target attribute of the target object from the target image includes:
  • the target image is input into a pre-trained multimodal feature extraction model to obtain a multimodal feature sequence corresponding to the target attribute of the target object.
  • Example 3 provides the method of Example 2.
  • the multimodal feature extraction model includes: a first target detection module, configured to extract from the target image a first region where the target attribute of the target object is located; a first preprocessing module, connected to the first target detection module, used to normalize the first region into an image of a first preset size, and straighten the normalized image into a one-dimensional row vector of a first preset length; a first full connection module, connected to the first preprocessing module, for according to the first preset length
  • the one-dimensional row vector of the target attribute generates a visual feature sequence of the target attribute;
  • the text recognition module is connected with the first target detection module for text recognition of the first region to obtain the attribute description text of the target attribute;
  • the multilingual language sub-model is connected with the text recognition module and is used for extracting the semantic feature sequence of the target attribute from the attribute description text;
  • the splicing module is connected with the first fully connected module and the multilingual language sub-model respectively, and is used for splic
  • Example 4 provides the method of Example 1.
  • the extracting the key information feature sequence of the target object from the target image includes: performing text recognition on the target image to obtain a recognition text, wherein the recognition text is multilingual text or monolingual text; inputting the recognition text into a pre-trained multilingual language model to obtain the key information feature sequence of the target object.
  • Example 6 provides the method described in any one of Examples 1-5, the method further comprising: extracting an appearance feature sequence of the target object from the target image; determining a plurality of object attributes of the target object according to the key information feature sequence and the multimodal feature sequence includes: determining a plurality of object attributes of the target object according to the key information feature sequence, the multimodal feature sequence, and the appearance feature sequence.
  • Example 7 provides the method of Example 6.
  • the extracting the appearance feature sequence of the target object from the target image includes: inputting the target image into a pre-trained appearance feature extraction model to obtain the appearance feature sequence of the target object.
  • the appearance feature extraction model includes a second object detection module, a second preprocessing module, a third encoding module and a third full connection module connected in sequence; wherein the second object detection module is used to extract from the target image a second region where the appearance of the target object is located; The second region is normalized to an image of a second preset size, and the normalized image is straightened into a one-dimensional row vector of a second preset length; the third coding module is used to encode the one-dimensional row vector of the second preset length to obtain a third coding sequence; the third fully connected module is used to perform dimensionality reduction processing on the third coding sequence to obtain an appearance feature sequence of the target object and a preset dimension.
  • Example 8 provides the method of Example 6, the method further comprising: extracting a global visual feature sequence of the target image from the target image; determining multiple object attributes of the target object according to the key information feature sequence, the multimodal feature sequence, and the appearance feature sequence, including: determining multiple object attributes of the target object according to the key information feature sequence, the multimodal feature sequence, the appearance feature sequence, and the global visual feature sequence.
  • Example 9 provides the method of Example 8.
  • the extracting the global visual feature sequence of the target object from the target image includes: inputting the target image into a pre-trained global visual feature extraction model to obtain the global visual feature sequence of the target image.
  • the global visual feature extraction model includes a third preprocessing module, a fourth fully connected module, a fourth encoding module, and a second decoding module connected in sequence; wherein the third preprocessing module is used to adjust the target image to a third preset size, and the resized target image is obtained according to the fourth preset size Divide into a plurality of image blocks, and then straighten each of the image blocks into a one-dimensional feature vector of a third preset length, and form each of the one-dimensional feature vectors of the third preset length into a second feature matrix; the fourth fully connected module is used to generate an original feature sequence corresponding to the target image according to the second feature matrix; the fourth encoding module is used to encode the original feature sequence to obtain a fourth code sequence; the second decoding module is used to decode the fourth code sequence to obtain a global visual feature sequence of the target image.
  • Example 10 provides an object attribute recognition device, including: an acquisition module, configured to acquire a target image, wherein the target image includes a target object and object description information of the target object; a first extraction module, configured to extract the key information feature sequence of the target object and a multimodal feature sequence corresponding to the target attribute of the target object from the target image acquired by the acquisition module, wherein the multimodal feature sequence includes a visual feature sequence and a semantic feature sequence of the target attribute; a determination module, based on the extracted by the first extraction module.
  • the key information feature sequence and the multimodal feature sequence determine multiple object attributes of the target object, wherein the multiple object attributes include the target attribute.
  • Example 11 provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of any one of the methods described in Examples 1-9 are implemented.
  • Example 12 provides an electronic device, including: a storage device, on which a computer program is stored; a processing device, configured to execute the computer program in the storage device, so as to implement the steps of any one of the methods in Examples 1-9.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

本公开涉及一种对象属性识别方法、装置、可读存储介质及电子设备。方法包括:获取目标图像,目标图像中包含目标对象和目标对象的对象描述信息;从目标图像中提取目标对象的关键信息特征序列和目标对象的目标属性对应的多模态特征序列,多模态特征序列包括目标属性的视觉特征序列和语义特征序列;根据关键信息特征序列和多模态特征序列,确定目标对象的多个对象属性。这样,在对目标图像中的目标对象进行属性识别时,不但参考了目标对象的关键信息特征,还参考目标属性的视觉特征和语义特征,使得目标对象的特征维度更加丰富,信息更加全面,从而提升了对象属性识别的准确度和对象属性的丰富度。

Description

对象属性识别方法、装置、可读存储介质及电子设备
相关申请的交叉引用
本申请要求于2022年01月21日提交的,申请号为202210074401.6、发明名称为“对象属性识别方法、装置、可读存储介质及电子设备”的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本公开涉及图像处理技术领域,具体地,涉及一种对象属性识别方法、装置、可读存储介质及电子设备。
背景技术
近年来,随着信息化技术的快速发展,图像结构化己成为图像理解中的标配。图像结构化是一种基于图像内容信息提取关键目标对象(例如,车辆、行人等)的技术,它对图像内容按照语义关系,采用时空分割、特征提取、对象识别等处理手段,组织成可供计算机和人类理解的结构化信息的技术。其中,对图像中对象的属性进行识别是图像结构化重要的一个功能模块,其能够从图像中预测对象的各个属性标签,如行人的年龄、性别和服装款式等,车辆的车牌号,年限等,其可用于图像感知世界的智慧应用。其中,如何提升图像的对象属性识别的准确度和丰富度,成为增强图像理解的关键。
发明内容
提供该发明内容部分以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该发明内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。
第一方面,本公开提供一种对象属性识别方法,包括:
获取目标图像,其中,所述目标图像中包含目标对象和所述目标对象的对象描述信息;
从所述目标图像中提取所述目标对象的关键信息特征序列和所述目标对象的目标属性对应的多模态特征序列,其中,所述多模态特征序列包括所述目标属性的视觉特征序列和语义特征序列;
根据所述关键信息特征序列和所述多模态特征序列,确定所述目标对象的多个对象属性,其中,所述多个对象属性包括所述目标属性。
第二方面,本公开提供一种对象属性识别装置,包括:
获取模块,用于获取目标图像,其中,所述目标图像中包含目标对象和所述目标对象的对象描述信息;
第一提取模块,用于从所述获取模块获取到的所述目标图像中提取所述目标对象的关键信息特征序 列和所述目标对象的目标属性对应的多模态特征序列,其中,所述多模态特征序列包括所述目标属性的视觉特征序列和语义特征序列;
确定模块,用于根据所述第一提取模块提取到的所述关键信息特征序列和所述多模态特征序列,确定所述目标对象的多个对象属性,其中,所述多个对象属性包括所述目标属性。
第三方面,本公开提供一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现本公开第一方面提供的所述方法的步骤。
第四方面,本公开提供一种电子设备,包括:
存储装置,其上存储有计算机程序;
处理装置,用于执行所述存储装置中的所述计算机程序,以实现本公开第一方面提供的所述方法的步骤。
在上述技术方案中,首先,获取目标图像,目标图像中包含目标对象和目标对象的对象描述信息;然后,从目标图像中提取目标对象的关键信息特征序列和目标对象的目标属性对应的多模态特征序列,多模态特征序列包括目标属性的视觉特征序列和语义特征序列;最后,根据关键信息特征序列和多模态特征序列,确定目标对象的多个对象属性。这样,在对目标图像中的目标对象进行属性识别时,不但参考了目标对象的关键信息特征,还参考目标属性的视觉特征和语义特征,使得目标对象的特征维度更加丰富,信息更加全面,从而提升了对象属性识别的准确度和对象属性的丰富度。
本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。
附图说明
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。在附图中:
图1是根据一示例性实施例示出的一种对象属性识别方法的流程图。
图2是根据一示例性实施例示出的一种多模态特征提取模型的结构示意图。
图3是根据一示例性实施例示出的一种多模态融合模型的结构示意图。
图4是根据另一示例性实施例示出的一种对象属性识别方法的流程图。
图5是根据一示例性实施例示出的一种外观特征提取模型的结构示意图。
图6是根据另一示例性实施例示出的一种对象属性识别方法的流程图。
图7是根据一示例性实施例示出的一种全局视觉特征提取模型的结构示意图。
图8是根据一示例性实施例示出的一种对象属性识别装置的框图。
图9是根据一示例性实施例示出的一种电子设备的框图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
图1是根据一示例性实施例示出的一种对象属性识别方法的流程图。如图1所示,该方法可以包括以下S101~S103。
在S101中,获取目标图像。
其中,目标图像中包含目标对象(具体为目标对象的图像)和目标对象的对象描述信息,其中,目标对象可以例如是车辆、行人、书柜、电视等,对象描述信息为用于描述目标对象的文本。
在S102中,从目标图像中提取目标对象的关键信息特征序列和目标对象的目标属性对应的多模态特征序列。
在本公开中,多模态特征序列包括目标属性的视觉特征序列和语义特征序列,其中,目标属性可以为用户重点关注的目标对象的任一属性,例如,目标对象为人,目标属性为年龄。
在S103中,根据关键信息特征序列和多模态特征序列,确定目标对象的多个对象属性。
在本公开中,多个对象属性包括上述目标属性。
示例地,目标对象为行人,多个对象属性可以包括年龄、身高、性别、服装款式、发型等。
又示例地,目标对象为物品(例如,书柜、车辆),多个对象属性可以包括类别、品牌、名称、基本参数、容量/体积、使用范围等。
在上述技术方案中,首先,获取目标图像,目标图像中包含目标对象和目标对象的对象描述信息;然后,从目标图像中提取目标对象的关键信息特征序列和目标对象的目标属性对应的多模态特征序列, 多模态特征序列包括目标属性的视觉特征序列和语义特征序列;最后,根据关键信息特征序列和多模态特征序列,确定目标对象的多个对象属性。这样,在对目标图像中的目标对象进行属性识别时,不但参考了目标对象的关键信息特征,还参考目标属性的视觉特征和语义特征,使得目标对象的特征维度更加丰富,信息更加全面,从而提升了对象属性识别的准确度和对象属性的丰富度。
下面针对上述S102中的从目标图像中提取目标对象的关键信息特征序列的具体实施方式进行详细说明。具体来说,可以通过以下步骤(1)和步骤(2)来实现:
(1)对目标图像进行文本识别,得到识别文本。
在本公开中,识别文本可以为多语种文本,也可以为单语种文本,本公开不作具体限定。
另外,可以通过将目标图像输入到预先训练好的文本识别模型中,以得到识别文本,其中,上述文本识别模型可以例如是卷积循环神经网络、基于注意力机制的编解码网络等。
(2)将识别文本输入到预先训练好的多语种语言模型中,得到目标对象的关键信息特征序列。
在本公开中,多语种语言模型用于提取目标图像对应的识别文本中的对象关键特征。示例地,多语种语言模型可以由多个(例如,12个)依次串联的编码网络和多个(例如,6个)依次串联的解码网络组成,其中,多个依次串联的编码网络中的最后一个编码网络与多个依次串联的解码网络中的第一个解码网络串联。
本公开实施例不限定上述编码网络,可以采用现有的或者未来出现的任一种编码网络(例如,transformer模型中Encoder模块、conformer模型中Encoder模块等)进行实施。
本公开实施例不限定上述解码网络,可以采用现有的或者未来出现的任一种解码网络(例如,transformer模型中Decoder模块、conformer模型中Decoder模块等)进行实施。
下面针对上述S102中的从目标图像中提取目标对象的目标属性对应的多模态特征序列的具体实施方式进行详细说明。具体来说,可以将目标图像输入到预先训练好的多模态特征提取模型中,得到目标对象的目标属性对应的多模态特征序列。
如图2所示,上述多模态特征提取模型可以包括:第一目标检测模块、第一预处理模块、第一全连接模块、文本识别模块、多语种语言子模型、拼接模块、第一编码模块以及第二全连接模块。
其中,第一目标检测模块,用于从目标图像中提取目标对象的目标属性的标识所在的第一区域,其中,标识可以例如是车牌号、品牌logo等;第一预处理模块,与第一目标检测模块连接,用于将第一区域归一化为第一预设尺寸(例如,32*32)的图像,并将归一化后所得的图像拉直成第一预设长度(例如,1024)的一维行向量;第一全连接模块,与第一预处理模块连接,用于根据第一预设长度的一维行向量,生成目标属性的视觉特征序列;文本识别模块,与第一目标检测模块连接,用于对第一区域进行文本识别,得到目标属性的属性描述文本,例如,品牌词;多语种语言子模型,与文本识别模块连接,用于从属性描述文本中提取目标属性的语义特征序列;拼接模块,与第一全连接模块、多语种语言子模型分别连接,用于将目标属性的视觉特征序列和语义特征序列进行拼接,得到拼接序列;第一编码模块, 与拼接模块连接,用于对拼接序列进行编码,得到第一编码序列;第二全连接模块,与第一编码模块连接,用于对第一编码序列进行降维处理,得到目标属性对应的、预设维度的多模态特征序列。
在本公开中,第一目标检测模块可以例如是YOLO(You Only Look Once)网络、单阶段多框检测(Single Shot MultiBox Detector,SSD)等。上述文本识别模块可以例如是卷积循环神经网络、基于注意力机制的编解码网络等。多语种语言子模型的结构可以与上述多语种语言模型相同。
示例地,第一全连接模块可以包括2层串联的全连接层,第二全连接模块可以包括2层全连接层,第一编码模块可以包括4层依次串联的编码网络,其中,本公开实施例不限定上述4层依次串联的编码网络,可以采用现有的或者未来出现的任一种编码网络(例如,transformer模型中Encoder模块、conformer模型中Encoder模块等)进行实施。
另外,目标属性的视觉特征序列、语义特征序列的长度为预设维度,拼接序列的维度为预设维度的2倍。示例地,预设维度为128维度。
下面针对上述S103中的根据关键信息特征序列和多模态特征序列,确定目标对象的多个对象属性的具体实施方式进行详细说明。具体来说,可以将关键信息特征序列和多模态特征序列输入到预先训练好的多模态融合模型中,得到目标对象的多个对象属性。
如图3所示,上述多模态融合模型可以包括第二编码模块、与多个对象属性的属性类别一一对应的多个第一解码模块(图3中以N个解码模块示例,N大于1),分别与第二编码模块连接。
第二编码模块,用于对关键信息特征序列和多模态特征序列构成的第一特征矩阵进行编码,得到第二编码序列,其中,关键信息特征序列和多模态特征序列的维度均为预设维度;
第一解码模块,用于根据第二编码序列,生成所对应的属性类别下的对象属性,其中,每一对象属性的属性类别均不相同,即第一解码模块的数量与上述多个对象属性的数量相等。
如图3所示,第一解码模块1与对象属性a1的属性类别对象,其用于生成对象属性a1,第一解码模块2与对象属性a2的属性类别对象,其用于生成对象属性a2,……,第一解码模块N与对象属性aN的属性类别对象,其用于生成对象属性aN。
示例地,第二编码模块可以包括12层依次串联的编码网络,第一解码模块可以包括一解码网络。
为了进一步提升对象属性识别的准确度和对象属性的丰富度,在进行对象属性识别时,除了参考上述关键信息特征序列和多模态特征序列外,还可以参考目标对象的外观特征。具体来说,如图4所示,上述方法还可以包括以下S104。
在S104中,从目标图像中提取目标对象的外观特征序列。
此时,上述S103可以根据关键信息特征序列、多模态特征序列以及外观特征序列,确定目标对象的多个对象属性。具体来说,可以将关键信息特征序列、多模态特征序列以及外观特征序列输入到预先训练好的多模态融合模型中,得到目标对象的多个对象属性,其中,上述多模态融合模型中的第二编码模块,用于对关键信息特征序列、多模态特征序列以及外观特征序列构成的特征矩阵进行编码。
下面针对上述S104中的从目标图像中提取目标对象的外观特征序列的具体实施方式进行详细说明。具体来说,可以将目标图像输入到预先训练好的外观特征提取模型中,得到目标对象的外观特征序列。
如图5所示,上述外观特征提取模型可以包括依次连接的第二目标检测模块、第二预处理模块、第三编码模块以及第三全连接模块。
其中,第二目标检测模块,用于从目标图像中提取目标对象的外观所在的第二区域,其中,外观可以包括目标对象外表、目标对象的包装等;第二预处理模块,用于将第二区域归一化为第二预设尺寸(例如,16*16)的图像,并将归一化后所得的图像拉直成第二预设长度(例如,256)的一维行向量;第三编码模块,用于对第二预设长度的一维行向量进行编码,得到第三编码序列;第三全连接模块,用于对第三编码序列进行降维处理,得到目标对象的、预设维度的外观特征序列。
在本公开中,第二目标检测模块可以例如是YOLO网络、SSD等。
示例地,第三编码模块可以包括2层串联的编码网络,其中,本公开实施例不限定上述2层依次串联的编码网络,可以采用现有的或者未来出现的任一种编码网络(例如,transformer模型中Encoder模块、conformer模型中Encoder模块等)进行实施;第三全连接模块可以包括2层串联的全连接层。
为了进一步提升对象属性识别的准确度和对象属性的丰富度,在进行对象属性识别时,除了参考上述关键信息特征序列、多模态特征序列以及目标对象的外观特征外,还可以参考目标图像的全局视觉特征序列。具体来说,如图6所示,上述方法还可以包括以下S105。
在S105中,从目标图像中提取目标图像的全局视觉特征序列。
此时,上述S103可以根据关键信息特征序列、多模态特征序列、外观特征序列以及全局视觉特征序列,确定目标对象的多个对象属性。具体来说,可以将关键信息特征序列、多模态特征序列、外观特征序列以及全局视觉特征序列输入到预先训练好的多模态融合模型中,得到目标对象的多个对象属性,其中,上述多模态融合模型中的第二编码模块,用于对关键信息特征序列、多模态特征序列、外观特征序列以及全局视觉特征序列构成的特征矩阵进行编码。
另外,关键信息特征序列、多模态特征序列、外观特征序列以及全局视觉特征序列的长度均为预设长度。
下面针对上述S105中的从目标图像中提取目标对象的全局视觉特征序列的具体实施方式进行详细说明。具体来说,可以将目标图像输入到预先训练好的全局视觉特征提取模型中,得到目标图像的全局视觉特征序列。
如图7所示,上述全局视觉特征提取模型可以包括依次连接的第三预处理模块、第四全连接模块、第四编码模块以及第二解码模块。
其中,第三预处理模块,用于将目标图像调整为第三预设尺寸(例如,256*256),按照第四预设尺寸(例如,16*16)将调整尺寸后所得的目标图像分割为多个图像块,之后,将每一图像块拉直成第三预设长度(例如,256)的一维特征向量,将每一第三预设长度(例如,256)的一维特征向量组成第二 特征矩阵;第四全连接模块,用于根据第二特征矩阵,生成目标图像对应的原始特征序列;第四编码模块,用于对原始特征序列进行编码,得到第四编码序列;第二解码模块,用于对第四编码序列进行解码,得到目标图像的全局视觉特征序列。
示例地,第四全连接模块可以包括2层串联的全连接层,第四编码模块可以包括6层依次串联的编码网络,第二解码模块可以包括一解码网络。
图8是根据一示例性实施例示出的一种对象属性识别装置的框图。如图8所示,该装置800包括:
获取模块801,用于获取目标图像,其中,所述目标图像中包含目标对象和所述目标对象的对象描述信息;
第一提取模块802,用于从所述获取模块801获取到的所述目标图像中提取所述目标对象的关键信息特征序列和所述目标对象的目标属性对应的多模态特征序列,其中,所述多模态特征序列包括所述目标属性的视觉特征序列和语义特征序列;
确定模块803,用于根据所述第一提取模块802提取到的所述关键信息特征序列和所述多模态特征序列,确定所述目标对象的多个对象属性,其中,所述多个对象属性包括所述目标属性。
在上述技术方案中,首先,获取目标图像,目标图像中包含目标对象和目标对象的对象描述信息;然后,从目标图像中提取目标对象的关键信息特征序列和目标对象的目标属性对应的多模态特征序列,多模态特征序列包括目标属性的视觉特征序列和语义特征序列;最后,根据关键信息特征序列和多模态特征序列,确定目标对象的多个对象属性。这样,在对目标图像中的目标对象进行属性识别时,不但参考了目标对象的关键信息特征,还参考目标属性的视觉特征和语义特征,使得目标对象的特征维度更加丰富,信息更加全面,从而提升了对象属性识别的准确度和对象属性的丰富度。
可选地,所述第一提取模块802用于将所述目标图像输入到预先训练好的多模态特征提取模型中,得到所述目标对象的目标属性对应的多模态特征序列。
可选地,所述多模态特征提取模型包括:
第一目标检测模块,用于从所述目标图像中提取所述目标对象的目标属性的标识所在的第一区域;
第一预处理模块,与所述第一目标检测模块连接,用于将所述第一区域归一化为第一预设尺寸的图像,并将归一化后所得的图像拉直成第一预设长度的一维行向量;
第一全连接模块,与所述第一预处理模块连接,用于根据所述第一预设长度的一维行向量,生成所述目标属性的视觉特征序列;
文本识别模块,与第一目标检测模块连接,用于对所述第一区域进行文本识别,得到所述目标属性的属性描述文本;
多语种语言子模型,与所述文本识别模块连接,用于从所述属性描述文本中提取所述目标属性的语义特征序列;
拼接模块,与所述第一全连接模块、所述多语种语言子模型分别连接,用于将所述目标属性的视觉 特征序列和所述语义特征序列进行拼接,得到拼接序列;
第一编码模块,与所述拼接模块连接,用于对所述拼接序列进行编码,得到第一编码序列;
第二全连接模块,与所述第一编码模块连接,用于对所述第一编码序列进行降维处理,得到所述目标属性对应的、预设维度的多模态特征序列。
可选地,所述第一提取模块802包括:
识别子模块,用于对所述目标图像进行文本识别,得到识别文本,其中,所述识别文本为多语种文本或单语种文本;
输入子模块,用于将所述识别文本输入到预先训练好的多语种语言模型中,得到所述目标对象的关键信息特征序列。
可选地,所述确定模块802用于将所述关键信息特征序列和所述多模态特征序列输入到预先训练好的多模态融合模型中,得到所述目标对象的多个对象属性;
其中,所述多模态融合模型包括:
第二编码模块,用于对所述关键信息特征序列和所述多模态特征序列构成的第一特征矩阵进行编码,得到第二编码序列,其中,所述关键信息特征序列和所述多模态特征序列的维度均为预设维度;
与所述多个对象属性的属性类别一一对应的多个第一解码模块,分别与所述第二编码模块连接,用于根据所述第二编码序列,生成所对应的属性类别下的对象属性,其中,每一所述对象属性的属性类别均不相同。
可选地,所述装置800还包括:
第二提取模块,用于从所述目标图像中提取所述目标对象的外观特征序列;
所述确定模块803用于根据所述关键信息特征序列、所述多模态特征序列以及所述外观特征序列,确定所述目标对象的多个对象属性。
可选地,所述第二提取模块,用于将所述目标图像输入到预先训练好的外观特征提取模型中,得到所述目标对象的外观特征序列,所述外观特征提取模型包括依次连接的第二目标检测模块、第二预处理模块、第三编码模块以及第三全连接模块;
其中,所述第二目标检测模块,用于从所述目标图像中提取所述目标对象的外观所在的第二区域;
所述第二预处理模块,用于将所述第二区域归一化为第二预设尺寸的图像,并将归一化后所得的图像拉直成第二预设长度的一维行向量;
所述第三编码模块,用于对所述第二预设长度的一维行向量进行编码,得到第三编码序列;
所述第三全连接模块,用于对所述第三编码序列进行降维处理,得到所述目标对象的、预设维度的外观特征序列。
可选地,所述装置800还包括:
第三提取模块,用于从所述目标图像中提取所述目标图像的全局视觉特征序列;
所述确定模块803用于根据所述关键信息特征序列、所述多模态特征序列、所述外观特征序列以及所述全局视觉特征序列,确定所述目标对象的多个对象属性。
可选地,所述第三提取模块,用于将所述目标图像输入到预先训练好的全局视觉特征提取模型中,得到所述目标图像的全局视觉特征序列,所述全局视觉特征提取模型包括依次连接的第三预处理模块、第四全连接模块、第四编码模块以及第二解码模块;
其中,所述第三预处理模块,用于将所述目标图像调整为第三预设尺寸,按照第四预设尺寸将调整尺寸后所得的目标图像分割为多个图像块,之后,将每一所述图像块拉直成第三预设长度的一维特征向量,将每一所述第三预设长度的一维特征向量组成第二特征矩阵;
所述第四全连接模块,用于根据所述第二特征矩阵,生成所述目标图像对应的原始特征序列;
所述第四编码模块,用于对所述原始特征序列进行编码,得到第四编码序列;
所述第二解码模块,用于对所述第四编码序列进行解码,得到所述目标图像的全局视觉特征序列。
本公开还提供一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现本公开提供的上述对象属性识别方法的步骤。
下面参考图9,其示出了适于用来实现本公开实施例的电子设备(终端设备或服务器)600的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图9示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图9所示,电子设备600可以包括处理装置(例如中央处理器、图形处理器等)601,其可以根据存储在只读存储器(ROM)602中的程序或者从存储装置608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中,还存储有电子设备600操作所需的各种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。
通常,以下装置可以连接至I/O接口605:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置607;包括例如磁带、硬盘等的存储装置608;以及通信装置609。通信装置609可以允许电子设备600与其他设备进行无线或有线通信以交换数据。虽然图9示出了具有各种装置的电子设备600,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置609从网络上被下载和安装,或者从存储装置608被安装,或者从ROM 602被安装。在该计算机 程序被处理装置601执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取目标图像,其中,所述目标图像中包含目标对象和所述目标对象的对象描述信息;从所述目标图像中提取所述目标对象的关键信息特征序列和所述目标对象的目标属性对应的多模态特征序列,其中,所述多模态特征序列包括所述目标属性的视觉特征序列和语义特征序列;根据所述关键信息特征序列和所述多模态特征序列,确定所述目标对象的多个对象属性,其中,所述多个对象属性包括所述目标属性。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言——诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机,或者,可以连接到外部 计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块的名称在某种情况下并不构成对该模块本身的限定,例如,获取模块还可以被描述为“获取目标图像的模块”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
根据本公开的一个或多个实施例,示例1提供了一种对象属性识别方法,包括:获取目标图像,其中,所述目标图像中包含目标对象和所述目标对象的对象描述信息;从所述目标图像中提取所述目标对象的关键信息特征序列和所述目标对象的目标属性对应的多模态特征序列,其中,所述多模态特征序列包括所述目标属性的视觉特征序列和语义特征序列;根据所述关键信息特征序列和所述多模态特征序列,确定所述目标对象的多个对象属性,其中,所述多个对象属性包括所述目标属性。
根据本公开的一个或多个实施例,示例2提供了示例1的方法,所述从所述目标图像中提取所述目标对象的目标属性对应的多模态特征序列,包括:
将所述目标图像输入到预先训练好的多模态特征提取模型中,得到所述目标对象的目标属性对应的多模态特征序列。
根据本公开的一个或多个实施例,示例3提供了示例2的方法,所述多模态特征提取模型包括:第 一目标检测模块,用于从所述目标图像中提取所述目标对象的目标属性的标识所在的第一区域;第一预处理模块,与所述第一目标检测模块连接,用于将所述第一区域归一化为第一预设尺寸的图像,并将归一化后所得的图像拉直成第一预设长度的一维行向量;第一全连接模块,与所述第一预处理模块连接,用于根据所述第一预设长度的一维行向量,生成所述目标属性的视觉特征序列;文本识别模块,与第一目标检测模块连接,用于对所述第一区域进行文本识别,得到所述目标属性的属性描述文本;多语种语言子模型,与所述文本识别模块连接,用于从所述属性描述文本中提取所述目标属性的语义特征序列;拼接模块,与所述第一全连接模块、所述多语种语言子模型分别连接,用于将所述目标属性的视觉特征序列和所述语义特征序列进行拼接,得到拼接序列;第一编码模块,与所述拼接模块连接,用于对所述拼接序列进行编码,得到第一编码序列;第二全连接模块,与所述第一编码模块连接,用于对所述第一编码序列进行降维处理,得到所述目标属性对应的、预设维度的多模态特征序列。
根据本公开的一个或多个实施例,示例4提供了示例1的方法,所述从所述目标图像中提取所述目标对象的关键信息特征序列,包括:对所述目标图像进行文本识别,得到识别文本,其中,所述识别文本为多语种文本或单语种文本;将所述识别文本输入到预先训练好的多语种语言模型中,得到所述目标对象的关键信息特征序列。
根据本公开的一个或多个实施例,示例5提供了示例1的方法,所述根据所述关键信息特征序列和所述多模态特征序列,确定所述目标对象的多个对象属性,包括:将所述关键信息特征序列和所述多模态特征序列输入到预先训练好的多模态融合模型中,得到所述目标对象的多个对象属性;其中,所述多模态融合模型包括:第二编码模块,用于对所述关键信息特征序列和所述多模态特征序列构成的第一特征矩阵进行编码,得到第二编码序列,其中,所述关键信息特征序列和所述多模态特征序列的维度均为预设维度;与所述多个对象属性的属性类别一一对应的多个第一解码模块,分别与所述第二编码模块连接,用于根据所述第二编码序列,生成所对应的属性类别下的对象属性,其中,每一所述对象属性的属性类别均不相同。
根据本公开的一个或多个实施例,示例6提供了示例1-5中任一项所述的方法,所述方法还包括:从所述目标图像中提取所述目标对象的外观特征序列;所述根据所述关键信息特征序列和所述多模态特征序列,确定所述目标对象的多个对象属性,包括:根据所述关键信息特征序列、所述多模态特征序列以及所述外观特征序列,确定所述目标对象的多个对象属性。
根据本公开的一个或多个实施例,示例7提供了示例6的方法,所述从所述目标图像中提取所述目标对象的外观特征序列,包括:将所述目标图像输入到预先训练好的外观特征提取模型中,得到所述目标对象的外观特征序列,所述外观特征提取模型包括依次连接的第二目标检测模块、第二预处理模块、第三编码模块以及第三全连接模块;其中,所述第二目标检测模块,用于从所述目标图像中提取所述目标对象的外观所在的第二区域;所述第二预处理模块,用于将所述第二区域归一化为第二预设尺寸的图像,并将归一化后所得的图像拉直成第二预设长度的一维行向量;所述第三编码模块,用于对所述第二 预设长度的一维行向量进行编码,得到第三编码序列;所述第三全连接模块,用于对所述第三编码序列进行降维处理,得到所述目标对象的、预设维度的外观特征序列。
根据本公开的一个或多个实施例,示例8提供了示例6的方法,所述方法还包括:从所述目标图像中提取所述目标图像的全局视觉特征序列;所述根据所述关键信息特征序列、所述多模态特征序列以及所述外观特征序列,确定所述目标对象的多个对象属性,包括:根据所述关键信息特征序列、所述多模态特征序列、所述外观特征序列以及所述全局视觉特征序列,确定所述目标对象的多个对象属性。
根据本公开的一个或多个实施例,示例9提供了示例8的方法,所述从所述目标图像中提取所述目标对象的全局视觉特征序列,包括:将所述目标图像输入到预先训练好的全局视觉特征提取模型中,得到所述目标图像的全局视觉特征序列,所述全局视觉特征提取模型包括依次连接的第三预处理模块、第四全连接模块、第四编码模块以及第二解码模块;其中,所述第三预处理模块,用于将所述目标图像调整为第三预设尺寸,按照第四预设尺寸将调整尺寸后所得的目标图像分割为多个图像块,之后,将每一所述图像块拉直成第三预设长度的一维特征向量,将每一所述第三预设长度的一维特征向量组成第二特征矩阵;所述第四全连接模块,用于根据所述第二特征矩阵,生成所述目标图像对应的原始特征序列;所述第四编码模块,用于对所述原始特征序列进行编码,得到第四编码序列;所述第二解码模块,用于对所述第四编码序列进行解码,得到所述目标图像的全局视觉特征序列。
根据本公开的一个或多个实施例,示例10提供了一种对象属性识别装置,包括:获取模块,用于获取目标图像,其中,所述目标图像中包含目标对象和所述目标对象的对象描述信息;第一提取模块,用于从所述获取模块获取到的所述目标图像中提取所述目标对象的关键信息特征序列和所述目标对象的目标属性对应的多模态特征序列,其中,所述多模态特征序列包括所述目标属性的视觉特征序列和语义特征序列;确定模块,用于根据所述第一提取模块提取到的所述关键信息特征序列和所述多模态特征序列,确定所述目标对象的多个对象属性,其中,所述多个对象属性包括所述目标属性。
根据本公开的一个或多个实施例,示例11提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现示例1-9中任一项所述方法的步骤。
根据本公开的一个或多个实施例,示例12提供了一种电子设备,包括:存储装置,其上存储有计算机程序;处理装置,用于执行所述存储装置中的所述计算机程序,以实现示例1-9中任一项所述方法的步骤。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中 包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。

Claims (12)

  1. 一种对象属性识别方法,其特征在于,包括:
    获取目标图像,其中,所述目标图像中包含目标对象和所述目标对象的对象描述信息;
    从所述目标图像中提取所述目标对象的关键信息特征序列和所述目标对象的目标属性对应的多模态特征序列,其中,所述多模态特征序列包括所述目标属性的视觉特征序列和语义特征序列;
    根据所述关键信息特征序列和所述多模态特征序列,确定所述目标对象的多个对象属性,其中,所述多个对象属性包括所述目标属性。
  2. 根据权利要求1所述的方法,其特征在于,所述从所述目标图像中提取所述目标对象的目标属性对应的多模态特征序列,包括:
    将所述目标图像输入到预先训练好的多模态特征提取模型中,得到所述目标对象的目标属性对应的多模态特征序列。
  3. 根据权利要求2所述的方法,其特征在于,所述多模态特征提取模型包括:
    第一目标检测模块,用于从所述目标图像中提取所述目标对象的目标属性的标识所在的第一区域;
    第一预处理模块,与所述第一目标检测模块连接,用于将所述第一区域归一化为第一预设尺寸的图像,并将归一化后所得的图像拉直成第一预设长度的一维行向量;
    第一全连接模块,与所述第一预处理模块连接,用于根据所述第一预设长度的一维行向量,生成所述目标属性的视觉特征序列;
    文本识别模块,与第一目标检测模块连接,用于对所述第一区域进行文本识别,得到所述目标属性的属性描述文本;
    多语种语言子模型,与所述文本识别模块连接,用于从所述属性描述文本中提取所述目标属性的语义特征序列;
    拼接模块,与所述第一全连接模块、所述多语种语言子模型分别连接,用于将所述目标属性的视觉特征序列和所述语义特征序列进行拼接,得到拼接序列;
    第一编码模块,与所述拼接模块连接,用于对所述拼接序列进行编码,得到第一编码序列;
    第二全连接模块,与所述第一编码模块连接,用于对所述第一编码序列进行降维处理,得到所述目标属性对应的、预设维度的多模态特征序列。
  4. 根据权利要求1所述的方法,其特征在于,所述从所述目标图像中提取所述目标对象的关键信息特征序列,包括:
    对所述目标图像进行文本识别,得到识别文本,其中,所述识别文本为多语种文本或单语种文本;
    将所述识别文本输入到预先训练好的多语种语言模型中,得到所述目标对象的关键信息特征序列。
  5. 根据权利要求1所述的方法,其特征在于,所述根据所述关键信息特征序列和所述多模态特征序列,确定所述目标对象的多个对象属性,包括:
    将所述关键信息特征序列和所述多模态特征序列输入到预先训练好的多模态融合模型中,得到所述目标对象的多个对象属性;
    其中,所述多模态融合模型包括:
    第二编码模块,用于对所述关键信息特征序列和所述多模态特征序列构成的第一特征矩阵进行编码,得到第二编码序列,其中,所述关键信息特征序列和所述多模态特征序列的维度均为预设维度;
    与所述多个对象属性的属性类别一一对应的多个第一解码模块,分别与所述第二编码模块连接,用于根据所述第二编码序列,生成所对应的属性类别下的对象属性,其中,每一所述对象属性的属性类别均不相同。
  6. 根据权利要求1-5中任一项所述的方法,其特征在于,所述方法还包括:
    从所述目标图像中提取所述目标对象的外观特征序列;
    所述根据所述关键信息特征序列和所述多模态特征序列,确定所述目标对象的多个对象属性,包括:
    根据所述关键信息特征序列、所述多模态特征序列以及所述外观特征序列,确定所述目标对象的多个对象属性。
  7. 根据权利要求6所述的方法,其特征在于,所述从所述目标图像中提取所述目标对象的外观特征序列,包括:
    将所述目标图像输入到预先训练好的外观特征提取模型中,得到所述目标对象的外观特征序列,所述外观特征提取模型包括依次连接的第二目标检测模块、第二预处理模块、第三编码模块以及第三全连接模块;
    其中,所述第二目标检测模块,用于从所述目标图像中提取所述目标对象的外观所在的第二区域;
    所述第二预处理模块,用于将所述第二区域归一化为第二预设尺寸的图像,并将归一化后所得的图像拉直成第二预设长度的一维行向量;
    所述第三编码模块,用于对所述第二预设长度的一维行向量进行编码,得到第三编码序列;
    所述第三全连接模块,用于对所述第三编码序列进行降维处理,得到所述目标对象的、预设维度的外观特征序列。
  8. 根据权利要求6所述的方法,其特征在于,所述方法还包括:
    从所述目标图像中提取所述目标图像的全局视觉特征序列;
    所述根据所述关键信息特征序列、所述多模态特征序列以及所述外观特征序列,确定所述目标对象 的多个对象属性,包括:
    根据所述关键信息特征序列、所述多模态特征序列、所述外观特征序列以及所述全局视觉特征序列,确定所述目标对象的多个对象属性。
  9. 根据权利要求8所述的方法,其特征在于,所述从所述目标图像中提取所述目标对象的全局视觉特征序列,包括:
    将所述目标图像输入到预先训练好的全局视觉特征提取模型中,得到所述目标图像的全局视觉特征序列,所述全局视觉特征提取模型包括依次连接的第三预处理模块、第四全连接模块、第四编码模块以及第二解码模块;
    其中,所述第三预处理模块,用于将所述目标图像调整为第三预设尺寸,按照第四预设尺寸将调整尺寸后所得的目标图像分割为多个图像块,之后,将每一所述图像块拉直成第三预设长度的一维特征向量,将每一所述第三预设长度的一维特征向量组成第二特征矩阵;
    所述第四全连接模块,用于根据所述第二特征矩阵,生成所述目标图像对应的原始特征序列;
    所述第四编码模块,用于对所述原始特征序列进行编码,得到第四编码序列;
    所述第二解码模块,用于对所述第四编码序列进行解码,得到所述目标图像的全局视觉特征序列。
  10. 一种对象属性识别装置,其特征在于,包括:
    获取模块,用于获取目标图像,其中,所述目标图像中包含目标对象和所述目标对象的对象描述信息;
    第一提取模块,用于从所述获取模块获取到的所述目标图像中提取所述目标对象的关键信息特征序列和所述目标对象的目标属性对应的多模态特征序列,其中,所述多模态特征序列包括所述目标属性的视觉特征序列和语义特征序列;
    确定模块,用于根据所述第一提取模块提取到的所述关键信息特征序列和所述多模态特征序列,确定所述目标对象的多个对象属性,其中,所述多个对象属性包括所述目标属性。
  11. 一种计算机可读介质,其上存储有计算机程序,其特征在于,该程序被处理装置执行时实现权利要求1-9中任一项所述方法的步骤。
  12. 一种电子设备,其特征在于,包括:
    存储装置,其上存储有计算机程序;
    处理装置,用于执行所述存储装置中的所述计算机程序,以实现权利要求1-9中任一项所述方法的步骤。
PCT/CN2022/141994 2022-01-21 2022-12-26 对象属性识别方法、装置、可读存储介质及电子设备 WO2023138314A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210074401.6A CN114429552A (zh) 2022-01-21 2022-01-21 对象属性识别方法、装置、可读存储介质及电子设备
CN202210074401.6 2022-01-21

Publications (1)

Publication Number Publication Date
WO2023138314A1 true WO2023138314A1 (zh) 2023-07-27

Family

ID=81313847

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/141994 WO2023138314A1 (zh) 2022-01-21 2022-12-26 对象属性识别方法、装置、可读存储介质及电子设备

Country Status (2)

Country Link
CN (1) CN114429552A (zh)
WO (1) WO2023138314A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114429552A (zh) * 2022-01-21 2022-05-03 北京有竹居网络技术有限公司 对象属性识别方法、装置、可读存储介质及电子设备
CN117333868A (zh) * 2022-06-24 2024-01-02 华为云计算技术有限公司 识别对象的方法、装置及存储介质
CN117274953A (zh) * 2023-09-28 2023-12-22 深圳市厚朴科技开发有限公司 一种车辆和行人属性识别方法系统、设备及介质
CN117351205A (zh) * 2023-10-23 2024-01-05 中国人民解放军陆军工程大学 一种图像结构化信息提取方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428593A (zh) * 2020-03-12 2020-07-17 北京三快在线科技有限公司 一种文字识别方法、装置、电子设备及存储介质
CN111461203A (zh) * 2020-03-30 2020-07-28 北京百度网讯科技有限公司 跨模态处理方法、装置、电子设备和计算机存储介质
US20220019847A1 (en) * 2020-07-20 2022-01-20 Beijing Baidu Netcom Science And Technology Co., Ltd. Active interaction method, electronic device and readable storage medium
CN114429552A (zh) * 2022-01-21 2022-05-03 北京有竹居网络技术有限公司 对象属性识别方法、装置、可读存储介质及电子设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428593A (zh) * 2020-03-12 2020-07-17 北京三快在线科技有限公司 一种文字识别方法、装置、电子设备及存储介质
CN111461203A (zh) * 2020-03-30 2020-07-28 北京百度网讯科技有限公司 跨模态处理方法、装置、电子设备和计算机存储介质
US20220019847A1 (en) * 2020-07-20 2022-01-20 Beijing Baidu Netcom Science And Technology Co., Ltd. Active interaction method, electronic device and readable storage medium
CN114429552A (zh) * 2022-01-21 2022-05-03 北京有竹居网络技术有限公司 对象属性识别方法、装置、可读存储介质及电子设备

Also Published As

Publication number Publication date
CN114429552A (zh) 2022-05-03

Similar Documents

Publication Publication Date Title
WO2023138314A1 (zh) 对象属性识别方法、装置、可读存储介质及电子设备
US20230394671A1 (en) Image segmentation method and apparatus, and device, and storage medium
CN110321958B (zh) 神经网络模型的训练方法、视频相似度确定方法
CN110826567B (zh) 光学字符识别方法、装置、设备及存储介质
WO2022247562A1 (zh) 多模态数据检索方法、装置、介质及电子设备
CN112766284B (zh) 图像识别方法和装置、存储介质和电子设备
CN111314388B (zh) 用于检测sql注入的方法和装置
CN115908640A (zh) 生成图像的方法、装置、可读介质及电子设备
CN116128055A (zh) 图谱构建方法、装置、电子设备和计算机可读介质
CN113610034B (zh) 识别视频中人物实体的方法、装置、存储介质及电子设备
CN115640815A (zh) 翻译方法、装置、可读介质及电子设备
CN113033707B (zh) 视频分类方法、装置、可读介质及电子设备
CN114463769A (zh) 表格识别方法、装置、可读介质和电子设备
CN111898338B (zh) 文本生成方法、装置和电子设备
CN111312224B (zh) 语音分割模型的训练方法、装置和电子设备
CN113051933A (zh) 模型训练方法、文本语义相似度确定方法、装置和设备
WO2023138361A1 (zh) 图像处理方法、装置、可读存储介质及电子设备
CN113986958B (zh) 文本信息的转换方法、装置、可读介质和电子设备
CN111460214B (zh) 分类模型训练方法、音频分类方法、装置、介质及设备
CN114564606A (zh) 一种数据处理方法、装置、电子设备及存储介质
CN116821327A (zh) 文本数据处理方法、装置、设备、可读存储介质及产品
CN114004229A (zh) 文本识别方法、装置、可读介质及电子设备
US20200321026A1 (en) Method and apparatus for generating video
CN117376634B (zh) 一种短视频配乐方法、装置、电子设备和存储介质
CN112214695A (zh) 信息处理方法、装置和电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22921735

Country of ref document: EP

Kind code of ref document: A1