WO2023179308A1 - 一种图像描述生成方法、装置、设备、介质及产品 - Google Patents

一种图像描述生成方法、装置、设备、介质及产品 Download PDF

Info

Publication number
WO2023179308A1
WO2023179308A1 PCT/CN2023/078335 CN2023078335W WO2023179308A1 WO 2023179308 A1 WO2023179308 A1 WO 2023179308A1 CN 2023078335 W CN2023078335 W CN 2023078335W WO 2023179308 A1 WO2023179308 A1 WO 2023179308A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
features
target object
visual
label
Prior art date
Application number
PCT/CN2023/078335
Other languages
English (en)
French (fr)
Inventor
毛晓飞
黄灿
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023179308A1 publication Critical patent/WO2023179308A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present disclosure belongs to the field of image processing technology, and specifically relates to an image description generation method, device, equipment, computer-readable storage medium and computer program product.
  • Images contain rich visual information, and natural language descriptions can currently be generated based on this visual information.
  • natural language descriptions generated based solely on visual information are a poor match to the items in the image. Therefore, there is an urgent need for an image description generation method to improve the matching between natural language descriptions and products.
  • the purpose of this disclosure is to provide an image description generation method, device, equipment, computer-readable storage medium and computer program product, which can improve the matching degree between the obtained natural language description and the target object in the image.
  • the present disclosure provides an image description generation method, the method including:
  • a natural language description for the image is generated based on the label features, the location features, the text features, the visual features and a visual language model.
  • an image description generating device including:
  • the acquisition module is used to acquire the image including the target object
  • An extraction module configured to respectively extract the label features of the target object, the position features of the target object in the image, the text features in the image, and the visual features of the target object from the image;
  • a generating module configured to generate a natural language description for the image according to the label features, the location features, the text features, the visual features and the visual language model.
  • the present disclosure provides a computer-readable medium having a computer program stored thereon, which, when executed by a processing device, implements the steps of the method according to any one of the first aspects of the present disclosure.
  • an electronic device including:
  • a processing device configured to execute the computer program in the storage device to implement the steps of any one of the methods in the first aspect of the present disclosure.
  • the present disclosure provides a computer program product containing instructions that, when run on a device, cause the device to execute the method described in any of the above implementations of the first aspect.
  • the present disclosure provides an image description generation method, which first obtains an image including a target object, such as an image of a commodity. Then, the label features of the target object, the position features of the target object in the image, the text features in the image, and the visual features of the target object are extracted from the image. It can be seen that this method extracts more effective information from the image. Then, a natural language description of the image is generated based on label features, location features, text features, visual features, and visual language models. The natural language description based on more effective information is more accurate, which in turn makes the natural language description more consistent with the target object in the image.
  • Figure 1 is an architectural diagram of a recommendation system provided by an embodiment of the present disclosure
  • Figure 2 is a flow chart of an image description generation method provided by an embodiment of the present disclosure
  • Figure 3A is a schematic diagram of an image acquisition interface provided by an embodiment of the present disclosure.
  • Figure 3B is a schematic diagram of an image upload interface provided by an embodiment of the present disclosure.
  • Figure 4 is a schematic diagram of a coding and decoding structure provided by an embodiment of the present disclosure.
  • Figure 5 is a schematic diagram of a feature fusion process provided by an embodiment of the present disclosure.
  • Figure 6 is a schematic diagram of an image description generating device provided by an embodiment of the present disclosure.
  • FIG. 7 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
  • first and second in the embodiments of the present disclosure are only used for descriptive purposes and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Therefore, features defined as “first” and “second” may explicitly or implicitly include one or more of these features.
  • Image processing refers to the technology of using computers to analyze images to achieve the desired results.
  • Image processing technology generally includes image compression, enhancement and restoration, matching, description and recognition, etc.
  • Natural language refers to the language used by humans daily. It is an important way of human communication and is also the essential feature that distinguishes humans from other animals.
  • the image will include target objects.
  • the image can include products, and the products can be products such as bracelets and headphones.
  • the user can directly understand the content in the image, such as bracelets, headphones, etc. included in the image.
  • the recommendation system needs to mine the natural language description of the product in the image based on the image of the product, so as to ensure the recommendation effect of downstream recommendation tasks. Therefore, it is necessary to accurately mine natural language descriptions of images.
  • the visual information included in the image is used to automatically generate a natural language description of the image.
  • the natural language description generated based only on visual information has a poor matching degree with the goods in the image, relies on less image information, and is difficult to meet business needs.
  • inventions of the present disclosure provide an image description generation method, which method is applied to electronic devices.
  • Electronic equipment refers to equipment with data processing capabilities, such as servers or terminals.
  • terminals include but are not limited to smartphones, tablets, laptops, personal digital assistants (Personal Digital Assistant, PDA) or smart wearable devices.
  • the server may be a cloud server, such as a central server in a central cloud computing cluster, or an edge server in an edge cloud computing cluster.
  • the server can also be a server in a local data center.
  • a local data center refers to a data center directly controlled by the user.
  • the method includes: the electronic device acquires an image of the target object; and then extracts from the image the label features of the target object, the position features of the target object in the image, the text features in the image, and the visual features of the target object; Then a natural language description of the image is generated based on label features, location features, text features, visual features and visual linguistic (visual linguistic) model.
  • this method can be executed by the server alone, or it can be executed by the terminal and the server in cooperation.
  • the terminal and the server may constitute a recommendation system (for example, a product recommendation system).
  • this method is specifically implemented in the form of a computer program.
  • the computer program may be independent, for example, an independent application with corresponding functions.
  • the computer program may be a functional module or plug-in, which is attached to an existing application and runs.
  • the recommendation system can obtain an image that includes a bracelet and then generate a natural language description of the bracelet. Then based on the natural language description of the bracelet, a promotion strategy for the bracelet is determined. Then based on the determined promotion strategy, the bracelet is promoted. Since the generated natural language description matches the bracelet more closely, the recommendation system can achieve better promotion results when promoting the bracelet based on the more accurate natural language description.
  • the recommendation system 100 includes a terminal 110 , a terminal 120 and a server 130 .
  • the terminal 110, the terminal 120 and the server 130 are connected through a network.
  • the terminal 110 can be a merchant terminal, and the merchant publishes products based on the merchant terminal.
  • the terminal 120 can be a user terminal, and the user can browse the products published by the merchant based on the user terminal.
  • the merchant may send images including merchandise to the server 130 through the terminal 110 .
  • the server 130 can process the image of the product to obtain a natural language description of the image, and then determine a promotion strategy for the product based on the natural language description of the image, and promote the product based on the promotion strategy. For example, the server 130 promotes the product to users of the terminal 120 by pushing advertisements to the terminal 120 .
  • the promotion strategy determined based on the accurate natural language description will be more accurate, which can better match the user with the product and improve the promotion effect and conversion rate.
  • the image description generation method provided by the embodiment of the present disclosure is introduced below from the perspectives of the terminal 110, the terminal 120 and the server 130.
  • this figure is a flow chart of an image description generation method provided by an embodiment of the present disclosure.
  • the method includes:
  • the terminal 110 sends an image including the target object to the server 130.
  • this figure is a schematic diagram of an image acquisition interface provided by an embodiment of the present disclosure.
  • the terminal 110 can present an image acquisition interface to the merchant.
  • the image acquisition interface includes a shooting control 301 , an album control 302 and a preview screen 303 .
  • the preview screen 303 may be a video stream collected by a camera of the terminal.
  • the merchants can upload images that include their target audience in a variety of ways.
  • the merchant can click the photographing control 301 to photograph a target object (such as a commodity), and thereby obtain an image including the target object.
  • the merchant can click the album control 302 to select a pre-stored image including the target object from the album.
  • FIG. 3B this figure is a schematic diagram of an image upload interface provided by an embodiment of the present disclosure.
  • the image upload interface includes a reselection control 304, an upload control 305 and a preview image 306.
  • the interface presented by the terminal 110 is switched from the image acquisition interface shown in Figure 3A to the image uploading interface shown in Figure 3B .
  • the preview image 306 can be displayed on the image upload interface so that the merchant can know the image he has selected.
  • the merchant can re-select an image in the album or re-shoot the image by clicking the re-select control 304; the merchant can also directly upload the image of the target object (such as preview) by clicking the upload control 305.
  • the image corresponding to image 306) is given to the server 130.
  • the above is just an example of the terminal 110 sending the image including the target object to the server 130.
  • the server 130 can select an appropriate method according to actual needs so that the server 130 obtains the image including the target object.
  • the server 130 respectively extracts the label features of the target object, the position features of the target object in the image, the text features in the image, and the visual features of the target object from the image.
  • this figure is a schematic diagram of a codec structure provided by an embodiment of the present disclosure.
  • the server 130 After acquiring the image including the target object, the server 130 sequentially passes the image through the convolutional neural network, encoding structure, and decoding structure to extract the position of the target object in the image and the label of the target object. Then, according to the position coordinates of the target object in the image, the position features in the image are obtained, and according to the label of the target object, the label features of the target object are obtained.
  • the encoding and decoding structure may include a 4-layer convolutional neural network, an N-layer conformer (encoder) structure, and an N-layer decoder structure.
  • the position coordinates can be the coordinates of the upper left corner and the lower right corner of the target object in the image.
  • the rectangular area where the target object is located can be determined, that is, the area image corresponding to the target object.
  • the visual features of the target object are obtained based on the regional image corresponding to the target object, rather than based on the entire image. In this way, the visual feature can be made more representative of the target object.
  • the label of the above target object may be one word or multiple words.
  • the word can be "black”, etc.; when the label is multiple words, Multiple words could be "watch”, "screen”, etc. That is to say, in the embodiment of the present disclosure, the tag is not limited to only one word.
  • the server 130 After obtaining the position coordinates and label of the target object, the server 130 performs vectorization processing on the position coordinates and label of the target object respectively. For example, position coordinates are converted into a 768-length sequence through postion embedding, and labels are converted into a 768-length sequence through token embedding.
  • sequence of 768 length is only an example, and in other examples, it may also be a sequence of other lengths.
  • the server 130 can determine the region image corresponding to the target object from the image based on the position coordinates of the target object in the image; and then obtain the visual characteristics of the target object based on the region image corresponding to the target object.
  • the server 130 may obtain the region corresponding to the target object based on the feature map obtained after passing the image through the convolutional neural network, and then convert it into a sequence of 768 length as the visual representation of the target object. feature.
  • the server 130 can process the image through optical character recognition (Optical Character Recognition, OCR) technology, extract the text in the image from the image, and then obtain the text features in the image based on the text in the image. After acquiring the text in the image, the server 130 performs vectorization processing on the text, for example, converting the text into a 768-length sequence through segment embedding.
  • OCR Optical Character Recognition
  • the server 130 generates a natural language description for the image based on the tag features, location features, text features, visual features and visual language model.
  • the server may first fuse the above-mentioned label features, location features, text features and visual features through a sum operation to obtain the fused features. Then, the fused features are input into the visual language model to generate a natural language description of the image.
  • this figure is a schematic diagram of a feature fusion process provided by an embodiment of the present disclosure.
  • the target object may be a smart watch
  • the image may be an image including the smart watch.
  • text features can be identified by two vectors.
  • the corresponding label feature can be a word.
  • the corresponding label feature can be an image.
  • "1" in the position feature can represent two coordinates, namely the coordinates of the upper left corner and the coordinates of the lower right corner of the target object.
  • “1” can correspond to the abscissa and ordinate of the upper left corner, and the abscissa and ordinate of the lower right corner.
  • “CLS” is placed first, as the starting character;
  • END is placed at the end, as the ending character;
  • SEP is used to separate two inputs, as the delimiter, which can be used to distinguish text from images;
  • MASK is used to mask some words in the text. After covering the words with "MASK”, predict the words at the "MASK” position, thereby generating more samples to facilitate model training, so that only a small number of original samples are needed to train the visual language model.
  • the visual language model can be composed of a multi-layer encoding structure (conformer) and a multi-layer decoding structure (decoder).
  • the number of layers of the encoding structure can be 6 layers, and the number of layers of the decoding structure can also be 6 layers.
  • the server 130 determines a promotion strategy for the target object based on the natural language description of the image.
  • the server 130 pushes the advertisement targeted at the target object to the terminal 120 according to the promotion strategy targeted at the target object.
  • the server 130 can determine a more accurate promotion strategy based on the more accurate natural language description. Then, based on the more accurate promotion strategy, advertisements targeted at the target object are pushed to the terminal 120 . In this way, not only can the resource waste of the server 130 be reduced, but also the user on the terminal 120 side can more directly understand the target object, thereby improving the conversion rate.
  • embodiments of the present disclosure provide an image description generation method.
  • this method not only the visual features are mined from the image, but also the label characteristics of the target object are mined.
  • Features, location features of the target object in the image, and text features in the image enable the image to provide more effective information, which in turn enables the generated natural language description to better match the target object in the image.
  • the recommendation system can achieve better recommendation results based on the more accurate natural language description.
  • FIG. 6 is a schematic diagram of an image description generating device according to an exemplary disclosed embodiment. As shown in Figure 6, the image description generating device 600 includes:
  • Acquisition module 601 used to acquire images including target objects
  • Extraction module 602 configured to respectively extract the label features of the target object, the position features of the target object in the image, the text features in the image, and the visual features of the target object from the image;
  • Generating module 603 configured to generate a natural language description for the image according to the label features, the location features, the text features, the visual features and the visual language model.
  • the generation module 603 is also configured to determine a promotion strategy for the target object based on the natural language description of the image, and the promotion strategy is used to promote the target object.
  • the extraction module 602 is specifically configured to sequentially pass the image through a convolutional neural network, a coding structure, and a decoding structure to extract the position coordinates of the target object in the image and the label of the target object. ; According to the position coordinates of the target object in the image, the position characteristics in the image are obtained, and according to the label of the target object, the label characteristics of the target object are obtained.
  • the label of the target object includes at least one word.
  • the extraction module 602 is specifically configured to perform optical character recognition on the image, extract text in the image, and obtain text features in the image based on the text in the image.
  • the extraction module 602 is specifically configured to determine the region image corresponding to the target object from the image according to the position coordinates of the target object in the image; according to the region corresponding to the target object image to obtain the visual characteristics of the target object.
  • the generation module 603 is specifically configured to fuse the label features, the location features, the text features and the visual features through a sum operation to obtain a fusion. Combining features; inputting the fused features into the visual language model to generate a natural language description for the image.
  • FIG. 7 a schematic structural diagram of an electronic device 7 suitable for implementing embodiments of the present disclosure is shown.
  • the electronic device shown in FIG. 7 is only an example and should not impose any limitations on the functions and scope of use of the embodiments of the present disclosure.
  • the electronic device 700 may include a processing device (eg, central processing unit, graphics processor, etc.) 701 that may be loaded into a random access device according to a program stored in a read-only memory (ROM) 702 or from a storage device 708 .
  • the program in the memory (RAM) 703 executes various appropriate actions and processes.
  • various programs and data required for the operation of the electronic device 700 are also stored.
  • the processing device 701, ROM 702 and RAM 703 are connected to each other via a bus 704.
  • An input/output (I/O) interface 705 is also connected to bus 704.
  • the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 707 such as a computer; a storage device 708 including a magnetic tape, a hard disk, etc.; and a communication device 709. Communication device 709 may allow electronic device 700 to communicate wirelessly or wiredly with other devices to exchange data.
  • FIG. 7 illustrates an electronic device 700 having various means, it should be understood that implementation or availability of all illustrated means is not required. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product including a computer program carried on a non-transitory computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via communication device 709, or from storage device 708, or from ROM 702.
  • the processing device 701 the above-mentioned functions defined in the method of the embodiment of the present disclosure are performed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • computer readable storage The storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination thereof. More specific examples of computer readable storage media may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard drive, random access memory (RAM), read only memory (ROM), removable Programmd read-only memory (EPROM or flash memory), fiber optics, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to: wire, optical cable, RF (radio frequency), etc., or any suitable combination of the above.
  • terminals and servers can communicate using any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol), and can communicate with digital data in any form or medium. (e.g., communications network) interconnection.
  • communications networks include local area networks (“LAN”), wide area networks (“WAN”), the Internet (e.g., the Internet), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any currently known or developed in the future network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; it may also exist independently without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs.
  • the electronic device acquires an image including a target object; extracts the targets respectively from the image. Label features of the object, position features of the target object in the image, text features in the image, and visual features of the target object; A natural language description for the image is generated based on the label features, the location features, the text features, the visual features and a visual language model.
  • Computer program code for performing the operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as "C" or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as an Internet service provider). connected via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service provider such as an Internet service provider
  • each block in the flowchart or block diagram may represent a module, segment, or portion of code that contains one or more logic functions that implement the specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.
  • the modules involved in the embodiments of the present disclosure can be implemented in software or hardware.
  • the name of the module does not constitute a limitation on the module itself under certain circumstances.
  • the first acquisition module can also be described as "a module that acquires at least two Internet Protocol addresses.”
  • FPGA Field programmable gate array
  • ASIC application specific integrated circuit
  • ASSP application specific standard product
  • SOC system on chip
  • CPLD complex programmable logic device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM portable compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.
  • Example 1 provides an image description generation method.
  • the method includes: acquiring an image including a target object; respectively extracting label features and the target object from the image.
  • the model generates a natural language description of the image.
  • Example 2 provides the method of Example 1.
  • the method further includes: determining a promotion strategy for the target object according to the natural language description of the image, where the promotion strategy is To promote the stated target audience.
  • Example 3 provides the method of Example 1, which includes extracting label features of the target object and position features of the target object in the image from the image. , including: sequentially passing the image through a convolutional neural network, encoding structure and decoding structure to extract the position coordinates of the target object in the image and the label of the target object; according to the position of the target object in the image According to the position coordinates in the image, the position characteristics in the image are obtained, and according to the label of the target object, the label characteristics of the target object are obtained.
  • Example 4 provides the method of Example 3, and the label of the target object includes at least one word.
  • Example 5 provides the method of Example 1.
  • the process of extracting text features in the image includes: performing optical character recognition on the image, extracting the text in the image; according to the Describe the text in the image and obtain the text features in the image.
  • Example 6 provides the method of Example 3.
  • the process of extracting the visual features of the target object includes: according to the position coordinates of the target object in the image, from the image The regional image corresponding to the target object is determined; and the visual characteristics of the target object are obtained according to the regional image corresponding to the target object.
  • Example 7 provides the method of Examples 1-6, wherein the target is generated according to the label features, the location features, the text features, the visual features and a visual language model.
  • the natural language description of the image includes: fusing the label features, the position features, the text features and the visual features through a sum operation to obtain fusion features; inputting the fusion features into The visual language model generates a natural language description for the image.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Character Input (AREA)

Abstract

本公开提供了一种图像描述生成方法、装置、设备、介质及产品,涉及图像处理技术领域,该方法包括获取包括目标对象的图像;从所述图像中分别提取出所述目标对象的标签特征、所述目标对象在所述图像中的位置特征、所述图像中文本特征以及所述目标对象的视觉特征;根据所述标签特征、所述位置特征、所述文本特征、所述视觉特征和视觉语言模型生成针对所述图像的自然语言描述。可见,该方法从图像中提取了更多的有效信息,使得模型能够更好地理解图像,进而能够提高得到自然语言描述与图像中目标对象的匹配度。

Description

一种图像描述生成方法、装置、设备、介质及产品
本公开要求于2022年3月21日提交中国国家知识产权局、申请号为202210278138.2、发明名称为“一种图像描述生成方法、装置、设备、介质及产品”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开属于图像处理技术领域,具体涉及一种图像描述生成方法、装置、设备、计算机可读存储介质以及计算机程序产品。
背景技术
随着计算机技术尤其是图像处理技术的发展,以图像为载体的信息传递方式占越来越多的比重。在一些场景下,如电商场景中,商品的图像与商品的自然语言描述之间的匹配度,会影响到商品的推广效果。当商品的自然语言描述与商品的匹配度较高时,用户能够更加直接了解该商品,从而提高商品的推广效果。
图像中包括丰富的视觉信息,目前基于该视觉信息可以生成自然语言描述。但是,仅基于视觉信息生成的自然语言描述与该图像中商品的匹配度较差。因此,亟需一种图像描述生成方法,以提高自然语言描述与商品的匹配度。
发明内容
本公开的目的在于:提供了一种图像描述生成方法、装置、设备、计算机可读存储介质以及计算机程序产品,能够提高得到的自然语言描述与图像中目标对象的匹配度。
第一方面,本公开提供了一种图像描述生成方法,所述方法包括:
获取包括目标对象的图像;
从所述图像中分别提取出所述目标对象的标签特征、所述目标对象在所述图像中的位置特征、所述图像中文本特征以及所述目标对象的视觉特征;
根据所述标签特征、所述位置特征、所述文本特征、所述视觉特征和视觉语言模型生成针对所述图像的自然语言描述。
第二方面,本公开提供了一种图像描述生成装置,包括:
获取模块,用于获取包括目标对象的图像;
提取模块,用于从所述图像中分别提取出所述目标对象的标签特征、所述目标对象在所述图像中的位置特征、所述图像中文本特征以及所述目标对象的视觉特征;
生成模块,用于根据所述标签特征、所述位置特征、所述文本特征、所述视觉特征和视觉语言模型生成针对所述图像的自然语言描述。
第三方面,本公开提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现本公开第一方面中任一项所述方法的步骤。
第四方面,本公开提供了一种电子设备,包括:
存储装置,其上存储有计算机程序;
处理装置,用于执行所述存储装置中的所述计算机程序,以实现本公开第一方面中任一项所述方法的步骤。
第五方面,本公开提供了一种包含指令的计算机程序产品,当其在设备上运行时,使得设备执行上述第一方面中的任一种实现方式所述的方法。
从以上技术方案可以看出,本公开具有如下优点:
本公开提供了一种图像描述生成方法,该方法先获取包括目标对象的图像,如商品的图像。然后,从图像中分别提取出目标对象的标签特征、目标对象在图像中的位置特征、图像中文本特征以及目标对象的视觉特征。可见,该方法从图像中提取的更多的有效信息。接着,再基于标签特征、位置特征、文本特征、视觉特征以及视觉语言模型生成针对图像的自然语言描述。基于更多的有效信息得到的自然语言描述更加准确,进而使得该自然语言描述与图像中的目标对象更加匹配。
本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。
附图说明
附图用来提供对本发明的进一步理解,并且构成说明书的一部分,与本发明实施例一起用于解释本发明,并不构成对本发明的限制。在附图中:
图1为本公开实施例提供的一种推荐系统的架构图;
图2为本公开实施例提供的一种图像描述生成方法的流程图;
图3A为本公开实施例提供的一种图像获取界面的示意图;
图3B为本公开实施例提供的一种图像上传界面的示意图;
图4为本公开实施例提供的一种编解码结构的示意图;
图5为本公开实施例提供的一种特征融合过程的示意图;
图6为本公开实施例提供的一种图像描述生成装置的示意图;
图7为本公开实施例提供的一种电子设备的结构示意图。
具体实施方式
本公开实施例中的术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。
首先对本公开实施例中所涉及到的一些技术术语进行介绍。
图像处理(image processing)是指用计算机对图像进行分析,以达到所需结果的技术。图像处理技术一般包括图像压缩,增强和复原,匹配、描述和识别等。
自然语言是指人类日常所使用的语言,是人类交际的重要方式,也是人类区别于其他动物的本质特征。
一般的,图像中会包括目标对象,如图像中可以包括商品,商品可以是手环、耳机等产品。针对该图像的自然语言描述,用户能够直接的了解到该图像中的内容,例如图像中包括的手环、耳机等。
在电商场景下,推荐系统需要基于商品的图像挖掘出图像中商品的自然语言描述,从而保证下游推荐任务的推荐效果。因此,需要准确地挖掘出图像的自然语言描述。目前,采用图像中所包括的视觉信息,自动生成该图像的自然语言描述。但是,仅基于视觉信息生成的自然语言描述与该图像中商品的匹配度较差,所依赖的图像信息较少,难以满足业务需求。
有鉴于此,本公开实施例提供了一种图像描述生成方法,该方法应用于电子设备。电子设备是指具有数据处理能力的设备,例如可以是服务器,或者是终端。其中,终端包括但不限于智能手机、平板电脑、笔记本电脑、个人数字助理(Personal Digital Assistant,PDA)或者智能穿戴设备等。服务器可以是云服务器,例如是中心云计算集群中的中心服务器,或者是边缘云计算集群中的边缘服务器。当然,服务器也可以是本地数据中心中的服务器。本地数据中心是指用户直接控制的数据中心。
具体地,该方法包括:电子设备获取目标对象的图像;然后从图像中分别提取出目标对象的标签特征、目标对象在该图像中的位置特征、图像中的文本特征以及目标对象的视觉特征;接着根据标签特征、位置特征、文本特征、视觉特征和视觉语言(visual linguistic)模型生成针对该图像的自然语言描述。
可见,在该方法中,不仅仅从图像中挖掘了视觉特征,还挖掘了目标对象的标签特征、目标对象在图像中的位置特征以及图像中的文本特征,使图像提供了更多的有效信息,进而能够使得生成的自然语言描述与该图像中的目标对象更加匹配。进一步的,当目标对象的自然语言描述与该目标对象的匹配度较高时,推荐系统基于该更加准确的自然语言描述能够实现较好的推荐效果。
如上述,该方法可以由服务器单独执行,也可以由终端和服务器协同执行。为了便于理解,以终端和服务器协同执行为例,其中,终端和服务器可以构成推荐系统(例如商品推荐系统)。该方法应用于推荐系统时,具体是以计算机程序的形式实现。在一些实施例中,该计算机程序可以是独立的,例如可以是具有相应功能的独立应用。在另一些实施例中,该计算机程序可以是功能模块或插件等,附着于已有的应用中运行。
举例说明,推荐系统可以获取包括手环的图像,然后生成该手环的自然语言描述。接着基于该手环的自然语言描述,确定针对该手环的推广策略。然后基于确定的推广策略,对该手环进行推广。由于生成的自然语言描述与该手环更加匹配,进而推荐系统基于该更加准确的自然语言描述对该手环进行推广时,能够达到较好的推广效果。
为了使得本公开的技术方案更加清楚、易于理解,下面结合附图对本公开实施例提供的推荐系统的架构进行介绍。
参见图1所示的推荐系统100的系统架构图,推荐系统100包括终端110、终端120和服务器130。终端110、终端120和服务器130通过网络连接。其中,终端110可以为商户终端,商户基于商户终端发布商品,终端120可以为用户终端,用户基于用户终端可以浏览商户所发布的商品。
在一些示例中,商户可以通过终端110向服务器130发送包括商品的图像。服务器130可以对该商品的图像进行处理,以获得针对该图像的自然语言描述,然后再根据图像的自然语言描述,确定针对商品的推广策略,基于该推广策略对商品进行推广。例如,服务器130通过向终端120推送广告,以实现向终端120的用户推广该商品。
可见,当服务器130生成的商品的自然语言描述与该商品匹配度较高时,基于准确的自然语言描述确定的推广策略也更加准确,能够使用户与该商品更加匹配,提高推广效果以及转化率。
为了使得本公开的技术方案更加清楚、易于理解,下面从终端110、终端120和服务器130的角度,对本公开实施例提供的图像描述生成方法进行介绍。
如图2所示,该图为本公开实施例提供的一种图像描述生成方法的流程图,该方法包括:
S201、终端110向服务器130发送包括目标对象的图像。
在一些实施例中,如图3A所示,该图为本公开实施例提供的一种图像获取界面的示意图。终端110可以向商户呈现图像获取界面,该图像获取界面包括拍摄控件301、相册控件302以及预览画面303。其中,预览画面303可以是终端的摄像头所采集的视频流。
商户可以通过多种方式上传包括目标对象的图像。在一些示例中,商户可以点击该拍摄控件301,对目标对象(例如商品)进行拍摄,进而得到包括目标对象的图像。在另一些示例中,商户可以点击该相册控件302,从相册中选择预先存储的包括目标对象的图像。
如图3B所示,该图为本公开实施例提供的一种图像上传界面的示意图。该图像上传界面包括重选控件304、上传控件305和预览图像306。在商户通过拍摄控件301拍摄包括目标对象的图像或通过相册控件302选择包括目标对象的图像后,终端110所呈现的界面由图3A所示的图像获取界面切换为图3B所示的图像上传界面。在该图像上传界面可以显示预览图像306,以便商户得知自己所选择的图像。接着,在商户对预览图像306不满意的情况下,商户可以通过点击重选控件304重新在相册选择图像或重新拍摄图像;商户也可以直接通过点击上传控件305,上传目标对象的图像(如预览图像306对应的图像)给服务器130。
需要说明的是,以上仅仅是终端110向服务器130发送包括目标对象的图像的一种示例,本领域技术人员可以根据实际需要选择合适方式,使得服务器130获取包括目标对象的图像。
S202、服务器130从图像中分别提取出目标对象的标签特征、目标对象在图像中的位置特征、图像中的文本特征以及目标对象的视觉特征。
在一些实施例中,如图4所示,该图为本公开实施例提供的一种编解码结构的示意图。服务器130在获取到包括目标对象的图像后,将该图像依次通过卷积神经网络、编码结构和解码结构,提取目标对象在图像中的位置以及目标对象的标签。接着,根据目标对象在图像中的位置坐标,得到图像中的位置特征,根据目标对象的标签得到目标对象的标签特征。
其中,编解码结构可以包括4层卷积神经网络、N层conformer(encoder)结构以及N层decoder结构。位置坐标可以是目标对象在图像中的左上角的坐标以及右下角的坐标。基于左上角的坐标和右下角的坐标能够确定目标对象所在的矩形区域,即目标对象对应的区域图像。然后,基于该目标对象对应的区域图像,得到目标对象的视觉特征,而不是基于整个图像得到目标对象的视觉特征。如此,能够使得该视觉特征更能够表征该目标对象。
其中,上述目标对象的标签可以是一个单词、也可以是多个单词。例如,当标签为一个单词时,该单词可以是“黑”等;当标签为多个单词时, 多个单词可以是“手表”、“屏幕”等。也就是说,在本公开实施例中,标签不仅仅局限于一个单词。
需要说明的是,上述标签的具体内容仅仅是示例说明。
服务器130在得到目标对象的位置坐标和标签后,分别对目标对象的位置坐标和标签进行向量化处理。例如,通过postion embedding,将位置坐标转换为768长度的序列,通过token embedding将标签转换为768长度的序列。
需要说明的是,上述768长度的序列仅仅是示例说明,在另一些示例中还可以是其他长度的序列。
接着,服务器130可以基于目标对象在图像中的位置坐标,从图像中确定目标对象对应的区域图像;然后根据目标对象对应的区域图像,得到目标对象的视觉特征。在一些示例中,服务器130可以在将图像通过卷积神经网络后得到的特征图中,基于该特征图得到与目标对象对应的区域,然后再进行转换为768长度的序列,作为目标对象的视觉特征。
服务器130可以通过光学字符识别(Optical Character Recognition,OCR)技术对图像进行处理,从该图像中提取出图像中的文本,然后再基于该图像中的文本,得到图像中的文本特征。服务器130在获取到图像中的文本后,对该文本进行向量化处理,例如通过segment embedding,将文本转换为768长度的序列。
在本公开实施例中,不仅选用了视觉特征,而且还选用了图像中的其他特征(如文本特征、位置特征、标签特征等),从而使得模型能够充分理解该图像,进而使生成的自然语言描述更加准确,使该自然语言描述与图像更加匹配。
S203、服务器130根据标签特征、位置特征、文本特征、视觉特征和视觉语言模型生成针对图像的自然语言描述。
在一些实施例中,服务器可以先通过加和操作,将上述标签特征、位置特征、文本特征以及视觉特征进行融合,得到融合特征。进而,再将融合特征输入到视觉语言模型,生成该图像的自然语言描述。
如图5所示,该图为本公开实施例提供的一种特征融合过程的示意图。如图5所示,目标对象可以是智能手表,图像可以是包括智能手表的图像。从图中可以看出,文本特征为可以通过两种向量标识,例如,当文本特征通过A表征时,对应的标签特征可以是单词,当文本特征通过C表征时,对应的标签特征可以是图像。位置特征中“1”可以表征两个坐标,即目标对象的左上角的坐标和右下角的坐标。例如,“1”可以对应左上角的横坐标和纵坐标、右下角的横坐标和纵坐标。标签特征中,“CLS”放在首位,为起始符;“END”放在末尾,为结束符;“SEP”用于分开两种输入,为分隔符,可以用于区分文本与图像;“MASK”用于遮盖文本中的一些单词。将单词用“MASK”遮盖之后,预测“MASK”位置的单词,从而生成更多的样本,便于模型训练,从而仅需要少量原始样本即可实现对视觉语言模型的训练。
其中,视觉语言模型可以由多层编码结构(conformer)和多层解码结构(decoder)构成,其中编码结构的层数可以是6层,解码结构的层数也可以是6层。在将上述标签特征、位置特征、文本特征和视觉特征加和后,输入到该视觉语言模型,经过全连接(Fully Connected,FC)层,得到该图像的自然语言描述。该自然语言描述可以是文本,如输出的结果可以是“具有大屏幕和彩色显示屏的智能手表”。
S204、服务器130根据图像的自然语言描述,确定针对目标对象的推广策略。
S205、服务器130根据针对目标对象的推广策略向终端120推送针对目标对象的广告。
服务器130在得到更加准确的自然语言描述后,基于该更加准确的自然语言描述,能够确定出更加准确的推广策略。接着,基于该更加准确的推广策向终端120推送针对目标对象的广告。如此,不仅能够减少服务器130的资源浪费,而且还能够使终端120侧的用户更加直接的了解目标对象,便于提高转化率。
基于上述内容描述,本公开实施例提供了一种图像描述生成方法。在该方法中,不仅仅从图像中挖掘了视觉特征,还挖掘了目标对象的标签特 征、目标对象在图像中的位置特征以及图像中的文本特征,使图像提供了更多的有效信息,进而能够使得生成的自然语言描述与该图像中的目标对象更加匹配。进一步的,当目标对象的自然语言描述与该目标对象的匹配度较高时,推荐系统基于该更加准确的自然语言描述能够实现较好的推荐效果。
图6是根据一示例性公开实施例示出的一种图像描述生成装置的示意图。如图6所示,所述图像描述生成装置600包括:
获取模块601,用于获取包括目标对象的图像;
提取模块602,用于从所述图像中分别提取出所述目标对象的标签特征、所述目标对象在所述图像中的位置特征、所述图像中文本特征以及所述目标对象的视觉特征;
生成模块603,用于根据所述标签特征、所述位置特征、所述文本特征、所述视觉特征和视觉语言模型生成针对所述图像的自然语言描述。
可选地,所述生成模块603,还用于根据所述图像的自然语言描述,确定针对所述目标对象的推广策略,所述推广策略用于推广所述目标对象。
可选地,所述提取模块602,具体用于将所述图像依次通过卷积神经网络、编码结构和解码结构,提取所述目标对象在所述图像中的位置坐标以及所述目标对象的标签;根据所述目标对象在所述图像中的位置坐标,得到所述图像中的位置特征,根据所述目标对象的标签,得到所述目标对象的标签特征。
可选地,所述目标对象的标签包括至少一个单词。
可选地,所述提取模块602,具体用于对所述图像进行光学字符识别,提取所述图像中的文本;根据所述图像中的文本,得到所述图像中的文本特征。
可选地,所述提取模块602,具体用于根据所述目标对象在所述图像中的位置坐标,从所述图像中确定所述目标对象对应的区域图像;根据所述目标对象对应的区域图像,得到所述目标对象的视觉特征。
可选地,所述生成模块603,具体用于通过加和操作,将所述标签特征、所述位置特征、所述文本特征以及所述视觉特征件进行融合,得到融 合特征;将所述融合特征输入到所述视觉语言模型,生成针对所述图像的自然语言描述。
上述各模块的功能在上一实施例中的方法步骤中已详细阐述,在此不做赘述。
下面参考图7,其示出了适于用来实现本公开实施例的电子设备7的结构示意图。图7示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图7所示,电子设备700可以包括处理装置(例如中央处理器、图形处理器等)701,其可以根据存储在只读存储器(ROM)702中的程序或者从存储装置708加载到随机访问存储器(RAM)703中的程序而执行各种适当的动作和处理。在RAM 703中,还存储有电子设备700操作所需的各种程序和数据。处理装置701、ROM 702以及RAM 703通过总线704彼此相连。输入/输出(I/O)接口705也连接至总线704。
通常,以下装置可以连接至I/O接口705:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置706;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置707;包括例如磁带、硬盘等的存储装置708;以及通信装置709。通信装置709可以允许电子设备700与其他设备进行无线或有线通信以交换数据。虽然图7示出了具有各种装置的电子设备700,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置709从网络上被下载和安装,或者从存储装置708被安装,或者从ROM 702被安装。在该计算机程序被处理装置701执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存 储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
在一些实施方式中,终端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取包括目标对象的图像;从所述图像中分别提取出所述目标对象的标签特征、所述目标对象在所述图像中的位置特征、所述图像中文本特征以及所述目标对象的视觉特征; 根据所述标签特征、所述位置特征、所述文本特征、所述视觉特征和视觉语言模型生成针对所述图像的自然语言描述。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言——诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块的名称在某种情况下并不构成对该模块本身的限定,例如,第一获取模块还可以被描述为“获取至少两个网际协议地址的模块”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现 场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
根据本公开的一个或多个实施例,示例1提供了一种图像描述生成方法所述方法包括:获取包括目标对象的图像;从所述图像中分别提取出所述目标对象的标签特征、所述目标对象在所述图像中的位置特征、所述图像中文本特征以及所述目标对象的视觉特征;根据所述标签特征、所述位置特征、所述文本特征、所述视觉特征和视觉语言模型生成针对所述图像的自然语言描述。
根据本公开的一个或多个实施例,示例2提供了示例1的方法,所述方法还包括:根据所述图像的自然语言描述,确定针对所述目标对象的推广策略,所述推广策略用于推广所述目标对象。
根据本公开的一个或多个实施例,示例3提供了示例1的方法,所述从所述图像中分别提取出所述目标对象的标签特征、所述目标对象在所述图像中的位置特征,包括:将所述图像依次通过卷积神经网络、编码结构和解码结构,提取所述目标对象在所述图像中的位置坐标以及所述目标对象的标签;根据所述目标对象在所述图像中的位置坐标,得到所述图像中的位置特征,根据所述目标对象的标签,得到所述目标对象的标签特征。
根据本公开的一个或多个实施例,示例4提供了示例3的方法,所述目标对象的标签包括至少一个单词。
根据本公开的一个或多个实施例,示例5提供了示例1的方法,提取所述图像中文本特征的过程包括:对所述图像进行光学字符识别,提取所述图像中的文本;根据所述图像中的文本,得到所述图像中的文本特征。
根据本公开的一个或多个实施例,示例6提供了示例3的方法,提取所述目标对象的视觉特征的过程包括:根据所述目标对象在所述图像中的位置坐标,从所述图像中确定所述目标对象对应的区域图像;根据所述目标对象对应的区域图像,得到所述目标对象的视觉特征。
根据本公开的一个或多个实施例,示例7提供了示例1-6的方法,所述根据所述标签特征、所述位置特征、所述文本特征、所述视觉特征和视觉语言模型生成针对所述图像的自然语言描述,包括:通过加和操作,将所述标签特征、所述位置特征、所述文本特征以及所述视觉特征件进行融合,得到融合特征;将所述融合特征输入到所述视觉语言模型,生成针对所述图像的自然语言描述。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要 求书的示例形式。关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。

Claims (11)

  1. 一种图像描述生成方法,其特征在于,所述方法包括:
    获取包括目标对象的图像;
    从所述图像中分别提取出所述目标对象的标签特征、所述目标对象在所述图像中的位置特征、所述图像中文本特征以及所述目标对象的视觉特征;
    根据所述标签特征、所述位置特征、所述文本特征、所述视觉特征和视觉语言模型生成针对所述图像的自然语言描述。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    根据所述图像的自然语言描述,确定针对所述目标对象的推广策略,所述推广策略用于推广所述目标对象。
  3. 根据权利要求1所述的方法,其特征在于,所述从所述图像中分别提取出所述目标对象的标签特征、所述目标对象在所述图像中的位置特征,包括:
    将所述图像依次通过卷积神经网络、编码结构和解码结构,提取所述目标对象在所述图像中的位置坐标以及所述目标对象的标签;
    根据所述目标对象在所述图像中的位置坐标,得到所述图像中的位置特征,根据所述目标对象的标签,得到所述目标对象的标签特征。
  4. 根据权利要求3所述的方法,其特征在于,所述目标对象的标签包括至少一个单词。
  5. 根据权利要求1所述的方法,其特征在于,提取所述图像中文本特征的过程包括:
    对所述图像进行光学字符识别,提取所述图像中的文本;
    根据所述图像中的文本,得到所述图像中的文本特征。
  6. 根据权利要求3所述的方法,其特征在于,提取所述目标对象的视觉特征的过程包括:
    根据所述目标对象在所述图像中的位置坐标,从所述图像中确定所述目标对象对应的区域图像;
    根据所述目标对象对应的区域图像,得到所述目标对象的视觉特征。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述根据所述标签特征、所述位置特征、所述文本特征、所述视觉特征和视觉语言模型生成针对所述图像的自然语言描述,包括:
    通过加和操作,将所述标签特征、所述位置特征、所述文本特征以及所述视觉特征件进行融合,得到融合特征;
    将所述融合特征输入到所述视觉语言模型,生成针对所述图像的自然语言描述。
  8. 一种图像描述生成装置,其特征在于,包括:
    获取模块,用于获取包括目标对象的图像;
    提取模块,用于从所述图像中分别提取出所述目标对象的标签特征、所述目标对象在所述图像中的位置特征、所述图像中文本特征以及所述目标对象的视觉特征;
    生成模块,用于根据所述标签特征、所述位置特征、所述文本特征、所述视觉特征和视觉语言模型生成针对所述图像的自然语言描述。
  9. 一种电子设备,其特征在于,包括:
    存储装置,其上存储有计算机程序;
    处理装置,用于执行所述存储装置中的所述计算机程序,以实现权利要求1至7中任一项所述的方法。
  10. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理装置执行时实现权利要求1至7中任一项所述的方法。
  11. 一种计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得计算机执行如权利要求1至7中任一项所述的方法。
PCT/CN2023/078335 2022-03-21 2023-02-27 一种图像描述生成方法、装置、设备、介质及产品 WO2023179308A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210278138.2A CN114627353B (zh) 2022-03-21 2022-03-21 一种图像描述生成方法、装置、设备、介质及产品
CN202210278138.2 2022-03-21

Publications (1)

Publication Number Publication Date
WO2023179308A1 true WO2023179308A1 (zh) 2023-09-28

Family

ID=81904313

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/078335 WO2023179308A1 (zh) 2022-03-21 2023-02-27 一种图像描述生成方法、装置、设备、介质及产品

Country Status (2)

Country Link
CN (1) CN114627353B (zh)
WO (1) WO2023179308A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114627353B (zh) * 2022-03-21 2023-12-12 北京有竹居网络技术有限公司 一种图像描述生成方法、装置、设备、介质及产品

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472642A (zh) * 2019-08-19 2019-11-19 齐鲁工业大学 基于多级注意力的细粒度图像描述方法及系统
CN112200268A (zh) * 2020-11-04 2021-01-08 福州大学 一种基于编码器-解码器框架的图像描述方法
CN113569068A (zh) * 2021-01-19 2021-10-29 腾讯科技(深圳)有限公司 描述内容生成方法、视觉内容的编码、解码方法、装置
CN113792112A (zh) * 2020-07-31 2021-12-14 北京京东尚科信息技术有限公司 视觉语言任务处理系统、训练方法、装置、设备及介质
US20210390700A1 (en) * 2020-06-12 2021-12-16 Adobe Inc. Referring image segmentation
CN114627353A (zh) * 2022-03-21 2022-06-14 北京有竹居网络技术有限公司 一种图像描述生成方法、装置、设备、介质及产品

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472688A (zh) * 2019-08-16 2019-11-19 北京金山数字娱乐科技有限公司 图像描述的方法及装置、图像描述模型的训练方法及装置
CN110825901A (zh) * 2019-11-11 2020-02-21 腾讯科技(北京)有限公司 基于人工智能的图文匹配方法、装置、设备及存储介质
CN111368118B (zh) * 2020-02-13 2023-04-18 中山大学 一种图像描述生成方法、系统、装置和存储介质
CN113792113A (zh) * 2020-07-31 2021-12-14 北京京东尚科信息技术有限公司 视觉语言模型获得及任务处理方法、装置、设备及介质
CN113569892A (zh) * 2021-01-29 2021-10-29 腾讯科技(深圳)有限公司 图像描述信息生成方法、装置、计算机设备及存储介质
AU2021106049A4 (en) * 2021-08-19 2021-12-09 Abrar Alvi An enhancing visual data caption generation using machine learning
CN114387430B (zh) * 2022-01-11 2024-05-28 平安科技(深圳)有限公司 基于人工智能的图像描述生成方法、装置、设备及介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472642A (zh) * 2019-08-19 2019-11-19 齐鲁工业大学 基于多级注意力的细粒度图像描述方法及系统
US20210390700A1 (en) * 2020-06-12 2021-12-16 Adobe Inc. Referring image segmentation
CN113792112A (zh) * 2020-07-31 2021-12-14 北京京东尚科信息技术有限公司 视觉语言任务处理系统、训练方法、装置、设备及介质
CN112200268A (zh) * 2020-11-04 2021-01-08 福州大学 一种基于编码器-解码器框架的图像描述方法
CN113569068A (zh) * 2021-01-19 2021-10-29 腾讯科技(深圳)有限公司 描述内容生成方法、视觉内容的编码、解码方法、装置
CN114627353A (zh) * 2022-03-21 2022-06-14 北京有竹居网络技术有限公司 一种图像描述生成方法、装置、设备、介质及产品

Also Published As

Publication number Publication date
CN114627353A (zh) 2022-06-14
CN114627353B (zh) 2023-12-12

Similar Documents

Publication Publication Date Title
CN112184738B (zh) 一种图像分割方法、装置、设备及存储介质
WO2021196903A1 (zh) 视频处理方法、装置、可读介质及电子设备
US11758088B2 (en) Method and apparatus for aligning paragraph and video
CN112153422B (zh) 视频融合方法和设备
US20230239546A1 (en) Theme video generation method and apparatus, electronic device, and readable storage medium
CN109934142B (zh) 用于生成视频的特征向量的方法和装置
WO2023179308A1 (zh) 一种图像描述生成方法、装置、设备、介质及产品
WO2023179310A1 (zh) 图像修复方法、装置、设备、介质及产品
WO2024012251A1 (zh) 语义分割模型训练方法、装置、电子设备及存储介质
CN114580425B (zh) 命名实体识别的方法和装置,以及电子设备和存储介质
US20230367972A1 (en) Method and apparatus for processing model data, electronic device, and computer readable medium
CN110148024B (zh) 一种用于提供评论输入模式的方法与装置
CN115294501A (zh) 视频识别方法、视频识别模型训练方法、介质及电子设备
CN113610034B (zh) 识别视频中人物实体的方法、装置、存储介质及电子设备
CN109816023B (zh) 用于生成图片标签模型的方法和装置
CN110673886B (zh) 用于生成热力图的方法和装置
CN113628097A (zh) 图像特效配置方法、图像识别方法、装置及电子设备
WO2023098576A1 (zh) 图像处理方法、装置、设备及介质
WO2022213801A1 (zh) 视频处理方法、装置及设备
CN112084441A (zh) 信息检索方法、装置和电子设备
CN115757933A (zh) 推荐信息生成方法、装置、设备、介质和程序产品
CN114239501A (zh) 合同生成方法、装置、设备及介质
CN112488204A (zh) 训练样本生成方法、图像分割方法、装置、设备和介质
WO2024074118A1 (zh) 图像处理方法、装置、设备及存储介质
CN112306976A (zh) 信息处理方法、装置和电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23773559

Country of ref document: EP

Kind code of ref document: A1