WO2021170230A1

WO2021170230A1 - Devices and methods for providing images and image capturing based on a text and providing images as a text

Info

Publication number: WO2021170230A1
Application number: PCT/EP2020/055030
Authority: WO
Inventors: Radu Ciprian Bilcu; Vitali Samurov; Hong Zhou
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2020-02-26
Filing date: 2020-02-26
Publication date: 2021-09-02

Abstract

According to aspects, there are provided a device for determining an image in a textual form, a device for providing an image from a text and a device for capturing an image based on a text. The device for determining an image in a textual form is configured to determine a textual description of the image; receive at least one user input for modifying the textual description of the image; determine a modified textual description of the image based on the textual description of the image and the at least one user input, the modified description of the image having a limited number of characters; and store the modified textual description of the image as a text file, wherein a size of the text file is lower than a file size of the image.

Description

DEVICES AND METHODS FOR PROVIDING IMAGES AND IMAGE CAPTURING BASED ON A TEXT AND PROVIDING IMAGES AS A TEXT

TECHNICAL FIELD

The present disclosure relates to the field of computer technology, and more particularly to compression, decompression and capturing of images based on a text.

BACKGROUND

Saving high-resolution images occupies a lot of memory. Especially with the development and widespread use of smartphones equipped with cameras, the need of memory space for images has increased rapidly. Typical image compression techniques reduce the size of the image file without degrading its quality significantly. If a conventional decompression method is used at a high compression ratio, it may not be possible to restore the image without degrading its quality.

SUMMARY

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

It is an object to provide a solution for image compression and decompression using image-to- text and text-to-image algorithms. A device and a method to determine an image in a textual form are provided for storing images in a compressed form such that the required file size may be considerably decreased. Further, the images may be decompressed in a high-quality form with an embodied device and a method to provide an image from a text.

This objective is achieved by the features of the independent claims. Further embodiments and examples are apparent from the dependent claims, the description and the figures.

According to a first aspect, there is provided a device to determine an image in a textual form. The device is configured to determine a textual description of the image; receive at least one user input for modifying the textual description of the image; determine a modified textual description of the image based on the textual description of the image and the at least one user input, the modified description of the image having a limited number of characters; and store the modified textual description of the image as a text file, wherein a size of the text file is lower than a file size of the image. This enables, that images may be compressed in a considerably lower space than with conventional compressing methods. The user may be able to improve the quality and accuracy of the determined textual description. The user may act as a programmer such that the user may easily improve the compression process with the user inputs, for example, to enhance accuracy of a used image-to-text algorithm. Further, the user may be able to modify the image by modifying the textual description of the image before storing the image in the textual form.

In an implementation form of the first aspect, the device is further configured to cause at least one of visual or audio output of the textual description of the image. Hence, the user may easily inspect and modify the textual description when the user is storing the image in the compressed form. This enables, that the user may participate in generating the final textual description, and thereby supplement operations of the device.

In an implementation form of the first aspect, the at least one user input comprises at least one instruction to modify at least one portion of the textual description of the image associated with at least one portion of the image and/or at least one parameter of the image. This enables, that the user may edit only a part of the image, or the whole image before storing the image in the compressed form.

In an implementation form of the first aspect, the device is further configured to obtain an auxiliary image, wherein the auxiliary image comprises a simplified representation of the image; and store the auxiliary image in association with the modified textual description of the image. This enables, that the image may be later decompressed based on the modified textual description and the auxiliary image. The auxiliary image may be used as a starting point for reconstructing the image based on the modified textual description. Alternatively, the auxiliary image may be used to assist in the reconstruction based on the textual description by, for example, obtaining additional information from the auxiliary image. The additional information may relate to locations, colors, sizes or shapes of objects, among others.

In an implementation form of the first aspect, a resolution of the auxiliary image is lower than a resolution of the image; and/or a number of pixels of the auxiliary image is smaller than in the image. Hence, the auxiliary image and the modified textual description stored together may require less space on a memory than storing the image compressed using conventional compressing techniques.

In an implementation form of the first aspect, the device is further configured to obtain the auxiliary image based on at least one of object segmentation or color quantization of the image. This enables, that the auxiliary image may be stored in a considerably smaller space than the image.

In an implementation form of the first aspect, the device is further configured to obtain at least one second image for a reference for modifying the textual description of the image; convert the second image into a second textual description of the second image; obtain a fused textual description based on merging the second textual description with the textual description, the fused textual description of the images having a limited number of characters; and store the fused textual description of the images as a text file, wherein a size of the text file is lower than a file size of any of the images. This enables, that upon compressing the image, the image may be modified based on features of a second image.

In an implementation form of the first aspect, the device is further configured to store the fused textual description is stored in association with the auxiliary image. Hence, the image may be reconstructed later based on the fused textual description by using the auxiliary image as a starting point or to obtain additional information for the reconstruction. This further enables, that the reconstructed image may be an edited version of the image based on one or more features of the second image.

In an implementation form of the first aspect, at least one illumination parameter, at least one color parameter, or at least one capture parameter of the second image is different from the image. Hence, the image may be edited based on the features of the second image when merging the textual descriptions of the images. For example, an HDR image may be created.

According to a second aspect, a device to provide an image from a text is provided. The device is configured to obtain a textual description of at least one image; obtain an auxiliary image associated with the textual description; and reconstruct the image based on the textual description and the auxiliary image. This enables, that the image may be decompressed such that a realistic high-quality image is formed. Accuracy of the reconstruction is increased with using the auxiliary image as a starting point for text-to-image conversion and utilizing image information in the textual description to fill in the details lacking in the auxiliary image, for example.

In an implementation form of the second aspect, the auxiliary image comprises a compressed version of the image. Hence, the textual description may be used to reconstruct the image on basis of the auxiliary image. This may increase accuracy of the reconstructed image.

In an implementation form of the second aspect, at least one of a resolution of the auxiliary image is lower than a resolution of the image or a number of pixels of the auxiliary image is smaller than in the image. Hence, the image may be reconstructed in a higher resolution than the auxiliary image.

In an implementation form of the second aspect, the auxiliary image comprises an object segmented and/or color quantized version of the image. The textual description of the image may be used to reconstruct the original version of the image based on the highly modified auxiliary image. The highly modified auxiliary image required a considerably smaller space in the memory compared to the original image.

According to a third aspect, there is provided a device to provide an image as a text. The device is configured to determine a textual description of the image; obtain an auxiliary image, wherein the auxiliary image comprises a simplified representation of the image; and store the auxiliary image in association with the textual description of the image. This enables, that images may be compressed in a considerably lower space than with conventional compressing methods. Further, the image may be later decompressed based on the textual description and the auxiliary image. The auxiliary image may be used as a starting point for reconstructing the image based on the modified textual description. Alternatively, the auxiliary image may be used to assist in the reconstruction based on the textual description by, for example, obtaining additional information from the auxiliary image. The additional information may relate to locations, colors, sizes or shapes of objects, among others.

According to a fourth aspect, there is provided a device to capture an image based on a text. The image capture device is configured to receive a textual description of the image; obtain at least one viewfinder image of a camera; compare the textual description of the image with the viewfinder image; and capture a second image with the camera in response to determining that the viewfinder image corresponds to the textual description of the image. This enables, that a user may be able to capture an image based on a textual description of the image. Hence, the image capturing device may be able to capture an image at a right time without a user input when the viewfinder image matches the textual description of the image.

In an implementation form of the fourth aspect, the textual description of the image is received based on a textual user input or a verbal user input. This enables, that the user may initiate image capture simply by writing or verbally describing the scene, and the image capturing device may capture the image as soon as an image matching the description is detected in the viewfinder.

In an implementation form of the fourth aspect, the device is configured to convert the viewfinder image into a second textual description; compare the second textual description to the textual description of the image; and capture the second image with the camera in response to determining that the second textual description corresponds to the textual description. This enables, that the matching of the textual description of the image and the viewfinder image may be determined by comparing textual description of the viewfinder image to the received textual description.

In an implementation form of the fourth aspect, the device is configured to reconstruct the image based on the textual description; compare the reconstructed image to the viewfinder image; and capture the at least one second image with the camera in response to determining that the reconstructed image corresponds to the viewfinder image. This enables, that the comparison of the textual description of the image and the viewfinder image may be performed by first reconstructing the image based on the textual description for image comparison.

In an implementation form of the fourth aspect, the device is configured to obtain an auxiliary image, wherein the auxiliary image comprises a simplified representation of the second image in a more compact form; and store the auxiliary image in association with the textual description of the image. Hence, the image may be compressed by storing the auxiliary image and the textual description of the image instead of the image. Hence, the image may be stored in a considerably smaller space than if the image was compressed using conventional compressing techniques. Further, the image may be later reconstructed accurately.

In an implementation form of the fourth aspect, the device is further configured to delete the second image. Hence, only the auxiliary image and the text description is saved, and a considerable amount of memory space may be saved for storing the image.

According to a fifth aspect, there is provided a method for determining an image in a textual form. The method comprises determining a textual description of the image; receiving at least one user input for modifying the textual description of the image; determining a modified textual description of the image based on the textual description of the image and the at least one user input, the modified description of the image having a limited number of characters; and storing the modified textual description of the image as a text file, wherein a size of the text file is lower than a file size of the image. This enables, that images may be compressed in a considerably lower space than with conventional compressing methods. Further, the user may be able to modify the image by modifying the textual description of the image before storing the image in the textual form.

In an implementation form of the fifth aspect, the method is executed in the device of the first aspect.

According to a sixth aspect, there is provided a method for providing an image from a text. The method comprises obtaining a textual description of at least one image; obtaining an auxiliary image associated with the textual description; and reconstructing the image based on the textual description and the auxiliary image. This enables, that the image may be decompressed by reconstructing the image accurately in high-quality based on the auxiliary image and the textual description of the image.

In an implementation form of the sixth aspect, the method is executed in the device of the second aspect.

According to a seventh aspect, there is provided a method to provide an image as a text. The method comprises determining a textual description of the image; obtaining an auxiliary image, wherein the auxiliary image comprises a simplified representation of the image; and store the auxiliary image in association with the textual description of the image. This enables, that images may be compressed in a considerably lower space than with conventional compressing methods. Further, the image may be later decompressed based on the textual description and the auxiliary image. The auxiliary image may be used as a starting point for reconstructing the image based on the modified textual description. Alternatively, the auxiliary image may be used to assist in the reconstruction based on the textual description by, for example, obtaining additional information from the auxiliary image. The additional information may relate to locations, colors, sizes or shapes of objects, among others.

In an implementation form of the seventh aspect, the method is executed in the device of the third aspect.

According to an eighth aspect, there is provided a method for image capturing based on a text. The method comprises receiving a textual description of an image; obtaining at least one viewfinder image of a camera; comparing the textual description with the viewfinder image; and capturing a second image with the camera in response to determining that the viewfinder image corresponds to the textual description of the image. This enables, that a user may be able to capture an image based on a textual description of the image. Hence, the image capturing device may be able to capture an image at a right time without a user input when the viewfinder image matches the textual description of the image.

In an implementation form of the eighth aspect, the method is executed in the device of the fourth aspect.

According to a ninth aspect, there is provided a computer program comprising a program code configured to cause performance of the method according to any of the fifth, sixth, seventh or eighth aspect, when the computer program is executed on a computer.

According to a tenth aspect, there is provided a computer program product comprising a computer readable storage medium storing program code thereon, the program code comprising instructions for executing the method according to any of the fifth, sixth, seventh or eighth aspect. BRIEF DESCRIPTION OF THE DRAWINGS

In the following examples are described in more detail with reference to the attached figures and drawings, in which:

FIG. 1 illustrates a schematic representation of a block diagram of an image compression system according to an embodiment.

FIG. 2 illustrates a schematic representation of a block diagram of an apparatus configured to perform a functionality according to an embodiment.

FIG. 3 illustrates a schematic representation of a flowchart of a method for determining an image in a textual form for storing the image in a compact form according to an embodiment.

FIG. 4 illustrates a schematic representation of a flowchart of a method for providing an image from a text by reconstructing the image based on a textual description according to an embodiment.

FIG. 5 illustrates a schematic representation of a flowchart of a method for image capturing based on text and comparison of the text to a viewfinder image according to an embodiment.

FIG. 6 illustrates a schematic representation of providing an image in a textual form using neural networks according to an example embodiment.

FIG. 7 illustrates a schematic representation of determining an image in a textual form and further reconstructing the image based on the text according to an embodiment.

FIG. 8 illustrates a schematic representation of capturing an image based on a text and comparison of the text with viewfinder images according to an example embodiment.

In the following identical reference signs refer to identical or at least functionally equivalent features. DETAILED DESCRIPTION

In the following description, reference is made to the accompanying drawings, which form part of the disclosure, and in which are shown, by way of illustration, specific aspects and examples in which the present subject-matter may be placed. It is understood that other aspects may be utilized, and structural or logical changes may be made without departing from the scope of the present subject-matter. The following detailed description, therefore, is not to be taken in a limiting sense, as the scope of the present subject-matter is defined in the appended claims.

For instance, it is understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if a specific method operation is described, a corresponding device may include a unit or other means to perform the described method step, even if such unit is not explicitly described or illustrated in the figures. On the other hand, for example, if a specific apparatus is described based on functional units, a corresponding method may include a step performing the described functionality, even if such step is not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary aspects described herein may be combined with each other, unless specifically noted otherwise.

An embodiment provides an apparatus and a method for a very high image compression rate of an image without actually compressing the image but using, for example, image-to-text conversion technologies. Text-to-image and image-to-text algorithms may be used in a new way to compress high-resolution images as a text describing the image. This may enable that the required space on a memory is considerably reduced. Further, a user may be able to edit the image by modifying the text describing the image. Still further, a solution may enable creating and storing images in the compressed form without using a camera, for example, based on a textual or a verbal description of a visualized image.

In an embodiment, image compression may be performed by converting the image into a text description of the image with a device. Hence, a space-saving compressed text file may be saved instead of a compressed image file. The image is either saved as a single text file using image-to-text function, or a low-resolution version of the image is saved together with the text file to increase the accuracy of the image when decompressed. The accuracy of the decompressed image may be also increased by enabling a user to modify the text description. In an embodiment, an image may be reconstructed based on an auxiliary image and a textual description of the image. The image may have been saved in a considerably smaller space compared to conventional image compression methods, while the auxiliary image and the textual description of the image may enable reconstruction the image such that quality of the decompressed image substantially matches the quality of the image before the compression.

In an embodiment, an image may be captured by a device based on a textual description of the image. The textual description may be compared to viewfinder images of a camera, and the image is captured by the device when the viewfinder image corresponds the image description. The device may use the captured image, for example, to obtain a low-resolution version of the image as an auxiliary image which may be used for image compression by storing the auxiliary image in association with the image description, similarly as described in the above section. Furthermore, the device and a method for image capturing based on a text may enable capturing images without user input as soon as the viewfinder image matches a scene described by a user.

FIG. 1 illustrates a schematic representation of a block diagram of a system 100 for compressing and decompressing images according to an aspect. The system 100 comprises a device 300 to determine an image in a textual form coupled with a device 400 to provide an image from a text. The devices 300, 400 may operate independently from each other. In addition, the devices 300 and 400 may be coupled via a memory 101. The devices 300, 400 may be operationally coupled via a network 102. In an embodiment, the image decompression device 400 may retrieve data stored on the image compression device 300 over the network 102. In an embodiment, the image compression device 300 may store data on a remote memory storage, such as in a cloud, and the image decompression device 400 may retrieve the data from the cloud over the network 102. The network 102 may comprise a wireless local area network (WLAN) connection, a short-range wireless network connection such as for example a Bluetooth, NFC (near-field communication), or RFID connection, a local wired connection such as for example a local area network (LAN) connection or a universal serial bus (USB) connection, or a wired Internet connection. The system 100 may further comprise a device 500 configured for capturing images. The device 500 may assist the device 300 by providing the images for compression. The device 300, the device 500 and the device 400 may be embodied as separate devices, or one or more of the devices 300, 400, 500 may be comprised in a single device, for example as dedicated software and/or hardware components. FIG. 2 illustrates a schematic representation of a block diagram of an apparatus 200 configured to perform a functionality according to an embodiment. The apparatus 200 may comprise a computing device such as for example a mobile phone, a tablet computer, a laptop, or the like. Although the apparatus 200 is illustrated as a single device it is appreciated that, wherever applicable, functions of the apparatus 200 may be distributed to a plurality of devices, for example to implement example embodiments as a cloud computing service. In an embodiment, the apparatus 200 may be the device 300 to determine an image in a textual form, the device 400 to provide an image from a text, the device 500 to capture an image based on a text, or a combination thereof.

The apparatus 200 may comprise at least one processor 201 and a memory 202. The memory 202 may comprise a program code 203 which, when executed on the processor 201 causes the apparatus 200 to perform embodiments of the operations and functionality described. For example, the at least one processor 201 may comprise one or more of various processing devices, such as for example a co-processor, a microprocessor, a controller, a digital signal processor (DSP), a processing circuitry with or without an accompanying DSP, or various other processing devices. Alternatively, or in addition, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Program-specific Integrated Circuits (ASICs), Program-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), Graphics Processing Units (GPUs).

The apparatus 200 may comprise a user interface 205 comprising an input device and/or an output device. The input device may be any device configured to receive user inputs such a keyboard, a touch screen, or a microphone. The output device may for example comprise a display and/or a speaker. The apparatus may further comprise a camera 204. In an embodiment, the apparatus 200 may be the device 500 for image capture based on a text and the camera 204 may be configured to capture the images. In another embodiment, the apparatus 200 may be the device 300 to determine an image in a textual form, and the camera 204 may be used by the device 300 to obtain the image. The apparatus 200 may comprise all or only some of the components or devices 201, 202, 203, 204, 205 depending on the functionality of the apparatus 200 The memory 202 may be any medium, including non-transitory storage media, on which the program code 203 is stored such as a Blu-Ray disc, DVD, CD, USB (flash) drive, hard disc, server storage available via a network, a ROM, a PROM, an EPROM, an EEPROM or a Flash memory having electronically readable control signals stored thereon which cooperate or are capable of cooperating with a programmable computer system such that an embodiment of at least one of the methods described is performed. The functionality described herein can be performed, at least in part, by one or more computer program product components such as software components. An embodiment comprises or is a computer program comprising program code for performing any of the methods described herein, when executed on a computer. Another example comprises or is a computer readable medium comprising a program code that, when executed by the processor, causes a computer system to perform any of the methods described herein. The program code may comprise instructions which, when executed, cause the processor, computer or the like, to perform at least one of the methods described herein.

Consequently, the apparatus 200 is configured for performing at least one method described herein. In one embodiment, this includes the at least one processor, the at least one memory including program code configured to, when executed by the at least one processor, cause the apparatus to perform the method. The apparatus 200 may be embodied as a separate device, or one or more of apparatuses 200 may be comprised in a single device, for example as dedicated software and/or hardware components.

FIG. 3 illustrates a schematic representation of a flowchart of a method for determining an image in a textual form according to an aspect. The method may enable storing the image in a compressed form as a text. In an embodiment, the device 300 may be configured to execute the method.

An image may be obtained for storage. The image may be obtained by capturing a scene with a camera coupled with the device 300. In an embodiment, the image may be obtained by retrieving the image from a memory or over a network. The image may be also obtained by other means, such as by receiving a verbal description from a user of the image the user visualizes. The camera may be comprised in the device 300, or it may be a camera of a device 500 for image capturing coupled with the device 300. The image may be converted into a textual description of the image. At, 301, a textual description of the image may be determined. The conversion to the textual form may be implemented using an image-to-text algorithm. For example, a model combining machine vision and machine translation may be used to generate natural sentences describing the image. The model may be trained with images and their respective descriptions to increase the accuracy of the generated textual description. For example, deep convolution neural networks (CNN) may be used for object recognition and detection, and scene classification. The textual description based on the detected objects and scene may be formed using, for example, recurrent neural network (RNN). When the image is obtained in a verbal format, the image may be converted using a speech-to-text algorithm.

In the CNN/RNN model, CNN may encode the image into a compact representation, followed by RNN that may generate a corresponding description of the image. As illustrated in FIG. 6, CNN 601 may be configured to take image data 600 as an input. CNN 601 may be further configured to output objects and a scene in the image, as well as how the objects relate to each other and attributes associated with the objects. CNN 601 may transform the image data 600 into a rich representation by embedding the image data into a fixed-length vector which may be provided as an input to the decoder RNN 602 which in turn generates the output description 603. The image-to-text algorithm may be trained using stochastic gradient descent. The CNN 601 may be pre-trained for image classification on a large corpora from a variety of different fields to increase the accuracy and quality of the generation of image descriptions. In case the CNN did not recognize an object or an attribute correctly, or a user wants to edit the image 600, for example, the generated textual description 603 may be provided for the user via a user interface 205. In response to one or more user inputs 605, the textual description of the image 603 may be modified. The modified textual description of the image 604 may be again provided for the user vie the user interface 205 so that the user may determine if he is satisfied with the image description. The device 300 may request an approval of the user via the user interface 205, and the modified textual description 604 may be stored on a memory 202 in response to a user input indicating the approval of the user. Alternatively, the modified textual description of the image 604 may be automatically stored on the memory 202. The modified textual description of the image 604 may be retrieved from the memory 202 for constructing the image 606 based on the modified textual description 604. In response to receiving a new input image, a non-linear function may be used by the algorithm to predict each word of the description, such as a Long-Short Term Memory (LSTM) net. The LSTM may comprise a memory cell encoding knowledge at every time step of what inputs have been observed up to this step. The behavior of the cell may be controlled by gate layers which are applied multiplicatively. Hence, the gate layers may either keep a value from the gated layer if the gate is 1 or zero the value if the gate is 0. Three gates may be used to control whether to forget the current memory cell value, if the memory cell should read its input and whether to output the new memory cell value. The multiplicative gates may enable to train the LSTM robustly as the gates may deal well with exploding and vanishing gradients. When training, a copy of the LTMS memory may be created for an image and each word such that all LSTMs share the same parameters and the output of the LSTM of a previous time instance is fed to the LSTM at the next time instance. Hence, recurrent connections may be transformed to feed-forward connections. The image and the words may be mapped to the same space, the image by using a vision CNN, and the words by using word embedding. Each image may be input only once to inform the LSTM about contents of the image and to avoid noise caused by additional inputs of the same image. In an embodiment, the description may be generated by sampling, where the first generated word is sampled and then provide the corresponding embedding as an input, sample the outcome, and continue corresponding sampling until a predetermined number of characters is met. Instead of sampling, a beam search decoder may be used, where a set of best sentences may be iteratively considered as candidates for the final description, and only the resulting best sentences may be kept to form the description.

Even though the image-to-text algorithm may be highly accurate, modifications may be needed in order to provide a comprehensive description of the image. At 302, at least one user input for modifying the textual description of the image is received. The textual description may be displayed as a visual output to the user via a user interface of the device 300. Alternatively, or in addition, the textual description may be provided to the user as an audio output. Further, the device 300 may receive the one or more user inputs for modifying the textual description of the image via the user interface. The user inputs may be, for example, textual or verbal user inputs. For example, the user may modify the text by typing additional text in the textual description and/or deleting one or more parts of the textual description. The modification of the textual description in response to the user inputs may be related to a single object of the image, to many objects of the image or to all image area. For a simplified example, the device 300 may have converted the image into a textual description, which say: “a forest landscape with a red ball in the foreground”. The device 300 may then display or playback the textual description to a user via the user interface as a visual or an audio output. For instance, if the conversion failed to comprise the tree type in the image, the user may then edit the textual description with speech, saying: “the forest comprises birches”. Thereafter, at 303, a modified textual description of the image is determined based on the textual description of the image and the at least one user input. For example, when the user modified the textual description using the verbal input, the verbal input may be first converted into text using a speech-to-text model, and the text may be fused with the textual description. The modified textual description may be, for example: “a birch forest landscape with a red ball in the foreground”. Hence, accuracy of the image reconstruction may be increased. In addition to providing image details, the user may modify the textual description in order to edit the image. In an embodiment, the user input may comprise at least one instruction to modify at least one portion of the textual description of the image associated with at least one portion of the image and/or at least one parameter of the image. For example, the user may remove the red ball from the image or change brightness of the image by modifying the textual description. The modifications performed by the user may be perceived in the image when the image is decompressed in response to using a text-to-image conversion.

Once the user is satisfied with the text, only the modified textual description is saved instead of the obtained image. The modified textual description of the image may have a limited number of characters. At 304, the modified textual description of the image is stored as a text file, wherein a size of the text file is lower than a file size of the image. The image may be configured to be decompressed based on the modified textual description. Hence, the image may be stored in a reduced size and the compression may be reversible. The device 300 executing the method may achieve a considerably higher compression rate than conventional compression methods. The user may be able to ensure accuracy of the textual description of the image by modifying the textual description. In addition, or alternatively, the user may edit the image by modifying the textual description, and the image may be reconstructed in the edited form when decompressed.

In an embodiment, the method may further comprise obtaining an auxiliary image, wherein the auxiliary image comprises a simplified representation of the image converted into the textual description; and storing the auxiliary image in association with the modified textual description of the image. The auxiliary image may be used with the modified textual description by an device 400 to reconstruct the image when the image is being decompressed.

In an embodiment, the device 300 may be configured to determine a textual description of the image, obtain an auxiliary image, wherein the auxiliary image comprises a simplified representation of the image, and store the textual description of the image in association with the auxiliary image.

The auxiliary image may be a low-quality version of the image with a lower resolution. In an embodiment, a number of pixels of the auxiliary image is smaller than in the image. The auxiliary image may be a highly modified version of the image which can be efficiently compressed, for example, using JPEG. The auxiliary image stored with the modified textual description may take approximately half of the required space on a memory compared to compressing the original image. The required space may depend on the modification method of the auxiliary image. For example, the auxiliary image may be obtained based on at least one of object segmentation or color quantization of the image. Alternatively, a very low-resolution image may be obtained from the original image. For example, the image can be resized to a VGA resolution. In the very low-resolution image, all the details may not be kept. These small objects (details) may be retained in the textual description. Hence, when the compressed image is reconstructed into a full resolution on a basis of the auxiliary image, the details may be added using the textual description.

In an embodiment, the method may further comprises obtaining at least one second image for a reference for modifying the textual description of the image; converting the second image into a second textual description of the second image; obtaining a fused textual description based on merging the second textual description with the textual description, the fused textual description of the images having a limited number of characters; and storing the fused textual description of the images as a text file, wherein a size of the text file is lower than a file size of any of the images. The image may be configured to be decompressed based on the fused textual description. The fused textual description may be stored in association with the auxiliary image. The second image may be obtained by capturing with the camera. Alternatively, the second image may be retrieved by the image compression device 300 from a memory or over a network. The second image may or may not be related to the image, e.g. the second image may portray the same scene as the image but at a different time of the day, or it may portray a completely different scene.

In an embodiment, at least one illumination parameter, at least one color parameter, or at least one capture parameter of the second image is different from the image. The textual description of the second image may comprise references for reconstruction of the image in an edited form. Fusing the images may include generating a fused image by selecting portions (e.g. regions, pixels) either from the image or the second image. Alternatively, the style of the second image may be transferred to the image to change the visual appearance of the image by fusing the images. Changing the visual appearance may be implemented, for example, by changing the illumination of the image to match a user preference. The second image may be a demonstration of the user preference. For instance, the user may dictate objects of the second image having a different color, a different illumination, or different capture parameters such as exposure as the references, and fuse the desired parameters with the original image to get an HDR (high dynamic range) image. This may be implemented, for example, by obtaining a textual description of the second image, combining the textual description of the second image with the textual description of the image, and using a text-to-image algorithm to generate a modified image based on the combined textual description. The text-to-image algorithm may utilize domain information of the reference, while preserving the content information from the image. That is, during the reconstruction phase, the content from the image may be preserved (such as scene, objects, details), while a different style from the second image may be taken from the reference (different color scheme, illumination, etc.) for modification of the image. As another example, if the user wishes to use the illumination parameters of the second image in the image, the user may modify the fused textual description by deleting other parts related to the second image leaving only the illumination instructions for the reconstruction purposes.

Any of operations 301-304 for determining the image in a textual form for compression may be performed in response to a user input. Alternatively, the operations 301-304 may be performed automatically, for example, in response to obtaining the image.

FIG. 4 illustrates a schematic representation of a flowchart of a method to provide an image from a text according to an aspect. The method enables reconstructing an image which may be compressed as a textual description of the image. The device 400 may be configured to execute the method. At 401, a textual description of at least one image is obtained. The textual description of the image may be retrieved from a memory or over a network. The textual description may be retrieved in response to a user input. The user may be able to modify the obtained textual description before the image is reconstructed. For example, the device 400 may display the textual description to the user and receive user inputs to at least one of add or remove one or more parts of/to the textual description. This may enable, that the modifications may be applied to only part of the image or to the entire image. For example, modification of the textual description may be related to a single object of the image, to many objects of the image or to all image area. The user inputs may be received in a text or a speech format. Hence, the user may be able to improve the reconstruction quality. Further, the user may be able to edit the image before reconstruction.

At 402, an auxiliary image associated with the textual description is obtained. The auxiliary image may be a low-quality version of the image. For example, a resolution of the auxiliary image may be lower than a resolution of the image. The auxiliary image may be obtained from a memory or over a network. At 403, the image is reconstructed based on the textual description and the auxiliary image. For example, the auxiliary image may be used as a drawing canvas, and the textual description may be used to reconstruct the image on the auxiliary image. This may enable, that the compressed image may be decompressed in a higher resolution based on the textual description of the image and the significantly lower quality version of the image. Further, when the textual description is based on an image of a scene visualized by the user, a high-resolution realistic image of the scene may be generated based on the saved textual description. The reconstruction may be performed using a text-to-image algorithm. For example, the text-to-image algorithm may be trained by randomly sampling sentences and pairing them with images. During conversion, the algorithm may provide the textual description with a random noise vector as an input for producing the image. The algorithm may be trained with a dataset of paired samples of sources (descriptions) and targets (images). During training, samples may be taken also from the target distribution as references. When training on the paired data, the reference may be a ground-truth target. With the reference, a reference encoder may compress domain information into a fixed-length vector for a reference embedding. The embedding may be used as a query vector in a token layer, which may comprise a bank of token embedding and an attention layer, where the token embeddings are randomly initialized. The tokens may represent a variety of domain information in the training data. A universal attention module may be further used to encode diverse domain information from the target distribution into a latent space. The attention layer is used to learn the similarity between the reference embedding and each of the tokens, which may produce a set of weights representing the contribution of each token. A domain embedding may comprise a weighted sum of the token embedding, which may be used as an encoded latent code for image generation. The bank of token embedding may be shared across all training sequences. The presented process may allow the algorithm to learn a highly structured latent space.

FIG. 5 illustrates a schematic representation of a flowchart of a method for image capturing according to an embodiment. The device 500 may be configured to execute the method.

At 501, a textual description of an image is received. The textual description of the image may be received, for example, from a memory or over a network. As another example, the textual description of the image may be received based on a textual user input or a verbal user input. That is, the user may describe, by text or voice, the scene he visualizes or is looking at.

At 502, at least one viewfinder image of a camera is obtained. At 503, the textual description is compared with the viewfinder image. At 504, a second image is captured with the camera in response to determining that the viewfinder image corresponds to the textual description of the image.

FIG. 8 illustrates an example of an operation of the device 500 for image capturing based on a text. A user may have provided a textual description 801 of the image he wants to capture. For example, when the user starts the camera and points the camera to a scene, the similarity between the scene, captured in the viewfinder of the camera, and the earlier stored textual description 801 of the image may be calculated. The user may move the camera along a landscape such that different viewfinder images A, B, C are captured in the viewfinder. Similarity of the images A, B, C, 800 may be determined, at 802, by comparing the similarity in the image domain, or, by converting the viewfinder images A, B, C to a second textual description and comparing the second textual description, for example, by analyzing whether similar words describing similar objects exists in the textual descriptions. If the viewfinder scene is substantially similar to the one described earlier by the user an image is captured in response to a determination performed by the device 500 at 803. For example, the determined similarity of the the textual description of the image 801 and the viewfinder image A may be 20%, 98% for the viewfinder image B, and 50% for the viewfinder image C. Based on the comparisons, the device 500 may capture, at 804, the image of the viewfinder image B. The device 500 may capture the image, for example, when a predetermined similarity threshold is met. If the threshold is not met at 803 after the comparison, the operation may return to 802. Hence, the device 500 may enable capturing images automatically based on an image description received from a user.

In an embodiment of the method, the viewfinder image is converted into a second textual description; the second textual description is compared to the textual description of the image; and the second image is captured with the camera in response to determining that the second textual description corresponds to the textual description. Hence, the camera may be automated to capture the second image when the textual descriptions of the images comprise a preset degree of similarities. The device 500 may be configured to determine when the level of the similarities is sufficient for the correspondence. For example, if the textual descriptions comprise at least 90 percent of matching words or sentences.

In an embodiment, the method may further comprise reconstructing the image based on the textual description; comparing the reconstructed image to the viewfinder image; and capturing the at least one second image with the camera in response to determining that the reconstructed image corresponds to the viewfinder image. For example, the device 500 may analyze contents of the images, such as colors, shapes, textures, or any other image content information by using an image distance measure. That is, dimensions of the contents of the images may be measured and compared as distance measures. For example, a color histogram may be computed for the images identifying the proportion of pixels within the image holding specific values. For another example, visual patterns in the images and how they are spatially defined may be computed and compared. Alternatively, or in addition, the device 500 may apply segmentation or edge detection, use shape filters, or other shape descriptor method for comparing shapes in the images. If the determined distance measure is approximately zero, the second image may be captured.

In an embodiment, the method may further comprise obtaining an auxiliary image, wherein the auxiliary image comprises a compressed version of the second image; and storing the auxiliary image in association with the textual description of the image. In response to obtaining the auxiliary image, the device 500 may be configured to delete the second image. Therefore, memory space may be saved. This is because the second image becomes unnecessary since the auxiliary image and the textual description may be sufficient for reconstructing the image as well as the corresponding second image. Hence, the auxiliary image stored with the textual description may be used for image compression and decompression. The auxiliary image may be a simplified representation of the second image. The auxiliary image may be a highly modified or a lower resolution version of the second image which may be efficiently compressed. The second image may be captured based on a textual description of an image of a scene visualized by the user. In an embodiment, the second image matching the scene visualized by the user may be used by the device 500 to obtain the auxiliary image for compression to the device 300 and/or decompression to device 400. Therefore, when the user wants to view the image he visualized, it may be reconstructed with increased accuracy using the auxiliary image which is based on an actual image.

FIG. 7 illustrates a schematic representation of determining an image in a textual form and back to the image form from the text according to an embodiment. The operations may be performed by the devices 300, 400 configured to implement the respective methods described in conjunction with FIGS. 3 and 4.

First, an image 700 is obtained by the device 300. The image 700 may be a high-resolution image. The high-resolution may be 300 dpi (dots per inch) or more. The device 300 may convert the image 700 to a textual description 701 of the image 700. The textual description 701 in FIG. 6 illustrates only one possible description of the image 700, and it may comprise more information, such as additional details of objects and scene, color codes, illumination, and the like. The device 300 may display the textual description 701 for a user. The user may then modify it, for example, by typing in additional information. In the embodiment of FIG. 7, the user may have added the underlined section (“on a table”) to ensure that the objects locations are reconstructed correctly when the image is decompressed. In response to the at least one user input, the compression device determines a modified textual description 703. The device 300 may further obtain an auxiliary image 702. The auxiliary image 702 may be a low-resolution version of the image 700. The low-resolution may be, for example, 72 dpi. Alternatively, or in addition, the auxiliary image 702 may be obtained from the original image 700 by flattening, as illustrated in the FIG. 7. When flattened, objects look like flat areas. The image 700 may be flattened using, for example, object segmentation and color quantization. This highly modified auxiliary image 702 may be more efficiently compressed using JPEG since it contains large contiguous flat areas. Hence, when the auxiliary image 702 is stored together with the modified textual description 703 of the image 700, the image 700 may be compressed to approximately half of the originally required space. For example, the image 700 illustrated in FIG. 7 may take 1.6MB on a disk, and the auxiliary image 702 may take only 0.8MB. Further, the textual description 701 may take only 161 bytes, even when in uncompressed form.

When the user wants to view the image 700, the device 400 may retrieve the auxiliary image 702 and the modified textual description 703 from the memory. For example, the auxiliary image 702 may be used as a drawing canvas where the image 700 is reconstructed from the modified textual description 703. For instance, the modified textual description 703 may make references to the auxiliary image 702, such as: “the grey toy mouse is situated on the black area of the auxiliary image”. Also, boundaries of the desired areas for particular objects may be textually described in the modified textual description 703. For example: “a glasses box is having a center at location x, y and a diameter r”. As another example, using the reference (e.g. “the grey toy mouse on the black area”) with the auxiliary image 602 may help the text-to- image algorithm to more reliably reconstruct the right kind of toy mouse, for example, based on the shape on the black area. The referenced shape on the black area may be further used to help the decompression device to place the object (toy mouse) to the right place in the image 700. Even though the exemplary textual descriptions are presented as fluent sentences, it may be sufficient that the textual descriptions comprise the image information in any text form, such as a list of properties, objects, attributes, and the like.

Although the subject matter has been described in language specific to structural features and/or acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as examples of implementing the claims and other equivalent features and acts are intended to be within the scope of the claims.

It will be understood that the benefits and advantages described above may relate to one example or may relate to several examples. The examples are not limited to those that solve any or all of the stated problems or those that have any or all of the stated benefits and advantages. It will further be understood that reference to 'an' item may refer to one or more of those items. The steps of the methods described herein may be carried out in any suitable order, or simultaneously where appropriate. Additionally, individual blocks may be deleted from any of the methods without departing from the spirit and scope of the subject matter described herein. Aspects of any of the examples described above may be combined with aspects of any of the other examples described to form further examples without losing the effect sought.

The term 'comprising' is used herein to mean including the method, blocks or elements identified, but that such blocks or elements do not comprise an exclusive list and a method or apparatus may contain additional blocks or elements.

Although the solution and its advantages have been described in detail with reference to specific features and embodiments thereof, it is evident that various changes, modifications, substitutions, combinations and alterations can be made thereto without departing from the spirit and scope as defined by the appended claims. The specification and drawings are, accordingly, to be regarded simply as an illustration as defined by the appended claims, and are contemplated to cover any and all modifications, variations, combinations or equivalents that fall within the scope of the present subject-matter.

Claims

1. A device to determine an image in a textual form, configured to: determine a textual description of the image; receive at least one user input for modifying the textual description of the image; determine a modified textual description of the image based on the textual description of the image and the at least one user input, the modified description of the image having a limited number of characters; and store the modified textual description of the image as a text file, wherein a size of the text file is lower than a file size of the image.

2. The device according to claim 1, wherein the at least one user input comprises at least one instruction to modify at least one portion of the textual description of the image associated with at least one portion of the image and/or at least one parameter of the image.

3. The device according to any preceding claim, wherein the device is further configured to: obtain an auxiliary image, wherein the auxiliary image comprises a simplified representation of the image; and store the auxiliary image in association with the modified textual description of the image.

4. The device according to claim 3, wherein a resolution of the auxiliary image is lower than a resolution of the image; and/or a number of pixels of the auxiliary image is smaller than in the image.

5. The device according to claim 3 or claim 4, wherein the image compression device is further configured to: obtain the auxiliary image based on at least one of object segmentation or color quantization of the image.

6. The device according to any preceding claim, wherein the device is further configured to: obtain at least one second image for a reference for modifying the textual description of the image; convert the second image into a second textual description of the second image; obtain a fused textual description based on merging the second textual description with the textual description, the fused textual description of the images having a limited number of characters; and store the fused textual description of the images as a text file, wherein a size of the text file is lower than a file size of any of the images.

7. The device according to claim 6, wherein device is configured to store the fused textual description in association with the auxiliary image.

8. The device according to claim 7, wherein at least one illumination parameter, at least one color parameter, or at least one capture parameter of the second image is different from the image.

9. A device to provide an image from a text, configured to: obtain a textual description of at least one image; obtain an auxiliary image associated with the textual description; and reconstruct the image based on the textual description and the auxiliary image.

10. The device according to claim 9, wherein at least one of a resolution of the auxiliary image is lower than a resolution of the image or a number of pixels of the auxiliary image is smaller than in the image.

11. The device according to any of claims 9 to 10, wherein the auxiliary image comprises an object segmented and/or color quantized version of the image.

12. A device to provide an image as a text, configured to: determine a textual description of the image; obtain an auxiliary image, wherein the auxiliary image comprises a simplified representation of the image; and store the auxiliary image in association with the textual description of the image.

13. A device to capture an image based on a text, configured to: receive a textual description of the image; obtain at least one viewfinder image of a camera; compare the textual description of the image with the viewfinder image; and capture a second image with the camera in response to determining that the viewfinder image corresponds to the textual description of the image.

14. The device according to claim 13, configured to: convert the viewfinder image into a second textual description; compare the second textual description to the textual description of the image; and capture the second image with the camera in response to determining that the second textual description corresponds to the textual description.

15. The device according to claim 13 or claim 14, configured to: reconstruct the image based on the textual description; compare the reconstructed image to the viewfinder image; and capture the at least one second image with the camera in response to determining that the reconstructed image corresponds to the viewfinder image.

16. The device according to any of claims 13 to 15, further configured to: obtain an auxiliary image, wherein the auxiliary image comprises a simplified representation of the second image in a more compact form; and store the auxiliary image in association with the textual description of the image.

17. A method for determining an image in a textual form, comprising: determining a textual description of the image; receiving at least one user input for modifying the textual description of the image; determining a modified textual description of the image based on the textual description of the image and the at least one user input, the modified description of the image having a limited number of characters; and storing the modified textual description of the image as a text file, wherein a size of the text file is lower than a file size of the image.

18. A method for providing an image from a text, comprising: obtaining a textual description of at least one image; obtaining an auxiliary image associated with the textual description; and reconstructing the image based on the textual description and the auxiliary image.

19. A method for image capturing based on a text, comprising: receiving a textual description of an image; obtaining at least one viewfinder image of a camera; comparing the textual description with the viewfinder image; and capturing a second image with the camera in response to determining that the viewfinder image corresponds to the textual description of the image.

20. A computer program product comprising a computer readable storage medium storing program code thereon, the program code comprising instructions for executing the method according to any of claims 17 to 19.