CN112634174B

CN112634174B - Image representation learning method and system

Info

Publication number: CN112634174B
Application number: CN202011632703.8A
Authority: CN
Inventors: 胡郡郡
Original assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Current assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2023-12-12
Anticipated expiration: 2040-12-31
Also published as: CN112634174A

Abstract

The application discloses an image representation learning method and system. The image representation learning method comprises the following steps: an enhanced image acquisition step: acquiring an enhanced image of an original image; a feature map acquisition step: acquiring a feature map of the enhanced image by an encoder; and a prediction step: predicting the frame of the enhanced image by using a frame regression network, and obtaining a predicted frame; the calculation steps are as follows: and calculating the final loss of the real frame and the predicted frame, and updating the frame regression network and the encoder according to the final loss. The application provides an image representation learning method and system. Two enhancement modes are used for an original picture, the robustness of the model is enhanced, noise can be well restrained, and meanwhile, the accuracy of detection and segmentation tasks is improved.

Description

Image representation learning method and system

Technical Field

The present application relates to the field of image representation technologies, and in particular, to a method and a system for learning image representations.

Background

Deep learning, when optimizing a particular task, whether classification, detection, or segmentation, typically loads a pre-trained model classified on imagenet before migrating to downstream tasks, but this training model obscures the location information. In addition to pre-training models classified on imagenet, the current unsupervised method is also widely studied, and the expression capacity of the pre-training models is improved through a contrast learning method. In another method for improving the expression capability, a frame regression method is used on an image data set to learn the expression capability, the classification task obscures the position information, and the frame regression can learn more position information, so that the downstream task sensitive to the position information is facilitated.

Therefore, the present application provides an image representation learning method and system, which improves the expression capability of the model, particularly improves the sensitivity of the model to position and detail, and enables the model to acquire more position information by using frame regression instead of a classified training method. Two enhancement modes are used for the original picture, the frames are respectively returned, the loss function is overlapped for the two modes, the robustness of the model is enhanced, noise can be well restrained, and meanwhile, the accuracy of detection and segmentation tasks is improved.

Disclosure of Invention

The embodiment of the application provides an image representation learning method and system, which are used for at least solving the problem of influence of subjective factors in the related technology.

The application provides an image representation learning method, which comprises the following steps:

an enhanced image acquisition step: acquiring an enhanced image of an original image;

a feature map acquisition step: acquiring a feature map of the enhanced image by an encoder;

and a prediction step: predicting the frame of the enhanced image by using a frame regression network, and obtaining a predicted frame;

the calculation steps are as follows: and calculating the final loss of the real frame and the predicted frame, and updating the frame regression network and the encoder according to the final loss.

The image representation learning method described above, the enhanced image obtaining step includes, for each of the original images, obtaining at least two of the enhanced images of the original image using a data enhancement method.

The image representation learning method includes the steps of extracting the encoder formed by the backbone network and the multi-layer perceptron together by using the deep learning features, and acquiring the feature map according to the encoder.

The image representation learning method includes the steps of predicting the frame of each of the enhanced images using the frame regression network and obtaining the predicted frame.

The image representation learning method includes the steps of calculating the losses of the real frame of the original image and the predicted frame of each enhanced image respectively by using cross ratio losses, adding at least two losses to obtain a final loss, and updating the encoder and the frame regression network according to the final loss back propagation.

The present application provides an image representation learning system, which is suitable for the above-mentioned image representation learning method, and comprises:

enhanced image acquisition unit: acquiring an enhanced image of an original image;

feature map acquisition unit: acquiring a feature map of the enhanced image by an encoder;

prediction unit: predicting the frame of the enhanced image by using a frame regression network to obtain a predicted frame;

a calculation unit: and calculating the final loss of the real frame and the predicted frame, and updating the frame regression network and the encoder according to the final loss.

The above-described image representation learning system, wherein the enhanced image acquisition unit acquires at least two of the enhanced images of the original images using a data enhancement method for each of the original images.

According to the image representation learning system, the feature map obtaining unit uses the deep learning features to extract the encoder formed by the trunk network and the multi-layer perceptron together, and obtains the feature map according to the encoder.

In the above image representation learning system, the prediction unit predicts the frame of each of the enhanced images using the frame regression network, and obtains the predicted frame.

The above image representation learning system, wherein the calculating unit calculates the loss of the real frame of the original image and the predicted frame of each enhanced image respectively using the cross-ratio loss, adds at least two of the losses to obtain a final loss, and back-propagates updating the encoder and the frame regression network according to the final loss.

Compared with the related art, the application provides the image representation learning method and the image representation learning system, and the frame regression is used instead of the classification training method, so that the expression capacity of the model is improved, particularly the sensitivity of the model to positions and details is improved, and the model obtains more position information. Two enhancement modes are used for the original picture, the frames are respectively returned, the loss function is overlapped for the two modes, the robustness of the model is enhanced, noise can be well restrained, and meanwhile, the accuracy of detection and segmentation tasks is improved.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a flowchart of an image representation learning method according to an embodiment of the present application;

FIG. 2 is an application flow diagram according to an embodiment of the application;

FIG. 3 is a schematic diagram of an image representation learning system according to the present application;

fig. 4 is a frame diagram of an electronic device according to an embodiment of the application.

Wherein, the reference numerals are as follows:

enhanced image acquisition unit: 51;

feature map acquisition unit: 52;

prediction unit: 53;

a calculation unit: 54;

81: a processor;

82: a memory;

83: a communication interface;

80: a bus.

Detailed Description

The present application will be described and illustrated with reference to the accompanying drawings and examples in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. All other embodiments, which can be made by a person of ordinary skill in the art based on the embodiments provided by the present application without making any inventive effort, are intended to fall within the scope of the present application.

It is apparent that the drawings in the following description are only some examples or embodiments of the present application, and it is possible for those of ordinary skill in the art to apply the present application to other similar situations according to these drawings without inventive effort. Moreover, it should be appreciated that while such a development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as having the benefit of this disclosure.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by those of ordinary skill in the art that the described embodiments of the application can be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. The terms "a," "an," "the," and similar referents in the context of the application are not to be construed as limiting the quantity, but rather as singular or plural. The terms "comprising," "including," "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to only those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in connection with the present application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The terms "first," "second," "third," and the like, as used herein, are merely distinguishing between similar objects and not representing a particular ordering of objects.

The application is based on image representation learning of frame regression and is briefly described below.

Deep learning is a class of machine learning algorithms: higher level features are progressively extracted from the original input using multiple layers. For example, in image processing, lower layers may identify edges, while higher layers may identify parts that are significant to humans, such as numbers/letters or faces. Most modern deep learning models are based on artificial neural networks, particularly Convolutional Neural Networks (CNNs), although they may also include propositional formulas or latent variables organized layer by layer in the deep generative model, such as deep belief networks and nodes in the deep boltzmann machine. In deep learning, each level of learning converts its input data into a somewhat abstract and complex representation. In image recognition applications, the original input may be a matrix of pixels; the first generation skin may extract pixels and encode edges; the second layer may constitute and encode an edge arrangement; the third layer may encode the nose and eyes; and the fourth layer may recognize an image containing a face. Importantly, the deep learning process can learn which features are placed on which level is optimal. (of course, this does not completely avoid the need for manual adjustment; e.g., different layers and layer sizes may provide different levels of abstraction), "depth" in "deep learning" refers to the number of layers of data conversion. More specifically, the deep learning system has a substantial Credit Assignment Path (CAP) depth. CAP is a chain of conversions from input to output. CAP describes a potential causal relationship between input and output. For a feed-forward neural network, the depth of the CAP is the depth of the network, equal to the number of hidden layers plus 1 (because the output layers are also parameterized). For recurrent neural networks, where the signal may propagate through one layer more than once, the CAP depth may be infinite. There is no universally agreed depth threshold to distinguish shallow from deep learning, but most researchers agree that CAP depth in deep learning >2. A CAP of depth 2 has proven to be a generic approximator because it can simulate any function. In addition, more layers do not increase the function approximation capability of the network. The depth model (CAP > 2) is able to extract better features than the shallow model, and therefore, additional layers help learn features. Deep learning architectures are typically built with greedy layer-by-layer methods. Deep learning helps to clarify these abstractions and to find out which features can improve performance. For supervised learning tasks, the deep learning approach avoids feature engineering by converting the data into a compact intermediate representation similar to the principal components and deriving a hierarchical structure with redundant representations eliminated. Deep learning algorithms may be applied to unsupervised learning tasks. This is an important benefit because unlabeled data is more abundant than labeled data. Examples of deep structures that can be trained in an unsupervised manner are neural history compressors and deep belief networks.

Frame regression is defined as a process of approximating a generated candidate frame with a labeled real frame as a target in the target detection process. Since the box on an image can be uniquely determined by the center point coordinates (Xc, yc) and width W and height H, the process of this approximation can be modeled as a regression problem. And by carrying out frame regression on the candidate frames, the finally detected target positioning is more approximate to a true value, and the positioning accuracy is improved. An image is a generic term for an image and a picture, and is a reproduction of a visually perceived substance by a person. The images can be acquired by optical devices such as cameras, mirrors, telescopes, microscopes, etc.; and can also be artificially created, such as manual painting images and the like. The image may be recorded, stored on a paper medium, film, or the like that is sensitive to the light signal. Through the specially designed image, the visual language of human-to-human communication can be developed, and a large number of planar painting, three-dimensional sculpture and building in the world art can be known. The image is also called an image. Refers to a non-photographic imaging sensor imaging modality, which is essentially an extension of photographic film. Photographic imaging is typically referred to as optical photographic imaging and is recorded on photographic film, and is passive remote sensing imaging. The image can receive the visible light, infrared, thermal infrared and microwave information from the ground object through optical-mechanical, optical-electronic or antenna scanning, and record on the magnetic tape or record on the photosensitive film through electro-optical conversion. Compared to photo, it is more extensive in content and form. "images" are used to include (rather than replace) "shots" as a result of the development of aerial photography reconnaissance into remote sensing and photography (photography) into imaging (imaging). The model is an object for objectively describing an expression object of a morphological structure by means of physical or virtual expression through subjective consciousness (the object is not equal to an object, is not limited to physical and virtual, is not limited to a plane and a solid). Model not equal to commodity. Any object is defined as a model in the development process before the commodity, and when the model, the specification and the corresponding price are defined, the model is presented in the commodity form. In a broad sense: if one thing can change as another thing changes, then that thing is a model of the other thing. The models are used for expressing the properties of different concepts, one concept can change a plurality of models to different degrees, but only a few models can express the properties of one concept, so that one concept can change the expression form of the properties by referring to different models. When a model is associated with a thing, a framework is created that has properties that determine how the model changes from thing to thing. Model constitution forms are classified into solid models (physical form concept solid objects having volumes and weights) and virtual models (forms constituted by digital representations using electronic data and other actual effects). The solid model is divided into a static model (a physical relatively static state, a power system without energy conversion is arranged in the solid model), a power-assisted model (based on the static model, the structure and the integrity of the body are not represented under external acting force, the structure is not represented by the static model under the action of external kinetic energy, the connection relation of an object structure through physical motion detection is not changed), and a movable model (the kinetic energy can be generated through an energy conversion mode, the power conversion system is arranged in the solid model, and the relative continuous physical motion mode is represented in the energy conversion process). The virtual model is divided into a virtual static model, a virtual dynamic model and a virtual phantom model. Wherein the mathematical model is a type of model described in a mathematical language. The mathematical model may be one or a set of algebraic, differential, integral or statistical equations, or some suitable combination thereof, by which the interrelationship or causal relationship between the variables of the system is described quantitatively or qualitatively. In addition to mathematical models described by equations, there are models described by other mathematical tools, such as algebra, geometry, topology, mathematical logic, etc. It should be noted that the mathematical model describes the behavior and characteristics of the system rather than the actual structure of the system. The physical model is also called a physical model, and can be divided into a physical model and an analog model. Physical model: real objects which are manufactured according to the similarity theory and are scaled down (can be enlarged or have the same size as the original system) according to the original system, such as an airplane model in a wind tunnel experiment, a hydraulic system experiment model, a building model, a ship model and the like. Simulation model of space model: the respective variables in different physical fields (mechanical, electrical, thermal, hydrodynamic, etc.) sometimes follow the same law, from which a model of very different physical meaning can be made, analogies and the like. For example, the pressure response of a pneumatic system consisting of a throttle valve and a gas capacitor under certain conditions has a rule similar to the output voltage characteristic of a circuit consisting of a resistor and a capacitor, so that the pneumatic system can be simulated by a circuit which is relatively easy to perform experiments.

The image representation is a representation and storage manner of image information in a computer. Together, the image table and image operations form an image model, which is an important component in pattern analysis. The image may be represented at different levels of image information. The most basic physical image is obtained by extracting a two-dimensional gray scale array (matrix) from a continuous map domain according to the rectangular grid sampling principle. The two-dimensional gray matrix can also be represented by a long number, which is formed by scanning the gray matrix by columns (or rows) and connecting the head of the next column (or row) with the tail of the previous column (or row). Image representation is used for text or graphic symbology that expresses a model.

The application provides an image representation learning method and system, which improves the expression capacity of a model by using frame regression instead of a classified training method, in particular improves the sensitivity of the model to positions and details, and enables the model to acquire more position information. Two enhancement modes are used for the original picture, the frames are respectively returned, the loss function is overlapped for the two modes, the robustness of the model is enhanced, noise can be well restrained, and meanwhile, the accuracy of detection and segmentation tasks is improved.

Embodiments of the present application will be described below with reference to learning an image representation as an example.

Example 1

The present embodiment provides an image representation learning method. Referring to fig. 1-2, fig. 1 is a flowchart of an image representation learning method according to an embodiment of the application, as shown in fig. 1, the image representation learning method includes the following steps:

enhanced image acquisition step S1: acquiring an enhanced image of an original image;

feature map acquisition step S2: acquiring a feature map of the enhanced image by an encoder;

prediction step S3: predicting the frame of the enhanced image by using a frame regression network, and obtaining a predicted frame;

and (4) calculating: and calculating the final loss of the real frame and the predicted frame, and updating the frame regression network and the encoder according to the final loss.

In an embodiment, the enhanced image obtaining step S1 includes, for each of the original images, obtaining at least two of the enhanced images of the original image using a data enhancement method.

In an embodiment, the feature map obtaining step S2 includes extracting the encoder formed by the backbone network and the multi-layer perceptron together by using the deep learning feature, and obtaining the feature map according to the encoder.

In an embodiment, the predicting step S3 includes predicting the frame of each of the enhanced images using the frame regression network, and obtaining the predicted frame.

In an embodiment, the calculating step S4 includes calculating the loss of the real frame of the original image and the predicted frame of each enhanced image using an overlap ratio loss, adding at least two of the losses to obtain a final loss, and back-propagating and updating the encoder and the frame regression network according to the final loss.

Referring to fig. 2, fig. 2 is a flowchart of an application according to an embodiment of the application. As shown in fig. 2, the image representation learning method of the present application is described below with reference to fig. 2 in a specific embodiment.

Step 1, for each image, acquiring two enhanced images x1 and x2 by using a data enhancement method.

And 2, acquiring feature maps of x1 and x2 by using an encoder (encoder) formed by a deep learning feature extraction network backstbone and a multi-layer perceptron mlp.

And 3, predicting the frames p1 and p2 by using a frame regression network h.

And 4, respectively calculating the losses of p1, p2 and groudtryy by using the IoU Loss, and finally adding the losses of the two.

And 5, back propagation updating an encoder (encoder) f and a frame regression network h.

In a specific implementation, the pseudo code is the following image representation learning method pseudo code:

//f:backbone+projection mlp

//p:prediction

for x in loader://load a minibatch x with n samples

x1,x2＝aug(x),aug(x)//random augmentation

z1,z2＝f(x1),f(x2)//encoder

p1，p2＝h(z1),h(z2)//box regression

L＝IoULoss(p1,y)+IoULoss(p2,y)//loss

L.backward()//update f&h

therefore, the application provides an image representation learning method and system, which improves the expression capacity of the model by using frame regression instead of a classified training method, in particular improves the sensitivity of the model to positions and details, and enables the model to acquire more position information. Two enhancement modes are used for the original picture, the frames are respectively returned, the loss function is overlapped for the two modes, the robustness of the model is enhanced, noise can be well restrained, and meanwhile, the accuracy of detection and segmentation tasks is improved.

Example two

Referring to fig. 3, fig. 3 is a schematic diagram of an image representation learning system according to the present application. As shown in fig. 3, the image representation learning system of the present application is applied to the image representation learning method described above, and the image representation learning system includes:

enhanced image acquisition unit 51: acquiring an enhanced image of an original image;

the feature map acquisition unit 52: acquiring a feature map of the enhanced image by an encoder;

prediction unit 53: predicting the frame of the enhanced image by using a frame regression network to obtain a predicted frame;

calculation unit 54: and calculating the final loss of the real frame and the predicted frame, and updating the frame regression network and the encoder according to the final loss.

In the present embodiment, the enhanced image acquisition unit 51 acquires at least two of the enhanced images of the original image using a data enhancement method for each of the original images.

In this embodiment, the feature map obtaining unit 52 extracts the encoder composed of the backbone network and the multi-layer perceptron using the deep learning feature, and obtains the feature map according to the encoder.

In this embodiment, the prediction unit 53 predicts the frame of each of the enhanced images using the frame regression network, and acquires the predicted frame.

In this embodiment, the calculating unit 54 calculates the loss of the real frame of the original image and the predicted frame of each enhanced image using the cross-ratio loss, adds at least two of the losses to obtain a final loss, and back-propagates updating the encoder and the frame regression network according to the final loss.

Example III

In connection with fig. 4, this embodiment discloses a specific implementation of an electronic device. The electronic device may include a processor 81 and a memory 82 storing computer program instructions.

In particular, the processor 81 may comprise a Central Processing Unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC), or may be configured as one or more integrated circuits that implement embodiments of the present application.

Memory 82 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 82 may comprise a Hard Disk Drive (HDD), floppy Disk Drive, solid state Drive (Solid State Drive, SSD), flash memory, optical Disk, magneto-optical Disk, tape, or universal serial bus (Universal Serial Bus, USB) Drive, or a combination of two or more of the foregoing. The memory 82 may include removable or non-removable (or fixed) media, where appropriate. The memory 82 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 82 is a Non-Volatile (Non-Volatile) memory. In a particular embodiment, the Memory 82 includes Read-Only Memory (ROM) and random access Memory (Random Access Memory, RAM). Where appropriate, the ROM may be a mask-programmed ROM, a programmable ROM (Programmable Read-Only Memory, abbreviated PROM), an erasable PROM (Erasable Programmable Read-Only Memory, abbreviated FPROM), an electrically erasable PROM (Electrically Erasable Programmable Read-Only Memory, abbreviated EFPROM), an electrically rewritable ROM (Electrically Alterable Read-Only Memory, abbreviated EAROM), or a FLASH Memory (FLASH), or a combination of two or more of these. The RAM may be Static Random-Access Memory (SRAM) or dynamic Random-Access Memory (Dynamic Random Access Memory DRAM), where the DRAM may be a fast page mode dynamic Random-Access Memory (Fast Page Mode Dynamic Random Access Memory FPMDRAM), extended data output dynamic Random-Access Memory (Extended Date Out Dynamic Random Access Memory EDODRAM), synchronous dynamic Random-Access Memory (Synchronous Dynamic Random-Access Memory SDRAM), or the like, as appropriate.

Memory 82 may be used to store or cache various data files that need to be processed and/or communicated, as well as possible computer program instructions for execution by processor 81.

The processor 81 reads and executes the computer program instructions stored in the memory 82 to implement any of the image representation learning methods in the above-described embodiments.

In some of these embodiments, the electronic device may also include a communication interface 83 and a bus 80. As shown in fig. 4, the processor 81, the memory 82, and the communication interface 83 are connected to each other through the bus 80 and perform communication with each other.

The communication interface 83 is used to enable communication between modules, devices, units and/or units in embodiments of the application. Communication port 83 may also enable communication with other components such as: and the external equipment, the image/data acquisition equipment, the database, the external storage, the image/data processing workstation and the like are used for data communication.

Bus 80 includes hardware, software, or both that couple components of the electronic device to one another. Bus 80 includes, but is not limited to, at least one of: data Bus (Data Bus), address Bus (Address Bus), control Bus (Control Bus), expansion Bus (Expansion Bus), local Bus (Local Bus). By way of example, and not limitation, bus 80 may include a graphics acceleration interface (Accelerated Graphics Port), abbreviated AGP, or other graphics Bus, an enhanced industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) Bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an industry standard architecture (Industry Standard Architecture, ISA) Bus, a wireless bandwidth (InfiniBand) interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a micro channel architecture (Micro Channel Architecture, abbreviated MCa) Bus, a peripheral component interconnect (Peripheral Component Interconnect, abbreviated PCI) Bus, a PCI-Express (PCI-X) Bus, a serial advanced technology attachment (Serial Advanced Technology Attachment, abbreviated SATA) Bus, a video electronics standards association local (Video Electronics Standards Association Local Bus, abbreviated VLB) Bus, or other suitable Bus, or a combination of two or more of the foregoing. Bus 80 may include one or more buses, where appropriate. Although embodiments of the application have been described and illustrated with respect to a particular bus, the application contemplates any suitable bus or interconnect.

The electronic device may be connected to an image representation learning system to implement the method described in connection with fig. 1-2.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. An image representation learning method, comprising:

the calculation steps are as follows: calculating the final loss of the real frame and the predicted frame, and updating the frame regression network and the encoder according to the final loss;

the enhanced image obtaining step includes, for each original image, obtaining at least two enhanced images of the original image using a data enhancement method; the calculating step includes the steps of calculating losses of the real frames of the original image and the predicted frames of each enhanced image respectively by using cross-ratio losses, adding at least two losses to obtain final losses, and updating the encoder and the frame regression network according to the final losses in a counter-propagation mode.

2. The image representation learning method according to claim 1, wherein the feature map acquisition step includes extracting the encoder composed of a backbone network and a multi-layer perceptron together using a deep learning feature, and acquiring the feature map from the encoder.

3. The image representation learning method according to claim 1, characterized in that the predicting step includes predicting the border of each of the enhanced images using the frame regression network and acquiring the predicted border.

4. An image representation learning system adapted for use in the image representation learning method of claims 1-3, said image representation learning system comprising:

enhanced image acquisition unit: acquiring enhanced images of original images, wherein for each original image, the enhanced image acquisition unit acquires at least two enhanced images of the original image by using a data enhancement method;

a calculation unit: and calculating the final loss of the real frame and the predicted frame, and updating the frame regression network and the encoder according to the final loss, wherein the calculation unit calculates the loss of the real frame of the original image and the predicted frame of each enhanced image respectively by using the cross ratio loss, adds at least two losses to obtain the final loss, and back-propagates and updates the encoder and the frame regression network according to the final loss.

5. The image representation learning system according to claim 4, wherein the feature map acquisition unit extracts the encoder composed of a backbone network and a multi-layer perceptron together using a deep learning feature, and acquires the feature map from the encoder.

6. The image representation learning system according to claim 4, wherein the prediction unit predicts the frame of each of the enhanced images using the frame regression network and acquires the predicted frame.