CN115565042A

CN115565042A - Commodity image feature representation method and device, equipment, medium and product thereof

Info

Publication number: CN115565042A
Application number: CN202211259755.4A
Authority: CN
Inventors: 李保俊
Original assignee: Guangzhou Huanju Shidai Information Technology Co Ltd
Current assignee: Guangzhou Huanju Shidai Information Technology Co Ltd
Priority date: 2022-10-14
Filing date: 2022-10-14
Publication date: 2023-01-03

Abstract

The application relates to a commodity image feature representation method and a device, equipment, medium and product thereof, wherein the method comprises the following steps: acquiring a commodity image; inputting the commodity image into an image encoder of a preset image feature extraction model to extract deep semantic information of the commodity image, and obtaining a corresponding semantic feature map; adopting a feature optimization network in the image feature extraction model, segmenting the semantic feature graph by using a plurality of windows to obtain a plurality of feature sub-graphs, and respectively compressing the feature sub-graphs into feature vectors, wherein part of feature sub-graphs contain the same feature region; and fusing all the feature vectors into the image feature information of the commodity image by adopting a fusion network in the image feature extraction model. According to the method and the device, the image characteristic information extracted from the commodity image can obtain stronger robustness, and the subtle difference between different commodity images can be amplified through the semantic characteristics, so that a more accurate retrieval result can be obtained in an application scene of commodity image-based commodity retrieval.

Description

Commodity image feature representation method and device, equipment, medium and product thereof

Technical Field

The application relates to an e-commerce information technology, in particular to a commodity image feature representation method and device, equipment, medium and product thereof.

Background

With the development of science and technology, there is an increasing demand for searching for relevant goods by taking pictures of objects, such as users purchasing goods by retrieving goods through a buyer show-a seller show, merchants taking corresponding sources of goods by taking pictures of popular goods for stocking, and the like. Therefore, the retrieval of the commodity image is a crucial step, and the understanding and recognition of the commodity image are a central link.

In the technology for understanding the image, a deep convolutional neural network is often adopted for implementation, and the convolutional neural network has strong semantic abstract understanding capability, so that the image information of the commodity can be well mapped to a high-dimensional space to obtain high-dimensional representation of the commodity, and background noise can be well inhibited.

However, in various conventional image feature extraction technologies based on convolutional neural networks, semantic information extracted from the whole image is generally mechanical, and the phenomena of poor robustness, unobvious feature differentiation and the like occur when semantic features of different images with similar contents are represented.

When the phenomenon is applied to a scene of understanding and identifying the commodity images of the e-commerce platform, the discrimination degree between the commodity images with similar characteristics is weak, so that the commodity images with the same model and different styles are easily regarded as the same image, and a retrieval error phenomenon and the like are caused. Therefore, there is a need to improve the existing technology for the feature representation technology of the commodity image.

Disclosure of Invention

The present application aims to solve the above problems and provide a method for representing image features of an article, and a corresponding apparatus, device, non-volatile readable storage medium, and computer program product.

According to an aspect of the present application, there is provided a method for representing an image feature of a commodity, including the steps of:

acquiring a commodity image;

inputting the commodity image into an image encoder of a preset image feature extraction model to extract deep semantic information of the commodity image, and obtaining a corresponding semantic feature map;

adopting a feature optimization network in the image feature extraction model, segmenting the semantic feature graph by using a plurality of windows to obtain a plurality of feature sub-graphs, and respectively compressing the feature sub-graphs into feature vectors, wherein part of feature sub-graphs contain the same feature regions;

and fusing all the feature vectors into the image feature information of the commodity image by adopting a fusion network in the image feature extraction model.

Optionally, the segmenting the semantic feature map by using multiple windows to obtain multiple feature sub-maps, and respectively compressing the feature sub-maps into feature vectors, includes:

determining the size of a window according to a preset scale, and segmenting a plurality of characteristic subgraphs from the semantic characteristic graph by applying corresponding windows according to preset different position information, wherein a local characteristic region between one characteristic subgraph and any other characteristic subgraph is overlapped;

performing pooling operation on each feature subgraph to realize feature compression and obtain initial feature vectors;

and respectively carrying out full connection on the initial characteristic vectors and then normalizing the initial characteristic vectors into final characteristic vectors.

Optionally, fusing all the feature vectors into the image feature information of the commodity image by using a fusion network in the image feature extraction model, including:

averaging the plurality of feature vectors to realize fusion, and obtaining image feature information of the commodity image;

alternatively, the first and second liquid crystal display panels may be,

and weighting and summing the plurality of feature vectors to realize fusion, and obtaining the image feature information of the commodity image.

Optionally, after fusing all the feature vectors into the image feature information of the commodity image by using a fusion network in the image feature extraction model, the method includes:

taking the commodity image as a commodity image to be checked, and calculating semantic similarity between image feature information of the commodity image to be checked and image feature information of each commodity image in a commodity database, wherein the image feature information of each commodity image in the commodity database is generated by adopting the image feature extraction model in advance;

screening partial commodity images with relatively high semantic similarity in the commodity database, acquiring commodity information of source commodities corresponding to the partial commodity images, and constructing the commodity information as a commodity information list;

and outputting the commodity information list.

Optionally, the commodity image is input into an image encoder of a preset image feature extraction model to extract deep semantic information of the commodity image, and before a corresponding semantic feature map is obtained, a training process of the image feature extraction model is started, including:

connecting the image feature extraction model with a classifier to form a model training framework;

based on the model training framework, a preset training data set is adopted to carry out classification task training on the image feature extraction model, and the image feature extraction model is trained to be in a convergence state; the training data set comprises a plurality of training samples and supervision labels thereof, and the training samples are commodity images.

Optionally, based on the model training architecture, a preset training data set is used to perform classification task training on the image feature extraction model, and the training is performed to a convergence state, including:

calling a single training sample in the training data set to input the image feature extraction model to determine the image feature information of the training sample;

obtaining classification probability of each category mapped to a preset classification space by a classifier in the model training framework according to the image characteristic information as a classification result;

and calculating the total loss value of the classification result by adopting the supervision label corresponding to the training sample, when the model training architecture does not reach the convergence state, performing gradient updating on the model training architecture according to the total loss value, and calling the next training sample to perform iterative training until the convergence state.

Optionally, calculating an overall loss value of the classification result by using the supervised labels corresponding to the training samples, including:

calculating a first loss value corresponding to the image characteristic information by adopting the supervision label;

calculating the data distance between every two feature vectors which are generated by the image feature extraction model and used for constructing the image feature information, and solving the average value of all the data distances as a second loss value;

and fusing the first loss value and the second loss value to obtain the total loss value.

According to another aspect of the present application, there is provided an article image feature representation apparatus including:

an image acquisition module configured to acquire a commodity image;

the semantic extraction module is used for inputting the commodity image into an image encoder of a preset image feature extraction model to extract deep semantic information of the commodity image, and a corresponding semantic feature map is obtained;

the feature optimization module is arranged for adopting a feature optimization network in the image feature extraction model, segmenting the semantic feature graph by using a plurality of windows to obtain a plurality of feature sub-graphs and respectively compressing the feature sub-graphs into feature vectors, wherein part of the feature sub-graphs contain the same feature regions;

and the feature fusion module is configured to fuse all feature vectors into the image feature information of the commodity image by using a fusion network in the image feature extraction model.

According to another aspect of the present application, there is provided an article image feature representation apparatus, comprising a central processing unit and a memory, wherein the central processing unit is used for invoking and running a computer program stored in the memory to execute the steps of the article image feature representation method described in the present application.

According to another aspect of the present application, a non-transitory readable storage medium is provided, which stores a computer program implemented according to the method for representing the image characteristics of an article, in the form of computer readable instructions, and when the computer program is called by a computer, the computer program executes the steps included in the method.

According to another aspect of the present application, there is provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method described in any one of the embodiments of the present application.

Compared with the prior art, the method has the advantages that after the semantic feature map of the commodity image is extracted by the image encoder in the image feature extraction model, the semantic feature map is further segmented and compressed by the feature optimization network to obtain a plurality of feature sub-maps, the feature sub-maps are intersected due to the fact that overlapping regions exist in windows adopted in the segmentation, and finally all the feature sub-maps are fused, so that the obtained image feature information can effectively amplify the features of key regions in the commodity image, the image feature information can obtain stronger robustness, especially effective distinguishing effects on slight differences in similar commodity images can be achieved, the method is more suitable for expressing the semantic features of the commodity image in a power provider platform, even if the commodity images of the same type and different styles can obtain obvious distinguishing effects through the image feature information obtained according to the method, and accordingly, more accurate retrieval results can be obtained in application scenes of commodity image-based commodity retrieval.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic network architecture diagram of an application environment according to the present invention;

FIG. 2 is a schematic diagram of a model architecture of an image feature extraction model for exemplary use in the present application;

FIG. 3 is an exemplary functional block diagram illustrating the internal structure of a feature optimization network in an exemplary image feature extraction model of the present application;

FIG. 4 is a diagram of a model training architecture for an exemplary implementation of the present application;

FIG. 5 is a flowchart illustrating an embodiment of a method for characterizing an image of an article of merchandise according to the present application;

FIG. 6 is a schematic flow chart illustrating a process of extracting a plurality of feature vectors from a semantic feature map according to an embodiment of the present application;

fig. 7 is a schematic flowchart of matching a product information list according to a product image to be checked in an exemplary application scenario of the application;

FIG. 8 is a schematic flowchart illustrating an iterative training process performed on an image feature extraction model based on a model training architecture in an embodiment of the present application;

FIG. 9 is a schematic flow chart illustrating a process of calculating an overall loss value during a training process of a model training architecture according to an embodiment of the present application;

fig. 10 is a functional block diagram of the product image feature representation apparatus of the present application;

fig. 11 is a schematic structural diagram of a commodity image feature representation apparatus used in the present application.

Detailed Description

The models cited or possibly cited in the application comprise a traditional machine learning model or a deep learning model, and unless specified in clear text, the models can be deployed in a remote server and remotely called at a client, and can also be deployed in a client with qualified equipment capability to be directly called.

Referring to fig. 1, a network architecture adopted by an exemplary application scenario in the present application includes a terminal device 80, an independent station server 81, and an application server 82, where the application server 82 may be configured to deploy a commodity image feature representation service, and the commodity image feature representation service provides a service by running a computer program product implemented according to the commodity image feature representation method in the present application and opening a corresponding interface. The independent station server 81 can be used for deploying and opening an online shop of e-commerce services, a user on the terminal device 80 can submit a commodity image to the independent station server 81 in a page of the online shop, the independent station server 81 stores the commodity image into a commodity database of the independent station server, a corresponding interface provided by a commodity image feature representation service of the application server 82 is further called to submit the commodity image to the independent station server, image feature information of the commodity image is obtained, and feature representation of the commodity image is achieved.

It should be noted that the application scenario in fig. 1 is only an example of a platform deployment, and in other exemplary embodiments, the computer program product implemented according to the commodity image feature representation method of the present application may also be run in any computer device with sufficient computing power to execute the steps of the method to implement the commodity image feature representation service. For example, the product image feature representation service may be provided running in the terminal device 80 or a stand-alone server.

Referring to fig. 2, the present application is provided with an exemplary image feature extraction model for generating image feature information for a commodity image, where the image feature extraction model includes an image encoder, a feature optimization network, and a fusion network. The image encoder is responsible for extracting deep semantic information from the commodity image to obtain a corresponding semantic feature map, so that preliminary feature representation of the commodity image is realized. The feature optimization network is responsible for carrying out feature segmentation on the semantic feature graph to obtain a plurality of feature sub-graphs, allowing each feature sub-graph to have a feature region with local overlap, and compressing each feature sub-graph for use so as to finally obtain a feature vector corresponding to each feature sub-graph. The fusion network is used for fusing each feature vector into a single high-dimensional vector to be used as the image feature information of the commodity image.

The image encoder may employ any deep learning model suitable for characterizing image data, such as: convolutional Neural Network (CNN), residual Network (ResNet), vision Transformer, swin Transformer, and the like. The image encoder may employ a sophisticated pre-trained model.

Referring to fig. 3, the feature optimization network correspondingly sets a plurality of branches corresponding to the number of feature sub-graphs, each branch being sequentially provided with a pooling layer (avolinglayer, avgPooling for short), a full connection layer (FC, fullConnectionLayer), and a normalizing layer (Norm for short), the pooling layer being configured to perform pooling operation on the corresponding feature sub-graphs, and preferably to use mean pooling operation, so as to obtain initial feature vectors; the full connection layer is used for performing full connection on the initial characteristic vector to map the initial characteristic vector into a preset dimension specification so as to meet specific service requirements; the normalization layer is used for carrying out standardization processing on the fully connected feature vectors so as to obtain final feature vectors.

The fusion network is used for fusing a plurality of final feature vectors output by the feature optimization network, and synthesizing a plurality of feature vectors into the same high-dimensional vector to be used as image feature information. The fusion network can realize the fusion of a plurality of feature vectors by adopting the modes of weighted summation or mean value calculation and the like.

Before the image feature extraction model is used for determining the image feature information of the commodity image in the application, the commodity image feature extraction model is trained to be in a convergence state by adopting a sufficient amount of training samples and corresponding supervision labels. In the training process, a model training architecture as shown in fig. 4 is constructed by following classifiers on the basis of the image feature extraction model to perform classification task training. For each iteration, the model loss value generated after the training sample adopted by the iteration is classified and the relative loss value between the feature vectors of the training sample can be combined to comprehensively determine the total loss value of the training sample, and the image feature extraction model is subjected to gradient updating by using the total loss value, so that the image feature extraction model is more easily converged, and the capability of accurately representing the image features of the commodity image is quickly learned.

Referring to fig. 5, based on the above disclosed principle, an embodiment of a method for representing image features of a product according to the present application includes the following steps:

step S1100, acquiring a commodity image;

the product image is an image for displaying the shape, structure, usage state, and other contents of the product, and may be a detailed image of a local area of a single product or a full-view image of the entire product. In an e-commerce platform application scenario, the item image is typically a display image of an item on the shelf of an online store, including but not limited to a main map, a detail map, etc. displayed on an item detail page of the item on the shelf. As such, the content in the figure in the product image may be regarded as the product image of the present application as long as the product is a product.

In one embodiment, the commodity image may be submitted by a user on a terminal device, and the user may take a commodity image through a camera device in the terminal device, or obtain a commodity image from a video file or a picture file, and submit the commodity image to an independent station server for further processing, so as to determine image feature information of the commodity image.

In another embodiment, the commodity image may be obtained from a commodity picture file belonging to a commodity on the shelf, which is already stored in a commodity database, so as to determine corresponding image characteristic information for the commodity picture file of the commodity on the shelf.

In another embodiment, the commodity image may be stored in any storage space available for calling, including but not limited to a distributed storage system, a local hard disk, a local memory, and the like, and when the commodity image needs to be called, corresponding binary data may be directly read from the corresponding storage space.

In some embodiments, the reference requirement of the image encoder of the image feature extraction model of the present application may be adapted, and the commodity image may be subjected to image formatting preprocessing, for example, the commodity image may be adjusted to a target size, and the commodity image may be cut to highlight a commodity map, for which, those skilled in the art may flexibly implement the method.

Step S1200, inputting the commodity image into an image encoder of a preset image feature extraction model to extract deep semantic information of the commodity image, and obtaining a corresponding semantic feature map;

after the commodity image is obtained, the commodity image can be input into an image feature extraction model which is prepared in the application and is trained to a convergence state, and the image encoder performs convolution operation on the commodity image, so that deep semantic information of the commodity image is extracted, preliminary feature characterization of the commodity image is completed, and a corresponding semantic feature map is obtained.

In one embodiment, the image encoder is a residual error network, and the multi-channel semantic feature map can be obtained by performing multi-layer convolution on the commodity image through the residual error network.

In another embodiment, the image encoder is implemented by adopting a SwinTransformer model, the model divides the commodity image into a plurality of image blocks, then the image blocks are input into the model to apply window moving operation and window attention, deep semantic information is extracted layer by layer, and finally a multi-channel semantic feature map is obtained.

No matter what kind of image encoder is adopted, as long as the image encoder is trained in advance to reach a convergence state, the preliminary feature representation can be effectively carried out on the commodity image so as to reflect the deep semantics of the commodity image.

Step 1300, by adopting a feature optimization network in the image feature extraction model, segmenting the semantic feature graph by using a plurality of windows to obtain a plurality of feature sub-graphs and respectively compressing the feature sub-graphs into feature vectors, wherein part of feature sub-graphs contain the same feature region;

with reference to the schematic block diagram shown in fig. 3, after extracting the semantic feature map of the commodity image through image features, the semantic feature map enters a feature optimization network of the image feature extraction model, and the feature optimization network is implemented to perform feature segmentation on the semantic feature map, so as to segment the semantic feature map into a plurality of feature sub-maps. The multiple feature sub-graphs allow some feature sub-graphs to contain the same feature region, that is, the same local feature region in the semantic feature graph may exist in two or more feature sub-graphs at the same time. This effect can be achieved by setting the window size when performing feature segmentation, for example, assuming that the specification of the semantic feature map is H × W × C, where H is height, W is width, and C is the number of channels, and setting the window to W × 3/5, when performing feature segmentation, two feature subgraphs are aligned and segmented from both sides of the semantic feature map in the width direction, and feature regions of 1/5 of each of the middle portions of the two feature subgraphs overlap and are also the same.

It is understood that the obtained feature subgraph is multi-channel data, so that the feature subgraph can be further subjected to feature compression and normalization if necessary by using a feature optimization network to form a high-dimensional vector as a corresponding feature vector. Therefore, through feature compression, the same number of feature vectors can be obtained by corresponding to a plurality of feature subgraphs obtained by feature segmentation. Each feature vector is a high-dimensional representation of its corresponding feature sub-graph.

In one embodiment, the specification of the window used in the feature segmentation may be uniformly set, so as to obtain multiple feature subgraphs of the same scale, thereby avoiding the need to adjust the specification of the feature subgraphs subsequently.

In another embodiment, when feature segmentation is performed, feature segmentation may be performed on the semantic feature map according to a grid layout, so that different layouts such as four grids, six grids, nine grids and the like may be obtained by setting a window size, an overlapped feature region is ensured to be included between adjacent grids, and then feature segmentation is performed. Thereby, an efficient feature segmentation effect can be expected.

In yet another embodiment, the number of the feature sub-images may include more than four, so as to amplify the nuances of semantic features in the commodity image through the respective optimization of a plurality of feature sub-images with abundant numbers.

In another embodiment, in the feature sub-images, all different feature regions of one target feature sub-image are included in other feature sub-images, so that semantic association is established between the target feature sub-image and other feature sub-images, and the semantic features of the corresponding image region are amplified through the target feature sub-image to highlight the local image.

And S1400, fusing all the feature vectors into the image feature information of the commodity image by adopting a fusion network in the image feature extraction model.

Referring to the schematic block diagram shown in fig. 3, a plurality of feature vectors finally obtained by the feature optimization network are obtained by abstracting a part of each semantic feature map of the commodity image, and in order to more comprehensively and more concisely represent deep semantic information of the commodity image, the plurality of feature vectors are input into a fusion network of the image feature extraction model, all feature vectors are feature-fused by the fusion network, and all feature vectors are fused into a single high-dimensional vector, which can be used as image feature information of the commodity image.

In the process of fusing a plurality of feature vectors into the same high-dimensional vector, a mean value fusion mode can be adopted, a weighted summation fusion mode can also be adopted for implementation, the flexible setting can be realized, and the adopted fusion mode can be used as long as the high-dimensional vector obtained after fusion does not lose the overall features of the feature vectors.

Because of the characteristic subgraphs corresponding to all the characteristic vectors, wherein part of the characteristic subgraphs contain the characteristics of the same region, the characteristic subgraphs have a mutual strengthening relationship to the same characteristic region, after the characteristic vectors of the characteristic subgraphs are fused into the image characteristic information, the dimension of the obtained image characteristic information is greatly reduced, so that the data volume is saved, and the image characteristic information actually highlights the characteristics of the overlapping part between different characteristic subgraphs due to the fact that the same characteristic region is strengthened, so that the microscopic effect is achieved, and the distinguishing capability of the semantic characteristics of the commodity image is favorably amplified. In actual measurement, only slight differences such as colors, local details and the like originally exist between commodity images of different commodity models (also called Stock Keeping units) of the same commodity object (also called a Standardized Product Unit (SPU) in a Standard Product Unit) in an e-commerce platform, and after the image feature information of the commodity images of each commodity model is represented by adopting the method, the difference degree between semantic features of the image feature information is larger, and the semantic features are easier to distinguish clearly.

According to the embodiments, after the semantic feature map of the commodity image is extracted by the image encoder in the image feature extraction model, the semantic feature map is further segmented and compressed by the feature optimization network to obtain a plurality of feature sub-maps, overlapping regions are allowed to exist in windows adopted in the segmentation of the feature sub-maps, so that features in the feature sub-maps are crossed, and finally, after all the feature sub-maps are fused, the obtained image feature information can effectively amplify features of key regions in the commodity image, so that the image feature information can obtain stronger robustness, especially for slight differences in similar commodity images, the image feature information can effectively distinguish the semantic features, and is more suitable for expressing the semantic features of the commodity image in the electronic commerce platform, so that even if the commodity images with the same model and different styles are different, an obvious distinguishing effect can be obtained through the image feature information obtained according to the application scenario of commodity image-based commodity retrieval, and a more accurate retrieval result can be obtained.

On the basis of any embodiment of the present application, please refer to fig. 6 in combination with fig. 3, where a plurality of windows are applied to segment the semantic feature map to obtain a plurality of feature subgraphs, and the feature subgraphs are respectively compressed into feature vectors, including:

step S1310, determining the size of a window according to a preset scale, and segmenting a plurality of feature subgraphs from the semantic feature graph by applying corresponding windows according to preset different position information, wherein a local feature region between one feature subgraph and any other feature subgraph is overlapped;

in this embodiment, referring to the schematic block diagram illustrated in fig. 3, when performing feature segmentation on the semantic feature map obtained by the image encoder, the size of the window may be set, for example, N × N, and then, the scale of the semantic feature map is assumed to be H × W × C, where N is greater than 1/2 of H and greater than 1/2 of W, that is, the area of the window is greater than 1/4 of the size of the plane formed by H × W, for which, five feature sub-maps (F1, F2, F3, F4, F5) of N × N may be segmented from the plane formed by the height and the width of the semantic feature map according to preset position information, for example, four positions in the upper left, upper right, lower left, lower right, and five positions in the middle. In the five feature subgraphs obtained by the segmentation in this way, the central feature subgraph and the other four feature subgraphs respectively have an overlapped local feature region, and it is also expected that any one feature subgraph and the other four feature subgraphs respectively have the same local feature region, that is, the same local feature region exists between every two feature subgraphs in all the feature subgraphs. Therefore, according to the segmentation scheme, the local association relation on the characteristic region is established among all characteristic subgraphs.

In other alternative embodiments, by adjusting the center position, or adjusting the number of the windows, and appropriately adjusting the height or width of the windows, etc., one of the feature sub-images and one or more other feature sub-images in the plurality of feature sub-images obtained by segmentation have a local association relationship on the feature region, and thus, those skilled in the art can flexibly adjust the positions according to the principles disclosed in this application.

Step S1320, performing pooling operation on each feature subgraph to realize feature compression, and obtaining an initial feature vector;

after each characteristic subgraph is obtained, further passing through a corresponding branch in the characteristic optimization network, performing pooling operation by a pooling layer in the branch to compress the corresponding characteristic subgraph into an initial characteristic vector.

When the feature subgraph is compressed, the feature subgraph can be compressed in the width direction and the height direction by adopting an average pooling (averagePooling) mode, so that a high-dimensional vector of 1 × C is obtained, and the high-dimensional vector can be used as an initial feature vector of the corresponding feature subgraph.

Step S1330, fully connecting the initial feature vectors respectively, and then normalizing the initial feature vectors into final feature vectors.

As shown in fig. 3, the dimension of the initial feature vector is a result output according to the specification of the image encoder, but in various service scenarios, the dimension of the feature vector may be specified, in which case, the feature compression may be further achieved by setting a full connection layer in the feature optimization network to map the initial feature vector to the dimension specification of the specified dimension, for example, mapping the feature vector from 1024 dimensions to 128 dimensions.

The feature vectors obtained through the mapping of the full connection layer are further input into a normalization layer to be normalized, so that the feature vectors are expressed in a standardized mode, and the final feature vectors are obtained.

According to the embodiment, the window of the feature segmentation is set, the semantic feature graph is subjected to the feature segmentation in the feature optimization network to obtain a plurality of feature sub-graphs with feature overlapping association, the feature compression is respectively carried out on the basis of each feature sub-graph to obtain corresponding feature vectors, each feature sub-graph is subjected to feature processing on details, but because feature association exists among different feature sub-graphs, necessary association information still exists among the finally obtained feature vectors, and accurate feature representation can be obtained with the help of the association information when image feature information of a commodity image is generated according to the feature vectors.

On the basis of any embodiment of the present application, a fusion network in the image feature extraction model is adopted to fuse all feature vectors into image feature information of the commodity image, and in one embodiment, the method includes:

step S1410, taking the mean values of the plurality of feature vectors to realize fusion, and obtaining the image feature information of the commodity image:

specifically, the operation of averaging all the feature vectors obtained by the feature optimization network can be realized by using one pooling layer, so that feature fusion of the feature vectors is realized, and a high-dimensional vector is obtained as the image feature information of the commodity image.

Alternatively, in another embodiment, the method comprises:

step S1420, weighting and summing the plurality of feature vectors to realize fusion, and obtaining image feature information of the commodity image:

specifically, different normalization weights can be respectively assigned to the feature vectors, then the feature vectors are summed to obtain a high-dimensional vector, and the finally obtained high-dimensional vector is normalized through the weights, so that the high-dimensional vector is suitable for being used as corresponding image feature information. It can be seen that, corresponding to the present embodiment, the converged network can be implemented as a linear layer responsible for feature vector summation.

In some embodiments, considering that some feature sub-images, for example, each local feature region of the feature sub-image corresponding to the central position of the semantic feature map, are scattered and overlapped with other feature sub-images, the feature sub-image at such central position may be matched with a normalized weight higher than that of other feature sub-images, so as to further enlarge the distinguishing function of the features of the feature sub-image at the central position, so that the details of the image in the central region of the commodity image are enlarged and distinguished on the semantic representation.

In some embodiments, the fusion network may be implemented by using a linear layer, and matching each feature vector with a corresponding normalized weight, and making each normalized weight participate in gradient update of the training stage of the image feature extraction model, so as to determine the normalized weight of each feature vector through training.

According to the embodiments, there are various ways to fuse all the feature vectors obtained by the feature optimization network to obtain corresponding image feature information, so as to realize feature representation of corresponding commodity images.

On the basis of any embodiment of the present application, please refer to fig. 7, where after all feature vectors are fused into image feature information of the commodity image by using a fusion network in the image feature extraction model, the method includes:

step S2100, taking the commodity image as a commodity image to be checked, and calculating semantic similarity between image feature information of the commodity image to be checked and image feature information of each commodity image in a commodity database, wherein the image feature information of each commodity image in the commodity database is generated by adopting the image feature extraction model in advance;

in an exemplary application scenario, the image feature extraction model of the present application may be adopted to extract image feature information corresponding to each commodity image of each commodity in a commodity database of an independent station in advance, and the image feature information, the corresponding commodity image and the source commodity are constructed as mapping relationship data, and may also be stored in the commodity database or a mapping view thereof. Therefore, semantic retrieval can be carried out through the image characteristic information in the commodity database, and the purpose of image retrieval is achieved.

In some more specific service scenarios, for example, in a recommended scenario of photo search, the independent station obtains a commodity image as a commodity image to be checked, where the commodity image to be checked may be obtained by a user in the terminal device by using his/her camera device to take a picture, may also be a picture file of the user, and may also be a picture file that is specified by the user and is obtained by the independent station from a corresponding storage space. For the commodity image to be checked, the independent station further calls the image feature extraction model, the commodity image to be checked is input into the model, deep semantic information of the commodity image to be checked is extracted by an image encoder of the model, and corresponding image feature information is obtained through processing of a feature optimization network and a fusion network. The image feature extraction model may be directly deployed in an independent station server, or may be deployed in an application server of an e-commerce platform to open a service interface for the commodity image feature representation service to the current independent station, and the independent station calls the service interface for use.

Further, an image vector comparison service is used for calculating the data distance between the image characteristic information of the commodity image to be checked and the image characteristic information of each commodity image in the commodity database, and the data distance is converted into the representation of semantic similarity, so that each commodity image in the commodity database corresponds to the commodity image to be checked, and the corresponding semantic similarity of each commodity image is obtained. When the semantic similarity is expressed, the similarity is implemented as the numerical value is larger. The image vector comparison service can be deployed in a current independent station server, and can also be deployed in an application server of an e-commerce platform for being called by the current independent station.

In calculating the data distance between two image feature information, any feasible data distance algorithm may be employed, including but not limited to: cosine similarity calculation, euclidean distance algorithm, pearson correlation coefficient algorithm, jaccard coefficient algorithm, etc., which can be flexibly selected by those skilled in the art.

It is easy to understand that the semantic similarity obtained by each commodity image in the commodity database represents the semantic similarity between the commodity image and the commodity image to be checked, the higher the semantic similarity is, the more similar the semantic similarity is in the image content, and otherwise, the more dissimilar the semantic similarity is.

Step S2200, screening out partial commodity images with relatively high semantic similarity in the commodity database, acquiring commodity information of source commodities corresponding to the partial commodity images, and constructing the commodity information as a commodity information list;

and further screening each commodity image in the commodity database based on each semantic similarity calculated by the commodity image to be checked.

In one embodiment, a preset number to be preferred is set, each commodity image in the commodity database is sorted from high to low according to semantic similarity, and then one or more commodity images in the top of the sorting are selected according to the preset number, wherein the commodity images are similar commodity images matched with the commodity image to be checked.

In another embodiment, a similarity threshold is set, then, the semantic similarity obtained from the image feature information of each commodity image in the commodity database is compared with the threshold, and when the semantic similarity is higher than the threshold, the corresponding commodity image can be confirmed to be a similar commodity image matched with the commodity image to be checked, so that one or more similar commodity images are screened out.

And aiming at the screened one or more similar commodity images, according to the corresponding relation between the similar commodity images and the source commodities to which the similar commodity images belong, the commodity information of each source commodity can be further obtained and used for packaging a commodity information list. The commodity information generally includes a corresponding similar commodity image of the source commodity, a commodity title of the source commodity, a price of the source commodity, a commodity link, and the like, and what kind of commodity information is specifically selected is flexibly set according to business needs, and is not limited by the above examples.

And step S2300, outputting the commodity information list.

After the commodity information list is obtained, the commodity information list can be pushed to corresponding terminal equipment of a user, the commodity information list is analyzed by the terminal equipment, different commodity display controls are constructed corresponding to all source commodities in the commodity information list, and the different commodity display controls are displayed in a graphical user interface of the terminal equipment. Each commodity display control can display corresponding commodity information, including commodity titles, commodity images, commodity prices and the like, and corresponding commodity links can be added to the commodity information, so that when the corresponding commodity display control is touched, the corresponding commodity links can be skipped to and directly reach the commodity detail pages of corresponding source commodities, and the business effect of searching the images by using the images is achieved.

According to the embodiments, after the image feature extraction model is used for generating the image feature information for the commodity image in the commodity database and the commodity image to be searched, the semantic similarity between the image feature information can be used for realizing the business of searching the image by the image, and the obtained image feature information can clearly show the detail feature, so that when the business of searching the image by the image is applied, even the matching between the commodity images of different commodity models of the same commodity object can be reflected in the semantic similarity, so that the similar commodity images in the commodity database can be accurately searched for the commodity image to be searched, and the searching accuracy and the searching discrimination are improved.

On the basis of any embodiment of the application, the commodity image is input into an image encoder of a preset image feature extraction model to extract deep semantic information of the commodity image, and before a corresponding semantic feature map is obtained, a training process of the image feature extraction model is started, wherein the training process comprises the following steps:

step S3100, connecting the image feature extraction model with a classifier to form a model training framework;

in order to train the image feature extraction model, a classification task may be applied to train the image feature extraction model, and for this purpose, as shown in fig. 4, a classifier is connected to the image feature extraction model, so as to form a model training architecture, and train the model training architecture.

The classifier can be a two-classifier or a multi-classifier. A classification corresponding to a positive sample is marked in a classifier, the other classifications correspond to negative samples, and in the training process by adopting the training sample, when the training sample is the positive sample, the corresponding supervision label of the positive sample is adopted for supervision; and when the training sample is a negative sample, adopting a corresponding supervision label of the negative sample to supervise.

Step S3200, based on the model training framework, adopting a preset training data set to carry out classification task training on the image feature extraction model and train the image feature extraction model to a convergence state; the training data set comprises a plurality of training samples and supervision labels thereof, and the training samples are commodity images.

To perform training, a training data set is prepared that prestores a large number of training samples sufficient to train the model training architecture to a converged state. The training samples can be distinguished into positive samples and negative samples, wherein, for example, the positive samples can be commodity images, and the negative samples can be non-commodity images. Accordingly, in the training dataset, whether the sample type of the training sample is a positive sample or a negative sample is marked for use as a supervised label of the corresponding training sample in the training phase.

On the basis of obtaining the training data set, starting classification task training on the model training architecture, and training the whole model training architecture to a convergence state by calling training samples for multiple times to implement training, so that the image feature extraction model in the model training architecture reaches the convergence state, and the capability of performing feature representation on the commodity image to obtain corresponding image feature information is learned for use.

During each iterative training, a single training sample in the training data set is used as input, the image feature extraction model is used for carrying out feature representation to obtain corresponding image feature information, the classifier is used for predicting whether the training data set belongs to a positive sample or not, then a corresponding supervision label is used for calculating a loss value of the training data set, whether the whole model training framework is converged or not is decided according to the loss value, when the training data set is not converged, gradient updating is carried out on the model training framework according to the loss value, next training sample is called continuously for iteration, and the like until the whole model training framework reaches a convergence state.

According to the embodiment, the image feature extraction model can be trained by means of classification tasks, training efficiency is high, sample cost is low, and the image feature extraction model has obvious economic advantages.

On the basis of any embodiment of the present application, referring to fig. 8, based on the model training architecture, a preset training data set is adopted to perform classification task training on the image feature extraction model, and the training is performed to a convergence state, including:

step S3210, transferring a single training sample in the training data set to input the image feature extraction model to determine image feature information of the training sample;

when each iterative training is performed, a single training sample is called from the training data set, as described above, the training sample is essentially image data, and accordingly, the training sample is input into the image feature extraction model in the model training architecture, specifically, into the image encoder of the image feature extraction model, so that the semantic feature map of the training sample is extracted by the image encoder.

According to the network architecture of the image feature extraction model, the semantic feature map is further processed by a feature optimization network, and is firstly divided into a plurality of feature sub-maps, and then the feature sub-maps are compressed into initial feature vectors through the pooling layers in the corresponding branches of the feature sub-maps, and then the initial feature vectors are converted into final feature vectors through the full-link layer and the normalization layer.

Further, the fusion network in the image feature extraction model fuses each feature vector input by each branch into the image feature information of the commodity image.

Step S3220, obtaining classification probabilities of the classes mapped to a preset classification space by a classifier in the model training framework according to the image characteristic information as a classification result;

and inputting the image characteristic information obtained based on the training sample into a classifier of a model training framework, performing classification mapping through a full connection layer, mapping the image characteristic information into a preset classification space, and obtaining classification probabilities corresponding to various classes mapped into the classification space as a classification result.

Step S3230, calculating the total loss value of the classification result by adopting the supervision labels corresponding to the training samples, when the model training architecture does not reach the convergence state, performing gradient updating on the model training architecture according to the total loss value, and calling the next training sample to perform iterative training until the model training architecture reaches the convergence state.

It is understood that the overall loss value of the classification result can be calculated based on the corresponding supervised labels of the training samples. In one embodiment, as the training is performed by using the classification task, the cross entropy between the supervision label and the classification result can be directly calculated by using a cross entropy function as an overall loss value. In another embodiment, loss values represented by data distances between every two feature sub-graphs can be further fused on the basis of the loss values corresponding to the cross entropy, so as to form the total loss value. In still another embodiment, based on the above embodiment, a cross-entropy loss function is replaced by a smoothing function, i.e., a Smooth-AP function, to calculate a loss value between the supervision tag and the classification result, so as to determine the overall loss value.

After the total loss value of the model training architecture is determined, whether the model reaches a convergence state can be judged according to the total loss value, a threshold value can be preset correspondingly for the purpose of pursuing the minimization of the loss value, then whether the total loss value obtained by the training sample of each iterative training reaches the threshold value is judged, when the threshold value is reached, the model training architecture is shown to reach the convergence state, and the training can be terminated; when the threshold value is not reached, the model training architecture does not reach the convergence state, gradient updating is carried out on the whole model training architecture according to the total loss value, the whole model training architecture is corrected through back propagation, the point is that the weight parameters of the image feature extraction model are emphasized, the whole model training architecture is further close to convergence, then the next training sample is called from the step S3210 to carry out the next iterative training, and the like until the whole model training architecture reaches the convergence state.

According to the embodiment, the model training architecture carries out training by using classification tasks, the training cost is low, the training efficiency is high, in the process of training the whole model training architecture, the image feature extraction model is naturally trained to a convergence state, the capability of accurately representing the semantic features of the commodity image is obtained, corresponding image feature information is obtained, the image feature extraction model carries out feature segmentation on the semantic feature graph of the commodity image in a feature optimization network of the image feature extraction model to obtain a plurality of feature sub-graphs associated with local feature regions, and feature compression is respectively carried out on the basis of each feature sub-graph to regenerate the image feature information, so that the image feature information has the capability of feature representation of amplified details.

On the basis of any embodiment of the present application, please refer to fig. 9, the calculating the total loss value of the classification result by using the supervised labels corresponding to the training samples includes:

step S4100, calculating a first loss value corresponding to the image characteristic information by adopting the supervision label;

in this embodiment, the image feature information F is calculated _avg When the classification loss is lost, a Smooth-AP function is adopted for calculation, wherein the Smooth-AP function is a plug-and-play target function, end-to-end training of the deep network is allowed, and the realization is simple and elegant. The objective function is particularly suitable for optimizing the measurement based on ranking, and is more suitable for being applied to the application, so that the image feature information obtained by the supervised image feature extraction model is more suitable for semantic matching. Therefore, by using Sooth-AP, a first loss value of the image characteristic information obtained by the model training architecture relative to the supervision label of the corresponding training sample is calculated and is marked as L _SAP 。

Step S4200, calculating data distances between every two feature vectors generated by the image feature extraction model and used for constructing the image feature information, and solving the mean value of all the data distances as a second loss value;

as described above, the image feature extraction model outputs a plurality of feature vectors in the feature optimization network, and there are 5 feature vectors, which are respectively recorded as: f. of ₁ ,f ₂ ,f ₃ ,f ₄ ,f ₅ Finally, the 5 feature vectors are fused through a fusion network to obtain final image feature information F _avg 。

Considering that 5 feature vectors are obtained by sampling different feature regions of the same image, the feature vectors should be similar to each other, and the distance between the 5 feature vectors needs to be reduced, therefore, the distance between every two feature vectors is calculated to obtain 10 data distances, and finally, the average is obtained to obtain a second loss value which is recorded as L _COS 。

The specific calculation formula of the second loss value is as follows:

wherein:

cos(fi,fk)＝1-fi*fk ^T

and S4300, fusing the first loss value and the second loss value to obtain the total loss value.

After determining the first loss value and the second loss value for each training sample, the first loss value and the second loss value can be fused into an overall loss value for deciding whether the model converges.

In one embodiment, the total loss value L is obtained by directly adding the first loss value and the second loss value using the following formula:

L＝L _SAP +L _COS

in another embodiment, a smoothing weight may be further set for the first loss value and the second loss value, and a weighted summation of the two values is implemented to obtain the total loss value L.

According to the above calculation process of the total loss value obtained by the corresponding training sample during each iterative training of the model training architecture, when the total loss value is calculated for the model training architecture, the similar loss between every two feature vectors corresponding to each feature subgraph is integrated, the smooth loss of the finally obtained image feature information is also integrated, the model loss can be represented more effectively, whether the model is converged or not is determined through the total loss value, and the gradient updating is performed on the model training architecture, so that the whole model training architecture can be trained to be in a convergence state quickly, the training efficiency is improved, the capability of the image feature extraction model for accurately representing the image features of the commodity image is ensured, and the image feature information of the commodity image has higher discrimination.

Referring to fig. 10, a commodity image feature representation apparatus according to an aspect of the present application includes an image obtaining module 1100, a semantic extracting module 1200, a feature optimizing module 1300, and a feature fusing module 1400, where: the image acquisition module 1100 is configured to acquire an image of a commodity; the semantic extraction module 1200 is configured to input the commodity image into an image encoder of a preset image feature extraction model to extract deep semantic information of the commodity image, and obtain a corresponding semantic feature map; the feature optimization module 1300 is configured to use a feature optimization network in the image feature extraction model, segment the semantic feature map by using multiple windows to obtain multiple feature sub-maps, and compress the feature sub-maps into feature vectors respectively, where part of the feature sub-maps contain the same feature region; the feature fusion module 1400 is configured to fuse all feature vectors into image feature information of the commodity image using a fusion network in the image feature extraction model.

On the basis of any embodiment of the present application, the feature optimization module 1300 includes: the data segmentation unit is used for determining the size of a window according to a preset scale, and segmenting a plurality of feature subgraphs from the semantic feature graph by applying corresponding windows according to preset different position information, wherein a local feature region between one feature subgraph and any other feature subgraph is overlapped; the compression processing unit is used for respectively implementing pooling operation on each characteristic subgraph to realize characteristic compression and obtain an initial characteristic vector; and the standardization processing unit is used for respectively carrying out full connection on the initial characteristic vectors and then normalizing the initial characteristic vectors into final characteristic vectors.

On the basis of any embodiment of the present application, the feature fusion module 1400 includes: the mean value fusion unit is used for taking the mean values of the plurality of feature vectors to realize fusion and obtaining the image feature information of the commodity image; alternatively, it comprises: and the weighted fusion unit is used for weighting and summing the plurality of characteristic vectors to realize fusion so as to obtain the image characteristic information of the commodity image.

On the basis of any embodiment of the present application, the feature fusion module 1400 includes: the query matching module is used for taking the commodity image as a commodity image to be checked, calculating semantic similarity between image characteristic information of the commodity image to be checked and image characteristic information of each commodity image in a commodity database, and generating the image characteristic information of each commodity image in the commodity database by adopting the image characteristic extraction model in advance; the screening processing module is arranged for screening partial commodity images with relatively high semantic similarity in the commodity database, acquiring commodity information of source commodities corresponding to the partial commodity images and constructing the commodity information as a commodity information list; and the result output module is used for outputting the commodity information list.

On the basis of any embodiment of the present application, before the semantic extraction module 1200, a training process for the image feature extraction model is started, and the training process includes: the architecture generation module is used for connecting the image feature extraction model with a classifier to form a model training architecture; the training implementation module is arranged to implement classification task training on the image feature extraction model by adopting a preset training data set based on the model training architecture and train the image feature extraction model to a convergence state; the training data set comprises a plurality of training samples and supervision labels thereof, and the training samples are commodity images.

On the basis of any embodiment of the present application, the training implementation module includes: the sample representing unit is used for calling a single training sample in the training data set to input the image feature extraction model to determine the image feature information of the training sample; the classification prediction unit is set to adopt a classifier in the model training framework to obtain classification probability of each class mapped to a preset classification space according to the image characteristic information as a classification result; and the iteration updating unit is set to calculate the total loss value of the classification result by adopting the supervision labels corresponding to the training samples, when the model training framework does not reach the convergence state, the model training framework is subjected to gradient updating according to the total loss value, and the next training sample is called to carry out iteration training until the convergence state is reached.

On the basis of any embodiment of the present application, the iterative update unit includes: the first calculation unit is used for calculating a first loss value corresponding to the image characteristic information by adopting the supervision label; the second calculation unit is used for calculating the data distance between every two feature vectors which are generated by the image feature extraction model and used for constructing the image feature information, and solving the average value of all the data distances as a second loss value; and the fusion calculation unit is arranged to fuse the first loss value and the second loss value to obtain the total loss value.

Another embodiment of the present application also provides a commodity image feature representation apparatus. As shown in fig. 11, the internal structure of the product image feature representation apparatus is schematically illustrated. The commodity image feature representation device comprises a processor, a computer readable storage medium, a memory and a network interface which are connected through a system bus. The computer-readable non-transitory readable storage medium of the commodity image feature representation device stores an operating system, a database and computer-readable instructions, the database can store information sequences, and the computer-readable instructions can enable a processor to realize a commodity image feature representation method when being executed by the processor.

The processor of the commodity image feature representation device is used for providing calculation and control capability and supporting the operation of the whole commodity image feature representation device. The memory of the article image feature representation device may have stored therein computer readable instructions, which, when executed by the processor, may cause the processor to perform the article image feature representation method of the present application. The network interface of the commodity image feature representation equipment is used for connecting and communicating with the terminal.

It will be understood by those skilled in the art that the structure shown in fig. 11 is a block diagram of only a portion of the structure relevant to the present application, and does not constitute a limitation on the article image feature representation apparatus to which the present application is applied, and that a particular article image feature representation apparatus may include more or fewer components than shown in the figures, or may combine certain components, or have a different arrangement of components.

In this embodiment, the processor is configured to execute specific functions of the modules in fig. 10, and the memory stores program codes and various data required for executing the modules or sub-modules. The network interface is used for realizing data transmission between user terminals or servers. In the present embodiment, a nonvolatile readable storage medium stores program codes and data necessary for executing all modules in the product image feature display device of the present application, and a server can call the program codes and data of the server to execute the functions of all modules.

The present application further provides a non-transitory readable storage medium storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the method for image feature representation of an article of manufacture of any of the embodiments of the present application.

The present application also provides a computer program product comprising computer programs/instructions which, when executed by one or more processors, implement the steps of the method as described in any of the embodiments of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments of the present application may be implemented by hardware related to instructions of a computer program, which may be stored in a non-volatile readable storage medium, and when executed, may include the processes of the embodiments of the methods as described above. The storage medium may be a computer-readable storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

In summary, the image feature information extracted for the commodity image according to the present application can obtain stronger robustness, especially can effectively distinguish subtle differences in similar commodity images, and is more suitable for representing semantic features of the commodity image in the e-commerce platform, so that even if the commodity images of the same model and different styles are used, an obvious distinguishing effect can be obtained through the image feature information obtained according to the present application, and thus a more accurate retrieval result can be obtained in an application scenario of commodity retrieval based on the commodity image.

Claims

1. A method for representing an image feature of a commodity, comprising:

acquiring a commodity image;

adopting a feature optimization network in the image feature extraction model, segmenting the semantic feature graph by using a plurality of windows to obtain a plurality of feature sub-graphs, and respectively compressing the feature sub-graphs into feature vectors, wherein part of feature sub-graphs contain the same feature region;

2. The commodity image feature representation method according to claim 1, wherein the step of segmenting the semantic feature map by applying a plurality of windows to obtain a plurality of feature sub-maps and compressing the feature sub-maps into feature vectors respectively comprises the steps of:

3. The commodity image feature representation method according to claim 1, wherein fusing all feature vectors into image feature information of the commodity image using a fusion network in the image feature extraction model includes:

alternatively, the first and second electrodes may be,

4. The commodity image feature representation method according to claim 1, wherein after fusing all feature vectors into image feature information of the commodity image using a fusion network in the image feature extraction model, the method comprises:

and outputting the commodity information list.

5. The commodity image feature representation method according to any one of claims 1 to 4, wherein before the commodity image is input into an image encoder of a preset image feature extraction model to extract deep semantic information of the commodity image, and a corresponding semantic feature map is obtained, a training process of the image feature extraction model is started, and the method comprises the following steps:

based on the model training framework, a preset training data set is adopted to carry out classification task training on the image feature extraction model, and the image feature extraction model is trained to a convergence state; the training data set comprises a plurality of training samples and supervision labels thereof, and the training samples are commodity images.

6. The commodity image feature representation method according to claim 5, wherein the training of the classification task to the image feature extraction model to a convergence state is performed by using a preset training data set based on the model training architecture, and includes:

and calculating the total loss value of the classification result by adopting the supervision labels corresponding to the training samples, performing gradient updating on the model training framework according to the total loss value when the model training framework does not reach the convergence state, and calling the next training sample to perform iterative training until the convergence state.

7. The commodity image feature representation method according to claim 6, wherein calculating the overall loss value of the classification result by using the supervised labels corresponding to the training samples comprises:

8. An article image feature representation apparatus, comprising:

an image acquisition module configured to acquire a commodity image;

the semantic extraction module is used for inputting the commodity image into an image encoder of a preset image feature extraction model to extract deep semantic information of the commodity image so as to obtain a corresponding semantic feature map;

9. An article image feature representation apparatus comprising a central processor and a memory, wherein the central processor is configured to invoke execution of a computer program stored in the memory to perform the steps of the method of any one of claims 1 to 7.

10. A non-transitory readable storage medium storing a computer program implemented according to the method of any one of claims 1 to 7 in the form of computer readable instructions, the computer program, when invoked by a computer, performing the steps included in the corresponding method.