CN115082702A

CN115082702A - Image and e-commerce image processing method, device and storage medium

Info

Publication number: CN115082702A
Application number: CN202210646454.0A
Authority: CN
Inventors: 孟子皓; 王文强; 黄勃; 虞旭林; 陈清付
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-06-08
Filing date: 2022-06-08
Publication date: 2022-09-20

Abstract

The embodiment of the application provides an image and e-commerce image processing method, device and storage medium. According to the E-commerce image processing method provided by the embodiment of the application, when the significance of the E-commerce image is detected, the E-commerce image is considered, and compared with a semantic object, a user pays attention to the E-commerce image and is attracted by characters in the E-commerce image more easily; the character area still has the characteristic of larger attention point degree, and the E-commerce image is respectively subjected to significance detection and character detection; and a second significance map of the e-commerce image is generated based on the first significance map obtained by significance detection and the character probability map obtained by character detection, so that significance detection aiming at the characteristics of the e-commerce image is realized, and the accuracy of the e-commerce image significance detection is improved.

Description

Image and e-commerce image processing method, device and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image and e-commerce image processing method, device, and storage medium.

Background

For online shopping, the e-commerce image has a great inspiring and guiding role for consumers, and has a core role in the whole shopping activity, such as introducing products to consumers, assisting visual search, attracting consumers and influencing their final decision. Due to the inherent nature of online shopping, the core goal of e-commerce images is to attract the attention of customers. The e-commerce image content is typically a combination of pictures and text to achieve the goal of effectively attracting and introducing customers. Therefore, the significance prediction of the e-commerce image is of great significance for providing enhanced guidance information and shopping experience for the consumer.

Existing saliency prediction work has almost focused on natural images. However, since the design targets of the e-commerce image and the natural image are substantially different, the existing saliency prediction method has a disadvantage in predicting the saliency of the e-commerce image.

Disclosure of Invention

Aspects of the present application provide an image and e-commerce image processing and model training method, device and storage medium, so as to implement significance prediction for e-commerce images.

The embodiment of the application provides an e-commerce image processing method, which comprises the following steps:

acquiring an e-commerce image;

carrying out saliency feature extraction on the e-commerce image to obtain saliency features of the e-commerce image;

according to the saliency features, saliency detection is carried out on the e-commerce image to obtain a first saliency map of the e-commerce image;

according to the saliency characteristics, carrying out character detection on the e-commerce image to obtain a character probability chart of the e-commerce image;

and generating a second saliency map of the E-commerce image according to the first saliency map and the character probability map.

An embodiment of the present application further provides an image processing method, including:

acquiring an image to be processed; the image to be processed includes: text messages and images of other objects;

extracting the salient features of the image to be processed to obtain the salient features of the image to be processed;

according to the saliency features, saliency detection is carried out on the image to be processed to obtain a first saliency map of the image to be processed;

according to the saliency characteristics, carrying out character detection on the image to be processed to obtain a character prediction probability map of the image to be processed;

and generating a second saliency map of the image to be processed according to the first saliency map and the character prediction probability map.

The embodiment of the present application further provides a model training method, including:

acquiring a commodity image sample;

performing model training on an initial model of the significance detection model by using an electronic commerce image sample with a loss function minimization as a training target;

in the model training process, carrying out significance feature extraction on an e-commerce image sample by utilizing a plurality of network layers cascaded in a backbone network to obtain prediction significance features corresponding to the network layers;

inputting the predicted significance characteristics corresponding to the plurality of network layers into a character detection head for character detection to obtain a character prediction probability graph;

inputting the predicted significance characteristics corresponding to the last layer of the plurality of network layers into a significance detection head for significance detection to obtain a predicted significance map;

the loss function is determined according to the difference between the text prediction probability map and the text true value map, the difference between the prediction significance map and the true value significance map, and the difference between the prediction significance characteristics and the true value significance characteristics corresponding to the plurality of network layers; the true significance signature corresponding to each network layer is the same size as the predicted significance signature corresponding to that network layer.

An embodiment of the present application further provides a computing device, including: a memory and a processor; wherein the memory is used for storing a computer program;

the processor is coupled to the memory for executing the computer program for performing the above-mentioned e-commerce image processing method, and/or steps in the model training method.

Embodiments of the present application also provide a computer-readable storage medium storing computer instructions, which, when executed by one or more processors, cause the one or more processors to perform the above-mentioned e-commerce image processing method, and/or steps in the model training method.

According to the E-commerce image processing method provided by the embodiment of the application, when the significance of the E-commerce image is detected, the E-commerce image is considered, and compared with a semantic object, a user pays attention to the E-commerce image and is attracted by characters in the E-commerce image more easily; the character area still has the characteristic of larger attention point degree, and the E-commerce image is respectively subjected to significance detection and character detection; and a second significance map of the e-commerce image is generated based on the first significance map obtained by significance detection and the character probability map obtained by character detection, so that significance detection for the e-commerce image characteristics is realized, and the accuracy of the e-commerce image significance detection is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1a is a schematic diagram of an e-commerce image;

FIG. 1b is a point-of-interest density map of an e-commerce image provided by an embodiment of the present application;

fig. 2a and fig. 2b are proportional distribution of points of interest in an e-commerce image provided by the embodiment of the present application except for a text area;

FIG. 3a is a schematic view illustrating a process for calculating a viewing angle according to an embodiment of the present disclosure;

FIG. 3b is a perspective measurement of the perspective calculation process provided in FIG. 3 a;

fig. 4 is a schematic flowchart of an e-commerce image processing method provided in the embodiment of the present application;

fig. 5 is a schematic process diagram of an online saliency detection of an e-commerce image by a saliency detection model provided in the embodiment of the present application;

fig. 6 is a schematic diagram of a significance detection model architecture and a schematic diagram of a model training process provided in an embodiment of the present application;

fig. 7 is a schematic flowchart of an image processing method according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computing device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Conventional saliency prediction methods aim to predict human attention at the pixel level, relying primarily on low-level features including contrast, color, brightness, texture, and so forth. However, none of the conventional saliency prediction methods relates to saliency prediction of e-commerce (e-commerce for short) images. In the embodiment of the present application, the e-commerce image refers to an image produced for displaying commodity information in e-commerce, and generally includes commodity pictures and characters. For example, pictures made for displaying commodities and commodity price information on online shopping platforms (such as websites, APPs, etc.), and the like.

Due to the inherent nature of online shopping, the core goal of e-commerce images is to attract the attention of customers, which mainly involves two aspects: (1) attracting consumers to pay attention to a certain product when browsing a website, and advertising a certain brand at the same time; (2) the consumer is attracted to a specific area in the image, such as the goods or the information that the consumer wants to convey (such as a price-reducing promotion word). Thus, e-commerce image content is often a combination of pictures and text to achieve the goal of effectively attracting and introducing customers. For example, as shown in fig. 1a, the e-commerce image includes: semantic objects (XX thin book) of the E-commerce images, text information (price-reducing promotion words, wherein the price-reducing promotion words are specified to be 300-80 in a limited time, and the lowest hand price is xxxx unit) and the like. Therefore, the significance prediction of the e-commerce image is of great significance for providing enhanced guidance information and shopping experience for the consumer. Existing saliency prediction efforts have almost focused on natural images, including bottom-up or top-down approaches. However, since the design targets of the e-commerce image and the natural image are different in nature, the existing method has a disadvantage in predicting the significance of the e-commerce image. For example, when the saliency of an image is predicted in the existing method, an object region in a natural image is one of the most important high-level clues, so that a product key feature region in a predicted e-commerce image is as salient as a brand character region.

Therefore, in order to solve the problem of priority of characters in the e-commerce image, a new method for predicting the significance of the e-commerce image needs to be researched. Meanwhile, the lack of e-commerce image data sets also hinders the research of significance prediction models. Therefore, the embodiment of the application provides a method for predicting the significance of an e-commerce image.

Aiming at the problem of shortage of E-commerce image data sets, the embodiment of the application establishes an eyeball tracking data set of the E-commerce images, wherein the data set comprises diversified E-commerce images collected from an online shopping platform, and provides an attention point diagram and a real significance diagram through an eyeball tracking technology. In order to research the human perception behavior of the electronic commerce image, the inventor establishes an electronic commerce image data set, which comprises: a large number of e-commerce images including collected image points of interest and text boundaries. In summary, the eyeball tracking data set of the e-commerce image established by the application comprises: hundreds of thousands of fixation points in the eye tracking experiment for multiple subjects, and tens of thousands of text boxes annotated by multiple volunteers. The inventor analyzes an eyeball tracking data set of an e-commerce image through an eyeball tracking technology, and finds that the e-commerce image has the following characteristics:

(1) characteristic 1: for the e-commerce image, the user attention is more easily attracted by characters in the e-commerce image than semantic objects. In the embodiment of the application, the semantic object in the e-commerce image mainly refers to a commodity image in the e-commerce image.

From previous studies, it is known that images of general natural scenes are more likely to be attracted to semantic objects for their visual attention. The inventor firstly analyzes the significant true value graph of the e-commerce image and finds that the text area of the e-commerce image attracts a great deal of attention. To further evaluate this feature, semantic object boxes and text boxes in the database were detected by YOLOv5 and CRAFT, and then the focus densities (per 1000 pixel values) falling on the text and object regions were calculated, respectively. As shown in fig. 1b, for the e-commerce image, the density of the attention points in the text area is much higher than that of the object area, and the density of the object area is only slightly higher than that of the attention points in the random area. This is mainly because the core design goal of e-commerce images is to display merchandise, so that merchandise objects tend to occupy most of the area of the image, but only a small portion of the object attracts visual attention. The above results indicate that, for the e-commerce image, the visual attention of the user is more easily attracted by the text.

(2) Characteristic 2: although the text in the e-commerce image can greatly attract the visual attention, a large amount of visual attention still exists outside the text area.

As can be appreciated from characteristic 1, the text can greatly attract the visual attention of the user. The inventors further calculated the number of points of interest outside the text area. Fig. 2a and 2b show the scale of the points of interest outside the text area. Wherein, fig. 2a is the proportion of the attention points outside the text area of the e-commerce image with attention points in the text area; fig. 2b is a text area non-attention point ratio of the attention point outside the text area of the e-commerce image. In fig. 2a and 2b, each point represents an e-commerce image in the e-commerce image data set, the abscissa represents the area ratio of the text area on the e-commerce image, and the ordinate represents the proportion of the attention points outside the text area on the e-commerce image to all the attention points of the e-commerce image. As shown in fig. 2a, for most images of the e-commerce image dataset, about 40% to 70% of the points of interest fall outside the text area. This means that in the text area, visual attention is drawn to other areas with bottom-up or top-down saliency.

In fig. 2b, the inventor further calculates a relationship between an area ratio of a text region without any attention point in each e-commerce image and a ratio of the attention points outside the text region in the e-commerce image to the number of all the attention points in the e-commerce image. From the results shown in fig. 2b, it can be seen that the text region in this portion has no aggregative property, i.e. there is no text region of a certain range proportion to attract a large amount of visual attention points. It is also worth noting that similar trends as described above also occur in different categories of e-commerce images.

The results shown in fig. 2a and 2b indicate that the interest points of the user on the e-commerce image are distributed and dispersed, and therefore, the saliency prediction of the e-commerce image is complex and cannot be simply solved by a text detection method.

Characteristic 3: in e-commerce images, the visual attention between different user subjects is consistent, especially within the text area.

In order to study the consistent features of the e-commerce image, the inventor measures the visual consistency in the e-commerce image data set by calculating a Linear Correlation Coefficient (LCC) between the significance of a single subject and other subjects, i.e. calculating the correlation of the significance map of a single person and the significance map of other persons. As shown in table 1, in order to understand in detail the difference in the consistency between different areas when the subject views the image, the present inventors calculated the LCCs of each person and others in different areas, that is, including the entire image, the text area, the object target area, and the like. Table 1 also lists the LCC values of the other two eye tracking data sets (i.e., LEDOV and Hollywood) as a comparative reference. In addition, visual consistency of position deviation in the e-commerce image data set is also reflected by measuring LCC between 2 randomly selected e-commerce image saliency maps in table 1. From the results of table 1 it can be concluded that visual consistency in e-commerce image data sets is similar to other eye tracking data sets. It is also demonstrated again that users are more inclined to focus on text when viewing e-commerce images.

TABLE 1 different data set consistency

Characteristic 4: gaze aversion in e-commerce images is typically much larger than foveal area (Fovea Region), indicating that non-local content in e-commerce images may attract visual attention trends. Among them, the fovea is the most acute region of vision (color discrimination, resolution) in the retina.

Fig. 3a is a schematic view of a calculation process of a viewing angle provided by the present application. As shown in fig. 3a, the present inventors evaluated gaze point shift in e-commerce image data sets by calculating the viewing angle θ between two consecutive fixations per subject. In the eyeball tracking experiment of the present application, the size of the screen and the distance between the tester's subject (user) and the screen are fixed, and thus, the angle of view can be calculated by a trigonometric function. In fig. 3a, point O represents the gaze location of the fovea of the eyeball; points a and B represent the gaze locations for two consecutive visual transitions. The viewing angle theta between two consecutive fixations can be obtained by trigonometric calculation. The viewing angle measurement is performed through the viewing angle calculation process shown in fig. 3a, resulting in the viewing angle measurement result shown in fig. 3 b. According to the research on the human visual system, human visual attention is focused only on a small range of a viewing angle not exceeding 2 degrees (deg). However, according to the view angle measurement result shown in fig. 3b, it can be known that the gaze shift view angle of the subject for the e-commerce image is only 25.8% of the view angle located in the fovea, and therefore, the focus shift view angle of the user for the e-commerce image is larger. The larger focus-shifting perspective indicates that human attention is more likely to be attracted by non-local content in e-commerce images. This may be because the e-commerce image was originally designed to show all semantic objects and text on the image, not a portion of the image.

Based on the research on the gazing characteristics of the subjects on the e-commerce images, the e-commerce image processing method provided by the embodiment of the application gives consideration to the e-commerce images when the significance of the e-commerce images is detected, and compared with semantic objects, the user attention is attracted by characters in the e-commerce images more easily; the character area still has the characteristic of larger attention point degree, and the significance detection and the character detection are respectively carried out on the E-commerce image; and a second significance map of the e-commerce image is generated based on the first significance map obtained by significance detection and the character probability map obtained by character detection, so that significance detection aiming at the characteristics of the e-commerce image is realized, and the accuracy of the e-commerce image significance detection is improved.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

It should be noted that: like reference numerals refer to like objects in the following figures and embodiments, and thus, once an object is defined in one figure or embodiment, further discussion thereof is not required in subsequent figures and embodiments.

Fig. 4 is a schematic flow chart of an e-commerce image processing method provided in the embodiment of the present application. As shown in fig. 4, the e-commerce image processing method mainly includes:

401. and acquiring an e-commerce image.

402. And carrying out saliency feature extraction on the E-commerce image to obtain saliency features of the E-commerce image.

403. According to the saliency features, saliency detection is carried out on the E-commerce image to obtain a first saliency map of the E-commerce image.

404. And according to the saliency characteristics, carrying out character detection on the E-commerce image to obtain a character probability chart of the E-commerce image.

405. And generating a second saliency map of the E-commerce image according to the first saliency map and the character probability map.

In the embodiment of the present application, in order to perform saliency detection on the e-commerce image, in step 401, the e-commerce image may be acquired. The e-commerce image is any e-commerce image to be processed, and can be a two-dimensional image or a three-dimensional image. Of course, the e-commerce image can be a separate frame image, and can also be any video frame in the video. Optionally, for the device performing e-commerce image processing, an access request for an online shopping platform may be acquired; acquiring the identifier of the image to be accessed from the access request; and then, acquiring the e-commerce image corresponding to the identifier of the image to be accessed from the server of the online shopping platform as the e-commerce image in the step 401. For the description of the characteristics of the e-commerce image, reference is made to the above contents, and the description is omitted here.

Further, in step 402, salient feature extraction may be performed on the e-commerce image to obtain salient features of the e-commerce image. The image saliency is an important visual feature in an image, and represents the degree of importance of human eyes to certain areas of the image. The salient features may be in the form of a salient Feature Map (Feature Map) as a multi-dimensional Feature vector.

Based on the characteristic analysis of the e-commerce image, it can be known that for the e-commerce image, compared with a semantic object, the user attention is more easily attracted by characters in the e-commerce image; and still have a greater degree of interest outside the text area. Therefore, in the embodiment of the present application, when the saliency of the e-commerce image is detected, in step 403, the saliency of the e-commerce image may be detected according to the saliency features to obtain a saliency map of the e-commerce image; and in step 404, according to the saliency characteristics, performing character detection on the e-commerce image to obtain a character probability map of the e-commerce image. Each probability value in the character probability graph refers to the probability that each pixel point in the e-commerce image is a character. Among them, text detection may also be referred to as text detection.

In the embodiment of the present application, the significance detection in step 403 and the text detection in step 404 may be two non-intersecting branched networks, or two branched networks sharing a backbone network. In the embodiment of the present application, in order to improve the accuracy of the subsequent saliency detection, steps 403 and 404 may be two branch networks sharing a backbone network. Accordingly, step 402 may be implemented as: and carrying out significance feature extraction on the E-commerce image by utilizing a backbone network of the significance detection model to obtain significance features of the E-commerce image.

In some embodiments, the backbone network may be implemented as a neural network model. The neural network model may be a machine learning model of any structure, including but not limited to: CNN, DNN, RNN, FCN, and Transformer models.

Further, based on the

characteristics

2 and 4 of the e-commerce image, the attention point information of the e-commerce image has the characteristic of non-locality. The attention mechanism of the Transformer model enables the model to pay attention to global perception clues when prediction is made, so the Transformer model can be selected as a backbone network for extracting the significant features of the e-commerce image. However, the computational complexity of the transform model for extracting the significant features is high, the computational complexity problem brought by the visual task is not solved all the time, and the self-attention mechanism needs to calculate the correlation matrix with the size of N ^2 for all the input N tokens. Considering that visual information is originally two-dimensional (image) or even three-dimensional (video), the slightly higher resolution results in a much higher amount of transform model calculation.

In order to reduce the computational load of significance detection, in some embodiments of the present application, a Swin-Transformer model may be used as the backbone network. The Swin-Transformer model solves the technical problem of high calculation amount of the Transformer model by proposing the design of a layered architecture and a shifting window, and reduces the calculation complexity in various visual tasks through verification. The Swin-Transformer model improves efficiency by multi-scale layers and learning self-attention maps within moving windows.

The model architecture of the significance detection model is shown in fig. 5 and 6, and the layered architecture of the backbone network mainly refers to a plurality of cascaded network layers. Plural means 2 or more. Fig. 6 illustrates only the number of network layers being 4, but this is not limitative. As shown in fig. 5 and 6, the network layer may include: a Window Multi-head Self-Attention (W-MSA) module and a shift-Window Multi-head Self-Attention (SW-MSA) module.

Based on the above significance detection model, one implementation of step 403 is: and (3) carrying out significance detection on the e-commerce image according to the significance characteristics by using a significance detection Head (Saliency Head) of a significance detection model so as to obtain a significance map of the e-commerce image. Accordingly, one implementation of step 404 is: and performing character detection on the E-commerce image by using a character detection Head (Text Head) of the significance detection model according to the significance characteristics to obtain a character probability graph of the E-commerce image.

In the embodiment of the application, the saliency detection head and the character detection head can adopt a light and effective model to predict the saliency map and the character probability. In some embodiments, as shown in fig. 5, the saliency detection head may comprise: a plurality of Dense blocks (Dense blocks), a multi-scale information extraction module, and a deconvolution Block (DeConv). Plural means 2 or more. For example, the dense blocks may be 3 or 4, etc. Fig. 5 illustrates only 3 dense blocks, but this is not a limitation. Alternatively, the multi-scale information extraction module may perform multi-scale information extraction using an empty space convolutional pooling (ASPP) structure. The multi-scale information extraction module may employ hole convolution at different sampling rates. For example, the saliency detection head of FIG. 5 uses a hole convolution of four different sampling rates. The hole convolution as shown in FIG. 5 may be embodied as a spread convolution (scaled Conv). Further, a reverse convolution block can be adopted to recover the multi-scale information to obtain a saliency map. The number of the reverse volume blocks is plural. Plural means 2 or more. Fig. 5 illustrates 3 reverse volume blocks, but the present invention is not limited thereto.

In the embodiment of the present application, the specific structure of the character detection head is not limited. In some embodiments, the text detection head may utilize the basic structure of CRAFT to enable character-level text detection based on the idea of segmentation. Alternatively, as shown in fig. 5, the text detection head may include: an upsampling block and an upsampling block. After upsampling and convolution are carried out on the significance characteristics output by the backbone network, a character area score map (SOC) and an affinity score map (AFF) of the e-commerce image are output through a convolution module. Each score in the character region score map represents the probability that a pixel corresponding to the E-commerce image is a character center; the affinity score represents the probability of the center of the adjacent character region. Based on the basic architecture of the basic structure of the CRAFT, in this embodiment, a character detection head of a saliency detection model may be used to perform character detection on the e-commerce image according to the saliency features to obtain a character area score map of the e-commerce image; detecting the connection relation between characters in the e-commerce image according to the saliency characteristics to obtain an affinity score map of the e-commerce image; further, a character region score map and an affinity score map can be determined, and the character region score map and the affinity score map are character probability maps of the e-commerce image.

Based on the characteristic analysis of the e-commerce image, it can be known that for the e-commerce image, compared with a semantic object, the user attention is more easily attracted by characters in the e-commerce image; and the character area still has a larger attention point degree, so in step 405, a saliency map of the e-commerce image can be generated according to the saliency map and the character probability map obtained in step 403. In the embodiment of the present application, for convenience of description and distinction, the saliency map obtained in step 403 is defined as a first saliency map; and defining a second saliency map from the saliency map obtained in step 405.

Specifically, the pixel values of the same pixel coordinate in the text probability map and the first saliency map may be added to obtain a second saliency map of the e-commerce image.

According to the E-commerce image processing method provided by the embodiment of the application, when the significance of the E-commerce image is detected, the E-commerce image is considered, and compared with a semantic object, a user pays attention to the E-commerce image and is attracted by characters in the E-commerce image more easily; the character area still has the characteristic of larger attention point degree, and the significance detection and the character detection are respectively carried out on the E-commerce image; and a second significance map of the e-commerce image is generated based on the first significance map obtained by significance detection and the character probability map obtained by character detection, so that significance detection for the e-commerce image characteristics is realized, and the accuracy of the e-commerce image significance detection is improved.

The embodiments for steps 402-404 described above may be implemented using a saliency detection model. As shown in fig. 5 and 6, the significance detection model may include: backbone network, significance detection head and characters detection head. The main network is used for extracting the significant features of the e-commerce image; and the significance detection head and the character detection head respectively carry out significance detection and character detection on the E-commerce image according to the significance characteristics. Before the significance detection model is used online, model training needs to be carried out on the significance detection model. The following significance detection models include: the training process of the significance detection model is exemplarily explained by taking a backbone network, a significance detection head and a character detection head as examples.

In the embodiment of the application, in order to improve the network prediction performance, significance information can be added into each SWN transformation block to promote the network prediction performance. That is, the saliency prediction features output by each network layer in fig. 6 can be input into the saliency detection header to add saliency information in each SWN transform block (the upper convolution block in the text detection header of fig. 6) of the saliency detection header. In the embodiment of the present application, for the last base layer in the network layer of each stage, the embodiment of the present application proposes an attention loss L using a significance map _a To supervise the salient features learned in the backbone network. For the L-th network layer, L is lost _a Expressed as:

wherein, in formula (1), M represents the total number of channels in the ith network layer. M denotes the mth channel of the ith layer. M is 1,2, …, M. A. the _l,c Self-attention of the mth channel, S, representing the output of the l-th layer _l A saliency map, i.e., a saliency prediction feature, having the same size as the first-level self-attention map is shown. The significance prediction characteristic corresponding to a plurality of network layers is carried out on a significance truth map of the E-commerce image samplesAnd (4) adjusting the size. The size of the significance prediction feature corresponding to each network layer is equal to the self attention graph output by the network layer.

Cor (. cndot.) in the above formula (1) can be represented as:

cor(X)＝soft max(vec(X)·(vec(X) ^T ) (2)

in the formula (2), vec (X) is a number represented by the formula _l ×w _l Matrix X of (a) vectorizes a vector of w × h × l, softmax (·) denotes a softmax operation. In the examples of the present application, X in the formula (2) is A in the formula (1) _l,c And S _l 。

The main objective of attention loss proposed in the present application is to guide the backbone network proposed in the present application (e.g. SSwin-Transformer backbone network) to learn the non-local clues of e-commerce images according to human perception, while maintaining the diversity brought by the multi-headed self-attention graph. On one hand, the promoted backbone network focuses on the non-local area most emphasized by human; the method plays a crucial role in tasks such as saliency prediction and character detection of the e-commerce image, because people pay high attention to character areas when watching the e-commerce image. On the other hand, global and uniform priorities are also imposed on the network, which is very important in multitask learning. Meanwhile, the output characteristics of the backbone network are enhanced through a global clue of multi-task learning, and subsequent significance detection and character detection are facilitated.

In the embodiment of the present application, the loss function of the significance detection head can be represented by the difference between the predicted significance map (i.e., the significance prediction map) and the significance truth map of the significance detection head. For example, the divergence can be calculated by the Kullback-leibler (kl) divergence between the significance map predicted by the significance detector (i.e., the significance prediction map) and the significance truth map. Accordingly, the loss function L of the saliency detection head _s Can be expressed as:

L _s ＝KL(S ^p ||S ^gt ) (3)

in the above formula (3), S ^p A saliency prediction map representing an e-commerce image sample output by the saliency detection head; s ^gt A saliency truth map representing a sample of e-commerce images.

In the embodiment of the present application, the text detection head may adopt the basic structure of CRAFT, and based on the idea of segmentation, character-level text detection can be implemented. As shown in fig. 6, the saliency predicted features of different resolutions from different network layers of the backbone network undergo upsampling, convolution and connection operations to enhance information aggregation between different resolutions. And then outputting a character probability prediction graph through a plurality of upper convolution modules. In fig. 6, the text probability prediction graph may be represented as a character region score graph and an affinity score graph. In the embodiment of the present application, the loss function of the text detection head can be represented by the difference between the text probability prediction graph and the text probability true graph. For example, the loss of the text probability prediction graph and the text probability true value graph can be evaluated by the Mean Square Error (MSE) between the text probability prediction graph and the text probability true value graph.

In the embodiment of the present application, since the text region score map and the affinity score map are generally sparse, and their MSEs are directly added, a sample imbalance problem may occur, resulting in zero values being output almost everywhere. To overcome the problem of sample imbalance, a balanced mse (bmse) may be used to mitigate the training problem with a network output of 0. Accordingly, N can be randomly selected when the mean square error between the text probability prediction graph and the text probability true graph is the same _pos A positive sample image and N _neg A negative sample image. Wherein, the positive sample image is an E-commerce image sample marked with a correct significance true value image and a character probability true value icon; the negative sample image is an e-commerce image sample with a known significance true value image and a character probability true value image with an error label. Accordingly, the calculation method of BMSE is expressed as:

in the formula (4), (P £ N) represents a set of the positive samples and the negative samples, and (i, j) represents a pixel point with a coordinate (i, j) in the electricity quotient sample image. X (i, j) represents a character probability predicted value of the pixel point (i, j); y (i, j) represents the true value of the text probability of the pixel point (i, j). N is a radical of _pos And N _neg Representing the number of positive and negative samples, respectively. Accordingly, the loss function L of the character detection head _t Expressed as:

in formula (5)

And

respectively representing a character area score prediction graph of an e-commerce image sample output by a character detection head and a character area score true value graph of the e-commerce image sample;

and

and the character affinity score prediction graph and the character affinity score true value graph respectively represent the electronic commerce image sample output by the character detection head.

Based on the above analysis, the loss function of the significance detection model can be expressed as:

L＝λ _a L _a +λ _s L _s +λ _t L _t (6)

in formula (6), λ _a 、λ _s And λ _t Representing the loss weights of the backbone network, the saliency detection heads and the text detection heads for adjusting the ratio of the losses of the different modules, L _a 、L _t ，L _s Respectively representing the loss of the backbone network, the significance detection head and the text detection head.

Based on the above-mentioned architectural diagram of the significance detection module shown in fig. 6, the model training process of the significance detection model mainly includes the following steps:

and S1, acquiring an e-commerce image sample.

S2, performing model training on the backbone network, the significance detection head and the character detection head by using a loss function minimization as a training target and utilizing an electronic commerce image sample;

and S3, in the model training process, extracting the significance characteristics of the E-commerce image samples by utilizing a plurality of network layers cascaded in the backbone network to obtain the significance prediction characteristics corresponding to the network layers.

And S4, inputting the predicted significance characteristics corresponding to the plurality of network layers into a character detection head for character detection to obtain a character probability prediction graph of the e-commerce image sample.

And S5, inputting the significance prediction characteristics corresponding to the last layer of the plurality of network layers into a significance detection head for significance detection to obtain a significance prediction graph.

The loss function is determined according to the difference between the text probability prediction graph and the text probability true value graph, the difference between the significance prediction graph and the significance true value graph, and the difference between the significance prediction characteristic and the significance true value characteristic corresponding to the network layers; the significance truth characteristic corresponding to each network layer is the same as the significance prediction characteristic corresponding to the network layer in size. For a specific representation of the loss function, reference may be made to the above-mentioned relations of equation (1) to equation (6), which are not described herein again.

In the training process of the significance prediction model, significance prediction features with different resolutions generated by different network layers are added in the character detection head, so that the prediction performance of the character detection head can be improved, the accuracy of character detection by using the significance prediction model subsequently can be improved, and the accuracy of the significance map of the e-commerce image generated subsequently according to the character probability map obtained by character detection and the significance map obtained by significance detection can be improved.

In other embodiments, the saliency prediction characteristics corresponding to a plurality of network layers can also be input into the saliency detection head for saliency detection. As can be seen from the above characteristic 1 of the e-commerce image, the text attention of the e-commerce image is higher than that of other areas, and therefore the priority of text detection is higher than that of saliency detection. Based on the significance prediction characteristics corresponding to the first K layers of the plurality of network layers, the significance prediction characteristics can be input into a significance detection head for significance detection. Wherein K is less than the number of network layers.

In the embodiment of the present application, as can be seen from the

above characteristics

1 and 2 of the e-commerce image, in the e-commerce image, the character detection and the saliency detection are mutually facilitated. Based on this, as shown in fig. 6, in the embodiment of the present application, in the stage of training the saliency detection model, information output from the saliency head may be fed back to the input end of the text detection head, and output information from the text detection head may be fed back to the input end of the saliency head. Such a flow of interaction information between the saliency detection head and the text detection head may improve learning of saliency prediction and text detection. Based on this, the forward process of the above significance detection head becomes:

salhead (-) in equation (7) represents a significance detector,

and

respectively representing character region score prediction graphs output to character detection heads

And character affinity score prediction graph

And adjusting the size of the character region score prediction graph and the character affinity score prediction graph after the size of the character region score prediction graph and the character affinity score prediction graph is equal to the size of the significance prediction feature F output by the backbone network. An element product, i.e., a dot product, is indicated.

Wherein f (-) in formula (7) can be represented as:

f(x)＝ρ(x-0.5)+1 (8)

whereinX is more than or equal to 0 and less than or equal to 1, and rho is a scale factor. For formula (7), x in formula (8) is:

in this way, the detected text region having a positive value can appropriately increase the importance of the feature at the corresponding position compared to the background region whose output is zero, so that the performance of the saliency detection head can be further improved by this additional information.

Similarly, the character probability prediction graph output by the character detection head can be expressed as:

in equation (8), TextHead (·) represents a text detection head.

And the significance prediction graph represents the significance prediction graph output by the significance detection head and the significance prediction graph which is equal to the significance prediction characteristic F output by the backbone network in size. For formula (9), x in formula (8) is:

based on an information feedback mechanism between the character detection head and the significance detection head, the training process of the significance prediction model comprises a plurality of training rounds, and for any current training round, significance prediction characteristics corresponding to a plurality of network layers output by a backbone network in the current training round and a significance prediction image output by the significance detection head in the previous training round can be input into the character detection head for character detection so as to obtain a character probability prediction image output by the character detection head in the current training round; inputting the significance prediction characteristics corresponding to the last layer of the plurality of network layers and a character probability prediction graph output by a character detection head in the previous training turn into the significance detection head for significance detection so as to obtain a significance prediction graph output by the significance detection head in the current training turn; wherein the current training round is any training round except the first training round.

Specifically, a first dot product between the significance characteristics corresponding to the multiple network layers output by the backbone network in the current training round and the significance prediction graph output by the significance detection head in the previous training round can be calculated, and the first dot product is input into the character detection head for character detection, so that the character probability prediction graph output by the character detection head in the current training round is obtained. And calculating a second dot product between the significance prediction features corresponding to the last layer of the plurality of network layers and the character probability prediction graph output by the character detection head in the previous training turn, inputting the significance detection head for significance detection, and obtaining the significance prediction graph output by the significance detection head in the current training turn.

It should be noted that, in the first training round, according to the size of the saliency prediction feature output by the backbone network in the first training round, an all-1 matrix with the same size as the saliency prediction feature may be determined as the result output by the prefix before the first training round, and a rough prediction S may be obtained ^p Then feeding the character probability prediction data back to the character detection head to obtain a character probability prediction graph output by the character detection head in the first training turn

And

then, through character detection obtained in the method

And

and then feeding back to the significance detection head again to obtain a more refined significance prediction map. It should be noted that although iterating this process may result in further enhanced prediction, as the number of iterations increases, the computational complexity increases and the gain decreases. In order to simplify the computational complexity, in the embodiment of the present application, tooInformation of set times can be fed back to the character detection head and the significance detection head. Wherein the set number of times may be less than the set number of training rounds. For example, the set number of times may be 1,2, or 3, and so on.

Aiming at the information feedback mechanism between the character detection head and the significance detection head, because significance prediction characteristics with different resolutions generated by different network layers are added in the character detection head in the training process of the significance prediction model, the prediction performance of the character detection head can be improved; on the other hand, because the output information of the character detection head is fed back to the significance detection head, significance prediction characteristics with different resolutions generated by different network layers are also added into the significance detection head, and the significance detection method is also beneficial to improving the accuracy of significance detection by using a significance prediction model subsequently.

The above embodiment exemplarily provides a training process of the saliency detection model, and after the saliency detection model training is completed, the saliency detection model can be used for performing saliency detection and character detection on the e-commerce image to obtain a first saliency map and a character probability map of the e-commerce image; further, a second saliency map of the E-commerce image can be generated according to the first saliency map and the character probability map output by the saliency detection model. Compared with semantic objects, the user attention is more easily attracted by characters in the e-commerce image due to the fact that the second saliency map is fused with the e-commerce image; and the character area still has the characteristic of larger attention point degree, so that the E-commerce image saliency detection method provided by the embodiment of the application is matched with the characteristic of the E-commerce image, and compared with the traditional saliency detection method, the E-commerce image saliency detection accuracy can be improved.

In the embodiment of the application, for the e-commerce image, after the saliency map of the e-commerce image is obtained, code rate allocation can be performed on the e-commerce image according to the second saliency map of the e-commerce image to obtain a code rate allocation result of the e-commerce image; and then, carrying out image coding on the E-commerce image according to the code rate distribution result, and realizing the perception optimization of the image coding. Further, the coded e-commerce image can be transmitted to an image request end. Because the second saliency map of the E-commerce image reflects attention information of people on different regions of the E-commerce image, code rate allocation can be performed on the E-commerce image according to the second saliency map of the E-commerce image, coding rate allocation can be performed on regions with high attention and regions with low attention, the regions with high attention can be allocated with higher code rates, and perceptual optimization of image coding can be achieved.

In the embodiment of the present application, a specific implementation of performing code rate allocation on an e-commerce image according to a second saliency map of the e-commerce image is not limited. In some embodiments, the e-commerce image may be segmented to obtain a plurality of image blocks. Plural means 2 or more. The image block may be referred to as a Coding Tree Unit (CTU). One CTU consists of blocks of pixels of size N x N. In some embodiments, N is 128 (for the current 4k/8k ultra high definition image). In the embodiment of the present application, a specific implementation of image segmentation on an e-commerce image is not limited. In some embodiments, the manner of image segmentation of the e-commerce image may include: one or more of a threshold-based segmentation method, an edge detection-based segmentation method, a region-based segmentation method, and a deep learning-based segmentation method, but is not limited thereto.

Further, according to the second significance map of the e-commerce image obtained in step 405, code rate allocation may be performed on the plurality of image blocks to obtain target code rates of the plurality of image blocks, which are used as code rate allocation results of the e-commerce image. Further, the e-commerce image can be subjected to image coding according to the target code rates of the plurality of image blocks.

In some embodiments, the method can minimize the coding distortion of the e-commerce image as a target, and calculate the initial code rates of the plurality of image blocks by using a rate distortion optimization method with the constraint that the code rates of the plurality of image blocks are equal to the set target bit number, so that the overall perceptual distortion of the e-commerce image is minimized. Accordingly, the calculation can be expressed as:

wherein d is _i And r _i Respectively representing the distortion and the initial code rate of the ith image block, and M representing the total number of the image blocks corresponding to the current e-commerce image. R represents the target bit number of the E-commerce image coding and can be flexibly set according to actual requirements. T. "means that the total number of bits of the e-commerce image is made equal to R.

In some embodiments, the distortion and the code rate of the image satisfy a certain curve relationship. For example, distortion and code rate of an image satisfy a hyperbolic function relationship. The curve relationship between the distortion and the code rate of the image can be expressed as:

D(r _i )＝C*r _i ^-K (11)

in formula (11), C and K are constants. Performing joint solution based on the above equations (10) and (11) to obtain the initial code rate r of each image block _i 。

Further, the initial code rates of the plurality of image blocks can be adjusted according to the second significance map of the e-commerce image, so that the target code rates of the plurality of image blocks can be obtained.

Specifically, from the second significance map, significance subgraphs corresponding to a plurality of image blocks can be determined; then, according to the significance subgraph corresponding to the plurality of image blocks, significance weights of the plurality of image blocks can be calculated. Optionally, the sum of the saliency values in the saliency subgraph corresponding to each image block can be calculated; selecting a target image block with the maximum sum of corresponding significance values from the plurality of image blocks; for any image block a, the ratio of the sum of the saliency values corresponding to the image block a to the sum of the saliency values corresponding to the target image block may be calculated as the saliency weight of the image block a.

In some embodiments, to ensure that the perceptual distortion of the image is minimum, the code rate allocation of the non-key coding tree units can be reduced on the basis of the R-lambda code rate control algorithm based on the visual perception model, so that the perceptual distortion of the e-commerce image is reduced. Specifically, firstly, a saliency map of a current image is output according to a visual perception model, and a saliency weight value of each CTU block of the current image is obtained by calculation by taking the CTU block as a unit and is used as a subjective weight w of a perception code _i . Coding each image block computed inside the encoderAnd during rate, adjusting the initial code rate of the image block according to the significance weight of the corresponding image block to obtain the target code rate of each image block. The overall code rate is as shown in equation (12):

wherein, in the formula (12), r _i *(w _i +w _base ) Representing the target bitrate of image block i. w is a _i Is the saliency weight of the image block i. w is a _base Is a preset correction weight. Mainly due to the saliency weights w of the image blocks i _i Is the ratio of the sum of the saliency values corresponding to the image block i to the sum of the saliency values corresponding to the target image block. And the sum of the saliency values corresponding to the image block i is 0. In order to prevent the sum of the significance values corresponding to the image block i to be 0 from affecting the subsequent coding effect, w is set _base And carrying out weight correction on the sum of the significance values corresponding to the image block i. Wherein, w _base May be 0.5.

The significance weight graph corresponding to the e-commerce image is obtained by dividing the resolution of the e-commerce image by the size of the image block. Accordingly, the saliency weight map corresponding to the e-commerce image may be represented as:

h _w ＝(h+size-1)/size (13)

w _w ＝(w+size-1)/size (14)

w of formulae (13) and (14) _w And h _w Respectively, the width and height of the significance weight map, and the size is the size of the image block, such as 128 in video coding (VVC).

After the significance weight of each image block is determined, the sum of the code rates of the image blocks is equal to the set target bit number to serve as a constraint, and the initial code rates of the image blocks are adjusted by using the significance weights of the image blocks to obtain the target code rates of the image blocks. The specific adjustment method can be seen in the above formula (12).

And then, carrying out image coding on the E-commerce image according to the target code rates of the plurality of image blocks. Specifically, the image coding parameters of the plurality of image blocks may be calculated according to the target code rates of the plurality of image blocks and the curve relationship between the code rates and the image coding parameters. Wherein, the image coding parameter can be expressed as: quantization Parameter (QP). Accordingly, the target code rate of each image block and the curve relationship between the code rate and the image coding parameters can be realized as the following equations (15) and (16) to obtain the quantization parameter QP corresponding to the CTU block by calculation of the target code rate for coding. As shown in formulas (3.10) and (3.11):

QP _i ＝4.2005·lnλ _i +13.7122 (16)

in the above formula (15), λ _i And representing curve coefficients between the distortion of the image block i and the target code rate of the image block i. Alpha is alpha _i And beta _i Are coefficients related to the content of the image block i. bpp _i And the pixel depth of the image block i is represented and can be obtained by calculation according to the target code rate of the image block i. Wherein, the calculation formula is as follows:

in the formula (17), R _i Representing the target bitrate of the image block i. f denotes a frame rate. w is a _i And h _i Respectively representing the width and height of the image block i.

Based on the above equations (15) - (17), the image coding parameters, such as the quantization parameter QP, corresponding to each image block can be obtained. Further, the image coding parameters of the plurality of image blocks can be utilized to perform image coding on the plurality of image blocks so as to perform image coding on the e-commerce image.

In the encoding process of the e-commerce image, according to the attention information of people to different regions of the image, the code rate of the region of the e-commerce image is redistributed, so that the code rate of region distribution with high attention is higher, the code rate of region distribution with low attention is lower, the encoding quality of the region with high attention is ensured, the encoding code rate of the region with low attention can be reduced, and compared with a traditional encoder, the encoding code rate of the e-commerce image can be reduced. The inventor of the application carries out perceptual coding verification on the universal video coding (VVC), and compared with a standard encoder, the perceptual coding image code rate can be saved by more than 20%.

The image processing method provided by the embodiment of the application can be applied to saliency detection and coding of the e-commerce image, and can also be applied to saliency detection and coding of other images comprising characters and semantic objects. Such as video images with subtitles, human-machine interaction interfaces (e.g., UI interfaces), electronic posters, etc. The following provides an exemplary description of the application of the image processing method provided in the embodiments of the present application to other application scenarios.

Fig. 7 is a schematic flowchart of an image processing method according to an embodiment of the present application. As shown in fig. 7, the method mainly includes:

701. acquiring an image to be processed; the image to be processed includes: textual information and images of other objects.

702. And extracting the significant features of the image to be processed to obtain the significant features of the image to be processed.

703. And according to the saliency features, performing saliency detection on the image to be processed to obtain a first saliency map of the image to be processed.

704. And according to the saliency characteristics, carrying out character detection on the image to be processed to obtain a character probability chart of the image to be processed.

705. And generating a second saliency map of the image to be processed according to the first saliency map and the character probability map.

In this embodiment, the image to be processed may be any image including text information and other objects. Such as video frame images with subtitles, UI interfaces, pop-ups, electronic posters, etc.

In the present embodiment, the image to be processed and the e-commerce image have similar characteristics. Based on the to-be-processed image, compared with the semantic object, the user pays attention to the to-be-processed image and is more easily attracted by characters in the to-be-processed image; the character region still has the characteristic of larger attention point degree, and when the saliency detection is carried out on the image to be processed, the saliency detection and the character detection are respectively carried out on the image to be processed; and generating a second saliency map of the image to be processed based on the first saliency map obtained by saliency detection and the character probability map obtained by character detection, so that saliency detection aiming at the characteristics of the image to be processed is realized, and the improvement of the accuracy of saliency detection of the image to be processed is facilitated. For specific embodiments of performing significance detection and text detection on an image to be processed, reference may be made to the above-mentioned related contents of performing significance detection and text detection on an e-commerce image, which are not described herein again.

It should be noted that, the executing subjects of the steps of the method provided in the foregoing embodiments may be the same device, or different devices may also be used as the executing subjects of the method. For example, the execution subject of

steps

401 and 402 may be device a; for another example, the execution subject of step 401 may be device a, and the execution subject of step 402 may be device B; and so on.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of order or in parallel as they appear in the inventor of the present application, and the serial numbers of the operations such as 403, 404, etc. are merely used to distinguish various operations, and the serial numbers themselves do not represent any execution order. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel.

Accordingly, embodiments of the present application also provide a computer-readable storage medium storing computer instructions, which, when executed by one or more processors, cause the one or more processors to perform the above-mentioned e-commerce image processing method, and/or steps in the model training method.

Fig. 8 is a schematic structural diagram of a computing device according to an embodiment of the present application. In the embodiment of the present application, the implementation form of the computing device is not limited. Alternatively, the computing device may be implemented as a single server, a cloud-based server array, or the like; of course, the computing device may also be implemented as a terminal device such as a mobile phone or a computer. As shown in fig. 8, the computing device includes: a memory 80a and a processor 80 b; a memory 80a for storing a computer program;

the processor 80b is coupled to the memory 80a for executing computer programs for: acquiring an e-commerce image; carrying out saliency feature extraction on the E-commerce image to obtain saliency features of the E-commerce image; according to the saliency characteristics, saliency detection is carried out on the E-commerce image to obtain a first saliency map of the E-commerce image; according to the saliency characteristics, carrying out character detection on the e-commerce image to obtain a character probability graph of the e-commerce image; and generating a second saliency map of the E-commerce image according to the first saliency map and the character probability map.

In some embodiments, the processor 80b, when performing the salient feature extraction on the e-commerce image, is specifically configured to: and carrying out significance feature extraction on the E-commerce image by utilizing a backbone network of the significance detection model to obtain significance features of the E-commerce image.

Correspondingly, when the processor 80b performs saliency detection on the e-commerce image to obtain the first saliency map of the e-commerce image, the processor is specifically configured to: and carrying out saliency detection on the E-commerce image according to the saliency features by utilizing a saliency detection head of the saliency detection model to obtain a first saliency map of the E-commerce image.

Optionally, when the processor 80b performs text detection on the e-commerce image, the processor is specifically configured to: and performing character detection on the E-commerce image by using a character detection head of the significance detection model according to the significance characteristics to obtain a character probability chart of the E-commerce image.

Further, when the processor 80b performs text detection on the e-commerce image according to the saliency characteristics by using the text detection head of the saliency detection model, the processor is specifically configured to: character detection is carried out on the electronic commerce image by utilizing a character detection head of the significance detection model according to the significance characteristics so as to obtain a character area score map of the electronic commerce image; each score in the character region score map represents the probability that a pixel corresponding to the E-commerce image is a character center; detecting the connection relation among the characters in the e-commerce image according to the saliency features to obtain a character affinity score map of the e-commerce image; the character affinity score represents the probability of the center of the adjacent character region; and determining a character region score map and a character affinity score map which are character probability maps.

In some embodiments, the processor 80b is further configured to: acquiring a commodity image sample; taking the minimization of the loss function as a training target, and carrying out model training on the backbone network, the significance detection head and the character detection head by utilizing the electronic commerce image sample; in the model training process, extracting significance characteristics of an e-commerce image sample by utilizing a plurality of network layers cascaded in a backbone network to obtain significance prediction characteristics corresponding to the network layers; inputting the significance prediction characteristics corresponding to the plurality of network layers into a character detection head for character detection to obtain a character probability prediction graph of the e-commerce image sample; inputting the significance prediction characteristics corresponding to the last layer of the plurality of network layers into a significance detection head for significance detection to obtain a significance prediction graph; the loss function is determined according to the difference between the text probability prediction graph and the text probability true value graph, the difference between the significance prediction graph and the significance true value graph, and the difference between the significance prediction characteristic and the significance true value characteristic corresponding to the network layers; the significance truth characteristic corresponding to each network layer is the same as the significance prediction characteristic corresponding to the network layer in size.

Wherein, the difference between the significance prediction diagram and the significance truth diagram is represented by the divergence between the significance prediction diagram and the significance truth diagram; the difference between the significance prediction features and the significance truth-value features corresponding to the plurality of network layers is represented by divergence between the significance prediction features and the significance truth-value features corresponding to the plurality of network layers; the significance truth-value feature corresponding to each network layer is obtained by adjusting the size of the significance truth-value diagram.

The above-mentioned e-commerce image sample includes: a positive sample image and a negative sample image; the positive sample image is an E-commerce image sample marked with a correct significance true value image and a character probability true value icon; the negative sample image is an e-commerce image sample with a known significance true value image and a character probability true value image with an error label.

Optionally, the model training process comprises a plurality of training rounds. The processor 80b is further configured to: inputting a character detection head to perform character detection so as to obtain a character probability prediction graph output by the character detection head in the current training round; inputting the significance prediction characteristics corresponding to the last layer of the plurality of network layers and a character probability prediction graph output by a character detection head in the previous training round into the significance detection head for significance detection so as to obtain a significance prediction graph output by the significance detection head in the current training round; the current training round is any training round other than the first training round.

Optionally, when inputting the saliency characteristics corresponding to a plurality of network layers output by the backbone network in the current training round and the saliency prediction map output by the saliency detection head in the previous training round, the processor 80b is specifically configured to: and calculating a first dot product between the significance characteristics corresponding to a plurality of network layers output by the backbone network in the current training round and a significance prediction image output by the significance detection head in the previous training round, and inputting the first dot product into the character detection head for character detection.

Correspondingly, when inputting the saliency detection head for saliency detection into the text probability prediction map output by the saliency feature and the text detection head corresponding to the last layer of the plurality of network layers in the previous training round, the processor 80b is specifically configured to: and calculating a second dot product between the significance prediction characteristics corresponding to the last layer of the plurality of network layers and the character probability prediction graph output by the character detection head in the previous training turn, and inputting the significance detection head for significance detection.

In some embodiments, the processor 80b, when generating the second saliency map of the e-commerce image from the first saliency map and the text probability map, is specifically configured to: and adding the pixel values of the same pixel coordinate in the character probability graph and the first saliency graph to obtain a second saliency graph of the E-commerce image.

In the embodiment of the present application, the processor 80b is further configured to: carrying out image segmentation on the e-commerce image to obtain a plurality of image blocks; according to the second significance map, code rate distribution is carried out on the plurality of image blocks to obtain target code rates of the plurality of image blocks; and carrying out image coding on the E-commerce image according to the target code rates of the plurality of image blocks.

Optionally, the processor 80b is further configured to: and calculating the initial code rates of the plurality of image blocks by adopting a rate distortion optimization method with the goal of minimizing the coding distortion of the E-commerce image and the constraint that the sum of the code rates of the plurality of image blocks is equal to the set target bit number. Accordingly, processor 80b, when dependent on the second saliency map, is specifically configured to: and adjusting the initial code rates of the plurality of image blocks according to the second significance map to obtain the target code rates of the plurality of image blocks.

Further, when the processor 80b adjusts the initial code rates of the plurality of image blocks according to the second saliency map, it is specifically configured to: determining a significance subgraph corresponding to a plurality of image blocks from the second significance graph; calculating significance weights of the image blocks according to significance subgraphs corresponding to the image blocks; and regulating the initial code rates of the plurality of image blocks by using the significance weight of the plurality of image blocks and taking the sum of the code rates of the plurality of image blocks equal to the target bit number as a constraint so as to obtain the target code rates of the plurality of image blocks.

Optionally, when the processor 80b calculates the significance weights of the plurality of image blocks according to the significance subgraph corresponding to the plurality of image blocks, it is specifically configured to: calculating the sum of the significance values in the significance subgraph corresponding to each image block; selecting a target image block with the largest sum of corresponding significance values from a plurality of image blocks; and calculating the ratio of the sum of the saliency values corresponding to any image block to the sum of the saliency values corresponding to the target image block as the saliency weight of any image block.

In some embodiments, the processor 80b is specifically configured to, when performing image coding on the e-commerce image according to the target code rates of the plurality of image blocks: calculating image coding parameters of the image blocks according to the target code rates of the image blocks and the curve relation between the code rates and the image coding parameters; and carrying out image coding on the plurality of image blocks by utilizing the image coding parameters of the plurality of image blocks so as to carry out image coding on the e-commerce image.

The computing device provided by the embodiment gives consideration to the E-commerce image when the significance of the E-commerce image is detected, and compared with a semantic object, the user attention is more easily attracted by characters in the E-commerce image; the character area still has the characteristic of larger attention point degree, and the E-commerce image is respectively subjected to significance detection and character detection; and a second significance map of the e-commerce image is generated based on the first significance map obtained by significance detection and the character probability map obtained by character detection, so that significance detection aiming at the characteristics of the e-commerce image is realized, and the accuracy of the e-commerce image significance detection is improved.

In the embodiment of the present application, the processor 80b is further configured to: acquiring an image to be processed; the image to be processed includes: text messages and images of other objects; extracting the saliency characteristics of the image to be processed to obtain the saliency characteristics of the image to be processed; according to the saliency characteristics, saliency detection is carried out on the image to be processed to obtain a first saliency map of the image to be processed; according to the saliency characteristics, carrying out character detection on the image to be processed to obtain a character prediction probability map of the image to be processed; and generating a second saliency map of the image to be processed according to the first saliency map and the character prediction probability map.

For a specific implementation of performing text detection and saliency detection on an image to be processed, reference may be made to the above-mentioned related contents of text detection and saliency detection on an e-commerce image, which are not described herein again.

In some optional implementations, as shown in fig. 8, the computing device may further include: communication component 80c, power component 80d, etc. In some embodiments, the computing device may be implemented as a terminal device such as a mobile phone, a computer, and the like, and accordingly, the computing device may further include: a display component 80e and an audio component 80 f. Only some of the components shown in fig. 8 are schematically represented, and it is not meant that the computing device must include all of the components shown in fig. 8, nor that the computing device can include only the components shown in fig. 8.

In embodiments of the present application, the memory is used to store computer programs and may be configured to store other various data to support operations on the device on which it is located. Wherein the processor may execute a computer program stored in the memory to implement the corresponding control logic. The memory may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

In the embodiments of the present application, the processor may be any hardware processing device that can execute the above described method logic. Alternatively, the processor may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or a Micro Controller Unit (MCU); programmable devices such as Field-Programmable Gate arrays (FPGAs), Programmable Array Logic devices (PALs), General Array Logic devices (GAL), Complex Programmable Logic Devices (CPLDs), etc. may also be used; or Advanced Reduced Instruction Set (RISC) processors (ARM), or System On Chips (SOC), etc., but is not limited thereto.

In embodiments of the present application, the communication component is configured to facilitate wired or wireless communication between the device in which it is located and other devices. The device in which the communication component is located can access a wireless network based on a communication standard, such as WiFi, 2G or 3G, 4G, 5G or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component may also be implemented based on Near Field Communication (NFC) technology, Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, or other technologies.

In the embodiment of the present application, the display assembly may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display assembly includes a touch panel, the display assembly may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

In embodiments of the present application, a power supply component is configured to provide power to various components of the device in which it is located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.

In embodiments of the present application, the audio component may be configured to output and/or input audio signals. For example, the audio component includes a Microphone (MIC) configured to receive an external audio signal when the device in which the audio component is located is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in a memory or transmitted via a communication component. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals. For example, for devices with language interaction functionality, voice interaction with a user may be enabled through an audio component, and so forth.

It should be noted that, the descriptions of "first" and "second" in the inventor of the present application are used to distinguish different messages, devices, modules, and the like, and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

The storage medium of the computer is a readable storage medium, which may also be referred to as a readable medium. Readable storage media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. The computer readable medium does not include transitory computer readable media (transient media), such as modulated data signals and carrier waves, as defined by the inventors of the present application.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. An e-commerce image processing method is characterized by comprising the following steps:

acquiring an e-commerce image;

2. The method according to claim 1, wherein the significant feature extraction of the e-commerce image to obtain the significant features of the e-commerce image comprises:

carrying out saliency feature extraction on the e-commerce image by using a backbone network of a saliency detection model to obtain saliency features of the e-commerce image;

the detecting the saliency of the e-commerce image according to the saliency characteristic to obtain a first saliency map of the e-commerce image comprises:

carrying out saliency detection on the e-commerce image according to the saliency features by utilizing a saliency detection head of the saliency detection model to obtain a first saliency map of the e-commerce image;

the character detection is carried out on the e-commerce image according to the saliency characteristics to obtain a character probability graph of the e-commerce image, and the character probability graph comprises the following steps:

and performing character detection on the e-commerce image by using a character detection head of the significance detection model according to the significance characteristics to obtain a character probability chart of the e-commerce image.

3. The method according to claim 2, wherein the text detection head using the saliency detection model performs text detection on the e-commerce image according to the saliency features to obtain a text probability map of the e-commerce image, and the method comprises:

carrying out character detection on the e-commerce image by using a character detection head of the significance detection model according to the significance characteristics to obtain a character area score map of the e-commerce image; each score in the character region score map represents the probability that the corresponding pixel of the e-commerce image is a character center;

detecting the connection relation among the characters in the e-commerce image according to the saliency features to obtain a character affinity score map of the e-commerce image; the character affinity score represents a probability of a center of an adjacent character region;

and determining the character region score map and the character affinity score map as the character probability map.

4. The method of claim 2, further comprising:

acquiring a commodity image sample;

performing model training on the backbone network, the significance detection head and the character detection head by using an electronic commerce image sample with a loss function minimization as a training target;

in the model training process, extracting significance characteristics of an e-commerce image sample by utilizing a plurality of network layers cascaded in a backbone network to obtain significance prediction characteristics corresponding to the network layers;

inputting the significance prediction characteristics corresponding to the network layers into a character detection head for character detection so as to obtain a character probability prediction graph of the e-commerce image sample;

inputting the significance prediction characteristics corresponding to the last layer of the plurality of network layers into a significance detection head for significance detection to obtain a significance prediction graph;

the loss function is determined according to the difference between the text probability prediction graph and the text probability true value graph, the difference between the significance prediction graph and the significance true value graph, and the difference between the significance prediction characteristic and the significance true value characteristic corresponding to the plurality of network layers; the significance truth characteristic corresponding to each network layer is the same as the significance prediction characteristic corresponding to the network layer in size.

5. The method of claim 4, wherein the model training process comprises a plurality of training rounds, the method further comprising:

inputting the significance prediction characteristics corresponding to a plurality of network layers output by the backbone network in the current training round and a significance prediction image output by the significance detection head in the previous training round to the current training round, and performing character detection by the character detection head to obtain a character probability prediction image output by the character detection head in the current training round;

inputting the significance prediction characteristics corresponding to the last layer of the plurality of network layers and a character probability prediction graph output by the character detection head in the previous training round, and inputting the significance detection head for significance detection to obtain a significance prediction graph output by the significance detection head in the current training round;

the current training round is any training round except the first training round.

6. The method of claim 4, wherein the e-commerce image sample comprises: a positive sample image and a negative sample image; the positive sample image is an E-commerce image sample marked with a correct significance true value image and a character probability true value icon; the negative sample image is an e-commerce image sample with a wrong label of a known significance true value image and a character probability true value image.

7. The method according to any one of claims 1-6, wherein the generating a second saliency map of the e-commerce image from the first saliency map and the text probability map comprises:

and adding the pixel values of the same pixel coordinate in the character probability map and the first saliency map to obtain a second saliency map of the e-commerce image.

8. The method of any one of claims 1-6, further comprising:

carrying out image segmentation on the e-commerce image to obtain a plurality of image blocks;

according to the second significance map, code rate distribution is carried out on the plurality of image blocks to obtain target code rates of the plurality of image blocks;

and carrying out image coding on the E-commerce image according to the target code rates of the plurality of image blocks.

9. The method of claim 8, further comprising:

calculating initial code rates of the plurality of image blocks by adopting a rate distortion optimization method with the minimization of the coding distortion of the E-commerce image as a target and the sum of the code rates of the plurality of image blocks equal to a set target bit number as a constraint;

the allocating the code rates to the plurality of image blocks according to the second saliency map to obtain the target code rates of the plurality of image blocks includes:

and adjusting the initial code rates of the plurality of image blocks according to the second significance map to obtain the target code rates of the plurality of image blocks.

10. The method of claim 9, wherein the adjusting the initial coding rates of the plurality of image blocks according to the second saliency map comprises:

determining saliency subgraphs corresponding to the image blocks from the second saliency map;

calculating significance weights of the image blocks according to significance subgraphs corresponding to the image blocks;

and adjusting the initial code rates of the plurality of image blocks by using the significance weight of the plurality of image blocks and taking the sum of the code rates of the plurality of image blocks equal to the target bit number as a constraint so as to obtain the target code rates of the plurality of image blocks.

11. The method according to claim 8, wherein the image-coding the e-commerce image according to the target code rates of the plurality of image blocks comprises:

calculating image coding parameters of the image blocks according to the target code rates of the image blocks and the curve relation between the code rates and the image coding parameters;

and performing image coding on the plurality of image blocks by using the image coding parameters of the plurality of image blocks so as to perform image coding on the e-commerce image.

12. An image processing method, comprising:

according to the saliency characteristics, carrying out character detection on the image to be processed to obtain a character probability graph of the image to be processed;

and generating a second saliency map of the image to be processed according to the first saliency map and the character probability map.

13. A computing device, comprising: a memory and a processor; wherein the memory is used for storing a computer program;

the processor is coupled to the memory for executing the computer program for performing the steps of the method of any of claims 1-12.

14. A computer-readable storage medium having stored thereon computer instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the method of any one of claims 1-12.