CN108694471A

CN108694471A - A kind of user preference prediction technique based on personalized attention network

Info

Publication number: CN108694471A
Application number: CN201810619393.2A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2018-06-11
Filing date: 2018-06-11
Publication date: 2018-10-23

Abstract

A kind of user preference prediction technique based on personalized attention network proposed in the present invention, main contents include:The pre- flow measurement of conspicuousness, preference fitted flow merge two streams, and process is to give input picture, personalized attention network (PANet) will extract its depth characteristic on multiple scales, and pass them to two streams:The pre- flow measurement of conspicuousness will generate Saliency maps, and without being influenced by user preference, and preference fitted flow will generate the preference figure according to inputting preferences using object detection model architecture;In conjunction with after the result that two streams obtain, will be post-processed (including increasing a center priori), prediction result will be provided as the Pixel-level notable figure of the suitable specific user.Personalization attention network proposed by the present invention, is adapted to different user preferences, more accurate quick when predicting the attention force of different user, is more advantageous to practical application.

Description

A kind of user preference prediction technique based on personalized attention network

Technical field

The present invention relates to preferences to predict field, more particularly, to a kind of user preference based on personalized attention network Prediction technique.

Background technology

Attention is a kind of experience of personalization, even if different people, when facing Same Scene, attention may also can It concentrates in different regions or target.The attention of correctly predicted each user for human-computer interaction (HCI) application program extremely It closes important.With the raising of the progress and computing capability of recent deep learning, the visual tasks such as object detection and conspicuousness prediction It realizes higher precision and realizes faster.The lime light of user is accurately analyzed, it is emerging to commodity sense to be conducive to businessman The crowd of interest carries out specific aim sale, or the placement position etc. of the more popular commodity of adjustment, to adjust sales tactics in real time; And the attention of child is analyzed, then it will be seen where their point of interest, so that the phase develops its interest in due course Hobby.Traditional prediction technique calculates complexity, and speed is slower, and accuracy is not high so that it is difficult to put into practical application.

The present invention proposes a kind of user preference prediction technique based on personalized attention network, gives input picture, Personalized attention network (PANet) will extract its depth characteristic on multiple scales, and pass them to two streams:Significantly The pre- flow measurement of property will generate Saliency maps, and without being influenced by user preference, and preference fitted flow will utilize object detection model Architecture generates the preference figure according to inputting preferences;In conjunction with after the result that two streams obtain, will be post-processed (including Increase a center priori), prediction result will be provided as the Pixel-level notable figure of the suitable specific user.It is proposed by the present invention Personalized attention network, is adapted to different user preferences, more accurate fast when predicting the attention force of different user Speed is more advantageous to practical application.

Invention content

Complicated, the problems such as speed is slower, and accuracy is not high is calculated for traditional prediction technique, the purpose of the present invention exists In providing a kind of user preference prediction technique based on personalized attention network, input picture, personalized attention net are given Network (PANet) will extract its depth characteristic on multiple scales, and pass them to two streams:The pre- flow measurement of conspicuousness will generate Saliency maps, without being influenced by user preference, and preference fitted flow will be generated using object detection model architecture According to the preference figure of inputting preferences;In conjunction with after the result that two streams obtain, will be post-processed (including increase a center elder generation Test), prediction result will be provided as the Pixel-level notable figure of the suitable specific user.

To solve the above problems, the present invention provides a kind of user preference prediction technique based on personalized attention network, Its main contents includes:

(1) the pre- flow measurement of conspicuousness;

(2) preference fitted flow;

(3) merge two streams.

Wherein, the personalized attention network (PANet), PANet by two shared common trait extract layers volume Product neural network (CNN) forms;The model needs three inputs:Pending original image, user-defined detailed class are to super Classification maps and the other user preference vector of superclass;Given input picture, PANet will extract its depth spy on multiple scales Sign, and pass them to two streams:The pre- flow measurement of conspicuousness will generate Saliency maps, without being influenced by user preference, and Preference fitted flow will generate the preference figure according to inputting preferences using object detection model architecture;It is obtained in conjunction with from two streams It after the result obtained, will be post-processed (including increasing a center priori), prediction result will be as the suitable specific user's Pixel-level notable figure provides;In order to train PANet models, carry out Pixel-level based on the recurrence really marked, this is given In the case of inputting preferences in training generator dynamic generation.

Wherein, the pre- flow measurement of the conspicuousness, in order to combine the Analysis On Multi-scale Features of input picture to carry out conspicuousness prediction, mould Type uses the different layers of the feature VGG-16 and single detector (SSD) self-defined figure layer of extraction;Use three kinds of different proportions Feature, size are respectively 38 × 38,19 × 19,10 × 10 to sample the identical size with first;By second and third scale Feature be combined can dramatically improve conspicuousness prediction accuracy;After re-scaling, characteristic pattern is combined into three-dimensional tensor, Size is 38 × 38 × 3, totally 512 channels;Then tensor will be combined respectively by four spatial nuclei convolutional layers, be respectively provided with 64,128,4,1 feature channels;Then, network remodeling is the triple channel two dimension tensor that size is 38 × 38, and it is passed through more 1 × 1 more convolutional layers, therefrom network output have the final result in the single feature channel for conspicuousness detection stream.

Wherein, the preference fitted flow, the stream is identical as the custom layers in SSD models, including selected anchor point life Stratification, and generate characteristic pattern on multiple scales;The output of this part is the series connection of object type, confidence level and coordinate information, Need to convert it back to image tensor in non-maximum suppression (NMS) and mapping layer to be further processed.

Further, the non-maximum suppression, at NMS layers, network chooses whether to keep detection according to its confidence level;It sets Confidence threshold is different for different data sets;0.5 is set a threshold to, this can detect most of wisp simultaneously With rational rate of false alarm;NMS layers will indicate that these predictive information height are credibly converted into the two-dimentional tensor in image space.

Further, the tensor, in order to merge later with the pre- flow measurement of conspicuousness, the tensor created is used as NMS layers Output be arranged to size be 38 × 38, port number N is identical as the quantity of exhaustive division;Each channel represent one it is specific The prediction of classification.

Further, the predictive information, for input picture, if there is classification Cat_iThe prediction of middle object, then According to predicted position and forecast confidence, i-th of channel of tensor will have non-zero pixels;Value at each pixel (x, y) It is (Conf₁,…,Conf_k), wherein Conf₁,…,Conf_kIt is the confidence of the prediction with the bounding box for surrounding pixel (x, y) Degree.

Further, the output, NMS layers of output channel is detailed classification, and mapping layer is combined into Represent the channel of superclass;Mapping layer needs two additional inputs:User preference vector and superclass are clipped to the definition between classification Mapping;In view of such mapping:

It indicatesTensor channel will be merged into an individual channel SCat_i;Indicate SCat_iNew tunnel Pixel orientation value is:

Wherein, two streams of the merging, which is cascaded by tensor and merges two streams, and adds tool There have two 1 × 1 convolutional layers that channel number is 8 and 1 to be more non-linear to obtain;It is regarded further, since attention is generally concentrated at Wild center, therefore a center is added before model before final active coating, the institute concentrated by summary data Significant calibration truthful data SAL_gt, then by its Biao Zhunhuawei [0,1].

Further, the conspicuousness demarcates truthful data, generates this priori from the conspicuousness label in data set and reflects It penetrates:

Prior=∑s SAL_gt (2)

Finally, a Softmax active coating is added, will finally predict to export as probability graph.

Description of the drawings

Fig. 1 is a kind of system flow chart of the user preference prediction technique based on personalized attention network of the present invention.

Fig. 2 is a kind of non-maximum suppression layer of the user preference prediction technique based on personalized attention network of the present invention.

Fig. 3 is a kind of mapping layer of the user preference prediction technique based on personalized attention network of the present invention.

Specific implementation mode

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase It mutually combines, invention is further described in detail in the following with reference to the drawings and specific embodiments.

Fig. 1 is a kind of system flow chart of the user preference prediction technique based on personalized attention network of the present invention.It is main To include the pre- flow measurement of conspicuousness, preference fitted flow merges two streams.

Personalized attention network (PANet) by two shared common trait extract layers convolutional neural networks (CNN) group At;The model needs three inputs:Pending original image, user-defined detailed class to superclass not Ying She and superclass it is other User preference vector;Given input picture, PANet will extract its depth characteristic on multiple scales, and pass them to two A stream:The pre- flow measurement of conspicuousness will generate Saliency maps, and without being influenced by user preference, and preference fitted flow will utilize object Detection model architecture generates the preference figure according to inputting preferences;In conjunction with after the result that two streams obtain, after progress Processing (including increasing a center priori), prediction result will be provided as the Pixel-level notable figure of the suitable specific user;For Trained PANet models, carry out Pixel-level based on the recurrence really marked, this is being instructed in the case of given inputting preferences Practice dynamic generation in generator.

The pre- flow measurement of conspicuousness, in order to combine the Analysis On Multi-scale Features of input picture to carry out conspicuousness prediction, model uses extraction Feature VGG-16 and single detector (SSD) self-defined figure layer different layers;Use the feature of three kinds of different proportions, size point 38 × 38,19 × 19,10 × 10 the identical size with first Wei not be sampled;The feature of second and third scale is combined It can dramatically the accuracy for improving conspicuousness prediction;After re-scaling, characteristic pattern is combined into three-dimensional tensor, and size is 38 × 38 × 3, totally 512 channels;Then tensor will be combined respectively by four spatial nuclei convolutional layers, be respectively provided with 64,128,4,1 spies Levy channel;Then, network remodeling is the triple channel two dimension tensor that size is 38 × 38, and it is passed through more 1 × 1 convolution Layer, therefrom network output have the final result in the single feature channel for conspicuousness detection stream.

Preference fitted flow is identical as the custom layers in SSD models, including selected anchor point generation layer, and in multiple scales Upper generation characteristic pattern;The output of this part is the series connection of object type, confidence level and coordinate information, needs non-maximum suppression (NMS) and in mapping layer image tensor is converted it back to be further processed.

Merge two streams, which is cascaded by tensor and merge two streams, and it is 8 to add with channel number Two 1 × 1 convolutional layers with 1 are more non-linear to obtain;Further, since attention is generally concentrated at the centre bit in the visual field It sets, therefore adds a center before model before final active coating, all conspicuousness standards concentrated by summary data Determine truthful data SAL_gt, then by its Biao Zhunhuawei [0,1].

The mapping of this priori is generated from the conspicuousness label in data set:

Prior=∑s SAL_gt (1)

Fig. 2 is a kind of non-maximum suppression layer of the user preference prediction technique based on personalized attention network of the present invention. Network chooses whether to keep detection according to its confidence level;Confidence threshold value is different for different data sets;By threshold Value is set as 0.5, this can detect most of wisp and have rational rate of false alarm;NMS layers will indicate these predictive information Height is credibly converted into the two-dimentional tensor in image space.

In order to merge later with the pre- flow measurement of conspicuousness, the tensor created is arranged to size as NMS layers of output and is 38 × 38, port number N is identical as the quantity of exhaustive division;Each channel represents the prediction of a particular category.

For input picture, if there is classification Cat_iThe prediction of middle object, then according to predicted position and forecast confidence, I-th of channel of tensor will have non-zero pixels;Value at each pixel (x, y) is (Conf₁,…,Conf_k), wherein Conf₁,…,Conf_kIt is the confidence level of the prediction with the bounding box for surrounding pixel (x, y).

Fig. 3 is a kind of mapping layer of the user preference prediction technique based on personalized attention network of the present invention.NMS layers Output channel is detailed classification, and mapping layer is combined into representing the channel of superclass;Mapping layer need two it is additional Input:User preference vector and superclass are clipped to the mapping of the definition between classification;In view of such mapping:

For those skilled in the art, the present invention is not limited to the details of above-described embodiment, in the essence without departing substantially from the present invention In the case of refreshing and range, the present invention can be realized in other specific forms.In addition, those skilled in the art can be to this hair Bright to carry out various modification and variations without departing from the spirit and scope of the present invention, these improvements and modifications also should be regarded as the present invention's Protection domain.Therefore, the following claims are intended to be interpreted as including preferred embodiment and falls into all changes of the scope of the invention More and change.

Claims

1. a kind of user preference prediction technique based on personalized attention network, which is characterized in that include mainly that conspicuousness is pre- Flow measurement (one);Preference fitted flow (two);Merge two streams (three).

2. based on the personalized attention network (PANet) described in claims 1, which is characterized in that PANet is shared by two The convolutional neural networks (CNN) of common trait extract layer form;The model needs three inputs:Pending original image is used The detailed class that family defines to superclass not Ying She and the other user preference vector of superclass;Given input picture, PANet will be in multiple rulers Its depth characteristic is extracted on degree, and passes them to two streams:The pre- flow measurement of conspicuousness will generate Saliency maps, without by user The influence of preference, and preference fitted flow will generate the preference according to inputting preferences using object detection model architecture Figure;In conjunction with after the result that two streams obtain, being post-processed (including increase a center priori), prediction result is by conduct It is suitble to the Pixel-level notable figure of the specific user to provide;In order to train PANet models, carry out Pixel-level based on really marking Return, this be in the case of given inputting preferences in training generator dynamic generation.

3. based on the pre- flow measurement of conspicuousness (one) described in claims 1, which is characterized in that in order to combine more rulers of input picture It spends feature and carries out conspicuousness prediction, model uses the feature VGG-16 extracted and single detector (SSD) self-defined figure layer not Same layer;Using the feature of three kinds of different proportions, size be respectively 38 × 38,19 × 19,10 × 10 sample it is identical as first Size;The feature of second and third scale is combined to the accuracy that can dramatically and improve conspicuousness prediction;After re-scaling, Characteristic pattern is combined into three-dimensional tensor, and size is 38 × 38 × 3, totally 512 channels;Then combination tensor is passed through four respectively Spatial nuclei convolutional layer is respectively provided with 64,128,4,1 feature channels;Then, network remodeling is the triple channel that size is 38 × 38 Two-dimentional tensor, and by it by more 1 × 1 convolutional layers, therefrom network output has the single spy for conspicuousness detection stream Levy the final result in channel.

4. based on the preference fitted flow (two) described in claims 1, which is characterized in that the stream with it is self-defined in SSD models Layer is identical, including selected anchor point generation layer, and generate characteristic pattern on multiple scales;The output of this part be object type, The series connection of confidence level and coordinate information needs to convert it back to image tensor in non-maximum suppression (NMS) and mapping layer with into one Step processing.

5. based on the non-maximum suppression described in claims 4, which is characterized in that at NMS layers, network is selected according to its confidence level Whether holding detects;Confidence threshold value is different for different data sets;0.5 is set a threshold to, this can be examined It surveys most of wisp and there is rational rate of false alarm;NMS layers will indicate that these predictive information height are credibly converted into image sky Between in two-dimentional tensor.

6. based on the tensor described in claims 5, which is characterized in that in order to merge later with the pre- flow measurement of conspicuousness, created Tensor to be arranged to size as NMS layer of output be 38 × 38, port number N is identical as the quantity of exhaustive division;Each Channel represents the prediction of a particular category.

7. based on the predictive information described in claims 5, which is characterized in that for input picture, if there is classification Cat_iIn The prediction of object, then according to predicted position and forecast confidence, i-th of channel of tensor will have non-zero pixels;In each picture Value at plain (x, y) is (Conf₁,…,Conf_k), wherein Conf₁,…,Conf_kIt is with the bounding box for surrounding pixel (x, y) Prediction confidence level.

8. based on the output described in claims 6, which is characterized in that NMS layers of output channel is detailed classification, and is mapped Layer is combined into representing the channel of superclass;Mapping layer needs two additional inputs:User preference vector and superclass are clipped to Definition mapping between classification;In view of such mapping:

It indicatesTensor channel will be merged into an individual channel SCat_i;Indicate SCat_iNew tunnel pixel Direction value is:

9. based on two streams (three) of merging described in claims 1, which is characterized in that the model is cascaded by tensor by two Stream merges, and it is more non-linear to obtain to add two 1 × 1 convolutional layers that there is channel number to be 8 and 1;In addition, Since attention is generally concentrated at the center in the visual field, added in one before model before final active coating The heart demarcates truthful data SAL by all conspicuousnesses that summary data is concentrated_gt, then by its Biao Zhunhuawei [0,1].

10. demarcating truthful data based on the conspicuousness described in claims 9, which is characterized in that from the conspicuousness in data set Label generates the mapping of this priori:

Prior=∑s SAL_gt (2)