CN116645244A

CN116645244A - Dining guiding system and method based on computer vision

Info

Publication number: CN116645244A
Application number: CN202310613188.6A
Authority: CN
Inventors: 左智彬; 宋添光; 胡棵; 左凌风
Original assignee: Shenzhen Bozz Technology Co ltd
Current assignee: Shenzhen Bozz Technology Co ltd
Priority date: 2023-05-29
Filing date: 2023-05-29
Publication date: 2023-08-25

Abstract

The invention discloses a dining guiding system and a guiding method based on computer vision, which are characterized in that restaurant information is obtained, data preprocessing is carried out on the restaurant information to obtain a data set, visual features of a plurality of images in each view angle of a restaurant are extracted through a pre-trained deep learning model, visual feature vectors corresponding to beverages, foods, internal scenes and external scenes of the restaurant are spliced to construct a visual feature matrix, weights related to the visual feature matrix and users are projected to the visual feature vector, the visual feature vector is subjected to dimension reduction through an embedded matrix to obtain a low-dimension feature vector, the predicted score of the user on the restaurant is obtained, the visual feature of the restaurant is modeled to obtain the restaurant visual information, the preference of the user on the restaurant is predicted based on the restaurant visual information and the visual feature matrix, and the dining guiding is completed according to the preference of the user on the restaurant, so that personalized dining guiding is effectively carried out on the user.

Description

Dining guiding system and method based on computer vision

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a dining guiding system and method based on computer vision.

Background

Vision is one of the most important channels of feeling information for human beings, and has irreplaceable importance in life of human beings, and the irreplaceable visual information often influences people to make selection judgment. Restaurant guidance algorithms typically use user-side information, restaurant-side information, and the like to guide. However, the existing guiding system combines visual features, but only feature vectors of one picture or feature vectors of all pictures or average or maximum pooling vectors of all pictures are generally used for representing visual features of commodities, while in real life, most commodities have multi-view image information, such as guiding of a restaurant, if only a plurality of menu pictures of the dining hall are used, the pictures can represent food information of the restaurant, but visual information of the internal and external environments and drinks of the restaurant is lacking, and as images of all restaurants are pooled by using average or maximum values, differences of different categories of images cannot be fully utilized, so that accuracy of dining guiding to a user is reduced.

Disclosure of Invention

In view of the above, the invention provides a dining guiding system and a guiding method based on computer vision, which can effectively guide a user for personalized dining and improve the accuracy of dining guiding and dining experience of the user, so as to solve the technical problems, and the invention is realized by adopting the following technical scheme.

In a first aspect, the present invention provides a computer vision based dining guidance system comprising:

the data acquisition unit is used for acquiring restaurant information, carrying out data preprocessing on the restaurant information to obtain a data set, and marking the image categories in the data set as beverages, foods, internal scenes and external scenes, wherein the restaurant information comprises user ratings, user comments, restaurant images, restaurant positions and restaurant business information;

the feature extraction unit is used for extracting visual features of a plurality of images in each view angle of the restaurant through a pre-trained deep learning model, and aggregating the visual features of all the images in each category into a visual feature vector through average pooling;

the characteristic splicing unit is used for splicing the visual characteristic vectors corresponding to the beverage, food, internal scene and external scene of the restaurant to construct a visual characteristic matrix, wherein the visual characteristic matrix represents the multi-view visual characteristics of the restaurant;

the model construction unit is used for projecting the weight related to the visual feature matrix and the user to the visual feature vector, reducing the dimension of the restaurant visual feature vector through the embedded matrix to obtain a low-dimension feature vector, and combining the low-dimension feature vector with the implicit factor of the restaurant to construct a prediction model;

the dining guiding unit is used for obtaining the prediction score of the user on the restaurant, introducing the hierarchical attention to model the visual characteristics of the restaurant to obtain the visual information of the restaurant, predicting the preference of the user on the restaurant based on the visual information of the restaurant and the visual characteristic matrix, and guiding according to the preference of the user on the restaurant to complete dining guiding.

As a further preferred aspect of the above-described technical solution, the hierarchical attention includes an upper category attention layer and a lower image attention layer, each image category in the image attention layer is inputted by a plurality of image visual features, all visual features under each image category are integrated into one vector representing the visual features of the image category in the form of attention by calculating the attention of the image attention layer, the vector represents the visual information expression of the restaurant under the image category, and the visual feature vectors of all images are integrated into the visual information expression of the restaurant from the category attention layer.

As a further preferable mode of the technical scheme, the method for obtaining the prediction score of the restaurant by the user, introducing the hierarchical attention to model the visual characteristics of the restaurant to obtain the visual information of the restaurant comprises the following steps:

modeling user preferences among different images under the same image category, and adding an image-level attention mechanism into the expression of the image-level attention mechanismWherein W is _α Representing a transformation matrix, b _α Representing bias terms, for any restaurant i, f _ict Representing the visual characteristics of the t-th image belonging to the c category, f is applied by a full connection layer _ict Mapping to an implicit expression form u _ict Using ReLU as an activation function, a commonly learned context vector u is used _α To measure the importance of the visual features of each image, u ^ict And u _α Is given with respect to u ^ict Score of u ^ict Representing importance under the current category, f _ic A weighted sum representation representing all visual features of the class.

As a further preferable mode of the technical scheme, the expression added with the attention mechanism based on the category hierarchy is thatWherein W is _β Representing a transformation matrix, b _β Representing the bias term, f _i Representing restaurant visual features obtained through a hierarchical attention layer, u _β The context vector representing class level performs collaborative filtering work with the potential factors of the user using the obtained restaurant visual features as part of the restaurant potential factors.

As a further preferred aspect of the above technical solution, projecting weights related to the visual feature matrix and the user onto the visual feature vector, and dimension-reducing the restaurant visual feature vector by the embedding matrix to obtain a feature vector with low dimension, including:

mapping respective real scoring vectors of a user and a restaurant to implicit factors of the user and the restaurant in low-dimensional space learning through a multi-layer fully connected network, and adopting visual characteristicsEnhancing restaurant side information, adding visual factor f _i And restaurant implicit factor p _i Adding;

cosine similarity between the user side features and the restaurant side features is used as a prediction function of a model, and the expression of the prediction function of the model is as followsWherein->The representation model calculates the predictive score of user u for restaurant i, p _i And q _u Implicit scoring factors, f, representing restaurant i and user u, respectively _i Visual factors representing the restaurant learned by the attention module, the restaurant side expression is represented by p _i And f _i Is expressed by calculating a restaurant factor (p _i +f _i ) And a user scoring factor q _u Cosine similarity between to predict user's preference for restaurant, restaurant and user's implicit scoring vector p _i And q _u Respectively by two multi-layer fully connected networks FC _it And FC (fiber channel) _us Mapping to low-dimensional space, the input of the two multi-layer fully connected networks is respectively extracted from scoring vector R in interaction matrix R of user and restaurant _i And r _u ，r _i Vector representing the score composition of all users for the ith restaurant, r _u Representing the scoring vector of user u for all restaurants, the multi-layer fully connected network is used for viewing the restaurant vision factor f _i And (5) performing dimension reduction.

As a further preferable aspect of the above-described aspect, the model is constructed by calculating a restaurant factor (p _i +f _i ) And a user scoring factor q _u Cosine similarity between the two to predict the user's preference for restaurants, including:

one vector in the preset n-dimensional space represents one user, two users u and v are selected, and cosine is adopted to calculate the expression of the similarity between the two user vectors as Balancing the difference in ratings between different users by subtracting the mean of the user scores expressed as +.>Wherein L is _u，v Representing a set of restaurants where user u and user v score simultaneously, L _u 、L _v Representing a set of restaurants that are individually rated by users u, v, R _u，c 、R _v，c Representing the score of user u, user v for restaurant c, +.>The average scoring values of the restaurants scored by the users u and v are respectively shown.

As a further preferable aspect of the above technical solution, the restaurant score prediction model has the expression of Wherein (1)>Representing the predicted score of user u for restaurant i, θ represents model parameters, F represents a function mapping parameters to predicted scores, two potential factors q _u 、p _i Representing user u and restaurant i, the two potential factors are expressed as +.>Wherein (1)>And->Full connection layer, sigma, representing user u and restaurant i, respectivelyFully connected network representing a single layer for substitution r _i To make it and f _i Keep consistent, q _u And p _i The cosine similarity between them is used to calculate the score that user u predicts for restaurant i.

As a further preferred aspect of the above technical solution, splicing the visual feature vectors corresponding to the beverage, the food, the internal view and the external view of the restaurant to construct the visual feature matrix includes:

the expression of the restaurant guide model of the multi-view visual information is Where α represents the global offset, β _u And beta _i A deviation term representing the user and the restaurant,and gamma _i Potential factor vectors, θ, representing the user and restaurant, respectively _u Visual factor representing user u, E representing the embedded matrix, using matrix F _i To represent the visual characteristics of restaurant i, ω _u Representing a 4 x 1 weight vector corresponding to four visual views of the restaurant, A representing the overall bias weight of the multi-view visual feature, W representing the category visual preference matrix of all users, the product of A and W representing the overall preference of the user for the multi-view visual appearance of the preset restaurant;

training the model by using maximum posterior estimation of Bayesian analysis, and presetting a training set D _S The expression consisting of one triplet (u, i, k) isWherein u represents the user, i represents the restaurant the user has gone through, j represents the unknown restaurant, ++>Representing all the positive sample restaurant sets for user u,matrix factorization is used to predict user preferences, predicted values +.>Representation, predictive model->The expression of (2) isγ _ij Representing gamma _i And gamma _j Difference between F _ij Represents F _i And F _j The difference between them, the expression of the personalized ranking optimization criterion C is +.>Wherein σ represents a sigmoid function, λ _θ Representing by adjusting regularized hyper-parameters.

As a further preferred aspect of the above technical solution, acquiring restaurant information and performing data preprocessing on the restaurant information to obtain a data set, includes:

expressing restaurants and attributes thereof as semantic vectors, calculating preference vectors of users by calculating weights of different attribute values in the restaurants and integrating the restaurants which are historically liked by the users, and presetting D _ul Representing a user's history of favorite restaurant collections, restaurant D e D _ul H represents restaurant entity in at least the map corresponding to restaurant d, and the user dining attribute triplet expression is B _u ＝{(h，r，t)|h∈D _ul (h, r, t) represents a triplet with restaurant entity h as a head entity, r represents an attribute of the restaurant entity, and t represents an attribute value;

by calculating the weights of different attribute characteristics of restaurant in the user history like restaurant as the attention of the user to the different attributes, the attribute r of restaurant d _i The weight calculation expression of (2) is Wherein h, r _i Respectively representing entity and attribute value vector corresponding to restaurant d obtained by training +.>Representing a projection matrix generated by the entity h and the relation thereof;

calculating a user diet interest vector based on the attribute weight and the attribute value vector, and after calculating the weight of each attribute in the user restaurant attribute triplet, weighting and summing all attribute value vectors of the user favorite restaurant to represent the expression of the user diet preference vector as followsWherein t is _i Representing restaurant attribute value vectors generated by training, W _ri Representing a certain attribute weight of the restaurant.

In a second aspect, the invention also provides a dining guiding method based on computer vision, which comprises the following specific steps:

acquiring restaurant information, carrying out data preprocessing on the restaurant information to obtain a data set, and marking image categories in the data set as beverages, foods, internal scenes and external scenes, wherein the restaurant information comprises user ratings, user comments, restaurant images, restaurant positions and restaurant business information;

extracting visual features of a plurality of images in each view angle of a restaurant through a pre-trained deep learning model, and aggregating the visual features of all images in each category into a visual feature vector through average pooling;

splicing visual feature vectors corresponding to beverages, foods, internal scenes and external scenes of the restaurant to construct a visual feature matrix, wherein the visual feature matrix represents multi-view visual features of the restaurant;

projecting the weight related to the visual feature matrix and the user to the visual feature vector, reducing the dimension of the restaurant visual feature vector through the embedded matrix to obtain a low-dimension feature vector, and combining the low-dimension feature vector with an implicit factor of the restaurant to construct a prediction model;

the method comprises the steps of obtaining a prediction score of a user on a restaurant, introducing hierarchical attention to model visual features of the restaurant to obtain restaurant visual information, predicting preference of the user on the restaurant based on the restaurant visual features and a visual feature matrix, and guiding according to the preference of the user on the restaurant to complete dining guiding.

The invention provides a dining guiding system and a guiding method based on computer vision, which are characterized in that restaurant information is obtained, data preprocessing is carried out on the restaurant information to obtain a data set, image categories in the data set are marked as beverage, food, internal view and external view, visual features of a plurality of images in each view angle of a restaurant are extracted through a pre-trained deep learning model, the visual features of all the images in each category are aggregated into a visual feature vector through averaging pooling, the visual feature vectors corresponding to the beverage, the food, the internal view and the external view of the restaurant are spliced to construct a visual feature matrix, weights related to the visual feature matrix and a user are projected to the visual feature vector, the visual feature vector of the restaurant is reduced to obtain a low-dimensional feature vector through an embedding matrix, the low-latitude feature vector is combined with an implicit factor of the restaurant to construct a prediction model, the prediction score of the restaurant by the user is obtained, the visual feature of the restaurant is modeled by introducing hierarchical attention to obtain restaurant visual information, the preference of the user on the restaurant is predicted based on the restaurant visual feature vector and dining preference of the restaurant by the user, dining preference of the restaurant is improved by the user, dining is guided by the user, and dining guiding is carried out on the dining accuracy of the user.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a block diagram of a dining guidance system based on computer vision provided by the invention;

fig. 2 is a flowchart of a dining guiding method based on computer vision provided by the invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

Referring to fig. 1, the present invention provides a dining guiding system based on computer vision, comprising:

In this embodiment, the hierarchical attention includes an upper category attention layer and a lower image attention layer, each image category in the image attention layer is input by a plurality of image visual features, all visual features under each image category are integrated into one vector representing the visual features of the image category in the form of attention by calculating the attention of the image attention layer, the vector represents the visual information expression of the restaurant under the image category, and the visual feature vectors of all images are integrated into the visual information expression of the restaurant from the category attention layer. Obtaining a prediction score of a user on a restaurant, introducing hierarchical attention to model visual features of the restaurant to obtain restaurant visual information, and comprising: modeling user preferences among different images under the same image category, and adding an image-level attention mechanism into the expression of the image-level attention mechanismWherein W is _α Representing a transformation matrix, b _α Representing bias terms, for any restaurant i, f _ict Representing the visual characteristics of the t-th image belonging to the c category, f is applied by a full connection layer _ict Mapping to an implicit expression form u _ict Using ReLU as an activation function, a commonly learned context vector u is used _α To measure the importance of the visual features of each image, u ^ict And u _α Is given with respect to u ^ict Score of u ^ict Representing importance under the current category, f _ic A weighted sum representation representing all visual features of the class.

It should be noted that a real number set is presetRepresenting a user restaurant scoring matrix, where M and N represent the number of users and restaurants, respectively, using u and i to index users and restaurants, taking R _u，i Digital evaluation of restaurant i by user uIf the user u does not interact with the restaurant i, R _u，i =0, extracting column vector and row vector r from scoring matrix _u And r _i ，r _u Representing the score of user u for all restaurants, and r _i Representing the score of all users for restaurant i, using f _i Representing the visual characteristics of restaurant i learned by the hierarchical attention module. The model is input into a user restaurant scoring matrix and multi-view visual features corresponding to each restaurant, the output of the model is the prediction score of the user on the restaurant, namely cosine similarity measurement of a user side hidden vector and a restaurant side hidden vector, in an image attention layer, each image category such as food, drink, restaurant internal view and restaurant external view has a plurality of image visual features as input, and by calculating the attention of the image layer, the model integrates all visual features under each image category into one vector representing visual information of the image category in the form of attention, and the vector is used for representing visual information expression of the restaurant under the image category. At the category attention layer, the model calculates the importance of different categories, integrates the visual feature vectors of all images into the visual feature expression of the restaurant, combines the obtained restaurant visual information with the depth matrix decomposition network, and predicts the preference of the user for the restaurant.

It should be understood that if a restaurant has t image categories, each having a different number of images, after the vision of the images is extracted, the feature vectors are clustered into k clusters by calculating the euclidean distance, the vision feature vector closest to the cluster center is retained, and k vision feature vectors are retained after each image category is sampled, by employing a strategy, each image category contains a fixed number of images for all restaurants, and redundant information is to be excluded, the vision feature for each restaurant is a fixed number of images, and redundant information is to be excluded, and the vision feature for each restaurant is of a fixed size. The visual features of the images are extracted by adopting a deep convolution network and integrated into a collaborative filtering framework, the multi-view visual features are fused through user weights, the user-related weights reflect the personalized restaurant visual preferences of the users, the weights among the users are different and independent, the model is applied to two real restaurant comment data sets, and personalized guidance is provided for the users. The images of different categories are used as auxiliary information for guiding, and the visual attention distribution of the user inside a single image, between the images and between the image categories can be analyzed layer by adopting a self-attention mechanism, so that the accuracy of dining guiding to the user is improved.

Optionally, the expression joining the category-hierarchy based attention mechanism isWherein W is _β Representing a transformation matrix, b _β Representing the bias term, f _i Representing restaurant visual features obtained through a hierarchical attention layer, u _β The context vector representing class level performs collaborative filtering work with the potential factors of the user using the obtained restaurant visual features as part of the restaurant potential factors.

In this embodiment, projecting weights related to the visual feature matrix and the user onto the visual feature vector, and reducing the dimensions of the restaurant visual feature vector by the embedding matrix to obtain a low-dimensional feature vector includes: mapping respective real scoring vectors of a user and a restaurant to implicit factors of the user and the restaurant in low-dimensional space learning through a multi-layer full-connection network, enhancing restaurant side information by adopting visual features, and obtaining visual factor f _i And restaurant implicit factor p _i Adding; cosine similarity between the user side features and the restaurant side features is used as a prediction function of a model, and the expression of the prediction function of the model is as followsWherein->The representation model calculates the predictive score of user u for restaurant i, p _i And q _u Implicit scoring factors, f, representing restaurant i and user u, respectively _i Visual factor representing restaurant learned by attention moduleRestaurant side expression is expressed by p _i And f _i Is expressed by calculating a restaurant factor (p _i +f _i ) And a user scoring factor q _u Cosine similarity between to predict user's preference for restaurant, restaurant and user's implicit scoring vector p _i And q _u Respectively by two multi-layer fully connected networks FC _it And FC (fiber channel) _us Mapping to low-dimensional space, the input of the two multi-layer fully connected networks is respectively extracted from scoring vector R in interaction matrix R of user and restaurant _i And r _u ，r _i Vector representing the score composition of all users for the ith restaurant, r _u Representing the scoring vector of user u for all restaurants, the multi-layer fully connected network is used for viewing the restaurant vision factor f _i And (5) performing dimension reduction.

The model calculates the restaurant factor (p _i +f _i ) And a user scoring factor q _u Cosine similarity between the two to predict the user's preference for restaurants, including: one vector in the preset n-dimensional space represents one user, two users u and v are selected, and cosine is adopted to calculate the expression of the similarity between the two user vectors asBalancing the difference in ratings between different users by subtracting the mean of the user scores expressed as +.> Wherein L is _u，v Representing a set of restaurants where user u and user v score simultaneously, L _u 、L _v Representing a set of restaurants that are individually rated by users u, v, R _u，c 、R _v，c Representing the score of user u, user v for restaurant c, +.>Respectively represent the user u and the user v to evaluate the userThe average grading value of the divided dining rooms enhances the work timeliness and effectiveness of the dining guiding system.

Alternatively, the restaurant score prediction model is expressed as Wherein (1)>Representing the predicted score of user u for restaurant i, θ represents model parameters, F represents a function mapping parameters to predicted scores, two potential factors q _u 、p _i Representing user u and restaurant i, the two potential factors are expressed as +.>Wherein (1)>And->Representing the fully connected layers of user u and restaurant i, respectively, σ represents a single layer fully connected network for displacing r _i To make it and f _i Keep consistent, q _u And p _i The cosine similarity between them is used to calculate the score that user u predicts for restaurant i.

In this embodiment, splicing the visual feature vectors corresponding to the beverage, food, internal view and external view of the restaurant to construct the visual feature matrix includes: the expression of the restaurant guide model of the multi-view visual information isWhere α represents the global offset, β _u And beta _i Deviation item representing user, restaurant, +.>And gamma _i Potential factor vectors, θ, representing the user and restaurant, respectively _u Visual factor representing user u, E representing the embedded matrix, using matrix F _i To represent the visual characteristics of restaurant i, ω _u Representing a 4 x 1 weight vector corresponding to four visual views of the restaurant, A representing the overall bias weight of the multi-view visual feature, W representing the category visual preference matrix of all users, the product of A and W representing the overall preference of the user for the multi-view visual appearance of the preset restaurant; training the model by using a maximum a posteriori estimation of a Bayesian analysis, wherein the expression of the preset training set Ds consisting of a triplet (u, i, k) is ∈ -> Wherein u represents the user, i represents the restaurant the user has gone through, j represents the unknown restaurant, ++>All positive sample restaurant sets representing user u, matrix factorization was used to predict user preferences with +.>Representation, predictive model->The expression of (2) is +.> γ _ij Representing gamma _i And gamma _j Difference between F _ij Represents F _i And F _j The difference between them, the expression of the personalized ranking optimization criterion C is +.>Wherein σ represents a sigmoid function, λ _θ Representing by adjusting regularized hyper-parameters.

It should be noted that, obtaining restaurant information and performing data preprocessing on the restaurant information to obtain a data set includes: expressing restaurants and attributes thereof as semantic vectors, calculating preference vectors of users by calculating weights of different attribute values in the restaurants and integrating the restaurants which are historically liked by the users, and presetting D _ul Representing a user's history of favorite restaurant collections, restaurant D e D _ul H represents restaurant entity in at least the map corresponding to restaurant d, and the user dining attribute triplet expression is B _u ＝{(h，r，t)|h∈D _ul (h, r, t) represents a triplet with restaurant entity h as a head entity, r represents an attribute of the restaurant entity, and t represents an attribute value; by calculating the weights of different attribute characteristics of restaurant in the user history like restaurant as the attention of the user to the different attributes, the attribute r of restaurant d _i The weight calculation expression of (2) is Wherein h, r _i Respectively representing entity and attribute value vector corresponding to restaurant d obtained by training +.>Representing a projection matrix generated by the entity h and the relation thereof; calculating a user diet interest vector based on the attribute weight and the attribute value vector, and after calculating the weight of each attribute in the user restaurant attribute triplet, weighting and summing all attribute value vectors of the user favorite restaurant to represent the expression of the user diet preference vector as followsWherein t is _i Representing restaurant attribute value vectors generated by training, W _ri Representing a certain attribute weight of a restaurant, employing a userModel training is carried out on the implicit feedback data of the restaurant attribute knowledge layer and the semantic information of the restaurant attribute knowledge layer so as to obtain more accurate user and restaurant characteristic representation, and the personalized restaurant guiding effect is improved.

Referring to fig. 2, the invention also provides a dining guiding method based on computer vision, which comprises the following specific steps:

s1: acquiring restaurant information, carrying out data preprocessing on the restaurant information to obtain a data set, and marking image categories in the data set as beverages, foods, internal scenes and external scenes, wherein the restaurant information comprises user ratings, user comments, restaurant images, restaurant positions and restaurant business information;

s2: extracting visual features of a plurality of images in each view angle of a restaurant through a pre-trained deep learning model, and aggregating the visual features of all images in each category into a visual feature vector through average pooling;

s3: splicing visual feature vectors corresponding to beverages, foods, internal scenes and external scenes of the restaurant to construct a visual feature matrix, wherein the visual feature matrix represents multi-view visual features of the restaurant;

s4: projecting the weight related to the visual feature matrix and the user to the visual feature vector, reducing the dimension of the restaurant visual feature vector through the embedded matrix to obtain a low-dimension feature vector, and combining the low-dimension feature vector with an implicit factor of the restaurant to construct a prediction model;

s5: the method comprises the steps of obtaining a prediction score of a user on a restaurant, introducing hierarchical attention to model visual features of the restaurant to obtain restaurant visual information, predicting preference of the user on the restaurant based on the restaurant visual features and a visual feature matrix, and guiding according to the preference of the user on the restaurant to complete dining guiding.

In this embodiment, a pre-trained neural network model is used to extract visual features of each image, such as three images (i) with internal views in a restaurant ₁ ,i ₂ ,i ₃ ) For each image, extracting visual features with specific dimension such as 4096 dimension from the pre-training neural network model, and representing the visual feature vectors of the three imagesThe visual information of the internal scene of the restaurant is then spliced with the visual feature vectors of the food, beverage, internal scene and external scene of the restaurant to form a visual feature matrix which represents the multi-view visual features of the restaurant, and the visual feature matrix passes through the weight omega related to the user _u ({ω ₁ ,ω ₂ ,ω ₃ ,ω ₄ And }) projecting a feature vector, wherein the weight reflects personalized visual preference of a user u to a restaurant, the feature vector is subjected to dimension reduction through an embedded matrix to obtain a low-dimension feature vector, and the visual feature vector is combined with an implicit factor of the restaurant and can be used as input of a matrix decomposition model of personalized prediction, so that the robustness of model training is improved, and the effectiveness of personalized dining guidance can be carried out on the user.

Any particular values in all examples shown and described herein are to be construed as merely illustrative and not a limitation, and thus other examples of exemplary embodiments may have different values.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

The above examples merely represent a few embodiments of the present invention, which are described in more detail and are not to be construed as limiting the scope of the present invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention.

Claims

1. A computer vision-based dining guidance system, comprising:

2. The computer vision-based dining guidance system of claim 1, wherein the hierarchical attention comprises an upper category attention layer and a lower image attention layer, each image category in the image attention layer being input by a plurality of image visual features, all visual features under each image category being integrated in form of attention into one vector representing the visual features of the image category by calculating the attention of the image attention layer, the vector representing the visual information representation of the restaurant under the image category, the visual feature vectors of all images being integrated from the category attention layer into the visual information representation of the restaurant.

3. The computer vision-based dining guidance system of claim 1, wherein obtaining a predicted score of a restaurant by a user, introducing hierarchical attention to model visual features of the restaurant to obtain restaurant visual information, comprises:

4. The computer vision based dining guidance system of claim 3, wherein the expression joining the category-level based attention mechanism isWherein W is _β Representing a transformation matrix, b _β Representing the bias term, f _i Representing restaurant visual features obtained through a hierarchical attention layer, u _β The context vector representing class level performs collaborative filtering work with the potential factors of the user using the obtained restaurant visual features as part of the restaurant potential factors.

5. The computer vision-based dining guidance system of claim 1, wherein projecting the weights of the visual feature matrix associated with the user onto the visual feature vector and dimension-reducing the restaurant visual feature vector by the embedding matrix to obtain a low-dimensional feature vector, comprises:

mapping respective real scoring vectors of a user and a restaurant to implicit factors of the user and the restaurant in low-dimensional space learning through a multi-layer full-connection network, enhancing restaurant side information by adopting visual features, and obtaining visual factor f _i And restaurant implicit factor p _i Adding;

6. The computer vision based dining guidance system of claim 5, wherein the model is generated by calculating a restaurant factor (p _i +f _i ) And a user scoring factor q _u Cosine similarity between to predict user preference for restaurant, including：

7. The computer vision based dining guidance system of claim 6, further comprising:

the restaurant score prediction model has the expression of Wherein (1)>Representing the predicted score of user u for restaurant i, < >>Representing model parameters, F representing a function mapping parameters to prediction scores, two potential factors q _u 、p _i Representing user u and restaurant i, the two potential factors are expressed as +.>Wherein (1)>And->Representing the fully connected layers of user u and restaurant i, respectively, σ represents a single layer fully connected network for displacing r _i To make it and f _i Keep consistent, q _u And p _i The cosine similarity between them is used to calculate the score that user u predicts for restaurant i.

8. The computer vision-based dining guide system of claim 1, wherein stitching the visual feature vectors corresponding to the beverage, food, interior and exterior views of the restaurant to construct a visual feature matrix comprises:

the expression of the restaurant guide model of the multi-view visual information is Where α represents the global offset, β _u And beta _i Deviation item representing user, restaurant, +.>And gamma _i Potential factor vectors, θ, representing the user and restaurant, respectively _u A visual factor representing user u, E representing the embedded matrix, such thatUsing a matrix F _i To represent the visual characteristics of restaurant i, ω _u Representing a 4 x 1 weight vector corresponding to four visual views of the restaurant, A representing the overall bias weight of the multi-view visual feature, W representing the category visual preference matrix of all users, the product of A and W representing the overall preference of the user for the multi-view visual appearance of the preset restaurant;

training the model by using maximum posterior estimation of Bayesian analysis, and presetting a training set D _s The expression consisting of a triplet (u, i, j) isWherein u represents the user, i represents the restaurant the user has gone through, j represents the unknown restaurant, ++>All positive sample restaurant sets representing user u, matrix factorization was used to predict user preferences with +.>Representation, predictive model->The expression of (2) is +.>γ _ij Representing gamma _i And gamma _j Difference between F _ij Represents F _i And F _j The difference between them, the expression of the personalized ranking optimization criterion C isWherein σ represents a sigmoid function, λ _θ Representing by adjusting regularized hyper-parameters.

9. The computer vision-based dining guidance system of claim 1, wherein obtaining restaurant information and data preprocessing the restaurant information to obtain a data set comprises:

10. A computer vision based dining guiding method of a computer vision based dining guiding system according to any of claims 1-9, characterized by the specific steps of: