Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a real-time going and dealing recommendation method based on context and user interest in an open network, which is easy to model and can improve the recommendation efficiency and accuracy.
The technical problem to be solved by the invention is realized by adopting the following technical scheme:
a real-time leaving recommendation method based on context and user interest in an open network comprises a leaving recommendation client processing step, a leaving recommendation server offline processing step and a leaving recommendation server online processing step, wherein:
the processing steps of the outgoing recommendation client side are as follows: defining a configuration file and a context of a user and sending the configuration file and the context to a server port;
the off-line processing steps of the outgoing recommendation server are as follows: collecting the relevant data of the removal, and carrying out data normalization and vectorization:
the on-line processing steps of the outgoing recommendation server are as follows: intercepting a user request, analyzing a user configuration file and context information, calculating the similarity based on linear interpolation, and sequencing the similarity to obtain a recommendation result.
The profile includes a visit to a place and a score and label for each place, and the context includes the season of the place, the geographic location, the number of people traveling, and the purpose of the travel.
The off-line processing step of the outgoing recommendation server comprises the following steps:
step 1: appointing a plurality of websites for capturing the data of the destination;
step 2: aligning the data of the departures, and disambiguating merchants at the same place of different commenting websites;
and step 3: vectorizing each removed data, and storing the vectorized data into a database;
and 4, step 4: defining a keyword lexicon corresponding to the context, and storing the keyword lexicon into a database.
The step 1 is to grasp the merchant name, the merchant brief introduction, the rating information, the positive rating text and the negative rating text for the rating website, and grasp the merchant name, the merchant brief introduction, the merchant position, the positive rating text and the negative rating text for the merchant website.
Step 3, each merchant for vectorizing the outgoing data comprises the following eight groups of vectors to represent the semantic features of the business: the business name TFIDF semantic vector, the business name LSI semantic vector, the business profile TFIDF semantic vector, the business profile LSI semantic vector, the positive evaluation text TFIDF semantic vector, the positive evaluation text LSI semantic vector, the negative evaluation text TFIDF semantic vector and the negative evaluation text LSI semantic vector.
The on-line processing step of the outgoing recommendation server comprises the following steps:
step 1: intercepting a request of a user by a destination recommendation server and analyzing information;
step 2: the place-going recommendation server carries out vectorization processing on the configuration file of the user, and maps the interest configuration file of the user and the place-going information in the database in the same description coordinate system;
and step 3: the place-going recommendation server carries out vectorization processing on the context, and the context and the front evaluation of the place-going are mapped in the same description coordinate system;
and 4, step 4: the place-removing recommendation server calculates all place-removing vectors in the database according to the user interest preference in the step 2 and the context vector in the step 3, and calculates the similarity between the place-removing and the pair of the user and the context by a linear interpolation method;
and 5: and 4, sorting the similarity obtained in the step 4, recommending the place where the sorting is performed before, and returning the result.
In the step 4, the similarity s is calculated by adopting the following formula:
where p is a place of departure, u is a user, c is content, simfunc(-) is the cosine similarity of the two corresponding vectors; the operations in the set are to calculate similarity between the user interest preference and the place of the place, and comprise: { business name TFIDF semantic vector, business name LSI semantic vector, business profile TFIDF semantic vector, business profile LSI semantic vector, positive evaluation text TFIDF semantic vector, positive evaluation text LSI semantic vector, negative evaluation text TFIDF semantic vector, and negative evaluation text LSI semantic vector }; simcontext_lsi(p, c) calculating a similarity relation between the place of departure and the potential semantic mapping lsi of the context vector, and calculating by using the corresponding vector of the place of departure positive evaluation; simcontext_tfidf(p, c) calculating the similarity relation between the place of departure and the vocabulary tfidf of the context vector, and calculating by using the corresponding vector of the positive evaluation of the place of departure; w is the coefficient corresponding to each term.
The invention has the advantages and positive effects that:
1. the invention obtains the webpage data of the place of the user through the open network and carries out disambiguation pretreatment; the personalized ranking of the place where the user goes is given by combining the user configuration file and the context, the context can be effectively utilized to judge the relevance under the condition of limited training data, the algorithm can respond in a very short time, the time complexity is low, the accuracy and the stability are high, and the method can be widely applied to the personalized place where the user goes and real-time recommendation based on the context.
2. The invention takes the season, the geographical position, the number of the traveling people and the traveling purpose of the user as the context, and can help to obtain commercial benefits and bring convenience to the user through the context and the destination recommendation of the user interest.
3. The invention has reasonable design and easy modeling, fully satisfies the characteristic that the historical behavior preference of the user is consistent with the current behavior preference and the current behavior preference of the user is related to the context (such as the conditions of geographic position, travel type, season and the like), can obviously improve the recommendation efficiency and performance by utilizing the advantages of the prior knowledge of human beings, and improves the accuracy of the real-time recommendation based on the context to the place.
Detailed Description
The embodiments of the present invention will be described in detail with reference to the accompanying drawings.
The invention is mainly realized by adopting a statistical machine learning theory and a crawler technology, in order to ensure the normal operation of the system, in the embodiment, a used computer platform is provided with a memory not lower than 8G, the number of CPU cores is not lower than 4, the main frequency is not lower than 2.6GHz, the video memory is not lower than 1GB, a 64-bit operating system of versions of Linux 14.04 and above is installed, and necessary software environments of versions of python2.7 and above are installed.
The real-time leaving recommendation method based on context and user interest in the open network comprises a leaving recommendation client processing step, a leaving recommendation server off-line processing step and a leaving recommendation server on-line processing step, which are respectively explained as follows:
1. a step of processing a destination recommendation client: the user's profile and context are defined and sent to the server port. The configuration file comprises the visiting and visiting conditions and the score and the label of each visiting place, and the context comprises the season, the geographic position, the number of the travelers and the travel purpose.
2. The off-line processing steps of the outgoing recommendation server are as follows: collecting the relevant data of the removal, and carrying out data normalization and vectorization, wherein the method specifically comprises the following steps:
step 1: and appointing a plurality of websites for data capturing of the destination.
Due to the aggregation effect of the users, the users can tie up piles to go to the place to comment and cross reference. Therefore, the invention selects two types of pages as the description file of each place to be removed. The first type of page is a merchant page on the commenting website, such as the merchant page http:// www.dianping.com/shop/18570510 on the popular commenting website; another type of page is the merchant's home page, such as McDonald's home page https:// www.mcdonalds.com.cn/.
For the former category of pages, some review websites, such as Yelp, Foursquare, and TripAdvisor, provide friendly development APIs that can directly access the content, ratings, and reviews of the store. And acquiring the name of the merchant, the brief introduction of the merchant, scoring information, a positive evaluation text and a negative evaluation text through page processing.
And for the second type of pages, searching on Yelp by using the title of the main page, and taking the first returned result as the supplement of the main page of the merchant. And acquiring the name, the brief introduction and the position of the merchant on the homepage, acquiring scoring information according to the merchant score on Yelp, and positively evaluating the text and negatively evaluating the text. The positive evaluation text is the comment text with the user score higher than the average value, the negative evaluation text is the comment text with the user score lower than the average value, and the comment text equal to the average value is not considered.
And storing the captured data in a json format.
Step 2: and aligning the captured place of departure data, namely disambiguating merchants at the same place of different commenting websites.
Data disambiguation is primarily based on the geographic location information and merchant name displayed on the page. The geographic location and the merchant name are both based on string matching, and if the geographic location and the merchant name are identical, the merchant is considered to be the same-location merchant.
And step 3: vectorizing each piece of location data, and storing the vectorized data into a database.
Vectorization is to convert the text string into a vector, and the operation of the step faces to the name of the merchant and the brief introduction of the merchant, and positive evaluation text and negative evaluation text. The merchant name is not processed, and only word segmentation is performed. The following preprocessing operations are sequentially performed for 4 items except the merchant name: dividing words, removing stop words, removing words with word frequency less than 20 in the term, removing 30 words with maximum word frequency, and removing punctuation marks.
And then vectorizing the text character strings according to the word frequency, namely each character string is a vector on the dictionary, and each dimension of the vector is the frequency of the words in the dictionary corresponding to the dimension in the current character string. Assuming that the dictionary has the word "you, me, he, up, down", the current text string is preprocessed to be "i, me, up", the current text string is vectorized to be (0,2,0,1,1, 0).
After vectorization, semantic vectors of the text strings are modeled. Here, the obtained text vector is mapped in a Semantic space using a TFIDF (Term Frequency-Inverse Document Frequency) weight calculation method and an lsa (late Semantic analysis) latent Semantic analysis method.
Finally, each business has eight groups of vectors to represent the semantic features of the business, namely, a business name TFIDF semantic vector, a business name LSI semantic vector, a business profile TFIDF semantic vector, a business profile LSI semantic vector, a positive evaluation text TFIDF semantic vector, a positive evaluation text LSI semantic vector, a negative evaluation text TFIDF semantic vector and a negative evaluation text LSI semantic vector.
And 4, step 4: defining a keyword lexicon corresponding to the context, and storing the keyword lexicon into a database.
The invention defines the context as the conditions of some specific dimensions of the current user trip, including seasons (spring, summer, autumn and winter), the purpose of travel (such as business trip, travel trip and kiss trip), accompanying personnel (single person, family and group), and other enumeratable items also belong to the context. According to different contexts, relevant keywords are manually selected, for example, regarding "summer" to select "hot", "cold", "barbeque", "picnic", etc. The present invention selects 100 relevant keywords per context.
3. The on-line processing steps of the outgoing recommendation server are as follows: intercepting a user request, analyzing a user configuration file and context information, calculating similarity based on linear interpolation, and sequencing the similarity to obtain a recommendation result, wherein the method specifically comprises the following steps:
step 1: and intercepting the request of the user and analyzing the information by the outgoing recommendation server.
The user request contains a user ID, a user's profile, the user's current location and context information.
The user profile is a rating of previous departures of the user, noting that here the departures need to be retrievable in the database; the current position of the user and the geographical position of the previous place need to be under the same measurement, and information needs to be analyzed, such as the information is unified on longitude and latitude measurement; the context information is the content of the specific dimension in the server-side offline processing step 4.
The format is json.
Step 2: and the going-to-place recommendation server carries out vectorization processing on the user configuration file and maps the user interest configuration file and the information of the going-to-place in the database in the same feature description coordinate system.
In this step, each user constructs features representing the semantics of the user's interests comprising eight sets of vectors. Since each user's profile will contain multiple sets of "go to identifier, evaluation star, custom tag". The invention selects the place where the evaluation stars go above the average as the positive interest expression data of the user. Each destination contains eight vectors, namely a TFIDF semantic vector of a name, an LSI semantic vector of a name, a TFIDF semantic vector of a profile, an LSI semantic vector of a profile, a TFIDF semantic vector of a positively evaluated text, an LSI semantic vector of a positively evaluated text, a TFIDF semantic vector of a negatively evaluated text, and an LSI semantic vector of a negatively evaluated text. And respectively combining the vocabulary related to the same vector at different places together to form a new vector, and then calculating the value of the new vector as the characteristic representation of the vector of the user. So far, a user interest preference vector is obtained.
Meanwhile, the user-defined label of the user is temporarily stored as an independent characteristic part.
And step 3: and the going recommendation server carries out vectorization processing on the context and maps the context and the positive evaluation of the going place in the same description coordinate system.
And (2) searching corresponding keywords in a keyword word bank corresponding to the context according to the context part in the user request file, and combining all the keywords corresponding to the context with the user-defined label in the step (2) to generate a 'user characteristic description text' (which is a word list in essence, wherein the list is unordered). And vectorizing the user characteristic description text, wherein the IDF in the TFIDF representation in the vectorization process uses IDF statistical data of the front evaluation text set at the place, and the LSI also uses LSI parameters of the front evaluation text set at the place to carry out inference. At this point, a context vector is obtained.
And 4, step 4: and (3) calculating the vectors of interest preference and the context vectors of the users and all the places in the database by the place-removing recommendation server based on the description in the step (2) and the step (3), and calculating the similarity between the place-removing and the pair of the user and the context by a linear interpolation method. The linear difference equation for calculating the similarity s (p, u) is as follows:
where p is place, u is user, c is context, simfunc(-) is the cosine similarity of the two corresponding vectors. The operation in the V set is mainly to calculate the similarity relation between the user interest preference and the place of the user, and comprises the following steps: { TFIDF semantic vector of name, LSI semantic vector of name, TFIDF semantic vector of profile, LSI semantic vector of profile, TFIDF semantic vector of positive evaluation text, LSI semantic vector of positive evaluation text, TFIDF semantic vector of negative evaluation text, LSI semantic vector of negative evaluation text }. simcontext_lsi(p, c) for computing similarity of the place and potential semantic mappings lsi of the context vectors, using the corresponding vectors of the positive evaluation of the place。simcontext_tfidf(p, c) for computing the lexical tfidf similarity of the place and context vectors, using the corresponding vector of the place positive evaluation. w is a coefficient corresponding to each item, and the larger w is, the larger the influence of the item on whether the user is related to the place where the user is located is.
And 5: and 4, recommending the place where the sequence is front according to the similarity obtained in the step 4, and returning a result.
Since the step is only numerical sorting, sorting methods such as quick sorting, heap sorting and the like can be adopted.
It should be emphasized that the embodiments described herein are illustrative rather than restrictive, and thus the present invention is not limited to the embodiments described in the detailed description, but also includes other embodiments that can be derived from the technical solutions of the present invention by those skilled in the art.