CN116308465B

CN116308465B - Big data analysis system based on mobile payment

Info

Publication number: CN116308465B
Application number: CN202310543846.9A
Authority: CN
Inventors: 刘丹丹; 王秋容; 王立宝; 苟延; 程小焱
Original assignee: Shenzhen Yipai Payment Technology Co ltd
Current assignee: Shenzhen Yipai Payment Technology Co ltd
Priority date: 2023-05-15
Filing date: 2023-05-15
Publication date: 2023-09-01
Anticipated expiration: 2043-05-15
Also published as: CN116308465A

Abstract

The invention discloses a big data analysis system based on mobile payment, which comprises a server, at least one network device and at least one payment terminal, wherein the server comprises: the system comprises a data acquisition and integration module, a data preprocessing module, a user characteristic calculation module, a store characteristic calculation module, a similarity calculation and sales amount prediction module and a visualization and reporting module. By the system, the data processing efficiency, the analysis accuracy and the practicability are improved, accurate marketing strategy suggestions are provided for merchants, and the merchants are helped to optimize operation decisions.

Description

Big data analysis system based on mobile payment

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a big data analysis system based on mobile payment.

Background

With the popularity and development of mobile payment technology, the marketplace and retail industries are increasingly relying on big data analytics to improve marketing results, optimize customer experience, and improve business efficiency. Mobile payment systems provide rich user consumption behavior data for businesses and retailers that can be mined and analyzed to provide valuable insight and predictions for merchants. However, the big data analysis method and system in the market at present have some problems such as low data processing efficiency, non-uniform data format, inaccurate feature extraction, inaccurate similarity calculation and sales prediction. These problems limit the application of big data analysis in the marketplace and retail industries.

When the existing big data analysis system processes data from different payment systems, manual intervention is often needed to solve the problem of inconsistent data formats. This approach is time consuming and labor intensive, and is prone to errors affecting the accuracy of the analysis results. In addition, in calculating the user characteristics and store characteristics, simple statistical methods are often used, such as calculating the number of purchases of the user in different commodity categories and price intervals. This approach ignores user preferences in other ways, such as geographic location, time of purchase, etc., resulting in feature extraction that is not accurate enough, affecting the performance of the recommendation system. When calculating the similarity between the user feature and the store feature, a simple similarity measurement method is often adopted, such as euclidean distance, pearson correlation coefficient and the like. These methods perform poorly when processing high-dimensional sparse data, easily resulting in inaccurate similarity calculations. Meanwhile, the existing system has a certain limitation in sales prediction, and the diversity and complexity of the consumption behaviors of users in different stores are difficult to fully consider.

Therefore, the existing big data analysis method and system based on mobile payment have certain limitations and defects in the aspects of data processing, feature extraction, similarity calculation, sales prediction and the like. These problems have resulted in the accuracy and practicality of the analysis results being compromised, limiting the application of big data analysis in the marketplace and retail industries.

In order to solve these problems, there is a need to develop an improved mobile payment-based big data analysis method and system that can automatically extract mobile payment data in a mall, identify data formats, sort and merge of different payment systems, and simultaneously clean, deduplicate and format convert the data. The system needs to adopt an advanced feature extraction method, and fully considers the preferences of users in multiple dimensions such as commodity category, price interval, geographic position and the like, so that the accuracy of feature extraction is improved.

Disclosure of Invention

In view of the foregoing drawbacks in the prior art, the present invention provides a mobile payment-based big data analysis system, the system including a server, at least one network device, at least one payment terminal, wherein the server includes:

the data acquisition and integration module is used for regularly extracting mobile payment data from at least one payment terminal in a market, identifying data formats of different payment systems, merging and arranging the extracted data according to user IDs, and converting the mobile payment data into a uniform format;

the data preprocessing module is used for cleaning, de-duplicating and format converting the extracted data;

The user characteristic calculation module is used for calculating user characteristics according to the tidied mobile payment data;

the store feature calculation module is used for calculating store features according to commodity information of stores;

the similarity calculation and sales amount prediction module is used for calculating the similarity between the user characteristics and the store characteristics, and predicting sales amount according to the similarity, wherein the similarity calculation uses a cosine similarity measurement method;

and the visualization and reporting module is used for presenting the analysis result in the form of a chart and a report.

Wherein, the mobile payment data at least comprises: payment ID, user ID, store ID, payment amount, commodity category, payment time, commodity price section.

Wherein, the calculating the user characteristic according to the sorted mobile payment data includes:

for each user, at least the following features are included:

the total payment amount is the amount of each payment which is recorded and accumulated by traversing each payment of the user; the payment frequency is the payment frequency of the user in a specified time range;

the market access frequency is the number of times that users access the market within a specified time range;

The high price commodity purchasing times are obtained by traversing all payment records of a user, and counting the high price commodity purchasing times if the purchased commodity belongs to a high price zone for each payment;

the low price commodity purchasing times are all payment records of the traversing user, and if the purchased commodity belongs to a low price zone, the low price commodity purchasing times are counted for each payment, so that the low price commodity purchasing times are obtained;

the purchase price distribution is that the number of times of purchase of the user in different price intervals is counted, the number of times of purchase is divided by the total payment number to obtain the purchase proportion of the price intervals, and the data of the number of times of purchase of the user in the different price intervals is reduced in dimension to be a low-dimensional vector by using a t-SNE algorithm to obtain the low-dimensional vector of the purchase price distribution;

the commodity category preference is to count the purchase times of the user under different commodity categories, and the t-SNE algorithm is used for reducing the data of the purchase times of the user under different commodity categories into a low-dimensional vector to obtain the low-dimensional vector of the commodity category preference;

the geographic position preference is expressed by the single thermal code One-hot Encoding for the market area where the store is located, and is specifically that the single thermal codes of the N stores corresponding to the N stores visited by the user are determined after the user visits the N stores highest.

Wherein the store characteristics include at least: commodity category characteristics of store, price interval characteristics of store;

and after the commodity data of the store on sale are counted, performing t-SNE dimension reduction on commodity category characteristics of the store with high dimension and price interval characteristics of the store to obtain commodity category characteristics of the store with low dimension and price interval characteristics of the store.

The commodity category preference after T-SNE dimension reduction is corrected by using the ratio of the payment frequency to the market access frequency;

correcting the low-dimensional commodity category preference by using the high-price commodity purchase times and the low-price commodity purchase times;

the corrected user feature vector includes: the corrected commodity category preference and the corrected purchase price distribution;

the payment frequency is equal to the sum of the purchase times of the high-price commodity and the purchase times of the low-price commodity.

Wherein, calculating the similarity between the user characteristic and the store characteristic comprises:

calculating the similarity S between the user and a target store according to the purchase price distribution and commodity category preference of the user;

and determining the access index of the user to the target store according to the geographic position of the target store and the geographic positions of N most frequently visited stores in the geographic position information of the user, and correcting the similarity S according to the access index to obtain the corrected similarity S' of the user to the target store.

The similarity calculation uses a cosine similarity measurement method, which comprises the following steps:

the cosine similarity is used for calculating the similarity between the corrected commodity category preference and commodity category characteristics of the store, the cosine similarity is used for calculating the similarity between the corrected purchase price distribution and price interval characteristics of the store, and the similarity S between the user and the store is obtained by weighting the two.

Similarity s=α×cosineim arity (modifiedmatch, commodity category feature) +β×cosineim arity (ModifiedPriceDistribution, price interval feature)

Wherein α and β are weight parameters, α+β=1, for adjusting the importance degree of commodity category preference and purchase price distribution in calculating the similarity;

the similarity S represents the similarity between the user and the store, the ModifiedContogoryPreference represents the commodity category preference after the user is modified, the ModifiedPricedistribution represents the purchase price distribution after the user is modified, and the commodity category characteristic and the price interval characteristic respectively represent the commodity category and the price characteristic of the store;

the cosinesilvery represents cosine similarity and is used for measuring cosine values of included angles between two vectors;

CosineSimilarity(A, B) = (A • B) / (||A|| * ||B||)

Wherein A and B are two vectors, "" represents a vector dot product, "||||" means the modulo length of the vector.

Wherein, sales prediction based on similarity includes:

determining the access index of a user to a target store according to the geographic position of the target store and the geographic positions of N stores most frequently visited in geographic position information of the user, and correcting the similarity S according to the access index to obtain corrected similarity S' of the user to the target store;

predicting an expected sales amount for each store based on the corrected similarity S';

the geographic position of the store uses a single-heat Encoding One-hot Encoding to encode the store, specifically, the Encoding method of the store is designed to be F- (X, Y), wherein F is the floor of the store, and X and Y are the floor coordinates of the store.

Determining an access index of a user to a store according to the geographic position of a target store and the geographic position of the most frequently visited store in geographic position information of the user, and correcting the similarity S according to the access index, wherein the method comprises the following steps:

calculating floor gap and floor store distance, calculating geographic position correlation and calculating access index;

And the floor store distance comprises the Euclidean distance between stores on the same floor or the comprehensive distance between the target store and all entrances and exits of the floor;

the modified similarity S ' =α ' ×s+ (1- α ') + (1- α ') access index, where α ' is an adjustment coefficient between 0 and 1, for adjusting the weight between the similarity S and the weighted access index.

Wherein predicting the expected sales of each store based on the corrected similarity S' comprises:

calculate the average payment avgpaymentamounts:

AvgPaymentAmount_i = TotalAmount_i / PaymentFrequency_i

wherein totalamount_i represents the total amount paid by user i and paymentfrequency_i represents the number of payments by user i;

multiplying the corrected similarity S' with the average payment amount AvgPaymentAmount to obtain a predicted consumption amount matrix P_ij:

P_ij = S'_ij * AvgPaymentAmount_i；

in the predicted spending amount matrix p_ij, i represents a user, j represents a store;

where S' _ij is a corrected similarity representing a similarity between the ith user and the jth store; p_ij represents the predicted amount of consumption of the ith user at the jth store;

multiplying the corrected similarity S' _ij with the average payment amount AvgPaymentAmount_i to obtain a predicted consumption amount matrix P_ij;

The predicted sales for the store are obtained by summing each column of the predicted spending amount matrix p_ij, including:

predicted sales_j=Σ (p_ij), where i represents a user, j represents a store, and the summation range is all users;

the summation results in a one-dimensional vector in which each element represents a predicted sales of the corresponding store.

The big data analysis system based on mobile payment improves the data processing efficiency, the analysis accuracy and the practicability; providing accurate and targeted marketing strategy suggestions for merchants, and helping the merchants to optimize operation decisions; and the visualization and reporting module of the system enables merchants to intuitively know analysis results, and is convenient for quickly adjusting strategies.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 is a schematic diagram illustrating a mobile payment based big data analysis system in accordance with an embodiment of the present invention;

Fig. 2 is a flowchart illustrating a big data analysis method based on mobile terminal payment according to an embodiment of the present invention.

Description of the embodiments

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, the "plurality" generally includes at least two.

It should be understood that although the terms first, second, third, etc. may be used to describe … … in embodiments of the present invention, these … … should not be limited to these terms. These terms are only used to distinguish … …. For example, the first … … may also be referred to as the second … …, and similarly the second … … may also be referred to as the first … …, without departing from the scope of embodiments of the present invention.

It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such product or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a commodity or device comprising such element.

The invention provides a big data analysis method based on mobile terminal payment, which comprises the following steps:

step S1, data collection and pretreatment: and collecting mobile payment data of users in the mall, and preprocessing the mobile payment data, including data cleaning, deduplication, format conversion and the like. By processing the raw data, user characteristics, store characteristics and historical data of the user accessing the store can be obtained.

Step S2, matching the user characteristics with store characteristics: and calculating the similarity S of the store according to the purchase price distribution and commodity category preference of the user. And determining the access index of the user to the store according to the geographical positions of the store and the most frequently visited N stores in the geographical position information of the user, and correcting the similarity S according to the access index to obtain the corrected similarity S' of each store.

Step S3, predicting store sales conditions: based on the revised similarity S 'and the user characteristics generated from the user' S mobile payment data, the expected sales for each store are predicted.

Step S4, supply chain management: based on the predicted sales, a store is provided with selection suggestions based on the preferences of the intended visiting user.

And S5, periodically updating the user characteristics and the store characteristics to ensure the accuracy of data prediction, including periodically collecting new user payment data and store commodity information, adjusting the user characteristics and the store characteristics, and correspondingly updating the matching degree and sales prediction.

The invention can realize continuous improvement of the supply chain management strategy, improve the overall operation efficiency of the market and improve the sales performance and customer satisfaction of the market.

In one embodiment, the mobile payment data is the last time period (e.g., 1 month, 14 days, etc.) of the mobile payment data collected and saved. The mobile payment data may contain the following 7 items, each describing the user's information in a single payment: (1) PaymentID: a unique identifier of the payment record; (2) user ID (UserID): a unique identifier of the user making the payment; (3) store ID (store ID): a unique identifier of the store where payment occurred; (4) payment amount (PaymentAmount): the amount paid this time; (5) commodity category (product category): the category to which the commodity purchased in the present payment belongs; (6) PaymentTime: the time at which payment occurs may include a date and time; (7) price of goods section (PriceRange): the price interval of the commodity purchased in the payment is set as a plurality of price intervals (such as low price, medium price and high price, or within 30 yuan, within 100 yuan, 100 to 500 yuan, 500 to 3000 yuan, and more than 3000 yuan).

Then, characteristics of the individual user are determined by the mobile payment data, the characteristics of the individual user including the following 8 items of concrete contents:

total payment amount (totalacount): traversing all payment records of the user, and accumulating the amount of each payment. TotalAmount = Σ (parameter_i), where i represents each payment record.

Payment frequency (PaymentFrequency): and calculating the payment times of the user within the specified time range. PaymentFrequency=count (parameter_i), where i represents each payment record over a specified time frame.

Market access frequency (VisitFrequency): and in the appointed time range, the number of times the user accesses the mall is calculated. This may be achieved by analyzing the payment time in the payment record. For each payment, an access is considered if the payment is made within a natural day of time. To achieve this, the payment records may be ordered by payment time and then the payment dates of adjacent payment records compared. If the payment dates of adjacent payment records are different, a new access is counted.

Number of high price commodity purchases (highpricepurchasease): and traversing all payment records of the user, and counting the purchased commodity into the high-price commodity purchase times statistics if the purchased commodity belongs to the high-price interval for each payment. The high-value interval is set according to the scene (for example, above 1000, the value of the limit can be manually set or adjusted).

Low price commodity purchase times (lowpricepurchasease): and traversing all payment records of the user, and counting the purchased commodity into low-price commodity purchase times for each payment if the purchased commodity belongs to a low-price zone. The low-cost section is set according to the scene (for example, lower than 1000, this value can be manually set or adjusted).

1. Purchase price distribution (PriceDistribution): firstly, the number of times of purchasing the users in different price intervals is counted, and specifically the number of times of each payment record of each user in the price interval p is counted. Then, the purchase number is divided by the total payment number to obtain the purchase ratio of the price section. Next, the data of the number of purchases of the user at different price intervals is down-sized into a low-dimensional vector using a t-SNE algorithm.

2. Commodity category preference (CategoryPreference): firstly, the number of purchases of users under different commodity categories is counted, and specifically the number of times of each payment record of each user under the commodity category c is counted. Then, the data of the purchase times of the user under different commodity categories is reduced to a low-dimensional vector by using a t-SNE algorithm so as to facilitate subsequent processing.

3. Geographic location preference (LocationPreference): and representing the store area where the store is located by using One-hot Encoding (One-hot Encoding), and determining the single-hot Encoding of the corresponding N stores after counting the N stores visited by the user at the highest.

By the above method, a feature set can be generated for each user, including their consumption behavior and preferences.

In one embodiment, the step of reducing the data of the number of purchases of the user under different commodity categories into a low-dimensional vector by using the t-SNE algorithm comprises the following steps:

t-SNE is a nonlinear dimension reduction method capable of maintaining a local structure of high-dimensional data in a low-dimensional space (such as two-dimensional or three-dimensional). The specific operation is as follows:

step one, preparing data: the raw data is organized into a matrix, each row representing a user, and each column representing a category of merchandise. Assuming that there are N users and 50 commodity categories in the mobile payment database, the data matrix would be an Nx50 shape.

Step two, data preprocessing: and (3) carrying out standardization or normalization operation on the data, and improving the dimension reduction effect.

Step three, reducing the dimension by using t-SNE: the high-dimensional data is reduced to the target dimension using an already implemented t-SNE library (e.g., the scikit-learn library in Python). The reduced dimension operation is performed using the function tsne=tsne (n_components=2, random_state=42), resulting in reduced_data=tsne.fit_transform (data_matrix), representing the reduced dimension user characteristics, at which time a reduction of 50-dimensional data to 2-dimensional or 3-dimensional is achieved. t-SNE is a random algorithm, and the result may be affected by a random seed. Different random seeds need to be tried or parameters of the t-SNE need to be adjusted to obtain the best dimension reduction effect.

In one embodiment, the method for reducing the data dimension of the purchase times of the user in different price intervals into a low-dimension vector by using a t-SNE algorithm comprises the following steps:

assume that there are 4 price intervals: low price: price interval 0-100, medium price: price interval 101-500, medium and high price: price interval 501-3000, high price: price interval 3001 or more. It is necessary to analyze the number of purchases of each user in these price intervals and dimension down these data into a low-dimensional vector using the t-SNE algorithm. The following describes the specific steps in a simplified example:

step one, preparing data: the raw data is organized into a matrix, each row representing a user, and each column representing a price range. Assuming that there are N users and 4 price intervals, the data matrix will be of the shape Nx 4.

Suppose that the move has the following payment records:

user' s	Price interval 0-100	Price interval 101-500	Price interval 501-3000	Price interval 3001 or more
					A	3	2	1	1
B	5	1	0	0
					C	2	3	2	1

Step two, calculating the purchase proportion of the price interval: dividing the purchase times of each user by the total payment times to obtain the purchase duty ratio of the price interval.

The results of the treatment are shown in the following table:

user' s	Price interval 0-100	Price interval 101-500	Price interval 501-3000	Price interval 3001 or more
					A	0.43	0.29	0.14	0.14
B	0.83	0.17	0	0
					C	0.25	0.38	0.25	0.13

Step three, reducing the dimension by using t-SNE: the 4-dimensional data is reduced to the target dimension using an already implemented t-SNE library (e.g., the scikit-learn library in Python). And performing dimension reduction operation by using a function tsne=tsne (n_components=2, range_state=42), so as to obtain reduced_data=tsne, fit_transform (data_matrix) reduced_data which is a matrix of Nx2, representing the user characteristics after dimension reduction, and realizing the reduction of 4-dimensional data to 2 dimensions. t-SNE is a random algorithm, and the result may be affected by a random seed. Different random seeds may be tried or parameters of the t-SNE may be adjusted to obtain the best dimension reduction effect.

In one embodiment, the store features include: commodity category characteristics of store, price interval characteristics of store. After the statistics of the on-sale commodity data of the store is performed, for example, the statistics is performed on 50 classes of commodities according to the same commodity classification mode in the user characteristics, the commodity intervals with different price are divided according to the intervals with the same user price, and then the t-SNE dimension reduction is performed on the vector of the store according to the similar mode of the embodiment after the statistics. The parameters of the random seed and t-SNE that have been obtained in the user dimension reduction in the previous embodiments may be used. In one embodiment, the 1 st-7 th items in the characteristics of a single user can be clustered (such as a K-means algorithm), all users are clustered to form different categories, each category is determined to be a different user portrait, and then the median or mean value in the characteristics of each category is used as the characteristics of the 1 st-7 th items of each user in the category for the subsequent calculation of the similarity S. This simplifies the calculation process, which can be used when the system processing power is weak.

In an embodiment, the calculating the similarity S according to the purchase price distribution and the commodity category preference of the user, and correcting the similarity S based on the geographic position data in the user characteristics to obtain a corrected similarity S', includes the following steps:

and correcting the T-SNE two-dimensional vector of 6-7 by using part of user characteristics to obtain corrected commodity category preference and purchase price distribution. And calculating the similarity S between the corrected commodity category preference and the commodity category characteristics of the store and the price interval characteristics of the store by using the cosine similarity. Through the above steps, the similarity S between the user feature vector and the store feature vector can be calculated. The following specifically describes the above steps: step one, correcting the T-SNE two-dimensional vector of 6-7 items by using the items of the user characteristics:

the ratio PurchaseRatio of the payment frequency (Paymentfrequency) and the mall access frequency (VisitFrequency) is used to correct the T-SNE reduced commodity category preference (CategoryPreference). And correcting the low-dimensional commodity category preference (CategoryPreference) using the high price commodity purchase number (HighPricePurchase) and the low price commodity purchase number (LowPricePurchase). The corrected user feature vector includes: modified merchandise category preference (modified distribution), modified purchase price distribution (modified price distribution).

In order to obtain the revised commodity category preference (modifiedcategorypference) and the revised purchase price distribution (ModifiedPriceDistribution), the following method may be used:

for the revised merchandise category preference (modifiedcategorypeference), the calculation process includes: first, the ratio of the payment frequency (PaymentFrequency) to the mall access frequency (VisitFrequency) is calculated:

PurchaseRatio = PaymentFrequency / VisitFrequency；

the payment frequency (PaymentFrequency) is equal to the sum of the number of high price commodity purchases (HighPricePurchase) and the number of low price commodity purchases (LowPricePurchase).

PurchaseRatio is then combined with T-SNE reduced commodity category preference (CategoryPreference). This can be done by multiplying each component of the two-dimensional vector by PurchaseRatio: modifiedcategorypeference=categorypeference × purchasetatio

For a modified purchase price distribution (ModifiedPriceDistribution), the calculation process includes: first, the sum payment frequency (PaymentFrequency) of the number of high price commodity purchases (highpricepurchasease) and the number of low price commodity purchases (lowpricepurchasease) is calculated:

PaymentFrequency = HighPricePurchase + LowPricePurchase

next, the ratio of the number of purchases of the high-price commodity to the total number of purchases and the ratio of the number of purchases of the low-price commodity to the total number of purchases are calculated. HighPriceRatio = highpricepurchasese/PaymentFrequency;

lowpriceratio=lowpricepurchasese/PaymentFrequency; these two ratios are combined with a two-dimensional vector of the purchase price distribution (PriceDistribution) after the T-SNE dimension reduction. This can be done by multiplying each component of the two-dimensional vector by the corresponding price interval scale: the corrected feature vectors of modifiedpricedistribution=pricedistribution [ HighPriceRatio, lowPriceRatio ] can more accurately reflect shopping behaviors and preferences of users, thereby improving the effect of the recommendation system.

Step two, calculating similarity S of 6-7 items of user characteristics and a and b characteristics of store characteristics:

in order to calculate the similarity between the modified commodity category preference (modifiedcategorypeference) and the commodity category feature (a) of the store, cosine similarity (cosine similarity) is used. The cosine similarity is also used to calculate the similarity between the modified purchase price distribution (ModifiedPriceDistribution) and the price section feature (b) of the store.

Where α and β are weight parameters, α+β=1, for adjusting the importance of commodity category preferences and purchase price distributions in calculating similarity. The values of these two parameters can be adjusted according to the actual requirements. The similarity S indicates the similarity between the user and the store, the modifieddcategorypeference indicates the commodity category preference after the user modification, the ModifiedPriceDistribution indicates the purchase price distribution after the user modification, and the commodity category characteristic and the price interval characteristic respectively indicate the commodity category and the price characteristic of the store. The cosinesilty represents cosine similarity and is used for measuring cosine values of included angles between two vectors. When the directions of the two vectors are the same, the cosine similarity is 1, which indicates complete similarity; when the directions of the two vectors are completely opposite, the cosine similarity is-1, which indicates complete dissimilarity; when the two vectors are orthogonal, the cosine similarity is 0, indicating no correlation. The specific formula is as follows: cosinesimilityy (a, B) = (a.b)/(|a|b||) wherein, a and B are two vectors, "" represents a vector dot product, "|||||" means the modulo length of the vector. In One embodiment, the store is coded using One-hot Encoding (One Encoding), specifically the store is coded by the method designated as F- (X, Y), F is the floor, and X and Y are the floor coordinates. One-hot Encoding (One Encoding) is a method of converting a classification variable into a binary vector. In one-hot encoding, each class is represented as a binary vector with only one non-zero element. This encoding method can be used to convert non-digital data into digital data for ease of calculation and analysis. For different stores within a mall, variations of the single thermal code may be used to represent their geographic locations. Considering that the association degree between stores of the same floor is higher than that of other floors, and that the closer the store association degree is, the higher the store association degree is, the farther the position is, and the two points can be simultaneously embodied by using the F- (X, Y) coding mode. In an embodiment, the access index of the user to the store is determined according to the geographical positions of the store and the most frequently visited N stores in the geographical position information of the user, and then the similarity S is corrected according to the access index, so as to obtain corrected similarity S' of each store.

First, it is necessary to determine the access index of the user to the store, and how to calculate the access index is described below with n=1, and the access index of the user to the store is calculated according to the following steps: calculating floor gaps (A, B); calculating floor store distances (A, B) (including calculating Euclidean distances between stores on the same floor or calculating comprehensive distances between stores and all entrances and exits of the floor); calculating a geographic location correlation (a, B); the access index (a, B) is calculated.

In one embodiment, the method of calculating the geographic location correlation may be optimized (exemplified by n=1) in order to better account for the effects of floor gap and distance between stores. The method comprises the following specific steps: step one, calculating the floor gap between the first store which the user most frequently visits and the target store:

floor gap (a, B) = |a_floor-b_floor| and 0 if a and B are on the same floor.

Step two, calculating the distance (A, B) of the floor store: if a and B are on the same floor, then floor store distance =If A and B are on the same floor; otherwise, the comprehensive distance between the target store A and all the entrances and exits of the floor is calculated. Assuming that the floor has three entrances and exits, denoted by E_1, E_2, and E_3, respectively, a weighted sum of the distances between store A and the three entrances and exits can be calculated. At this time, the floor-store distance (a, B) =the integrated distance (a) =w_1× +. >+ w_2 ×/>+w_3 ×/>Wherein the method comprises the steps ofw_1, w_2 and w_3 are weights of the respective entrances and exits, and can be set according to practical situations. The greater the weight, the greater the influence of the doorway.

And thirdly, processing the floor gap to reflect the influence of the high floor and the low floor. An exponential parameter β (0 < β < 1) is introduced for adjusting the effect of floor gap on geographic location dependence: floor gap adjustment coefficient=β (floor gap).

And fourthly, calculating the geographic position correlation between the stores according to the floor store distances (A and B) and the floor gap adjustment coefficient.

Geographic position correlation (a, B) =floor gap adjustment coefficient x floor store distance (a, B).

Step five, converting the calculated geographic position correlation into an access index:

access index (a, B) =1-geographic location correlation (a, B).

Wherein the access index is between 0 and 1, and the smaller the geographic location correlation, the higher the access index. In one embodiment, an exponential parameter β (0 < β < 1) is introduced for adjusting the effect of floor gap on geographic location correlation, comprising:

when the user most often visits a floor above the target store floor: beta (diff) =1- α1 diff 2;

When the user's most frequently visited floor is lower than or equal to the target store floor: β (diff) =1- α2 diff;

wherein diff represents the difference between the floor of the store most frequently visited by the user and the floor of the target store, α1 and α2 are parameters for adjusting the degree of influence of the floor gap on the access index in both cases, 0< α1 < 1, 0< α2 < 1, respectively. α2 is smaller than α1, indicating that the access index decreases relatively slowly when the floor frequently accessed by the user is lower than the target store floor. When the floor frequently visited by the user is higher than the target store floor, the decay rate of the visit index is very rapid, and furthermore the introduction of diff 2 enhances this effect, diff 2 representing the square of diff. In this formula, the square of diff is used to adjust the beta value when the user most often visits a floor above the target store floor. The purpose of this is to allow the access index to increase relatively slowly when the user's frequently accessed floor differs less from the target store floor. As the difference increases, the decay rate of the access index increases gradually.

When the floor frequently visited by the user is higher than the target store floor, the beta value is reduced along with the increase of the floor gap, so that the visit index is gradually reduced; when the floor the user is visiting is less than or equal to the target store floor, the beta value decreases with increasing floor gap, but at a relatively slow rate. When diff is 0, the beta value is 1, max.

In an embodiment, the access index of the user to the store is determined according to the geographical positions of the store and the most frequently visited N stores in the geographical position information of the user, and then the similarity S is corrected according to the access index, so as to obtain corrected similarity S' of each store. When N >1, determining the duty ratio of the corresponding stores according to the number of accesses of the stores, determining the duty ratio (between 0 and 1) as the weight coefficient of the corresponding normally accessed store, calculating the access indexes according to the method of the embodiment, and then carrying out weighted summation on the N access indexes to serve as the access index (weighted access index) of the user to the target store.

In one embodiment, the specific steps for correcting the similarity S based on the access index are as follows:

for each store within the mall, a weighted exponentially-weighted average of the store's accesses to the top N stores that the user was highest visiting is calculated:

step one, weighted access index = Σ (access index (store_i, target store): weight_i), where i represents the top N stores the user is visiting highest, and weight_i may be set as the ratio of the number of accesses the user has to store_i to the total number of accesses.

Correcting the similarity S according to the weighted access index, and carrying out weighted average on the weighted access index and the similarity S to obtain corrected similarity:

the modified similarity S ' =α ' ×s+ (1- α ') + (1- α ') the weighted access index, where α ' is an adjustment coefficient between 0 and 1, for adjusting the weight between the similarity S and the weighted access index. When n=1, the weighted visit index is a visit index (when only one highest visit store is set for correction), that is, the corrected similarity S ' =α ' ×s+ (1- α '). In one embodiment, the expected sales for each store are predicted based on the revised similarity S 'and the user characteristics generated by the user' S mobile payment data. And giving out option adjustment suggestions to shops with low predicted shops sales. The expected sales for each store are predicted by:

step one, calculating an average payment amount (avgpaymentamounto). For each user, it may be calculated by:

AvgPaymentAmount_i = TotalAmount_i / PaymentFrequency_i

where totalamount_i represents the total amount paid by user i and paymentfrequency_i represents the number of payments by user i.

Step two, multiplying the corrected similarity S' with an average payment amount (avgpaymentamountto) to obtain a predicted consumption amount matrix p_ij:

P_ij = S'_ij * AvgPaymentAmount_i；

In the predicted spending amount matrix p_ij, i denotes a user, and j denotes a store.

Where p_ij represents the predicted amount of consumption of the ith user at the jth store. The predicted consumption amount matrix p_ij is obtained by multiplying the corrected similarity S' _ij with the average payment amount avgpaymentamountj. In this way, a matrix is obtained in which each element represents the predicted amount of consumption of the corresponding user at the corresponding store.

S' _ij is the corrected similarity, which indicates the similarity between the ith user and the jth store. The calculation of S' _ij is obtained by correcting the original similarity S and the access index of the user to each area of the market. The corrected similarity S' better reflects the preference of the user to different stores, so that the accuracy of the recommendation system is improved.

The original similarity S is obtained by calculating the similarity between the user characteristics and the store characteristics. The corrected similarity S' incorporates the user access index into the calculation of each area in the mall so as to better measure the actual preference of the user for different shops.

In the predicted amount of consumption matrix p_ij, the corrected similarity S' _ij is multiplied by the average payment amount avgpaymentam count_i. This results in a matrix in which each element represents the predicted amount of consumption of the corresponding user at the corresponding store. In this way, the user's amount of consumption at each store is more accurately predicted, providing valuable consumer behavioral analysis and recommendations to the marketplace.

Step three, the total predicted sales of the store can be obtained by summing up each column of the predicted spending amount matrix p_ij. Predicted sales_j=Σ (p_ij), where i represents a user, j represents a store, and the summation range is all users (from 1 to N). This results in a vector in which each element represents a predicted sales of the corresponding store.

In one embodiment, the option adjustment suggestions given include, but are not limited to: for stores with different sales, the commodity price and the commodity category of the stores are adjusted according to the purchase price distribution and commodity category preference of the users so as to meet the demands of the users. For stores where it is predicted that store sales will be too low, the options may be adjusted by: and i, adjusting price trend: the price of the commodity is adjusted according to the purchase price distribution of the user, such as providing more price-appropriate commodity, or holding a sales promotion to attract consumers. Commodity category advice: the merchandise category is adjusted according to the merchandise category preference of the user, such as adding hot sales or merchandise meeting the needs of a particular consumer.

In one embodiment, to achieve periodic updates of user features and store features to ensure accuracy of data predictions, the following specific implementations and techniques may be employed:

Periodically collecting and updating data: new user payment data and store merchandise information is collected periodically (e.g., weekly or monthly) by a mobile paymate or other data provider. This may be achieved by means of API calls or periodic exporting of data.

Data preprocessing: the collected new data is preprocessed, including data cleaning, deduplication, format conversion, and the like. This may be achieved by writing a data processing script or using a data processing tool such as the Pandas library (Python).

Recalculating user features and store features: and recalculating the characteristics of the user (such as payment amount, payment frequency, market access frequency and the like) and store characteristics (such as commodity category characteristics, price interval characteristics and the like) according to the updated data. This can be achieved by writing corresponding data processing and analysis code, for example using the NumPy and SciPy libraries of Python.

Recalculating the matching degree and sales prediction: and recalculating the matching degree by using the updated user characteristics and store characteristics, and updating the sales prediction according to the recalculation matching degree. This may be accomplished by invoking previously implemented similarity calculations and predictive model code.

Automation and scheduling: the whole updating process is automated, and the updating task is executed regularly by writing a script or using a scheduling tool (such as Apache Airflow), so that the data is ensured to be updated in time.

Monitoring and alarming: to ensure the stability and accuracy of the update process, monitoring and alarm mechanisms may be provided, such as data quality checks, update task execution, and the like. When an anomaly or problem is found, the relevant person may be notified to handle by mail or other means. This may be accomplished by integrating existing monitoring and alarm tools (e.g., grafana, prometheus, etc.).

In one embodiment, a close fit of hardware and software is required to achieve mobile payment big data analysis.

In hardware, the following devices are needed to jointly implement all the functions of the scheme:

and (3) a server: for storing data for each payment database within the mall, and computing resources required to process and analyze the data. The server may choose to use a cloud service provider (e.g., amazon Web Services, google Cloud Platform, or Microsoft Azure, etc.) or a private data center.

Network equipment: the data transmission between the payment system and the server in the market is ensured to be efficient, stable and safe. This includes network devices such as switches, routers, firewalls, etc.

And (3) a payment terminal: payment terminals (such as POS devices, self-checkout devices, etc.) in the mall need to support various mobile payment modes and synchronize real-time data with the server. The mobile payment data can be stored in the payment terminal, can also be periodically reported to the network device by the payment terminal, can be temporarily stored by the network device, and can be periodically reported to the server by the network device.

In a server, a set of complete acquired data and big data analysis software needs to be developed to realize the mobile payment data processing and analysis functions of the scheme, and the software part in the server comprises the following modules:

and the data acquisition and integration module: a set of acquisition software is developed for periodically extracting payment data from various payment databases within the mall. The acquisition software needs to be able to identify the data format of each payment system and convert the data into a unified format. In addition, software needs to support merging and sorting of extracted data by unique identification (e.g., user ID).

And a data preprocessing module: and developing a preprocessing module for performing operations such as cleaning, deduplication, format conversion and the like on the extracted data so as to ensure the quality and consistency of the data.

And a user characteristic calculation module: the user characteristic calculation module is developed and used for calculating various user characteristics such as payment amount, payment frequency, commodity category preference and the like according to the sorted data. The computing module may be developed using programming languages such as Python, R, etc. and corresponding data processing libraries (e.g., pandas, numPy, etc.).

Store feature calculation module: and the store feature calculation module is used for calculating various store features, such as commodity category features, price interval features and the like, according to commodity information of the store.

Similarity calculation and sales amount prediction module: and the development similarity calculation module is used for calculating the similarity between the user characteristics and the store characteristics and predicting sales according to the similarity. The similarity calculation may use a similarity measurement method such as cosine similarity.

Visualization and reporting module: a visualization and reporting module is developed for presenting the analysis results in the form of charts and reports to a store manager. This may be developed using commercial intelligent tools such as Tableau, power BI, or libraries such as Matplotlib, seaborn using Python.

In one embodiment, the appropriate server type is selected based on the evaluated computing requirements. A physical server or a virtual server may be selected. While physical servers are suitable for large-scale, high-performance computing demands, virtual servers are more suitable for flexible, scalable computing demands. Cloud service providers (e.g., amazon Web Services, microsoft Azure, alicloud platform, or the china cloud, etc.) offer many types of virtual servers for selection.

In one embodiment, after a server is selected, an appropriate number of CPU cores and threads are configured to meet the computing needs. For computationally intensive tasks, a high performance CPU (e.g., intel Xeon or AMD EPYC, etc.) may be selected. Meanwhile, considering the requirement of parallel computing, a proper CPU core and thread number are allocated for the task supporting multiple threads or multiple processes.

In one embodiment, load balancing and optimization strategies may be set to ensure efficient resource usage by the various modules during the computation process. For example, compute-intensive tasks may be assigned to servers with more CPU cores and threads, while I/O-intensive tasks may be assigned to servers with higher disk and network performance.

In one embodiment, the CPU resources of the server are dynamically adjusted according to the actual requirements of the computing task. For example, more CPU resources may be required to process large amounts of data during the data preprocessing stage, while less resources may be required during the subsequent similarity calculation and prediction stage. The server configuration can be flexibly adjusted by monitoring the resource use condition of the task so as to improve the resource utilization rate.

In one embodiment, the CPU utilization, temperature, power consumption, etc. of the server are continuously monitored throughout the calculation process to ensure stable operation of the server. If performance bottleneck or fault is encountered, diagnosis and maintenance are performed in time so as to ensure smooth performance of calculation tasks.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

The foregoing description of the preferred embodiments of the present invention has been presented for purposes of clarity and understanding, and is not intended to limit the invention to the particular embodiments disclosed, but is intended to cover all modifications, alternatives, and improvements within the spirit and scope of the invention as outlined by the appended claims.

Claims

1. A mobile payment based big data analysis system, the system comprising a server, at least one network device, at least one payment terminal, wherein the server comprises:

a visualization and reporting module for presenting the analysis results in the form of graphs and reports;

wherein, the mobile payment data at least comprises: payment ID, user ID, store ID, payment amount, commodity category, payment time, commodity price interval;

wherein the calculating the similarity between the user characteristic and the store characteristic comprises:

determining the access index of the user to the target store according to the geographic position of the target store and the geographic positions of N most frequently visited stores in the geographic position information of the user, and correcting the similarity S according to the access index to obtain the corrected similarity S' of the user to the target store;

CosineSimilarity(A, B) = (A • B) / (||A|| * ||B||)；

Wherein A and B are two vectors, "" represents a vector dot product, "||||" means the modular length of the vector;

the commodity category preference CategoryPreference after T-SNE dimension reduction is modified by using the ratio PurchaseRatio of the payment frequency PaymentFrequency and the market access frequency VisitFrequency;

and correcting the low-dimensional commodity category preference Category reference by using the high price commodity purchase times HighPricePurchase and the low price commodity purchase times LowPricePurchase;

the corrected user feature vector includes: the modified commodity category preference modifiedconversion reference and the modified purchase price distribution modifiedPricedistribution;

in order to obtain the revised commodity category preference modifiedcategorypference and the revised purchase price distribution ModifiedPriceDistribution, the method comprises the following steps of:

for the revised commodity category preference modifiedcategorypeference, the calculation process includes:

first, the ratio of the payment frequency PaymentFrequency to the mall access frequency VisitFrequency is calculated:

PurchaseRatio = PaymentFrequency / VisitFrequency；

the payment frequency PaymentFrequency is equal to the sum of the high price commodity purchase times HighPricePurchase and the low price commodity purchase times LowPricePurchase;

Then, purchaseRatio is combined with the T-SNE reduced commodity category preference CategoryPreference by multiplying each component of the two-dimensional vector by PurchaseRatio:

ModifiedCategoryPreference = CategoryPreference * PurchaseRatio

for the modified purchase price distribution, the calculation process comprises the following steps:

first, the sum of the number of high price commodity purchases highpricepurchasease and the number of low price commodity purchases lowpricepurchasease is calculated to pay for the frequency PaymentFrequency:

PaymentFrequency = HighPricePurchase + LowPricePurchase；

next, the ratio of the number of purchases of the high-price commodity to the total number of purchases and the ratio of the number of purchases of the low-price commodity to the total number of purchases are calculated.

HighPriceRatio = HighPricePurchase / PaymentFrequency ；

LowPriceRatio = LowPricePurchase / PaymentFrequency；

Combining these two proportions with the two-dimensional vector after the T-SNE dimension reduction of the purchase price distribution Pricedistribution is accomplished by multiplying each component of the two-dimensional vector by the corresponding price interval proportion:

ModifiedPriceDistribution = PriceDistribution * [HighPriceRatio, LowPriceRatio]。

2. the mobile payment based big data analysis system of claim 1, wherein the computing the user characteristic from the consolidated mobile payment data comprises:

for each user, at least the following features are included:

3. A mobile payment based big data analysis system as in claim 1, wherein the store characteristics include at least: commodity category characteristics of store, price interval characteristics of store;

4. The mobile payment based big data analysis system of claim 1, wherein the sales prediction based on the similarity comprises:

the geographic position of the store uses a single-heat Encoding One-hot Encoding to encode the store, specifically, the Encoding method of the store is designed to be F-X, Y, wherein F is the floor of the store, and X and Y are the floor coordinates of the store.

5. The mobile payment based big data analysis system of claim 4,

6. The mobile payment-based big data analysis system of claim 5, wherein the predicting the expected sales of each store based on the revised similarity S' comprises:

Calculate the average payment avgpaymentamounts:

AvgPaymentAmount_i = TotalAmount_i / PaymentFrequency_i

P_ij = S'_ij * AvgPaymentAmount_i；