CN114372090A

CN114372090A - User reading behavior analysis and prediction system under big data environment

Info

Publication number: CN114372090A
Application number: CN202111662826.0A
Authority: CN
Inventors: 李丹丹; 段娟; 肖创柏
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-19

Abstract

The invention discloses a system for analyzing and predicting reading behaviors of users in a big data environment, which comprises: the system comprises a text data correlation analysis unit, a user data correlation analysis unit, a data abnormity analysis unit and a user behavior prediction unit; the user reading behavior analysis and prediction system under the big data environment is divided into a user data storage layer, a user data processing layer, a user data analysis and modeling layer, a service layer and a display layer. The user data processing layer comprises source data acquisition, source data cleaning, data storage, data management and maintenance. The user data analysis and modeling layer includes code for text data correlation analysis, user data correlation analysis, data anomaly analysis, and user behavior prediction. The service layer comprises a data service, a behavior service, a user service, a portrait service and a forecast service. The display layer is mainly responsible for displaying the result of the statistical analysis on the interface. The system is beneficial to code maintainability, readability and flexibility, and is beneficial to system management and maintenance.

Description

User reading behavior analysis and prediction system under big data environment

Technical Field

The invention belongs to the technical field of computer and big data application analysis, and particularly relates to a system for analyzing and predicting reading behaviors of a user.

Background

In the context of big data, analyzing user behavior has great significance, and user portrait, user behavior anomaly detection and user behavior prediction are three important parts in user behavior analysis. Through the analysis and prediction of the data, the value of the data is fully exerted, the rapid development of enterprises is promoted, and data information with higher value is provided for the enterprises. The technical subject of the invention is to construct a user behavior analysis and prediction system by collecting and analyzing the behavior data of the user in the search application. The system can rapidly and efficiently discover the relationship among users, behaviors and data, so that the users, keywords and data portraits are further constructed. The user portrait is a label model of the user in the aspects of basic attributes, behavior characteristics, social networks, psychological characteristics, interests and hobbies and the like obtained by analyzing the user behavior data. According to the characteristics of the user behaviors, a better user normal behavior outline is established, and the deviation degree of the actual activities of the user and the normal outline is detected to judge whether the user belongs to the abnormal behaviors. And the user behavior data and the portrait data are utilized to predict the user behavior, optimize the user experience and provide better personalized search service.

Disclosure of Invention

The invention aims at analyzing and predicting user behavior, and a functional diagram of the invention is shown in figure 1.

The technical scheme adopted by the invention is a user reading behavior analysis and prediction system under a big data environment, and the system comprises: the system comprises a text data correlation analysis unit, a user data correlation analysis unit, a data abnormity analysis unit and a user behavior prediction unit; wherein:

the user reading behavior analysis and prediction system under the big data environment can be divided into a user data storage layer, a user data processing layer, a user data analysis and modeling layer, a service layer and a display layer. The user data storage layer is information stored in MySql. The user data processing layer comprises source data acquisition, source data cleaning, data storage, data management and maintenance. The user data analysis and modeling layer includes code for text data correlation analysis, user data correlation analysis, data anomaly analysis, and user behavior prediction. The service layer comprises a data service, a behavior service, a user service, a portrait service and a forecast service. The display layer is mainly responsible for displaying the result of the statistical analysis on the interface.

The text data related analysis unit is used for carrying out multi-dimensional mining on a large amount of text data in a website and researching the text data, so that service is better provided for users. The text data analysis includes text base information, text portraits, and text statistics.

The text basic information comprises a title, an author, a year, a brief introduction, keywords, a price, a label, adding time and article classification.

The text portrait comprises search quantity, click quantity, reading quantity, comment quantity, praise quantity, collection quantity and exposure quantity.

The text statistical information comprises text search quantity ranking distribution, text search conversion rate distribution, text click quantity distribution, text reading quantity ranking distribution, text comment quantity ranking distribution, text praise quantity ranking distribution, text collection quantity ranking distribution, text exposure quantity ranking distribution, text reading user quantity distribution, text reading time distribution, text related keywords distribution, text label distribution, text classification distribution, keyword search quantity distribution, keyword search conversion rate distribution, keyword click quantity distribution, keyword affiliated classification distribution, keyword hit article distribution, search user ranking distribution and article classification distribution.

The user data correlation analysis unit is used for carrying out preliminary statistical analysis on log information of user surfing the internet, then carrying out deep research on the user behavior by combining the actual needs of projects and utilizing data mining, finding out the use preference and behavior rules of the user visiting the website, and improving the problems of the website by combining the rules with the strategy of website marketing.

The user data analysis includes user basic information, user portrayal and user statistical information.

The user basic information comprises a user name, a name, an age, a gender, a contact address, a registered IP, a login place, an operator, adding time and latest operation time.

The user profile includes successful search volume, failed search volume, unchecked search volume, total clicked volume, total read volume, review volume, endorsement volume, and collection volume.

The user statistical information comprises user search volume ranking distribution, user search conversion rate statistics, user click volume ranking distribution, user reading time period distribution, user comment volume ranking distribution, user approval volume ranking distribution, user collection volume ranking distribution, user registration time distribution, user access time distribution, user affiliated region distribution, user use operator distribution, user use time interval time distribution, user browsing conversion rate statistics, search click rate statistics and user label distribution.

The data anomaly analysis unit is under the normal behavior outline of the user, and has local contingency while presenting certain regularity on the whole. This part of the contingency is considered anomalous data due to deviations from the user's general behavior.

The data anomaly analysis comprises data anomaly basic information and data anomaly statistical information.

The basic information of the data exception comprises a serial number, a name, a content introduction, a keyword, a type, an exception time, a user, a place and a search IP.

The data anomaly statistical information comprises distribution of violation hit keywords, distribution of user ip anomalies, distribution of comment content violations, distribution of user search vocabulary anomalies, distribution of user search quantity anomalies, distribution of user click quantity anomalies, distribution of user reading time period anomalies, distribution of user comment quantity anomalies, distribution of user approval quantity anomalies, distribution of user collection quantities anomalies and distribution of user access time period anomalies.

The user behavior prediction unit is used for carrying out statistical analysis on various factors influencing the user and carrying out modeling research according to the analyzed characteristics. And finally, selecting the user behavior characteristics to construct a user behavior prediction model. The main prediction indexes are user search word prediction, user search word abnormity prediction, user search behavior frequency abnormity prediction, user search article abnormity prediction, user article clicking abnormity prediction, user article reading prediction and user article reading abnormity prediction.

Most of the existing systems use Java language, and the present invention uses PHP language. The simple and elegant characteristics of Laravel enable the code implementation flow of the system to be simplified in the code writing process. Meanwhile, the good support of RESTful greatly helps to realize the front-end and back-end separation of the system. Meanwhile, the Laravel design idea is the most advanced of all mainstream PHP frameworks at present, and is very suitable for being applied to various development modes. Such as IoC containers, dependent injection, etc. The good support of the composition makes the management of project dependence simpler and more convenient, and plays a vital role in the whole system development process. The system adopts a Model-View-Controller (Model-View-Controller) architecture mode, and the Model-View-Controller architecture mode is divided into three components, namely a Model, a View and a Controller. Wherein the Model layer is responsible for how to Model the data. The View layer is responsible for user interface generation, how to present data obtained from the Model layer to the terminal and provide interaction. The Controller layer is responsible for the butt joint of the Model layer and the View layer, the butt joint is mainly corresponding to two ends, one end is a data source which requests the Model for processing, the other end transmits the processing result to the View in a certain mode, and the middle specific process is the layer responsible for the Controller. The design mode is used for decoupling, so that 3 components do not depend on each other, and the code maintainability, readability and flexibility are facilitated, and the system management and maintenance are facilitated.

Drawings

FIG. 1 is an overall functional diagram of the present invention.

Figure 2 is a system architecture diagram of the present invention.

FIG. 3 is a flow chart of the operation of the present invention.

FIG. 4 is a flow chart of the K-Means algorithm of the present invention.

FIG. 5 is a collaborative filtering technique predictive mechanization map of the present invention.

Detailed Description

The system architecture diagram of the invention is shown in fig. 2, when a user operates the system, the behavior log generated by the user is collected and analyzed, and the behavior log is analyzed and displayed to the interface. The overall structure of the system is shown in figure 2. The system can be divided into a user data storage layer, a user data processing layer, a user data analysis and modeling layer, a service layer and a display layer.

The user data storage layer stores the text information, the user information, the behavior log of the user and the result of the statistical analysis of the text and the user in MySql and Redis.

The user data processing layer comprises source data acquisition, source data cleaning, data storage, data management and maintenance. The source data mainly comes from user operation logs, data are collected through code embedded points, operation is carried out on each user request, information supplement is carried out to form the user logs, and then the user logs are stored in MySql and Redis through PHP codes. Or importing the historical behavior data of the users in the json format in batch, analyzing the data and writing the data into the user log.

The user data analysis and modeling layer includes code for text data correlation analysis, user data correlation analysis, data anomaly analysis, and user behavior prediction. Mainly completed by PHP codes in a controller under a Laravel framework. Adopts Laravel 5.5 frame. The Request life cycle of Laravel is shown in FIG. 3, after receiving a user Request (Request), the Request is assigned to a Route (Route) by Laravel for processing, and through the Route website and the method, it can be known to which Controller (Controller) the requested data is to be processed, but before being processed by the Controller, the requested data is processed by Middleware (Middleware) and then delivered to the Controller for processing. After the controller receives the request data, the correctness of the data is confirmed through a verifier (Validator), the data is sent to a to-do work note (Redis) through a Job (Job), a Queue (Queue) secretary is requested to assist in processing background work, the data of the database (MySql) is obtained through a Model (Eloquest Model), and a data interface is output to a user through the template (Blade). And determining abnormal data through a K-Means clustering algorithm. The flow chart of the K-Means clustering algorithm is shown in the attached figure 3, and the algorithm steps are as follows:

(1) selecting an initial cluster center for each cluster;

(2) distributing the sample set to the nearest cluster according to the minimum distance principle;

(3) updating the cluster center using the sample mean of each cluster;

(4) repeating the steps (2) and (3) to know that the clustering center is not changed any more;

(5) outputting the final clustering center and k cluster partitions;

and taking the value which is far away from the clustering center as abnormal data.

And (4) predicting by adopting a collaborative filtering algorithm. The algorithm mainly utilizes the similarity degree between the abnormal behaviors, and when the similarity degree between the abnormal behaviors of the user and the abnormal behaviors is high, the possible abnormal behaviors of the user can be predicted. The predictive mechanism of the algorithm is shown in figure 5. The user A has abnormal behaviors A and C, the user B has abnormal behaviors A, B and C, and the user C has only abnormal behavior A, so that the abnormal behaviors A and C are relatively high in the abnormal behaviors of the user, and the abnormal behaviors B and A, C are relatively low in the abnormal behaviors of the user. If the user C has the abnormal behavior C, and the degree of similarity between the abnormal behavior a and the abnormal behavior C is high, it can be considered that the user C may have the abnormal behavior C, so that it can be predicted that the user C may have the abnormal behavior C. The main flow of the algorithm is generally consistent with the content collaborative filtering technology.

The flow of the collected user information is consistent with the user collaborative filtering technique. The nearest neighbor search of the algorithm mainly aims at the abnormal behaviors of the user, and a behavior closest to the abnormal behaviors of the user is found by using a correlation calculation method. The process of generating the prediction list primarily utilizes the most similar set of behaviors that are obtained.

The service layer comprises a data service, a behavior service, a user service, a portrait service and a forecast service. The main function of the service layer is to provide services to the outside world, and all requests must be routed well before they can be accessed. The handler of the route definition can be correctly accessed through the specified URI, HTTP request method and route parameters. For example, when a client requests a URI in an HTTP GET manner, larvel will finally dispatch the request to an index method of the corresponding class for processing, and then return a response to the client in the index method.

The display layer is mainly responsible for displaying the result of the statistical analysis on the interface. The presentation layer is a bridge for communication between the user and the system, and provides an interactive tool for the user on one hand, and also realizes certain logic for displaying and submitting data so as to coordinate the operation of the user and the system on the other hand. The front end adopts an html, css and js code development interface, and the JQuery Ajax technology is used for communicating with a Controller (Controller) to finish the read-write operation of data.

Claims

1. User's reading action analysis and prediction system under big data environment, its characterized in that: the method comprises the following steps: the system comprises a text data correlation analysis unit, a user data correlation analysis unit, a data abnormity analysis unit and a user behavior prediction unit; wherein:

the user reading behavior analysis and prediction system under the big data environment can be divided into a user data storage layer, a user data processing layer, a user data analysis and modeling layer, a service layer and a display layer; the user data storage layer is used for storing information in MySql; the user data processing layer comprises source data acquisition, source data cleaning, data storage, data management and maintenance; the user data analysis and modeling layer comprises codes for text data correlation analysis, user data correlation analysis, data anomaly analysis and user behavior prediction; the service layer comprises a data service, a behavior service, a user service, a portrait service and a forecast service; the display layer is mainly responsible for displaying the result of the statistical analysis on an interface;

the text data related analysis unit is used for carrying out multi-dimensional mining on a large amount of text data in a website and researching the text data, so that service is better provided for a user; the text data analysis comprises text basic information, text portrait and text statistical information;

the basic information of the text comprises a title, an author, a year, a brief introduction, keywords, a price, a label, adding time and article classification;

the text portrait comprises search quantity, click quantity, reading quantity, comment quantity, praise quantity, collection quantity and exposure quantity;

the text statistical information comprises text search quantity ranking distribution, text search conversion rate distribution, text click quantity distribution, text reading quantity ranking distribution, text comment quantity ranking distribution, text praise quantity ranking distribution, text collection quantity ranking distribution, text exposure quantity ranking distribution, text reading user quantity distribution, text reading time distribution, text related keywords distribution, text label distribution, text classification distribution, keyword search quantity distribution, keyword search conversion rate distribution, keyword click quantity distribution, keyword affiliated classification distribution, keyword hit article distribution, search user ranking distribution and article classification distribution;

2. The big data environment user reading behavior analysis and prediction system of claim 1, wherein: the user data analysis includes user basic information, user portrayal and user statistical information.

3. The big data environment user reading behavior analysis and prediction system of claim 1, wherein: the user basic information comprises a user name, a name, an age, a gender, a contact address, a registered IP, a login place, an operator, adding time and latest operation time.

4. The big data environment user reading behavior analysis and prediction system of claim 1, wherein: the user profile includes successful search volume, failed search volume, unchecked search volume, total clicked volume, total read volume, review volume, endorsement volume, and collection volume.

5. The big data environment user reading behavior analysis and prediction system of claim 1, wherein: the user statistical information comprises user search volume ranking distribution, user search conversion rate statistics, user click volume ranking distribution, user reading time period distribution, user comment volume ranking distribution, user approval volume ranking distribution, user collection volume ranking distribution, user registration time distribution, user access time distribution, user affiliated region distribution, user use operator distribution, user use time interval time distribution, user browsing conversion rate statistics, search click rate statistics and user label distribution.

6. The big data environment user reading behavior analysis and prediction system of claim 1, wherein: the data anomaly analysis unit is under the normal behavior outline of the user, and has local contingency while presenting certain regularity on the whole; this part of the contingency is considered anomalous data due to deviations from the user's general behavior.

7. The big data environment user reading behavior analysis and prediction system of claim 1, wherein: the data anomaly analysis comprises data anomaly basic information and data anomaly statistical information.

8. The big data environment user reading behavior analysis and prediction system of claim 1, wherein: the basic information of the data exception comprises a serial number, a name, a content introduction, a keyword, a type, an exception time, a user, a place and a search IP.

9. The big data environment user reading behavior analysis and prediction system of claim 1, wherein: the data anomaly statistical information comprises distribution of violation hit keywords, distribution of user ip anomalies, distribution of comment content violations, distribution of user search vocabulary anomalies, distribution of user search quantity anomalies, distribution of user click quantity anomalies, distribution of user reading time period anomalies, distribution of user comment quantity anomalies, distribution of user approval quantity anomalies, distribution of user collection quantities anomalies and distribution of user access time period anomalies.

10. The big data environment user reading behavior analysis and prediction system of claim 1, wherein: the user behavior prediction unit is used for carrying out statistical analysis on various factors influencing the user and carrying out modeling research according to the analyzed characteristics; finally, selecting user behavior characteristics to construct a user behavior prediction model; the main prediction indexes are user search word prediction, user search word abnormity prediction, user search behavior frequency abnormity prediction, user search article abnormity prediction, user article clicking abnormity prediction, user article reading prediction and user article reading abnormity prediction.