CN116506300A

CN116506300A - Website traffic data statistics method and system

Info

Publication number: CN116506300A
Application number: CN202310264141.3A
Authority: CN
Inventors: 徐黎; 沈程; 孙婉琪; 郭伟杰; 王天放; 刘敏
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-03-10
Filing date: 2023-03-10
Publication date: 2023-07-28

Abstract

The method for counting website traffic data comprises the following steps: adding a data statistics code and a data transmission address to a page of a website by using a JavaScript embedded point code, and creating, recording and collecting user access behavior information; processing and receiving information of the embedded point server, recording the information in a log form, and transmitting collected log data to a downstream Flame log frame through the recorded header information and browser information; respectively carrying out off-line batch processing and real-time stream processing on the log stream; the offline batch log is stored in Hive after being cleaned by MapReduce data; calculating according to business logic through calculation tools such as ETL, storm and the like, and outputting processed log data; dividing log data into a plurality of tables according to actual needs and persisting the tables to a database; and (3) visually displaying the data, and visually displaying the flow data through a chart component. The system also comprises a website traffic data statistics system. The method combines two data processing modes of offline and real-time, and ensures the accuracy and the real-time performance of website flow data.

Description

Website traffic data statistics method and system

Technical Field

The invention relates to a website traffic data statistics method and system.

Background

The website flow statistics is one of important means for improving website operation, and by acquiring a behavior link of a user in a website, user group preference or the heat of website content and website page problems can be analyzed, so that the page of the website is improved, user experience is improved, and conversion and improvement are more targeted. The traffic statistics modes of the common website mainly comprise two types: 1. burying points, monitoring and analyzing are carried out on the self website server; 2. and adopting a website flow data statistical mode provided by a third party. The method 1 can obtain user data more flexibly, but lack persuasion when data description is carried out on a third party, and due to the problem of data safety, a monitoring party cannot be matched with a data sample, the accuracy of flow attribute analysis is reduced, and the method is generally used for processing offline data and missing instant touch and monitoring of real-time data; method 2 generally requires service billing, and the functionality is limited by the functional planning of third party websites, which is more limited for the enterprise to customize the scenario individually.

Disclosure of Invention

The invention aims to overcome the defects existing in the prior art and provides a website traffic data statistics method and system. The method for counting website traffic data constructs a click stream model by acquiring the behavior of a user on a website, thereby improving the website more pertinently, and comprises the following steps:

(1) Adding a data statistics code and a data transmission address on a page of a website by using a JavaScript embedded point code, and creating, recording and collecting user access behavior information;

(2) Processing and receiving information of the embedded point server, recording the information in a log form, and transmitting collected log data to a downstream Flame log frame through the recorded header information and browser information;

(3) Respectively carrying out off-line batch processing and real-time stream processing on the log stream;

(4) The offline batch log is stored in Hive after being cleaned by MapReduce data;

(5) Calculating according to business logic through calculation tools such as ETL, storm and the like, and outputting processed log data;

(6) Dividing log data into a plurality of tables according to actual needs and persisting the tables to a database;

(7) And (3) visually displaying the data, and visually displaying the flow data through a chart component.

The step (1) specifically comprises the following steps:

first, embedding points are needed to be realized in js codes, specific js scripts are written, user characteristic information such as pages uv, cookie, session and the like are recorded, and codes are embedded into web pages needing log analysis.

The step (2) specifically comprises: preprocessing the log through writing a server for collecting the log by embedding the user basic information, redundant information or missing information recorded by the point codes, including url transcoding, unified format of the log information, obtaining IP and the like, defining a LogServelet class inheriting the HttpServlet class, rewriting the doGet or the doPost method, defining a unified log format, complementing the user information and transmitting the flow index information to a background server; in addition, in order to increase the richness of data acquisition, a server adds a log record component, the component creates a unique LogData object in a thread by defining @ AopLog annotation, the LogData object standardizes a log printing format fixed in application, and the object can be used for recording not only interface parameters but also process parameters in a service method, and log processing is performed by implanting a tangent plane into an interface; and connecting the log server with the Flume, and configuring information such as a Flume port number, an address and the like through the properties configuration of the log to print the log to the Flume and the console respectively.

The step (3) comprises: creating weblog. Conf under the data catalog of the thumb, configuring a corresponding acquisition source, a sinking target, namely an HDFS file system and a transfer channel between the acquisition source and the sinking target in a file transfer mode; when new files appear, offline data transmitted from log4j are respectively stored in the medium HDFS according to time groups, and real-time statistical data are stored in a Kafka producer.

In the step (4): offline data cleaning is carried out by writing a MapReduce program, a task scheduling module is customized, data are processed at fixed time, and data segments are partitioned, ordered, regulated and grouped and used as data segment intervals; the processing process needs to be repeatedly calculated according to a certain time rule, so that a task scheduling module is added, and the task scheduling module performs unified management and scheduling on the processing of MapReduce; finally, packaging the program into a jar packet, uploading the jar packet to Linux, and running the jar packet to execute a data cleaning flow; the cleaned data is imported into Hive through an import command, and partitioning is performed according to time;

for data needing to be processed in real time, the real-time calculation of the stream is carried out through the Storm, the Storm is very rapid in processing of continuously generated data streams, but since the data streams are not uniformly generated, the data streams are uniformly sent to the topic subscribed by the Storm by introducing Kafka, and then subsequent processing is carried out.

The step (5) specifically comprises: analyzing Hive data by using an ETL, and loading the data of the service system to a data warehouse after extraction, cleaning and conversion by the ETL; integrating log data through ETL, and providing analysis basis for customer decision according to business logic; inquiring the attribution of the user, and calculating the area with the highest total flow.

In the step (6): and persistence of the calculated index to a database, wherein the persistence step is realized by adopting a data export tool Sqoop.

In the step (7): and the data visualization page is realized by adopting ECharts and customizing and developing a corresponding web program.

The system for implementing the method for counting the website traffic data is characterized in that: comprising the following steps:

the website page marking module is used for adding data statistics codes and data transmission addresses to pages of a website by using JavaScript embedded point codes and used for creating, recording and collecting user access behavior information;

the log recording and sending module is used for processing and receiving information of the embedded point server, recording the information in a log form, and sending collected log data to a downstream frame of the Flume log through the recorded header information and browser information;

the log stream processing module is used for respectively carrying out off-line batch processing and real-time stream processing on the log stream;

the offline batch log data cleaning module is used for cleaning offline batch logs through MapReduce data and storing the offline batch logs into Hive;

the log data processing module is used for calculating according to business logic through an ETL and Storm calculation tool and outputting processed log data;

the log data sub-table and persistence module is used for dividing the log data into a plurality of tables and persistence the tables to the database according to actual needs;

and the data visualization module is used for data visualization display and carrying out visualization display on the flow data through the chart component.

The invention also includes a computing device comprising a memory and a processor, wherein the memory has executable code stored therein, and the processor, when executing the executable code, implements the method of the invention.

The working principle of the invention is as follows:

the flow acquisition technology is an important source for monitoring network flow, and in order to effectively analyze the page flow in a complex enterprise website page, the invention mainly comprises three parts, wherein the first part is to record the whole user access process, such as user page browsing and clicking data of user IP addresses, browser information, operating system information, stay time, session records and the like; recording user data in the running process of the server through a log component of the code; the second part is a recorded data processing part which comprises the parallel cleaning, filtering and log file sorting of the offline flow data and the real-time flow data respectively; the third part is an analysis part for the effective data, comprising definition indexes, and acquiring service indexes such as online number of people, weekly browsing amount and the like from a database. The invention processes data based on a Flume log collection framework and a MapReduce parallel program design framework, and utilizes a Java Spring development framework to combine with Hive and SQLLite2 databases so as to complete the statistics of the whole flow data of website data recording, monitoring and analysis.

Aiming at the defects of the prior art, the invention constructs the click stream model by acquiring the behavior of the user on the website and carries out statistics on flow data aiming at different data types, thereby improving the website more pertinently. The method mainly involves the following four modules: the device comprises a data acquisition module, a data cleaning module, a data processing and data displaying module. The hardware consumption is less, the data processing is timely and reliable, the big data processing is supported, and the expandability is good.

The data acquisition module is realized by combining a website Nginx log file and a buried point JavaScript code added with data statistics on a website page, uniformly processes the data and sends the data to a log server for processing. The method comprises the steps of collecting logs by using a Flume configuration agent, respectively sending the logs to message middleware Kafka and HDFS through a Sink component, processing real-time log data and offline log data, and dividing the real-time log data and the offline log data into the following types according to the data types: offline data and real-time data are processed by adopting different data processing modules; the data cleaning unifies the data into a regular formatting structure, the key values of MapReduce are utilized to carry out partition grouping and sequencing of the data, offline data are stored in a data warehouse Hive, and the data such as business indexes are finally stored in a database through an ETL calculation tool; the real-time data is also stored into a database after being subjected to Storm real-time calculation; and finally, displaying through a data visualization interface. And the accuracy and the instantaneity of website flow data are ensured by combining two data processing modes of offline and real-time.

The system can be applied to the large data website flow statistics scene, integrates the log component and the data, decouples the large data module from the foreground application server processing module, solves the access pressure of the application server, and ensures the accuracy and analysis instantaneity of the website flow.

The invention has the advantages that: the invention is integrated with Storm big data processing frame based on mature MapReduce calculation frame functionally, ensures the feasibility of parallel analysis of offline data and real-time data, realizes high availability of the system, simultaneously, can instantly reach operators by effective information of website data, can instantly adjust corresponding operation strategies, has lower development and maintenance cost, and ensures the reliability of data to a certain extent.

Drawings

FIG. 1 is a flow chart of a method for implementing the present invention.

Fig. 2 is a schematic diagram of a system module according to the present invention.

FIG. 3 is a schematic flow chart of the system of the present invention.

FIG. 4 is a system computing device frame diagram of the present invention.

Fig. 5 is a diagram of an application server system architecture of the present invention.

Detailed Description

The foregoing examples are illustrative and are not to be construed as limiting the invention, which is to be as claimed in the claims, including but not limited to the specific exemplary embodiments described above. Any method and system for counting website traffic data according to the present invention are described in the claims, and any person skilled in the art shall fall into the scope of the present invention according to different service scenarios, different service indexes, selection of pre-training word vectors, and variation, replacement and modification of word segmentation algorithm.

Example 1

Referring to fig. 1, 3, 4 and 5, a method for counting website traffic data includes the following specific steps:

(1) Aiming at a website page, embedding points are realized at js codes, and data statistics codes and data transmission codes are added to judge, create, record and transmit page flow data and user data. The js embedded point code is as follows:

/*main function*/

function a1_main(){

var dest_path = data reception address;

var expire_time = session timeout duration;

v/processing

Method and apparatus for handling cookies

Per/process session

Domain name handling/handling

}

(a) Different cookies are set according to operators: 0 indicates that the cookie is a cookie of a session level without setting timeout time, and the cookie information is stored in the memory of the browser and disappears when the browser is closed; 1, setting the timeout time to be 10 years later, wherein the cookie is always stored in a temporary folder of the browser until the timeout time arrives or the user manually clears the cookie; 2 indicates that after setting the timeout to 1 hour, the cookie will remain in the browser's temporary folder until the timeout expires or the user manually clears the cookie.

(b) A value of uv_id, if uv_id is null, configuring id for this new uv, a random number uv_id=get_random (20) of length 20; setting a cookie: uv save time is 10 years set_cookie ("uv", uv_id, 1); if the cookie value uv is not null, obtaining uv_id;

(c) Session of session is acquired, if the value of ss does not exist in the cookie, which means that the cookie is a new session, ss_id=get_random (10) with the length of 10 is randomly generated; the splicing format is 'session number_number of accesses in session_client time_website'; if yes, judging whether the session is overtime, if yes, regenerating the session id, setting the page access times in the session to be 0 times, and if not, carrying out the page access times of +1.

(d) Besides processing cookies, there are other information such as js embedded point browser, etc., and embedded point information is transferred into the log frame. Calling the partial codes when loading the page:

Windows.onload()＝function(){

a1_main()；

}

(2) The log framework receives and processes information of the embedded point server: and transmitting the recorded embedded point information such as the user information, the page information, the browser information and the like to a downstream frame of the Flume log.

(a) Because redundant information or missing information exists in the embedded data, preprocessing the log by writing a server for collecting the log, including url transcoding, unified format of the log information, obtaining IP and the like, defining a LogServelet class inheriting the HttpServlet class, rewriting a doGet or a doPost method, defining a unified log format, and complementing user information;

(b) In addition, to track the process information when the page is browsed, the server may add a logging component that creates a unique LogData object in one thread by defining @ aopcog notes, where the LogData object specifies a fixed log printing format in the application, and the object may be used to record not only the interface parameters but also the process parameters in the service method, by embedding a facet to the interface, writing a facet processor, and performing log processing. Connecting the log server with the Flume, and printing the logs to the Flume and the console respectively through the properties configuration of the logs:

log4j.appender.flume＝org.apache.flume.clients.log4jappender.Log4jAppender

log4j. App. Flime. Hostname=flime address

log4j. App. Flime. Port = flime port number

log4j.appender.flume.UnsafeMode＝true

(a) Offline batch processing: creating weblog. Conf under the data directory of the flash, configuring corresponding source (data source), channel (connecting source and sink), sink (data target), storing offline data transmitted from log4j into the HDFS according to time groups,

a1.sources＝r1

a1.channels＝c1

a1.sinks＝k1

a1.sources.r1.type＝avro

a1.sources.r1.bind＝0.0.0.0

a1.sources.r1.port＝44444

a1.sources.r1.interceptors＝i1

a1.sources.r1.interceptors.i1.type＝timestamp

a1.sinks.k1.type＝hdfs

a1.sinks.k1.hdfs.path=hdfs address

a1.sinks.k1.hdfs.fileType＝DataStream

a1.sinks.k1.hdfs.rollInterval＝40

a1.sinks.k1.hdfs.rollSize＝0

a1.sinks.k1.hdfs.rollCount＝2000

a1.channels.c1.type＝memory

a1.channels.c1.capacity＝2000

a1.channels.c1.transactionCapacity＝200

a1.sources.r1.channels＝c1

a1.sinks.k1.channel＝c1

(b) Data cleaning: the MapReduce adopts a line processing mode, and a MapReduce program is customized to preprocess the collected original log data, such as cleaning, format arrangement, dirty data filtering and the like. Firstly, file data are read, data which do not meet the requirements are filtered, and data to be processed are combed into click stream model data. And importing the data preprocessed at this time into Hive. The processing process needs to be repeatedly calculated according to a certain time rule, such as the daily data is processed regularly, so a task scheduling module is added to manage the scheduling of task units.

(c) And then partitioning according to time when the data in the HDFS is stored in the Hive table. Various statistical results are obtained through the ETL analysis tool, the results are persisted into the MySQL database, and finally visualized processing is carried out through Echarts, so that operation decision-making staff can acquire data conveniently, and the data can be understood more simply and rapidly.

(4) Processing real-time stream;

(a) An Agent of the jump is added on the node of the deployment website, the real-time records are collected together, and then the target address of the delivery is added at the Sink component of the jump. And delivering the real-time record to the Kafka cluster, filling in information pointing to the Kafka cluster at the Sink component, and configuring the flash as follows:

agent.sources＝r1

agent.sinks＝k1

agent.channels＝c1

#Describe/configure the source

agent.sources.r1.type＝netcat

agent.sources.r1.bind＝192.168.223.128

agent.sources.r1.port＝8888

#Describe the sink

agent.sinks.k1.type＝org.apache.flume.sink.kafka.KafkaSink

agent.sinks.k1.kafka.bootstrap.servers＝192.168.223.128:9092

agent.sinks.k1.kafka.topic＝log4j-flume-kafka

agent.sinks.k1.serializer.class＝kafka.serializer.StringEncoder

agent.sinks.k1.kafka.producer.acks＝1

agent.sinks.k1.custom.encoding＝UTF-8

agent.channels.c1.type＝memory

agent.channels.c1.capacity＝1000

agent.channels.c1.transactionCapacity＝100

# bind source and sink to channel

# configuration transmission channel

agent.sources.r1.channels＝c1

agent.sinks.k1.channel＝c1

The collected real-time records are stored at the manufacturer side of Kafka.

(b) Data to be consumed in the Kafka cluster is transported to the Storm cluster through the KafkaSpout. Storm are real-time, distributed computing systems with high fault tolerance. The Storm can also process large batch of data like Hadoop; after entering a Storm cluster, data flows are grouped through a real-time calculation model of Storm, if a certain data flow is grouped based on a domain name named "ip", all tuples containing the same "ip" are distributed into the same task, so that the consistency of message processing can be ensured. Corresponding calculation is carried out according to the service index, and the result after calculation is persisted into a DB library, wherein MySQL or Redis can be generally utilized for data persistence.

Example 2

Referring to fig. 2 to 5, the present embodiment relates to a system for implementing the method for website traffic data statistics of embodiment 1, including:

Example 3

The invention also includes a computing device comprising a memory and a processor, wherein the memory has executable code stored therein, and the processor, when executing the executable code, implements a method for website traffic data statistics of embodiment 1.

Claims

1. The method for counting website traffic data is characterized by comprising the steps of constructing a click stream model by acquiring the behavior of a user on a website, thereby improving the website more pertinently, and comprising the following steps:

2. The method for counting website traffic data according to claim 1, wherein the step (1) specifically comprises:

3. The method of statistics of website traffic data according to claim 1, wherein the step (2) specifically comprises: preprocessing the log through writing a server for collecting the log by embedding the user basic information, redundant information or missing information recorded by the point codes, including url transcoding, unified format of the log information, obtaining IP and the like, defining a LogServelet class inheriting the HttpServlet class, rewriting the doGet or the doPost method, defining a unified log format, complementing the user information and transmitting the flow index information to a background server; in addition, in order to increase the richness of data acquisition, a server adds a log record component, the component creates a unique LogData object in a thread by defining @ AopLog annotation, the LogData object standardizes a log printing format fixed in application, and the object can be used for recording not only interface parameters but also process parameters in a service method, and log processing is performed by implanting a tangent plane into an interface; and connecting the log server with the Flume, and configuring information such as a Flume port number, an address and the like through the properties configuration of the log to print the log to the Flume and the console respectively.

4. The method of claim 1, wherein the step (3) includes: creating weblog. Conf under the data catalog of the thumb, configuring a corresponding acquisition source, a sinking target, namely an HDFS file system and a transfer channel between the acquisition source and the sinking target in a file transfer mode; when new files appear, offline data transmitted from log4j are respectively stored in the medium HDFS according to time groups, and real-time statistical data are stored in a Kafka producer.

5. The method of claim 1, wherein in the step (4): offline data cleaning is carried out by writing a MapReduce program, a task scheduling module is customized, data are processed at fixed time, and data segments are partitioned, ordered, regulated and grouped and used as data segment intervals; the processing process needs to be repeatedly calculated according to a certain time rule, so that a task scheduling module is added, and the task scheduling module performs unified management and scheduling on the processing of MapReduce; finally, packaging the program into a jar packet, uploading the jar packet to Linux, and running the jar packet to execute a data cleaning flow; the cleaned data is imported into Hive through an import command, and partitioning is performed according to time;

6. The method for counting website traffic data according to claim 1, wherein the step (5) specifically comprises: analyzing Hive data by using an ETL, and loading the data of the service system to a data warehouse after extraction, cleaning and conversion by the ETL; integrating log data through ETL, and providing analysis basis for customer decision according to business logic; inquiring the attribution of the user, and calculating the area with the highest total flow.

7. The method of claim 1, wherein in the step (6): and persistence of the calculated index to a database, wherein the persistence step is realized by adopting a data export tool Sqoop.

8. The method of claim 1, wherein in the step (7): and the data visualization page is realized by adopting ECharts and customizing and developing a corresponding web program.

9. A system for implementing website traffic data statistics as recited in claim 1, wherein:

10. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1-8.