CN116506300A - Website traffic data statistics method and system - Google Patents

Website traffic data statistics method and system Download PDF

Info

Publication number
CN116506300A
CN116506300A CN202310264141.3A CN202310264141A CN116506300A CN 116506300 A CN116506300 A CN 116506300A CN 202310264141 A CN202310264141 A CN 202310264141A CN 116506300 A CN116506300 A CN 116506300A
Authority
CN
China
Prior art keywords
data
log
information
processing
website
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310264141.3A
Other languages
Chinese (zh)
Inventor
徐黎
沈程
孙婉琪
郭伟杰
王天放
刘敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202310264141.3A priority Critical patent/CN116506300A/en
Publication of CN116506300A publication Critical patent/CN116506300A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/22Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks comprising specially adapted graphical user interfaces [GUI]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0876Network utilisation, e.g. volume of load or congestion level
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The method for counting website traffic data comprises the following steps: adding a data statistics code and a data transmission address to a page of a website by using a JavaScript embedded point code, and creating, recording and collecting user access behavior information; processing and receiving information of the embedded point server, recording the information in a log form, and transmitting collected log data to a downstream Flame log frame through the recorded header information and browser information; respectively carrying out off-line batch processing and real-time stream processing on the log stream; the offline batch log is stored in Hive after being cleaned by MapReduce data; calculating according to business logic through calculation tools such as ETL, storm and the like, and outputting processed log data; dividing log data into a plurality of tables according to actual needs and persisting the tables to a database; and (3) visually displaying the data, and visually displaying the flow data through a chart component. The system also comprises a website traffic data statistics system. The method combines two data processing modes of offline and real-time, and ensures the accuracy and the real-time performance of website flow data.

Description

Website traffic data statistics method and system
Technical Field
The invention relates to a website traffic data statistics method and system.
Background
The website flow statistics is one of important means for improving website operation, and by acquiring a behavior link of a user in a website, user group preference or the heat of website content and website page problems can be analyzed, so that the page of the website is improved, user experience is improved, and conversion and improvement are more targeted. The traffic statistics modes of the common website mainly comprise two types: 1. burying points, monitoring and analyzing are carried out on the self website server; 2. and adopting a website flow data statistical mode provided by a third party. The method 1 can obtain user data more flexibly, but lack persuasion when data description is carried out on a third party, and due to the problem of data safety, a monitoring party cannot be matched with a data sample, the accuracy of flow attribute analysis is reduced, and the method is generally used for processing offline data and missing instant touch and monitoring of real-time data; method 2 generally requires service billing, and the functionality is limited by the functional planning of third party websites, which is more limited for the enterprise to customize the scenario individually.
Disclosure of Invention
The invention aims to overcome the defects existing in the prior art and provides a website traffic data statistics method and system. The method for counting website traffic data constructs a click stream model by acquiring the behavior of a user on a website, thereby improving the website more pertinently, and comprises the following steps:
(1) Adding a data statistics code and a data transmission address on a page of a website by using a JavaScript embedded point code, and creating, recording and collecting user access behavior information;
(2) Processing and receiving information of the embedded point server, recording the information in a log form, and transmitting collected log data to a downstream Flame log frame through the recorded header information and browser information;
(3) Respectively carrying out off-line batch processing and real-time stream processing on the log stream;
(4) The offline batch log is stored in Hive after being cleaned by MapReduce data;
(5) Calculating according to business logic through calculation tools such as ETL, storm and the like, and outputting processed log data;
(6) Dividing log data into a plurality of tables according to actual needs and persisting the tables to a database;
(7) And (3) visually displaying the data, and visually displaying the flow data through a chart component.
The step (1) specifically comprises the following steps:
first, embedding points are needed to be realized in js codes, specific js scripts are written, user characteristic information such as pages uv, cookie, session and the like are recorded, and codes are embedded into web pages needing log analysis.
The step (2) specifically comprises: preprocessing the log through writing a server for collecting the log by embedding the user basic information, redundant information or missing information recorded by the point codes, including url transcoding, unified format of the log information, obtaining IP and the like, defining a LogServelet class inheriting the HttpServlet class, rewriting the doGet or the doPost method, defining a unified log format, complementing the user information and transmitting the flow index information to a background server; in addition, in order to increase the richness of data acquisition, a server adds a log record component, the component creates a unique LogData object in a thread by defining @ AopLog annotation, the LogData object standardizes a log printing format fixed in application, and the object can be used for recording not only interface parameters but also process parameters in a service method, and log processing is performed by implanting a tangent plane into an interface; and connecting the log server with the Flume, and configuring information such as a Flume port number, an address and the like through the properties configuration of the log to print the log to the Flume and the console respectively.
The step (3) comprises: creating weblog. Conf under the data catalog of the thumb, configuring a corresponding acquisition source, a sinking target, namely an HDFS file system and a transfer channel between the acquisition source and the sinking target in a file transfer mode; when new files appear, offline data transmitted from log4j are respectively stored in the medium HDFS according to time groups, and real-time statistical data are stored in a Kafka producer.
In the step (4): offline data cleaning is carried out by writing a MapReduce program, a task scheduling module is customized, data are processed at fixed time, and data segments are partitioned, ordered, regulated and grouped and used as data segment intervals; the processing process needs to be repeatedly calculated according to a certain time rule, so that a task scheduling module is added, and the task scheduling module performs unified management and scheduling on the processing of MapReduce; finally, packaging the program into a jar packet, uploading the jar packet to Linux, and running the jar packet to execute a data cleaning flow; the cleaned data is imported into Hive through an import command, and partitioning is performed according to time;
for data needing to be processed in real time, the real-time calculation of the stream is carried out through the Storm, the Storm is very rapid in processing of continuously generated data streams, but since the data streams are not uniformly generated, the data streams are uniformly sent to the topic subscribed by the Storm by introducing Kafka, and then subsequent processing is carried out.
The step (5) specifically comprises: analyzing Hive data by using an ETL, and loading the data of the service system to a data warehouse after extraction, cleaning and conversion by the ETL; integrating log data through ETL, and providing analysis basis for customer decision according to business logic; inquiring the attribution of the user, and calculating the area with the highest total flow.
In the step (6): and persistence of the calculated index to a database, wherein the persistence step is realized by adopting a data export tool Sqoop.
In the step (7): and the data visualization page is realized by adopting ECharts and customizing and developing a corresponding web program.
The system for implementing the method for counting the website traffic data is characterized in that: comprising the following steps:
the website page marking module is used for adding data statistics codes and data transmission addresses to pages of a website by using JavaScript embedded point codes and used for creating, recording and collecting user access behavior information;
the log recording and sending module is used for processing and receiving information of the embedded point server, recording the information in a log form, and sending collected log data to a downstream frame of the Flume log through the recorded header information and browser information;
the log stream processing module is used for respectively carrying out off-line batch processing and real-time stream processing on the log stream;
the offline batch log data cleaning module is used for cleaning offline batch logs through MapReduce data and storing the offline batch logs into Hive;
the log data processing module is used for calculating according to business logic through an ETL and Storm calculation tool and outputting processed log data;
the log data sub-table and persistence module is used for dividing the log data into a plurality of tables and persistence the tables to the database according to actual needs;
and the data visualization module is used for data visualization display and carrying out visualization display on the flow data through the chart component.
The invention also includes a computing device comprising a memory and a processor, wherein the memory has executable code stored therein, and the processor, when executing the executable code, implements the method of the invention.
The working principle of the invention is as follows:
the flow acquisition technology is an important source for monitoring network flow, and in order to effectively analyze the page flow in a complex enterprise website page, the invention mainly comprises three parts, wherein the first part is to record the whole user access process, such as user page browsing and clicking data of user IP addresses, browser information, operating system information, stay time, session records and the like; recording user data in the running process of the server through a log component of the code; the second part is a recorded data processing part which comprises the parallel cleaning, filtering and log file sorting of the offline flow data and the real-time flow data respectively; the third part is an analysis part for the effective data, comprising definition indexes, and acquiring service indexes such as online number of people, weekly browsing amount and the like from a database. The invention processes data based on a Flume log collection framework and a MapReduce parallel program design framework, and utilizes a Java Spring development framework to combine with Hive and SQLLite2 databases so as to complete the statistics of the whole flow data of website data recording, monitoring and analysis.
Aiming at the defects of the prior art, the invention constructs the click stream model by acquiring the behavior of the user on the website and carries out statistics on flow data aiming at different data types, thereby improving the website more pertinently. The method mainly involves the following four modules: the device comprises a data acquisition module, a data cleaning module, a data processing and data displaying module. The hardware consumption is less, the data processing is timely and reliable, the big data processing is supported, and the expandability is good.
The data acquisition module is realized by combining a website Nginx log file and a buried point JavaScript code added with data statistics on a website page, uniformly processes the data and sends the data to a log server for processing. The method comprises the steps of collecting logs by using a Flume configuration agent, respectively sending the logs to message middleware Kafka and HDFS through a Sink component, processing real-time log data and offline log data, and dividing the real-time log data and the offline log data into the following types according to the data types: offline data and real-time data are processed by adopting different data processing modules; the data cleaning unifies the data into a regular formatting structure, the key values of MapReduce are utilized to carry out partition grouping and sequencing of the data, offline data are stored in a data warehouse Hive, and the data such as business indexes are finally stored in a database through an ETL calculation tool; the real-time data is also stored into a database after being subjected to Storm real-time calculation; and finally, displaying through a data visualization interface. And the accuracy and the instantaneity of website flow data are ensured by combining two data processing modes of offline and real-time.
The system can be applied to the large data website flow statistics scene, integrates the log component and the data, decouples the large data module from the foreground application server processing module, solves the access pressure of the application server, and ensures the accuracy and analysis instantaneity of the website flow.
The invention has the advantages that: the invention is integrated with Storm big data processing frame based on mature MapReduce calculation frame functionally, ensures the feasibility of parallel analysis of offline data and real-time data, realizes high availability of the system, simultaneously, can instantly reach operators by effective information of website data, can instantly adjust corresponding operation strategies, has lower development and maintenance cost, and ensures the reliability of data to a certain extent.
Drawings
FIG. 1 is a flow chart of a method for implementing the present invention.
Fig. 2 is a schematic diagram of a system module according to the present invention.
FIG. 3 is a schematic flow chart of the system of the present invention.
FIG. 4 is a system computing device frame diagram of the present invention.
Fig. 5 is a diagram of an application server system architecture of the present invention.
Detailed Description
The foregoing examples are illustrative and are not to be construed as limiting the invention, which is to be as claimed in the claims, including but not limited to the specific exemplary embodiments described above. Any method and system for counting website traffic data according to the present invention are described in the claims, and any person skilled in the art shall fall into the scope of the present invention according to different service scenarios, different service indexes, selection of pre-training word vectors, and variation, replacement and modification of word segmentation algorithm.
Example 1
Referring to fig. 1, 3, 4 and 5, a method for counting website traffic data includes the following specific steps:
(1) Aiming at a website page, embedding points are realized at js codes, and data statistics codes and data transmission codes are added to judge, create, record and transmit page flow data and user data. The js embedded point code is as follows:
/*main function*/
function a1_main(){
var dest_path = data reception address;
var expire_time = session timeout duration;
v/processing
Method and apparatus for handling cookies
Per/process session
Domain name handling/handling
}
(a) Different cookies are set according to operators: 0 indicates that the cookie is a cookie of a session level without setting timeout time, and the cookie information is stored in the memory of the browser and disappears when the browser is closed; 1, setting the timeout time to be 10 years later, wherein the cookie is always stored in a temporary folder of the browser until the timeout time arrives or the user manually clears the cookie; 2 indicates that after setting the timeout to 1 hour, the cookie will remain in the browser's temporary folder until the timeout expires or the user manually clears the cookie.
(b) A value of uv_id, if uv_id is null, configuring id for this new uv, a random number uv_id=get_random (20) of length 20; setting a cookie: uv save time is 10 years set_cookie ("uv", uv_id, 1); if the cookie value uv is not null, obtaining uv_id;
(c) Session of session is acquired, if the value of ss does not exist in the cookie, which means that the cookie is a new session, ss_id=get_random (10) with the length of 10 is randomly generated; the splicing format is 'session number_number of accesses in session_client time_website'; if yes, judging whether the session is overtime, if yes, regenerating the session id, setting the page access times in the session to be 0 times, and if not, carrying out the page access times of +1.
(d) Besides processing cookies, there are other information such as js embedded point browser, etc., and embedded point information is transferred into the log frame. Calling the partial codes when loading the page:
Windows.onload()=function(){
a1_main();
}
(2) The log framework receives and processes information of the embedded point server: and transmitting the recorded embedded point information such as the user information, the page information, the browser information and the like to a downstream frame of the Flume log.
(a) Because redundant information or missing information exists in the embedded data, preprocessing the log by writing a server for collecting the log, including url transcoding, unified format of the log information, obtaining IP and the like, defining a LogServelet class inheriting the HttpServlet class, rewriting a doGet or a doPost method, defining a unified log format, and complementing user information;
(b) In addition, to track the process information when the page is browsed, the server may add a logging component that creates a unique LogData object in one thread by defining @ aopcog notes, where the LogData object specifies a fixed log printing format in the application, and the object may be used to record not only the interface parameters but also the process parameters in the service method, by embedding a facet to the interface, writing a facet processor, and performing log processing. Connecting the log server with the Flume, and printing the logs to the Flume and the console respectively through the properties configuration of the logs:
log4j.appender.flume=org.apache.flume.clients.log4jappender.Log4jAppender
log4j. App. Flime. Hostname=flime address
log4j. App. Flime. Port = flime port number
log4j.appender.flume.UnsafeMode=true
(3) Respectively carrying out off-line batch processing and real-time stream processing on the log stream;
(a) Offline batch processing: creating weblog. Conf under the data directory of the flash, configuring corresponding source (data source), channel (connecting source and sink), sink (data target), storing offline data transmitted from log4j into the HDFS according to time groups,
a1.sources=r1
a1.channels=c1
a1.sinks=k1
a1.sources.r1.type=avro
a1.sources.r1.bind=0.0.0.0
a1.sources.r1.port=44444
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=timestamp
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=hdfs address
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.rollInterval=40
a1.sinks.k1.hdfs.rollSize=0
a1.sinks.k1.hdfs.rollCount=2000
a1.channels.c1.type=memory
a1.channels.c1.capacity=2000
a1.channels.c1.transactionCapacity=200
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1
(b) Data cleaning: the MapReduce adopts a line processing mode, and a MapReduce program is customized to preprocess the collected original log data, such as cleaning, format arrangement, dirty data filtering and the like. Firstly, file data are read, data which do not meet the requirements are filtered, and data to be processed are combed into click stream model data. And importing the data preprocessed at this time into Hive. The processing process needs to be repeatedly calculated according to a certain time rule, such as the daily data is processed regularly, so a task scheduling module is added to manage the scheduling of task units.
(c) And then partitioning according to time when the data in the HDFS is stored in the Hive table. Various statistical results are obtained through the ETL analysis tool, the results are persisted into the MySQL database, and finally visualized processing is carried out through Echarts, so that operation decision-making staff can acquire data conveniently, and the data can be understood more simply and rapidly.
(4) Processing real-time stream;
(a) An Agent of the jump is added on the node of the deployment website, the real-time records are collected together, and then the target address of the delivery is added at the Sink component of the jump. And delivering the real-time record to the Kafka cluster, filling in information pointing to the Kafka cluster at the Sink component, and configuring the flash as follows:
agent.sources=r1
agent.sinks=k1
agent.channels=c1
#Describe/configure the source
agent.sources.r1.type=netcat
agent.sources.r1.bind=192.168.223.128
agent.sources.r1.port=8888
#Describe the sink
agent.sinks.k1.type=org.apache.flume.sink.kafka.KafkaSink
agent.sinks.k1.kafka.bootstrap.servers=192.168.223.128:9092
agent.sinks.k1.kafka.topic=log4j-flume-kafka
agent.sinks.k1.serializer.class=kafka.serializer.StringEncoder
agent.sinks.k1.kafka.producer.acks=1
agent.sinks.k1.custom.encoding=UTF-8
agent.channels.c1.type=memory
agent.channels.c1.capacity=1000
agent.channels.c1.transactionCapacity=100
# bind source and sink to channel
# configuration transmission channel
agent.sources.r1.channels=c1
agent.sinks.k1.channel=c1
The collected real-time records are stored at the manufacturer side of Kafka.
(b) Data to be consumed in the Kafka cluster is transported to the Storm cluster through the KafkaSpout. Storm are real-time, distributed computing systems with high fault tolerance. The Storm can also process large batch of data like Hadoop; after entering a Storm cluster, data flows are grouped through a real-time calculation model of Storm, if a certain data flow is grouped based on a domain name named "ip", all tuples containing the same "ip" are distributed into the same task, so that the consistency of message processing can be ensured. Corresponding calculation is carried out according to the service index, and the result after calculation is persisted into a DB library, wherein MySQL or Redis can be generally utilized for data persistence.
Example 2
Referring to fig. 2 to 5, the present embodiment relates to a system for implementing the method for website traffic data statistics of embodiment 1, including:
the website page marking module is used for adding data statistics codes and data transmission addresses to pages of a website by using JavaScript embedded point codes and used for creating, recording and collecting user access behavior information;
the log recording and sending module is used for processing and receiving information of the embedded point server, recording the information in a log form, and sending collected log data to a downstream frame of the Flume log through the recorded header information and browser information;
the log stream processing module is used for respectively carrying out off-line batch processing and real-time stream processing on the log stream;
the offline batch log data cleaning module is used for cleaning offline batch logs through MapReduce data and storing the offline batch logs into Hive;
the log data processing module is used for calculating according to business logic through an ETL and Storm calculation tool and outputting processed log data;
the log data sub-table and persistence module is used for dividing the log data into a plurality of tables and persistence the tables to the database according to actual needs;
and the data visualization module is used for data visualization display and carrying out visualization display on the flow data through the chart component.
Example 3
The invention also includes a computing device comprising a memory and a processor, wherein the memory has executable code stored therein, and the processor, when executing the executable code, implements a method for website traffic data statistics of embodiment 1.

Claims (10)

1. The method for counting website traffic data is characterized by comprising the steps of constructing a click stream model by acquiring the behavior of a user on a website, thereby improving the website more pertinently, and comprising the following steps:
(1) Adding a data statistics code and a data transmission address on a page of a website by using a JavaScript embedded point code, and creating, recording and collecting user access behavior information;
(2) Processing and receiving information of the embedded point server, recording the information in a log form, and transmitting collected log data to a downstream Flame log frame through the recorded header information and browser information;
(3) Respectively carrying out off-line batch processing and real-time stream processing on the log stream;
(4) The offline batch log is stored in Hive after being cleaned by MapReduce data;
(5) Calculating according to business logic through calculation tools such as ETL, storm and the like, and outputting processed log data;
(6) Dividing log data into a plurality of tables according to actual needs and persisting the tables to a database;
(7) And (3) visually displaying the data, and visually displaying the flow data through a chart component.
2. The method for counting website traffic data according to claim 1, wherein the step (1) specifically comprises:
first, embedding points are needed to be realized in js codes, specific js scripts are written, user characteristic information such as pages uv, cookie, session and the like are recorded, and codes are embedded into web pages needing log analysis.
3. The method of statistics of website traffic data according to claim 1, wherein the step (2) specifically comprises: preprocessing the log through writing a server for collecting the log by embedding the user basic information, redundant information or missing information recorded by the point codes, including url transcoding, unified format of the log information, obtaining IP and the like, defining a LogServelet class inheriting the HttpServlet class, rewriting the doGet or the doPost method, defining a unified log format, complementing the user information and transmitting the flow index information to a background server; in addition, in order to increase the richness of data acquisition, a server adds a log record component, the component creates a unique LogData object in a thread by defining @ AopLog annotation, the LogData object standardizes a log printing format fixed in application, and the object can be used for recording not only interface parameters but also process parameters in a service method, and log processing is performed by implanting a tangent plane into an interface; and connecting the log server with the Flume, and configuring information such as a Flume port number, an address and the like through the properties configuration of the log to print the log to the Flume and the console respectively.
4. The method of claim 1, wherein the step (3) includes: creating weblog. Conf under the data catalog of the thumb, configuring a corresponding acquisition source, a sinking target, namely an HDFS file system and a transfer channel between the acquisition source and the sinking target in a file transfer mode; when new files appear, offline data transmitted from log4j are respectively stored in the medium HDFS according to time groups, and real-time statistical data are stored in a Kafka producer.
5. The method of claim 1, wherein in the step (4): offline data cleaning is carried out by writing a MapReduce program, a task scheduling module is customized, data are processed at fixed time, and data segments are partitioned, ordered, regulated and grouped and used as data segment intervals; the processing process needs to be repeatedly calculated according to a certain time rule, so that a task scheduling module is added, and the task scheduling module performs unified management and scheduling on the processing of MapReduce; finally, packaging the program into a jar packet, uploading the jar packet to Linux, and running the jar packet to execute a data cleaning flow; the cleaned data is imported into Hive through an import command, and partitioning is performed according to time;
for data needing to be processed in real time, the real-time calculation of the stream is carried out through the Storm, the Storm is very rapid in processing of continuously generated data streams, but since the data streams are not uniformly generated, the data streams are uniformly sent to the topic subscribed by the Storm by introducing Kafka, and then subsequent processing is carried out.
6. The method for counting website traffic data according to claim 1, wherein the step (5) specifically comprises: analyzing Hive data by using an ETL, and loading the data of the service system to a data warehouse after extraction, cleaning and conversion by the ETL; integrating log data through ETL, and providing analysis basis for customer decision according to business logic; inquiring the attribution of the user, and calculating the area with the highest total flow.
7. The method of claim 1, wherein in the step (6): and persistence of the calculated index to a database, wherein the persistence step is realized by adopting a data export tool Sqoop.
8. The method of claim 1, wherein in the step (7): and the data visualization page is realized by adopting ECharts and customizing and developing a corresponding web program.
9. A system for implementing website traffic data statistics as recited in claim 1, wherein:
the website page marking module is used for adding data statistics codes and data transmission addresses to pages of a website by using JavaScript embedded point codes and used for creating, recording and collecting user access behavior information;
the log recording and sending module is used for processing and receiving information of the embedded point server, recording the information in a log form, and sending collected log data to a downstream frame of the Flume log through the recorded header information and browser information;
the log stream processing module is used for respectively carrying out off-line batch processing and real-time stream processing on the log stream;
the offline batch log data cleaning module is used for cleaning offline batch logs through MapReduce data and storing the offline batch logs into Hive;
the log data processing module is used for calculating according to business logic through an ETL and Storm calculation tool and outputting processed log data;
the log data sub-table and persistence module is used for dividing the log data into a plurality of tables and persistence the tables to the database according to actual needs;
and the data visualization module is used for data visualization display and carrying out visualization display on the flow data through the chart component.
10. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1-8.
CN202310264141.3A 2023-03-10 2023-03-10 Website traffic data statistics method and system Pending CN116506300A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310264141.3A CN116506300A (en) 2023-03-10 2023-03-10 Website traffic data statistics method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310264141.3A CN116506300A (en) 2023-03-10 2023-03-10 Website traffic data statistics method and system

Publications (1)

Publication Number Publication Date
CN116506300A true CN116506300A (en) 2023-07-28

Family

ID=87329289

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310264141.3A Pending CN116506300A (en) 2023-03-10 2023-03-10 Website traffic data statistics method and system

Country Status (1)

Country Link
CN (1) CN116506300A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117290256A (en) * 2023-11-24 2023-12-26 北京中指实证数据信息技术有限公司 Code burying method for counting user behaviors

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117290256A (en) * 2023-11-24 2023-12-26 北京中指实证数据信息技术有限公司 Code burying method for counting user behaviors

Similar Documents

Publication Publication Date Title
CN107577805B (en) Business service system for log big data analysis
US9817867B2 (en) Dynamically processing an event using an extensible data model
US8938534B2 (en) Automatic provisioning of new users of interest for capture on a communication network
US9082127B2 (en) Collecting and aggregating datasets for analysis
US11924240B2 (en) Mechanism for identifying differences between network snapshots
US10191962B2 (en) System for continuous monitoring of data quality in a dynamic feed environment
CN105824744A (en) Real-time log collection and analysis method on basis of B2B (Business to Business) platform
US9058323B2 (en) System for accessing a set of communication and transaction data associated with a user of interest sourced from multiple different network carriers and for enabling multiple analysts to independently and confidentially access the set of communication and transaction data
US10826803B2 (en) Mechanism for facilitating efficient policy updates
CN104426713A (en) Method and device for monitoring network site access effect data
US10044820B2 (en) Method and system for automated transaction analysis
CN108052679A (en) A kind of Log Analysis System based on HADOOP
CN108228322B (en) Distributed link tracking and analyzing method, server and global scheduler
CN105515836A (en) Log processing method, device and server
CN112181931A (en) Big data system link tracking method and electronic equipment
CA3119167A1 (en) Approach for a controllable trade-off between cost and availability of indexed data in a cloud log aggregation solution such as splunk or sumo
CN113360554A (en) Method and equipment for extracting, converting and loading ETL (extract transform load) data
CN111970195A (en) Data transmission method and streaming data transmission system
CN116506300A (en) Website traffic data statistics method and system
CN106559498A (en) Air control data collection platform and its collection method
CN102055620B (en) Method and system for monitoring user experience
CN105468502A (en) Log collection method, device and system
CN115391429A (en) Time sequence data processing method and device based on big data cloud computing
CN114971714A (en) Accurate customer operation method based on big data label and computer equipment
CN114417796A (en) Dynamic report statistical method and system based on equipment sampling points

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination