CN111460333A - Real-time search data analysis system - Google Patents

Real-time search data analysis system Download PDF

Info

Publication number
CN111460333A
CN111460333A CN202010233677.5A CN202010233677A CN111460333A CN 111460333 A CN111460333 A CN 111460333A CN 202010233677 A CN202010233677 A CN 202010233677A CN 111460333 A CN111460333 A CN 111460333A
Authority
CN
China
Prior art keywords
data
search
analysis
user
real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010233677.5A
Other languages
Chinese (zh)
Other versions
CN111460333B (en
Inventor
朱哲哲
段娟
肖创柏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Langzhao Technology Beijing Co ltd
Original Assignee
Langzhao Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Langzhao Technology Beijing Co ltd filed Critical Langzhao Technology Beijing Co ltd
Priority to CN202010233677.5A priority Critical patent/CN111460333B/en
Publication of CN111460333A publication Critical patent/CN111460333A/en
Application granted granted Critical
Publication of CN111460333B publication Critical patent/CN111460333B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a real-time search data analysis system, which analyzes and counts search behavior data generated by a vertical search engine by utilizing a stream type calculation big data technology and provides visual display. In the data collection stage, behaviors of clicking and browsing and the like left by a user on a search interface are collected by using a developed embedded point technology, and used data for searching are collected at the same time. In the aspect of search data display, besides the visual display of analysis results, the search engine technology is utilized to help users to search and analyze results more conveniently, potential values of data are mined, users only need to input required data analysis dimensions in a search box, then click the analysis results to be checked, visual chart results can be visually seen, and the method is more convenient and faster.

Description

Real-time search data analysis system
Technical Field
The invention relates to a system for analyzing and searching data in real time, and belongs to the field of big data.
Background
The real-time search data analysis belongs to another data processing mode after relay off-line batch processing data analysis in the field of big data. As the internet has evolved, more and more data needs to be processed and analyzed to aid in decision making. The traditional data processing method of big data is that the data is accumulated for a period of time and then stored in a disk or a distributed file system (HDFS, etc.), then offline analysis processing is performed, and the processed and analyzed data is made into a report form for operators to refer to. The advantage of this is that the data is of high integrity, all historical data can be used for analysis, and off-line analysis does not take up server resources. The disadvantage is that services with real-time requirements, such as advertising services, cannot be met. Basically, data generated in the big data era is time-efficient, and loses its value after a while, so real-time calculation is urgently needed to meet business and service requirements.
Another problem is that the general search technology represented by the conventional search engine is becoming mature, more and more vertical search engine technologies are serving millions of enterprise users, users in the enterprise or in the vertical field are using the search engine to simplify the data acquisition process and improve the decision-making capability, but the behavior of the user on the search engine is unknown, and the search-based data analysis cannot be effectively performed. According to the invention, by combining a real-time big data analysis technology, the user behavior in the vertical search engine is subjected to statistical analysis in real time according to different time dimensions and visual display is provided, so that the problems of timeliness, black box performance and the like of data search are effectively solved.
Disclosure of Invention
The invention provides a solution for analyzing vertical search engine data in real time, and solves the problems that search data is difficult to obtain, the real-time performance of data analysis cannot be guaranteed, the search data cannot be effectively utilized and the like.
Specifically, the content of the invention comprises a search data acquisition module, a search data transmission module, a search data analysis module, an analysis result storage module and an analysis result display module.
A search data acquisition module: the method is mainly used for acquiring data generated by searching from different sources and transmitting the data in a uniform format. The sources of search data generation include data generated by a search interface, data generated by a search API business system.
And the search data generated by the search API service is counted by a background service interface, each search link is classified and stored into the log according to an HTTP method, the log directory is monitored in real time through the flash technology, meanwhile, an interceptor is added to modify the data format to ensure uniformity, and then the data is sent to a data stream pipeline.
And monitoring user behaviors through data generated by the search interface by a front-end buried point technology. The embedded point technology is realized as a glapi technology, a client side introduces a glapi. The collected data comprises a current link address, a browser type, a current geographic position and the like, the data related to the user comprises words searched by a search engine, articles browsed by the user, praise of the user on a certain article and collection behaviors, and the data flows to a data flow pipeline after being collected.
The search data transmission module: the data stream is transmitted in real time by utilizing a Kafka message queue technology, the data in the Kafka is classified into different topics by utilizing an interceptor unified data format according to search data and user data, and the data is transmitted to a stream type calculation analysis platform by utilizing a publishing subscription mode.
A search data analysis module: after the search data is preliminarily cleaned, different analysis schemes and analysis dimensions are customized according to business requirements, and the analysis comprises search data statistical analysis, user data statistical analysis and deepening analysis.
Statistical analysis utilizes Spark technology and Spark Streaming technology, Spark end pulls data stream from Kafka, analysis operators are defined according to different service dimensions to form DAG graph, data is cut and integrated in real-time batch processing mode, and analysis result is obtained and then sent to database system for storage.
The search data statistical analysis comprises dimensions such as total search number, total user number, total data volume, user geographic distribution, hourly search volume, 24-hour search volume, real-time search articles, real-time search keywords and the like, and search indexes, user indexes and data indexes are provided according to a heat algorithm to judge search quality.
The statistical analysis of the user data comprises the search behaviors of the users under the search engine, such as the most common search words, the most search time periods, the total number of articles clicked and the total number of collections, and also provides the statistical analysis of the single user. The deepening analysis comprises the steps that a user searches to hit results, a search result funnel graph is obtained, and user search recommendation is provided according to the user search results.
And the analysis result storage module stores the data after analysis and statistics into a MySQ L database to provide data support for subsequent visual display.
An analysis result display module: the search data analyzed through the Spark technology provides a visual display interface, and meanwhile, in order to guarantee effective transmission of the search data, the analyzed data also provides support for a search engine.
The data visualization is pulled from the database by using an Echarts technology, user login and registration are provided, different data set display functions are selected, and the display part is divided into overall display and single display.
The analyzed data is stored in the search through a search engine data transmission interface, the user can search the analyzed result through the search bar, and the user can see the display interface after clicking.
Drawings
In order to more clearly illustrate the technical solutions used in the present invention and the overall implementation details, the following briefly describes the embodiments or drawings used in the prior art description, and obviously, the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a diagram of a real-time search data analysis architecture provided by an embodiment of the present invention.
Fig. 2 is a flowchart of a user implementation provided in the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
When the system is implemented, firstly, the whole system needs to be built according to the diagram shown in fig. 1, and the method comprises the following steps:
step 1, configuring a data collection file: specifying a directory of log files in the collection configuration, defining data output destinations and data pipes, and starting a data collection service. And simultaneously introducing a buried point collection file into the page.
And 2, building a message queue cluster, designating the name and the number of the topics in the configuration file, designating the number of partitions of each topic, defining a monitoring address, and then starting a message queue service.
And 3, establishing data analysis service, simultaneously appointing whether the data needs to be stored off line, appointing a database address for storing analysis results, and finally starting the data analysis service.
And 4, starting the visual server and testing whether the page is normally displayed.
Fig. 2 shows the implementation process after the system is successfully built:
step 1, a user opens a background address and logs in the system after inputting a user name and a password.
And 2, after logging in, specifying data collection configuration and analysis dimension configuration, wherein the data collection configuration and the analysis dimension configuration comprise data collection format specification, whether a certain data dimension needs to be displayed and the like, or selecting default configuration.
And 3, selecting a search data set, and loading the data set by the system.
And 4, directly seeing the analysis result after selecting the data set, or selecting a search column to search the analysis indexes, and clicking the corresponding result after searching the result to see the visual interface.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that the above examples are only used to illustrate the technical solutions of the present invention, but not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (7)

1. A real-time search data analysis system, characterized by: the system comprises a search data acquisition module, a search data transmission module, a search data analysis module, an analysis result storage module and an analysis result display module;
a search data acquisition module: the system is used for acquiring data generated by searching from different sources and sending the data in a uniform format; the source of the search data comprises data generated by a search interface and data generated by a search API service system;
the search data transmission module: the data stream is transmitted in real time by utilizing a Kafka message queue technology, the data in the Kafka is classified into different topics by utilizing an interceptor to unify the data format according to the search data and the user data, and the data is transmitted to a stream type calculation analysis platform by utilizing a publishing subscription mode;
a search data analysis module: after the search data is preliminarily cleaned, customizing different analysis schemes and analysis dimensions according to business requirements, wherein the analysis schemes and the analysis dimensions comprise search data statistical analysis, user data statistical analysis and deepening analysis;
the analysis result storage module stores the data after analysis and statistics into a MySQ L database to provide data support for subsequent visual display;
an analysis result display module: providing a visual display interface for the search data analyzed by the Spark technology, and providing a search engine support for the analyzed data in order to ensure the effective transmission of the search data;
data visualization is pulled from a database by using an Echarts technology, user login and registration are provided, different data set display functions are selected, and a display part is divided into overall display and single display;
the analyzed data is stored in the search through a search engine data transmission interface, the user searches the analyzed result through a search bar, and the user can see a display interface after clicking.
2. The real-time search data analysis system of claim 1, wherein: in the search data acquisition module, search data generated by search API service is counted through a background service interface, each search link is classified and stored into logs according to an HTTP method, log directories are monitored in real time through a flash technology, meanwhile, an interceptor is added to modify a data format to ensure uniformity, and then the data are sent to a data stream pipeline.
3. The real-time search data analysis system of claim 1, wherein: monitoring user behaviors through data generated by a search interface by a front-end point burying technology; the embedded point technology is realized as a glapi technology, a client side introduces a glapi.js script into a page, and assigns an id attribute to a page tag to be acquired so as to identify the tag to be acquired, and acquired user behavior data can be sent back to a server in real time; the collected data comprises a current link address, a browser type and a current geographic position, the data related to the user comprises words searched by a search engine, articles browsed by the user, praise of the user on a certain article and collection behaviors, and the data flows to a data flow pipeline after being collected.
4. The real-time search data analysis system of claim 1, wherein: statistical analysis utilizes Spark technology and Spark streaming technology, Spark end pulls data stream from Kafka, analysis operators are defined according to different service dimensions to form DAG graph, data is cut and integrated in real-time batch processing mode, and analysis result is obtained and then sent to database system for storage.
5. The real-time search data analysis system of claim 1, wherein: the search data statistical analysis comprises total search number, total user number, total data volume, user geographic distribution, hourly search volume, 24-hour search volume, real-time search articles and real-time search keyword dimension, and search indexes, user indexes and data indexes are provided according to a heat algorithm to judge search quality.
6. The real-time search data analysis system of claim 1, wherein: the user data statistical analysis comprises the search behaviors of the user under a search engine, such as the most common search words, the most search time periods, the total number of articles approved and the total number of collections, and meanwhile, the statistical analysis of a single user is also provided; the deepening analysis comprises the steps that a user searches to hit results, a search result funnel graph is obtained, and user search recommendation is provided according to the user search results.
7. The method for implementing the paperless examination anti-cheating system by using the system of claim 1 is characterized by comprising the following steps:
step 1: a user opens a background address and inputs a user name and a password and then logs in the system;
step 2: after logging in, data collection configuration and analysis dimension configuration need to be specified, wherein the data collection format is specified, whether a certain data dimension needs to be displayed or not is judged, or default configuration is selected;
and step 3: selecting a search data set, and loading the data set by a system;
and 4, step 4: after the data set is selected, the analysis result can be directly seen, or the search column is selected to search the analysis index, and after the result is searched, the corresponding result is clicked, so that the visual interface can be seen.
CN202010233677.5A 2020-03-30 2020-03-30 Real-time search data analysis system Active CN111460333B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010233677.5A CN111460333B (en) 2020-03-30 2020-03-30 Real-time search data analysis system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010233677.5A CN111460333B (en) 2020-03-30 2020-03-30 Real-time search data analysis system

Publications (2)

Publication Number Publication Date
CN111460333A true CN111460333A (en) 2020-07-28
CN111460333B CN111460333B (en) 2024-02-23

Family

ID=71681525

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010233677.5A Active CN111460333B (en) 2020-03-30 2020-03-30 Real-time search data analysis system

Country Status (1)

Country Link
CN (1) CN111460333B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112486708A (en) * 2020-12-16 2021-03-12 中国联合网络通信集团有限公司 Processing method and processing system of page operation data
CN114372090A (en) * 2021-12-31 2022-04-19 北京工业大学 User reading behavior analysis and prediction system under big data environment
CN115001793A (en) * 2022-05-27 2022-09-02 北京双湃智安科技有限公司 Data fusion method for information security multi-source heterogeneous data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224445A (en) * 2015-10-28 2016-01-06 北京汇商融通信息技术有限公司 Distributed tracking system
CN105786683A (en) * 2016-03-03 2016-07-20 四川长虹电器股份有限公司 Customized log collecting system and method
CN106547882A (en) * 2016-11-03 2017-03-29 国网重庆市电力公司电力科学研究院 A kind of real-time processing method and system of big data of marketing in intelligent grid
CN110297746A (en) * 2019-07-05 2019-10-01 北京慧眼智行科技有限公司 A kind of data processing method and system
US20190334789A1 (en) * 2018-04-26 2019-10-31 EMC IP Holding Company LLC Generating Specifications for Microservices Implementations of an Application

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224445A (en) * 2015-10-28 2016-01-06 北京汇商融通信息技术有限公司 Distributed tracking system
CN105786683A (en) * 2016-03-03 2016-07-20 四川长虹电器股份有限公司 Customized log collecting system and method
CN106547882A (en) * 2016-11-03 2017-03-29 国网重庆市电力公司电力科学研究院 A kind of real-time processing method and system of big data of marketing in intelligent grid
US20190334789A1 (en) * 2018-04-26 2019-10-31 EMC IP Holding Company LLC Generating Specifications for Microservices Implementations of an Application
CN110297746A (en) * 2019-07-05 2019-10-01 北京慧眼智行科技有限公司 A kind of data processing method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GAUTAM PAL, ET AL.: "Big Data Real-Time Clickstream Data Ingestion Paradigm for E-Commerce Analytics" *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112486708A (en) * 2020-12-16 2021-03-12 中国联合网络通信集团有限公司 Processing method and processing system of page operation data
CN112486708B (en) * 2020-12-16 2023-11-07 中国联合网络通信集团有限公司 Page operation data processing method and processing system
CN114372090A (en) * 2021-12-31 2022-04-19 北京工业大学 User reading behavior analysis and prediction system under big data environment
CN114372090B (en) * 2021-12-31 2024-05-24 北京工业大学 User reading behavior analysis and prediction system in big data environment
CN115001793A (en) * 2022-05-27 2022-09-02 北京双湃智安科技有限公司 Data fusion method for information security multi-source heterogeneous data

Also Published As

Publication number Publication date
CN111460333B (en) 2024-02-23

Similar Documents

Publication Publication Date Title
US7165069B1 (en) Analysis of search activities of users to identify related network sites
CN111460333A (en) Real-time search data analysis system
WO2021098648A1 (en) Text recommendation method, apparatus and device, and medium
CN107451149B (en) Monitoring method and device for flow data query task
CN109101658B (en) Information searching method and device, and equipment/terminal/server
CN106339394B (en) Information processing method and device
CN111475728A (en) Cloud resource information searching method, device, equipment and storage medium
US20140201203A1 (en) System, method and device for providing an automated electronic researcher
US11423096B2 (en) Method and apparatus for outputting information
KR20100112512A (en) Apparatus for searching contents and method for searching contents
JP5705114B2 (en) Information processing apparatus, information processing method, program, and web system
CN112749266B (en) Industrial question and answer method, device, system, equipment and storage medium
CN112181931A (en) Big data system link tracking method and electronic equipment
US9323833B2 (en) Relevant online search for long queries
CN114265981A (en) Recommendation word determining method, device, equipment and storage medium
CN107330076B (en) Network public opinion information display system and method
CN111159559A (en) Method for constructing recommendation engine according to user requirements and user behaviors
Knap Towards Odalic, a Semantic Table Interpretation Tool in the ADEQUATe Project.
KR100557874B1 (en) Method of scientific information analysis and media that can record computer program thereof
CN116226494B (en) Crawler system and method for information search
Wu et al. RIVA: A Real-Time Information Visualization and analysis platform for social media sentiment trend
CN114417179A (en) Meta-search engine processing method and device for large-scale knowledge base group
CN112883143A (en) Elasticissearch-based digital exhibition searching method and system
Xu et al. The application of web crawler in city image research
KR20210045172A (en) Big Data Management and System for Livestock Disease Outbreak Analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant