CN111460333A - Real-time search data analysis system - Google Patents
Real-time search data analysis system Download PDFInfo
- Publication number
- CN111460333A CN111460333A CN202010233677.5A CN202010233677A CN111460333A CN 111460333 A CN111460333 A CN 111460333A CN 202010233677 A CN202010233677 A CN 202010233677A CN 111460333 A CN111460333 A CN 111460333A
- Authority
- CN
- China
- Prior art keywords
- data
- search
- analysis
- user
- real
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000007405 data analysis Methods 0.000 title claims abstract description 22
- 238000004458 analytical method Methods 0.000 claims abstract description 40
- 238000005516 engineering process Methods 0.000 claims abstract description 24
- 230000006399 behavior Effects 0.000 claims abstract description 11
- 230000000007 visual effect Effects 0.000 claims abstract description 11
- 238000000034 method Methods 0.000 claims abstract description 9
- 238000013480 data collection Methods 0.000 claims abstract description 8
- 238000004364 calculation method Methods 0.000 claims abstract description 4
- 238000007619 statistical method Methods 0.000 claims description 13
- 230000005540 biological transmission Effects 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 5
- 238000012544 monitoring process Methods 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 claims description 2
- 238000013079 data visualisation Methods 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9538—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
- G06F16/9577—Optimising the visualization of content, e.g. distillation of HTML documents
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a real-time search data analysis system, which analyzes and counts search behavior data generated by a vertical search engine by utilizing a stream type calculation big data technology and provides visual display. In the data collection stage, behaviors of clicking and browsing and the like left by a user on a search interface are collected by using a developed embedded point technology, and used data for searching are collected at the same time. In the aspect of search data display, besides the visual display of analysis results, the search engine technology is utilized to help users to search and analyze results more conveniently, potential values of data are mined, users only need to input required data analysis dimensions in a search box, then click the analysis results to be checked, visual chart results can be visually seen, and the method is more convenient and faster.
Description
Technical Field
The invention relates to a system for analyzing and searching data in real time, and belongs to the field of big data.
Background
The real-time search data analysis belongs to another data processing mode after relay off-line batch processing data analysis in the field of big data. As the internet has evolved, more and more data needs to be processed and analyzed to aid in decision making. The traditional data processing method of big data is that the data is accumulated for a period of time and then stored in a disk or a distributed file system (HDFS, etc.), then offline analysis processing is performed, and the processed and analyzed data is made into a report form for operators to refer to. The advantage of this is that the data is of high integrity, all historical data can be used for analysis, and off-line analysis does not take up server resources. The disadvantage is that services with real-time requirements, such as advertising services, cannot be met. Basically, data generated in the big data era is time-efficient, and loses its value after a while, so real-time calculation is urgently needed to meet business and service requirements.
Another problem is that the general search technology represented by the conventional search engine is becoming mature, more and more vertical search engine technologies are serving millions of enterprise users, users in the enterprise or in the vertical field are using the search engine to simplify the data acquisition process and improve the decision-making capability, but the behavior of the user on the search engine is unknown, and the search-based data analysis cannot be effectively performed. According to the invention, by combining a real-time big data analysis technology, the user behavior in the vertical search engine is subjected to statistical analysis in real time according to different time dimensions and visual display is provided, so that the problems of timeliness, black box performance and the like of data search are effectively solved.
Disclosure of Invention
The invention provides a solution for analyzing vertical search engine data in real time, and solves the problems that search data is difficult to obtain, the real-time performance of data analysis cannot be guaranteed, the search data cannot be effectively utilized and the like.
Specifically, the content of the invention comprises a search data acquisition module, a search data transmission module, a search data analysis module, an analysis result storage module and an analysis result display module.
A search data acquisition module: the method is mainly used for acquiring data generated by searching from different sources and transmitting the data in a uniform format. The sources of search data generation include data generated by a search interface, data generated by a search API business system.
And the search data generated by the search API service is counted by a background service interface, each search link is classified and stored into the log according to an HTTP method, the log directory is monitored in real time through the flash technology, meanwhile, an interceptor is added to modify the data format to ensure uniformity, and then the data is sent to a data stream pipeline.
And monitoring user behaviors through data generated by the search interface by a front-end buried point technology. The embedded point technology is realized as a glapi technology, a client side introduces a glapi. The collected data comprises a current link address, a browser type, a current geographic position and the like, the data related to the user comprises words searched by a search engine, articles browsed by the user, praise of the user on a certain article and collection behaviors, and the data flows to a data flow pipeline after being collected.
The search data transmission module: the data stream is transmitted in real time by utilizing a Kafka message queue technology, the data in the Kafka is classified into different topics by utilizing an interceptor unified data format according to search data and user data, and the data is transmitted to a stream type calculation analysis platform by utilizing a publishing subscription mode.
A search data analysis module: after the search data is preliminarily cleaned, different analysis schemes and analysis dimensions are customized according to business requirements, and the analysis comprises search data statistical analysis, user data statistical analysis and deepening analysis.
Statistical analysis utilizes Spark technology and Spark Streaming technology, Spark end pulls data stream from Kafka, analysis operators are defined according to different service dimensions to form DAG graph, data is cut and integrated in real-time batch processing mode, and analysis result is obtained and then sent to database system for storage.
The search data statistical analysis comprises dimensions such as total search number, total user number, total data volume, user geographic distribution, hourly search volume, 24-hour search volume, real-time search articles, real-time search keywords and the like, and search indexes, user indexes and data indexes are provided according to a heat algorithm to judge search quality.
The statistical analysis of the user data comprises the search behaviors of the users under the search engine, such as the most common search words, the most search time periods, the total number of articles clicked and the total number of collections, and also provides the statistical analysis of the single user. The deepening analysis comprises the steps that a user searches to hit results, a search result funnel graph is obtained, and user search recommendation is provided according to the user search results.
And the analysis result storage module stores the data after analysis and statistics into a MySQ L database to provide data support for subsequent visual display.
An analysis result display module: the search data analyzed through the Spark technology provides a visual display interface, and meanwhile, in order to guarantee effective transmission of the search data, the analyzed data also provides support for a search engine.
The data visualization is pulled from the database by using an Echarts technology, user login and registration are provided, different data set display functions are selected, and the display part is divided into overall display and single display.
The analyzed data is stored in the search through a search engine data transmission interface, the user can search the analyzed result through the search bar, and the user can see the display interface after clicking.
Drawings
In order to more clearly illustrate the technical solutions used in the present invention and the overall implementation details, the following briefly describes the embodiments or drawings used in the prior art description, and obviously, the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a diagram of a real-time search data analysis architecture provided by an embodiment of the present invention.
Fig. 2 is a flowchart of a user implementation provided in the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
When the system is implemented, firstly, the whole system needs to be built according to the diagram shown in fig. 1, and the method comprises the following steps:
step 1, configuring a data collection file: specifying a directory of log files in the collection configuration, defining data output destinations and data pipes, and starting a data collection service. And simultaneously introducing a buried point collection file into the page.
And 2, building a message queue cluster, designating the name and the number of the topics in the configuration file, designating the number of partitions of each topic, defining a monitoring address, and then starting a message queue service.
And 3, establishing data analysis service, simultaneously appointing whether the data needs to be stored off line, appointing a database address for storing analysis results, and finally starting the data analysis service.
And 4, starting the visual server and testing whether the page is normally displayed.
Fig. 2 shows the implementation process after the system is successfully built:
step 1, a user opens a background address and logs in the system after inputting a user name and a password.
And 2, after logging in, specifying data collection configuration and analysis dimension configuration, wherein the data collection configuration and the analysis dimension configuration comprise data collection format specification, whether a certain data dimension needs to be displayed and the like, or selecting default configuration.
And 3, selecting a search data set, and loading the data set by the system.
And 4, directly seeing the analysis result after selecting the data set, or selecting a search column to search the analysis indexes, and clicking the corresponding result after searching the result to see the visual interface.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that the above examples are only used to illustrate the technical solutions of the present invention, but not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (7)
1. A real-time search data analysis system, characterized by: the system comprises a search data acquisition module, a search data transmission module, a search data analysis module, an analysis result storage module and an analysis result display module;
a search data acquisition module: the system is used for acquiring data generated by searching from different sources and sending the data in a uniform format; the source of the search data comprises data generated by a search interface and data generated by a search API service system;
the search data transmission module: the data stream is transmitted in real time by utilizing a Kafka message queue technology, the data in the Kafka is classified into different topics by utilizing an interceptor to unify the data format according to the search data and the user data, and the data is transmitted to a stream type calculation analysis platform by utilizing a publishing subscription mode;
a search data analysis module: after the search data is preliminarily cleaned, customizing different analysis schemes and analysis dimensions according to business requirements, wherein the analysis schemes and the analysis dimensions comprise search data statistical analysis, user data statistical analysis and deepening analysis;
the analysis result storage module stores the data after analysis and statistics into a MySQ L database to provide data support for subsequent visual display;
an analysis result display module: providing a visual display interface for the search data analyzed by the Spark technology, and providing a search engine support for the analyzed data in order to ensure the effective transmission of the search data;
data visualization is pulled from a database by using an Echarts technology, user login and registration are provided, different data set display functions are selected, and a display part is divided into overall display and single display;
the analyzed data is stored in the search through a search engine data transmission interface, the user searches the analyzed result through a search bar, and the user can see a display interface after clicking.
2. The real-time search data analysis system of claim 1, wherein: in the search data acquisition module, search data generated by search API service is counted through a background service interface, each search link is classified and stored into logs according to an HTTP method, log directories are monitored in real time through a flash technology, meanwhile, an interceptor is added to modify a data format to ensure uniformity, and then the data are sent to a data stream pipeline.
3. The real-time search data analysis system of claim 1, wherein: monitoring user behaviors through data generated by a search interface by a front-end point burying technology; the embedded point technology is realized as a glapi technology, a client side introduces a glapi.js script into a page, and assigns an id attribute to a page tag to be acquired so as to identify the tag to be acquired, and acquired user behavior data can be sent back to a server in real time; the collected data comprises a current link address, a browser type and a current geographic position, the data related to the user comprises words searched by a search engine, articles browsed by the user, praise of the user on a certain article and collection behaviors, and the data flows to a data flow pipeline after being collected.
4. The real-time search data analysis system of claim 1, wherein: statistical analysis utilizes Spark technology and Spark streaming technology, Spark end pulls data stream from Kafka, analysis operators are defined according to different service dimensions to form DAG graph, data is cut and integrated in real-time batch processing mode, and analysis result is obtained and then sent to database system for storage.
5. The real-time search data analysis system of claim 1, wherein: the search data statistical analysis comprises total search number, total user number, total data volume, user geographic distribution, hourly search volume, 24-hour search volume, real-time search articles and real-time search keyword dimension, and search indexes, user indexes and data indexes are provided according to a heat algorithm to judge search quality.
6. The real-time search data analysis system of claim 1, wherein: the user data statistical analysis comprises the search behaviors of the user under a search engine, such as the most common search words, the most search time periods, the total number of articles approved and the total number of collections, and meanwhile, the statistical analysis of a single user is also provided; the deepening analysis comprises the steps that a user searches to hit results, a search result funnel graph is obtained, and user search recommendation is provided according to the user search results.
7. The method for implementing the paperless examination anti-cheating system by using the system of claim 1 is characterized by comprising the following steps:
step 1: a user opens a background address and inputs a user name and a password and then logs in the system;
step 2: after logging in, data collection configuration and analysis dimension configuration need to be specified, wherein the data collection format is specified, whether a certain data dimension needs to be displayed or not is judged, or default configuration is selected;
and step 3: selecting a search data set, and loading the data set by a system;
and 4, step 4: after the data set is selected, the analysis result can be directly seen, or the search column is selected to search the analysis index, and after the result is searched, the corresponding result is clicked, so that the visual interface can be seen.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010233677.5A CN111460333B (en) | 2020-03-30 | 2020-03-30 | Real-time search data analysis system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010233677.5A CN111460333B (en) | 2020-03-30 | 2020-03-30 | Real-time search data analysis system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111460333A true CN111460333A (en) | 2020-07-28 |
CN111460333B CN111460333B (en) | 2024-02-23 |
Family
ID=71681525
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010233677.5A Active CN111460333B (en) | 2020-03-30 | 2020-03-30 | Real-time search data analysis system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111460333B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112486708A (en) * | 2020-12-16 | 2021-03-12 | 中国联合网络通信集团有限公司 | Processing method and processing system of page operation data |
CN114372090A (en) * | 2021-12-31 | 2022-04-19 | 北京工业大学 | User reading behavior analysis and prediction system under big data environment |
CN115001793A (en) * | 2022-05-27 | 2022-09-02 | 北京双湃智安科技有限公司 | Data fusion method for information security multi-source heterogeneous data |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105224445A (en) * | 2015-10-28 | 2016-01-06 | 北京汇商融通信息技术有限公司 | Distributed tracking system |
CN105786683A (en) * | 2016-03-03 | 2016-07-20 | 四川长虹电器股份有限公司 | Customized log collecting system and method |
CN106547882A (en) * | 2016-11-03 | 2017-03-29 | 国网重庆市电力公司电力科学研究院 | A kind of real-time processing method and system of big data of marketing in intelligent grid |
CN110297746A (en) * | 2019-07-05 | 2019-10-01 | 北京慧眼智行科技有限公司 | A kind of data processing method and system |
US20190334789A1 (en) * | 2018-04-26 | 2019-10-31 | EMC IP Holding Company LLC | Generating Specifications for Microservices Implementations of an Application |
-
2020
- 2020-03-30 CN CN202010233677.5A patent/CN111460333B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105224445A (en) * | 2015-10-28 | 2016-01-06 | 北京汇商融通信息技术有限公司 | Distributed tracking system |
CN105786683A (en) * | 2016-03-03 | 2016-07-20 | 四川长虹电器股份有限公司 | Customized log collecting system and method |
CN106547882A (en) * | 2016-11-03 | 2017-03-29 | 国网重庆市电力公司电力科学研究院 | A kind of real-time processing method and system of big data of marketing in intelligent grid |
US20190334789A1 (en) * | 2018-04-26 | 2019-10-31 | EMC IP Holding Company LLC | Generating Specifications for Microservices Implementations of an Application |
CN110297746A (en) * | 2019-07-05 | 2019-10-01 | 北京慧眼智行科技有限公司 | A kind of data processing method and system |
Non-Patent Citations (1)
Title |
---|
GAUTAM PAL, ET AL.: "Big Data Real-Time Clickstream Data Ingestion Paradigm for E-Commerce Analytics" * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112486708A (en) * | 2020-12-16 | 2021-03-12 | 中国联合网络通信集团有限公司 | Processing method and processing system of page operation data |
CN112486708B (en) * | 2020-12-16 | 2023-11-07 | 中国联合网络通信集团有限公司 | Page operation data processing method and processing system |
CN114372090A (en) * | 2021-12-31 | 2022-04-19 | 北京工业大学 | User reading behavior analysis and prediction system under big data environment |
CN114372090B (en) * | 2021-12-31 | 2024-05-24 | 北京工业大学 | User reading behavior analysis and prediction system in big data environment |
CN115001793A (en) * | 2022-05-27 | 2022-09-02 | 北京双湃智安科技有限公司 | Data fusion method for information security multi-source heterogeneous data |
Also Published As
Publication number | Publication date |
---|---|
CN111460333B (en) | 2024-02-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7165069B1 (en) | Analysis of search activities of users to identify related network sites | |
CN111460333A (en) | Real-time search data analysis system | |
WO2021098648A1 (en) | Text recommendation method, apparatus and device, and medium | |
CN107451149B (en) | Monitoring method and device for flow data query task | |
CN109101658B (en) | Information searching method and device, and equipment/terminal/server | |
CN106339394B (en) | Information processing method and device | |
CN111475728A (en) | Cloud resource information searching method, device, equipment and storage medium | |
US20140201203A1 (en) | System, method and device for providing an automated electronic researcher | |
US11423096B2 (en) | Method and apparatus for outputting information | |
KR20100112512A (en) | Apparatus for searching contents and method for searching contents | |
JP5705114B2 (en) | Information processing apparatus, information processing method, program, and web system | |
CN112749266B (en) | Industrial question and answer method, device, system, equipment and storage medium | |
CN112181931A (en) | Big data system link tracking method and electronic equipment | |
US9323833B2 (en) | Relevant online search for long queries | |
CN114265981A (en) | Recommendation word determining method, device, equipment and storage medium | |
CN107330076B (en) | Network public opinion information display system and method | |
CN111159559A (en) | Method for constructing recommendation engine according to user requirements and user behaviors | |
Knap | Towards Odalic, a Semantic Table Interpretation Tool in the ADEQUATe Project. | |
KR100557874B1 (en) | Method of scientific information analysis and media that can record computer program thereof | |
CN116226494B (en) | Crawler system and method for information search | |
Wu et al. | RIVA: A Real-Time Information Visualization and analysis platform for social media sentiment trend | |
CN114417179A (en) | Meta-search engine processing method and device for large-scale knowledge base group | |
CN112883143A (en) | Elasticissearch-based digital exhibition searching method and system | |
Xu et al. | The application of web crawler in city image research | |
KR20210045172A (en) | Big Data Management and System for Livestock Disease Outbreak Analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |