CN104182389B

CN104182389B - A kind of big data analyzing business intelligence service system based on semanteme

Info

Publication number: CN104182389B
Application number: CN201410348407.3A
Authority: CN
Inventors: 璐惧博; 贾岩
Original assignee: ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Current assignee: ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority date: 2014-07-21
Filing date: 2014-07-21
Publication date: 2018-01-19
Anticipated expiration: 2034-07-21
Also published as: CN104182389A

Abstract

The present invention proposes a kind of big data analyzing business intelligence service system based on semanteme, and precisely analysis, conveniently can provide commercial intelligence service, it includes for medium-sized and small enterprises with realizing the business information that is rich in internet：Data acquisition storage subsystem, real-time stream processing subsystem, storage subsystem, basic-level support subsystem and business output subsystem；Wherein, subsystem is put in storage in data acquisition, including separate distributed reptile module and data source adapter, distributed reptile module and data source adapter connect real-time stream processing subsystem respectively, distributed reptile module is responsible for data source header detecting, internet data collection and HTML pretreatments, data source adapter and is used for third party's data resource cut-in operation；Real-time stream processing subsystem is connected to storage subsystem, and the temporary storage module including being connected and data flow hook, the data interim storage that temporary storage module will collect in real time.

Description

Big data analysis business intelligent service system based on semantics

Technical Field

The invention relates to the technical field of business intelligence, in particular to a big data analysis business intelligent service system based on semantics.

Background

In the new period of the social development of China, the different military of small and medium-sized enterprises is prominent, and therefore the force of the Chinese market is increasingly vigorous. They are eagerly developed and need information services without the strength and energy of capital-intensive large group companies from having constructed information institutions. The information resource is one of the most important resources of an enterprise, and developing the information resource is the starting point of enterprise informatization and is also the 'homing' of the enterprise informatization.

With the continuous deepening of the informatization degree, the desire of enterprises for the analysis service of the big data is increasingly strong. The continuous increase of information resources of the internet contains huge amount of information with commercial value, and becomes an important business intelligent service information source, but the value of the internet is not fully developed and utilized by the industry due to the difficulties of huge data volume, large acquisition difficulty, relatively low unit value, almost all non-structural data such as texts and the like.

For an enterprise, "efficiency is life and time is money". The internet can provide more convenient, rapid and omnibearing reference consultation service for small and medium-sized enterprises only by actively providing information service means and utilizing modern technical equipment to realize resource sharing and organizing, planning and purposefully collecting and processing information, thereby accelerating the decision-making speed of enterprise leadership and gaining opportunity for enterprises in market economy.

Disclosure of Invention

Based on the problems in the background art, the invention provides a semantic-based big data analysis business intelligent service system, which realizes accurate analysis of business information rich in the Internet and can conveniently and quickly provide business intelligent service for small and medium-sized enterprises.

The invention provides a big data analysis business intelligent service system based on semantics, which comprises: the system comprises a data acquisition and storage subsystem, a real-time data stream processing subsystem, a storage subsystem, a bottom layer support subsystem and a service output subsystem; wherein,

the data acquisition and storage subsystem comprises a distributed crawler module and a data source adapter which are mutually independent, the distributed crawler module and the data source adapter are respectively connected with the real-time data stream processing subsystem, the distributed crawler module is responsible for data source detection, internet data acquisition and HTML preprocessing, and the data source adapter is used for accessing third-party data resources to work;

the real-time data stream processing subsystem is connected to the storage subsystem and comprises a temporary storage module and a data stream hook which are connected, the temporary storage module takes the memory of the cluster as a cache environment, and temporarily stores the data acquired in real time for being read by a module with real-time requirement; the stream data hook provides a hook for mounting, and when data arrives, the hook mounts basic description of the data so as to facilitate the module mounted to the hook to read; a cache threshold value is set in the real-time data stream processing subsystem, and data are emptied when the cache threshold value is exceeded;

the storage subsystem is connected to the service output subsystem and comprises a Hadoop cluster and a mysql cluster which are connected, and the Hadoop cluster is used for storing a large amount of webpage data and analysis results without random read-write requirements; the mysql cluster has a small storage volume and needs data which is read and written randomly;

the bottom layer support subsystem comprises a semantic information extraction module and a semantic search engine which are connected, wherein the semantic information extraction module is responsible for extracting semantic information from a text and supporting other modules needing semantic extraction and semantic analysis, and the semantic information extraction module is respectively connected with the real-time data processing subsystem and the service output subsystem; the semantic search engine integrates all tools and API modules related to semantic search and text processing, and is simultaneously connected with the Hadoop cluster and the service output subsystem;

the service output subsystem is used for executing, scheduling and displaying specific services and comprises an accurate marketing module, a data service module, a report generation module, a commercial information analysis module and a public opinion analysis module which are connected in parallel; the accurate marketing module is used for providing technical support of data collection, analysis and marketing means for accurate marketing; the data service module is used for data collection and semantic analysis which are carried out for meeting the specific data requirements of customers; the report generation module generates a short, summary and image-text combined information summary for a client, and supports automatic generation and report summarization and writing at regular intervals; the business information analysis module is used for business opportunity information analysis, competitor analysis, industry movement and data analysis; the public opinion analysis module is used for topic tracking analysis, event and person related tracking analysis, network public opinion data collection and integrated analysis.

And in the distributed crawler module, reliability weights are set for different information sources.

The distributed crawler module adopts a fixed-point squatting and guarding type and/or heuristic type and/or universal collection strategy.

The buffer threshold value of the real-time data stream processing subsystem is 0.1-100 minutes.

Hadoop clusters are persistent storage.

And the operation data, the data mining result and the semantic analysis result are stored in the mysql cluster.

The semantic information extraction module adopts the semantic information extraction technology of the natural language-like language and describes and marks the semantic information in the natural language text in a form extremely similar to the natural language.

The semantic information extraction module records the information amount of each topic by adopting a semantic clustering technology and reminds a user to pay attention to important events.

The invention effectively solves the problem of web-based big data analysis, has the characteristics of high precision, rich semantic information, high practicability, industrialization and the like, and can fully release the value of text information by using the big data as input data of technologies such as data mining and the like; the method comprises the steps of analyzing business behaviors of internet users to realize accurate marketing service of enterprise products; the method helps enterprises to insights the dynamic trend in the industry and the upstream and downstream industries, grasp business opportunities, avoid risks and help the enterprises to make scientific decisions and other business intelligent services quickly. The invention has wide industrialized application prospect.

Drawings

Fig. 1 is a structural diagram of a semantic-based big data analysis business intelligence service system according to the present invention.

Detailed Description

Referring to fig. 1, the big data analysis business intelligent service system based on semantics provided by the invention comprises: the system comprises a data acquisition and storage subsystem, a real-time data stream processing subsystem, a storage subsystem, a bottom layer support subsystem and a service output subsystem.

The data acquisition and storage subsystem comprises a distributed crawler module and a data source adapter which are mutually independent, and the distributed crawler module and the data source adapter are respectively connected with the real-time data stream processing subsystem. The distributed crawler module is responsible for data source detection, internet data acquisition and HTML (hypertext markup language) preprocessing. The data source adapter is used for accessing third-party data resources to work, such as data which needs to be analyzed and is specified by a client, and the processing flow of the system can be intervened through the data source adapter.

In the distributed crawler module, credibility weights are set for different information sources, so that a user can determine information value and extraction time is saved. For example, in this embodiment, the data mining toolkit adopts an abstract data mining common algorithm toolkit, and combines with tools and algorithm kits of open source communities to form a relatively mature data mining algorithm and toolkit, and collects data on networks such as various websites, forums, blogs, and the like in real time, and simultaneously, adopts ranking data of a ranking network of chinese websites, sets a confidence weight for each website information, and also has corresponding weights for different source information such as news, blogs, forums, and the like. The distributed crawler module collects data according to different topics, and meanwhile, in the embodiment, the main data blocks of the webpage are determined through webpage structure analysis of similar pages, and an executable template is automatically generated to achieve webpage extraction. In addition, the acquisition of network data adopts various acquisition strategies such as fixed-point squatting and guarding type, heuristic type and extensive acquisition. The method and the device have the advantages of wide data acquisition range, strong pertinence, high efficiency and less omission.

The real-time data stream processing subsystem is connected to the storage subsystem and comprises a temporary storage module and a data stream hook which are connected. The temporary storage module takes the memory of the cluster as a cache environment, temporarily stores the data acquired in real time by the data acquisition and storage subsystem, and provides the data for the module with the real-time requirement to read. The streaming data hook provides a hook for mounting, the basic mechanism is a subscription-consumption model, when data arrives, the hook mounts basic description of the data for a module mounted to the hook to read. The real-time data stream processing subsystem organically accesses various analysis requirements between the data acquisition and storage subsystem and the storage subsystem through a hook mechanism, so that the real-time processing is ensured, the data can be stored in a distributed mode, and processing congestion is avoided through an extensible architecture strategy. The real-time data stream processing subsystem is provided with a buffer threshold, and when the buffer threshold is exceeded, the data will be emptied, and the buffer threshold in the embodiment is 5 minutes, and in specific implementation, the buffer threshold may be set separately, for example, any value in 0.1 to 100 minutes.

The storage subsystem is connected to the service output subsystem and comprises a Hadoop cluster (a distributed system infrastructure) and a mysql cluster (a relational database) connected together. The Hadoop cluster is used for storing a large amount of webpage data and an analysis result without random read-write requirements, the data storage in the Hadoop cluster is permanent, the stored data capacity is large, and a foundation is laid for a data stream hook technology of a real-time data stream processing subsystem. The mysql cluster is small in storage volume and needs data read and written randomly, such as operation data, data mining results, semantic analysis results and the like. The Hadoop cluster and the mysql cluster improve the efficiency of data calling.

And the bottom layer support subsystem comprises a semantic information extraction module and a semantic search engine which are connected. The semantic information extraction module is responsible for extracting semantic information from the text and supporting other modules needing semantic extraction and semantic analysis, and the semantic information extraction module is respectively connected with the real-time data processing subsystem and the service output subsystem to transmit semantic analysis results. The semantic search engine integrates all tools related to semantic search and text processing and an Application Programming Interface (API) module, is simultaneously connected with the Hadoop cluster and the service output subsystem, and can search data in the Hadoop cluster and transmit results to the service output subsystem.

In the embodiment, the semantic information extraction module adopts a semantic analysis technology which takes paragraphs as analysis objects and takes the attributes of people, things and things as targets to extract all the common side surfaces and attributes related to the people, things and things; meanwhile, a semantic clustering technology is adopted to record the information content of each topic and remind a user to pay attention to the important events. In the embodiment, semantic information in a natural language text is described and marked in a form extremely similar to a natural language, no attempt is made to construct a strict rule, and a concerned semantic element is manually marked starting from a specific sentence which expresses similar meanings or contains similar semantic information one by one; analyzing the unmarked part in the sentence through a built-in semantic dictionary to generate an induction rule; by sorting rules that conform to natural language expression habits (also known as "intuitive compliance"); and carrying out a new iteration on the sentences which are not covered by the rules, thereby forming a set of rules which can be manually understood and can be used for semantic matching and text information extraction. The text semantic processing method effectively solves the problem of web-based big data analysis, has the characteristics of high precision, rich provided semantic information, high practicability, industrialization and the like, and can fully release the value of text information when being used as input data of technologies such as data mining and the like. Meanwhile, the semantic expression mode is excellent in duplication elimination performance, multiple times of storage of the same piece of information are avoided, and storage space is saved.

And the service output subsystem is used for executing, scheduling and displaying specific services and comprises an accurate marketing module, a data service module, a report generation module, a commercial information analysis module and a public opinion analysis module which are connected in parallel. The accurate marketing module is used for providing technical support of data collection, analysis and marketing means for accurate marketing; the data service module is used for data collection and semantic analysis which are carried out for meeting the specific data requirements of customers; the report generation module generates a short, summary and image-text combined information summary for a client, and supports automatic generation and report summarization and writing at regular intervals; the business information analysis module is used for business opportunity information analysis, competitor analysis, industry movement and data analysis; the public opinion analysis module is used for topic tracking analysis, event and person related tracking analysis, network public opinion data collection and integrated analysis. The business output subsystem realizes accurate marketing service for enterprise products by analyzing the commercial behaviors of Internet users; the method helps enterprises to insights the dynamic trend in the industry and the upstream and downstream industries, grasp business opportunities, avoid risks and help the enterprises to make scientific decisions and other business intelligent services quickly. Has wide industrial application prospect.

The system realizes the analysis of the commercial behaviors of the internet surfing people through the monitoring of the internet and the semantic analysis of the text information, and recommends products suitable for business opportunities to the people, thereby realizing the function of accurate marketing. On the other hand, by monitoring the external business environment of the enterprise, the business intelligent services including market environment, industry dynamic, product and brand monitoring, monitoring of the upstream and downstream environments of the enterprise and the like are provided.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A big data analysis business intelligence service system based on semantics, comprising: the system comprises a data acquisition and storage subsystem, a real-time data stream processing subsystem, a storage subsystem, a bottom layer support subsystem and a service output subsystem; wherein,

the data acquisition and storage subsystem comprises a distributed crawler module and a data source adapter which are mutually independent, the distributed crawler module and the data source adapter are respectively connected with the real-time data stream processing subsystem, the distributed crawler module is responsible for data source detection, internet data acquisition and HTML preprocessing, and the data source adapter is used for accessing third-party data resources to work; in the distributed crawler module, credibility weights are set for different information sources, and the distributed crawler module adopts a fixed-point squatting and/or heuristic acquisition strategy and/or a universal acquisition strategy;

2. The big data analytics business intelligence service system based on semantics of claim 1 wherein the real-time data stream processing subsystem has a buffering threshold of 0.1 to 100 minutes.

3. The big data analytics business intelligence service system as claimed in claim 1, wherein the Hadoop cluster is persistent storage.

4. The big data analytics business intelligence service system based on semantics of claim 1 wherein operational data, data mining results, semantic analysis results are stored in a mysql cluster.

5. The big data analytics business intelligence service system based on semantics of claim 1 wherein the semantic information extraction module employs natural language like semantic information extraction techniques to describe and tag semantic information in natural language text in the form of natural language.

6. The big data analytics business intelligence service system based on semantics as claimed in claim 1, wherein the semantics information extraction module employs a semantics clustering technique to record the information amount of each topic to remind the user to pay attention to important events.