CN109272155B

CN109272155B - Enterprise behavior analysis system based on big data

Info

Publication number: CN109272155B
Application number: CN201811058169.7A
Authority: CN
Inventors: 石国鹏; 张国增; 杨景伟
Original assignee: Zhengzhou Centfor Communication Technology Co ltd
Current assignee: Zhengzhou Centfor Communication Technology Co ltd
Priority date: 2018-09-11
Filing date: 2018-09-11
Publication date: 2021-07-06
Anticipated expiration: 2038-09-11
Also published as: CN109272155A

Abstract

The invention discloses an enterprise behavior analysis system based on big data, which comprises a data acquisition and processing platform, a data warehouse, a data management and control platform, a data analysis and mining platform, an operation analysis platform and a data visualization display platform, wherein the data acquisition and processing platform is used for acquiring and processing data; various source data of an enterprise are processed through mining, analysis and the like of each platform, and finally visualized display is carried out; the effect is as follows: the system associates big data generated by each business system in an enterprise by processing various source data of the enterprise, and provides data support service and operation and decision support for the enterprise by mining and analyzing the data and finally forming effective visual presentation, thereby enhancing the operation efficiency of the enterprise, improving the management level of the enterprise and promoting the core competitiveness and creativity of the enterprise.

Description

Enterprise behavior analysis system based on big data

Technical Field

The invention relates to the technical field of big data processing, in particular to an enterprise behavior analysis system based on big data.

Background

Good enterprise management can not only enhance the operation efficiency of the enterprise, but also enable the enterprise to have a definite development direction. However, in the prior art, data sources among all departments inside an enterprise are often independent from each other, various data cannot be cross-shared among different systems, and the problems of information fragmentation, data repetition and the like exist. That is, the existing enterprise data management cannot integrate various data, cannot realize the associated application of big data, cannot provide data support service for the enterprise, cannot perform predictive analysis for the enterprise, and thus enhances the operating efficiency of the enterprise.

Disclosure of Invention

The invention aims to provide an enterprise behavior analysis system based on big data, and the system is used for solving the problems that the related application of the big data cannot be realized in the prior art, so that data support service cannot be provided for enterprises, and prediction analysis cannot be performed on the enterprises.

Therefore, the technical scheme adopted by the invention is as follows: the enterprise behavior analysis system based on the big data comprises a data acquisition and processing platform, a data warehouse, a data management and control platform, a data analysis and mining platform, an operation analysis platform and a data visualization display platform;

the data acquisition and processing platform is used for acquiring various source data of an enterprise and processing the various source data to obtain target data;

the data warehouse is used for storing the target data;

the data management and control platform is used for providing metadata management, main data management, data quality management, data standard management and data security management services;

the data analysis mining platform is used for providing an algorithm model base and a data analysis mining tool;

the operation analysis platform is used for processing the target data through the algorithm model library and the data analysis mining tool to obtain a processing result, and the processing result can provide operation analysis and decision support for enterprises;

the data visualization display platform is used for performing various visualization displays on the processing result.

Preferably, the data warehouse comprises a distributed columnar storage database and a distributed file system.

Preferably, the enterprise behavior analysis system further comprises an SQL engine component, a stream processing engine component, a joint query engine component, a parallelization R algorithm execution engine component, a full-text retrieval engine component, a distributed computation engine component, and a task scheduling and monitoring component.

Preferably, the various source data for the enterprise includes one or more of structured data, semi/unstructured data, and real-time data.

Preferably, the sources of the various source data of the enterprise specifically include data of existing business systems of the enterprise, real-time data collected through a distributed message queue, and internet data collected through a web crawler technology.

Preferably, the sources of the various source data of the enterprise further include data uploaded by an online filling and report file.

Preferably, the source data processing method includes: data cleaning, data rearrangement and data processing;

the data cleaning refers to deleting irrelevant data and smooth noise data in source data, screening out data irrelevant to a theme, and processing missing values and abnormal values, wherein the missing values are processed by adopting a deleting method, a replacing method and an interpolation method;

the data de-duplication refers to removing repeated data in source data;

the data processing refers to the splitting and merging of data in the source data.

Preferably, the algorithm model library comprises a naive Bayes model, and the target data is classified by adopting the naive Bayes model; wherein, the expression of the naive Bayes model is as follows:

P(B|A)＝P(B)×P(A|B)/P(A)

wherein P (B | a) represents the probability of data B assuming that a is true, i.e., the posterior probability; p (a) represents the prior probability of the training data a to be observed; p (B) denotes the prior probability, i.e. the initial probability that B is assumed to possess before there is no training data; p (a | B) represents the probability of data a assuming that B holds true.

Preferably, the processing of the target data by the data analysis mining tool specifically comprises probability description, association analysis, classification, cluster analysis, prediction analysis and deviation detection analysis;

wherein the probability description comprises a characteristic description and a distinctive description, and the characteristic description is used for representing common characteristics of a certain type of target data; the distinctive description is used for representing the distinction between different types of target data;

the association analysis is used for discovering association rules, correlation relations, causal structures and frequent patterns of item sets among the item sets from a large amount of target data; the association rule is used for describing the mutual influence degree between the attributes, and the measurement is carried out through the confidence degree and the support degree, wherein the confidence degree is used for measuring the accuracy in the association rule, and the support degree is used for measuring the importance in the association rule;

the classification is used for finding a reasonable model for each type of target data on the premise of knowing the characteristics and classification of the training data, and then classifying new data by using the model;

the cluster analysis is used for carrying out information aggregation according to an information similarity principle under the condition of unknown classification in advance;

the predictive analysis is used to predict continuous or ordered values;

the deviation detection analysis is used to analyze the present status of the data, the history or significant changes and deviations between standards, to find out the abnormal records that exist, and to take corrective action.

Preferably, the data visualization presentation platform comprises a J2EE platform and a visualization presentation component, wherein the visualization presentation component comprises an instant query component, a report and instrument panel component, an OLAP multidimensional analysis component and a map presentation component.

By adopting the technical scheme, the method has the following advantages: the enterprise behavior analysis system based on the big data, provided by the invention, associates the big data generated by each business system in the enterprise by processing various source data of the enterprise, and provides data support service and operation and decision support for the enterprise through data mining and analysis and finally effective visual presentation, thereby enhancing the operation efficiency of the enterprise, improving the enterprise management level and improving the core competitiveness and creativity of the enterprise.

Drawings

FIG. 1 is a schematic diagram of a system configuration according to an embodiment of the present invention;

FIG. 2 is a technical framework diagram of an embodiment of the present invention;

fig. 3 is a schematic diagram of a logic structure according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific examples, which are used for illustrating the present invention and are not intended to limit the scope of the present invention. It should be noted that some English abbreviations presented in the documents are professional expressions in the industry, and those skilled in the art should understand that some terms are explained below.

J2 EE: the Java2Platform Enterprise Edition, J2EE Platform is essentially a distributed server application programming environment, a Java environment; spring: the framework is an open-source framework and is created for solving the complexity of enterprise application development, one of the main advantages of the framework is the layered framework, the layered framework allows a user to select which component to use, and meanwhile, the framework provides integration for J2EE application development; ESB: enterprise Service Bus; ETL: Extract-Transform-Load, which is used to describe the process of extracting (Extract), Transform (Transform), and loading (Load) data from the source end to the destination end; SQL: structured Query Language, Structured Query Language; hyperbase: a distributed columnar storage database; inceptor: is an interactive analysis engine, and is essentially an SQL translator; HDFS (Hadoop distributed File System): a Hadoop Distributed File System, Distributed File System; JDBC: java databank Connectivity, Java DataBase Connectivity; ODBC: open Database Connectivity, Open Database connection; FUSE: filesystem in Userspace, user space file system; spark: is a fast general-purpose computing engine designed specifically for large-scale data processing; PL/SQL: procedural Language/SQL, Procedural SQL Language; kafka: a distributed messaging system; JMS: JAVA Message Service, JAVA Message Service; cube: a data cube, here denoted distributed memory;

OLAP: online analytical Processing, online real-time analysis; JSON: JavaScript Object notification, which is a lightweight data exchange format; SparkR: is an R language package; CRM: customer Relationship Management; ERP: enterprise resource Planning; SNS: social Networking Services, Social Networking Services; HTTP: HyperText Transfer Protocol, Hypertext Transfer Protocol; CA, Certificate Authority; SSO: single Sign On, Single Sign On; agent: an agent; RESTful: reproducible State Transfer, presentation layer State transition; SOA: Service-Oriented Architecture, Service-Oriented Architecture.

Referring to fig. 1 to 3, an embodiment of the present invention provides an enterprise behavior analysis system based on big data, which includes a data acquisition and processing platform, a data warehouse, a data management and control platform, a data analysis and mining platform, an operation and analysis platform, and a data visualization display platform.

The data acquisition and processing platform is used for acquiring various source data of an enterprise and processing the various source data to obtain target data.

The data warehouse is used for storing the target data.

The data management and control platform is used for providing metadata management, main data management, data quality management, data standard management and data security management services.

The data analysis mining platform is used for providing an algorithm model base and a data analysis mining tool.

The business analysis platform is used for processing the target data through the algorithm model base and the data analysis mining tool to obtain a processing result, and the processing result can provide business analysis and decision support for enterprises.

The data visualization display platform is used for performing various visualization displays on the processing result, wherein the visualization displays comprise chart displays, mobile displays, map displays, large-screen displays and the like.

It should be noted that the data warehouse is a Hadoop-based data warehouse, and the Hadoop is a distributed system infrastructure developed by the Apache foundation. The system associates big data generated by each business system in an enterprise by processing various source data of the enterprise, and provides data support service and operation and decision support for the enterprise by mining and analyzing the data and finally forming effective visual presentation, so that the operation efficiency of the enterprise is enhanced, the management level of the enterprise is improved, and the core competitiveness and creativity of the enterprise are improved. In particular, the various source data includes one or more of structured data, semi/unstructured data, and real-time data. The sources of various source data mainly include the following channels: (1) taking data of each existing business system of an enterprise as first source data; (2) real-time data collected through the distributed message queue is used as second source data, and the real-time data comprises but is not limited to website click stream data, real-time event stream data and the like; (3) the internet data collected by the web crawler technology is used as third source data; (4) and the data is acquired in an online filling mode and in a report file uploading mode. By acquiring various source data through the four channels, the problems of information isolated islands and information splitting among enterprise departments can be effectively reduced.

Further, in the channel (1), the data collection and processing platform collects source data in the existing business systems of the enterprise as first source data through the data integration and ETL platform, wherein each business system includes CRM, ERP, other platforms, financial big data platforms, and the like), and loads the first source data into the data warehouse in batch after processing. In the channel (3), internet data (such as websites and SNS) can be collected by internet data collection software, processed, and then imported into a data warehouse.

Preferably, in this embodiment, the data warehouse further comprises a distributed columnar storage database and a distributed file system. The distributed column-type storage database (Hyperbase) is used for storing structured data, and comprises source data collected from an existing business system database, a data set integrated and processed with multi-theme association, an application-oriented data mart and the like. The system can access the Hyperbase through the SQL engine component and the JDBC/ODBC based standard interface. The distributed file system (HDFS) is used to store semi/unstructured data, including Office files, XML data, Email data, voucher document scan, video images, Web pages, etc. Data related to file attributes are mainly stored in a distributed columnar storage database Hyperbase; the index data generated for the text data is mainly stored in the full-text index library. The system can access the HDFS through a JAVA API, and can also map the HDFS into a remote disk for access and use through FUSE mounting.

Furthermore, the processing modes of various source data mainly comprise data cleaning, data rearrangement and data processing.

(1) Data cleansing

The data cleaning mainly comprises the steps of deleting irrelevant data and smooth noise data in source data, screening out data irrelevant to a theme, and processing missing values and abnormal values, wherein the missing values are processed by adopting a deletion method, a replacement method and an interpolation method;

the deletion method is the simplest missing value processing method and can be divided into two types, namely deleting an observation sample and deleting a variable according to different angles of data processing. All rows containing missing data can be removed through a na.init () function, which belongs to a method of exchanging the sample size for the integrity of information, and is suitable for the case that the proportion of missing values is small; deleting a variable is suitable for the case that the variable has a large deletion and has little influence on the research target, which means deleting the whole variable.

The alternative method is used for dividing variables into numerical types and non-numerical types according to attributes, and the processing methods of the numerical types and the non-numerical types are different: if the variable where the missing value is located is a numerical type, the missing value of the variable is generally replaced by the mean value of the values of the variable in all other objects; if the variable is a non-numerical variable, the median or mode of all other effective observations of the variable is used for replacement.

The interpolation method is proposed for solving the problems that information is wasted and a data structure is changed by using an erasure method and a replacement method, so that a biased statistical result is obtained finally. In the face of the missing value problem, the interpolation method commonly used includes regression interpolation, multiple interpolation, and the like. The regression interpolation method utilizes a regression model, takes variables needing interpolation and filling as dependent variables and other related variables as independent variables, and predicts the values of the dependent variables through a regression function lm () to fill in missing values; the principle of the multiple interpolation method is to generate a complete set of data from a data set containing missing values, and to do so many times, thereby generating a random sample of the missing values, and the mic () function packet can be multiple interpolated in the R language.

(2) Data deduplication

The data permutation is used for removing repeated data in the source data.

(3) Data processing

The data processing is used for separating and combining data in the source data.

It should be noted that after data cleaning, data deduplication and data processing are performed on source data, accuracy of data mining and analysis can be improved, and decision service can be better provided for enterprises.

Further, the data management and control platform can uniformly collect and process the metadata of the distributed file system HDFS, the distributed column type storage database superbase, the ETL processing flow and rules, the existing business system database, Teradata and the Oracle database through a metadata collection engine of the ETL platform, uniformly store the metadata in the database of the data management and control platform, and establish the metadata association relationship of a source base table, an interface table, an ETL processing process and a target base table, thereby laying a firm foundation for subsequent data standard management, main data management, data quality management and data security management. The method relates to data exchange in the butt joint with the existing metadata management and main data management system of an enterprise, and can adopt an ESB platform and message transmission middleware to exchange metadata and main data change records with the existing system in real time based on a JMS interface. (it should be noted that the data collection and processing platform and the data integration and ETL platform are referred to as the same platform, but those skilled in the art should understand that they are not explained herein one by one, for example, the ESB platform and the ESB service bus platform are the same platform, and the distributed column database is consistent with the distributed column storage database and the distributed column database in this text.)

The ESB platform is used for providing functions of message queue, message subscription and release, Web Service arrangement and combined call, Service monitoring and the like;

based on an ESB platform and a JMS message interface, real-time data exchange (including operation and maintenance management data, metadata/main data and the like) with the existing business system can be realized, and a result data set analyzed and mined by the system can be pushed to application service systems such as CRM, ERP, enterprise portal and APP in real time;

the ESB platform supports JDBC/ODBC and HTTP/JSON interfaces, can be in butt joint with an SQL engine and a joint query engine in the system, and accordingly can package the joint query functions of database query, unstructured data and structured data into Web Service for being called by related application systems.

Further, the algorithm model library comprises a naive Bayes model, and the target data is classified by adopting the naive Bayes model; wherein, the expression of the naive Bayes model is as follows:

P(B|A)＝P(B)×P(A|B)/P(A)

That is to say, a classifier is constructed from the collected big data to classify the target data, when the application is performed, the probability value is larger by calculating the posterior probability of the classifier, the probability value belongs to the same class, and so on, so that the behavior of the enterprise can be predicted and analyzed through the collected big data.

Further, the processing of the target data by the data analysis mining tool specifically includes:

(1) description of probability

The probability description comprises a characteristic description and a distinctive description, and the characteristic description is used for representing common characteristics of a certain type of target data; the distinctive description is used to represent the distinction between different classes of target data.

(2) Association analysis

The association analysis is used for discovering association rules, correlation or causal structures and frequent patterns of item sets among the item sets from a large amount of target data, and the association rules are used for describing the interaction degree among the attributes and are measured through confidence degrees and support degrees; the confidence coefficient is used for measuring accuracy in the association rule, and the support degree is used for measuring importance in the association rule.

(3) Classification

The classification is used for finding a reasonable model for each type of target data on the premise of knowing the characteristics and classification of the training data, and then classifying new data by using the model; the classification comprises the steps of model establishment and classification by using the model; here, the classification is performed with reference to the naive bayes model.

(4) Cluster analysis

The cluster analysis is a method for information aggregation according to the principle of information similarity without knowing the classification category in advance.

Specifically, the clustering algorithm may adopt an FCM clustering algorithm, which is an improvement of the conventional hard clustering algorithm, and the algorithm steps include:

standardizing the data matrix;

establishing a fuzzy similar matrix and initializing a membership matrix;

the algorithm starts iteration until the target function converges to a minimum value;

and determining the class to which the data belongs according to the iteration result and the final membership matrix, and displaying the final clustering result.

(5) Predictive analysis

The predictive analysis is used to predict continuous or ordered values.

(6) Deviation detection analysis

The deviation detection analysis is used for analyzing the current data situation, the historical records or the obvious changes and deviations among the standards, finding out some abnormal records, and taking corrective measures in time.

Further, the data visualization display platform comprises a J2EE platform and a visualization display component, wherein the visualization display component comprises an instant query component, a report form and instrument panel component, an OLAP multidimensional analysis component and a map display component, and the distributed column storage database and the distributed memory are accessed through an SQL engine component and a JDBC/ODBC interface.

Specifically, the system can realize the joint query of unstructured data (such as text data and XML data stored in HDFS) and structured data (including database data such as Oracle, MySQL, Teradata and Hyperbase) through a joint query engine and an HTTP/JSON interface. The system can also be connected with a full-text retrieval engine through an HTTP/JSON interface to realize full-text retrieval query.

Further, the enterprise behavior analysis system also comprises an SQL engine component, a stream processing engine component, a joint query engine component, a parallelization R algorithm execution engine component, a full-text retrieval engine component, a distributed computing engine component and a task scheduling and monitoring component.

The SQL engine component (Inceptor SQL) is a high-performance and high-compatibility SQL engine (SQL2016 standard) realized based on Spark, and provides a JDBC/ODBC standard interface for a system to access a Hyperbase database. The SQL engine supports PL/SQL, and is convenient for developers to realize multi-table association, summary processing and other applications.

The stream processing engine component is realized based on Spark Streaming, can be in butt joint with a distributed message system, and receives and processes stream data in real time; the system can be in butt joint with an ESB platform of an enterprise through a JMS API interface, receive and process the service data flow in real time, and can send the information of the abnormal events detected in real time to the ESB platform. The stream processing engine component can import real-time stream data into the distributed columnar storage database and the distributed memory in real time through the SQL engine. Service reference data, rule data and the like used in the operation of the stream processing engine can be placed in a distributed memory, so that the time consumption for accessing the database is greatly reduced.

The joint query engine is used for providing a joint query service for unstructured data and structured data for the system. And the system and the joint query engine interact query request and response information through an HTTP/JSON interface. The joint query engine supports accessing databases (Oracle, Teradata, MySQL, etc.) through JDBC/ODBC interfaces; supporting to access a distributed database Hyperbase and a distributed memory through an Inceptor SQL engine; supporting accessing to a distributed file system (HDFS) through a Java API (application program interface); and JSON and XML data are supported to be accessed through an HTTP interface.

The parallelization R algorithm execution engine component is a parallelization R algorithm engine realized based on spark R, and currently supports nearly 60 parallelization R algorithms. The developer can load the application package to the algorithm engine to be executed through the visual programming environment. And the parallelization R algorithm engine can extract required data from the Hyperbase through the JDBC interface and the SQL engine and store the analysis result into a distributed column storage database. The parallelization R algorithm engine can also directly read the file data on the HDFS.

The full text Search engine component (Elastic Search) is used to extract text data from superbase, HDFS and create a full text index library. Full-text index library data may be present in a distributed file system HDFS. The Elastic Search provides an HTTP/JSON access interface for full-text Search query application.

The distributed computing engine component is used for providing a JAVA API framework for distributed batch processing computing; the Spark engine realizes fast distributed processing by fully utilizing a memory computing technology and supports languages such as Java, Scala, Python and the like.

The task scheduling and monitoring component is used for dynamically loading the user component, managing the user component and executing task monitoring, and a user can check the task execution condition according to the user state information.

Furthermore, in order to realize login of multiple users and multiple systems, the system also comprises an identity authentication and access control component, and the identity authentication and authentication access control component is used for uniformly providing identity authentication and authentication access control services for users accessing enterprise portals, business analysis and other applications. User credentials, authorization information may be stored in a relational database (Oracle or MySQL) or a lightweight directory repository. User certificate information may be exchanged with a Certificate Authority (CA) through a proprietary interface or a JMS interface of the ESB platform. The component also provides an SSOAgent plug-in, and can realize single sign-on integration of various application systems and management systems.

From the above description, it can be seen that the enterprise behavior analysis system based on big data provided by the embodiment of the present invention has the following advantages: the system associates big data generated by each business system in an enterprise by processing various source data of the enterprise, provides data support service and operation and decision support for the enterprise by mining and analyzing the data and finally forming effective visual display, thereby enhancing the operation efficiency of the enterprise, improving the management level of the enterprise, and promoting the core competitiveness and creativity of the enterprise

To better understand the big data based enterprise behavior analysis system provided by the embodiment of the present invention, the system may be described from another perspective, as shown in fig. 2, and the system is divided into the following layers:

hardware device layer

The hardware device layer comprises hardware devices such as server devices, network devices, storage devices, load balancers, storage devices, VPN/firewalls and the like.

Virtualizing resource layers

The virtualized resource layer is a server virtualized resource pool constructed based on the distributed container cluster management system, and can provide multi-tenant container resource allocation and scheduling management, application packing deployment and operation, service registration and discovery, dynamic scaling, balanced disaster tolerance and other resource management services for various application, distributed computation and storage service components.

Application platform layer

The application platform layer provides platform support for development, test and operation of big data analysis application, and the application platform layer mainly comprises: j2EE application service platform and Spring frame, report and analysis tool display platform, parallelization algorithm model base, relational database, ESB service bus and ETL data integration platform, identity authentication and authentication control component, full text retrieval component and big data distributed computation and storage platform component, etc.

The big data distributed computing and storage platform assembly mainly comprises: the system comprises a distributed column database (namely a distributed column storage database), a distributed file system, an SQL engine component, a stream processing engine component, a joint query engine, a parallelization R algorithm execution engine, a full-text retrieval engine, a distributed computing engine, a task scheduling and monitoring component and the like.

Application service layer

The method is used for customizing and developing various application services, and mainly comprises the following steps: management, data management, application management, content management, data analysis, metadata management, decision support, risk management and control, flow optimization, service support, cross marketing, product innovation and the like.

Communication network layer

The communication network layer is used for accessing the authorized related application service by an external user through the Internet (including mobile Internet); and internal personnel can access to application services of an intranet (WIFI wireless local area network) through the integrated network.

Terminal access stratum

A system user can access related application services through a PCWeb browser and a mobile terminal (a smart phone, a tablet personal computer and the like); the platform supports interaction of e-mails, mobile phone APP, WeChat, short messages and the like, and the accessed terminal comprises a touch large screen and a professional terminal. The system overall technical framework also comprises: a big data management standard and standard system, unified system security operation and maintenance management and the like.

In order to enable the system to have good compatibility, the system also provides various data interfaces, including API (application programming interface) interfaces of all components of a fully-compatible Hadoop ecosphere open source, and REST (representational state transfer) access interfaces including a Web HDFS (Web HDFS) interface and a StarGate/Hyperbase REST interface; meanwhile, a JDBC/ODBC interface is provided by supporting SQL2016 standard and PL/SQL, so that the traditional service scene can be smoothly migrated to a large data platform; in addition, the big data platform provides Java API and R language interface for data mining, through the interface, users can directly use R language and SQL to carry out interactive data mining exploration, and can carry out secondary development through the API opened by the platform, and SQL query is carried out on upper application through JDBC/ODBC interface; in addition, the Inceptor also comprises a Java API of a basic parallel statistical mining algorithm library, and a user can perform secondary development of data mining through the parallel algorithm library.

The system adopts a Service Oriented Architecture (SOA) design and J2EE/Spring and Apache CXF frameworks to realize a built-in Service registration function, can register and call the existing external Web Service, and can call the defined Service for other applications. By utilizing an ESB platform, the query access of the distributed database is encapsulated into Web Services for being called by a related application system through the butt joint of a JDBC/ODBC interface and an SQL engine; the ESB platform can be in butt joint with the joint query engine through an HTTP/JSON interface, and joint query access of unstructured data and structured data is packaged into Web Services for being called by a related application system; and the analysis mining result generated by the report/analysis platform can be encapsulated into RESTful service based on the ESB platform, so that the RESTful service can be called by a related application system.

The system has the following advantages:

1. the transformation and upgrading capacity of the traditional industry can be improved. Under the new situation, the revolution of the traditional industry operation management mode, the innovation of the service mode and the business mode and the reconstruction of an industrial value chain system are accelerated by fully releasing the revolution function of the large data in the industry development, and the networked sharing, intensive integration, collaborative development and efficient utilization of social production elements are promoted, so that the traditional production mode can be changed, the collaborative development of the traditional industry and a new mode of a new state is promoted, the economic transformation pace is accelerated, and the economic operation efficiency is improved. Taking industry as an example, in a design link, the personalized level of an industrial design link can be improved by utilizing a big data enterprise behavior analysis system; in the production link, the big data can be utilized to monitor and optimize the assembly line operation, strengthen the fault prediction and the health management, optimize the product quality and reduce the energy consumption;

2. the system can optimize commodity supply and improve the economic benefit of enterprises. In the past, production enterprises are used for organizing production according to market expectation and judgment, and the expectation of the enterprises is not always consistent with the real market demand, so that product overstock, inventory generation and enterprise cost increase are caused. In the internet era, big data can be utilized to promote production and marketing butt joint, the accuracy of industrial product sale is improved according to the real demand of consumers to sell and decide production, unnecessary raw materials and labor cost are avoided, meanwhile, the inventory of enterprises is minimized, the cost is reduced, and the economic benefit of the enterprises is improved. For example, a production enterprise can intuitively and conveniently obtain the real requirements of consumers through big data collected by an e-commerce. For example: the accurate data of the users in the Jingdong during the whole process of browsing, comparing, selecting, purchasing and commenting is mined and fed back to the production enterprises in the Jingdong, so that the enterprises have data support from the beginning of accounting the manufacturing cost, and the products are designed and produced according to the big data, so that the efficiency is higher, and the cost is lower.

3. The system can be cultivated to generate new economic growth points. The method utilizes big data to reform the traditional kinetic energy and cultivate new kinetic energy, meets the objective requirements of economic development in China, and has great significance and wide prospect for realizing innovation and driving the development of entity economic transformation. In a new era, the potential value of big data economy inevitably drives the vigorous development of the information industry related to the big data economy, including big data resource construction, big data technology and big data application.

Finally, it should be noted that the above description is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are also included in the scope of the present invention.

Claims

1. An enterprise behavior analysis system based on big data is characterized by comprising a data acquisition and processing platform, a data warehouse, a data management and control platform, a data analysis and mining platform, an operation analysis platform and a data visualization display platform;

the data warehouse is used for storing the target data;

the data visualization display platform is used for performing various visualization displays on the processing result;

the data analysis mining tool specifically processes target data and comprises probability description, association analysis, classification, cluster analysis, prediction analysis and deviation detection analysis;

the predictive analysis is used to predict continuous or ordered values;

the deviation detection analysis is used for analyzing the current data situation, the history record or the obvious change and deviation among standards, finding out the existing abnormal record and taking corrective measures;

the enterprise behavior analysis system further comprises a stream processing engine component; the stream processing engine component is realized based on Spark Streaming, can be in butt joint with a distributed message system, and receives and processes stream data in real time; the system can be in butt joint with an ESB platform of an enterprise through a JMS API interface, receive and process a service data stream in real time, and can send information of abnormal events detected in real time to the ESB platform;

the enterprise behavior analysis system also comprises a joint query engine component, a parallelization R algorithm execution engine component, a full-text retrieval engine component, a distributed computation engine component and a task scheduling and monitoring component;

the joint query engine is used for providing a joint query service of unstructured data and structured data for the system;

extracting required data from Hyperbase by a parallelization R algorithm execution engine component through a JDBC interface and an SQL engine, and storing an analysis result into a distributed column storage database;

the full-text retrieval engine component is used for extracting text data from Hyperbase and HDFS and creating a full-text index library;

the distributed computing engine component is used for providing a JAVA API framework for distributed batch processing computing;

2. The big-data based enterprise behavior analysis system of claim 1, wherein the data warehouse comprises a distributed columnar storage database and a distributed file system.

3. The big-data based enterprise behavior analysis system of claim 1, wherein the source data comprises one or more of structured data, semi/unstructured data, and real-time data.

4. The big-data based enterprise behavior analysis system according to claim 3, wherein the sources of the various source data of the enterprise specifically include data of existing business systems of the enterprise, real-time data collected through a distributed message queue, and internet data collected through web crawler technology.

5. The big-data based enterprise behavior analysis system of claim 4, wherein the source of the various source data of the enterprise further comprises data uploaded from online filing and reporting files.

6. The big-data based enterprise behavior analysis system of claim 1, wherein the source data is processed in a manner comprising: data cleaning, data rearrangement and data processing;

the data de-duplication refers to removing repeated data in source data;

7. The big-data-based enterprise behavior analysis system according to claim 1, wherein the algorithm model library comprises a naive Bayes model, and the target data is classified by using the naive Bayes model; wherein, the expression of the naive Bayes model is as follows:

P(B|A)＝P(B)×P(A|B)/P(A)

8. The big-data based enterprise behavior analysis system of claim 1, wherein the data visualization presentation platform comprises a J2EE platform and visualization presentation components comprising an instant query component, a report and dashboard component, an OLAP multidimensional analysis component, and a map presentation component.