CN112148810B

CN112148810B - User portrait analysis system supporting custom labels

Info

Publication number: CN112148810B
Application number: CN202011243959.XA
Authority: CN
Inventors: 卢宪政; 左赋斌
Original assignee: Nanjing Zhishuyun Information Technology Co ltd
Current assignee: Nanjing Zhishuyun Information Technology Co ltd
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2023-11-28
Anticipated expiration: 2040-11-10
Also published as: CN112148810A

Abstract

The invention discloses a user portrait analysis system supporting custom labels, which comprises a source data layer and a storage layer, wherein the storage layer is used for storing original business data; the data analysis layer is used for analyzing and processing the original business data according to the labels/indexes configured by the user in a self-defining way, and loading analysis results into the data warehouse and the data application module for data display and application; and the data product end is based on a data analysis layer and is developed by a user to realize data statistics and display analysis. The invention integrates the existing big data technical frames deeply, provides standard integrated interface API externally, reduces the selection and integration work of the technical frames in the system research and development process, and is easy to integrate into the existing system; meanwhile, various data processing components can be provided, a flexibly-configurable tagged data analysis scheme can be quickly applied to related big data analysis scenes, and the data analysis indexes which are continuously changed can be quickly responded.

Description

User portrait analysis system supporting custom labels

Technical Field

The invention relates to a user portrait analysis system supporting custom labels, and belongs to the technical field of data analysis.

Background

After the internet walks into the big data age, the user behavior brings a series of changes and remodelling to the products and services of the enterprise, wherein the biggest change is that all the behaviors of the user are traceable and analyzable in front of the enterprise, a large amount of original data and various business data are stored in the enterprise, which is a real record of the business operation of the enterprise, how to more effectively use the data for analysis and evaluation is a problem of the enterprise based on a larger data amount background. With the deep research and application of big data technology, the focus of enterprises is increasingly focused on how to use big data for fine operation and accurate marketing service, and the fine operation is to be performed, and firstly, user portraits of the enterprises are to be established.

User portraits, namely user informatization tags, are characterized by the data of various dimensions such as social attributes, consumption habits, preference characteristics and the like of mobile phone users, and are analyzed and counted, potential value information is mined, and the information overall view of the users is abstracted from the characteristics. The user image can be regarded as the root of the fine operation of enterprises, is a precondition for targeted advertisement delivery and personalized recommendation, and lays a foundation for data-driven operation. Compared with the traditional enterprise report, the user portrait provides more flexible user behavior analysis and more accurate personalized service, and is an important direction of big data floor application.

At present, the technology of big data storage, processing, analysis and the like is layered endlessly, and frames in the same field are in diversified development trends, and have advantages and characteristics. When an enterprise needs to develop related services such as user portrait analysis, system developers often face the following problems:

1. many similar frameworks or technologies exist for completing the same function, and research and development personnel need to spend time for research, comparison and trial-and-error;

2. how to organically integrate a plurality of frames and form a set of high-efficiency and accurate system-level integral solution;

3. when other business departments in the enterprise need to build similar platforms, how to use the existing platforms for quick multiplexing and integration.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention provides a user portrait analysis system supporting a custom tag, which combines the service requirements of user portrait type, deeply integrates the prior big data technical frames, externally provides a standard integrated interface API, reduces the selection and integration work of the technical frames in the system research and development process, and is easy to integrate into the prior system; meanwhile, various data processing components can be provided, a flexibly-configurable tagged data analysis scheme can be quickly applied to related big data analysis scenes, and the data analysis indexes which are continuously changed can be quickly responded.

In order to achieve the above purpose, the present invention adopts the following technical scheme: a user portrait analysis system supporting custom labels, the whole system is divided into three layers, including:

the source data layer is used for storing a storage layer of original business data;

the data analysis layer is used for analyzing and processing the original business data according to the labels/indexes configured by the user in a self-defining way, and loading analysis results into the data warehouse and the data application module for data display and application;

the data product end is based on a data analysis layer, and is automatically developed by a user to realize data statistics and display analysis;

the data analysis layer comprises an ODS storage layer, a data processing module, a data warehouse and a data application module, wherein the ODS storage layer is an isolation layer formed between a service system and the data warehouse and is used for accessing and storing the original data of a plurality of service systems and providing a data foundation and support for a data analysis engine of an upper layer;

the data processing module comprises a tag metadata management module, a task scheduling engine module and an operation state monitoring module, wherein the tag metadata management module is used for managing definition data describing each tag in a user portrait; the task scheduling engine module coordinates and schedules each task according to a time plan, executes a deployed data analysis program and provides a Web monitoring and management page;

the data warehouse is realized based on Hive, is used for storing processed data and classifies the processed data according to topics;

the data application module is used for providing data application support for the data product end;

the data of the data warehouse are periodically synchronized to the data application module.

The ODS storage layer is realized by a Hive external partition table, and partition design is carried out according to different dimensions by combining specific service requirements.

The label metadata is a set of data index system established according to actual service demands and comprises a statistics index, a rule index, an algorithm index and a machine learning mining index, and the label metadata is stored by using Mysql and provides basic inquiry, new addition, modification and deletion interfaces for the outside.

The task scheduling engine module comprises script management, workflow, a scheduler, a script plug-in, a UI and an API, and the specific execution flow comprises the following steps:

step one, uploading program packages/files to a server designated directory: the user uploads the packed Jar package, the packed sh file and the packed SQL file by himself;

step two, inputting an execution script: the user inputs script information through a UI interface, wherein the script information comprises names, types, versions, program package paths, resource settings, execution environment parameters and dynamic parameters;

step three, creating a workflow: a user draws a DAG workflow through a UI interface, and configures basic information and execution sequence of each node;

step four, creating a scheduling task: the system automatically generates scheduling task information according to the workflow, wherein the scheduling task information comprises Job, trigger and Scheduler;

step five, the system performs task scheduling and execution: the system executes task scheduling according to the generated scheduling information;

step six, storing the result: and saving a task scheduling record and a task execution condition.

The data warehouse comprises a user attribute subject database, a user behavior subject database, a user consumption subject database, a user preference subject database and a user value subject database, wherein the user attribute subject database comprises user gender, age, academic, income level, marital status and family member status; the user behavior theme library comprises recent travel frequency and recent shopping frequency; the user consumption theme library comprises the recent consumption times, the recent consumption amount and the consumption capacity; the user preference theme library comprises a category of commonly purchased goods and eating habits; the user value subject library includes user value information calculated according to an RFM model.

The data application layer utilizes ElasticSearch, redis, hbase and relational databases as data stores.

Compared with the prior art, the invention adopts a highly packaged integrated data analysis layer, can realize necessary functions such as data acquisition, metadata definition, data processing, data application and the like, and supports expansion. The data processing layer adopts an independently developed execution engine, supports a main stream data processing frame and scripts, such as shell, spark, MR, hive, SQL, java, python scripts and the like, and can realize quick deployment and online application through online editing and configuration. Finally, by deeply integrating the existing mainstream big data processing technology and providing a standard integrated interface API to the outside, the selection and integration work of a technical framework in the system research and development process is greatly reduced, and the system is easy to integrate into the existing system. The whole system has powerful functions, provides various data processing components, is flexible and configurable with a labeled data analysis scheme, can be rapidly applied to related big data analysis scenes, and can rapidly respond to continuously-changed data analysis indexes.

Drawings

FIG. 1 is a general architecture diagram of the present invention;

FIG. 2 is a workflow of the task scheduling engine module of the present invention;

FIG. 3 is a schematic representation of the workflow of the present invention;

fig. 4 is a schematic diagram of a scheduler of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made more fully hereinafter with reference to the accompanying drawings, in which embodiments of the invention are shown, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1 to fig. 4, the method for performing network public opinion analysis based on DFA algorithm provided by the present invention includes three layers of systems, including:

the source data layer is used for storing a storage layer of original business data; managed and maintained by a business system, such as member information, order information, commodity information, member access logs, business operation logs, buried point tracking information, and the like;

the data analysis layer, namely the general data analysis component related to the patent, is used for analyzing and processing the original business data according to the labels/indexes configured by user definition, and loading analysis results into a data warehouse and a data application module for data display and application;

the data product end is based on a data analysis layer, and is automatically developed by a user to realize data statistics and display analysis; such as crowd analysis, user portraits, marketing delivery, statistics, early warning and alarming, etc.;

the data analysis layer comprises an ODS storage layer, a data processing module, a data warehouse and a data application module, wherein the ODS storage layer is an isolation layer formed between a service system and the data warehouse and is used for accessing and storing the original data of a plurality of service systems, such as the original data of user information, order details, commodity information and the like; and provides data base and support for the data analysis engine of the upper layer;

the ODS storage layer can be realized by adopting a Hive external partition table, and is designed in a partition mode according to different dimensions in combination with specific service requirements, such as date, data source, application and the like, and system research personnel need to establish a Hive data table corresponding to source data according to the table structure and the data characteristics of original data;

the data processing module is a core module of the data analysis layer and comprises a tag metadata management module, a task scheduling engine module and an operation state monitoring module, wherein the tag metadata management module is used for managing definition data describing each tag in a user portrait; the task scheduling engine module coordinates and schedules each task according to a time plan, executes a deployed data analysis program and provides a Web monitoring and management page;

Among them, the use of Hive as an ODS storage layer has the following advantages:

1) The metadata management is unified, and the metadata can be directly transmitted to frames such as Spark/Impala and the like for processing;

2) The SQL-like query language HQL is provided, so that the method is easy to use and use;

3) Based on Hadoop HDFS storage, the method has strong expansion and calculation capability;

the user can load the original data into the Hive data table by the existing mature means, such as ETL, sqoop, hive load command, etc.

The tag metadata is a set of data index system established according to actual service requirements, and comprises a statistics index, a rule index, an algorithm index and a machine learning mining index, and the set of data index system is commonly called as tag metadata. From the composition of the index and the application scenario, the following dimensions can be divided: user attribute dimensions, user behavior dimensions, user consumption dimensions, user social dimensions, and the like. The label metadata is stored by using Mysql, and basic inquiry, new addition, modification and deletion interfaces are externally provided, so that development of corresponding management interfaces by developers is facilitated.

The tag metadata definition table structure is designed as follows:

as shown in fig. 2, the task scheduling engine module includes script management, workflow, scheduler, script plug-in, UI and API, and the specific execution flow includes the following steps:

The script management comprises script input management, a parser and an executor, and the script attribute comprises: name, type, version, package path, resource settings, execution environment parameters, dynamic parameters, etc. The execution parameters of different types of scripts are the same, and after the user selects the type, the page automatically loads the corresponding execution parameter setting window. And the script parser analyzes, merges, replaces dynamic parameters and the like of the input script according to different types, and generates executable data analysis codes. The script executor is called by the workflow and the scheduler, and calls the bottom shell of the operating system to execute according to the executable codes provided by the parser.

The workflow is used for organizing and controlling the execution sequence of each task, and a retry mechanism is provided for failed tasks. The workflow is assembled in a DAG streaming mode (DAG: full scale Directed Acyclic Graph), task tasks in the workflow are assembled in a directed acyclic graph mode, topology traversal is carried out from nodes with zero degree until no subsequent nodes exist, the running state of the tasks can be monitored in real time, and operations such as retry supporting, failure recovery from a designated node, pause, kill Task and the like are supported.

An execution node is the basic component of a workflow, providing an abstract definition of a series of executable programs/scripts as a carrier of task execution. Each execution node corresponds to an execution script, and is distinguished according to the program type, and mainly comprises: linux shell, spark, hive, SQL, java, python, etc.

The node controller provides the sequence and priority of executing the nodes and provides the operations of executing, suspending, stopping, retrying and the like of the nodes.

The monitor provides basic monitoring data such as execution time, end time, time consumption, execution state, error log and the like of each execution node.

A complete workflow layout diagram is shown in fig. 3.

After the workflow is arranged, the system automatically generates corresponding task examples, and the task examples correspond to a set of scheduling configuration information. The scheduler mainly comprises the following physical components: scheduler, job and Trigger, the relationship between the components is shown in fig. 4. The Scheduler is a task scheduling controller and is used for receiving and storing Job and Trigger information and is responsible for triggering and executing the Trigger. Schedulers comprise two important components: threadPool and JobStore. ThreadPool provides multithreading to execute Job programs, jobStore, for storing Job and Trigger information. Job is a scheduled task, which is an abstract definition of a task, and is the execution logic of the task. One Job may be triggered by multiple Trigger. Trigger triggers Trigger corresponding task program based on time rule, which specifies Trigger time and period based on Cron expression. For example: cronscheduled ("0 0/3 9-15.

The script plug-in comprises various scripts such as shell, spark, MR, hive, SQL, java, python, the various scripts are integrated into the engine in a plug-in mode, the engine adapts to each supported script, and other new script types are extended in a later support mode.

The UI provides the visual management function related to the data processing engine for the terminal operation user, and mainly comprises the following steps: tag metadata management, script management, workflow configuration, task scheduling management, running state monitoring, data processing result display and the like.

And the API interface layer is used for uniformly providing RESTful API to provide request services to the outside. Interfaces include creation, definition, querying, modification, publishing, offline, manual start, stop, pause, resume, execute from the node, and so forth of the workflow.

The operation state monitoring is used for monitoring the execution state of each component and the result and efficiency of data processing in the process of data processing, and mainly comprises the following steps: task scheduling monitoring, data processing process monitoring, data processing engine execution state monitoring and the like.

The data warehouse already stores the topic library data in a classified manner, but cannot provide convenient and efficient query and analysis capabilities, so that a data application layer needs to be provided above the data warehouse for providing data application support for the product end. The data application layer uses ElasticSearch, redis, hbase, a relational database and the like as data storage, and provides very convenient functions of inquiring, analyzing, displaying and the like for a product end.

In summary, the invention adopts a highly encapsulated integrated data analysis layer, which can realize necessary functions such as data acquisition, metadata definition, data processing, data application and the like, and support expansion. The data processing layer adopts an independently developed execution engine, supports a main stream data processing frame and scripts, such as shell, spark, MR, hive, SQL, java, python scripts and the like, and can realize quick deployment and online application through online editing and configuration. Finally, by deeply integrating the existing mainstream big data processing technology and providing a standard integrated interface API to the outside, the selection and integration work of a technical framework in the system research and development process is greatly reduced, and the system is easy to integrate into the existing system. The whole system has powerful functions, provides various data processing components, is flexible and configurable with a labeled data analysis scheme, can be rapidly applied to related big data analysis scenes, and can rapidly respond to continuously-changed data analysis indexes.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims

1. A user portrait analysis system supporting custom labels is characterized in that the whole system is divided into three layers, and comprises:

the data analysis layer comprises an ODS storage layer, a data processing module, a data warehouse and a data application module, wherein the ODS storage layer is an isolation layer formed between a service system and the data warehouse and is used for accessing and storing the original data of a plurality of service systems and providing a data foundation and support for a data analysis engine of an upper layer; the data application layer uses ElasticSearch, redis, hbase and a relational database as data storage;

the workflow is used for organizing and controlling the execution sequence of each Task, a retry mechanism is provided for failed tasks, the workflow is in a DAG flow type, task tasks in the workflow are assembled in a directed acyclic graph form, topology traversal is carried out from nodes with zero degree of entry, and the Task is assembled in a mode until no successor node exists;

the execution nodes are basic components of the workflow, serve as bearing facilities for task execution, provide abstract definitions of a series of executable programs/scripts, correspond to one execution script, and distinguish the execution scripts according to program types;

the node controller provides the sequence and the priority of executing the nodes and provides the operations of executing, suspending, stopping and retrying the nodes;

step six, storing the result: storing task scheduling records and task execution conditions;

2. The custom tag-enabled user portrayal analysis system of claim 1, wherein the ODS storage layer is implemented using Hive external partition tables, and the partition designs are performed in different dimensions in conjunction with specific business requirements.

3. The user portrait analysis system supporting custom labels according to claim 1, wherein the label metadata is a set of data index system established according to actual service requirements, including statistics class indexes, rule class indexes, algorithm class indexes and machine learning mining class indexes, and the label metadata is stored by Mysql and provides basic query, addition, modification and deletion interfaces to the outside.

4. The user portrayal analysis system supporting custom tags according to claim 1, wherein the data warehouse comprises a user attribute topic library, a user behavior topic library, a user consumption topic library, a user preference topic library and a user value topic library, the user attribute topic library comprising user gender, age, academy, income level, marital status and family member status; the user behavior theme library comprises recent travel frequency and recent shopping frequency; the user consumption theme library comprises the recent consumption times, the recent consumption amount and the consumption capacity; the user preference theme library comprises a category of commonly purchased goods and eating habits; the user value subject library includes user value information calculated according to an RFM model.