CN114549052A

CN114549052A - Data-based accurate marketing method, device, equipment and storage medium

Info

Publication number: CN114549052A
Application number: CN202210071646.3A
Authority: CN
Inventors: 钟通; 罗平
Original assignee: Shenzhen Bessky Technology Co ltd
Current assignee: Shenzhen Bessky Technology Co ltd
Priority date: 2022-01-20
Filing date: 2022-01-20
Publication date: 2022-05-27

Abstract

The invention discloses a data-based accurate marketing method, a data-based accurate marketing device, data-based accurate marketing equipment and a data-based accurate marketing storage medium, and belongs to the technical field of data processing. According to the method, the initial data are acquired, the initial data are preprocessed to obtain the target data, the target data are processed according to marketing analysis requirements to obtain the analysis data, the analysis data are further processed in a visual mode, a user can check the analysis data and product marketing strategies are formulated, and therefore accurate marketing of products is achieved.

Description

Data-based accurate marketing method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of data processing, in particular to a data-based accurate marketing method, a data-based accurate marketing device, data-based accurate marketing equipment and a data-based accurate marketing storage medium.

Background

In product marketing, making marketing strategies according to various aspects of data is a basic means for improving marketing effects. In the face of rapid expansion of services, various types of data are increasingly precipitated. In the aspect of accurate marketing, the source of data is limited to limited information in a certain aspect, and many new types of data are not integrated with traditional data, so that the information of consumers, products and markets cannot be more comprehensively understood.

The existing marketing system does not always perform effective management on data, meanwhile, due to the inaccuracy of a part of data and the loss of information, the error of a calculation result is large, and the marketing requirement cannot be met.

Disclosure of Invention

The invention mainly aims to provide a data-based accurate marketing method, a system, a device and a readable storage medium, and aims to realize data-based accurate marketing.

In order to achieve the above object, the present invention provides a data-based precision marketing method, which includes the following steps:

collecting initial data;

preprocessing the initial data to obtain target data;

and processing the target data according to marketing analysis requirements to obtain analysis data, wherein the analysis data is used for making a product marketing strategy to carry out product marketing.

Optionally, if the initial data is database data, the preprocessing includes: missing value cleaning, the step of preprocessing the initial data comprising:

acquiring the missing proportion of fields in the database data, and acquiring the importance degree index of the fields;

and determining a cleaning strategy according to the missing proportion and the importance degree index, and cleaning the missing content according to the cleaning strategy.

Optionally, if the initial data is database data, the preprocessing includes: format content cleaning, wherein the step of preprocessing the initial data comprises the following steps:

according to a preset content rule, acquiring data which does not meet a preset format rule in the database data and carrying out format cleaning;

and acquiring data which does not meet the preset content rule in the database data according to the preset format rule, and cleaning the content.

Optionally, if the initial data is database data, the preprocessing includes: cleaning non-demand data, wherein the step of preprocessing the initial data comprises:

screening non-demand data from the initial data according to a preset non-demand data rule;

and deleting the non-required data.

Optionally, if the initial data is file data, the preprocessing includes: logic error cleaning, wherein the step of preprocessing the initial data comprises:

acquiring repeated data in the file data, and performing deduplication processing on the repeated data;

and acquiring unreasonable values and contradictory contents in the database data, and cleaning the unreasonable values and the contradictory contents according to a preset logic error cleaning method.

Optionally, the step of processing the target data according to marketing analysis requirements to obtain analysis data includes:

setting a timing task according to marketing analysis requirements;

and acquiring corresponding data in the target data based on the timing task, constructing a data model, and acquiring the analysis data through the data model.

Optionally, after the step of processing the target data according to marketing analysis requirements to obtain analysis data, the method further includes:

and carrying out visualization processing on the analysis data.

In addition, to achieve the above object, the present invention further provides a data-based precision marketing device, including:

the acquisition module is used for acquiring initial data;

the preprocessing module is used for preprocessing the initial data to obtain target data;

and the analysis module is used for processing the target data according to marketing analysis requirements to obtain analysis data, and the analysis data is used for making a product marketing strategy to carry out product marketing.

Optionally, the preprocessing module is further configured to:

according to a preset content rule, acquiring data which does not meet the preset content rule in the database data and carrying out content cleaning;

and acquiring data which does not meet the preset format rule in the database data according to the preset format rule, and carrying out format cleaning.

Optionally, the preprocessing module is further configured to:

and deleting the non-required data.

Optionally, the preprocessing module is further configured to:

Optionally, the analysis module is further configured to:

setting a timing task according to marketing analysis requirements;

Optionally, the analysis module is further configured to:

and carrying out visualization processing on the analysis data.

In addition, to achieve the above object, the present invention further provides a data-based precision marketing apparatus, including: the system comprises a memory, a processor and a data-based precision marketing program stored on the memory and capable of running on the processor, wherein the data-based precision marketing program realizes the steps of the data-based precision marketing method when being executed by the processor.

In addition, to achieve the above object, the present invention further provides a storage medium, wherein the storage medium stores a data-based precision marketing program, and the data-based precision marketing program, when executed by a processor, implements the steps of the data-based precision marketing method as described above.

According to the data-based accurate marketing method, the data-based accurate marketing device, the data-based accurate marketing equipment and the storage medium, initial data are obtained, then the initial data are preprocessed to obtain target data, the target data are processed according to marketing analysis requirements to obtain analysis data, the analysis data are further processed in a visualized mode and displayed in an application system, a user can check the analysis data, accurate marketing strategies based on the data are made, and accurate marketing is achieved.

Drawings

Fig. 1 is a schematic structural diagram of a data-based precision marketing device of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a first embodiment of a data-based precision marketing method according to the present invention;

FIG. 3 is a flowchart illustrating a crawler data warehousing process according to an embodiment of the data-based precision marketing method of the present invention;

FIG. 4 is a diagram of the overall architecture of the system according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a data model according to an embodiment of the present invention;

fig. 6 is a functional block diagram of a precise data-based marketing device according to a first embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a data-based precision marketing device of a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the data-based precision marketing apparatus may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory, or a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of a data-based precision marketing apparatus, and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating system, a data storage module, a network communication module, a user interface module, and a data-based precision marketing program.

In the data-based precision marketing apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with other apparatuses; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 of the data-based precision marketing device of the present invention may be disposed in the data-based precision marketing device, and the data-based precision marketing device invokes the data-based precision marketing program stored in the memory 1005 through the processor 1001 and executes the data-based precision marketing method provided by the embodiment of the present invention.

An embodiment of the present invention provides a data-based accurate marketing method, and referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of the data-based accurate marketing method according to the present invention.

In this embodiment, the accurate marketing method based on data includes the following steps:

step S10, collecting initial data;

step S20, preprocessing the initial data to obtain target data;

and step S30, processing the target data according to marketing analysis requirements to obtain analysis data, wherein the analysis data is used for making a product marketing strategy to carry out product marketing.

In this embodiment, a data-based precision marketing method is provided, which is used in a data-based precision marketing system for analyzing a consumption market. In the product marketing process, a large amount of data support is usually needed to analyze, study and judge, the embodiment aims to effectively collect and process data, the problem that the data quality is affected by the diversity of data types and the differentiation of data sources is solved, the purpose of analysis is found according to enterprise strategic objectives, the proper data content is selected according to the purpose, too many irrelevant data are prevented from being selected to interfere with analysis, the analysis result is visually displayed, a product marketing strategy is formulated, and the product is recommended and marketed according to the formulated product strategy. The respective steps will be described in detail below:

step S10, collecting initial data;

in one embodiment, initial data is first collected. The collection of big data can be divided into four categories from data sources: web data (including Web pages, video, audio, animation, pictures, etc.), log data, database data, other data (sensory device data, etc.). The methods and techniques used for data acquisition are also different for different data sources. The details are as follows:

web data acquisition: shell programming, a crawler tool, development of a crawler program (Java, Python and the like), http protocol, TCP/IP basic principle and Socket programming interface, programming language, data format conversion, commands and interfaces (HDFS, HBase and the like) of a distributed storage system and distributed application development. Collecting log data: the system comprises acquisition tools (Flume, fluent, Logstash and the like), an access tool (Kafka), log acquisition program (Java, Python and the like) development, Shell programming, TCP/IP basic principle, network programming interface, programming language, data format conversion, commands and interfaces (HDFS, HBase and the like) of the distributed storage system and distributed application development. Data acquisition of a database: shell programming, collection tools (Sqoop, keyle, etc.), access tools (Kafka), development of database collection programs (Java, Python, etc.), SQL query languages and programming interfaces, use of relational database connections such as JDBC, etc., TCP/IP fundamentals and Socket programming interfaces, programming languages, data format conversion, commands and interfaces for distributed storage systems (HDFS, HBase, etc.), development of distributed applications, database-like data sources may be developed from various large e-commerce platforms such as: ebay, Amazon, Aliecpress, Shobee, Joom, Wish, and Lazada, and import the data of the relational database into the Hadoop and its related system through the Scoop. And (3) collecting other data: shell programming, collection tools, access tools, collection program (Java, Python, etc.) development, specific data source interface usage, TCP/IP philosophy, and Socket programming interface, programming language, data format conversion, commands and interfaces for distributed storage systems (HDFS, HBase, etc.), distributed application development.

After the initial data is collected, the data is stored in the following manners, for example: mysql, ES, File, FTP, doris, HBASE, RestApi, Kafka. MySQL is a relational database management system of open source codes, and the most common database management language, namely Structured Query Language (SQL), is used for database management; the ES is an elastic search abbreviation, is a high-expansion and open-source full-text retrieval and analysis engine, and can rapidly store, search and analyze massive data in a quasi-real-time manner; FTP is a file transfer protocol, is used for bidirectional transmission of the Internet and controls a file downloading space to copy files from a local computer or a local uploading file to a space on a server in a server; doris is an MPP (Massively Parallel Processing) database for rapidly analyzing massive large data; RestApi, REST, namely Resource Representational State Transfer, can design a more concise interface to help realize information Transfer and call relationship between systems. Kafka is an open source stream processing platform developed by the Apache software foundation, written in Scala and Java.

Step S20, preprocessing the initial data to obtain target data;

in one embodiment, after the initial data is obtained, the initial data is preprocessed to obtain the target data. As can be appreciated, data pre-processing is a very critical step, and in the real world, data is often incomplete (lacking some attribute values of interest), inconsistent (containing differences in code or name), and very vulnerable to noise (false or outliers). Because the database is too large and the data set often comes from multiple heterogeneous data sources, low quality data will result in low quality mining results. Therefore, in order to improve the quality of data and the quality of data analysis and ensure the marketing effect of products, the initial data is preprocessed to obtain preprocessed target data.

In one embodiment, the target data is processed according to marketing analysis requirements to obtain analysis data. It can be understood that the target data amount is huge, and not all data are useful, so the embodiment mainly collects and processes data effectively, and solves the problem that the data quality is affected by the diversity of data types, the difference of data sources, and the like. Furthermore, according to the strategic objectives of the enterprise, the objective to be analyzed is found, and the proper data content is selected according to the objective, so that the interference analysis caused by selecting too much irrelevant data is avoided. The marketing analysis requirements are set according to specific conditions, and data required by different analysis requirements such as market analysis, user analysis and commodity analysis are different, so that corresponding data are obtained and processed according to the marketing analysis requirements to obtain analysis data. After the analytical data is obtained, product marketing strategies, product recommendations, etc. may be specified by the analytical data, such as: adjusting the commodity price according to the relation between the product price and the purchase quantity; and matching the recommended products according to the user purchase records.

Further, in an embodiment, after the step of processing the target data according to marketing analysis requirements to obtain analysis data, the method further includes:

step S40, performing visualization processing on the analysis data.

In one embodiment, through visualization processing, the analysis data is presented in an application system, and product marketing strategy specification is carried out based on the data, so that product marketing is carried out. The system can understand that visual display of data is achieved, business personnel can conveniently conduct flexible analysis, consumption behavior characteristics and market trends are found, product strategies, sales strategies and marketing strategies of enterprises are formulated, accurate marketing is achieved, and the system is more convenient and faster to analyze directly according to data in a database. Visualization of data is mainly achieved by two types of tools, programming and non-programming. Non-programmed data analysis tools are Excel, Power BI, Tableau, FineBI, etc. Wherein Power-BI is a (BI) business intelligence software, Tableau is a visual analysis platform, and FineBI is a set of business intelligence systems under the flag of Sail Soft company. The programming mode often needs a complete system framework, and the related technologies are as follows:

operating the system: UNIX, Windows

Content Delivery Network (CDN): unpkg, jsDelivr

A Web server: apache and Tengine

Programming language: JAVA, PHP, Python

JavaScript framework: js, Element UI, Handlebars

JavaScript library: reach, jQuery, Axios, Select2, moment

User Interface (UI) framework: cs, Bootstrap, animate

Other items are as follows: webpack, Babel

The programming language adopts PHP language, and has the advantages that: (1) opening a source code; (2) free of charge; (3) the cross-platform performance is strong; (4) the efficiency is high; (5) the running is fast, and the program development is fast; (6) the editing is simple, and the practicability is strong; (7) object-oriented; (8) supporting scripting language as a main part, and the like. Js, its advantage has: (1) modular development can be carried out; (2) data can be bound bidirectionally; (3) js is a responsive interface effect; (4) vue use routing does not refresh the page, as compared to conventional pages that implement page switching and hopping through hyperlinks.

Referring to fig. 3, fig. 3 is a crawler data warehousing flowchart of an embodiment of the data-based precision marketing method of the present invention, which is explained in the present embodiment, where a product marketing data source is mainly a platform of each large power company, a website data of a crawling platform is set up through a crawler program or through a data interface and other data acquisition modes, a collection is scheduled daily, and then stored in a local memory of a server, and then the data is stored in an inventory data directory in Mysql: the data directory is a db data directory, the data databases are separate banks subordinate to the data directory, the data databases are used for storing test data, and the data databases are used for storing all data. Then, storing the Data into a Data warehouse through an ETL timing task, wherein the Data warehouse is divided into four layers, the first layer is an ODS (operational Data store) layer, namely an interface layer, and the function of the interface layer is as follows: receiving original data without any processing, and mainly comprising the following working contents: the method comprises the steps of the standard design of a table structure, the conversion of Mysql to Hive field types and the arrangement of field mapping relations. The second layer is a DWD (data Warehouse detail) layer, also called a theme layer, which has the functions of: cleaning, converting, integrating and removing duplicate based on the original data. The working content is as follows: the design of the specification of the table fields, the source table of each field and the design of the processing logic. The third layer is a DWS (data Warehouse service) layer, namely a mart layer, which has the functions of: storing detailed data, intermediate calculation result table, summary table and the like. The fourth layer ADS (application Data store), namely the application layer, functions as: and carrying out commodity recommendation and multidimensional analysis. A CDH big data platform is constructed based on the four layers, and then required data are extracted through an elastic search, wherein the elastic search is a search server based on Lucene and is mainly used for storing and reading data.

With reference to fig. 4, fig. 4 is a general technical architecture diagram of a system according to an embodiment of the precise marketing method based on data of the present invention, which is described in detail, wherein the data source includes a real-time data source and an offline data source, specifically: mysql, ES, files, FTP, doris, HBASE, RestApi and Kafka, then data source data are acquired through offline or real-time transmission, the acquired data are stored on the ground to hdfs, in order to improve query speed, part of the data are also stored on an Elasticissearch to improve the speed through querying the Elasticissearch, hdfs is managed by using cloadermanager and zookeeper, data analysis mainly adopts hive + flash, flash analysis mainly adopts flash CDC to query and analyze data through ETLJob timing task triggering, the result data are grounded to Mysql, and an application system presents the data to a user through querying the Mysql and the Elasticissearch. The accurate marketing system based on the data integrates, summarizes, analyzes and processes the data, can store the traditional data and the new type of data, and reduces the development cost and the maintenance cost. The first application system, the second application system, the third application system and the fourth application system are shown for illustration, and various application systems can be applied for visualization, and the application systems can be Vue, Element, H5, PHP, Axios, jQuery, Echart, WebPack and the like. In addition, to explain the english noun appearing in the figure, Azkaban is a batch workflow task scheduler, introduced by Linkedin corporation, for running a set of jobs and processes in a particular order within a workflow. K8S is globally referred to as kubernets as an open source for managing containerized applications on multiple hosts in a cloud platform. HDFS is a short name for Hadoop Distributed File System, and is an implementation of a Hadoop abstract File System. YARN resource scheduling platform, the purpose is in order to improve the utilization ratio of resource under the cluster environment. ETL, an abbreviation used in english Extract-Transform-Load, is used to describe the process of extracting (Extract), converting (Transform), and loading (Load) data from a source end to a destination end. The Elasticsearch is a Lucene-based search server. Flink, also known as Apache Flink, is a distributed system that requires computing resources to execute applications. Flink integrates all common cluster resource managers, such as Hadoop YARN, Apache messos, and Kubernets, but can also be configured to operate as stand-alone clusters. CDC, Change Data Capture, short for Change Data Capture, with which we can retrieve committed changes from a database and send those changes downstream for use downstream. Hive itself is a data warehouse infrastructure built on top of Hadoop for processing structured data. It provides a series of tools for data extraction, transformation, analysis, and loading.

According to the method and the device, initial data are collected and preprocessed to obtain target data, the initial data are processed according to marketing analysis requirements to obtain analysis data, the analysis data are subjected to visualization processing, data analysis and marketing strategy appointments are presented to users, and product marketing is achieved. In addition, the original data and the analysis data are stored in the data-based accurate marketing system, so that the data are integrated and stored, and an analyst can comprehensively know the information of consumers, products and markets.

Furthermore, based on the first embodiment of the data-based precision marketing method of the present invention, a second embodiment of the data-based precision marketing method of the present invention is provided.

The second embodiment of the data-based precision marketing method differs from the first embodiment of the data-based precision marketing method in that the step of preprocessing the initial data comprises:

step S21, obtaining the missing proportion of the fields in the database data and obtaining the importance degree index of the fields;

and step S22, determining a cleaning strategy according to the missing proportion and the importance degree index, and cleaning the missing content according to the cleaning strategy.

In one embodiment, different cleaning strategies are formulated according to the importance degree index and the missing proportion by acquiring the missing proportion of the fields in the database and acquiring the importance degree index of the fields, and the missing content is cleaned by using the corresponding cleaning strategies. Data value loss is one of the problems often encountered in data analysis. When the missing proportion is small, the missing records can be directly discarded or manually processed. However, actual data often has a considerable weight of missing data. At this time, if manual processing is very inefficient, if missing records are discarded, a large amount of information is lost, so that a system difference is generated between incompletely observed data and completely observed data, and when such data is analyzed, an error conclusion is likely to be obtained, so that the missing values of the data need to be cleaned after the initial data is obtained. Acquiring all fields in database data, acquiring missing values in the database through a rule statement or other methods, counting the missing proportion, for example, acquiring the data for multiple times, comparing the data amount acquired for multiple times, counting the missing values, and calculating to obtain the missing proportion.

The importance degree index can be represented by a score, the importance degree index is determined by listing fields to service personnel or according to a rule given by a service, an importance degree score can be set for each field according to service requirements, then the importance degree is divided into a high importance degree and a low importance degree according to a preset score threshold value, the importance degree is greater than the preset score threshold value and the importance degree is less than the preset score threshold value, and the importance degree is marked in an importance degree label of the field. And the importance degree indexes can be ranked, for example, the price of the commodity is not very concerned, the categories of the commodity are concerned, the importance degree scores of the price fields are set to be lower, then ranking is carried out, for example, fields with the importance degree scores of fifty percent in the top are selected as fields with high importance degree, then a strategy is formulated according to the importance degree indexes and the missing proportion, and the specific strategy can be four types as follows:

(one) the strategy of fields with high importance degree indexes and low deletion rate: 1. filling by calculation; 2. Estimated through experience or business knowledge.

(II) strategy of fields with high importance degree indexes and high deletion rate: 1. attempt to replenish from other channels; 2. calculating and obtaining according to other fields; 3. remove fields and indicate in the results

(III) field strategies with low importance degree indexes and low deletion rate: no processing or simple filling is done.

(IV) the strategy of fields with low importance degree indexes and high deletion rate: this field is removed.

Specifically, unnecessary fields are removed: and omitting the unimportant fields with high null value rate and not entering the next processing link. Filling missing content: some missing values may be filled in, and a commonly used method is to complement data from other data sources, or to calculate the value of a missing field from the value of another field. For example, if the gender and age field of a person is empty but the identification number is not, the gender and age can be determined from the identification number. Re-fetching: if some indexes are very important and the loss rate is high, people who access the data or business personnel need to know whether other channels can access the relevant data or not.

Further, in an embodiment, if the initial data is database data, the preprocessing includes: format content cleaning, wherein the step of preprocessing the initial data comprises the following steps:

step S23, according to a preset content rule, acquiring data which does not meet the preset content rule in the database data and cleaning the content;

and step S24, acquiring data which does not meet the preset format rule in the database data according to the preset format rule, and performing format cleaning.

If the data is data which is manually filled in or calls an external interface by a person, there is a high possibility that some problems exist in the format and the content, so the embodiment cleans the data which has the problems in the format and the content.

The respective steps will be described in detail below:

in an embodiment, format cleaning is performed according to data which does not meet a preset content rule in a first database of the preset content rule. The preset content rules are some common wrong contents, and the content problems may bring adverse effects on subsequent data analysis, and may even cause judgment errors, for example, unit errors or omissions of numerical values may affect the judgment of analysts on price setting. Therefore, content washing is required, and is exemplified below:

(1) there are special characters in the content that should not be present: for example, only numbers and letters may be present in the identification number, most typically with a space in the middle of the head and tail. In such cases, it is necessary to find possible problems in a semi-automatic and semi-manual manner, and then to remove unnecessary characters. A common cleaning method is to use a regular expression function: like, rlike, regexp _ replace, regexp _ extract.

(2) The content should not match the field: this is a more detailed problem and also one of the important causes of analysis errors, such as cross-table correlation search failure (multiple spaces result in two people being identified by "zhang san" and "zhang san"), incomplete statistics (incorrect summation result due to number-doped letters), failed model output or poor effect (data pairs are staggered due to line breaks, price and date are mixed in the same column)

Case scenario one: when the Lazada crawler commodity data and the sea eagle interface commodity data are combined, the commodity id format is 329126338, and the irregular data exists, wherein the commodity id format of the interface is different; commodity id format for crawlers: 404185593_ MY-584204464.

The cleaning method comprises the following steps: the number before the "_" symbol in the itemid field is extracted by using regexp _ extract (itemid,' (_ is) (______________, 1), the result is not empty and can be directly used, and when the result is empty, the commodity id needs to be extracted from the commodity link product _ url field, and the syntax is as follows:

CASE WHEN t1.itemid_new＝”THEN

regexp_replace(regexp_extract(regexp_extract(t1.product_url,'(-i)(？＝(.(？！(-i))) *$)(.*)',0),'(？<＝(-i))(.*？)(？＝(-s))',2),'-i',”)ELSE t1.itemid_new END；

In one embodiment, according to the preset format rule, data which do not meet the preset format rule in the database data are obtained and cleaned. It can be understood that besides content problems, format problems may also exist due to data entry being not standardized or extraction errors, and several common format problems and cleaning methods are listed below:

(1) time date format problem: this problem is often associated with the input, and may be encountered when integrating multi-source data, which may be processed into a consistent format. Two functions such as FROM UNIXTIME (insert _ time/1000 as int), 'yyyy-MM-dd HH: MM: ss') and UNIX _ TIMESTAMP (date (insert _ time)). 1000 can realize the interconversion between the timestamp and the time of year, month and day format.

(2) And (3) JSON format data analysis:

single json field resolution: using a get _ json _ object function;

case scene two: there is a json field top _ categories in the Amazon interface data commodity table, the format of which is shown in the following table, and it is now necessary to extract 79903031 the value of category _ id in the json data.

The cleaning method comprises the following steps: first, replace the "[" character with regexp _ replace, and then use get _ json _ object function, the syntax is as follows:

get_json_object(regexp_replace(top_cates,'\\[|\\]',”),'$.cate_id')；

json array field resolution: by means of an explicit function built in the Hive, the explicit () function receives data of an array or map type as input, then outputs elements in the array or map in a form of each row, and is matched with LATERAL VIEW to achieve the purpose of analyzing a plurality of json.

Case scenario three: there is a field sub _ categories containing json arrays in the Amazon interface data commodity table, the format of which is shown in the following table, and the values 16150720031 and 508387031 of p _ l4_ id in json data need to be extracted.

The cleaning method comprises the following steps: the data in the sub _ sites column is changed from one row (in the form of a json array) into a plurality of rows (in the form of a single json) by using split and explore, and combining the hierarchical view implementation, wherein the characters of "[", "]" or "}, {" in the sub _ sites field are replaced by regexp _ replace. After the rows are changed, the p _ l4_ id field can be extracted by a method get _ json _ object of single json field resolution.

The complete sentence:

multiple nested json resolution: firstly, analyzing an array to be analyzed by using a get _ json _ object method, then replacing '} with {' in '} | {' by using regexp _ place, then dividing | | | by using a split method, and after dividing into an array, expanding the array into a plurality of columns by using a linear view exploid method.

Case scene four: washing multiple nested json codes shown below;

the following rule table format is resolved:

the analysis method comprises the following steps:

the method comprises the following steps: first, three columns of code, name and list are extracted by using get _ json _ object.

Step two: the data in the list of list is mainly analyzed, and the contents of Acode, Aname, Bcode and Bname are extracted.

Since the list is a multiple nested json array, one row can be changed into multiple rows by using split and explore in combination with the raster view.

First, a linear view extension is used:

lateral view explode(split(regexp_replace(regexp_extract(list,'^\\[(.+)\\]$',1),' \\}\\]\\}\\,\\{','\\}\\]\\}\\|\\|\\{'),

'\\|\\|'))alist as a

and (5) performing first-layer analysis on the list, and assigning a result to a.

Again using the json _ tuple function:

the average view json _ tuple (a, ACode ', ' AName ', ' BList ') ai as ACode, AName, BList perform the second layer analysis on a to obtain the values of the ACode, AName, and BList nodes.

Then using a second raster view extension:

lateral view explode(split(regexp_replace(regexp_extract(BList,'^\\[(.+)\\]$', 1),'\\}\\,\\{','\\}\\|\\|\\{'),'\\|\\|'))blist as b

and (4) carrying out third-layer analysis on the BList node value obtained in the last step, and assigning a result to b.

Finally, using the second json _ tuple function:

lateral view json_tuple(b,'BCode','BName')bi as BCode,BName

and b, performing second-layer analysis to obtain values of the BCode and the BName node.

Step three: the values of ACode, AName, BCode and BName are finally obtained through layer-by-layer analysis.

The complete sentence:

(3) XML (extensible markup language) format data parsing: there are a number of ways to parse XML data into a HIVE specification table. One of these is by adding a hivexmlserde jar file and then using the SerDe attribute in the ROW FORMAT (set attribute when building the table, auto-parse when derivative). Another method is that data in XML format is already stored in a temporary table in the format of a single character string, and then the data of each tag can be obtained by using XPATH function, and the syntax example is as follows:

xpath ('< a > b1 < b2 c1</c > </a', 'a/b/text ()') returns a result: [ "b1", "b2" ]

xpath _ string (' < a > < b1 > < b2 > </a >, '// b ') returns the result: b1

xpath _ borolan ('< a > </a >', 'a/b'): true

When an XML data file has an empty such empty tag, to assign a default value to the empty tag, we can write a custom configuration Unit UDF, modify this XML data before passing it to XPath, and provide any value for the empty tag, the following UDF Java code implements the conversion of to :

the Maven item is exported to be a jar file and added to the configuration unit, and the Maven item can be used as a function in an HQL statement subsequently.

Further, in an embodiment, if the initial data is database data, the preprocessing includes: cleaning non-demand data, wherein the step of preprocessing the initial data comprises:

step S25, screening non-demand data from the initial data according to a preset non-demand data rule;

and step S26, deleting the non-demand data.

In an embodiment, according to a preset non-demand data rule, deleting non-demand data in the initial data. The non-demand data rule is a rule set according to business demands, similar importance degree indexes can sequence or mark data demand degrees, data which do not contribute to data analysis can be deleted, and a specific implementation mode can be set through a code rule firstly to carry out screening and deleting work.

It should be noted that this step appears to be very simple: unnecessary fields are deleted. However, in practice, there are many problems, such as: deleting fields that appear to be unnecessary but are actually important to the service; a certain field is useful, but does not think well, and whether the deletion is needed or not is not known; the user can see the eyes at any time and the error field is deleted. The first two cases suggest: if the data size is not so large that the field cannot be processed, the field which can be deleted is deleted as much as possible. In the third case, backup data is recommended.

Further, in an embodiment, if the initial data is database data, the preprocessing further includes: and (4) logic error cleaning, wherein the step of preprocessing the initial data comprises the following steps:

step S27, acquiring first duplicate data in the database data, and performing deduplication processing on the first duplicate data;

in an embodiment, first repeated data in the database data are obtained, the first repeated data are subjected to deduplication processing, the first repeated data in the database data are obtained, and the first repeated data are subjected to deduplication processing. It will be appreciated that to prevent bias in the analysis results, some problem data that can be found by simple logical reasoning is filtered or cleaned. And acquiring first repeated data, and finding repeated data for deduplication processing by counting the fields and grouping and summarizing the fields. It should be noted that it is strongly suggested that the deduplication will follow the content washing of the format, otherwise the deduplication will not be effective, e.g. multiple spaces will cause the recognition of "zhang san" and "zhang san" as two people, and the deduplication will fail.

And step S28, acquiring unreasonable values and contradictory contents in the database data, and respectively cleaning.

In one embodiment, unreasonable values and inconsistent contents in the database data are obtained and cleaned. Specifically, unreasonable values are removed: for example, the time of merchandise being on shelf is the future time, which either filters out such data or sets an unreasonable value to null or a fixed value. And (3) correcting contradictory contents: some fields may be mutually verifiable, for example: the identification number is 110103198008060542, and the age is 18 years old, and in this case, it is necessary to determine which field provides more reliable information and remove or reconstruct unreliable fields according to the data source of the field.

The above are some examples, and many cases not listed are to be handled as appropriate in actual operation.

According to the method and the device, missing value cleaning, format content cleaning and non-required data cleaning are carried out on the database data before analysis, so that the problem that the final inference about the data is inaccurate due to data errors or format problems and other data self reasons is reduced, the quality of the data is improved, and the accuracy of the analysis result is improved.

The third embodiment of the data-based precision marketing method of the present invention is different from the first and second embodiments in that, if the initial data is file data, the preprocessing includes: logic error cleaning, wherein the step of preprocessing the initial data comprises:

step a, acquiring repeated data in the file data, and performing deduplication processing on the repeated data;

and b, acquiring unreasonable values and contradictory contents in the file data, and cleaning the unreasonable values and the contradictory contents according to a preset logic error cleaning method.

In an embodiment, duplicate data in the file data is obtained, and the duplicate data is subjected to deduplication processing. Since Excel files are sometimes needed to be analyzed during data analysis, the files do not enter the database through an interface but exist in a file form, and the modes of preprocessing the file data and the database data are different. Therefore, duplicate data in the file data is acquired and deduplicated. And acquiring unreasonable values and contradictory contents, and cleaning the unreasonable values and the contradictory contents according to a preset logic error cleaning method. Specifically, the preset logical error cleaning method is set according to the requirement aiming at the content to be cleaned, and manual judgment can be added, so that the cleaning is more accurate. For example, the real-time price of the commodity a is acquired, the price table acquired through the webpage commodity list is inconsistent with the price table set in the background, and a contradiction exists, in this case, the acquired numerical values are inconsistent due to wrong display of the website, and the price of the webpage screenshot can be set as the commodity price to be preferentially selected, because the price seen by the customer is the price on the webpage, the judgment of the customer on commodity purchase can be influenced. The specific implementation can be realized by writing codes by a preset logic error cleaning method to realize cleaning. Of course, the above processing method is proposed for the situation that there is a problem in the data, but the problem does not necessarily exist, and if there is no problem data such as duplicate data, unreasonable value, and inconsistent content, etc. obtained, the processing is not needed.

In one embodiment, the data to be cleaned is preprocessed when it is not stored in a relational database, such as a CSV (comma separated value file format) file, but is also the original data. The file data does not necessarily have the problems of duplication and logical errors, but also has the problems of missing values, format content errors and the like, and the steps of file data preprocessing can be increased or decreased according to actual conditions. Problems with data are, for example, in the following codes: and the column name has blank spaces, repeated data and missing data, and the column name is continuously stored as a CSV file after being cleaned.

asin, level, type, price, ranking

B09FC1ZX0X，new_2.PlayStation 3.1.99，1

B09BCZ28QJ，new_2，，16.99，2

B09CR4DBD0，PlayStation 3，1399，

B096KDZ886，new 2PlayStation 3.25.09，1

B093FJGYSR，new 236.99，2

B095SNN2TD，，PlayStation 3，29.99，3

B093RW5BWL，new 2，PlayStation 3，22.99，4

B0945FDFBB，，，209.95，5

B098PV8RBZ，new_2，PlayStation 3，25.45，7

The cleaning method comprises the following steps: the specific python code is as follows:

import pandas as pd

df＝pd.read_csv(″ResourceFile.csv″)

# acquisition List of names

ClName＝df.columns.values

# De-spacebar Using List derived alignment names

df.columns＝[x.strip()for x in ClName]

# delete duplicate modify Source data

df.drop_duplicates(inplace＝True)

# reset index

df.index＝range(df.shape[0])

Filling level is data with missing value, and filling content is non-level "

Loc [ df. grade isnull (), 'grade' ], no grade "

Type of filling up is listed as data with missing value, filling up content is unknown "

Loc [ df. type isnull (), ' type ' ], unknown '

# save File

df.to_csv("ResourceFile.csv")

Further, in an embodiment, the step of processing the target data according to marketing analysis demand to obtain analysis data includes:

step S31, setting a timing task according to marketing analysis requirements;

step S32, acquiring corresponding data in the target data based on the timing task, constructing a data model, and acquiring the analysis data through the data model.

In one embodiment, a timing task is set according to marketing analysis requirements, namely according to different analysis dimensions, required data is obtained from target data, and a data model is built to obtain analysis data. Referring to fig. 5, fig. 5 is a schematic diagram of a data model according to an embodiment of the data-based precision marketing method of the present invention, in which names of first behavior data models are respectively category analysis, commodity analysis and keyword analysis. In performing the class analysis, it is necessary to obtain and calculate: total number of commodities, total number of stores, commodity average price, commodity flow value, purchase rate, PPC bidding and the like; when commodity analysis is carried out, the rank change rate, the price interval, the newly-added comment number, the number of merchants, the delivery mode, the sales trend and the like need to be acquired; when performing keyword analysis, it is necessary to obtain calculations: rank upsets, search volume growth rate, market period, supply-demand ratio, click-through concentration rate, purchase rate, and the like. Taking the category analysis as an example, when the marketing analysis requirement is the category analysis, a timing task is set, the commodity price, the store number, the commodity number and the like are extracted from data and calculated, the commodity total number, the commodity average price and the store total number are calculated, and as can be seen from the figure, when different data models are constructed, the obtained data are different, so that the timing task is set to obtain the data according to the marketing analysis requirement, and further, the data models are constructed, and further, the market appearance is outlined by looking up the data models, such as the change of the commodity total number, the change of the commodity average price and the relation among other data, the label information and the history change track of the articles and the commodities, so as to obtain the analysis data.

In the embodiment, the file data is preprocessed, specifically, the repeated data in the file data is obtained, the repeated data is subjected to deduplication processing, the unreasonable value and the contradictory content in the file data are obtained, the file data is cleaned according to the preset logic error cleaning method, and the data which can affect the data analysis accuracy in the file data is cleaned, so that the accuracy and the reliability of subsequent data analysis are improved. And when the data is analyzed, the data model is constructed to obtain the analysis data aiming at different marketing requirements, so that the product marketing analysis can be more accurately carried out, the product marketing accuracy is improved, and the data is effectively utilized.

Referring to fig. 6, fig. 6 is a schematic functional module diagram of a first embodiment of the data-based precision marketing device according to the present invention. The accurate marketing device based on data comprises:

the accurate marketing device based on data includes:

the acquisition module 10 is used for acquiring initial data;

a preprocessing module 20, configured to preprocess the initial data to obtain target data;

and the analysis module 30 is used for processing the target data according to marketing analysis requirements to obtain analysis data, and the analysis data is used for making a product marketing strategy to carry out product marketing.

Optionally, the preprocessing module is further configured to:

and deleting the non-required data.

Optionally, the preprocessing module is further configured to:

Optionally, the analysis module is further configured to:

setting a timing task according to marketing analysis requirements;

Optionally, the analysis module is further configured to:

and carrying out visualization processing on the analysis data.

In addition, an embodiment of the present invention further provides a storage medium, where the storage medium stores a data-based precision marketing program, and the data-based precision marketing program, when executed by a processor, implements the steps of the data-based precision marketing method as described above.

The method implemented when the data-based precision marketing program executed on the processor is described with reference to various embodiments of the data-based precision marketing method of the present invention, and details thereof are not repeated herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A data-based precision marketing method is characterized by comprising the following steps:

collecting initial data;

preprocessing the initial data to obtain target data;

2. The data-based precision marketing method of claim 1, wherein if the initial data is database data, the preprocessing comprises: missing value cleaning, the step of preprocessing the initial data comprising:

3. The data-based precision marketing method of claim 1, wherein if the initial data is database data, the preprocessing comprises: format content cleaning, wherein the step of preprocessing the initial data comprises the following steps:

4. The data-based precision marketing method of claim 1, wherein if the initial data is database data, the preprocessing comprises: cleaning non-demand data, wherein the step of preprocessing the initial data comprises:

and deleting the non-required data.

5. The data-based precision marketing method of claim 1, wherein if the initial data is file data, the preprocessing comprises: logic error cleaning, wherein the step of preprocessing the initial data comprises:

and acquiring unreasonable values and contradictory contents in the file data, and cleaning the unreasonable values and the contradictory contents according to a preset logic error cleaning method.

6. The data-based precision marketing method of claim 1, wherein the step of processing the target data according to marketing analysis requirements to obtain analysis data comprises:

setting a timing task according to marketing analysis requirements;

7. The data-based precision marketing method of claim 1, wherein after the step of processing the target data according to marketing analysis requirements to obtain analysis data, the method further comprises:

and performing visualization processing on the analysis data.

8. The utility model provides an accurate marketing device based on data which characterized in that, accurate marketing device based on data includes:

the acquisition module is used for acquiring initial data;

and the analysis module is used for processing the target data according to marketing analysis requirements to obtain analysis data, and the analysis data is used for formulating a product marketing strategy to carry out product marketing.

9. A data-based precision marketing device, comprising: a memory, a processor, and a data-based precision marketing program stored on the memory and executable on the processor, the data-based precision marketing program when executed by the processor implementing the steps of the data-based precision marketing method of any one of claims 1 to 7.

10. A storage medium having stored thereon a data-based precision marketing program, which when executed by a processor implements the steps of the data-based precision marketing method according to any one of claims 1 to 7.