CN105787058A

CN105787058A - User label system and data pushing system based on same

Info

Publication number: CN105787058A
Application number: CN201610110693.9A
Authority: CN
Inventors: 黄永标; 申志刚; 林海棠; 钟威; 文斌; 郭泽波
Original assignee: Guangzhou Pinwei Software Co Ltd
Current assignee: Vipshop Guangzhou Software Co Ltd
Priority date: 2016-02-26
Filing date: 2016-02-26
Publication date: 2016-07-20
Anticipated expiration: 2036-02-26
Also published as: CN105787058B

Abstract

The embodiment of the invention discloses a user label system and a data pushing system based on the same.The technical problem that even though various data mining ways exist at present, data mined out cannot be precisely pushed outwards, and therefore the data mining efficiency is low is solved.The user label system comprises a unified inquiry engine, a user label management background, a Redis cluster and an Index/Solr distribution type cluster.The unified inquiry engine is in communication connection with the user label management background.Both the unified inquiry engine and the user label management background are in communication connection with the Index/Solr distribution type cluster, and the unified inquiry engine is in communication connection with the Redis cluster.The Index/Solr distribution type cluster is used for scanning label data for an HIVE data platform according to user label rules preset by the user label management background, and sending the label data to the Redis cluster to be cached.

Description

A kind of user tag system and the data delivery system based on user tag system

Technical field

The present invention relates to big data technique field, particularly relate to a kind of user tag system and the data delivery system based on user tag system.

Background technology

Big data are exactly internet development to a kind of presentation in stage now or feature, there is no need mythical it or it is kept the heart revered, under the setting off of the technological innovation curtain being representative with cloud computing, these data being originally difficult to collect and use start easily to be utilized, constantly bringing forth new ideas by all trades and professions, big data progressively can create more value for the mankind.

Data mining (English: Datamining), is translated into again Date Mining, data mining.It is that (English: Knowledge-DiscoveryinDatabases is called for short: a step in KDD) knowledge discovery in database.Data mining generally refers to be hidden in by algorithm search the process of wherein information from substantial amounts of data.Data mining is generally relevant with computer science, and realizes above-mentioned target by all multi-methods such as statistics, Data Environments, information retrieval, machine learning, specialist system (relying on empirical law in the past) and pattern recognitions.

The mode of current data mining has multiple, but cannot the data excavated externally be pushed accurately, result in the technical problem of data mining inefficiency.

Summary of the invention

A kind of user tag system of embodiment of the present invention offer and the data delivery system based on user tag system, although the mode solving current data mining has multiple, but the data excavated externally cannot be pushed accurately, the technical problem of the data mining inefficiency caused.

A kind of user tag system that the embodiment of the present invention provides, including:

Unified query engine, user tag manage backstage, Redis cluster and Index/Solr distributed type assemblies；

Described unified query engine manages background communication with described user tag and is connected, described unified query engine all communicates to connect with described Index/Solr distributed type assemblies with described user tag management backstage, and described unified query engine is connected with described Redis trunking communication；

Wherein, described Index/Solr distributed type assemblies, for managing, in conjunction with user tag, the user tag rule that backstage is preset, HIVE data platform is carried out the scanning of label data, described label data is sent extremely described Redis cluster and carries out buffer memory.

Preferably, described user tag system also includes:

Service application platform, is connected with described unified query engine communication, for sending business demand function command to described unified query engine.

Preferably, described unified query engine specifically includes:

Tag queries unit and label rule query unit；

Described tag queries unit, for being extracted the described label data of buffer memory by described Redis cluster, and transmission to described service application platform pushes；

Described label rule query unit, carries out label rule query for managing backstage by described user tag.

Preferably, described Index/Solr distributed type assemblies specifically includes:

Search user data cell, store user data cell, index unit, incremental update indexing units；

Described incremental update indexing units, for according to preset label task or label computing or customer group computing, carries out, by the label data of described HIVE data platform, the renewal that increment is newly-increased and increment is deleted and processes.

Preferably, described user tag management backstage specifically includes:

Label rule definition unit, label life cycle unit, same rights management unit and scheduler task administrative unit；

Described label rule definition unit, described label life cycle unit, described same rights management unit all communicate to connect with described scheduler task administrative unit.

Preferably, described label task will be for will set up label processor active task table, described label processor active task table is carried out the scanning of the first preset sweep time, change label data is write to described label processor active task table, generates pending corresponding label task SQL statement.

Preferably, described label computing is the scanning that described label processor active task table carried out the second preset sweep time, link with described HIVE data platform, it is determined that the change of the described label task SQL statement of described label processor active task table, and the increment carrying out correspondence increases newly and increment delete processing.

Preferably, described customer group computing is generate the user data corresponding with customer group according to the described label processor active task table after processing.

A kind of data delivery system based on user tag system that the embodiment of the present invention provides, including:

HIVE data platform, and any one the described user tag system mentioned in the embodiment of the present invention；

Described HIVE data platform is set up with described user tag system communication connection relation；

Described HIVE data platform includes: data collection module, data modeling unit and data cleaning unit；

Described HIVE data platform, for carrying out data collection by described data collection module, described data modeling unit, modeling processes, and is supplied to described user tag system according to the label data extraction instruction that described user tag system sends.

Preferably, described HIVE data platform, it is additionally operable to user basic information is synchronized to buffer memory in the Redis cluster of described user tag system.

As can be seen from the above technical solutions, the embodiment of the present invention has the advantage that

A kind of user tag system of embodiment of the present invention offer and the data delivery system based on user tag system, wherein, user tag system includes: unified query engine, user tag manage backstage, Redis cluster and Index/Solr distributed type assemblies；Unified query engine manages background communication with user tag and is connected, and unified query engine all communicates to connect with Index/Solr distributed type assemblies with user tag management backstage, and unified query engine is connected with Redis trunking communication；Wherein, Index/Solr distributed type assemblies, for managing, in conjunction with user tag, the user tag rule that backstage is preset, HIVE data platform is carried out the scanning of label data, label data transmission is carried out buffer memory to Redis cluster.In the present embodiment, by Index/Solr distributed type assemblies, for managing, in conjunction with user tag, the user tag rule that backstage is preset, HIVE data platform is carried out the scanning of label data, label data transmission is carried out buffer memory to Redis cluster, although the mode solving current data mining has multiple, but the data excavated externally cannot be pushed accurately, the technical problem of the data mining inefficiency caused.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.

The structural representation of one embodiment of a kind of user tag system that Fig. 1 provides for the embodiment of the present invention；

The structural representation of one embodiment of a kind of data delivery system based on user tag system that Fig. 2 provides for the embodiment of the present invention；

Fig. 3 is the overall architecture schematic diagram of Fig. 2 embodiment；

Fig. 4 is the schematic diagram of data stream；

Fig. 5 is service operation schematic diagram.

Detailed description of the invention

For making the goal of the invention of the present invention, feature, the advantage can be more obvious and understandable, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, the embodiments described below are only a part of embodiment of the present invention, and not all embodiment.Based on the embodiment in the present invention, all other embodiments that those of ordinary skill in the art obtain under not making creative work premise, broadly fall into the scope of protection of the invention.

Referring to Fig. 1, an embodiment of a kind of user tag system that the embodiment of the present invention provides includes:

Unified query engine 11, user tag manage backstage 12, Redis cluster 13 and Index/Solr distributed type assemblies 14；

Unified query engine 11 communicates to connect with user tag management backstage 12, and unified query engine 11 all communicates to connect with Index/Solr distributed type assemblies 14 with user tag management backstage 12, and unified query engine 11 and Redis cluster 13 communicate to connect；

Wherein, Index/Solr distributed type assemblies 14, for managing, in conjunction with user tag, the user tag rule that backstage 12 is preset, HIVE data platform is carried out the scanning of label data, label data transmission is carried out buffer memory to Redis cluster 13.

Further, user tag system also includes:

Service application platform 15, communicates to connect with unified query engine 11, for sending business demand function command to unified query engine 11.

Further, unified query engine 11 specifically includes:

Tag queries unit 111 and label rule query unit 112；

Tag queries unit 111, for being extracted the label data of buffer memory by Redis cluster 13, and transmission to service application platform 15 pushes；

Label rule query unit 112, carries out label rule query for managing backstage 12 by user tag.

Further, Index/Solr distributed type assemblies 14 specifically includes:

Search for user data cell 141, storage user data cell 142, index unit 143, incremental update indexing units 144；

Incremental update indexing units 144, for according to preset label task or label computing or customer group computing, carries out, by the label data of HIVE data platform, the renewal that increment is newly-increased and increment is deleted and processes.

Further, user tag management backstage 12 specifically includes:

Label rule definition unit 121, label life cycle unit 122, same rights management unit 123 and scheduler task administrative unit 124；

Label rule definition unit 121, label life cycle unit 122, same rights management unit 123 all communicate to connect with scheduler task administrative unit 124.

Further, label processor active task table, for will set up label processor active task table, is carried out the scanning of the first preset sweep time by label task, writes change label data to label processor active task table, generates pending corresponding label task SQL statement.

Further, label computing is the scanning that label processor active task table carried out the second preset sweep time, link with HIVE data platform, it is determined that the change of the label task SQL statement of label processor active task table, and the increment carrying out correspondence increases newly and increment delete processing.

Further, customer group computing is generate the user data corresponding with customer group according to the label processor active task table after processing.

In the present embodiment, by Index/Solr distributed type assemblies 14, for managing, in conjunction with user tag, the user tag rule that backstage 12 is preset, HIVE data platform is carried out the scanning of label data, label data transmission is carried out buffer memory to Redis cluster 13, although the mode solving current data mining has multiple, but the data excavated externally cannot be pushed accurately, the technical problem of the data mining inefficiency caused.

Referring to Fig. 2, an embodiment of a kind of data delivery system based on user tag system provided in the embodiment of the present invention includes:

HIVE data platform 21, and the user tag system 22 mentioned in Fig. 1 embodiment；

HIVE data platform 21 is set up with user tag system 22 communication connection relation；

HIVE data platform 21 includes: data collection module 211, data modeling unit 212 and data cleaning unit 213；

HIVE data platform 21, for carrying out data collection by data collection module 211, data modeling unit 212, modeling processes, and is supplied to user tag system 22 according to the label data extraction instruction that user tag system 22 sends.

Preferably, HIVE data platform 21, it is additionally operable to user basic information is synchronized to buffer memory in the Redis cluster of user tag system 22.

In order to make it easy to understand, below by the HIVE data platform of Fig. 1 and Fig. 2 embodiment and the interactive application of user tag system being described with concrete application scenarios, refer to Fig. 3 and Fig. 4, application examples includes:

1, user basic information synchronizes

Need to be synchronized to Redis from Hive by essential information.User basic information is divided into two classes according to registration account number and facility information, leaves in Redis cluster 1.

1.1 handling processes

(1) utr_basic_sync: user basic information synchronous meter.Status field definition is:

0 data are to be updated

In 1 process (0-> 1 needs lock table to update)

2 write Redis successes

-10 user basic information update unsuccessfully

Data_ver: versions of data identifies, such as 20150413

(2) creating two records in utr_basic_sync, mark account number newly synchronizes to synchronize with facility information respectively.

(3) creating task one, every 1 minute scanning utr_basic_sync table once, when status is 0 or last_sync_time non-same day, updates status=1, and performs to derive user basic information logic from Hive.(Hive data base needs mark, and data have been updated over)

(4) according to the data volume of user, equipment essential information in Hive, start multithreading and from Hive, pull essential information in Redis

(5) account/facility information all adopts the mode of incremental update.

(6) task one concurrently performs

1.2 incremental updates

SQL is used to contrast the field of change in two tables of hive.

User basic information form in 1.3Redis

"uid":"",

"user_id":"",

"phone":"",

"mail":"",

"tokens":[mid_deviceToken_appName,mid_deviceToken_appName]

Facility information

mid_deviceToken_appName,mid_deviceToken_appName

Specify the field returned for convenience of API, account uses map storage.It addition, tokens field uses the mid_deviceToken_appName form of agreement.While updating facility information, it is necessary in the tokens field of device information update to account.

1.4 synchronization failures process

(1) by arranging utr_basic_sync table status=0, the task merging operation of utr_tag_task is re-executed.

(2) in log, print detailed error information, utr_task_log records failure information

2, label task generates

2.1 handling processes

(1) utr_tag_task: label processor active task table, every day, task scheduling was according to the tag table related to, and merges into same task and writes as in utr_tag_task table.Status field definition:

0 is pending

In 1 inquiry

2 perform increment increases newly, updates operation

3 perform increment deletion action

-10hive inquires about unsuccessfully

-11 perform increment increases newly, updates operation

-12 perform the failure of increment deletion action

Data_ver: versions of data identifies, such as 20150413

(2) task two is created, 1 minute every day of scanning utr_basic_sync table is once, when status is 0 or last_sync_time non-same day, update status=1, and take out all records of all of utr_tag table, by the label of identical table, merge and become a task, be written in utr_tag_task table.Need each bar SQL statement that pre-generatmg is to be operated.

(in utr_tag table, increase a field, represent the tables of data belonging to attribute.Amendment log_id is task_id)

(3) task two, it is not necessary to concurrently perform.

2.2 synchronization failures process

(1) in log, print detailed error information, utr_task_log records failure information

(2) by updating the status=0 in utr_tag, it is possible to again derive the file of this tag

3, label computing (hive2solr)

3.1 handling processes

(1) utr_tag_task: label task list.Status field definition:

0 is pending

In 1 inquiry

2 perform increment increases newly, updates operation

3 perform increment deletion action

-10hive inquires about unsuccessfully

-11 perform increment increases newly, updates operation

-12 perform the failure of increment deletion action

Data_ver: versions of data identifies, such as 20150413

(2) task three is created, every 1 minute scanning utr_tag_task table, if status=0 or data_ver was less than the same day, update status=1.(hive needs whether offer method inquiry data have been prepared for complete)

(3) each task process logic is as follows:

Link hive, performs the create_sql in utr_tag_task table, the combination according to label condition, finds result and be saved in hive.Result form be (uid, tagCode1, tagCode2, tagCode3 ... .)

Perform the sql that increment is newly-increased, update

Perform the sql that increment is deleted

Wait that sql performs to terminate, update status and the data_ver in utr_tag_task and utr_tag table

(4) task three allows concurrently to perform

3.2 synchronization failures process

(2) one of them thread failure, then identify whole tag and update unsuccessfully

(3) by updating the status field in utr_tag, it is possible to again import the file of this tag.The code that there is repetition in Tag does not affect.

3.3 customer group computings

For all of customer group, carrying out pretreatment, generate corresponding user data, after only data genaration completes, operation system just can use this customer group.

3.3.1 handling process

(1) utr_group table: customer group tables of data, simultaneously as customer group task list.Status (external state, api interface only judges status field) defines:

0 establishment (completes customer group to create)

1 is ready

-10 data prepare unsuccessfully

Data_ver: versions of data identifies, such as 20150413

Sync_status (synchronous regime) defines:

In 0 establishment (state of newly-built customer group)

In 1 process

2 updating data cached

3 complete

-10 label datas are not ready to ready

-11 label datas prepare unsuccessfully

-12 user basic information prepare unsuccessfully (have data cached in Redis, but data_ver does not update)

(2) task four is created, every 1 minute scanning utr_group table, when record meets:

Sync_status=0 or last_sync_time, less than the record on the same day or (sync_status=-10andlast_sync_time be 5 minutes before), takes top1 record every time, updates sync_status=1, and perform logic below:

Judge label whether ready (data_ver is the same day), if it is not, then arrange sync_status=-10, update last_sync_time, be not written into utr_task_log table；

If there is label data and preparing unsuccessfully, then directly update sync_status=-11, update last_sync_time, write log to utr_task_log table；

If label data is ready, then take customer group condition, splice Solr query statement, find from Solr and record total and maximum uid；

If record sum is more than 500w, by performance optimization scheme multiple threads；

Collect multithreading operation result, if failure, then remove the record being stored in Redis；

If record sum is less than 500w, then direct single-threaded process, if failure, remove the record being stored in Redis；

If success, it is judged that whether user basic information is updated successfully, if so, then directly update data_ver field, if it is not, then arrange sync_status=-12, and the value of next_data_ver is set

(3) task four allows concurrently to perform, and arranging task four available line number of passes in quartz is 6, has 60 thread write Redis at most simultaneously.

3.3.2 buffer update performance optimization

If user's result sum that Solr inquires is more than 500w, then opens and be responsible for processing 300w record calculating (starting at most 10 threads) by every thread, from buffer update thread pool (initial value is 50), start its respective thread number process.Using single list to store customer group result in Redis, key rule is: group_code_data_ver；

3.3.3 synchronization failure processes

(1) sync_status=-10: the retray function provided by interface, arranges sync_status=0, re-executes customer group buffer update

(2) sync_status=-11: first re-execute user basic information and synchronize, after success, by the renewal versions of data function that interface provides, updates data_ver

(3) in log, print detailed error information, utr_task_log records failure information

4, data scrubbing

4.1Redis data scrubbing

(1) user basic information: user basic information does not differentiate between version, when updating, arranging expired time is after 5 days every time, it is not necessary to consider cleaning.

(2) customer group: updating the data_ver of customer group simultaneously, the expired time arranging last revision is after 24 hours

4.2Solr data scrubbing

Solr first deletes collection a few days ago before retaining the collection that the data creation of 2 days is new simultaneously.As: when 20150415 synchronization, delete the data of 20150413, retain the data of 20150414.

5, external interface (Http)

Api interface called side is built-in system, and concurrency is little, uses Tomcat externally to provide HTTP interface service.

5.1 interface handling processes

(1) customer group status=1 is judged, if it is not, return error message

(2) take data_ver, be spliced into the key deposited in Redis according to data_ver

(3) according to pageSize and pageNo, location starts the list and the vernier starting position start that fetch data.

Further, the data delivery system based on user tag system possesses alarm monitoring function, within 30 minutes, collects transmission, and resets.

Data delivery system based on user tag system can be that such as Fig. 5 carries out service operation, and user tag system utilizes a set of unified user tag system (360 degree of panorama pictures), it is provided that carry out, according to user tag, the function that specific user hives off.Achieve precision marketing, personalized recommendation, unified marketing user group, get through marketing, advertisement, sales promotion, personalized recommendation data stream, management user tag colony life cycle.

It should be noted that, in the present embodiment, user data will distinguish user account number, facility information, device label gets to user's upper (when tag definition, it is judged that two solr are simultaneously tagged in those labeling requirements) about the labeling requirement of brand, category table simultaneously.

In the present embodiment, the performance test carrying out Solr distributed type assemblies is as shown in table 1 below:

Use 3 cloud main frames to build Solr cluster, use Solr default configuration；

Table 1

After write performance, as shown in table 2:

Table 2

Query performance is as shown in Table 3 and Table 4:

Query performance bottle strength is mainly in degree of depth paging；

Table 3

Use paging mode to obtain data, be modified each page return number and can also improve performance；

Table 4

Redis paging performance is as shown in table 5:

Data volume	Always consuming time	Each page (2w/page) consuming time
			6180000	188841ms	600ms

Table 5

Those skilled in the art is it can be understood that arrive, for convenience and simplicity of description, the system of foregoing description, the specific works process of device and unit, it is possible to reference to the corresponding process in preceding method embodiment, do not repeat them here.

In several embodiments provided herein, it should be understood that disclosed system, apparatus and method, it is possible to realize by another way.Such as, device embodiment described above is merely schematic, such as, the division of described unit, being only a kind of logic function to divide, actual can have other dividing mode when realizing, for instance multiple unit or assembly can in conjunction with or be desirably integrated into another system, or some features can ignore, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be through INDIRECT COUPLING or the communication connection of some interfaces, device or unit, it is possible to be electrical, machinery or other form.

The described unit illustrated as separating component can be or may not be physically separate, and the parts shown as unit can be or may not be physical location, namely may be located at a place, or can also be distributed on multiple NE.Some or all of unit therein can be selected according to the actual needs to realize the purpose of the present embodiment scheme.

It addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it is also possible to be that unit is individually physically present, it is also possible to two or more unit are integrated in a unit.Above-mentioned integrated unit both can adopt the form of hardware to realize, it would however also be possible to employ the form of SFU software functional unit realizes.

If described integrated unit is using the form realization of SFU software functional unit and as independent production marketing or use, it is possible to be stored in a computer read/write memory medium.Based on such understanding, part or all or part of of this technical scheme that prior art is contributed by technical scheme substantially in other words can embody with the form of software product, this computer software product is stored in a storage medium, including some instructions with so that a computer equipment (can be personal computer, server, or the network equipment etc.) perform all or part of step of method described in each embodiment of the present invention.And aforesaid storage medium includes: USB flash disk, portable hard drive, read only memory (ROM, Read-OnlyMemory), the various media that can store program code such as random access memory (RAM, RandomAccessMemory), magnetic disc or CD.

The above, above example only in order to technical scheme to be described, is not intended to limit；Although the present invention being described in detail with reference to previous embodiment, it will be understood by those within the art that: the technical scheme described in foregoing embodiments still can be modified by it, or wherein portion of techniques feature is carried out equivalent replacement；And these amendments or replacement, do not make the essence of appropriate technical solution depart from the spirit and scope of various embodiments of the present invention technical scheme.

Claims

1. a user tag system, it is characterised in that including:

2. user tag system according to claim 1, it is characterised in that described user tag system also includes:

3. user tag system according to claim 2, it is characterised in that described unified query engine specifically includes:

Tag queries unit and label rule query unit；

4. user tag system according to claim 3, it is characterised in that described Index/Solr distributed type assemblies specifically includes:

5. user tag system according to claim 4, it is characterised in that described user tag management backstage specifically includes:

6. user tag system according to claim 5, it is characterized in that, described label task will be for will set up label processor active task table, described label processor active task table is carried out the scanning of the first preset sweep time, change label data is write to described label processor active task table, generates pending corresponding label task SQL statement.

7. the user tag system according to claim 4 or 6, it is characterized in that, described label computing is the scanning that described label processor active task table carried out the second preset sweep time, link with described HIVE data platform, determine the change of the described label task SQL statement of described label processor active task table, and the increment carrying out correspondence increases newly and increment delete processing.

8. user tag system according to claim 7, it is characterised in that described customer group computing is generate the user data corresponding with customer group according to the described label processor active task table after processing.

9. the data delivery system based on user tag system, it is characterised in that including:

HIVE data platform, and user tag system as claimed in any of claims 1 to 8 in one of claims；

10. the data delivery system based on user tag system according to claim 9, it is characterised in that described HIVE data platform, is additionally operable to user basic information is synchronized to buffer memory in the Redis cluster of described user tag system.