CN111190938B

CN111190938B - Data analysis method, device, storage medium and processor

Info

Publication number: CN111190938B
Application number: CN201911368912.3A
Authority: CN
Inventors: 文诗奇; 谭国苹; 李刚毅
Original assignee: Beyondsoft Corp
Current assignee: Beyondsoft Corp
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2023-08-01
Anticipated expiration: 2039-12-26
Also published as: CN111190938A

Abstract

The invention discloses a data analysis method, a data analysis device, a storage medium and a processor. Wherein the method comprises the following steps: acquiring a sequence data set of user services, wherein each sequence in the sequence data set corresponds to the use data of the service used by one user, each sequence comprises at least one element, and each element comprises at least one event which is used for representing the addition or loss of the user services; determining a target event in at least one event of the sequence data set, and determining a sequence including the target event as a target sequence; sequence pattern mining is carried out on the target sequence, a related sequence of the target event and the association degree of the related sequence and the target event are determined, wherein the related sequence represents at least one event associated with the target event in the sequence comprising the target event. The method solves the technical problem that the application range of the analysis model based on the classical frequent pattern mining algorithm is limited.

Description

Data analysis method, device, storage medium and processor

Technical Field

The present invention relates to the field of data processing, and in particular, to a data analysis method, apparatus, storage medium, and processor.

Background

With development of information technology and popularization of cloud services, services provided by cloud service providers are increasingly diversified, and various aspects of storage, calculation, management, optimization, artificial intelligence and the like are covered. The use of a large and diverse number of services by users implies a certain potential behavior pattern or law. These rules either reflect the common consumption habits of the user population or reveal the inherent relevance between the service products. The rules are utilized to help product suppliers to better locate products, find potential user groups, establish user loss early warning, and better meet the demands of existing users. For example, assume that the following behavior patterns exist in a user population: adding service A, adding service B, { adding service C, { canceling service B } ].

Then we can guess that service C may have some dependency on services a and B and that the user's demand for service B is more flexible than service a. Therefore, we can recommend service C for users with the behavior pattern of < { add service a, add service B }, and on the other hand, establish a churn pre-warning of service B for users with the behavior pattern of < { add service a, add service B }.

Classical frequent pattern mining methods such as Apriori and FP-Growth provide powerful analysis tools for us, but all have the problems of limited application range, long running time, large memory use and the like, and are not convenient to directly apply to commercial analysis models. Therefore, the invention uses practicality and high efficiency as a core, and applies an improved sequence pattern mining algorithm around different analysis angles to provide a product positioning and user analysis model.

Aiming at the problem that the application range of the analysis model based on the classical frequent pattern mining algorithm is limited, no effective solution is proposed at present.

Disclosure of Invention

The embodiment of the invention provides a data analysis method, a device, a storage medium and a processor, which are used for at least solving the technical problem that the application range of an analysis model based on a classical frequent pattern mining algorithm is limited.

According to an aspect of an embodiment of the present invention, there is provided a data analysis method including: acquiring a sequence data set of user services, wherein each sequence in the sequence data set corresponds to service use data of a user, each sequence comprises at least one element, and each element comprises at least one event which is used for representing the addition or loss of the user services; determining a target event in at least one event of the sequence data set, and determining a sequence comprising the target event as a target sequence; sequence pattern mining is performed on the target sequence, a related sequence of the target event is determined, and the association degree of the related sequence and the target event is determined, wherein the related sequence represents at least one event associated with the target event in the sequence comprising the target event.

Further, acquiring the sequence data set of the user service includes: collecting usage data of user services from a database, wherein the usage data comprises user numbers, time, events of the used services and usage amount; generating a start schedule and an attrition schedule of the user service based on the usage data, wherein the start schedule is used for representing a time when each user starts to use the user service and the attrition schedule is used for representing a time when each user borrows the user service; generating the sequence data set according to the start schedule and the churn schedule.

Further, performing sequence pattern mining on the target sequence, and determining a related sequence of the target event includes: determining an event occurrence frequency table of the sequence data set, wherein the event occurrence frequency table is used for representing the frequency of all events in the sequence data set; setting a support threshold of the target event based on the event occurrence frequency table; and carrying out sequence pattern mining on the target sequence based on the support threshold value, and determining the target event related sequence.

Further, performing sequence pattern mining on the target sequence, and determining a related sequence of the target event includes: determining a frequent sequence of the target sequence in a sequence data set, wherein the frequent sequence comprises frequent adjacent events of the target event; determining the number of sequences of the frequent sequences; removing invalid sequences in the frequent sequences to obtain valid frequent sequences under the condition that the number of the sequences of the frequent sequences is higher than the number of the preset sequences; and calculating the support degree of the effective frequent sequence, and determining the effective frequent sequence with the support degree higher than a support degree threshold value as the related sequence.

Further, after determining a target event in at least one event of the sequence data set, the method further comprises: determining a set of contiguous frequent events of the target event, wherein the set of contiguous frequent events comprises: a pre-adjacency event set, a concurrent event set and a post-adjacency event set, wherein the events in the frequent adjacency event set are arranged in descending order according to the degree of association with the target event; and generating a target subsequence according to the association degree of the adjacent frequent event set events and the target events, wherein the target subsequence represents the using path of the user service.

According to another aspect of the embodiment of the present invention, there is also provided a data analysis apparatus including: an acquisition unit, configured to acquire a sequence data set of a user service, where each sequence in the sequence data set corresponds to usage data of a service used by a user, each sequence includes at least one element, and each element includes at least one event, where the event is used to represent addition or loss of the user service; a first determining unit, configured to determine a target event in at least one event of the sequence data set, and determine a sequence including the target event as a target sequence; the second determining unit is used for performing sequence pattern mining on the target sequence, determining a related sequence of the target event and the association degree of the related sequence and the target event, wherein the related sequence represents at least one event associated with the target event in the sequence comprising the target event.

Further, the acquisition unit includes: the system comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring use data of user services from a database, wherein the use data comprises user numbers, time, events of the used services and use amount; a generation module, configured to generate a start schedule and an attrition schedule of the user service based on the usage data, where the start schedule is used to represent a time when each user starts using the user service and the attrition schedule is used to represent a time when each user borrows using the user service; and the second generation module is used for generating the sequence data set according to the starting time table and the loss time table.

Further, the second determination unit includes: the first determining module is used for determining an event occurrence frequency table of the sequence data set, wherein the event occurrence frequency table is used for representing the frequency of all events in the sequence data set; the setting module is used for setting a support threshold value of the target event based on the event occurrence frequency table; and the second determining module is used for carrying out sequence pattern mining on the target sequence based on the support threshold value and determining the related sequence of the target event.

According to another aspect of the embodiment of the present invention, there is also provided a storage medium, where the storage medium includes a stored program, and when the program runs, the device where the storage medium is controlled to execute the data analysis method described above.

According to still another aspect of the embodiment of the present invention, there is further provided a processor, where the processor is configured to execute a program, and the program executes the data analysis method described above.

In the embodiment of the invention, a sequence data set of user services is obtained, wherein each sequence in the sequence data set corresponds to the use data of the service used by one user, each sequence comprises at least one element, and each element comprises at least one event which is used for representing the addition or loss of the user services; then determining a target event in at least one event of the sequence data set, and determining a sequence comprising the target event as a target sequence; and then sequence pattern mining is carried out on the target sequence, a related sequence of the target event and the association degree of the related sequence and the target event are determined, wherein the related sequence represents at least one event associated with the target event in the sequence comprising the target event, the purpose of analyzing a sequence data set of user service is achieved, the technical effect of improving the application range of the sequence data set is achieved, and the technical problem that the application range of an analysis model based on a classical frequent pattern mining algorithm is limited is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

FIG. 1 is a flow chart of a data analysis method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a data analysis system according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a sequence pattern generation apparatus according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a frequent user behavior event stream analysis device, in accordance with an embodiment of the present invention;

FIG. 5 is a schematic diagram of the basic structure of an Apriori-like algorithm for sequence pattern discovery, according to an embodiment of the invention;

FIG. 6 is a schematic diagram of a user behavior pattern sequence pattern mining apparatus according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a data visualization interface apparatus, according to an embodiment of the invention;

fig. 8 is a schematic diagram of a data analysis device according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an embodiment of the present invention, there is provided a data analysis method embodiment, it being noted that the steps shown in the flowcharts of the drawings may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order different from that herein.

Fig. 1 is a flowchart of a data analysis method according to an embodiment of the present invention, as shown in fig. 1, the method includes the steps of:

step S102, a sequence data set of user services is obtained, wherein each sequence in the sequence data set corresponds to the use data of the service used by one user, each sequence comprises at least one element, and each element comprises at least one event which is used for representing the addition or loss of the user services;

step S104, determining a target event in at least one event of the sequence data set, and determining a sequence comprising the target event as a target sequence;

step S106, sequence pattern mining is carried out on the target sequence, a related sequence of the target event and the association degree of the related sequence and the target event are determined, wherein the related sequence represents at least one event associated with the target event in the sequence comprising the target event.

It should be noted that, the user service refers to service contents that each user enjoys, such as service/product names, consumption (such as use time or consumption amount, etc.), for example, the user purchases a certain commodity, subscribes to which service, etc.

Optionally, the association degree of the related sequence and the target event includes credibility, improvement degree and the like.

Alternatively, the related sequence may be a sequence that includes the target event.

Alternatively, the correlation sequence may be a predisposition sequence for the target event, the predisposition sequence being a sequence comprising the target event, and the time in the predisposition sequence being before the target event.

As an alternative embodiment, acquiring the sequence data set of the user service comprises: collecting usage data of user services from a database, wherein the usage data comprises user numbers, time, events of the used services and usage amount; generating a start schedule and an attrition schedule of the user service based on the usage data, wherein the start schedule is used for representing a time when each user starts to use the user service and the attrition schedule is used for representing a time when each user borrows the user service; a sequence data set is generated according to the start schedule and the churn schedule.

According to the embodiment of the invention, the use data of the user service can be stored in the database when each user enjoys the user service, the user number, the time and the event and the use amount of the used service are recorded by the database, the time when each user starts to use the user service and the time when the user service stops to use can be determined based on the use data, the starting time table and the loss time table are determined, and then the sequence data set is determined.

Optionally, generating the sequence data set according to the start schedule and the churn schedule includes: and eliminating the use data with the use time lower than a preset time threshold or the use amount lower than a preset use amount threshold based on the starting time table and the loss time table, thereby determining a sequence data set and completing denoising of the use data of the user service.

As an alternative embodiment, performing sequence pattern mining on the target sequence, and determining the related sequence of the target event includes: determining an event occurrence frequency table of the sequence data set, wherein the event occurrence frequency table is used for representing the frequency of all events in the sequence data set; setting a support threshold of a target event based on an event occurrence frequency table; and performing sequence pattern mining on the target sequence based on the support threshold value, and determining the related sequence of the target event.

According to the embodiment of the invention, the support degree between each event and the target event in the sequence data set can be determined based on the event occurrence frequency table of the sequence data set, and then the sequence mode mining can be carried out on the target threshold containing the target event based on the support degree threshold of the target event, so that the related sequence of the target event is determined.

As an alternative embodiment, performing sequence pattern mining on the target sequence, and determining the related sequence of the target event includes: determining a frequent sequence of the target sequence in the sequence data set, wherein the frequent sequence comprises frequent adjacent events of the target event; determining the number of sequences of the frequent sequences; under the condition that the number of sequences of the frequent sequences is higher than the number of preset sequences, invalid sequences in the frequent sequences are removed, and effective frequent sequences are obtained; and calculating the support degree of the effective frequent sequence, and determining the effective frequent sequence with the support degree higher than the support degree threshold value as the related sequence.

In the above embodiment of the present invention, in the process of sequence pattern mining on a target sequence, the number of sequences in a sequence data set may be huge, and the number distribution of events (i.e., service addition and service loss) in the sequence data set may be highly unbalanced, so in order to improve the efficiency of sequence pattern mining, in the process of mining, the number of frequent sequences required to be used in the process of sequence pattern mining may be counted, and the running cost of the mining process may be estimated, so that in the process of sequence pattern mining, the number of frequent sequences required to be used in the process of sequence pattern mining may be further angled, i.e., in the case of higher running cost of the mining process, invalid sequences in the frequent sequences may be removed by means of sequence scanning.

Note that, the invalid sequence may be a sequence with low support in the frequent sequence.

As an alternative embodiment, after determining the target event in at least one event of the sequence data set, the method further comprises: determining a set of contiguous frequent events of the target event, wherein the set of contiguous frequent events comprises: the method comprises the steps of arranging an event set before adjacency, a concurrent event set and an event set after adjacency in descending order according to the association degree between the event set and a target event in an adjacency frequent event set; and generating a target subsequence according to the association degree of the adjacent frequent event set events and the target events, wherein the target subsequence represents the path of the user service.

In the above embodiment of the present invention, the pre-adjacency event set, the concurrent event set, and the post-adjacency event set of the target event may be determined based on the sequence data set, and elements in the pre-adjacency event set, the concurrent event set, and the post-adjacency event set, and the events included in the elements and the target event are arranged in descending order according to the association degree, so as to generate a target sub-sequence, where the target sub-sequence represents a path of the user service through which the target event needs to occur.

The present invention also provides a preferred embodiment that provides a user behavior analysis and product location model based on time series patterns.

The technical scheme provided by the invention aims at: by analyzing the data sequence of a large number of users using the service, the user behavior sequence mode is mined, so that the product positioning and the user loss early warning of each service in the service set and the service recommendation reference for a specific user group are realized. Therefore, the invention establishes various models and realizes the mining of the occurrence rules and the like of frequent subsequences and adjacent elements of specific interested elements in the sequence containing a small amount or a large amount of elements and events.

Fig. 2 is a schematic diagram of a data analysis system according to an embodiment of the present invention, as shown in fig. 2, after raw data is input into the data analysis system, analysis of the raw data may be implemented by a sequence pattern generating device 202, a frequent user behavior event stream analyzing device 204, a user behavior pattern sequence pattern mining device 206, and a data visualization interface device 206 in the system.

Alternatively, the sequence pattern generating means may acquire a data set satisfying the condition from the user history service use database, and further perform data conversion to generate a sequence pattern set (i.e., a sequence data set) of the user use service.

Alternatively, the frequent user behavior Event stream (Event stream) analysis means may generate three contiguous frequent Event sets of an input start Event (start Event) as a Target Event (Target Event), including: a pre-adjacency event set (Preceding events set), a concurrent event set (Concurrent events set), and a post-adjacency event set (Following events set), the elements in the three sets of events being arranged in descending order of relevance to the target event.

Further, the events in the three sets of events are used as input options, the selected events are added to the target events to form a new target subsequence, and the device generates three new rule sets according to the existing target subsequence and works in a logic loop.

Optionally, the user behavior sequence pattern mining device may generate a sequence pattern and its support according to the input sequence data set and the input minimum support value (minimal support value), and if the input sequence data set is a subset screened according to a certain target event, calculate the correlation degree between the generated sequence pattern and the target event at the same time.

Optionally, the data visualization user interface device may generate an interactive user interface based on the sequence pattern set (i.e. the sequence data set) generated by the sequence pattern generating device, in combination with the frequent user behavior event stream analyzing device and the user behavior pattern sequence pattern mining device, and perform visual presentation based on the analysis results of the frequent user behavior event stream analyzing device 204 and the user behavior pattern sequence pattern mining device.

Optionally, the sequence pattern mining algorithm part and the user interface can interact through flash to better realize the event stream presentation of the frequent user behavior event stream analysis device.

Alternatively, visualization of the analysis results may be achieved by d3.js.

Fig. 3 is a schematic diagram of a sequence pattern generating apparatus according to an embodiment of the present invention, as shown in fig. 3, in which input raw data is cleaned and filtered, a user/service usage start schedule and a user/service usage end schedule are generated based on the cleaned and filtered raw data, a user service addition data set may be determined based on the user/service usage start schedule, a service loss determination may be performed based on the user/service usage end schedule, and then a user service loss data set may be determined, and then a sequence data set of a user service may be determined based on the user service addition data set and the user service loss data set.

As an alternative embodiment, the historical usage data of all users may be used as raw data, which should include at least user number, service/product name, consumption (such as usage time or consumption amount, etc.), and the sequence pattern generating means includes the steps of:

step 1: and after finishing the transformation, obtaining the primary service S time and the final service use time T of each user.

Step 2: user/service items with too short a use time T-S < m (or too small a use amount) are filtered out according to a predetermined use time threshold m.

Step 3: and generating a sub-data set added and lost by the user service according to the initial use time and the final use time. The first service use time S is regarded as a service addition time, and the last service use time is regarded as a service attrition time T. For each user, the maximum value of all T thereof is taken as the user vanishing time F of the user, i.e., f=max (T).

Step 4: and filtering the user service loss data set according to a preset user service loss window threshold value. More specifically, the service loss time T corresponding to each item in the data set is compared with the threshold k (k > 0) and the user disappearance time F, if F-T < k, the service is considered to be naturally stopped along with the disappearance of the user, and not to be lost, so that the piece of data is removed. Optionally, if k=0, then T is compared with all service addition times S of the corresponding users, and if T > max (S), i.e. after the service is lost, the user does not add any new service, then the service is considered to be naturally stopped as the user disappears, instead of being lost, and the piece of data is removed.

Step 5: combining the user service adding sub-data set and the user service losing sub-data set, adding a prefix for the service name, for example, adding the service name with a "+" representative and losing the service name with a "-" representative, and obtaining the user service using sequence data set through integration operation. Wherein each user corresponds to a Sequence (Sequence), each Sequence being an ordered list of one or more elements (elements), each Element being a set of one or more events (elements). For example, the sequence < { +Service A } { +Service B } { +Service C, -Service A } > contains 3 elements and 4 events. It should be noted that the method investigates the addition and loss of services so that the same event occurs at most once in a sequence. Alternatively, the sequence may be stored in an array.

Alternatively, the sequence data set may be represented by an event occurrence frequency table, as shown in the event occurrence frequency table of table 1:

event(s)	Frequency number
		+Service_0	168646
+Service_1	152921
		+Service_2	107664
+Service_3	103984
		+Service_4	67737
+Service_5	44709
		-Service_1	44110
-Service_0	40388
		+Service_6	38934

TABLE 1

Fig. 4 is a schematic diagram of a frequent user behavior event stream analysis device according to an embodiment of the present invention, where, as shown in fig. 4, a start event input by a user may be used as a target event, a sequence including the target event is selected from a current sequence data set, and a frequent adjacency time set of the target event is determined.

As an alternative embodiment, the output data of the sequence pattern generating device is taken as the current sequence data set U, and the sequence data set contains a data sequence number n, and the data sequence average contains an event number m. The workflow of the frequent user behavior event stream analysis device is as follows:

step 1: all event occurrence frequency table E is generated.

Step 2: selecting a target initiation event as target t ₀ The event isMay be an addition or a churn (+t or-t) of the service of interest t.

Step 3: traversing the current sequence data set U, and storing all the target t ₀ Sequence U of (2) ₀ And save t in these sequences ₀ The adjacency event set of (1) includes a pre-adjacency event set S _-1 Concurrent event set S ₀ Post-adjacency event set S ₊₁ . Wherein S is _-1 Refers to U ₀ All occurrences in t ₀ A set of counts of events in an immediately preceding element of the element in which S ₀ Refers to U ₀ All of (3) are equal to t ₀ Counting set of events located in the same element, S ₊₁ Refers to U ₀ All occurrences in t ₀ A set of counts of events in an element next to the element that is located. The time complexity of this step operation is O (mn).

Step 4: updating a current sequence dataset to U ₀ 。

Step 5: by S _-1 、S ₀ 、S ₁ The number of occurrence times of the event is used as the supporting degree (supporting Count) to filter the elements with the supporting degree smaller than the threshold value, and a frequent adjacent event set is obtained. Binding t ₀ Relevance parameters, such as Confidence, raise (Lift), are calculated for the corresponding rules of the remaining elements.

For example, S _-1 Event e in (a) _-1 The corresponding rules are: if t ₀ An nth element appearing in any sequence, e _-1 An n-1 element present in the sequence; s is S ₀ Event e in (a) ₀ The corresponding rules are: if t ₀ An nth element appearing in any sequence, e ₀ An nth element present in the sequence; s is S ₊₁ Event e in (a) ₊₁ The corresponding rules are: if t ₀ An nth element appearing in any sequence, e ₊₁ Appears at the n+1th element of the sequence.

Step 6: alternatively, from S _-1 、S ₀ 、S ₁ In combination with the current target event to generate a new target subsequence, e.g<{e _-1 }{t ₀ }>、<{e ₀ ,t ₀ }>、<{t ₀ }{e ₊₁ }>。

Step 7: optionally, steps 3-6 are repeated.

It should be noted that, when the target subsequence T includes>When 1 element, S _-1 Event e in (a) _-1 The corresponding rules are: if in any sequence all elements in the target subsequence T occur adjacently in order, and the first element in the target subsequence T occurs in the nth element of the sequence, then e ₀ An n-1 element present in the sequence; s is S ₀ Event e in (a) ₀ The corresponding rules are: if in any sequence, all elements in the target subsequence T occur adjacently in order, and the last element in the target subsequence T occurs in the nth element of the sequence, then e ₀ Also present in the nth element of the sequence, S ₊₁ Event e in (a) ₊₁ The corresponding rules are: if in any sequence, all elements in the target subsequence T occur adjacently in order, and the last element in the target subsequence T occurs in the nth element of the sequence, then e ₀ Appears at the n+1th element of the sequence.

For example, a sequence dataset contains 4 sequences as follows:

1、<{+Service A}{+Service B}{+Service C，-Service A}>

2、<{+Service A}{+Service B，+Service D}{+Service C，+Service G}{+Service E}{+Service J}>

3、<{+Service B，+Service C，+Service G}{+Service F}{-Service C}>

4、<{+Service B，+Service J，+Service K}{-Service K}{+Service C}{+Service E}>

when the target subsequence T is<{+Service B}{+Service C}>In this case, sequences 1 and 2 are sequences corresponding to the target subsequence T. Thereby S is arranged as _-1 The events (and the times of occurrence) are: +Service A (2 times), S ₀ The events (and the times of occurrence) are: -Service A (1 time) +service G (1 time), S ₊₁ The events (and the times of occurrence) are: +service E (1 time).

As an alternative embodiment, the possible application scenarios of the frequent user behavior event stream analysis means are as follows:

for example, to explore a typical path of a user using a service, then select a service a with a higher degree of association in the generated post-adjacency event set, then select a service C with a higher degree of association in the new post-adjacency event set, and repeat this until the support degree of all items in the post-adjacency event set is less than a predetermined threshold, at which time the resulting sequence < { +a } { +b } { +c } … > is a typical user using path, so that a solution that the typical path may correspond to many user preferences can be guessed, and the service provider can make a recommendation or promotion around this.

As another example, service a is found to have a significant recent user churn, and by analyzing the event stream of-a, it is found that most users add service D at or above the same step of service a churn, whereby service a and D can be guessed that there is some alternative or conflicting relationship, and the service provider can make policy improvement accordingly.

As an alternative embodiment, the output data of the sequence pattern generating device is taken as a current sequence data set, the sequence data set comprises a data sequence number n, the data sequence average comprises an event number m, and the sequence data set comprises an event total number l. The workflow of the user behavior pattern sequence pattern mining device is as follows:

step 1: all event occurrence frequency table E is generated.

Step 2: taking the data sequence U of all users as input, mining the service use sequence mode of all users.

Step 3: each event E in the table E is used for all data sequences U containing the event _e Sequence pattern mining is performed as input, and the resulting sequence pattern is regarded as a correlated sequence pattern of e. Optionally in U _e The subsequence of elements from the first element up to the point where e appears is used as input to mine the "incentive" sequence pattern for a particular event e.

The mining of the sequence mode is realized through an improved Apriori-like algorithm discovered by the sequence mode, so that the mining efficiency is improved as much as possible.

FIG. 5 is a schematic diagram of a basic structure of a class Apriori algorithm for sequence pattern discovery according to an embodiment of the present invention, where as shown in FIG. 5, all original frequent sequences (i.e., frequent 1-sequences) are found, each original frequent sequence is traversed, candidate frequent sequences (i.e., candidate k-sequences) are generated, and support counts are performed on the candidate frequent sequences to determine the frequent sequences of the target sequence.

However, the problem with this class of Apriori algorithms is that: if the total number of events i is large and the support degree of each event in the event frequency table E is highly unbalanced, a lower support degree threshold value needs to be set when the sequence mode is discovered to ensure that the obtained sequence mode covers enough events, but is not limited to the small number of events with the highest occurrence frequency. However, in this case, a large number of invalid candidate sequences, i.e., candidate sequences with a support count of 0, are generated during the generation of candidate k sequences with a smaller k value, particularly candidate 2-sequences and candidate 3-sequences, and the support count is performed on all the invalid sequences, thereby increasing the computational burden.

Fig. 6 is a schematic diagram of a user behavior pattern sequence pattern mining apparatus according to an embodiment of the present invention, as shown in fig. 6, which is characterized in that an estimation of an operation cost is performed once after a low-k candidate k sequence is generated. If the number of candidate sequences is too large, the number of potential invalid sequences is large, and when the support count cost is high, a sequence scanning method is used for generating candidate k sequences to ensure that the candidate k sequences do not contain useless candidate sequences, and then the candidate k sequences enter Apriori circulation logic.

It should be noted that the time complexity of counting the support degree of each candidate sequence is O (mn), and the time complexity of generating the candidate k sequence from the candidate k-1 sequence is O (L_CAND) ² ) Where L_CAND is the number of candidate k-1 sequences, the time complexity C1 from candidate k-1 sequences to frequent k sequence steps is O (L_CAND) assuming the number of generated candidate k sequences is N_CAND ² ) +n_cand O (mn). On the other hand, event complexity C2 for generating frequent k sequences using a sequence scanning algorithm is O (m ^k nl) +n_valid O (mn), where n_valid is the number of VALID candidate sequences,i.e. the number of candidate sequences with all support greater than zero, c1=n_cand O (mn), c2= (m) ^k-1 l+n_valid) O (mn); so when m ^k-1 l+N_VALID<<When n_cand, it is more efficient to generate frequent k sequences using the sequence scanning method.

Specifically, the approximate comparison of n_valid and n_cand can be estimated from the total number of events/of the event frequency table E and the skewness (skewness) or standard deviation (standard deviation) of the event frequency data distribution.

In practice, if the total number of events is large (l>100 Greater positive bias of event frequency distribution and lower support limit, then for smaller k, such as 2 or 3, m ^k-1 l+N_VALID is much smaller than N_CAND.

Fig. 7 is a schematic diagram of a data visualization interface device according to an embodiment of the present invention, as shown in fig. 7, a frequent user behavior event stream analysis device and a user behavior pattern sequence pattern mining device may perform association pattern rule mining, determine an analysis result of a sequence data set, perform interactive data transfer on a complex analysis result and a user interface, and then transmit the analysis result to a visualization tool through the user interface.

Optionally, the user interface and the visualization device provided in the data visualization interface device may perform mining on the common sequence pattern of the users and mining and rule generation of sequence patterns of all event-related users into a non-real-time function, that is, running and mining all rules in advance, saving rule data, and using the saved rule data as input data of the user interface UI, because the number of events in the data sequence set is limited and predictable; the analysis of the event stream of the user characteristic behavior is performed in real time, namely, the result is calculated again in real time each time the target is changed, because the possible combination of event streams generated by the user input of the user interface UI is large and unpredictable, and the operation time complexity of the module is linear along with the number of sequences, so that the efficiency is acceptable.

The technical scheme provided by the invention utilizes the time sequence of the user using the service to mine the hidden relevance existing between the addition and the loss of different services, and the output result covers two different layers or ranges. From a microscopic level, the addition and loss of a particular service, or a collection or sequence of actions, is subject to a path proximity analysis, i.e., what happens generally in the previous step of the user's generation of the action, what happens generally in the same time as the action, and what happens in the next step, and the potential cause or direct impact of the cause is mined. And from the macroscopic level, the method performs similar behavior mining on the use sequences of all users, explores common features of the users or potential links between services, and performs feature behavior extraction on the related sequences of the adding or losing behaviors of each service so as to provide a direct and effective tool for service recommendation or losing early warning.

According to still another embodiment of the present invention, there is also provided a storage medium including a stored program, wherein the program, when run, performs the data analysis method of any one of the above.

According to yet another embodiment of the present invention, there is also provided a processor for running a program, wherein the program, when run, performs the data analysis method of any one of the above.

According to an embodiment of the present invention, there is further provided an embodiment of a data analysis apparatus, where it should be noted that the data analysis apparatus may be used to perform a data analysis method in an embodiment of the present invention, and the data analysis method in an embodiment of the present invention may be performed in the data analysis apparatus.

Fig. 8 is a schematic diagram of a data analysis device according to an embodiment of the present invention, as shown in fig. 8, the device may include: an obtaining unit 82, configured to obtain a sequence data set of a user service, where each sequence in the sequence data set corresponds to usage data of a service used by a user, each sequence includes at least one element, and each element includes at least one event, where an event is used to represent addition or loss of the user service; a first determining unit 84 for determining a target event among at least one event of the sequence data set, and determining a sequence including the target event as a target sequence; a second determining unit 86, configured to perform sequence pattern mining on the target sequence, determine a correlation sequence of the target event, and a degree of association between the correlation sequence and the target event, where the correlation sequence represents at least one event associated with the target event in the sequence including the target event.

It should be noted that, the acquiring unit 822 in this embodiment may be used to perform step S102 in the embodiment of the present application, the first determining unit 84 in this embodiment may be used to perform step S104 in the embodiment of the present application, and the second determining unit 86 in this embodiment may be used to perform step S106 in the embodiment of the present application. The above units are the same as examples and application scenarios achieved by the corresponding steps, but are not limited to what is disclosed in the above embodiments.

As an alternative embodiment, the acquisition unit comprises: the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring the use data of the user service from a database, wherein the use data comprises a user number, time, event of the used service and use amount; the generation module is used for generating a starting time table and a loss time table of the user service based on the use data, wherein the starting time table is used for representing the time when each user starts to use the user service, and the loss time table is used for representing the time when each user borrows the user service; and the second generation module is used for generating a sequence data set according to the starting time table and the loss time table.

As an alternative embodiment, the second determining unit comprises: the first determining module is used for determining an event occurrence frequency table of the sequence data set, wherein the event occurrence frequency table is used for representing the frequency of all events in the sequence data set; the setting module is used for setting a support threshold value of the target event based on the event occurrence frequency table; and the second determining module is used for carrying out sequence pattern mining on the target sequence based on the support threshold value and determining the related sequence of the target event.

As an alternative embodiment, the second determining unit comprises: a third determining module, configured to determine a frequent sequence of the target sequence in the sequence dataset, where the frequent sequence includes frequent adjacency events of the target event; a fourth determining module, configured to determine the number of sequences of the frequent sequences; the screening module is used for eliminating invalid sequences in the frequent sequences to obtain valid frequent sequences under the condition that the number of the sequences of the frequent sequences is higher than the number of the preset sequences; and the calculating module is used for calculating the support degree of the effective frequent sequences and determining the effective frequent sequences with the support degree higher than the support degree threshold value as related sequences.

As an alternative embodiment, the apparatus further comprises: a third determining unit, configured to determine, after determining the target event in at least one event of the sequence data set, a contiguous frequent event set of the target event, where the contiguous frequent event set includes: the method comprises the steps of arranging an event set before adjacency, a concurrent event set and an event set after adjacency in descending order according to the association degree between the event set and a target event in an adjacency frequent event set; and the generating unit is used for generating a target subsequence according to the association degree of the adjacent frequent event set events and the target events, wherein the target subsequence represents the path of the user service.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A method of data analysis, comprising:

acquiring a sequence data set of user services, wherein each sequence in the sequence data set corresponds to service use data of a user, each sequence comprises at least one element, and each element comprises at least one event which is used for representing the addition or loss of the user services;

determining a target event in at least one event of the sequence data set, and determining a sequence comprising the target event as a target sequence;

sequence pattern mining is carried out on the target sequence, a related sequence of the target event and the association degree of the related sequence and the target event are determined, wherein the related sequence represents at least one event associated with the target event in the sequence comprising the target event;

wherein obtaining the sequence data set of the user service comprises:

Collecting usage data of user services from a database, wherein the usage data comprises user numbers, time, events of the used services and usage amount;

generating a start schedule and an attrition schedule of the user service based on the usage data, wherein the start schedule is used for representing a time when each user starts to use the user service and the attrition schedule is used for representing a time when each user releases using the user service;

generating the sequence data set according to the start schedule and the churn schedule;

wherein the sequence data set is generated from the start schedule and the churn schedule: based on the initial time table and the loss time table, rejecting the use data with the use time lower than a preset time threshold or with the use amount lower than a preset use amount threshold;

wherein after determining a target event in at least one event of the sequence data set, the method further comprises:

determining a set of contiguous frequent events of the target event, wherein the set of contiguous frequent events comprises: a pre-adjacency event set, a concurrent event set and a post-adjacency event set, wherein the events in the frequent adjacency event set are arranged in descending order according to the degree of association with the target event; the event set before adjacency refers to a counting set of all events occurring in an element adjacent to and before the element where the target event is located in the sequence data set, the concurrent event set refers to a counting set of all events in the same element as the target event in the sequence data set, and the event set after adjacency refers to a counting set of all events occurring in an element adjacent to and after the element where the target event is located in the sequence data set;

And generating a target subsequence according to the association degree of the adjacent frequent event set events and the target events, wherein the target subsequence represents the path of the user service.

2. The method of claim 1, wherein sequence pattern mining the target sequence, determining the relevant sequence of the target event comprises:

determining an event occurrence frequency table of the sequence data set, wherein the event occurrence frequency table is used for representing the frequency of all events in the sequence data set;

setting a support threshold of the target event based on the event occurrence frequency table;

and performing sequence pattern mining on the target sequence based on the support threshold value, and determining a related sequence of the target event.

3. The method of claim 1, wherein sequence pattern mining the target sequence, determining the relevant sequence of the target event comprises:

determining a frequent sequence of the target sequence in a sequence data set, wherein the frequent sequence comprises frequent adjacent events of the target event;

determining the number of sequences of the frequent sequences;

removing invalid sequences in the frequent sequences to obtain valid frequent sequences under the condition that the number of the sequences of the frequent sequences is higher than the number of the preset sequences;

And calculating the support degree of the effective frequent sequence, and determining the effective frequent sequence with the support degree higher than a support degree threshold value as the related sequence.

4. A data analysis device, comprising:

an acquisition unit, configured to acquire a sequence data set of a user service, where each sequence in the sequence data set corresponds to usage data of a service used by a user, each sequence includes at least one element, and each element includes at least one event, where the event is used to represent addition or loss of the user service;

a first determining unit, configured to determine a target event in at least one event of the sequence data set, and determine a sequence including the target event as a target sequence;

a second determining unit, configured to perform sequence pattern mining on the target sequence, determine a correlation sequence of the target event, and a degree of association of the correlation sequence with the target event, where the correlation sequence represents at least one event associated with the target event in a sequence including the target event;

wherein the acquisition unit includes:

the system comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring use data of user services from a database, wherein the use data comprises user numbers, time, events of the used services and use amount;

A generation module, configured to generate a start schedule and an attrition schedule of the user service based on the usage data, where the start schedule is used to represent a time when each user starts using the user service and the attrition schedule is used to represent a time when each user releases using the user service;

a second generation module for generating the sequence data set according to the start schedule and the churn schedule;

wherein the apparatus further comprises:

a third determining unit, configured to determine, after determining a target event in at least one event of the sequence data set, a set of contiguous frequent events of the target event, where the set of contiguous frequent events includes: a pre-adjacency event set, a concurrent event set and a post-adjacency event set, wherein the events in the frequent adjacency event set are arranged in descending order according to the degree of association with the target event; the event set before adjacency refers to a counting set of all events occurring in an element adjacent to and before the element where the target event is located in the sequence data set, the concurrent event set refers to a counting set of all events in the same element as the target event in the sequence data set, and the event set after adjacency refers to a counting set of all events occurring in an element adjacent to and after the element where the target event is located in the sequence data set;

And the generating unit is used for generating a target subsequence according to the association degree of the adjacent frequent event set event and the target event, wherein the target subsequence represents the path of the user service.

5. The apparatus according to claim 4, wherein the second determining unit includes:

the first determining module is used for determining an event occurrence frequency table of the sequence data set, wherein the event occurrence frequency table is used for representing the frequency of all events in the sequence data set;

the setting module is used for setting a support threshold value of the target event based on the event occurrence frequency table;

and the second determining module is used for carrying out sequence pattern mining on the target sequence based on the support threshold value and determining the related sequence of the target event.

6. A storage medium comprising a stored program, wherein the program performs the data analysis method of any one of claims 1 to 3.

7. A processor for running a program, wherein the program when run performs the data analysis method of any one of claims 1 to 3.