CN111190938A

CN111190938A - Data analysis method and device, storage medium and processor

Info

Publication number: CN111190938A
Application number: CN201911368912.3A
Authority: CN
Inventors: 文诗奇; 谭国苹; 李刚毅
Original assignee: Beyondsoft Corp
Current assignee: Beyondsoft Corp
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2020-05-22
Anticipated expiration: 2039-12-26
Also published as: CN111190938B

Abstract

The invention discloses a data analysis method, a data analysis device, a storage medium and a processor. Wherein, the method comprises the following steps: acquiring a sequence data set of user services, wherein each sequence in the sequence data set corresponds to use data of a service used by a user, each sequence comprises at least one element, each element comprises at least one event, and the event is used for indicating addition or loss of the user services; determining a target event in at least one event of the sequence dataset and determining a sequence comprising the target event as a target sequence; and carrying out sequence pattern mining on the target sequence, determining a related sequence of the target event and the relevance of the related sequence and the target event, wherein the related sequence represents at least one event related to the target event in the sequence comprising the target event. The method solves the technical problem that the application range of the analysis model based on the classical frequent pattern mining algorithm is limited.

Description

Data analysis method and device, storage medium and processor

Technical Field

The present invention relates to the field of data processing, and in particular, to a data analysis method, apparatus, storage medium, and processor.

Background

With the development of information technology and the popularization of cloud services, services provided by cloud service providers are increasingly diversified, and the services cover various aspects such as storage, calculation, management, optimization, artificial intelligence and the like. The use of a large and diverse number of services by a user implies a certain pattern or law of potential behavior. These laws either reflect the common consumption habits of the user population or reveal the inherent relevance among the service products. The development and utilization of the rules can help product suppliers to better position products, discover potential user groups, establish user loss early warning and better meet the requirements of existing users. For example, assume that the following behavior patterns exist in a user population: < { add service a, add service B }, { add service C }, and { cancel service B } >.

We can guess that service C may have some dependency on services a and B and that the user's demand for service B is more resilient than service a. Therefore, we can recommend service C to users with the behavior pattern of < { add service a, add service B }, and on the other hand, establish early warning of churn of service B to users with the behavior pattern of < { add service a, add service B }, and { add service C }.

Classical frequent pattern mining methods such as Apriori and FP-Growth provide powerful analysis tools for people, but all of the methods have the problems of limited application range, long running time, large memory use and the like, and are not convenient to be directly applied to business analysis models. Therefore, the invention provides a product positioning and user analysis model by taking practicality and high efficiency as a core and applying an improved sequence pattern mining algorithm around different analysis angles.

Aiming at the problem that the application range of the analysis model based on the classical frequent pattern mining algorithm is limited, an effective solution is not provided at present.

Disclosure of Invention

The embodiment of the invention provides a data analysis method, a data analysis device, a storage medium and a processor, which are used for at least solving the technical problem that the application range of an analysis model based on a classical frequent pattern mining algorithm is limited.

According to an aspect of an embodiment of the present invention, there is provided a data analysis method including: acquiring a sequence data set of user services, wherein each sequence in the sequence data set corresponds to use data of a service used by a user, each sequence comprises at least one element, each element comprises at least one event, and the event is used for indicating addition or loss of the user service; determining a target event among at least one event of the sequence dataset and determining a sequence comprising the target event as a target sequence; and performing sequence pattern mining on the target sequence, determining a related sequence of the target event and the association degree of the related sequence and the target event, wherein the related sequence represents at least one event associated with the target event in the sequence comprising the target event.

Further, obtaining the sequence data set of the user service includes: collecting the use data of user service from a database, wherein the use data comprises user number, time, event and use amount of the used service; generating a start time table and an attrition time table for the user service based on the usage data, wherein the start time table is used for representing the time when each user starts using the user service and the attrition time table is used for representing the time when each user borrows the user service; generating the sequence dataset according to the start schedule and the attrition schedule.

Further, performing sequence pattern mining on the target sequence, and determining a related sequence of the target event includes: determining an event occurrence frequency table of the sequence data set, wherein the event occurrence frequency table is used for indicating the frequency of all events in the sequence data set; setting a support threshold of the target event based on the event occurrence frequency table; and performing sequence pattern mining on the target sequence based on the support degree threshold value, and determining the target event related sequence.

Further, performing sequence pattern mining on the target sequence, and determining a related sequence of the target event includes: determining a frequent sequence of the target sequence in a sequence data set, wherein the frequent sequence comprises frequent adjacent events of the target event; determining a sequence number of the frequent sequences; under the condition that the number of the sequences of the frequent sequences is higher than the number of preset sequences, eliminating invalid sequences in the frequent sequences to obtain effective frequent sequences; and calculating the support degree of the effective frequent sequences, and determining the effective frequent sequences with the support degree higher than a support degree threshold value as the related sequences.

Further, after determining a target event in at least one event of the sequence dataset, the method further comprises: determining a set of contiguous frequent events for the target event, wherein the set of contiguous frequent events comprises: the events in the adjacent frequent event set are arranged in a descending order according to the relevance degree with the target event; and generating a target subsequence according to the association degree of the events in the adjacent frequent event set and the target events, wherein the target subsequence represents the use path of the user service.

According to another aspect of the embodiments of the present invention, there is also provided a data analysis apparatus, including: the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a sequence data set of user services, each sequence in the sequence data set corresponds to the use data of a service used by a user, each sequence comprises at least one element, each element comprises at least one event, and the event is used for indicating the addition or the loss of the user service; a first determining unit configured to determine a target event among at least one event of the sequence dataset, and determine a sequence including the target event as a target sequence; and a second determining unit, configured to perform sequence pattern mining on the target sequence, determine a related sequence of the target event and a degree of association between the related sequence and the target event, where the related sequence represents at least one event associated with the target event in a sequence including the target event.

Further, the acquisition unit includes: the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring the use data of user services from a database, and the use data comprises user numbers, time, events and use amount of the used services; a generating module, configured to generate a start time table and an attrition time table of the user service based on the usage data, wherein the start time table is used for indicating a time when each user starts using the user service and the attrition time table is used for indicating a time when each user borrows and uses the user service; a second generation module configured to generate the sequence data set according to the start time table and the run-off time table.

Further, the second determination unit includes: the first determining module is used for determining an event occurrence frequency table of the sequence data set, wherein the event occurrence frequency table is used for indicating the frequency of all events in the sequence data set; the setting module is used for setting a support threshold of the target event based on the event occurrence frequency table; and the second determination module is used for carrying out sequence pattern mining on the target sequence based on the support degree threshold value and determining the related sequence of the target event.

According to another aspect of the embodiments of the present invention, there is also provided a storage medium, where the storage medium includes a stored program, and when the program runs, the apparatus on which the storage medium is located is controlled to execute the data analysis method described above.

According to another aspect of the embodiments of the present invention, there is also provided a processor, configured to execute a program, where the program executes the data analysis method described above.

In the embodiment of the invention, a sequence data set of user services is obtained, wherein each sequence in the sequence data set corresponds to use data of a service used by a user, each sequence comprises at least one element, each element comprises at least one event, and the event is used for indicating addition or loss of the user service; then determining a target event in at least one event of the sequence dataset, and determining a sequence comprising the target event as a target sequence; and then carrying out sequence pattern mining on the target sequence, determining a related sequence of the target event and the degree of association between the related sequence and the target event, wherein the related sequence represents at least one event associated with the target event in the sequence comprising the target event, and the purpose of analyzing the sequence data set of the user service is achieved, so that the technical effect of improving the application range of the sequence data set is realized, and the technical problem that the application range of an analysis model based on a classical frequent pattern mining algorithm is limited is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow chart of a method of data analysis according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a data analysis system according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a sequence pattern generation apparatus according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a frequent user behavior event stream analysis apparatus according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating the basic structure of an Apriori-like algorithm for sequence pattern discovery according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a user behavior pattern sequence pattern mining device according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a data visualization interface apparatus, according to an embodiment of the present invention;

fig. 8 is a schematic diagram of a data analysis apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In accordance with an embodiment of the present invention, there is provided a data analysis method embodiment, it being noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than here.

Fig. 1 is a flow chart of a data analysis method according to an embodiment of the present invention, as shown in fig. 1, the method including the steps of:

step S102, acquiring a sequence data set of user services, wherein each sequence in the sequence data set corresponds to the use data of a service used by a user, each sequence comprises at least one element, each element comprises at least one event, and the event is used for indicating the addition or loss of the user services;

step S104, determining a target event in at least one event of the sequence dataset, and determining a sequence comprising the target event as a target sequence;

and step S106, carrying out sequence pattern mining on the target sequence, and determining a related sequence of the target event and the correlation degree of the related sequence and the target event, wherein the related sequence represents at least one event related to the target event in the sequence comprising the target event.

It should be noted that the user service refers to service contents enjoyed by each user, such as a service/product name, a consumption amount (such as a use time or a consumption amount), for example, which services the user purchases a certain commodity, which services the user subscribes to, and the like.

Optionally, the degree of association between the related sequence and the target event includes credibility, elevation, and the like.

Alternatively, the correlation sequence may be a sequence including the target event.

Alternatively, the related sequence may be a causal sequence of the target event, the causal sequence being a sequence comprising the target event and the time in the causal sequence preceding the target event.

As an alternative embodiment, obtaining a sequence dataset of user services comprises: collecting the use data of the user service from a database, wherein the use data comprises a user number, time, an event of the used service and a use amount; generating a start time table and an attrition time table of the user service based on the usage data, wherein the start time table is used for indicating the time when each user starts using the user service and the attrition time table is used for indicating the time when each user borrows and uses the user service; a sequence data set is generated based on the start schedule and the attrition schedule.

According to the embodiment of the invention, the use data of the user service can be stored in the database during the process that each user enjoys the user service, the user number, the time, the event and the use amount of the used service are recorded by the database, the time when each user starts to use the user service and the time when the user stops using the user service can be determined based on the use data, the starting time table and the loss time table are determined, and the sequence data set is further determined.

Optionally, generating the sequence data set according to the start schedule and the attrition schedule comprises: and based on the starting time table and the loss time table, removing the use data of which the use time is lower than a preset time threshold or the use amount is lower than a preset use amount threshold, thereby determining a sequence data set and completing the denoising of the use data of the user service.

As an alternative embodiment, the sequence pattern mining is performed on the target sequence, and determining the related sequence of the target event includes: determining an event occurrence frequency table of the sequence data set, wherein the event occurrence frequency table is used for indicating the frequency of all events in the sequence data set; setting a support threshold of a target event based on the event occurrence frequency table; and performing sequence pattern mining on the target sequence based on the support degree threshold value, and determining the related sequence of the target event.

According to the embodiment of the invention, based on the event occurrence frequency table of the sequence data set, the support degree between each event and the target event in the sequence data set can be determined, and further based on the support degree threshold of the target event, the sequence mode mining can be performed on the target threshold containing the target event, so as to determine the related sequence of the target event.

As an alternative embodiment, the sequence pattern mining is performed on the target sequence, and determining the related sequence of the target event includes: determining a frequent sequence of the target sequence in the sequence data set, wherein the frequent sequence comprises frequent adjacent events of the target event; determining the number of sequences of frequent sequences; under the condition that the number of the sequences of the frequent sequences is higher than the number of the preset sequences, eliminating invalid sequences in the frequent sequences to obtain effective frequent sequences; and calculating the support degree of the effective frequent sequences, and determining the effective frequent sequences with the support degree higher than the support degree threshold value as the related sequences.

In the above embodiment of the present invention, in the process of mining the sequence pattern of the target sequence, the number of sequences in the sequence data set may be huge, and the events in the sequence data set (i.e., service addition and service loss) may have a problem of highly unbalanced number distribution, so in order to improve the efficiency of mining the sequence pattern, in the mining process, the number of frequent sequences that need to be used in the sequence pattern mining process may be counted, the running cost of the mining process is evaluated, and further, in the sequence pattern mining process, the number of frequent sequences that need to be used is evaluated, that is, in the case that the running cost of the mining process is high, invalid sequences in the frequent sequences are removed in a sequence scanning manner.

It should be noted that the invalid sequence may be a sequence with a lower support degree in the frequent sequences.

As an alternative embodiment, after determining the target event in at least one event of the sequence data set, the method further comprises: determining a contiguous frequent event set of the target event, wherein the contiguous frequent event set comprises: the method comprises the following steps that a pre-adjacent event set, a concurrent event set and a post-adjacent event set are arranged, and events in an adjacent frequent event set are arranged in a descending order according to the relevance degree of a target event; and generating a target subsequence according to the association degree of the events in the adjacent frequent event set and the target events, wherein the target subsequence represents a path served by the user.

In the above embodiment of the present invention, based on the sequence dataset, a pre-adjacent event set, a concurrent event set, and a post-adjacent event set of the target event may be determined, and elements in the pre-adjacent event set, the concurrent event set, and the post-adjacent event set, and events included in the elements and the target event are arranged in a descending order according to the degree of association, so as to generate a target subsequence, and the target subsequence represents a path of the user service through which the target event needs to occur.

The present invention also provides a preferred embodiment that provides a user behavior analysis and product location model based on time series patterns.

The invention provides the technical scheme with the following purposes: by analyzing data sequences of a large number of user using services and mining user behavior sequence patterns, product positioning and user loss early warning of each service in a service set are achieved, and service recommendation reference for a specific user group is achieved. Therefore, the invention establishes a plurality of models, and realizes the mining of the appearance rules and the like of the frequent subsequences and the adjacent elements of the specific interested elements in the sequence containing a small amount or a large amount of elements and events.

Fig. 2 is a schematic diagram of a data analysis system according to an embodiment of the present invention, and as shown in fig. 2, after raw data is input into the data analysis system, analysis of the raw data may be implemented by a sequence pattern generation device 202, a frequent user behavior event stream analysis device 204, a user behavior pattern sequence pattern mining device 206, and a data visualization interface device 206 in the system.

Alternatively, the sequence pattern generating device may obtain a data set satisfying the condition according to the user historical service usage database, and then perform data transformation to generate a sequence pattern set (i.e., a sequence data set) of the user usage service.

Alternatively, the frequent user behavior Event stream (Event stream) analysis device may use an input start Event (Starting Event) as a Target Event (Target Event), and generate three sets of adjacent frequent Event sets of the Event, including: the method comprises the Following steps of a Preceding adjacent event set (Preceding events set), a Concurrent event set (Concurrent events set) and a Following adjacent event set (Following events set), wherein elements in the three event sets are arranged in a descending order according to the relevance degree of a target event.

Further, the events in the three sets of event sets are used as input selectable items, the selected events are added into the target events to form a new target subsequence, and meanwhile, the device generates three new sets of rule sets according to the existing target subsequence and works circularly according to the logic.

Alternatively, the user behavior sequence pattern mining device may generate a sequence pattern and a support thereof according to an input sequence data set and an input minimum support value (minimum support value), and if the input sequence data set is a subset filtered according to a certain target event, simultaneously calculate a correlation degree between the generated sequence pattern and the target event.

Alternatively, the data visualization user interface device may generate an interactive user interface based on the sequence pattern set (i.e., the sequence data set) generated by the sequence pattern generating device, by combining the frequent user behavior event stream analyzing device and the user behavior pattern sequence pattern mining device, and visually display the analysis results based on the frequent user behavior event stream analyzing device 204 and the user behavior pattern sequence pattern mining device.

Optionally, the sequence pattern mining algorithm part and the user interface can interact through flash to better realize the event stream display of the frequent user behavior event stream analysis device.

Alternatively, visualization of the analysis results may be achieved by d3. js.

Fig. 3 is a schematic diagram of a sequence pattern generating apparatus according to an embodiment of the present invention, as shown in fig. 3, the input raw data is cleaned and filtered, a user/service usage start time table and a user/service usage end time table are generated based on the cleaned and filtered raw data, a user service addition data set may be determined based on the user/service usage start time table, a service churn determination may be performed based on the user/service usage end time table, a user service churn data set may be further determined, and then a sequence data set of a user service may be determined based on the user service addition data set and the user service churn data set.

As an alternative embodiment, the historical usage data of all users can be used as raw data, the raw data at least includes user number, service/product name, consumption amount (such as usage time or consumption amount), and the sequence pattern generating device includes the following steps:

step 1: and after sorting and transformation, obtaining the initial service S time and the final service time T of each user.

Step 2: and filtering out the users/service items with the short use time T-S < m (or too small use amount) according to the preset use time threshold m.

And step 3: and generating a sub data set added and lost by the user service according to the initial use time and the final use time of the obtained data set. The service initial use time S is regarded as service adding time, and the service final use time is regarded as service loss time T. For each user, the maximum value of all T is used as the user disappearance time F of the user, i.e. F ═ max (T).

And 4, step 4: and filtering the user service loss data set according to a preset user service loss window threshold value. More specifically, the service loss time T corresponding to each item in the data set is compared with the threshold k (k >0) and the user disappearance time F, and if F-T < k, the service is considered to be naturally stopped along with the user disappearance instead of being lost, so that the piece of data is removed. Alternatively, if k is 0, T is compared with all service adding time S of the corresponding user, and if T > max (S), that is, after the service is lost, the user does not add any new service, the service is considered to be naturally stopped as the user disappears, rather than lost, and the piece of data is removed.

And 5: combining the user service adding sub data set and the user service loss sub data set, adding a prefix for the service name, for example, adding the prefix with a "+" representation, and losing the prefix with a "-" representation, and obtaining a user service use sequence data set through integration operation. Wherein each user corresponds to a Sequence (Sequence), each Sequence is an ordered list of one or more elements (elements), and each Element is a set of one or more events (events). For example, the sequence < { + Service A } { + Service B } { + Service C, -Service A } > contains 3 elements and 4 events. It should be noted that the method investigates the addition and loss of service, so the same event occurs at most once in a sequence. Alternatively, the sequence may be stored in an array.

Alternatively, the sequence data set may represent the frequency of each event by an event occurrence frequency table, as shown in table 1:

event(s)	Frequency of occurrence
		+Service_0	168646
+Service_1	152921
		+Service_2	107664
+Service_3	103984
		+Service_4	67737
+Service_5	44709
		-Service_1	44110
-Service_0	40388
		+Service_6	38934

TABLE 1

Fig. 4 is a schematic diagram of a frequent user behavior event stream analysis apparatus according to an embodiment of the present invention, and as shown in fig. 4, a starting event input by a user may be used as a target event, a sequence including the target event is selected from a current sequence data set, and a frequent adjacent time set of the target event is determined.

As an alternative embodiment, the output data of the sequence pattern generating device is used as the current sequence data set U, and the sequence data set includes the number of data sequences n, and the data sequences on average include the number of events m. The workflow of the frequent user behavior event flow analysis device is as follows:

step 1: and generating an all-event occurrence frequency table E.

Step 2: selecting a target Start event as target t₀The event may be the addition or the loss of the service of interest t (+ t or-t).

And step 3: traversing the current sequence data set U, and storing all the contained targets t₀Sequence of (1) U₀And storing t in these sequences₀Includes a set S of pre-adjacency events_-1Concurrent event set S₀Set of contiguous post events S₊₁. Wherein S is_-1Is referred to as U₀All appear at t₀Set of counts of events in the elements immediately preceding the element in which it is located, S₀Is referred to as U₀All of (1) and (t)₀Set of counts of events located in the same element, S₊₁Is referred to as U₀All appear at t₀A set of counts of events in an element that is adjacent to the element. The time complexity of this step operation is O (mn).

And 4, step 4: updating the current sequence dataset to U₀。

And 5: with S_-1、S₀、S₁And taking the occurrence frequency of the medium events as the Support (Support Count) to filter the elements with the Support smaller than the threshold value, and obtaining a frequent adjacent event set. Binding t₀And calculating association degree parameters such as reliability (Confidence) and Lift height (Lift) for the corresponding rules of the remaining elements.

For example, S_-1Event e in (1)_-1The corresponding rule is as follows: if t₀At the n-th element of any sequence, then e_-1The n-1 element appearing in the sequence; s₀Event e in (1)₀The corresponding rule is as follows: if t₀At the n-th element of any sequence, then e₀The nth element appearing in the sequence; s₊₁Event e in (1)₊₁The corresponding rule is as follows: if t₀At the n-th element of any sequence, then e₊₁Occurs at the n +1 th element of the sequence.

Step 6: optionally, from S_-1、S₀、S₁To select items of interest or items with a high degree of association with the target, and to generate a new target subsequence in conjunction with the current target event, e.g.<{e_-1}{t₀}>、<{e₀,t₀}>、<{t₀}{e₊₁}>。

And 7: optionally, repeating steps 3-6.

It should be noted that when the target subsequence T contains>1 element, S_-1Event e in (1)_-1The corresponding rule is as follows: e if, in any sequence, all elements in the target subsequence T are adjacent in order and the first element in the target subsequence T is present in the nth element of the sequence₀The n-1 element appearing in the sequence; s₀Event e in (1)₀The corresponding rule is as follows: e if, in any sequence, all elements in the target subsequence T appear adjacent in order and the last element in the target subsequence T appears in the nth element of the sequence₀Also present in the n-th element of the sequence, S₊₁Event e in (1)₊₁The corresponding rule is as follows: e if, in any sequence, all elements in the target subsequence T appear adjacent in order and the last element in the target subsequence T appears in the nth element of the sequence₀Occurs at the n +1 th element of the sequence.

For example, a sequence dataset contains 4 sequences as follows:

1、<{+Service A}{+Service B}{+Service C，-Service A}>

2、<{+Service A}{+Service B，+Service D}{+Service C，+Service G}{+Service E}{+Service J}>

3、<{+Service B，+Service C，+Service G}{+Service F}{-Service C}>

4、<{+Service B，+Service J，+Service K}{-Service K}{+Service C}{+Service E}>

when the target subsequence T is<{+Service B}{+Service C}>When sequences 1 and 2 are sequences corresponding to the target subsequence T. Thereby S_-1The events (and number of occurrences) in (1) are: + Service A (2), S₀The events (and number of occurrences) in (1) are: -Service A (1), Service G (1), S₊₁The events (and number of occurrences) in (1) are: + Service E (1 time).

As an alternative embodiment, the possible application scenarios of the frequent user behavior event stream analysis apparatus are as follows:

for example, to explore a typical path for a user to use a service, a service a with a larger number of users is selected, a service B with a higher relevance is selected from the generated post-adjacency event set, and a service C with a higher relevance is selected from the new post-adjacency event set, and the process is repeated until the support of all items in the post-adjacency event set is less than a predetermined threshold, at which time the resulting sequence < { + a } { + B } { + C } … > is a typical user usage path, so that the typical path can be guessed as a solution that may correspond to many user preferences, and the service provider can make recommendations or promotions around the solution.

As another example, it is found that the recent user of service A is lost more, and through the event stream analysis of-A, it is found that most users add service D at the same or previous step of service A loss, so that it can be guessed that there is some alternative or conflicting relationship between services A and D, and the service provider can make policy improvement accordingly.

As an alternative embodiment, the output data of the sequence pattern generating device is used as the current sequence data set, and the sequence data set includes the number n of data sequences, the average number m of data sequences, and the total number l of events. The working flow of the user behavior pattern sequence pattern mining device is as follows:

step 1: and generating an all-event occurrence frequency table E.

Step 2: and mining the service use sequence mode of all users by taking the data sequence U of all users as input.

And step 3: for each event E in table E, all events containedData sequence U_eSequence pattern mining is performed as an input, and the obtained sequence pattern is regarded as a related sequence pattern of e. Optionally, with U_eThe subsequence of all the sequences from the first element to the element before the occurrence of e is used as input to mine the 'incentive' sequence mode of the specific event e.

The mining of the sequence pattern is realized by an Apriori-like algorithm discovered by the improved sequence pattern, so that the mining efficiency is improved to the maximum extent.

Fig. 5 is a schematic diagram of a basic structure of an Apriori-like algorithm discovered by a sequence pattern according to an embodiment of the present invention, and as shown in fig. 5, all original frequent sequences (i.e., frequent 1-sequences) are found, each original frequent sequence is traversed, a candidate frequent sequence (i.e., candidate k-sequence) is generated, and a support count is performed on the candidate frequent sequence to determine a frequent sequence of a target sequence.

However, the Apriori algorithm has the following problems: if the total number l of events is large and the support of each event in the event frequency table E is highly unbalanced, a lower support threshold needs to be set during the discovery of the sequence pattern to ensure that the obtained sequence pattern covers a sufficient number of events, rather than being limited to a few events with the highest occurrence frequency. However, in this case, a large number of invalid candidate sequences, that is, candidate sequences with a support count of 0, are generated in the process of generating candidate k sequences with a small k value, particularly candidate 2-sequences and candidate 3-sequences, and the support counts are performed on all the invalid sequences, thereby increasing the computational burden.

Fig. 6 is a schematic diagram of a user behavior pattern sequence pattern mining device according to an embodiment of the present invention, as shown in fig. 6, the core of which is to perform an estimation on the computation cost after generating a low-k candidate k sequence. That is, if the number of candidate sequences is too large, there are many potential invalid sequences, and the support count cost is high, the method of sequence scanning is used to generate the candidate k-sequences to ensure that the candidate k-sequences do not contain useless candidate sequences, and then the candidate k-sequences enter Apriori loop logic.

It should be noted that the time complexity of the support count for each candidate sequence is O (m)n) of the candidate k-1 sequence, the time complexity of the generation of the candidate k-sequence from the candidate k-1 sequence being O (L _ CAND)²) Where L _ CAND is the number of candidate k-1 sequences, assuming that the number of candidate k-sequences generated is N _ CAND, the time complexity C1 from the candidate k-1 sequence to the frequent k-sequence step is O (L _ CAND)²) + N _ CAND o (mn). On the other hand, the event complexity C2 for generating frequent k-sequences using the sequence scanning algorithm is O (m)^knl) + N _ VALID o (mn), where N _ VALID is the number of VALID candidate sequences, i.e. the number of all candidate sequences with a degree of support greater than zero, and can be regarded as C1 ═ N _ CAND × (mn), and C2 ═ m (m)^k-1l + N _ VALID) × o (mn); so when m^k-1l+N_VALID<<N _ CAND, it is more efficient to generate frequent k-sequences using a sequence scanning method.

Specifically, the approximate comparison between N _ VALID and N _ CAND can be estimated from the total number of events l in the event frequency table E and the skewness (skewness) or standard deviation (standard deviation) of the event frequency data distribution.

In practical operation, if the total number of events is large (l)>100) With a larger positive bias of the event frequency distribution and a lower support limit, then for a smaller k, e.g. 2 or 3, m^k-1l + N _ VALID is much smaller than N _ CAND.

Fig. 7 is a schematic diagram of a data visualization interface device according to an embodiment of the present invention, and as shown in fig. 7, association pattern rule mining may be performed on a frequent user behavior event stream analysis device and a user behavior pattern sequence pattern mining device, an analysis result of a sequence data set is determined, interactive data transmission is performed on a re-analysis result and a user interface, and then the analysis result is transmitted to a visualization tool through the user interface.

Optionally, the user interface and visualization device provided in the data visualization interface device may perform mining on the user common sequence pattern and mining on the sequence pattern and generating rules of all event-related users as a non-real-time function, that is, run and mine all rules in advance, store rule data, and then use the stored rule data as input data of the user interface UI, because the number of events in the data sequence set is limited and predictable; the analysis of the event stream of the user characteristic behavior is carried out in real time, namely, the result is calculated in real time each time the target is changed, because the possible combination of the event stream generated by the user input of the user interface UI is large and unpredictable, and the operation time complexity of the module is linear with the sequence number, so that the efficiency is acceptable.

According to the technical scheme provided by the invention, the hidden relevance existing between the addition and the loss of different services is mined by utilizing the time sequence of the service used by the user, and the output result covers two different levels or ranges. On a microscopic level, the addition and loss behaviors or behavior aggregation and behavior sequence of a specific service are subjected to on-path proximity analysis, namely what is commonly happened in the previous step of the behavior generated by a user, what is commonly happened in the same time of taking the behavior and what is about to happen in the next step, and the potential cause or direct influence of the behavior is mined. On the macro level, similar behavior mining is carried out on the use sequences of all users on the one hand to explore the common characteristics of the users or potential relations among services, and on the other hand, characteristic behavior extraction is carried out on the relevant sequences of the addition or loss behaviors of each service, so that a direct and effective tool is provided for service recommendation or loss early warning.

According to still another embodiment of the present invention, there is also provided a storage medium including a stored program, wherein the program executes to perform the data analysis method of any one of the above.

According to yet another embodiment of the present invention, there is also provided a processor for executing a program, wherein the program executes to perform any one of the above data analysis methods.

According to an embodiment of the present invention, there is also provided an embodiment of a data analysis apparatus, where the data analysis apparatus may be configured to execute a data analysis method in the embodiment of the present invention, and the data analysis method in the embodiment of the present invention may be executed in the data analysis apparatus.

Fig. 8 is a schematic diagram of a data analysis apparatus according to an embodiment of the present invention, as shown in fig. 8, the apparatus may include: the acquiring unit 82 is configured to acquire a sequence data set of user services, where each sequence in the sequence data set corresponds to usage data of a service used by a user, each sequence includes at least one element, each element includes at least one event, and an event is used to indicate addition or loss of a user service; a first determining unit 84 for determining a target event among at least one event of the sequence data set and determining a sequence including the target event as a target sequence; a second determining unit 86, configured to perform sequence pattern mining on the target sequence, determine a related sequence of the target events, and determine a degree of association between the related sequence and the target events, where the related sequence represents at least one event associated with the target events in the sequence including the target events.

It should be noted that the obtaining unit 822 in this embodiment may be configured to execute step S102 in this embodiment, the first determining unit 84 in this embodiment may be configured to execute step S104 in this embodiment, and the second determining unit 86 in this embodiment may be configured to execute step S106 in this embodiment. The above units are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the disclosure of the above embodiments.

As an alternative embodiment, the obtaining unit includes: the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring the use data of user services from a database, and the use data comprises user numbers, time, events and use amount of the used services; a generation module for generating a start time table and an attrition time table of the user service based on the usage data, wherein the start time table is used for representing the time when each user starts to use the user service and the attrition time table is used for representing the time when each user borrows and uses the user service; and the second generation module is used for generating a sequence data set according to the starting time table and the loss time table.

As an alternative embodiment, the second determination unit includes: the device comprises a first determining module, a second determining module and a judging module, wherein the first determining module is used for determining an event occurrence frequency table of a sequence data set, and the event occurrence frequency table is used for indicating the frequency of all events in the sequence data set; the setting module is used for setting a support threshold of a target event based on the event occurrence frequency table; and the second determining module is used for carrying out sequence pattern mining on the target sequence based on the support degree threshold value and determining the related sequence of the target event.

As an alternative embodiment, the second determination unit includes: a third determining module, configured to determine a frequent sequence of the target sequence in the sequence data set, where the frequent sequence includes frequent adjacent events of the target event; a fourth determining module, configured to determine a sequence number of the frequent sequences; the screening module is used for eliminating invalid sequences in the frequent sequences to obtain effective frequent sequences under the condition that the number of the sequences of the frequent sequences is higher than the number of preset sequences; and the calculating module is used for calculating the support degree of the effective frequent sequences and determining the effective frequent sequences with the support degree higher than the support degree threshold value as the related sequences.

As an alternative embodiment, the apparatus further comprises: a third determining unit, configured to determine a set of contiguous frequent events of the target event after determining the target event in at least one event of the sequence data set, where the set of contiguous frequent events includes: the method comprises the following steps that a pre-adjacent event set, a concurrent event set and a post-adjacent event set are arranged, and events in an adjacent frequent event set are arranged in a descending order according to the relevance degree of a target event; and the generating unit is used for generating a target subsequence according to the association degree of the events in the adjacent frequent event set and the target events, wherein the target subsequence represents a path of user service.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method of data analysis, comprising:

acquiring a sequence data set of user services, wherein each sequence in the sequence data set corresponds to use data of a service used by a user, each sequence comprises at least one element, each element comprises at least one event, and the event is used for indicating addition or loss of the user service;

determining a target event among at least one event of the sequence dataset and determining a sequence comprising the target event as a target sequence;

and performing sequence pattern mining on the target sequence, determining a related sequence of the target event and the association degree of the related sequence and the target event, wherein the related sequence represents at least one event associated with the target event in the sequence comprising the target event.

2. The method of claim 1, wherein obtaining a sequence data set for a user service comprises:

collecting the use data of user service from a database, wherein the use data comprises user number, time, event and use amount of the used service;

generating a start time table and an attrition time table for the user service based on the usage data, wherein the start time table is used for representing the time when each user starts using the user service and the attrition time table is used for representing the time when each user borrows the user service;

generating the sequence dataset according to the start schedule and the attrition schedule.

3. The method of claim 1, wherein the sequence pattern mining is performed on the target sequence, and wherein determining the relevant sequence of the target event comprises:

determining an event occurrence frequency table of the sequence data set, wherein the event occurrence frequency table is used for indicating the frequency of all events in the sequence data set;

setting a support threshold of the target event based on the event occurrence frequency table;

and performing sequence pattern mining on the target sequence based on the support degree threshold value, and determining a related sequence of the target event.

4. The method of claim 1, wherein the sequence pattern mining is performed on the target sequence, and wherein determining the relevant sequence of the target event comprises:

determining a frequent sequence of the target sequence in a sequence data set, wherein the frequent sequence comprises frequent adjacent events of the target event;

determining a sequence number of the frequent sequences;

under the condition that the number of the sequences of the frequent sequences is higher than the number of preset sequences, eliminating invalid sequences in the frequent sequences to obtain effective frequent sequences;

and calculating the support degree of the effective frequent sequences, and determining the effective frequent sequences with the support degree higher than a support degree threshold value as the related sequences.

5. The method of claim 1, wherein after determining a target event in at least one event of the sequence dataset, the method further comprises:

determining a set of contiguous frequent events for the target event, wherein the set of contiguous frequent events comprises: the events in the adjacent frequent event set are arranged in a descending order according to the relevance degree with the target event;

and generating a target subsequence according to the association degree of the events in the adjacent frequent event set and the target events, wherein the target subsequence represents the path of the user service.

6. A data analysis apparatus, comprising:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a sequence data set of user services, each sequence in the sequence data set corresponds to the use data of a service used by a user, each sequence comprises at least one element, each element comprises at least one event, and the event is used for indicating the addition or the loss of the user service;

a first determining unit configured to determine a target event among at least one event of the sequence dataset, and determine a sequence including the target event as a target sequence;

and a second determining unit, configured to perform sequence pattern mining on the target sequence, determine a related sequence of the target event and a degree of association between the related sequence and the target event, where the related sequence represents at least one event associated with the target event in a sequence including the target event.

7. The apparatus of claim 6, wherein the obtaining unit comprises:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring the use data of user services from a database, and the use data comprises user numbers, time, events and use amount of the used services;

a generating module, configured to generate a start time table and an attrition time table of the user service based on the usage data, wherein the start time table is used for indicating a time when each user starts using the user service and the attrition time table is used for indicating a time when each user borrows and uses the user service;

a second generation module configured to generate the sequence data set according to the start time table and the run-off time table.

8. The apparatus of claim 6, wherein the second determining unit comprises:

the first determining module is used for determining an event occurrence frequency table of the sequence data set, wherein the event occurrence frequency table is used for indicating the frequency of all events in the sequence data set;

the setting module is used for setting a support threshold of the target event based on the event occurrence frequency table;

and the second determination module is used for carrying out sequence pattern mining on the target sequence based on the support degree threshold value and determining the related sequence of the target event.

9. A storage medium characterized by comprising a stored program, wherein the program executes the data analysis method of any one of claims 1 to 5.

10. A processor, characterized in that the processor is configured to run a program, wherein the program is configured to execute the data analysis method according to any one of claims 1 to 5 when running.