CN117371876B

CN117371876B - Index data analysis method and system based on keywords

Info

Publication number: CN117371876B
Application number: CN202311671020.7A
Authority: CN
Inventors: 刘宏; 曲松; 倪燕
Original assignee: Shenzhen Pinkuo Information Technology Co ltd
Current assignee: Shenzhen Pinkuo Information Technology Co ltd
Priority date: 2023-12-07
Filing date: 2023-12-07
Publication date: 2024-04-02
Anticipated expiration: 2043-12-07
Also published as: CN117371876A

Abstract

The application relates to the technical field of data analysis and discloses an index data analysis method and system based on keywords. The method comprises the following steps: acquiring initial user learning index data of a plurality of target users based on a cloud platform, and performing keyword index analysis to acquire the target user learning index data; classifying users through an EM algorithm to obtain a plurality of target user groups; carrying out mixed Weibull distribution parameter operation through a mixed Weibull distribution model to obtain a distribution scale parameter and a distribution shape parameter; constructing a variable set to obtain a target variable set; performing variable relation analysis through an initial Bayesian network to obtain target influence factors; and performing model parameter optimization through a particle swarm optimization algorithm to obtain an optimal network parameter combination, and performing network parameter updating on the initial Bayesian network through the optimal network parameter combination to obtain a target Bayesian network, thereby improving the analysis accuracy of index data.

Description

Index data analysis method and system based on keywords

Technical Field

The present disclosure relates to the field of data analysis technologies, and in particular, to a method and a system for analyzing index data based on keywords.

Background

With the popularization of online education platforms, a large amount of learning data is recorded, including learning behaviors, interaction patterns, learning progress, and the like of students. These data contain important insights into education strategies, course designs, personalized learning paths, etc. However, how to extract valuable information from these huge and complex data sets, thereby improving teaching methods and improving learning efficiency is a challenge facing the current education technical field.

By analyzing the learning index data of the user, learning habits, preferences and potential difficulties of the student can be revealed, thereby providing support for the educator, helping them to better understand and meet the demands of the student. For example, by analyzing the attention and participation of students in a particular course, the educational program can be adjusted by the learner to more closely follow the interests and learning patterns of the students, but the accuracy of the existing solution is low.

Disclosure of Invention

The method and the system for analyzing the index data based on the keywords further improve the analysis accuracy of the index data.

The first aspect of the present application provides a keyword-based index data analysis method, which includes:

Acquiring initial user learning index data of a plurality of target users based on a preset cloud platform, and performing keyword index analysis on the initial user learning index data to obtain target user learning index data;

user classification is carried out on the plurality of target users according to the target user learning index data through a preset EM algorithm, so that a plurality of target user groups are obtained;

inputting the target user learning index data into a preset mixed Weibull distribution model to perform mixed Weibull distribution parameter operation to obtain a distribution scale parameter and a distribution shape parameter;

according to the multiple target user groups, carrying out variable set construction on the distribution scale parameters and the distribution shape parameters to obtain a target variable set of each target user group;

inputting a target variable set of each target user group into a preset initial Bayesian network for variable relation analysis to obtain target influence factors of each target user group;

and carrying out model parameter optimization on the initial Bayesian network according to target influence factors of each target user group through a preset particle swarm optimization algorithm to obtain an optimal network parameter combination, and carrying out network parameter updating on the initial Bayesian network through the optimal network parameter combination to obtain a target Bayesian network.

A second aspect of the present application provides a keyword-based index data analysis system, the keyword-based index data analysis system comprising:

the acquisition module is used for acquiring initial user learning index data of a plurality of target users based on a preset cloud platform, and carrying out keyword index analysis on the initial user learning index data to obtain target user learning index data;

the classification module is used for carrying out user classification on the plurality of target users according to the target user learning index data through a preset EM algorithm to obtain a plurality of target user groups;

the operation module is used for inputting the target user learning index data into a preset mixed Weibull distribution model to carry out mixed Weibull distribution parameter operation so as to obtain a distribution scale parameter and a distribution shape parameter;

the construction module is used for constructing variable sets of the distribution scale parameters and the distribution shape parameters according to the plurality of target user groups to obtain target variable sets of each target user group;

the analysis module is used for inputting the target variable set of each target user group into a preset initial Bayesian network to perform variable relation analysis so as to obtain target influence factors of each target user group;

The updating module is used for carrying out model parameter optimization on the initial Bayesian network according to target influence factors of each target user group through a preset particle swarm optimization algorithm to obtain an optimal network parameter combination, and carrying out network parameter updating on the initial Bayesian network through the optimal network parameter combination to obtain a target Bayesian network.

According to the technical scheme, the keyword index analysis is carried out on the user learning index data, so that the user behavior can be accurately identified and quantified. This not only improves the accuracy of the data analysis, but also makes understanding of the user's behavior more thorough and specific. By applying the EM algorithm, this approach can efficiently divide users into different populations, each population having unique behavioral characteristics. Such classification not only helps identify different types of users, but also can be used to formulate targeted strategies, such as customized teaching methods. By using the hybrid weibull distribution model, complex patterns of user behavior can be modeled effectively, especially when behavior data with multiple influencing factors are processed. Such a model can reveal deep features of the user's behavior, such as duration and frequency of learning activities. By inputting the variable set into the bayesian network for analysis, the relationships and interactions between the variables can be comprehensively identified. Such analysis helps understand the variety of factors that affect user behavior, providing support for developing more efficient strategies. The performance of the model can be remarkably improved by optimizing the parameters of the Bayesian network by using a particle swarm optimization algorithm. The optimization ensures high accuracy and reliability of the model on the complex data set, so that the prediction and analysis results are more accurate, and the analysis accuracy of index data is further improved.

Drawings

FIG. 1 is a schematic diagram of one embodiment of a keyword-based index data analysis method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an embodiment of a keyword-based index data analysis system according to an embodiment of the present application.

Detailed Description

The embodiment of the application provides an index data analysis method and system based on keywords, so that the analysis accuracy of index data is improved.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims of this application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

For ease of understanding, the following describes a specific flow of an embodiment of the present application, referring to fig. 1, and one embodiment of a keyword-based index data analysis method in the embodiment of the present application includes:

step 101, acquiring initial user learning index data of a plurality of target users based on a preset cloud platform, and performing keyword index analysis on the initial user learning index data to obtain target user learning index data;

it may be understood that the execution body of the present application may be a keyword-based index data analysis system, or may be a terminal or a server, which is not limited herein. The embodiment of the present application will be described by taking a server as an execution body.

Specifically, first, based on a preset cloud platform, a plurality of target users are subjected to learning index monitoring to obtain initial user learning index data. These data cover the online learning duration, activity frequency, selection of learning content, etc. of the user, reflecting the learning habit and preference of the user. After the data are acquired, the data are processed by using a predefined keyword analysis function so as to calculate the frequency data of the keywords. The keyword analysis function is based on a TF-IDF algorithm, namely a word frequency-inverse document frequency algorithm, wherein the word frequency (TF) reflects the frequency of occurrence of a keyword in certain user learning index data, and the Inverse Document Frequency (IDF) is an index for measuring the uniqueness of the keyword in the whole data set. Specifically, the frequency of occurrence of each keyword in the individual user learning index data is calculated, and then multiplied by the inverse proportion of the frequency of occurrence of the keyword in all the user learning index data, so as to obtain the weight of the keyword. Thus, not only the words frequently appearing in the single user data can be identified, but also words which are unique in the whole data set can be identified, so that the learning characteristics and preferences of each user can be more accurately grasped. Then, based on the keyword frequency data, the system further performs association rule learning, and the relationship between different keywords is analyzed, so that a user behavior mode is revealed. For example, if certain keywords are found to occur often simultaneously, it is indicated that these learning content or activities are related to each other or equally important to the user. Finally, according to the user behavior mode obtained through association rule learning, the system further analyzes and screens the initial user learning index data to extract key index data reflecting learning habits and preferences of the target user. These data will provide key information for subsequent user analysis, content recommendation, or personalized educational path design.

102, carrying out user classification on a plurality of target users according to target user learning index data through a preset EM algorithm to obtain a plurality of target user groups;

in particular, the EM algorithm is an iterative algorithm for maximum likelihood estimation of probability model parameters containing hidden variables (variables). In this embodiment, according to the learning index data of the target user, the user is effectively classified, so as to obtain different user groups. First, category probabilities are calculated for a plurality of target users using the EM algorithm. The learning index data of each user is regarded as observation data, and the probability that each user belongs to a different category is calculated based on these data. In step E (the desired step) of the EM algorithm, the algorithm estimates the probability that each user belongs to each category under the current parameters. These probabilities can be understood as posterior probability distributions of hidden variables reflecting the nature of each user belonging to each category under the current model parameters. The users are then classified based on the calculated class probabilities, resulting in a plurality of initial user groups. In this process, each user is divided into the categories to which it most belongs, forming a preliminary user population division. And then, carrying out parameter updating calculation on the target user learning index data through a preset parameter updating function so as to optimize the classification performance of the model. This step is the M step (maximization step) of the EM algorithm, in which the algorithm updates the model parameters with the posterior probability distribution obtained in the previous step. In particular, the purpose of this parameter update function is to maximize the log-likelihood function of the observed data Number, which takes into account the posterior probability of each observed data point under the current parameters. In this process, θ ^(new) Representing the updated parameters, θ ^(old) Representing the parameters before updating, each iteration updates new parameter values based on the last parameter and data. And then, the classification performance of the model is gradually optimized and refined by carrying out iterative updating on the updated parameters. In each iteration, the model parameters are adjusted according to the newly calculated probability distribution and the observed data. This process is repeated until the parameters converge to a stable value, at which point the server obtains the target updated parameters. Based on the target updating parameters, finally, further group optimization is carried out on the initial user group, and a final target user group is obtained. In this way, the EM algorithm can effectively process complex data sets containing hidden variables, provides a powerful user classification tool for keyword-based index data analysis, and further provides a basis for subsequent data analysis and decision.

Step 103, inputting target user learning index data into a preset mixed Weibull distribution model to carry out mixed Weibull distribution parameter operation, so as to obtain a distribution scale parameter and a distribution shape parameter;

First, the target user learning index data is subjected to a mixed component analysis, and different behavior patterns or user groups existing in the data are identified. The goal is to decompose the overall data into subsets, each subset representing a particular pattern or feature of user behavior. For example, different blending components represent different learning time preferences, course selection habits, or interactions. And then, carrying out user behavior time distribution modeling on the target user learning index data through a preset mixed Weibull distribution model. The weibull distribution is a probability distribution widely used for survival analysis, reliability analysis, and risk assessment, whereas the mixed weibull distribution combines multiple weibull distributions together to better accommodate diverse data sets. The hybrid weibull distribution model is used to capture different characteristics of the user behavior time distribution, such as the distribution of learning duration, the distribution of active time periods, and the like. Then, the distribution parameter operation function is appliedThe number is used to calculate the distribution scale parameter and the distribution shape parameter. These parameters help understand and describe the user behavior patterns. The distribution scale parameter (λ) describes the extent of the distribution, while the distribution shape parameter (β) describes the shape of the distribution, such as the degree of deflection. A specially designed function L is used, which combines the weights of the mixed components (pi _k ) And a scale parameter and a shape parameter corresponding to each of the mixed components to describe a time value (t _i ). From this function, a mixed weibull distribution parameter describing the entire dataset can be calculated. Finally, the mixed Weibull distribution which can accurately reflect the learning behavior characteristics of the target user is obtained. These distribution parameters not only help to understand the learning behavior of the user, but also provide powerful support for subsequent data analysis, user grouping, personalized recommendation, etc.

104, constructing variable sets of distribution size parameters and distribution shape parameters according to a plurality of target user groups to obtain target variable sets of each target user group;

specifically, first, initial feature data is constructed for each target user group according to the obtained distribution scale parameters and distribution shape parameters. The characteristics of each user population are defined using parameters that have been calculated from the hybrid weibull distribution model. The distribution scale parameters and shape parameters reflect specific aspects of the user's learning behavior, such as learning frequency, duration, etc., which facilitate understanding of behavior patterns for different user populations. Next, these initial feature data are normalized to ensure consistency and comparability of the data. Normalization typically involves scaling the data to a standard range, such as 0 to 1 or-1 to 1, or Z-score normalization to eliminate the effects of different dimensions and ranges. This step ensures that the comparison and combination between the different features is fair and efficient. And then, carrying out feature clustering on the target feature data by using a preset feature extraction function to obtain a target variable set of each target user group. Feature clustering is a data mining technique that identifies groups of similar or related features from a large amount of data. The feature extraction function is based on a fuzzy clustering algorithm, and the core idea is to consider the membership degree of data points to different groups. This approach differs from traditional hard clustering algorithms (e.g., k-means) that divide each data point into a single group. Fuzzy clustering allows data points to belong to multiple groups to varying degrees, thereby providing a more flexible and fine view of data analysis.

Step 105, inputting a target variable set of each target user group into a preset initial Bayesian network for variable relation analysis to obtain target influence factors of each target user group;

specifically, first, a set of target variables for each target user group is input into a preset initial bayesian network. A bayesian network is a graphical model that represents probabilistic relationships between variables through nodes and directed edges. In this network, each node represents a variable, and the edges represent probabilistic dependencies between the variables. The target conditional probabilities for each set of target variables can be calculated separately by means of a probability inference function in the initial bayesian network. This probability inference function is based on the bayesian theorem, which infers the conditional probability of each variable in the set of target variables by taking into account the conditional dependencies between the variables. This process involves calculating the conditional probability of each variable given its parent node value, which represents other variables that affect the variable. By doing this for the whole set of target variables, an overall target conditional probability distribution can be obtained, reflecting the interrelationship and influence between the variables. Next, probability inference is performed for each target user population based on the resulting target conditional probabilities. The behavior patterns and features of the individual user groups are inferred from the conditional probability distribution of the bayesian network. Probability inference can help the server understand how much the values of other variables change given the values of some variables. This not only helps to identify the main features of the individual user population, but also reveals the interrelationship between these features. And finally, identifying influence factors for each target user group according to the result of the probability inference. Based on the previous analysis, key influencing factors of each user group are further deeply mined. By identifying these factors, the server better understands the behavior and preferences of different user groups, thereby providing an important basis for formulating more accurate and efficient user service policies.

And 106, performing model parameter optimization on the initial Bayesian network according to target influence factors of each target user group through a preset particle swarm optimization algorithm to obtain an optimal network parameter combination, and performing network parameter update on the initial Bayesian network through the optimal network parameter combination to obtain the target Bayesian network.

Specifically, first, a preset Particle Swarm Optimization (PSO) algorithm is used. Particle swarm optimization algorithms are a population-based optimization tool for finding optimal solutions in a complex search space by simulating the movement of a population of particles in the solution space. The algorithm generates particles of model parameters for the initial Bayesian network based on target influence factors of each target user group, and a plurality of initial network parameter particles are obtained. These particles represent the parameter configuration of the bayesian network, and each particle's location corresponds to a particular combination of network parameters. The goal of the particles is to find a combination of network parameters that best interprets the target user population data. Next, an initialization of the particles is performed, determining the position and velocity of each initial network parameter particle. This step is to set the initial state of the particle in the search space, where the position represents one potential parameter combination of the bayesian network and the velocity determines the direction and velocity of the particle search. Then, the position and speed of the particles are updated by a preset particle update function to guide the particles to move towards the optimal solution direction. This update process includes two parts: update of speed and update of location. The update of the speed is affected by three factors: the current velocity (inertia) of the particles, the best position the particles found so far (individual learning component), and the best position found in the whole population of particles (social learning component). This updating mechanism enables the particles to find a balance between individual experience and population experience, thus effectively exploring the solution space. Finally, the optimal combination of network parameters found in this way is used to update the network nodes and conditional probability tables of the initial bayesian network, resulting in the target bayesian network. The optimized Bayesian network can reflect the behavior mode and the characteristics of the target user group more accurately, thereby providing more accurate and effective support for the analysis of index data based on keywords.

According to the method, the user behavior can be accurately identified and quantified by carrying out keyword index analysis on the user learning index data. This not only improves the accuracy of the data analysis, but also makes understanding of the user's behavior more thorough and specific. By applying the EM algorithm, this approach can efficiently divide users into different populations, each population having unique behavioral characteristics. Such classification not only helps identify different types of users, but also can be used to formulate targeted strategies, such as customized teaching methods. By using the hybrid weibull distribution model, complex patterns of user behavior can be modeled effectively, especially when behavior data with multiple influencing factors are processed. Such a model can reveal deep features of the user's behavior, such as duration and frequency of learning activities. By inputting the variable set into the bayesian network for analysis, the relationships and interactions between the variables can be comprehensively identified. Such analysis helps understand the variety of factors that affect user behavior, providing support for developing more efficient strategies. The performance of the model can be remarkably improved by optimizing the parameters of the Bayesian network by using a particle swarm optimization algorithm. The optimization ensures high accuracy and reliability of the model on the complex data set, so that the prediction and analysis results are more accurate, and the analysis accuracy of index data is further improved.

In a specific embodiment, the process of executing step 101 may specifically include the following steps:

(1) Based on a preset cloud platform, carrying out learning index monitoring on a plurality of target users to obtain corresponding initial user learning index data;

(2) Performing keyword frequency calculation on the initial user learning index data through a preset keyword analysis function to obtain keyword frequency data, wherein the keyword analysis function is as follows:

t is a keyword, d is initial user learning index data, T _t,d For the keyword frequency data of the keyword t in the initial user learning index data d, f _t,d For the frequency of occurrence of the keyword t in the initial user learning index data d,/for the key word t>For the total frequency of occurrence of all keywords in the initial user learning index data d, N is the total number of initial user learning index data in the initial user learning index data set,learning the number of index data for the initial user containing the keyword t;

(3) Performing association rule learning on the initial user learning index data according to the keyword frequency data to obtain a user behavior mode;

(4) And extracting key word indexes from the initial user learning index data according to the user behavior mode to obtain target user learning index data.

Specifically, first, a plurality of target users are monitored for learning indexes through a preset cloud platform, and initial user learning index data is obtained. Such data typically includes a record of the user's activities on the online learning platform, such as when the lesson video was viewed, how frequently the discussion was engaged, how well the job was completed, and so forth. These data provide the basis for subsequent analysis. And then, performing keyword frequency calculation on the initial user learning index data through a preset keyword analysis function, so as to obtain keyword frequency data. This function is based on TF-IDF (word frequency-inverse document frequency) algorithm for evaluating the importance of a word in a data set. Specifically, the Term Frequency (TF) section calculates the frequency of occurrence of a word in the individual user learning index data, while the Inverse Document Frequency (IDF) section considers the frequency of occurrence of the word in the entire dataset, thereby helping to identify both common and unique words. For example, if most users on a learning platform frequently watch a certain type of lesson video, this type of lesson becomes a high frequency keyword. Next, association rule learning may be performed according to the obtained keyword frequency data to identify a user behavior pattern. Association rule learning is a method of finding relationships between variables in a large dataset. In this process, the algorithm may analyze the relevance between different keywords, for example, to find that users viewing a particular type of course also tend to participate in the relevant online discussion. In this way, the inherent patterns and trends in user behavior can be revealed, for example, finding that users who frequently participate in programming course discussions often also behave aggressively in courses of algorithms and data structures. And finally, extracting key word indexes from the initial user learning index data according to the identified user behavior mode to obtain target user learning index data. The most representative and predictive value information is extracted from the data. For example, if the analysis results indicate that users participating in a particular discussion group will generally achieve better performance in the final exam, the frequency and liveness of participation in the discussion may be extracted as an important learning indicator. Likewise, if the viewing duration of certain lesson videos is found to be highly correlated with the lesson completion rate, the viewing data of these videos becomes a key learning indicator.

In a specific embodiment, the process of executing step 102 may specifically include the following steps:

(1) Carrying out class probability calculation on a plurality of target users through a preset EM algorithm to obtain class probability of each target user;

(2) User classification is carried out on a plurality of target users according to the category probability, and a plurality of initial user groups are obtained;

(3) Updating parameter calculation is carried out on the target user learning index data through a preset parameter updating function, updated parameters are obtained, and the parameter updating function is as follows:

，θ ^(new) representing the updated parameters, θ ^(old) Representing parameters before update, x _i Learning the ith observation data point in the index data for the target user, wherein z is a hidden variable, and ++>Representing a given observed data point x under pre-update parameters _i Posterior probability of time-hidden variable z, +.>Representing updated under-parameter observed data point x _i And a joint probability of the hidden variable z;

4) And carrying out iterative updating on the updated parameters to obtain target updating parameters, and carrying out group optimization on a plurality of initial user groups according to the target updating parameters to obtain a plurality of target user groups.

In particular, the EM algorithm, i.e. the expectation maximization algorithm, is a method for statistical model parameter estimation with hidden variables. In the user classification, the EM algorithm may help the server estimate the categories to which it belongs from the user's behavioral data, even if these categories are not explicitly defined in advance. First, category probabilities are calculated for a plurality of target users by an EM algorithm. In this process, the algorithm randomly initializes the probability that each user belongs to a respective potential category. For example, in an online learning platform, these potential categories are "beginner", "intermediate user" and "advanced user". The preliminary category probability assignment is based on learning activities of the user, such as course viewing time, quiz achievements, forum participation, and the like. Next, the algorithm enters an iterative process, continually updating these class probabilities to more accurately reflect the user's true class. The EM algorithm is divided into two steps: e step (desired step) and M step (maximize step). In step E, the algorithm calculates the expected probability that each user belongs to each category, which is based on the probability of the user data under the current parameters. Then, in step M, the algorithm updates the model parameters, i.e., the features of each category, to maximize the likelihood of the observed data. This process is repeated until the category probabilities converge. And then, carrying out parameter updating calculation on the target user learning index data by utilizing a preset parameter updating function. The core of this process is to optimize the model parameters to more accurately reflect the user's behavior and class. The objective of the parameter update function is to find parameter values that maximize the likelihood of the model. Likelihood here refers to the probability of the model parameters given the observed data. In this way, it can be ensured that the model describes the user data as accurately as possible. Finally, the server obtains more accurate user classification by iteratively updating the updated parameters. This process is dynamic and as the user's behavior changes and new data is added, the model will continually adjust to better accommodate the user's actual behavior. After these steps are completed, the server may perform population optimization on the initial user population according to the target update parameters, thereby obtaining a plurality of target user populations. The optimization process not only considers the current behaviors of the users, but also can adapt to the variation trend of the behaviors of the users, and provides more accurate user grouping and personalized recommendation basis for the platform.

In a specific embodiment, the process of executing step 103 may specifically include the following steps:

(1) Performing mixed component analysis on the target user learning index data to obtain a plurality of different mixed components;

(2) Modeling the user behavior time distribution of the target user learning index data through a preset mixed Weibull distribution model to obtain corresponding target mixed Weibull distribution;

(3) Carrying out mixed Weibull distribution parameter operation on the target mixed Weibull distribution according to a plurality of different mixed components through a preset distribution parameter operation function to obtain a distribution scale parameter and a distribution shape parameter, wherein the distribution parameter operation function is as follows:λ is the scale parameter, β is the shape parameter, +.>Distribution scale parameters representing different mixing elements, < ->Representing the distribution shape parameters of the different blend components, < >>Weight for kth mixed component, +.>For the time value of the ith data point in the target user learning index data, K is the number of the mixed components, and w is the mixed distribution coefficient.

Specifically, first, the target user learning index data is subjected to mixed component analysis to obtain a plurality of different mixed components. The purpose of the mixed component analysis is to decompose the user learning index data into a plurality of sub-components, each sub-component representing a particular pattern of user behavior. For example, in an online learning platform, these mixed components represent different learning activities, course completion speeds, or interaction patterns, etc. A clustering algorithm or other statistical analysis method may be applied to identify potential patterns in the data. Users can be divided into several different groups by analyzing their learning time, course selection, forum participation, etc. behavioral data. Each group exhibits different learning characteristics and habits, thus constituting a mixed component of data. These learning index data are then modeled by a preset hybrid weibull distribution model. The weibull distribution is a flexible probability distribution that can be used to model the time distribution of user behavior, such as learning duration, course completion time, etc., in user behavior analysis. For example, some users tend to complete courses centrally in a short period of time (representing one form of weibull distribution), while others tend to complete the distribution over a longer period of time (another form of weibull distribution). Hybrid weibull distribution by combining these different weibull distribution forms, the time distribution characteristics of the overall user behavior can be more fully described. Then, use the pre-preparation The set distribution parameter operation function calculates parameters of the mixed weibull distribution, including a scale parameter (lambda) and a shape parameter (beta) of the distribution. The scale parameters determine the extent of the distribution, while the shape parameters determine the shape of the distribution. For example, in a scenario of learning time distribution, the scale parameter represents an average learning time, and the shape parameter represents a variation range of the learning time. The distribution parameter arithmetic function is calculated by taking into account the weights (pi _k ) And the scale and shape parameters of each component to calculate the overall mixing profile. Such a calculation takes into account not only the characteristics of the individual sub-components themselves, but also their relative importance in the overall data.

In a specific embodiment, the process of executing step 104 may specifically include the following steps:

(1) Respectively constructing initial characteristic data of each target user group according to the distribution scale parameters and the distribution shape parameters;

(2) Normalizing the initial characteristic data to obtain target characteristic data;

(3) Carrying out feature clustering on the target feature data through a preset feature extraction function to obtain a target variable set of each target user group, wherein the feature extraction function is as follows:

，V _group For the set of target variables, v _group Representing the variable, u _ij Representing membership degree of ith data point to jth group in target characteristic data, p represents weighting index in fuzzy clustering, x _ij And the j-th characteristic value of the i-th data point in the target characteristic data is represented.

Specifically, first, initial feature data of each target user group are respectively constructed according to the distribution scale parameters and the distribution shape parameters. These parameters are obtained from a prior mixed weibull distribution model reflecting the characteristics of different user populations in learning behavior time distribution. Next, these initial feature data are normalized. The purpose of the normalization processing is to eliminate the dimension influence among different features, so that the data are more unified and standardized, and the subsequent analysis is convenient. For example, if the scale parameter ranges from 0 to 100 and the shape parameter ranges from 0 to 10, it would not make sense to directly compare the two parameters. By normalization, these parameters can be converted to the same scale, e.g. they are all in the range of 0 to 1. In this way, the contribution of each feature is equal for subsequent analysis. And then, carrying out feature clustering on the target feature data by adopting a preset feature extraction function. The purpose of feature clustering is to categorize users with similar features into the same group. The feature extraction function used herein is based on a fuzzy clustering algorithm, which is different from conventional hard clustering algorithms (e.g., K-means), which allows one data point to belong to multiple clusters, but with different membership. In this function, the membership of each data point to each group is calculated based on its distance from the center of the group. For example, if a user is very close to the center of a "frequent learner" population in learning time, but is closer to the center of a "diverse learner" population in learning content diversity, then the user belongs to both populations at the same time, but with different membership. The advantage of this approach is that it provides a more flexible and detailed way to understand and characterize the user behavior, which means that the learning patterns and preferences of the user can be more accurately identified. For example, by this method, the server may find that some users, although not frequently logged into the learning platform, concentrate on learning for a longer period of time each time they log in, which behavior pattern is quite different from other users who are more frequent but have shorter learning times each time.

In a specific embodiment, the process of executing step 105 may specifically include the following steps:

(1) Inputting the target variable sets of each target user group into a preset initial Bayesian network, and respectively calculating the target conditional probability of each target variable set through a probability inference function in the initial Bayesian network, wherein the probability inference function is as follows:

，v _group representing the variables, V _group For the set of target variables, entries (V _group ) Parent node set representing each variable in the target set of variables, +.>Representing variable v given parent node value _group Is a function of the conditional probability of (1),representing a target conditional probability;

(2) According to the target conditional probability, probability inference is respectively carried out on each target user group, and a probability inference result of each target user group is obtained;

(3) And respectively carrying out influence factor identification on each target user group according to the probability inference result to obtain target influence factors of each target user group.

Specifically, first, a set of target variables for each target user group is input into a preset initial bayesian network. Bayesian networks are a powerful probabilistic graph model for representing the dependency between variables and making complex probabilistic inferences. Such networks represent variables by nodes, and directed edges between nodes represent probabilistic relationships between the variables. In this process, the structure of each set of variables, i.e., which variables are parent nodes of the other variables, is determined. Next, a target conditional probability for each set of target variables is calculated by a probability inference function in the bayesian network. The probability inference function used herein is based on the bayesian theorem, which is able to calculate the conditional probability of each variable given the parent node value. Further, probability inference is performed for each target user population based on the calculated target conditional probabilities. A bayesian network is queried to predict or infer behavior exhibited by a user population. Finally, according to the probability inference results, the identification of the influence factors can be carried out on each target user group. The objective is to extract the factors with the most influence on each user group from the analysis of the Bayesian network. For example, the server may find that for those groups of users seeking professional development, their learning time is more affected by work pressure and availability time, while for those groups of predominantly interested users, their learning time is more affected by personal interests and learning resources.

In a specific embodiment, the process of executing step 106 may specifically include the following steps:

(1) Generating model parameter particles for the initial Bayesian network according to target influence factors of each target user group by a preset particle swarm optimization algorithm to obtain a plurality of initial network parameter particles;

(2) Initializing a plurality of initial network parameter particles to obtain the particle position and the particle speed of each initial network parameter particle;

(3) The particle positions and particle speeds of a plurality of initial network parameter particles are updated in a searching direction through a preset particle updating function, so that an optimal network parameter combination is obtained, and the particle updating function comprises:；；/>indicating the particle velocity of particle i at time t+1, -/->Represents the particle velocity of particle i at time t, w represents the inertial weight, c ₁ And c ₂ Represents the acceleration constant, r ₁ And r ₂ Representing random numbers, pbest _i Representing the historical best position of the initial network parameter particles, gbest _i Representing a global optimum position;

(4) And updating the network nodes and the conditional probability table of the initial Bayesian network through the optimal network parameter combination to obtain the target Bayesian network.

Specifically, firstly, generating model parameter particles for an initial Bayesian network through a preset particle swarm optimization algorithm according to target influence factors of each target user swarm. Each particle represents a set of parameter settings of the bayesian network. For example, in analyzing user behavior of an online learning platform, these parameters include various factors that affect user learning duration and activity, such as course difficulty, user base knowledge level, and the like. The particles are randomly placed in the search space, the location of each particle representing a combination of network parameters. Next, particle initialization is performed on these initial network parameter particles, which includes determining the position and velocity of each particle. The position of the particle represents the current parameter setting of the bayesian network, while the velocity represents the direction and magnitude of the parameter setting change. The initialization phase provides a starting point for the particle to start searching. For example, if the initial position of a particle corresponds to a network parameter of high user activity, then the particle will start searching for a more optimal combination of parameters from this point. The position and velocity of the particles are then updated with a preset particle update function. The update function in the particle swarm optimization algorithm is mainly based on three components: the current velocity of the particles (inertia), the best position the particles found so far (individual optimal solution), the best position found in the whole population of particles (global optimal solution). The velocity update of the particles depends on a balance of these three factors, which allows the particles to find a balance between exploration (global search) and development (local search). For example, if an individual best location of a particle corresponds to a particularly effective set of bayesian network parameters, and a global best location corresponds to another set of parameters, the next location of the particle will be weighted between the two best locations. By continuously updating the position and speed of each particle, the particle swarm algorithm searches the whole parameter space for the optimal combination of network parameters. In this process, each particle adjusts its search path based on its own experience and the experience of the population. As the iteration proceeds, the entire population of particles will typically gradually approach the optimal solution. Finally, after the algorithm converges and finds the optimal combination of network parameters, the parameters are used to update the network nodes and the conditional probability table of the initial Bayesian network, thereby obtaining the target Bayesian network. This optimized network more accurately reflects the behavior pattern of the target user population. For example, in the context of an online learning platform, the final bayesian network will accurately reveal complex relationships between course difficulty, user interaction frequency, and learning outcome.

The method for analyzing index data based on keywords in the embodiment of the present application is described above, and the system for analyzing index data based on keywords in the embodiment of the present application is described below, referring to fig. 2, an embodiment of the system for analyzing index data based on keywords in the embodiment of the present application includes:

the acquisition module 201 is configured to acquire initial user learning index data of a plurality of target users based on a preset cloud platform, and perform keyword index analysis on the initial user learning index data to obtain target user learning index data;

the classification module 202 is configured to classify the plurality of target users according to the target user learning index data by using a preset EM algorithm, so as to obtain a plurality of target user groups;

the operation module 203 is configured to input the target user learning index data into a preset mixed weibull distribution model to perform mixed weibull distribution parameter operation, so as to obtain a distribution scale parameter and a distribution shape parameter;

a construction module 204, configured to construct a variable set of the distribution scale parameter and the distribution shape parameter according to the multiple target user groups, so as to obtain a target variable set of each target user group;

The analysis module 205 is configured to input a target variable set of each target user group into a preset initial bayesian network to perform variable relationship analysis, so as to obtain a target influence factor of each target user group;

the updating module 206 is configured to perform model parameter optimization on the initial bayesian network according to the target influencing factors of each target user group through a preset particle swarm optimization algorithm to obtain an optimal network parameter combination, and perform network parameter updating on the initial bayesian network through the optimal network parameter combination to obtain a target bayesian network.

Through the cooperation of the components, the method can accurately identify and quantify the user behaviors by carrying out keyword index analysis on the user learning index data. This not only improves the accuracy of the data analysis, but also makes understanding of the user's behavior more thorough and specific. By applying the EM algorithm, this approach can efficiently divide users into different populations, each population having unique behavioral characteristics. Such classification not only helps identify different types of users, but also can be used to formulate targeted strategies, such as customized teaching methods. By using the hybrid weibull distribution model, complex patterns of user behavior can be modeled effectively, especially when behavior data with multiple influencing factors are processed. Such a model can reveal deep features of the user's behavior, such as duration and frequency of learning activities. By inputting the variable set into the bayesian network for analysis, the relationships and interactions between the variables can be comprehensively identified. Such analysis helps understand the variety of factors that affect user behavior, providing support for developing more efficient strategies. The performance of the model can be remarkably improved by optimizing the parameters of the Bayesian network by using a particle swarm optimization algorithm. The optimization ensures high accuracy and reliability of the model on the complex data set, so that the prediction and analysis results are more accurate, and the analysis accuracy of index data is further improved.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. The index data analysis method based on the keywords is characterized by comprising the following steps of:

2. The keyword-based index data analysis method of claim 1, wherein the acquiring initial user learning index data of a plurality of target users based on a preset cloud platform and performing keyword index analysis on the initial user learning index data to obtain target user learning index data comprises:

based on a preset cloud platform, carrying out learning index monitoring on a plurality of target users to obtain corresponding initial user learning index data;

Performing keyword frequency calculation on the initial user learning index data through a preset keyword analysis function to obtain keyword frequency data, wherein the keyword analysis function is as follows:

t is a keyword, d is initial user learning index data, T _t,d For the keyword frequency data of the keyword t in the initial user learning index data d, f _t,d For the occurrence frequency of the keyword t in the initial user learning index data d, Σ _t′∈d f _t′,d For the total frequency of occurrence of all keywords in the initial user learning index data d, N is the total number of initial user learning index data in the initial user learning index data set,learning index number for initial user containing keyword tThe number of data;

performing association rule learning on the initial user learning index data according to the keyword frequency data to obtain a user behavior mode;

and extracting key word indexes from the initial user learning index data according to the user behavior mode to obtain target user learning index data.

3. The keyword-based index data analysis method of claim 1, wherein the classifying the plurality of target users according to the target user learning index data by a preset EM algorithm to obtain a plurality of target user groups comprises:

Carrying out category probability calculation on the plurality of target users through a preset EM algorithm to obtain the category probability of each target user;

according to the category probability, classifying the plurality of target users to obtain a plurality of initial user groups;

updating parameter calculation is carried out on the target user learning index data through a preset parameter updating function, updated parameters are obtained, and the parameter updating function is as follows:

，/>representing updated parameters ∈ ->Representing parameters before update, x _i Learning the ith observation data point in the index data for the target user, wherein z is a hidden variable, and ++>Representing a given observed data point x under pre-update parameters _i Posterior probability of time-hidden variable z, +.>Representing updated under-parameter observed data point x _i And a joint probability of the hidden variable z;

and carrying out iterative updating on the updated parameters to obtain target updated parameters, and carrying out group optimization on the plurality of initial user groups according to the target updated parameters to obtain a plurality of target user groups.

4. The keyword-based index data analysis method of claim 1, wherein the step of inputting the target user learning index data into a preset mixed weibull distribution model to perform mixed weibull distribution parameter operation to obtain a distribution scale parameter and a distribution shape parameter includes:

Performing mixed component analysis on the target user learning index data to obtain a plurality of different mixed components;

modeling the user behavior time distribution of the target user learning index data through a preset mixed Weibull distribution model to obtain corresponding target mixed Weibull distribution;

carrying out mixed Weibull distribution parameter operation on the target mixed Weibull distribution according to the plurality of different mixed components through a preset distribution parameter operation function to obtain a distribution scale parameter and a distribution shape parameter, wherein the distribution parameter operation function is as follows:

λ is the scale parameter, β is the shape parameter, +.>Distribution scale parameters representing different mixing elements, < ->Representing the distribution shape parameters of the different blend components, < >>Weight for kth mixed component, +.>And for the time value of the ith data point in the target user learning index data, K is the number of the mixed components, and w is the mixed distribution coefficient.

5. The keyword-based index data analysis method of claim 1, wherein the performing variable set construction on the distribution scale parameter and the distribution shape parameter according to the plurality of target user groups to obtain a target variable set of each target user group includes:

Respectively constructing initial characteristic data of each target user group according to the distribution scale parameters and the distribution shape parameters;

normalizing the initial characteristic data to obtain target characteristic data;

performing feature clustering on the target feature data through a preset feature extraction function to obtain a target variable set of each target user group, wherein the feature extraction function is as follows:

，/>for the set of target variables,representing the variable, u _ij Representing the membership degree of the ith data point to the jth group in the target characteristic data, p represents the weighted index in fuzzy clustering, and x _ij And the j-th characteristic value of the i-th data point in the target characteristic data is represented.

6. The keyword-based index data analysis method of claim 1, wherein the inputting the target variable set of each target user group into a preset initial bayesian network for variable relation analysis to obtain the target influencing factor of each target user group comprises:

inputting a target variable set of each target user group into a preset initial Bayesian network, and respectively calculating target conditional probability of each target variable set through a probability inference function in the initial Bayesian network, wherein the probability inference function is as follows:

，Representing the variables->For the set of target variables>Parent node set representing each variable in the target set of variables, +.>Representing the variable +.>Conditional probability of->Representing a target conditional probability;

according to the target conditional probability, probability inference is respectively carried out on each target user group, and a probability inference result of each target user group is obtained;

and respectively carrying out influence factor identification on each target user group according to the probability inference result to obtain target influence factors of each target user group.

7. The keyword-based index data analysis method of claim 1, wherein the performing model parameter optimization on the initial bayesian network according to the target influencing factors of each target user group through a preset particle swarm optimization algorithm to obtain an optimal network parameter combination, and performing network parameter update on the initial bayesian network through the optimal network parameter combination to obtain a target bayesian network comprises:

generating model parameter particles of the initial Bayesian network according to target influence factors of each target user group by a preset particle swarm optimization algorithm to obtain a plurality of initial network parameter particles;

Carrying out particle initialization on the plurality of initial network parameter particles to obtain the particle position and the particle speed of each initial network parameter particle;

and updating the particle positions and the particle speeds of the plurality of initial network parameter particles in a searching direction through a preset particle updating function to obtain an optimal network parameter combination, wherein the particle updating function comprises the following steps:

；；/>indicating the particle velocity of particle i at time t+1, -/->Represents the particle velocity of particle i at time t, w represents the inertial weight, c ₁ And c ₂ Represents the acceleration constant, r ₁ And r ₂ Representing random numbers, pbest _i Representing the historical best position of the initial network parameter particles, gbest _i Representing a global optimum position;

and updating the network nodes and the conditional probability table of the initial Bayesian network through the optimal network parameter combination to obtain a target Bayesian network.

8. A keyword-based index data analysis system, the keyword-based index data analysis system comprising: