CN117349512B

CN117349512B - User tag classification method and system based on big data

Info

Publication number: CN117349512B
Application number: CN202311136004.8A
Authority: CN
Inventors: 朱峻修; 林景
Original assignee: Guangzhou Interest Island Information Technology Co ltd
Current assignee: Guangzhou Interest Island Information Technology Co ltd
Priority date: 2023-09-04
Filing date: 2023-09-04
Publication date: 2024-03-12
Anticipated expiration: 2043-09-04
Also published as: CN117349512A

Abstract

The invention discloses a user tag classification method and a system based on big data, comprising the following steps: the method comprises the steps of obtaining behavior feature data of a user, extracting keywords from the behavior feature data through a preset keyword extraction algorithm, generating a plurality of keywords, carrying out association mining on the plurality of keywords according to a preset association analysis algorithm, obtaining corresponding association rules, carrying out node characterization aggregation on the association rules and the plurality of keywords according to a preset graph neural network, obtaining association data, extracting spatial features and time features of the association data according to a preset deep neural network, fusing the spatial features and time features corresponding to the association data, obtaining space-time features corresponding to the association data, carrying out tag classification according to the space-time features, and obtaining tags corresponding to the user through a preset tag classification model, thereby improving efficiency and accuracy of tag classification of the user.

Description

User tag classification method and system based on big data

Technical Field

The invention relates to the technical field of data processing, in particular to a user tag classification method and system based on big data.

Background

With the rapid development of information technology, an online learning platform of an enterprise has become an important way for education and knowledge sharing. User behavior data is one of the main bases for guiding the production operation of an education platform, how to perform effective data analysis is a main problem faced by platform operation, and in order to maintain the viscosity of clients, the platform often needs to classify different user groups, and recommend different education courses for users according to different user requirements so as to provide better course pushing services for the users.

In order to accurately classify users, big data analysis is required to be performed based on basic information, search data, behavior characteristics and the like of the users, interest keywords of the users are generated according to the big data analysis result, interest tags of the users are generated according to the classification of the interest keywords of the users, and therefore course pushing is performed on the users according to the interest tags.

The existing technology for classifying the interest labels of the users mainly comprises manual classification, keyword extraction or generating the interest labels of the users by adopting a simple classification model, for the manual classification mode, the manual classification cost is too high and the efficiency is low due to the fact that the information data quantity of the users is extremely large, the development trend of big data age is not adapted, the keyword extraction cannot pay attention to the relevance among the keywords, and therefore the accuracy of the interest classification is reduced, for example, the requirements of the users are travel attack, the searched keywords are delicates and scenic spots, the existing technology is based on keyword extraction pushing, namely delicates making courses or geographical magazines, the relevance between the delicates and the scenic spots is not considered, the accuracy of the interest classification is wrong, and the interest information of the users is often discarded by the simple classification model in the label classification process, so that the interest information of the users cannot be fully utilized, the resource utilization rate is too low, the interest classification of the users cannot be accurately analyzed, and the accuracy of the label generation is low.

Disclosure of Invention

In order to solve the technical problems, the invention discloses a user tag classification method and a system based on big data, which improve the efficiency and the accuracy of user tag classification.

In order to achieve the above object, the present invention discloses a user tag classification method based on big data, comprising:

acquiring behavior feature data of a user, extracting keywords from the behavior feature data through a preset keyword extraction algorithm, and generating a plurality of keywords corresponding to the user;

performing association mining on a plurality of keywords corresponding to the user according to a preset association analysis algorithm to obtain association rules corresponding to the keywords;

performing node characterization aggregation on the association rule and the plurality of keywords according to a preset graph neural network to obtain association data corresponding to the plurality of keywords;

extracting spatial features and time features of the associated data according to a preset deep neural network, and fusing the spatial features and the time features corresponding to the associated data to obtain the spatial features and the time features corresponding to the associated data;

and carrying out label classification according to the space-time characteristics through a preset label classification model to obtain labels corresponding to the users.

The invention discloses a user tag classification method based on big data, which comprises the steps of obtaining behavior feature data of a user, carrying out keyword extraction on the behavior feature data through a preset keyword extraction algorithm to obtain a plurality of keywords corresponding to the user, reducing the data processing amount, improving the efficiency of data processing through keyword extraction on the behavior feature data, then carrying out association mining on the plurality of keywords through a preset association analysis algorithm after obtaining the plurality of keywords corresponding to the user, generating association rules corresponding to the keywords by considering association relations among the keywords, carrying out node characterization aggregation on the association rules and the plurality of keywords through a preset graph neural network to generate association data corresponding to the plurality of keywords, wherein the association data not only comprises the keywords but also comprises the association relations among the keywords, better embodying the behavior feature of the user, improving the accuracy of user classification, extracting time features and space features corresponding to the association data through a preset deep neural network after obtaining the association data, not only taking the time features and space features corresponding to the association data into consideration, generating association rules corresponding to the keywords according to the preset graph neural network, carrying out node characterization aggregation on the association rules, generating the association data corresponding to the keywords according to the association rules, and the association rules by taking the association rules into consideration of the preset graph neural network after the fact that the association rules are extracted by the preset graph has the time-space-time feature information, and the time-space feature information of the user is improved, and the accuracy of the association feature is improved by the association labels after the association rules are compared through the preset graph, and the association rules are further improved by the association rules are generated by the association rules after the association rules and the association labels are compared through the association rules, and then extracting the space-time characteristics corresponding to the associated data by using a deep neural network so as to facilitate label classification according to the space-time characteristics and improve the accuracy of label classification.

As a preferred example, the acquiring the behavior characteristic data of the user includes:

acquiring a user ID according to the login state of the user, and retrieving from a preset database according to the user ID to acquire behavior characteristic data corresponding to the user; the behavior characteristic data comprise user attribute information, a user behavior sequence and user operation; the user operation includes a login operation, a transaction operation, and a browsing operation.

The invention obtains the user ID by utilizing the current login state of the user, can know the actual identity of the current user according to the user ID, and further obtains the behavior characteristic data corresponding to the current user by continuously searching from the preset database according to the user ID, so that the user is classified according to the previous behavior characteristic data of the user, and the classification accuracy is improved.

As a preferred example, the generating a plurality of keywords corresponding to the user by extracting keywords from the behavioral characteristic data through a preset keyword extraction algorithm includes:

performing part-of-speech tagging on the behavior feature data through a preset part-of-speech tagging algorithm, and extracting candidate words from the behavior feature data by combining with a preset word rule to obtain a candidate word set corresponding to the behavior feature data;

And extracting keywords from the behavior feature data through a preset MDERank algorithm according to the candidate word set to obtain a plurality of keywords corresponding to the behavior feature data.

The invention is based on the fact that the MDERank algorithm is utilized to extract the keywords of the behavior feature data, so that after the behavior feature data are obtained, the preset part-of-speech tagging algorithm is utilized to tag the part of speech of the behavior feature data and combine with the preset word rule to extract the candidate words of the feature data.

As a preferred example, performing association mining on a plurality of keywords corresponding to the user according to a preset association analysis algorithm to obtain association rules corresponding to the plurality of keywords, where the association rules include:

scanning each keyword in the plurality of keywords, calculating the support degree and the confidence degree of each keyword, comparing the support degree and the confidence degree of each keyword, deleting the keywords with the support degree smaller than the confidence degree, and obtaining frequent item sets corresponding to the plurality of keywords;

According to the frequent item set, traversing the grid diagram corresponding to the frequent item set from top to bottom and from bottom to top through a preset association analysis algorithm, and continuously reducing search space in the traversing process to obtain association rules corresponding to the keywords.

According to the method, the keywords are scanned, the supporting degree and the confidence degree corresponding to each keyword are obtained, so that keywords with low relevance among the keywords are deleted according to the supporting degree and the confidence degree, the data processing efficiency is improved, the label classification efficiency is further improved, after the frequent item set is generated, the two ends of the grid diagram corresponding to the generated frequent item set are traversed simultaneously by using a preset relevance analysis algorithm, the relevance relations among different keywords are mined, relevance rules corresponding to the keywords are generated, and the accuracy of label classification of users is further improved.

As a preferred example, performing node characterization aggregation on the association rule and the plurality of keywords according to the preset graph neural network to obtain association data corresponding to the plurality of keywords, where the node characterization aggregation includes:

determining a primary aggregation vector corresponding to each keyword in the plurality of keywords according to the attribute information of the user and the primary aggregation layer of the graphic neural network;

According to the primary aggregation vector, k-level vector aggregation is sequentially carried out on the association rule and the keywords through k aggregation layers preset in the graph neural network, a superior aggregation vector corresponding to each keyword in the keywords and a superior aggregation vector corresponding to each neighbor keyword in the keywords are obtained, and a weight value of each neighbor keyword relative to the keyword is determined;

determining the current level aggregate vector of each keyword according to the previous level aggregate vector of the keyword, the previous level aggregate vector of each neighbor keyword and the weight value relative to the keyword;

and carrying out node characterization aggregation on the keywords according to the current level aggregation vector corresponding to each keyword and k aggregation layers in the graph neural network to obtain associated data corresponding to the keywords.

According to the method, the association rule and the keywords are associated and matched through the graph neural network, the association degree of each keyword and the keywords around the keyword is deeply mined through each of K aggregation layers preset in the graph neural network, so that the keywords and the association rule are aggregated according to the association degree, the association degree among the keywords is deeply determined, and the label classification accuracy is improved.

As a preferred example, extracting the spatial feature and the temporal feature of the associated data according to the preset deep neural network includes:

performing space separation and time separation on the associated data through a first convolution layer preset in the deep neural network model to obtain space information data and time information data corresponding to the associated data;

and respectively carrying out convolution processing on the space information data and the time information data through a second convolution layer preset in the deep neural network model to obtain space characteristics corresponding to the space information data and time characteristics corresponding to the time information data.

In order to improve the accuracy of label classification, the invention separates the associated data into space information data and time information data through the preset deep neural network, so that the space-time characteristic extraction of the data is carried out subsequently, after the space information data and the time information data are obtained, the characteristic extraction is carried out on the information data according to the second convolution layer of the deep neural network, and the accuracy of keyword characteristic extraction is further improved, so that the extracted characteristics are more in line with the actual operation of a user, and the accuracy of label classification is improved.

As a preferred example, the fusing the spatial feature and the temporal feature corresponding to the associated data to obtain the spatial feature and the temporal feature corresponding to the associated data includes:

performing dimension lifting processing and feature fusion on the spatial features and the time features according to a full-connection layer preset in the deep neural network to obtain initial space-time features corresponding to the associated data;

and carrying out average processing on the initial space-time characteristics according to a preset average function to obtain space-time characteristics corresponding to the associated data.

According to the invention, the space-time characteristics and the time characteristics of the deep neural network are subjected to dimension lifting operation and characteristic fusion through the full connection layer of the deep neural network, so that the space-time characteristics of the associated data are better expressed, and then the obtained characteristics have universality by calculating the space-time characteristics through an average function, so that the efficiency and the accuracy of characteristic extraction are improved.

As a preferred example, performing label classification according to the space-time feature by using a preset label classification model to obtain a label corresponding to the user, where the method includes:

inputting the space-time characteristics into a preset label classification model, and respectively carrying out label prediction on the space-time characteristics through a plurality of different decision trees preset in the label classification model to obtain first labels in a plurality of different labels;

And processing the plurality of different first labels through a preset regression calculation function or classification function to obtain labels corresponding to the users.

According to the method, the device and the system, different decision trees preset in the label classification model are utilized to classify the labels of the space-time features, a plurality of different first labels are obtained, regression processing or classification processing is carried out according to the first labels, the labels corresponding to the users are obtained, and the accuracy of label classification is improved.

On the other hand, the invention discloses a user tag classification system based on big data, which comprises a keyword extraction module, an association module, an aggregation module, a feature extraction module and a tag classification module;

the keyword extraction module is used for acquiring behavior feature data of a user, extracting keywords from the behavior feature data through a preset keyword extraction algorithm, and generating a plurality of keywords corresponding to the user;

the association module is used for carrying out association mining on a plurality of keywords corresponding to the user according to a preset association analysis algorithm to obtain association rules corresponding to the keywords;

the aggregation module is used for carrying out node characterization aggregation on the association rule and the keywords according to a preset graph neural network to obtain association data corresponding to the keywords;

The feature extraction module is used for extracting spatial features and time features of the associated data according to a preset deep neural network, and fusing the spatial features and the time features corresponding to the associated data to obtain the spatial features and the time features corresponding to the associated data;

and the label classification module is used for carrying out label classification through a preset label classification model according to the space-time characteristics to obtain labels corresponding to the users.

The invention discloses a user tag classification system based on big data, which comprises obtaining behavior feature data of a user, extracting keywords from the behavior feature data through a preset keyword extraction algorithm to obtain a plurality of keywords corresponding to the user, extracting the keywords from the behavior feature data to reduce data processing amount and improve data processing efficiency, carrying out association mining on the keywords through a preset association analysis algorithm after obtaining the keywords corresponding to the user, generating association rules corresponding to the keywords by considering association relations among the keywords, carrying out node characterization aggregation on the association rules and the keywords through a preset graph neural network to generate association data corresponding to the keywords, the related data not only comprises keywords but also comprises the related relation between the keywords, the behavior characteristics of the user are better reflected, the accuracy of the user classification is improved, after the related data is obtained, the time characteristics and the space characteristics corresponding to the related data are extracted by using a preset deep neural network, the sequence of the keywords during searching of the user is considered, the layout characteristics of the keywords are considered, the accuracy of the user classification is improved, the labels of the user are generated by using a preset label classification model according to the time-space characteristics, in the invention, the processing capacity of the data is reduced by using a preset keyword extraction algorithm, the efficiency of the label classification is improved, the related relation between the keywords is extracted by using a preset related analysis algorithm and the depth of the graph neural network, the accurate related data is improved for the subsequent time-space characteristic analysis, and then extracting the space-time characteristics corresponding to the associated data by using a deep neural network so as to facilitate label classification according to the space-time characteristics and improve the accuracy of label classification.

As a preferable example, the keyword extraction module includes a data acquisition unit, a part-of-speech tagging unit, and an extraction unit;

the data acquisition unit is used for acquiring a user ID according to the login state of the user, and searching from a preset database according to the user ID to acquire behavior characteristic data corresponding to the user; the behavior characteristic data comprise user attribute information, a user behavior sequence and user operation; the user operation comprises login operation, transaction operation and browsing operation;

the part-of-speech tagging unit is used for performing part-of-speech tagging on the behavior feature data through a preset part-of-speech tagging algorithm, and extracting candidate words from the behavior feature data by combining with a preset word rule to obtain a candidate word set corresponding to the behavior feature data;

the extraction unit is used for extracting keywords from the behavior feature data through a preset MDERank algorithm according to the candidate word set, and obtaining a plurality of keywords corresponding to the behavior feature data.

According to the invention, the user ID is obtained by utilizing the current login state of the user, the actual identity of the current user can be known according to the user ID, and further, the behavior feature data corresponding to the current user is obtained by continuously searching from a preset database according to the user ID, so that the user is classified according to the previous behavior feature data of the user, the accuracy of classification is improved, then, keyword extraction is carried out on the behavior feature data based on the MDERank algorithm, therefore, after the behavior feature data is obtained, part-of-speech tagging is carried out on the behavior feature data by utilizing the preset part-of-speech tagging algorithm, and candidate word extraction is carried out on the feature data by combining the preset word rule.

Drawings

Fig. 1: the flow diagram of the user tag classification method based on big data is provided for the embodiment of the invention;

fig. 2: the embodiment of the invention provides a structural schematic diagram of a user tag classification system based on big data.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

The embodiment of the invention discloses a user tag classification method based on big data, and the specific implementation flow of the classification method is shown in fig. 1, and mainly comprises steps 101 to 105, wherein the steps comprise:

step 101: and acquiring behavior feature data of the user, extracting keywords from the behavior feature data through a preset keyword extraction algorithm, and generating a plurality of keywords corresponding to the user.

In this embodiment, the steps mainly include: acquiring a user ID according to the login state of the user, and retrieving from a preset database according to the user ID to acquire behavior characteristic data corresponding to the user; the behavior characteristic data comprise user attribute information, a user behavior sequence and user operation; the user operation comprises login operation, transaction operation and browsing operation; performing part-of-speech tagging on the behavior feature data through a preset part-of-speech tagging algorithm, and extracting candidate words from the behavior feature data by combining with a preset word rule to obtain a candidate word set corresponding to the behavior feature data; and extracting keywords from the behavior feature data through a preset MDERank algorithm according to the candidate word set to obtain a plurality of keywords corresponding to the behavior feature data.

In this embodiment, a user logs in to an application platform, adds an account number and a password of the user to a login page of the application platform, after obtaining the account number and the password, performs data retrieval according to the account ID of the user in a preset database and according to the account ID to obtain a series of behavior feature data of the user on the application platform, including user attribute information, a user behavior sequence and user operation of the user, where the user attribute information may include a category of the user, that is, whether the user is a student or a child, or a practitioner in other industries, and the user behavior sequence may include a behavior operation indicating a preference of the user, such as a collection, praise or blackout of the user, and the user operation includes a login operation, a transaction operation and a browsing operation, so that a basic requirement of the user is simply obtained according to the behavior feature data.

After the basic requirement is obtained, in order to improve the data processing efficiency, that is, improve the efficiency of user tag classification, a preset keyword extraction algorithm may be used to perform keyword extraction on the behavior feature data, so that tag classification can be performed according to the keywords later, in this embodiment, the keyword extraction algorithm includes an algorithm capable of implementing keyword extraction, such as a TF-IDF keyword extraction algorithm and a TextRank keyword extraction algorithm, in order to ensure the accuracy of the extracted keywords, in this embodiment, the keyword extraction is preferably performed by using an MDERank algorithm, the MDERank algorithm focuses on context association, and behavior features of a user can be better reflected, in order to perform keyword extraction by using the MDERank algorithm, part of speech labeling is required to be performed on the behavior feature data, and candidate words corresponding to the behavior feature data are generated based on preset word rules, in this embodiment, the behavior feature data are labeled by using a jieba part of speech labeling principle, and the candidate words corresponding to the word feature data are generated according to a preset word combination rule, that includes a verb part of speech rule, and the candidate words corresponding to the feature word feature rule are generated after the candidate words.

Traversing the behavior feature data, and replacing all positions of each candidate word in the behavior feature data by using masks. It is noted that after word segmentation, one candidate word has a plurality of word segments, and after masking, the masked text dMci corresponding to each candidate word can be obtained for all the word segment Span masks of the replaced candidate word.

Cosine similarity for ranking is defined as similarity score f (ci), text is encoded using BERT, text representation E (d) and masked text representation E (dcci) are obtained using MaxPooling to calculate f (ci). According to the definition of MDERank, the higher f (ci), the lower the ranking of ci, i.e. the more information the masked text is lost, the higher the importance of the masked candidate words, which is in contrast to PD-method, where the ranking of the candidate words is positively correlated with f (ci).

In this embodiment, a self-supervised learning method is further provided to obtain a high-quality representation to improve the ranking capability of MDERank on candidate words, define the text of a mask pseudo keyword as a positive example (the pseudo keyword is generated by the existing non-supervised keyword extraction method), mask the text of a pseudo non-keyword as a negative example, use the original text as an anchor point, and pull the distance between the positive example and the original text by using a triple pull, and pull the distance between the negative example and the original text, as follows:

l _CL ＝max(sim(H _d ,H _d+ )-sim(H _d ,H _d- )+m,0)

Wherein l _CL Processed behavior feature data, sim (Hx, H) _Y ) Refers to characterization of behavior feature data X and Y, hx, H _Y M is a boundary coefficient.

According to the method, the user ID is obtained by utilizing the current login state of the user, the actual identity of the current user can be known according to the user ID, further, the behavior feature data corresponding to the current user are obtained by continuously searching from a preset database according to the user ID, so that the user is classified according to the previous behavior feature data of the user, the accuracy of classification is improved, keyword extraction is conducted on the behavior feature data based on the MDERank algorithm, therefore, after the behavior feature data are obtained, part-of-speech labeling is conducted on the behavior feature data through a preset part-of-speech labeling algorithm, candidate word extraction is conducted on the feature data through the combination of a preset word rule, the accuracy of keyword extraction is improved through the combination of the MDERank algorithm and the context information of the behavior feature data, the processing amount of data is reduced, and the classification efficiency is improved.

Step 102: and carrying out association mining on a plurality of keywords corresponding to the user according to a preset association analysis algorithm to obtain association rules corresponding to the keywords.

In this embodiment, the steps mainly include: scanning each keyword in the plurality of keywords, calculating the support degree and the confidence degree of each keyword, comparing the support degree and the confidence degree of each keyword, deleting the keywords with the support degree smaller than the confidence degree, and obtaining frequent item sets corresponding to the plurality of keywords; according to the frequent item set, traversing the grid diagram corresponding to the frequent item set from top to bottom and from bottom to top through a preset association analysis algorithm, and continuously reducing search space in the traversing process to obtain association rules corresponding to the keywords.

For example, in this embodiment, the association rule between the plurality of keywords is extracted by using a preset association analysis algorithm, in this embodiment, the association analysis algorithm includes an Apriori algoritm algorithm, a GaloisClosure Based Approach algorithm, and the like, and in this embodiment, using the Apriori algorithm, preferably taking Apriori Algorithm algorithm as an example, needs to know a frequent item set (frequent itemsets) first: it is a collection of items that often appear in a block, and we have a preset support, such as 3/5, when using the Apriori algorithm. The support degree of all items of the item set is larger than one set by us, when the Apriori algorithm is used, all frequent item sets and the support degree thereof are extracted from data to generate all effective association rules (confidence) > minimum confidence (minconf)), after a plurality of keywords are obtained, the occurrence frequency of each keyword among the plurality of keywords is counted to generate the occurrence frequency of each keyword and the occurrence frequency of neighboring keywords before and after each keyword, and a frequency matrix corresponding to the plurality of keywords is constructed according to the frequency.

Scanning the frequency matrix, calculating the support degree and the confidence degree of each keyword, comparing the support degree and the confidence degree of each keyword, deleting the keywords with the support degree smaller than the confidence degree, obtaining frequent item sets corresponding to the keywords, constructing a corresponding grid graph according to the frequent item sets, traversing the grid graph from top to bottom and from bottom to top, continuously reducing the search space in the traversing process, generating frequent item sets candidate from bottom to top, generating the largest frequent item set candidate from top to bottom, and continuously repeating the steps until all the keywords are traversed, and obtaining the association rules corresponding to the keywords.

The method comprises the steps of scanning keywords to obtain the support degree and the confidence degree corresponding to each keyword, so that keywords with low association degree in a plurality of keywords are deleted according to the support degree and the confidence degree, the data processing efficiency is improved, the label classification efficiency is further improved, after the frequent item set is generated, the two ends of a grid diagram corresponding to the generated frequent item set are traversed simultaneously by using a preset association analysis algorithm, association relations corresponding to different keywords are mined, association rules corresponding to the keywords are generated, and the accuracy of label classification of users is further improved.

Step 103: and carrying out node characterization aggregation on the association rule and the plurality of keywords according to a preset graph neural network to obtain association data corresponding to the plurality of keywords.

In this embodiment, the steps mainly include: determining a primary aggregation vector corresponding to each keyword in the plurality of keywords according to the attribute information of the user and the primary aggregation layer of the graphic neural network; according to the primary aggregation vector, k-level vector aggregation is sequentially carried out on the association rule and the keywords through k aggregation layers preset in the graph neural network, a superior aggregation vector corresponding to each keyword in the keywords and a superior aggregation vector corresponding to each neighbor keyword in the keywords are obtained, and a weight value of each neighbor keyword relative to the keyword is determined; determining the current level aggregate vector of each keyword according to the previous level aggregate vector of the keyword, the previous level aggregate vector of each neighbor keyword and the weight value relative to the keyword; and carrying out node characterization aggregation on the keywords according to the current level aggregation vector corresponding to each keyword and k aggregation layers in the graph neural network to obtain associated data corresponding to the keywords.

In this embodiment, the graph neural network includes k aggregation layers specifically configured to perform k-level vector aggregation on the target keyword sequentially at the k aggregation layers, where each level of vector aggregation includes determining a current level of aggregation vector of the target keyword based at least on a previous level of aggregation vector of each neighboring keyword of the target keyword, where a primary aggregation vector of each keyword is determined according to attribute features of a corresponding user.

Wherein, in order to sequentially perform k aggregation layers, k-level vector aggregation is performed on the target keyword v, first, a primary aggregation vector of the target keyword v is determined according to attribute characteristics of the target keyword vDetermining a primary aggregate vector of each neighbor keyword u according to the attribute characteristics of each neighbor keyword u of the target keyword v>Then based at least on the primary aggregation vector->Performing k-level vector aggregation on the target keyword v to obtain an aggregation vector +.>And the user characterization of the user corresponding to the target keyword v.

In one embodiment, for a target node v, determining its ith level of vector aggregation (i.e., vector aggregation of the ith aggregation layer) may include first aggregating vectors according to the previous level (i.e., level i-1) of neighbor node u of the target node v using the aggregation function AGGi of the ith aggregation layer Get neighbor aggregation vector +.>I.e. neighbor aggregation result, where N (v) represents the set of neighbor nodes of the target node v, i.e.:

then, according to the neighbor aggregate vectorAnd the upper level (i.e., i-level 1) aggregate vector of the target node vDetermining the current level (level i) aggregate vector of the target node v>Namely:

where f represents a vector aggregation to neighborsAnd the primary vector above the target node v>The applied synthesis function, wi, is a parameter for the i-th level aggregation.

In this embodiment, the calculation mode of determining the weight value of each neighbor keyword u with respect to the target keyword v is as follows:

wherein,representing the weight value of the neighbor node u relative to the target node v in the ith aggregation layer, and W ₁ ⁱ 、/>For the parameters to be trained in the ith aggregation layer, tanh ₁ () Representing a first activation function preset in the ith aggregation layer.

And after determining the weight value of each neighbor node u relative to the target node v, determining the current level (i level) aggregate vector of the target node according to the previous level (i-1 level) aggregate vector of the target node v, the previous level (i-1 level) aggregate vector of each neighbor node u and the weight value relative to the target node.

Step 104: and extracting the spatial features and the temporal features of the associated data according to a preset deep neural network, and fusing the spatial features and the temporal features corresponding to the associated data to obtain the spatial features and the temporal features corresponding to the associated data.

In this embodiment, the steps mainly include: performing space separation and time separation on the associated data through a first convolution layer preset in the deep neural network model to obtain space information data and time information data corresponding to the associated data; and respectively carrying out convolution processing on the space information data and the time information data through a second convolution layer preset in the deep neural network model to obtain space characteristics corresponding to the space information data and time characteristics corresponding to the time information data. Performing dimension lifting processing and feature fusion on the spatial features and the time features according to a full-connection layer preset in the deep neural network to obtain initial space-time features corresponding to the associated data; carrying out average processing on the initial space-time characteristics according to a preset average function to obtain space-time characteristics corresponding to the associated data; performing dimension lifting processing and feature fusion on the spatial features and the time features according to a full-connection layer preset in the deep neural network to obtain initial space-time features corresponding to the associated data; and carrying out average processing on the initial space-time characteristics according to a preset average function to obtain space-time characteristics corresponding to the associated data.

In this embodiment, the associated data is input into a first convolution layer of a preset deep neural network model, the network structure of the deep neural network model includes a first convolution layer, a second convolution layer and a full connection layer, the associated data is input into the first convolution layer, spatial separation and temporal separation are performed on the associated data through a spatial-temporal separation technology preset in the first convolution layer, spatial information data and temporal information data corresponding to the associated data are obtained, the first convolution layer is composed of 13 convolution layers and 3 full connection layers, the convolution kernel size of the 13 convolution layers is 3*3, the step size of the 13 convolution kernels is 1, and the first convolution layers are stacked into 5 blocks. The method comprises the steps that the maximum pooling layer kernel size is 2 x 2, the step length is 2, the method is connected to each convolution block, convolution processing is conducted on space information data and time information data through a second convolution layer preset in the deep neural network model, the convolution layer obtains feature information of the space information data and the time information data, a plurality of convolution kernels are used for convoluting data of a previous layer, a result is output in a two-dimensional vector mode, a plurality of two-dimensional outputs can be obtained through convolution operation, space features and time features are output, the space features and the time features can be subjected to dimension lifting processing and feature fusion through a full connection layer, initial space-time features corresponding to the associated data are obtained, mean average function calculation is conducted on the initial space-time features, and therefore the obtained features are not affected by keyword sequences, and the method is more universal.

In order to improve the accuracy of label classification, the step is to separate the associated data into space information data and time information data through a preset deep neural network so as to enable the space-time characteristic extraction of the data to be carried out subsequently, and after the space information data and the time information data are obtained, the information data are subjected to characteristic extraction according to a second convolution layer of the deep neural network, so that the accuracy of keyword characteristic extraction is improved, the extracted characteristics are more in line with the actual operation of a user, and the accuracy of label classification is improved.

Step 105: and carrying out label classification according to the space-time characteristics through a preset label classification model to obtain labels corresponding to the users.

In this embodiment, the steps mainly include: inputting the space-time characteristics into a preset label classification model, and respectively carrying out label prediction on the space-time characteristics through a plurality of different decision trees preset in the label classification model to obtain first labels in a plurality of different labels; and processing the plurality of different first labels through a preset regression calculation function or classification function to obtain labels corresponding to the users.

In this embodiment, an innovative random forest prediction algorithm is provided, where an attention mechanism in deep learning is combined, an attention-based random forest model is provided, and the tag classification model is further constructed, and the contribution of each decision tree can be adaptively adjusted based on the attention-based random forest model, so that features with important contributions to target prediction are more focused in the prediction process, and in this embodiment, the training process of the tag classification model is as follows:

and (3) randomly sampling the behavior characteristic data obtained in the step (101) to generate n different sampling data sets, generating k decision trees by using a random forest algorithm for each sampling data set, and adjusting the weight of each characteristic in each decision tree by adopting an attention mechanism for each decision tree. Let the weight of the ith feature in the jth decision tree be w _ij The attention score for this feature is then:

to achieve adaptive adjustment of the importance of each feature, an attention mechanism is introduced to adjust the importance of each feature in the predictions by learning the attention score of each feature in each decision tree.

For the j-th decision tree, the attention score vector is set as a _j ＝[a ₁ ,a ₂ ,...,a _m ]The attention weight w of the ith feature in the decision tree _ij The method comprises the following steps:

where m is the total number of features in the decision tree.

Attention score a _j The calculation mode of (a) is as follows:

a _j ＝σ(Uv _j +b _a )

wherein U is a weight matrix, v _j B is a weighted average of the embedded vectors of all features in the j-th decision tree _a As bias terms, σ is an activation function (e.g., sigmoid function).

After a plurality of different decision trees are generated in the label classification model, the extracted space-time features are respectively input into each decision tree for label classification to obtain M prediction resultsThen the final prediction result is obtained by voting or averaging, in this embodiment, a specific value can be predicted by regression, and the final prediction result +.>The method comprises the following steps:

the method comprises the steps of carrying out label classification on the space-time characteristics by utilizing different decision trees preset in a label classification model to obtain a plurality of different first labels, then carrying out regression processing or classification processing according to the first labels to obtain labels corresponding to users, and improving the accuracy of label classification.

On the other hand, in the embodiment of the present invention, a user tag classification system based on big data is provided, and the specific structure of the system is shown in fig. 2, and the system includes a keyword extraction module 201, an association module 202, an aggregation module 203, a feature extraction module 204, and a tag classification module 205.

The keyword extraction module 201 is configured to obtain behavior feature data of a user, and perform keyword extraction on the behavior feature data through a preset keyword extraction algorithm, so as to generate a plurality of keywords corresponding to the user.

The association module 202 is configured to perform association mining on a plurality of keywords corresponding to the user according to a preset association analysis algorithm, so as to obtain association rules corresponding to the plurality of keywords.

The aggregation module 203 is configured to aggregate node representation of the association rule and the plurality of keywords according to a preset graph neural network, so as to obtain association data corresponding to the plurality of keywords.

The feature extraction module 204 is configured to extract spatial features and temporal features of the associated data according to a preset deep neural network, and fuse the spatial features and the temporal features corresponding to the associated data to obtain the spatial features and the temporal features corresponding to the associated data.

The tag classification module 205 is configured to perform tag classification according to the space-time feature through a preset tag classification model, so as to obtain a tag corresponding to the user.

In this embodiment, the keyword extraction module 201 includes a data acquisition unit, a part-of-speech tagging unit, and an extraction unit.

The data acquisition unit is used for acquiring a user ID according to the login state of the user, and searching from a preset database according to the user ID to acquire behavior characteristic data corresponding to the user; the behavior characteristic data comprise user attribute information, a user behavior sequence and user operation; the user operation includes a login operation, a transaction operation, and a browsing operation.

The part-of-speech tagging unit is used for performing part-of-speech tagging on the behavior feature data through a preset part-of-speech tagging algorithm, and extracting candidate words from the behavior feature data by combining with a preset word rule to obtain a candidate word set corresponding to the behavior feature data.

In this embodiment, the association module 202 further includes a filtering unit and a rule unit.

The screening unit is used for scanning each keyword in the plurality of keywords, calculating the support degree and the confidence degree of each keyword, comparing the support degree and the confidence degree of each keyword, deleting the keywords with the support degree smaller than the confidence degree, and obtaining frequent item sets corresponding to the plurality of keywords.

The rule unit is used for traversing the grid graph corresponding to the frequent item set from top to bottom and from bottom to top simultaneously through a preset association analysis algorithm according to the frequent item set, and continuously reducing search space in the traversing process to obtain association rules corresponding to the keywords.

In this embodiment, the aggregation module 203 includes a vector unit and an aggregation unit.

The vector unit is used for determining a primary aggregation vector corresponding to each keyword in the plurality of keywords according to the attribute information of the user and the primary aggregation layer of the graph neural network; according to the primary aggregation vector, k-level vector aggregation is sequentially carried out on the association rule and the keywords through k aggregation layers preset in the graph neural network, a superior aggregation vector corresponding to each keyword in the keywords and a superior aggregation vector corresponding to each neighbor keyword in the keywords are obtained, and a weight value of each neighbor keyword relative to the keyword is determined.

The aggregation unit is used for determining the current level aggregation vector of each keyword according to the previous level aggregation vector of the keyword, the previous level aggregation vector of each neighbor keyword and the weight value relative to the keyword; and carrying out node characterization aggregation on the keywords according to the current level aggregation vector corresponding to each keyword and k aggregation layers in the graph neural network to obtain associated data corresponding to the keywords.

In this embodiment, the feature extraction module 204 includes an extraction unit and a fusion unit.

The extraction unit is used for carrying out space separation and time separation on the associated data through a first convolution layer preset in the deep neural network model to obtain space information data and time information data corresponding to the associated data; and respectively carrying out convolution processing on the space information data and the time information data through a second convolution layer preset in the deep neural network model to obtain space characteristics corresponding to the space information data and time characteristics corresponding to the time information data.

The fusion unit is used for carrying out dimension lifting processing and feature fusion on the spatial features and the time features according to a full-connection layer preset in the deep neural network to obtain initial space-time features corresponding to the associated data; and carrying out average processing on the initial space-time characteristics according to a preset average function to obtain space-time characteristics corresponding to the associated data.

The method and system for classifying user labels based on big data comprise the steps of obtaining behavior characteristic data of users, extracting keywords from the behavior characteristic data through a preset keyword extraction algorithm to obtain a plurality of keywords corresponding to the users, extracting the keywords from the behavior characteristic data, reducing data processing amount, improving data processing efficiency, extracting time characteristics and space characteristics corresponding to the associated data through a preset association analysis algorithm after the keywords corresponding to the users are obtained, taking association relations among the keywords into consideration to generate association rules corresponding to the keywords, carrying out node characterization aggregation on the association rules and the keywords through a preset graph neural network to generate association data corresponding to the keywords, wherein the association data not only comprises the association relations among the keywords, but also better reflects the behavior characteristics of the users, improves the accuracy of user classification, extracts time characteristics and space characteristics corresponding to the associated data through the preset deep neural network after the keywords are obtained, takes the time characteristics and the space characteristics into consideration, and the accuracy of the association characteristics is improved in the time-space characteristics is improved by the aid of the preset graph neural network, the association rules are further extracted according to the preset graph neural network, the association characteristics are further improved by the association rules, the association labels are further improved in the accuracy of the association labels are further improved according to the association rules, the association labels are further extracted through the preset graph neural network, the association rules are further extracted according to the association rules, and then extracting the space-time characteristics corresponding to the associated data by using a deep neural network so as to facilitate label classification according to the space-time characteristics and improve the accuracy of label classification.

The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention, and are not to be construed as limiting the scope of the invention. It should be noted that any modifications, equivalent substitutions, improvements, etc. made by those skilled in the art without departing from the spirit and principles of the present invention are intended to be included in the scope of the present invention.

Claims

1. A big data based user tag classification method, comprising:

According to the space-time characteristics, carrying out label classification through a preset label classification model to obtain labels corresponding to the users;

performing association mining on a plurality of keywords corresponding to the user according to a preset association analysis algorithm to obtain association rules corresponding to the plurality of keywords, wherein the association rules comprise:

according to the frequent item set, traversing a grid diagram corresponding to the frequent item set from top to bottom and from bottom to top through a preset association analysis algorithm, and continuously reducing search space in the traversing process to obtain association rules corresponding to the keywords;

the node characterization aggregation is performed on the association rule and the plurality of keywords according to a preset graph neural network, and association data corresponding to the plurality of keywords is obtained, including:

2. The method for classifying user tags based on big data as set forth in claim 1, wherein said obtaining behavior feature data of the user comprises:

3. The method for classifying user labels based on big data according to claim 1, wherein the step of extracting keywords from the behavior feature data by a preset keyword extraction algorithm to generate a plurality of keywords corresponding to the user comprises the steps of:

4. The method for classifying user tags based on big data as set forth in claim 1, wherein the extracting spatial features and temporal features of the associated data according to a predetermined deep neural network comprises:

5. The method for classifying user labels based on big data as claimed in claim 4, wherein the step of fusing the spatial features and the temporal features corresponding to the associated data to obtain the spatial features and the temporal features corresponding to the associated data comprises the steps of:

6. The method for classifying user labels based on big data according to claim 1, wherein the step of classifying labels according to the space-time characteristics by a preset label classification model to obtain labels corresponding to the users comprises the steps of:

7. The user tag classification system based on big data is characterized by comprising a keyword extraction module, an association module, an aggregation module, a feature extraction module and a tag classification module;

The label classification module is used for carrying out label classification through a preset label classification model according to the space-time characteristics to obtain labels corresponding to the users;

8. The big data-based user tag classification system of claim 7, wherein the keyword extraction module comprises a data acquisition unit, a part-of-speech tagging unit and an extraction unit;