CN110096499B - User object identification method and system based on behavior time series big data - Google Patents

User object identification method and system based on behavior time series big data Download PDF

Info

Publication number
CN110096499B
CN110096499B CN201910284112.7A CN201910284112A CN110096499B CN 110096499 B CN110096499 B CN 110096499B CN 201910284112 A CN201910284112 A CN 201910284112A CN 110096499 B CN110096499 B CN 110096499B
Authority
CN
China
Prior art keywords
data
user
feature
characteristic
behavior
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910284112.7A
Other languages
Chinese (zh)
Other versions
CN110096499A (en
Inventor
杨灿
袁启虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201910284112.7A priority Critical patent/CN110096499B/en
Publication of CN110096499A publication Critical patent/CN110096499A/en
Application granted granted Critical
Publication of CN110096499B publication Critical patent/CN110096499B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2477Temporal data queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Quality & Reliability (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a user object identification method based on behavior time series big data, which comprises the following steps: acquiring historical data and data to be identified, and cleaning the historical data and the data to be identified according to a cleaning rule; generating a record with a uniform structure according to the cleaned data; selecting characteristics according to the data characteristic types, and constructing characteristic sets; constructing a feature vector or a feature vector group according to the feature set; respectively generating a similar discrimination matrix or a machine learning discrimination model according to the feature vector or the feature vector group of the historical data; and carrying out user identification on the feature vector generated by the data to be identified according to the similarity discrimination matrix or the machine learning discrimination model to obtain an identification result. The invention can realize accurate identity recognition on the data with the hidden or polluted user identity information.

Description

User object identification method and system based on behavior time series big data
Technical Field
The invention relates to the technical field of recognition in the field of behavior computers, in particular to a user object recognition method and system based on behavior time series big data.
Background
With the development of internet communication technology and the continuous change of social forms, the achievement and thinking of internet +' have been deepened into the daily life of people, more and more traditional living habits have been changed into network virtual behaviors, and the network virtual behaviors are collected and stored by various internet operators in the form of various behavior data, such as network shopping behaviors, webpage browsing behaviors, audio and video playing behaviors and the like. Secondly, with the rise of data mining and machine learning technologies, identity recognition technologies based on big data of network users are also developed, such as identity type recognition technologies of user gender, occupation, shopping preference and the like based on user images of all large electronic commerce.
In the prior art, identity recognition technologies based on non-image data such as Web are mostly limited to recognition of user identity types, which also results in that a large amount of user non-image behavior data cannot be used for accurate recognition of user identities. Therefore, under the condition that various data analysis and machine learning technologies are mature, the fixed-point identification of the user identity is realized by deeply mining the behavior data of the user, so that the situation that the current identity identification is almost limited to the image data can be greatly expanded.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a user object identification method based on behavior time series big data. The invention can realize accurate identity recognition by utilizing the hidden or polluted data in the user identity information.
The purpose of the invention can be realized by the following technical scheme:
a user object identification method based on behavior time series big data comprises the following steps:
acquiring historical data and data to be identified, and cleaning the historical data and the data to be identified according to a cleaning rule;
generating a record with a uniform structure according to the cleaned data;
selecting characteristics according to the data characteristic types, and constructing characteristic sets;
constructing a feature vector or a feature vector group according to the feature set;
respectively generating a similar discrimination matrix or a machine learning discrimination model according to the feature vector or the feature vector group of the historical data;
and carrying out user identification on the feature vector generated by the data to be identified according to the similarity discrimination matrix or the machine learning discrimination model to obtain an identification result.
Specifically, the historical data source is a static database and is used for generating a discrimination model; the data to be identified is newly generated data, and the source can be a static database and a dynamic data stream which are used as target data.
The data cleansing rules include user filtering, field filtering, and time filtering. User filtering means filtering out invalid users according to data quantity and data source; field filtering represents removing behavior-independent data attributes; the time filtering represents the screening of data in a specified time interval, ensures the time sequence of the data and removes the data with disordered time sequence.
Specifically, the step of generating a record with a unified structure according to the cleaned data includes:
digitizing the behavior field of the data, and adding a Timestamp (Timestamp) and a user tag (UserId) to each piece of data; the numeralization represents the mapping of behavior types to integer values within a specified range; the time stamp represents the number of seconds from a certain specified date to record the acquired time and is an integer value; the user tag indicates the number to the user, which is an integer value. The unified record generated by each piece of obtained data is represented as:
Record=<userId,Timestamp,Operation1,Operation2,...,Operationn>
wherein, Operation represents the Operation behavior recorded by the user.
Specifically, the data feature types are divided into behavior features and time sequence features of behaviors. The ith characteristic being denoted as fi
Behavior characteristics represent the specific type of behavior itself as a characteristic, the ith behavior characteristic being denoted gi,gi=Type(Operationa) Denotes OperationaAs a feature.
The time series characteristic of an action represents the switching between different action types as a characteristic, and the time series characteristic of the ith action is marked as hi,hi=(Type(Operationa)→Type(Operationb) Represents some two operations)aAnd OperationbOf the specific type of the switching behavior.
Specifically, the selected characteristic is based on the frequency heat and the TF-IDF heat of the characteristic in the data interval.
The frequency heat of a feature is a normalized value specifying the number of occurrences of the feature in a time interval, feature fiThe frequency heat is calculated by the formula
Figure GDA0003099122000000031
Wherein n isfiDenotes fiThe number of times of occurrence of the event,
Figure GDA0003099122000000032
indicating the number of occurrences of all features.
The TF-IDF heat of the feature indicates that the TF-IDF value of the specified feature in the data interval is taken as the heat, and the feature fiThe heat degree of the TF-IDF is calculated in a mode of
Figure GDA0003099122000000033
Wherein the content of the first and second substances,
Figure GDA0003099122000000034
representing the frequency of the signature within the user data interval;
Figure GDA0003099122000000035
representing all the number D of users and the number D of users containing the featurejThe logarithm of the ratio of ratios represents a measure of the general importance of a feature within the data interval.
Calculating the heat of all the characteristics f of each user, selecting Top-K with the highest heat as the characteristic set of the user, and expressing the characteristic set of the ith user as feature Ui={fi,1,fi,2,…,fi,K}。
Combining the features of u users to form a feature set with consistent scale:
Feature=featureU1∪featureU2∪...∪featureUu
all user features have m Feature elements, i.e. Feature f1,f2,…,fm}。
Specifically, constructing a user Feature vector or a Feature vector group according to the Feature set Feature comprises:
using the extracted m overall features as components to form a user feature vector, wherein each component value is the heat value of the corresponding feature in the user u, namely
UserVector=<Phu,1,Phu,2,...,Phu,m>
Wherein Ph represents a calorific value.
When the data volume is enough, the data of each user is divided into a plurality of sections according to the time section, a feature vector is generated for each section respectively, and all the feature vectors form a feature vector group.
Specifically, in the step of generating the similar discrimination matrix or the machine learning discrimination model according to the feature vector or the feature vector group of the historical data,
if the generated feature vectors are all users, the feature vectors of all users can form a discrimination matrix of u × m:
SimMatrix=<UserVector1;UserVector2;...;UserVectoru>
and if the generated characteristic vector group is the characteristic vector group, using each characteristic component of the characteristic vector as attribute input, and using the corresponding user number UserId as a label to perform machine learning training to obtain a machine learning discrimination model pi. The machine learning method can select KNN, decision tree, random forest,
Figure GDA0003099122000000041
Bayes、GBDT。
Specifically, in the step of performing user identification on the feature vector generated by the data to be identified according to the similarity discrimination matrix or the machine learning discrimination model to obtain the identification result,
if the similarity discrimination matrix is used, similarity measurement is carried out by using each feature vector in the feature vector Identifyvector to be recognized and the similarity discrimination matrix Simmatrix, and the Top Top-N users with the most similar features are selected as candidate recognition results. The similarity measure adopts Euclidean distance, cosine similarity or Pearson correlation coefficient.
If the machine learning discrimination model pi is used, each feature component of the feature vector Identifyvector to be recognized is used as the attribute input of the machine learning discrimination model pi, and the user number corresponding to the previous Top-N value of the model pi calculation output (hit probability) is used as a candidate recognition result.
Another object of the present invention is to provide a user object recognition system based on behavior time series big data.
The other purpose of the invention can be realized by the following technical scheme:
a user object identification system based on behavior time series big data comprises a data acquisition module, a feature construction module, a model generation module and a user identification module;
the data acquisition module acquires user data from a data source, cleans the data according to a specific filtering rule, retains meaningful data and forms user behavior records which are arranged in an increasing order according to timestamps;
the characteristic construction module is used for calculating the distribution characteristics of the user behaviors according to the collected data and constructing characteristic vectors with consistent scales, wherein each characteristic element value of the characteristic vector is the heat degree of the corresponding characteristic behavior;
the model generation module constructs a characteristic matrix according to the user historical characteristic behavior data or trains according to the user historical characteristic behavior data to obtain a machine learning discrimination model;
the user identification module drives the user historical behavior data feature matrix to perform similarity measurement by the data to be identified to obtain an identification result, or drives a machine learning discrimination model by the data to be identified to obtain the identification result.
The working steps of the user object identification system are as follows:
the data acquisition module acquires user historical data from a historical data source, cleans the data according to a specific rule, only retains meaningful data, and generates a data structure with a unified structure;
the feature construction module carries out heat calculation on the generated data according to the behavior or behavior time sequence, selects the behavior or behavior sequence with the highest heat as a feature set, and constructs a feature vector for each user according to the feature set;
the model generation module generates a user similarity judgment matrix according to the user characteristic vector or obtains a machine learning judgment module II according to the characteristic vector group training;
the data acquisition module acquires data to be identified from a data source to be identified and generates a feature vector to be identified according to the feature set extracted by the feature construction module; and the user identification module identifies the user by using the corresponding discrimination module for the characteristic vector to be identified and selects Top-N as an identification result.
Compared with the prior art, the invention has the following beneficial effects:
the invention can flexibly select the similar discrimination matrix or the machine learning discrimination model according to the data volume of the user, introduces the time sequence characteristics of the user behavior in the user identification process and can effectively improve the accuracy of the user identification.
Drawings
Fig. 1 is a flowchart of a method for identifying a user object based on behavior time series big data.
FIG. 2 is a flow chart of selecting features according to feature types and constructing feature sets according to embodiments of the present invention.
FIG. 3 is a flow chart of the identification based on the similarity metric matrix in the embodiment of the present invention.
FIG. 4 is a flow chart of machine learning model based recognition in an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Examples
Fig. 1 is a flowchart of a user object identification method based on behavior time series big data, which includes the steps of:
s1, acquiring historical data and data to be identified, and cleaning the historical data and the data to be identified according to cleaning rules;
in the data acquisition module in this embodiment, both the historical data and the data to be identified are acquired from the static database using a music play record data set of a certain user.
The method comprises the steps of obtaining a user historical data set from a static database, wherein each piece of Source data is recorded as Source ═ Date, Time, user, curSong, Artist, sTag, rTag and nextSong >, wherein Date, Time and user respectively represent the Date and Time of the record and the corresponding user, curSong and nextSong represent the music being listened to by the user and the music being listened to next Time, Artist and sTag represent the singer and the type of the music, and rTag represents the collection, sharing, comment and like operation of the music by the user. Wherein the curSong, sTag, rTag, nextSong contain behavior information.
S2, generating records with uniform structures according to the cleaned data;
formatting data according to rules, reserving behavior information of curSong, sTag, rTag and nextSong, mapping the behavior information into numerical value types respectively, and recording the numerical value types as Operation1,Operation2,Operation3,Operation4And adding a timestamp at the same time, numbering the users to obtain regular data, wherein each record is data ═ UserId, Operation1,Operation2,Operation3,Operation4,Timestamp>。
S3, selecting features according to the data feature types, and constructing feature sets, wherein the specific flow is shown in FIG. 2;
in this embodiment, the behavior analysis module uses the user behavior directly as the feature during feature extraction, and uses the feature frequency heat as the selection criterion.
Selecting all operations per userl,Operation2,Operation3,Operation4All types contained as features are marked as f, the frequency heat of each type to the user is calculated, and the calculation method is
Figure GDA0003099122000000071
Selecting Top-K according to heat for all f types of each user, and expressing the characteristic of the ith user as feature of feature Ui={fi,1,fi,2,…,fi,KCombining the features of u users to form a feature set with consistent dimension: feature ═ Feature U1∪featureU2∪...∪featureUuAll user features have m Feature elements, i.e. Feature ═ f1,f2,…,fm}。
S4, constructing a feature vector or a feature vector group according to the feature set;
in this embodiment, the frequency heat value of m features is calculated for each user and the feature vector of the user is formed to obtain
UserVector=<Phu,1,Phu,2,...,Phu,m>
S5, respectively generating a similar discrimination matrix or a machine learning discrimination model according to the feature vector or the feature vector group of the historical data;
the similarity discrimination matrix of U M formed by the UserVectors of all users is as follows:
SimMatrix=<UserVector1;UserVector2;...;UserVectoru>
and S6, carrying out user identification on the feature vector generated by the data to be identified according to the similarity discrimination matrix or the machine learning discrimination model to obtain an identification result.
The flowcharts of the recognition method based on the similarity discrimination matrix and the machine learning model are shown in fig. 3 and 4, respectively. In this embodiment, the cosine similarity measure is used in the similarity identification. According to the obtained identified feature vector Identifyvector, calculating the cosine similarity distance of each feature vector Uservector in the Identifyvector and SimMatrix, wherein the calculation method comprises the following steps
Figure GDA0003099122000000081
And taking the Top-N with the largest cos theta as the recognition result.
A user object identification system based on behavior time series big data comprises a data acquisition module, a feature construction module, a model generation module and a user identification module;
the data acquisition module acquires user data from a data source, cleans the data according to a specific filtering rule, retains meaningful data and forms user behavior records which are arranged in an increasing order according to timestamps;
the characteristic construction module is used for calculating the distribution characteristics of the user behaviors according to the collected data and constructing characteristic vectors with consistent scales, wherein each characteristic element value of the characteristic vector is the heat degree of the corresponding characteristic behavior;
the model generation module constructs a characteristic matrix according to the user historical characteristic behavior data or trains according to the user historical characteristic behavior data to obtain a machine learning discrimination model;
the user identification module drives the user historical behavior data feature matrix to perform similarity measurement by the data to be identified to obtain an identification result, or drives a machine learning discrimination model by the data to be identified to obtain the identification result.
In the data acquisition module of the embodiment, historical data and data to be identified are both acquired from a static database by using a music playing record data set of a certain user;
the behavior analysis module uses a user time sequence behavior sequence as a feature during feature extraction, and uses TF-IDF heat of the feature as a selection standard;
the user analysis module divides the data sections and calculates the feature vector group of the user.
The model generation module establishes a machine learning model;
and the user identification module identifies the user by using the established Decision Tree.
In this embodiment, the specific workflow of the system is as follows:
the method comprises the steps of obtaining a user historical data set from a static database, wherein each piece of Source data is recorded as Source ═ Date, Time, user, curSong, Artist, sTag, rTag and nextSong >, wherein Date, Time and user respectively represent the Date and Time of the record and the corresponding user, curSong and nextSong represent the music being listened to by the user and the music being listened to next Time, Artist and sTag represent the singer and the type of the music, and rTag represents the collection, sharing, comment and like operation of the music by the user. Wherein the curSong, sTag, rTag, nextSong contain behavior information.
Formatting data according to rules, reserving behavior information of curSong, sTag, rTag and nextSong, mapping the behavior information into numerical value types respectively, and recording the numerical value types as Operation1,Operation2,Operation3,Operation4And adding a timestamp at the same time, numbering the users to obtain regular data, wherein each record is data ═ UserId, Operation1,Operation2,Operation3,Operation4,Timestamp>。
Selecting all operations per user1Type and Operation4Between types<Operation1,Operation4>The switching pair is taken as a characteristic and is marked as f, the TF-IDF heat of each f to the user is calculated, and the calculation mode is Phfi=TF-IDFfi=TFfi*IDFfiWherein
Figure GDA0003099122000000101
Representing the frequency of the signature within the user data interval;
Figure GDA0003099122000000102
representing all the number D of users and the number D of users containing the featurejLogarithm of the ratio of.
Selecting Top-K according to heat for f of each user, and expressing the feature set of the ith user as feature Ui={fi,1,fi,2,…,fi,KCombining the features of u users to form a feature set with consistent dimension: feature ═ Feature U1∪featureU2∪...∪featureUuAll user features have m Feature elements, i.e. Feature ═ f1,f2,…,fm}。
And dividing the data of each user into P sections according to the time sections.
Calculating the frequency heat value of M characteristics for each section of each user and forming a characteristic vector, UserVector, of the userp=<Phu,1,Phu,2,…,Phu,m>A feature vector representing the pth sector of the user.
And inputting each feature vector of all users as an attribute, performing Decision Tree training by taking the user number UserId as a label, and selecting a proper training parameter. And obtaining a precision Tree discrimination model after training.
And according to the obtained feature vector IdentifVector to be identified, taking the IdentifVector as the attribute input of the Decision Tree, and calculating the user number UserId corresponding to the Top Top-N values of the output (hit probability) as a candidate identification result.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (7)

1. A user object identification method based on behavior time series big data is characterized by comprising the following steps:
acquiring historical data and data to be identified, and cleaning the historical data and the data to be identified according to a cleaning rule;
generating a record with a uniform structure according to the cleaned data;
selecting characteristics according to the data characteristic types, wherein the data characteristic types are divided into behavior characteristics and time sequence characteristics of behaviors; the ith characteristic being denoted as fi
Behavior characteristics represent the specific type of behavior itself as a characteristic, the ith behavior characteristic being denoted gi,gi=Type(Operationa) Denotes OperationaAs a feature;
the time series characteristic of an action represents the switching between different action types as a characteristic, and the time series characteristic of the ith action is marked as hi,hi=(Type(Operationa)→Type(Operationb) Represents some two operations)aAnd OperationbSwitching behavior between specific types of;
constructing a feature set:
the degree of heat is calculated for all the features f of each user,and selecting Top-K with highest heat as the feature set of the user, wherein the feature set of the ith user is expressed as feature Ui={fi,1,fi,2,...,fi,K};
Combining the features of u users to form a feature set with consistent scale:
Feature=featureU1∪featureU2∪...∪featureUu
all user features have m Feature elements, i.e. Feature f1,f2,...,fm};
Constructing a user Feature vector or a Feature vector group according to the Feature set Feature, comprising:
using the extracted m overall features as components to form a user feature vector, wherein each component value is the heat value of the corresponding feature in the user u, namely
UserVector=<Phu,1,Phu,2,...,Phu,m
Wherein Ph represents a calorific value;
when the data volume is enough, dividing the data of each user into a plurality of sections according to time sections, respectively generating a characteristic vector for each section, and forming a characteristic vector group by all the characteristic vectors;
respectively generating a similar discrimination matrix or a machine learning discrimination model according to the feature vector or the feature vector group of the historical data;
and carrying out user identification on the feature vector generated by the data to be identified according to the similarity discrimination matrix or the machine learning discrimination model to obtain an identification result.
2. The method for identifying the user object based on the behavioral time series big data according to claim 1, wherein the data cleansing rule comprises user filtering, field filtering and time filtering.
3. The method for identifying the user object based on the behavior time series big data as claimed in claim 1, wherein the step of generating the record with a uniform structure according to the cleaned data comprises:
digitizing the behavior field of the data, adding a timestamp and a user label to each piece of data, and expressing the obtained unified record generated by each piece of data as follows:
Record=<userId,Timestamp,Operation1,Operation2,...,Operationn
wherein, Operation represents the Operation behavior recorded by the user, UserId represents the user tag, and Timestamp represents the Timestamp.
4. The method for identifying the user object based on the behavioral time series big data according to claim 1, wherein the selected characteristics are frequency heat and TF-IDF heat in the data interval according to the characteristics;
the frequency heat of a feature is a normalized value specifying the number of occurrences of the feature in a time interval, feature fiThe frequency heat is calculated by the formula
Figure FDA0003099121990000021
Wherein the content of the first and second substances,
Figure FDA0003099121990000022
denotes fiThe number of times of occurrence of the event,
Figure FDA0003099121990000023
representing the number of occurrences of all features;
the TF-IDF heat of the feature indicates that the TF-IDF value of the specified feature in the data interval is taken as the heat, and the feature fiThe heat degree of the TF-IDF is calculated in a mode of
Figure FDA0003099121990000024
Wherein the content of the first and second substances,
Figure FDA0003099121990000031
representing the frequency of the signature within the user data interval;
Figure FDA0003099121990000032
representing all the number D of users and the number D of users containing the featurejThe logarithm of the ratio of ratios represents a measure of the general importance of a feature within the data interval.
5. The method according to claim 1, wherein in the step of generating a similarity discrimination matrix or a machine learning discrimination model based on the feature vectors or feature vector groups of the historical data,
if the generated feature vectors are all users, the feature vectors of all users can form a discrimination matrix of u × m:
SimMatrix=<UserVector1;UserVector2;...;UserVectoru
if the generated characteristic vector group is a characteristic vector group, using each characteristic component of the characteristic vector as attribute input, and using a corresponding user number as a label to perform machine learning training to obtain a machine learning discrimination model II; the machine learning method can select KNN, decision tree, random forest,
Figure FDA0003099121990000033
Bayes、GBDT。
6. The method according to claim 1, wherein in the step of obtaining the recognition result by performing user recognition on the feature vector generated by the data to be recognized according to the similarity discriminant matrix or the machine learning discriminant model,
if the similarity discrimination matrix is used, similarity measurement is carried out by using the feature vector to be recognized and each feature vector in the similarity discrimination matrix, and front Top-N users with the most similar features are selected as candidate recognition results; the similarity measure adopts Euclidean distance, cosine similarity or Pearson correlation coefficient;
and if the machine learning discrimination model II is used, using each characteristic component of the characteristic vector to be recognized as the attribute input of the machine learning discrimination model II, and using the user number corresponding to the output front Top-N value calculated by the model II as a candidate recognition result.
7. A system for realizing the behavior time series big data-based user object identification method of any one of claims 1-6, wherein the system comprises a data acquisition module, a feature construction module, a model generation module and a user identification module;
the data acquisition module acquires user data from a data source, cleans the data according to a specific filtering rule, retains meaningful data and forms user behavior records which are arranged in an increasing order according to timestamps;
the characteristic construction module is used for calculating the distribution characteristics of the user behaviors according to the collected data and constructing characteristic vectors with consistent scales, wherein each characteristic element value of the characteristic vector is the heat degree of the corresponding characteristic behavior;
the model generation module constructs a characteristic matrix according to the user historical characteristic behavior data or trains according to the user historical characteristic behavior data to obtain a machine learning discrimination model;
the user identification module drives the user historical behavior data feature matrix to perform similarity measurement by the data to be identified to obtain an identification result, or drives a machine learning discrimination model by the data to be identified to obtain the identification result.
CN201910284112.7A 2019-04-10 2019-04-10 User object identification method and system based on behavior time series big data Active CN110096499B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910284112.7A CN110096499B (en) 2019-04-10 2019-04-10 User object identification method and system based on behavior time series big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910284112.7A CN110096499B (en) 2019-04-10 2019-04-10 User object identification method and system based on behavior time series big data

Publications (2)

Publication Number Publication Date
CN110096499A CN110096499A (en) 2019-08-06
CN110096499B true CN110096499B (en) 2021-08-10

Family

ID=67444601

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910284112.7A Active CN110096499B (en) 2019-04-10 2019-04-10 User object identification method and system based on behavior time series big data

Country Status (1)

Country Link
CN (1) CN110096499B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110795570B (en) * 2019-10-11 2022-06-17 上海上湖信息技术有限公司 Method and device for extracting user time sequence behavior characteristics
CN111461180A (en) * 2020-03-12 2020-07-28 平安科技(深圳)有限公司 Sample classification method and device, computer equipment and storage medium
WO2021243534A1 (en) * 2020-06-02 2021-12-09 深圳市欢太科技有限公司 Behavior control method and apparatus and storage medium
CN112381112B (en) * 2020-10-16 2023-11-07 华南理工大学 User identity recognition method and system based on multi-mode item set of user data
CN113743103A (en) * 2021-08-20 2021-12-03 南京星云数字技术有限公司 Comment user identity identification method and device, computer equipment and storage medium
CN116578910B (en) * 2023-07-13 2023-09-15 成都航空职业技术学院 Training action recognition method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103646197A (en) * 2013-12-12 2014-03-19 中国石油大学(华东) User credibility authentication system and method based on user behaviors
CN104102819A (en) * 2014-06-27 2014-10-15 北京奇艺世纪科技有限公司 Determining method and device for user natural attributes
CN105577431A (en) * 2015-12-11 2016-05-11 青岛云成互动网络有限公司 User information identification and classification method based on internet application and system thereof
CN108197190A (en) * 2017-12-26 2018-06-22 北京秒针信息咨询有限公司 A kind of method and apparatus of user's identification
CN109583472A (en) * 2018-10-30 2019-04-05 中国科学院计算技术研究所 A kind of web log user identification method and system

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744957A (en) * 2014-01-06 2014-04-23 同济大学 Sequence mode mining method based on Web user time attributes
US11164089B2 (en) * 2015-10-12 2021-11-02 International Business Machines Corporation Transaction data analysis
CN105306495B (en) * 2015-11-30 2018-06-19 百度在线网络技术(北京)有限公司 user identification method and device
US9983859B2 (en) * 2016-04-29 2018-05-29 Intuit Inc. Method and system for developing and deploying data science transformations from a development computing environment into a production computing environment
CN106911668B (en) * 2017-01-10 2020-07-14 同济大学 Identity authentication method and system based on user behavior model
CN107515915B (en) * 2017-08-18 2020-02-18 晶赞广告(上海)有限公司 User identification association method based on user behavior data
CN108280482B (en) * 2018-01-30 2020-10-16 广州小鹏汽车科技有限公司 Driver identification method, device and system based on user behaviors
CN108388969A (en) * 2018-03-21 2018-08-10 北京理工大学 Inside threat personage's Risk Forecast Method based on personal behavior temporal aspect

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103646197A (en) * 2013-12-12 2014-03-19 中国石油大学(华东) User credibility authentication system and method based on user behaviors
CN104102819A (en) * 2014-06-27 2014-10-15 北京奇艺世纪科技有限公司 Determining method and device for user natural attributes
CN105577431A (en) * 2015-12-11 2016-05-11 青岛云成互动网络有限公司 User information identification and classification method based on internet application and system thereof
CN108197190A (en) * 2017-12-26 2018-06-22 北京秒针信息咨询有限公司 A kind of method and apparatus of user's identification
CN109583472A (en) * 2018-10-30 2019-04-05 中国科学院计算技术研究所 A kind of web log user identification method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于行为序列的移动智能终端用户身份认证技术研究;徐启寒;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190115(第12期);I138-103 *

Also Published As

Publication number Publication date
CN110096499A (en) 2019-08-06

Similar Documents

Publication Publication Date Title
CN110096499B (en) User object identification method and system based on behavior time series big data
CN108363804B (en) Local model weighted fusion Top-N movie recommendation method based on user clustering
Bertin-Mahieux et al. Automatic tagging of audio: The state-of-the-art
CN109408665A (en) A kind of information recommendation method and device, storage medium
JP6435426B1 (en) Information analysis apparatus, information analysis method, and information analysis program
CN109511015B (en) Multimedia resource recommendation method, device, storage medium and equipment
CN105849763A (en) Systems and methods for dynamically determining influencers in a social data network using weighted analysis
CN105426514A (en) Personalized mobile APP recommendation method
JP6033697B2 (en) Image evaluation device
US9245035B2 (en) Information processing system, information processing method, program, and non-transitory information storage medium
CN103886081A (en) Information sending method and system
KR20120101233A (en) Method for providing sentiment information and method and system for providing contents recommendation using sentiment information
CN111177559B (en) Text travel service recommendation method and device, electronic equipment and storage medium
JP5895052B2 (en) Information analysis system and information analysis method
CN114238573B (en) Text countercheck sample-based information pushing method and device
CN111651678B (en) Personalized recommendation method based on knowledge graph
Wu et al. An incremental community detection method for social tagging systems using locality-sensitive hashing
CN108363748B (en) Topic portrait system and topic portrait method based on knowledge
CN113239159B (en) Cross-modal retrieval method for video and text based on relational inference network
EP3340073A1 (en) Systems and methods for processing of user content interaction
CN110958472A (en) Video click rate rating prediction method and device, electronic equipment and storage medium
JP2009116457A (en) Method and device for analyzing internet site information
JP2012168986A (en) Method of providing selected content items to user
CN115456676A (en) Game advertisement visual delivery data analysis management method and system
Bunga et al. From implicit preferences to ratings: video games recommendation based on collaborative filtering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant