CN112905783A - Group user portrait acquisition method and device, electronic equipment and storage medium - Google Patents

Group user portrait acquisition method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112905783A
CN112905783A CN202110192229.XA CN202110192229A CN112905783A CN 112905783 A CN112905783 A CN 112905783A CN 202110192229 A CN202110192229 A CN 202110192229A CN 112905783 A CN112905783 A CN 112905783A
Authority
CN
China
Prior art keywords
user
log data
web log
web
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110192229.XA
Other languages
Chinese (zh)
Inventor
李涵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuo Erzhi Lian Wuhan Research Institute Co Ltd
Original Assignee
Zhuo Erzhi Lian Wuhan Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuo Erzhi Lian Wuhan Research Institute Co Ltd filed Critical Zhuo Erzhi Lian Wuhan Research Institute Co Ltd
Priority to CN202110192229.XA priority Critical patent/CN112905783A/en
Publication of CN112905783A publication Critical patent/CN112905783A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a group user portrait acquisition method and device, electronic equipment and a storage medium. The group user portrait acquisition method comprises the following steps: extracting corresponding user characteristics based on WEB log data of a user; and clustering the at least one user based on the extracted user characteristics corresponding to each user in the at least one user to obtain a group user portrait corresponding to each category.

Description

Group user portrait acquisition method and device, electronic equipment and storage medium
Technical Field
The application belongs to the technical field of data processing, and particularly relates to a group user portrait acquisition method and device, electronic equipment and a storage medium.
Background
The user portrait is the labeling of user information, and the refining and summarization of user characteristics are realized. The user portrait has semantic and short text property, which is convenient for fast understanding user characteristic and can be processed by computer. With the progress of society and the development of technology level, user portrayal is widely applied to various recommendation systems. The user profile can be used not only to analyze user characteristics, but also to perform associative characteristic analysis between users, i.e. group user profile analysis. In the related art, the group user images are usually clustered based on the user images of the sample users, and the number of the sample users is limited, so the group user images obtained based on the user images of the limited number of the sample users may not accurately represent the real features of the user group.
Disclosure of Invention
In view of the above, embodiments of the present disclosure provide a method, an apparatus, an electronic device, and a storage medium for obtaining a group user portrait, so as to solve the technical problem in the related art that the accuracy of a group user portrait obtained based on an existing user portrait is low.
In order to achieve the above purpose, the technical solution of the embodiment of the present application is implemented as follows:
the embodiment of the application provides a group user portrait acquisition method, which comprises the following steps:
extracting corresponding user characteristics based on World Wide Web (WEB) log data of a user;
and clustering the at least one user based on the extracted user characteristics corresponding to each user in the at least one user to obtain a group user portrait corresponding to each category.
In the above scheme, the extracting corresponding user characteristics based on the WEB log data of the user includes:
determining webpage texts of webpages accessed by users based on WEB log data of the users;
and extracting corresponding user characteristics based on the webpage text of the webpage accessed by the user.
In the foregoing solution, when extracting corresponding user features based on the WEB log data of the user, the method includes:
and determining the user corresponding to the extracted user characteristics based on the user identification information in the WEB log data.
In the foregoing solution, the determining a user corresponding to the WEB log data based on the user identification information in the WEB log data includes:
determining at least one user corresponding to the WEB log data based on Internet Protocol (IP) address information in the WEB log data;
and/or the presence of a gas in the gas,
and determining at least one user corresponding to the WEB log data based on the operating system information in the WEB log data.
In the foregoing solution, before the extracting the corresponding user feature based on the WEB log data of the user, the method further includes:
performing data cleaning on the WEB log data by at least one of the following modes:
converting the WEB log data in which the suffix does not meet the set condition in the WEB log data into WEB log data in which the suffix meets the set condition;
deleting the WEB log data of which the state codes do not accord with set conditions in the WEB log data;
deleting the WEB log data with missing content in the WEB log data;
and deleting the WEB log data which is repeated with the contents of other WEB log data in the WEB log data.
In the above scheme, the extracting corresponding user characteristics based on the WEB log data of the user includes:
determining a Term Frequency-Inverse text Frequency (TF-IDF, Term Frequency-Inverse text Frequency) value of each word in each webpage text based on the square of the Inverse text Frequency (IDF) of each word in each corresponding webpage text in WEB log data of a user;
and extracting corresponding user characteristics based on the TF-IDF value of each word in each webpage text.
In the above solution, the determining the TF-IDF value of each word in each webpage text includes:
setting the first weight to be greater than the second weight; wherein the content of the first and second substances,
the first weight represents the weight of a word positioned at the position of the title and/or the head section of each webpage text; the second weight characterizes the weight of the other words than the word corresponding to the first weight.
The embodiment of the present application further provides a group user portrait acquisition device, the device includes:
the extraction unit is used for extracting corresponding user characteristics based on WEB log data of a user;
and the clustering unit is used for clustering at least one user based on the extracted user characteristics corresponding to each user in the at least one user to obtain a group user portrait corresponding to each category.
An embodiment of the present application further provides an electronic device, including: a processor and a memory for storing a computer program capable of running on the processor, wherein,
the processor is adapted to perform the steps of any of the above methods when running the computer program.
Embodiments of the present application further provide a storage medium on which a computer program is stored, where the computer program is executed by a processor to implement the steps of any one of the above methods.
In the embodiment of the application, based on WEB log data of a user, extracting user characteristics of the corresponding user in the WEB log data; the method comprises the steps of clustering at least one user based on the extracted user characteristics corresponding to each user in at least one user to obtain a group user portrait corresponding to each category, namely, downloading WEB log data of the users in the Internet, extracting the user characteristics of the corresponding users from the WEB log data, and obtaining a group user portrait based on the user characteristics.
Drawings
FIG. 1 is a schematic flow chart illustrating an implementation of a group user portrait acquisition method according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of personality and subject categories provided by an embodiment of the present application;
FIG. 3 is a schematic diagram of clustering by a K-Means algorithm according to an embodiment of the present application;
FIG. 4 is a schematic diagram illustrating functional module division of a group user portrait acquisition method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a group user representation acquisition apparatus according to an embodiment of the present disclosure;
fig. 6 is a schematic diagram of a hardware component structure of an electronic device according to an embodiment of the present application.
Detailed Description
In the related art, the group user images are usually clustered based on the user images of the sample users, and the number of the sample users is limited, so the group user images obtained based on the user images of the limited number of the sample users may not accurately represent the real features of the user group.
Based on the above, the embodiment of the application provides a group user portrait acquisition method, a group user portrait acquisition device, an electronic device and a storage medium, wherein user characteristics of a corresponding user in WEB log data are extracted based on the WEB log data of the user; the method comprises the steps of clustering at least one user based on the extracted user characteristics corresponding to each user in at least one user to obtain a group user portrait corresponding to each category, namely, downloading WEB log data of the users in the Internet, extracting the user characteristics of the corresponding users from the WEB log data, and obtaining a group user portrait based on the user characteristics.
The present application will be described in further detail with reference to the following drawings and examples.
Fig. 1 is a schematic flow chart illustrating an implementation of a group user portrait acquisition method according to an embodiment of the present application. As shown in fig. 1, the method includes:
step 101: and extracting corresponding user characteristics based on the WEB log data of the user.
Here, a function of logging is configured on a server corresponding to the browser, and after the user accesses the browser, the server opens a log recording file and acquires WEB log data of the user. And establishing an original database based on the acquired WEB log data of the plurality of users, and extracting corresponding user characteristics based on the original database. In the embodiment of the application, the user characteristics represent the character attribute characteristics and the interest attribute characteristics of the user.
Step 102: and clustering the at least one user based on the extracted user characteristics corresponding to each user in the at least one user to obtain a group user portrait corresponding to each category.
After the user characteristics corresponding to each user in at least one user are extracted, clustering is carried out on the users with the same or similar user characteristics to obtain a group user portrait corresponding to each category, so that the portrait of the group with the same or similar characteristics can be drawn.
In an embodiment, the extracting corresponding user features based on the WEB log data of the user includes:
determining webpage texts of webpages accessed by users based on WEB log data of the users;
and extracting corresponding user characteristics based on the webpage text of the webpage accessed by the user.
Here, a piece of WEB log data is mainly composed of: access host (remote), identifier (ident), authorized user (anthuser), time of day (date), request (request), status code (status), number of bytes transferred (bytes), source page (referrer), user agent (agent). The request is an important item in the WEB log data, and the request mainly comprises: the method comprises the steps of requesting types, requesting resources and protocol version numbers, wherein the requesting resources display Uniform Resource Locators (URLs) of corresponding resources, WEB crawlers are used for obtaining WEB page texts corresponding to all URLs in WEB log data, after preprocessing operations such as word segmentation and word deactivation are carried out on the WEB page texts, corresponding user characteristics are extracted based on the preprocessed WEB page texts.
Generally, a user can select a webpage which is interested by the user to access, user characteristics are extracted based on webpage texts of the webpage accessed by the user, the extracted user characteristics can be accurately attached to personal preference of the user, and accuracy and comprehensiveness of user characteristic acquisition are improved.
In an embodiment, when extracting corresponding user features based on the user WEB log data, the method includes:
and determining the user corresponding to the extracted user characteristics based on the user identification information in the WEB log data.
When the user characteristics are subjected to cluster analysis, user identification is important, and the group user portrait can be accurately obtained only by obtaining the user characteristics of each user because the group user portrait is obtained depending on the user characteristics of each individual user.
Here, the WEB log data further includes user identification information, each user is distinguished from other users by a unique user identification, and a user corresponding to the extracted user feature is determined based on the user identification information included in the WEB log data.
Here, the user corresponding to the extracted user feature is determined through the user identification information in the WEB log data, and the user feature is closely associated with the corresponding user, so that the user feature corresponding to each user can be accurately extracted, and the accuracy of the obtained user feature is improved.
In an embodiment, the determining, based on the user identification information in the WEB log data, a user corresponding to the WEB log data includes:
determining at least one user corresponding to the WEB log data based on the IP address information in the WEB log data;
and/or the presence of a gas in the gas,
and determining at least one user corresponding to the WEB log data based on the operating system information in the WEB log data.
Here, the user identification information may be IP address information, each host in each network in the internet has a unique IP address, and if the IP addresses are different, the users corresponding to the IP addresses are also different. Therefore, based on the IP address information in the WEB log data, at least one user corresponding to the WEB log data can be determined.
The user identification information can also be operating system information, the operating system information of the equipment used by different users when accessing the webpage is different, the operating system information comprises user names of the operating systems, the user names are different, and the information of the operating systems is also different, so that the users can be distinguished through the operating system information. And if the IP address information in the two pieces of WEB log data is the same and the user names of the operating systems are different, the two pieces of WEB log data are considered to correspond to different users respectively.
By identifying the corresponding at least one user in the WEB log data based on the IP address information and/or the operating system information, the corresponding at least one user in the WEB log data can be accurately identified.
In an embodiment, before extracting the corresponding user feature based on the user WEB log data, the method further includes:
performing data cleaning on the WEB log data by at least one of the following modes:
converting the WEB log data in which the suffix does not meet the set condition in the WEB log data into WEB log data in which the suffix meets the set condition;
deleting the WEB log data of which the state codes do not accord with set conditions in the WEB log data;
deleting the WEB log data with missing content in the WEB log data;
and deleting the WEB log data which is repeated with the contents of other WEB log data in the WEB log data.
Here, the acquired user WEB log data may include a large amount of unsatisfactory WEB log data, and in order to effectively extract user features from the WEB log data, before extracting the user features based on the WEB log data, the WEB log data is subjected to data cleansing to obtain the satisfactory WEB log data. The specific data cleaning mode comprises the following steps:
and converting the WEB log data in which the suffix does not meet the set condition in the WEB log data into WEB log data in which the suffix meets the set condition. The WEB log data comprises log files in various suffix formats, such as log file (log) suffix, common log format (clf) suffix and text format (txt) suffix, and if the WEB log data is stored in a database, the WEB log data in the format other than the txt suffix needs to be converted into WEB log data in the txt suffix format because the database supports the txt suffix format.
And deleting the WEB log data of which the state codes do not accord with the set conditions in the WEB log data. Here, the status code is a 3-bit data code for indicating a web server hypertext Transfer Protocol (HTTP) response status, and may display the status of the response information, such as 200OK indicating that the request has succeeded, 404Not Found indicating that the request has failed, and 505HTTP Version Not Supported indicating that the server does Not support or rejects the HTTP Version used in the request. In this embodiment of the application, the setting condition may be that the request is successfully received and responded by the server, and the deletion process is performed on the WEB log data of which the status code does not represent that the request is successfully received and responded by the server.
And deleting the WEB log data with missing content in the WEB log data. The WEB log data with missing content may not contain complete and effective user information, and effective user features may not be extracted based on the WEB log data with missing content, so a certain matching rule is set, and if the content matched with the WEB log data is missing, the WEB log data with missing content is deleted. Meanwhile, a certain matching rule can be set, and the WEB log data meeting the conditions can be found out according to the requirements by matching with the WEB log data.
And deleting the WEB log data which is repeated with the contents of other WEB log data in the WEB log data. The acquired WEB log data may have a plurality of pieces of WEB log data with repeated contents, and in order to ensure the validity of the WEB log data, the WEB log data with repeated contents with other WEB log data contents is deleted.
By cleaning the acquired WEB log data, some unnecessary interference factors in the WEB log data are eliminated, the effectiveness of the WEB log data is improved, and the accuracy of the extracted user characteristics is also improved.
In an embodiment, the extracting corresponding user features based on the WEB log data of the user includes:
determining a TF-IDF value of each word in each webpage text based on the square of the IDF of each word in each corresponding webpage text in WEB log data of a user;
and extracting corresponding user characteristics based on the TF-IDF value of each word in each webpage text.
Here, the TF-IDF algorithm is a weighting method used in information retrieval and text mining to evaluate the importance of a word to one of a set of web page texts. The importance of a word increases in direct proportion to the number of times it appears in a web page text, but at the same time decreases in inverse proportion to the frequency with which it appears in the collection of web page text. The higher the TF-IDF value of a word, the higher the class discrimination capability of the word. The IDF is a measure of the general importance of a word, and in a web page text set, if the web page text containing a word is fewer, the IDF value of the word is larger, which indicates that the word has a good category distinguishing capability. In the embodiment of the present application, the TF-IDF value of each word in the text of the web page is calculated using the square of IDF.
The TF-IDF value of the word is calculated by using the square of the IDF of the word, so that the excessive dependence of the TF-IDF algorithm on the word frequency can be reduced, and a more accurate TF-IDF value is obtained, so that the extracted user characteristics are more accurate.
In one example, the determining the TF-IDF value of each word in the text of each web page includes:
setting the first weight to be greater than the second weight; wherein the content of the first and second substances,
the first weight represents the weight of a word positioned at the position of the title and/or the head section of each webpage text; the second weight characterizes the weight of the other words than the word corresponding to the first weight.
Here, the words in the title and the first segment of a WEB page text may represent the central topic of the WEB page text to a large extent, and in order to improve the accuracy of mining the TF-IDF values of the words in the corresponding WEB page text in the WEB log data, the weight of the words located at the title and/or the first segment of each WEB page text is increased.
By increasing the weight of the word positioned at the title and/or the head of the webpage text, the accuracy of the TF-IDF value of the word in the acquired webpage text is improved.
In practical application, extracting corresponding user features based on the TF-IDF value of each word in each webpage text comprises the following steps:
after the corresponding webpage text in the WEB log data is obtained, preprocessing such as word segmentation and word stop is carried out on the webpage text, and a database is established based on the preprocessed webpage text. Taking out a webpage text set D ═ D from the database1,d2,...,dnWhen extracting the keywords of each webpage text with different lengths, the dynamic weight α is used to adapt to the influence of the webpage texts with different lengths on the keywords, and the calculation formula is as follows:
tfj’(wi)=tfj(wi)+(3+α)·tfjh(wi)+(1.5+α)·tfjf(wi) Equation 1
Wherein, tfj'(wi) Denotes the current word wiText d in web pagejFrequency of occurrence after neutral-weighting, i.e. current word wiThe word frequency of; tf isjh(wi) Denotes the current word wiText d in web pagejFrequency of occurrence in the title of (1); tf isjf(wi) Denotes the current word wiText d in web pagejThe frequency occurring in the first paragraph.
Using the current word wiTo calculate the TF-IDF value of the current word, the square of the IDF value is calculated as follows:
idf'(wi)=log(N/df(wi))·log(N/df(wi) Equation 2)
Wherein N represents the total number of all web page texts in the web page text set; df (w)i) Indicating that the current word w appears in the text collection of the web pageiThe length of the web page text.
The calculation formula of the TF-IDF value of each word in each webpage text in the webpage texts is as follows:
TF-IDF'(wi)=tfj'(wi)·idf'(wi) Equation 3
Thereby obtaining the IF-IDF value of each word in each webpage text, namely obtaining the characteristic value of each word. And sequencing the characteristic values of all words in each webpage text from high to low, and taking the word with the highest characteristic value as a keyword of the corresponding webpage text.
And after obtaining the keywords corresponding to each webpage text, storing the keywords into a database. Meanwhile, the characteristic values of other words except the keywords in each webpage text are stored in a database.
After obtaining the keywords corresponding to each webpage text, the method further includes:
and performing theme classification on the corresponding webpage text in the WEB log data through a K Nearest Neighbor (KNN) algorithm.
Fig. 2 is a schematic diagram of characters and subject categories provided in an embodiment of the present application. As shown in fig. 2, 20 topics that can cover web text categories to some extent are set, each topic category corresponds to a specific character category, and four topics, such as helpers, environment, public welfare and food, all correspond to character categories that are suitable for people.
In the training set, each topic is provided with a plurality of texts which can prominently reflect the current topic and serve as a standard topic text set, and the feature value of each word in each text is extracted from the plurality of texts through the TF-IDF algorithm to obtain a standard topic vector set.
In order to remove noise words, the feature space of each webpage text is subjected to dimension reduction processing. The feature space refers to the entirety of all feature values, the number of the feature values is the dimension of the feature space, and if a web page text contains 100 feature values, the feature space corresponding to the web page text is 100 dimensions. In the embodiment of the invention, 20 characteristic values of each webpage text with characteristic values ranked from high to low as the top 20 are selected as the characteristic vector set of each webpage text. And carrying out similarity calculation on the feature vector set of each webpage text and the standard topic vector set through a KNN algorithm, thereby carrying out topic classification on each webpage text. The specific calculation process is as follows:
1) title topic text set D corresponding to 20 topics in training sets={ds1,ds2,...,ds20Preprocesses each text and then calculates the feature value of each word in each text by the TF-IDF algorithm described above. Extracting 20 eigenvectors (v) with highest eigenvalue in each themes{ws1,ws2,...,ws20And } as a standard topic vector set for the topic.
2) Extracting a webpage text set K ═ v from a databasei{wi1,wi2,...,win1 ≦ i ≦ n }, where K is a set of vectors consisting of the n top ranked feature values of each of the n web page texts. Calculating a webpage text vector v to be classified and a standard classification text vector vsThe similarity between the two is calculated according to the following formula:
Figure BDA0002944924830000111
wherein v isjRepresenting the text d of the web page to be classifiedjRank of median eigenvaluesThe characteristic values of the top n form a characteristic vector set.
3) Calculating a webpage text vector v to be classified and a topic category cjThe calculation formula is as follows:
Figure BDA0002944924830000112
wherein, y (v)i,cj) Represents a class attribute function when vi∈cjI.e. y (v) if the text of the web page currently to be classified belongs to the current topic categoryi,cj) 1 is ═ 1; otherwise, y (v)i,cj)=0。
4) Judging the subject category of the webpage text to be classified through a category decision function, wherein the calculation formula is as follows:
f=argmaxcj(p(v,cj) Equation 6)
Through the steps, the topic category to which each webpage text belongs can be known.
And determining user characteristics corresponding to each user based on the topic category to which each webpage text accessed by each user corresponding to the WEB log data belongs, and constructing a user portrait model. Wherein the user characteristics include an interest attribute characteristic and a character attribute characteristic. The interest attribute features are subject categories corresponding to webpage texts accessed by the user. Illustratively, when WEB log data of a user is acquired, based on the WEB log data, through a user image model, it can be determined that an IP address of the corresponding user is: 111.192.165.229, respectively; the interest attribute features are: science and technology; the character attribute is characterized in that: open type.
In an embodiment, after the user features of all the users are extracted, the method further includes:
and clustering the at least one user through a K-Means (K-Means) algorithm to obtain a group user portrait corresponding to each category.
K-means is a clustering analysis algorithm for iterative solution, which mainly comprises the steps of randomly selecting K objects as initial clustering centers, then calculating the distance between each object and each sub-clustering center, and allocating each object to the nearest clustering center. The cluster centers and the objects assigned to them represent a cluster. The cluster center of a cluster is recalculated for each sample assigned based on the objects existing in the cluster. This process will be repeated until some termination condition is met. The termination condition may be that no object is reassigned to a different cluster, that no cluster center is changed again, or that the sum of squared errors is locally minimal. Compared with other clustering algorithms, although the K-means method depends on the selection of the initial clustering center, a good clustering effect can be achieved by selecting a center data point with obvious user characteristics. Therefore, the method adopts a K-means clustering algorithm to calculate the similarity of the users.
Fig. 3 is a schematic diagram of clustering performed by a K-Means algorithm according to an embodiment of the present application. As shown in fig. 3:
and constructing a user feature vector set based on the user interest feature vector and the user character feature vector.
Setting K initial clustering center points. In the embodiment of the present application, the personality characteristics of the user are mainly classified into 5 personality types, and therefore, the number of clusters is set to 5, that is, the value of K is set to 5.
After setting an initial clustering center point, clustering users by a K-means algorithm, specifically, the user clustering process is as follows:
randomly selecting 5 feature vectors from the user feature vector set as a clustering center; for each feature vector in the user feature vector set, calculating a distance between the feature vector and each clustering center, wherein the distance can be an Euclidean distance as an example, and the feature vector is divided into a set to which the clustering center with the closest distance belongs when the feature vector is closest to which clustering center; after all the feature vectors in the user feature vector set are grouped together, 5 sets are obtained;
the cluster centers for each set are recalculated.
If the distance between the recalculated cluster center in a set and the original cluster center is greatly changed, which indicates that the set has not reached a convergence state, iterative calculation needs to be carried out again through the K-means algorithm.
If the distance between the cluster center obtained by recalculation and the original cluster center in a set is smaller than a certain set threshold value, the position change of the cluster center obtained by recalculation is small, the set tends to be stable, namely, a convergence state is reached, at this moment, the cluster can be considered to reach an expected result, the algorithm is stopped, and the attribute characteristics of the cluster center are analyzed.
And after the user clustering is completed, recording the user identification of each cluster and the number of the included users in the database. By analyzing the attribute characteristics of the clustering center, the characteristics of the user group with the specified attribute can be obtained on the premise of a certain specific attribute. Where a cluster is a collection of all data and the objects in the cluster are similar. In the embodiment of the application, on the premise of knowing the interest attribute characteristics and the personality attribute characteristics of the users, the characteristics of the user groups with the same personality attribute characteristics are obtained through clustering.
In the embodiment of the application, based on WEB log data of a user, extracting user characteristics of the corresponding user in the WEB log data; the method comprises the steps of clustering at least one user based on the extracted user characteristics corresponding to each user in at least one user to obtain a group user portrait corresponding to each category, namely, downloading WEB log data of the users in the Internet, extracting the user characteristics of the corresponding users from the WEB log data, and obtaining the group user portrait based on the user characteristics.
FIG. 4 is a schematic diagram illustrating functional module division of a group user portrait acquisition method according to an embodiment of the present application. As shown in fig. 4:
the group user portrait acquisition method can be realized through the following five functional modules: the system comprises a data acquisition module, a data preprocessing module, a data mining module, a user portrait and a group user portrait.
The data acquisition module comprises the acquisition of WEB logs.
The data preprocessing module comprises: suffix processing, method filtering, state code filtering and redundancy processing are carried out on the WEB log; identifying a corresponding user in the WEB log; and crawling a webpage text, preprocessing the webpage text and storing the webpage text into a database.
The data mining module comprises: calculating a feature value of the word; extracting keywords of a webpage text; and carrying out topic classification on the webpage text.
The user representation includes constructing a user character representation model.
The community user representation includes: and clustering the users to construct a group user portrait.
In order to implement the method of the embodiment of the present application, an embodiment of the present application further provides a device for obtaining a portrait of a group user, fig. 5 is a schematic diagram of the device for obtaining a portrait of a group user provided in the embodiment of the present application, please refer to fig. 5, the device includes:
the extracting unit 501 is configured to extract corresponding user features based on WEB log data of a user.
A clustering unit 502, configured to cluster the at least one user based on the extracted user characteristics corresponding to each user of the at least one user, so as to obtain a group user portrait corresponding to each category.
In an embodiment, the extracting unit 501 is further configured to determine, based on WEB log data of a user, a WEB page text of a WEB page visited by the user;
and extracting corresponding user characteristics based on the webpage text of the webpage accessed by the user.
In one embodiment, the apparatus further comprises: and the determining unit is used for determining the user corresponding to the extracted user characteristic based on the user identification information in the WEB log data.
In an embodiment, the determining unit is further configured to determine, based on IP address information in the WEB log data, at least one user corresponding to the WEB log data;
and/or the presence of a gas in the gas,
and determining at least one user corresponding to the WEB log data based on the operating system information in the WEB log data.
In one embodiment, the apparatus further comprises: the data cleaning unit is used for performing data cleaning on the WEB log data in at least one of the following modes:
converting the WEB log data in which the suffix does not meet the set condition in the WEB log data into WEB log data in which the suffix meets the set condition;
deleting the WEB log data of which the state codes do not accord with set conditions in the WEB log data;
deleting the WEB log data with missing content in the WEB log data;
and deleting the WEB log data which is repeated with the contents of other WEB log data in the WEB log data.
In an embodiment, the extracting unit 501 is further configured to determine a TF-IDF value of each word in each WEB page text based on a square of an IDF of each word in each corresponding WEB page text in WEB log data of the user;
and extracting corresponding user characteristics based on the TF-IDF value of each word in each webpage text.
In one embodiment, the apparatus further comprises: a setting unit configured to set the first weight to be larger than the second weight; wherein the content of the first and second substances,
the first weight represents the weight of a word positioned at the position of the title and/or the head section of each webpage text; the second weight characterizes the weight of the other words than the word corresponding to the first weight.
In practical applications, the extracting Unit 501, the clustering Unit 502, the determining Unit, the data cleaning Unit, and the setting Unit may be implemented by a Processor in a terminal, such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a Micro Control Unit (MCU), or a Programmable Gate Array (FPGA).
It should be noted that: in the group user image obtaining apparatus provided in the above embodiment, when displaying information, the above-mentioned division of each program module is merely exemplified, and in practical applications, the above-mentioned processing may be distributed to different program modules according to needs, that is, the internal structure of the apparatus may be divided into different program modules to complete all or part of the above-mentioned processing. In addition, the group user portrait acquisition device provided by the above embodiment and the group user portrait acquisition method embodiment belong to the same concept, and specific implementation processes thereof are detailed in the method embodiment and are not described herein again.
Based on the hardware implementation of the program module, in order to implement the method of the embodiment of the present application, an embodiment of the present application further provides an electronic device. Fig. 6 is a schematic diagram of a hardware component structure of an electronic device according to an embodiment of the present application, and as shown in fig. 6, the electronic device includes:
a communication interface 601, which can perform information interaction with other devices such as network devices;
and the processor 602 is connected with the communication interface 601 to implement information interaction with other devices, and is used for executing the method provided by one or more technical schemes of the terminal side when running a computer program. And the computer program is stored on the memory 603.
Specifically, the processor 602 is configured to extract, based on WEB log data of a user, a corresponding user feature;
and clustering the at least one user based on the extracted user characteristics corresponding to each user in the at least one user to obtain a group user portrait corresponding to each category.
In one embodiment, the processor 602 is further configured to determine, based on WEB log data of the user, WEB page text of a WEB page visited by the user;
and extracting corresponding user characteristics based on the webpage text of the webpage accessed by the user.
In an embodiment, when extracting the corresponding user feature based on the WEB log data of the user, the processor 602 is further configured to determine the user corresponding to the extracted user feature based on the user identification information in the WEB log data.
In an embodiment, the processor 602 is further configured to determine at least one user corresponding to the WEB log data based on IP address information in the WEB log data;
and/or the presence of a gas in the gas,
and determining at least one user corresponding to the WEB log data based on the operating system information in the WEB log data.
In an embodiment, before extracting the corresponding user feature from the user-based WEB log data, the processor 602 is further configured to perform data cleansing on the WEB log data by at least one of:
converting the WEB log data in which the suffix does not meet the set condition in the WEB log data into WEB log data in which the suffix meets the set condition;
deleting the WEB log data of which the state codes do not accord with set conditions in the WEB log data;
deleting the WEB log data with missing content in the WEB log data;
and deleting the WEB log data which is repeated with the contents of other WEB log data in the WEB log data.
In an embodiment, the processor 602 is further configured to determine a TF-IDF value of each word in each WEB page text based on a square of an IDF of each word in each corresponding WEB page text in WEB log data of the user;
and extracting corresponding user characteristics based on the TF-IDF value of each word in each webpage text.
In an embodiment, the processor 602 is further configured to set the first weight to be greater than the second weight; wherein the content of the first and second substances,
the first weight represents the weight of a word positioned at the position of the title and/or the head section of each webpage text; the second weight characterizes the weight of the other words than the word corresponding to the first weight.
Of course, in practice, the various components in the electronic device are coupled together by the bus system 604. It is understood that the bus system 604 is used to enable communications among the components. The bus system 604 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 604 in fig. 6.
The memory 603 in the embodiments of the present application is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program for operating on an electronic device.
It will be appreciated that the memory 603 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 603 described in embodiments herein is intended to comprise, without being limited to, these and any other suitable types of memory.
The methods disclosed in the embodiments of the present application may be applied to the processor 602, or implemented by the processor 602. The processor 602 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 602. The processor 602 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 602 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 603, and the processor 602 reads the program in the memory 603 and performs the steps of the foregoing method in conjunction with its hardware.
The processor 602 executes the program to implement corresponding processes in the methods according to the embodiments of the present application.
In an exemplary embodiment, the present application further provides a storage medium, i.e. a computer storage medium, specifically a computer readable storage medium, for example, including a memory 603 storing a computer program, which can be executed by the processor 602 to implement the steps of the foregoing method. The computer readable storage medium may be Memory such as FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, terminal and method may be implemented in other manners. The above-described device embodiments are only illustrative, for example, the division of the unit is only one logical function division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof that contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling an electronic device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method for capturing a portrait of a group user, the method comprising:
extracting corresponding user characteristics based on world wide WEB (WEB) log data of a user;
and clustering the at least one user based on the extracted user characteristics corresponding to each user in the at least one user to obtain a group user portrait corresponding to each category.
2. The method for obtaining a group user representation according to claim 1, wherein the extracting corresponding user features based on the WEB log data of the user comprises:
determining webpage texts of webpages accessed by users based on WEB log data of the users;
and extracting corresponding user characteristics based on the webpage text of the webpage accessed by the user.
3. The method for capturing portrait of group user as claimed in claim 1, wherein the step of extracting corresponding user features based on the WEB log data of the user comprises:
and determining the user corresponding to the extracted user characteristics based on the user identification information in the WEB log data.
4. The method for obtaining a group user representation according to claim 3, wherein the determining the user corresponding to the WEB log data based on the user identification information in the WEB log data comprises:
determining at least one user corresponding to the WEB log data based on the Internet protocol IP address information in the WEB log data;
and/or the presence of a gas in the gas,
and determining at least one user corresponding to the WEB log data based on the operating system information in the WEB log data.
5. The method of claim 1, wherein before extracting corresponding user features from the user-based WEB log data, the method further comprises:
performing data cleaning on the WEB log data by at least one of the following modes:
converting the WEB log data in which the suffix does not meet the set condition in the WEB log data into WEB log data in which the suffix meets the set condition;
deleting the WEB log data of which the state codes do not accord with set conditions in the WEB log data;
deleting the WEB log data with missing content in the WEB log data;
and deleting the WEB log data which is repeated with the contents of other WEB log data in the WEB log data.
6. The method for obtaining a group user representation according to claim 1, wherein the extracting corresponding user features based on the WEB log data of the user comprises:
determining a word frequency-inverse text frequency TF-IDF value of each word in each webpage text based on the square of the inverse text frequency IDF of each word in each corresponding webpage text in WEB log data of a user;
and extracting corresponding user characteristics based on the TF-IDF value of each word in each webpage text.
7. The method of claim 6, wherein said determining a TF-IDF value for each word in said text of each web page comprises:
setting the first weight to be greater than the second weight; wherein the content of the first and second substances,
the first weight represents the weight of a word positioned at the position of the title and/or the head section of each webpage text; the second weight characterizes the weight of the other words than the word corresponding to the first weight.
8. An image capture device for a group user, the device comprising:
the extraction unit is used for extracting corresponding user characteristics based on WEB log data of a user;
and the clustering unit is used for clustering at least one user based on the extracted user characteristics corresponding to each user in the at least one user to obtain a group user portrait corresponding to each category.
9. An electronic device, comprising: a processor and a memory for storing a computer program capable of running on the processor, wherein,
the processor is adapted to perform the steps of the method of any one of claims 1 to 7 when running the computer program.
10. A storage medium on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202110192229.XA 2021-02-19 2021-02-19 Group user portrait acquisition method and device, electronic equipment and storage medium Pending CN112905783A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110192229.XA CN112905783A (en) 2021-02-19 2021-02-19 Group user portrait acquisition method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110192229.XA CN112905783A (en) 2021-02-19 2021-02-19 Group user portrait acquisition method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112905783A true CN112905783A (en) 2021-06-04

Family

ID=76124077

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110192229.XA Pending CN112905783A (en) 2021-02-19 2021-02-19 Group user portrait acquisition method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112905783A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140289006A1 (en) * 2013-03-20 2014-09-25 Kaptivating Hospitality LLC Method and System For Social Media Sales
CN105893406A (en) * 2015-11-12 2016-08-24 乐视云计算有限公司 Group user profiling method and system
CN108108451A (en) * 2017-12-27 2018-06-01 合肥美的智能科技有限公司 The group of subscribers portrait acquisition methods and device of group
CN111597330A (en) * 2019-02-21 2020-08-28 中国科学院信息工程研究所 Intelligent expert recommendation-oriented user image drawing method based on support vector machine
CN111967914A (en) * 2020-08-26 2020-11-20 珠海格力电器股份有限公司 User portrait based recommendation method and device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140289006A1 (en) * 2013-03-20 2014-09-25 Kaptivating Hospitality LLC Method and System For Social Media Sales
CN105893406A (en) * 2015-11-12 2016-08-24 乐视云计算有限公司 Group user profiling method and system
CN108108451A (en) * 2017-12-27 2018-06-01 合肥美的智能科技有限公司 The group of subscribers portrait acquisition methods and device of group
CN111597330A (en) * 2019-02-21 2020-08-28 中国科学院信息工程研究所 Intelligent expert recommendation-oriented user image drawing method based on support vector machine
CN111967914A (en) * 2020-08-26 2020-11-20 珠海格力电器股份有限公司 User portrait based recommendation method and device, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
卓佳怡等: "基于TF-idf算法的公文用户画像", 应用交流, pages 218 - 223 *
郭跃等: "基于高等教育四个回归下的大学生角色构建调研分析", 浙江万里学院学报, pages 91 - 95 *

Similar Documents

Publication Publication Date Title
US8161059B2 (en) Method and apparatus for collecting entity aliases
WO2022117063A1 (en) Method and apparatus for training isolation forest, and method and apparatus for recognizing web crawler
US9589208B2 (en) Retrieval of similar images to a query image
US7519588B2 (en) Keyword characterization and application
US7752208B2 (en) Method and system for detection of authors
US8965894B2 (en) Automated web page classification
Shi et al. Learning-to-rank for real-time high-precision hashtag recommendation for streaming news
JP2013516022A (en) Cluster and present search suggestions
CN107506472B (en) Method for classifying browsed webpages of students
WO2012083874A1 (en) Webpage information detection method and system
US20040117363A1 (en) Information processing device and method, recording medium, and program
CN111259220B (en) Data acquisition method and system based on big data
US20060215298A1 (en) Information presentation apparatus, and information presentation method and program for use therein
CN114265953A (en) Short video recommendation method, system, device and medium based on label
CN113806660A (en) Data evaluation method, training method, device, electronic device and storage medium
CN112035723A (en) Resource library determination method and device, storage medium and electronic device
CN113961823B (en) News recommendation method, system, storage medium and equipment
US9223854B2 (en) Document relevance determining method and computer program
CN110019763B (en) Text filtering method, system, equipment and computer readable storage medium
CN108875050B (en) Text-oriented digital evidence-obtaining analysis method and device and computer readable medium
US20110047447A1 (en) Hyperlinking Web Content
CN112905783A (en) Group user portrait acquisition method and device, electronic equipment and storage medium
CN111523027B (en) Automatic data news writing robot based on blockchain technology
CN114706948A (en) News processing method and device, storage medium and electronic equipment
CN115883111A (en) Phishing website identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210604

RJ01 Rejection of invention patent application after publication