CN116127074B - Anchor image classification method based on LDA theme model and kmeans clustering algorithm - Google Patents

Anchor image classification method based on LDA theme model and kmeans clustering algorithm Download PDF

Info

Publication number
CN116127074B
CN116127074B CN202310157141.3A CN202310157141A CN116127074B CN 116127074 B CN116127074 B CN 116127074B CN 202310157141 A CN202310157141 A CN 202310157141A CN 116127074 B CN116127074 B CN 116127074B
Authority
CN
China
Prior art keywords
anchor
topic
data
text data
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310157141.3A
Other languages
Chinese (zh)
Other versions
CN116127074A (en
Inventor
吴少辉
王洪珑
谢晓东
李国鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202310157141.3A priority Critical patent/CN116127074B/en
Publication of CN116127074A publication Critical patent/CN116127074A/en
Application granted granted Critical
Publication of CN116127074B publication Critical patent/CN116127074B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A main broadcasting image classification method based on an LDA topic model and a kmeans clustering algorithm belongs to the technical field of data analysis. The steps are as follows: s1, acquiring anchor information in indication terminal equipment to obtain an original data set, and performing data preprocessing on the acquired anchor information to obtain an initial data set; s2, constructing an LDA topic model according to the initial data set, and mining topic probability distribution of topic words and each piece of topic text information from the initial data set; s3, data conversion, namely carrying out logarithmic processing and standardization on data information of each anchor; s4, determining the number of the categories of the clusters, and determining the number of the categories of the clusters according to the contour coefficient and the square sum of errors in the clusters; and S5, clustering is carried out according to the related numerical data of the anchor according to a kmeans clustering algorithm, different categories to which the anchor belongs are obtained, and the anchor characteristics are analyzed according to the results, so that an anchor portrait is established. The invention can cluster the text data and the structured data of the anchor at the same time, and establish the anchor portrait and refine marketing.

Description

Anchor image classification method based on LDA theme model and kmeans clustering algorithm
Technical Field
The invention belongs to the technical field of data analysis, and particularly relates to a method for classifying anchor images based on an LDA topic model and a kmeans clustering algorithm.
Background
With the development of mobile networks, more and more nationalities watch live broadcast, shake voice, fast hand and other live broadcast platforms rapidly develop based on the live broadcast, and more enterprises find out the anchor to seek cooperation and popularize own products and services due to the strong flow brought by live broadcast. However, in the face of massive anchor groups, it is not known how enterprises select appropriate anchors to cooperate, and different types of anchors have any characteristics and can produce any different marketing effects. Based on this we propose portraying the anchor based on the LDA topic model and using kmeans clustering algorithm. Meanwhile, text data (such as live introduction) of a host, etc. cannot be incorporated into a host portrait in a quantized form in the existing research, but it is known that this important data type greatly affects the knowledge of audiences and enterprises about the host in real life.
The invention patent with the authority notice number of CN110689040B and the authority notice date of 2022, 10 and 18 discloses a sound classification method based on the anchor portraits, and the patent classifies the anchor portraits. The classification method of the patent needs to define and classify the audio content in advance, so that large data cannot be analyzed widely and efficiently, and meanwhile, automatic analysis and mining of text data of the audio are not performed.
Disclosure of Invention
The invention aims to solve the problems in the prior art and provides a method for classifying anchor images based on an LDA topic model and a kmeans clustering algorithm.
According to the invention, the anchor is automatically classified, text data of the anchor can be analyzed based on the LDA topic model, and the text data can be converted into numerical data. On the basis, the numerical data of the anchor is subjected to cluster analysis through a kmeans clustering algorithm, and a manual method is not needed based on machine learning, so that objectivity and convenience of classification results can be ensured, and occupied human resources can be released.
The technical scheme adopted by the invention is as follows:
the method comprises the steps of converting text data of a host into numerical data by using an LDA topic model and a kmeans clustering algorithm, clustering related numerical data of the host by using the kmeans clustering algorithm, and establishing a host portrait; the method comprises the following steps:
s1, acquiring anchor information in indication terminal equipment to obtain an original data set, and carrying out data preprocessing on the information to obtain an initial data set;
s2, constructing an LDA topic model according to the initial data set, and mining different topic probability distributions of topic words and each anchor text data from the initial data set;
s3, data conversion, namely carrying out logarithmic processing and standardization on the numerical data of each anchor;
s4, determining the number of the categories of the clusters, and determining the number of the categories of the clusters according to the contour coefficient and the square sum of errors in the clusters;
and S5, clustering related numerical data of the anchor according to a kmeans clustering algorithm to obtain different categories to which the anchor belongs, analyzing anchor characteristics according to results, and building anchor portraits.
Further, in the step S1, the anchor information is acquired from the indication terminal device to obtain an original data set, and the data preprocessing is performed on the information to obtain the original data set, which comprises the following specific steps:
s11, acquiring text data and numerical data of a host, and screening live broadcast containing missing values to obtain an original data set;
s12, on the basis of the step S11, performing text word segmentation on the original data set to obtain word segmentation word sets;
s13, collecting stop words according to the stop word list, constructing a related dictionary, removing the stop words of the word segmentation vocabulary, and obtaining an initial data set.
Further, in the step S2, the specific steps of constructing the LDA theme model are as follows:
s21, determining the topic number K of the LDA topic model according to the initial data set, and obtaining the optimal topic number K by adopting a confusion degree evaluation method, wherein a confusion degree calculation formula is as follows:
wherein M is the number of anchor text data; n (N) i The total number of words appearing in text data of the ith anchor; w (w) i Words constituting the i-th anchor related text data; p (w) i ) Is w i The probability of generation;
in order to ensure the clustering effect, obtaining the confusion degree of all the topic numbers K with the topic number K within 10, and selecting the inflection point of the confusion degree as the optimal topic number K according to an elbow method;
s22, in the dirichlet distribution with the prior parameters of alpha and beta, sampling to generate a subject distribution theta of text data of each anchor under the condition of the number K of subjects and a subject word distribution of all anchor text data
Alpha is specifically expressed as a dirichlet a priori parameter of the distribution of each anchor text data on the topic;
beta is specifically expressed as a dirichlet prior parameter of the subject word distribution of all the anchor text data;
s23, sampling and generating a theme Z of each anchor text data from the theme distribution theta of each anchor text data, wherein the LDA theme model assumes that each anchor text data is composed of word combinations with different proportions, reflects a unique theme of each anchor text data, and the combination proportion obeys polynomial distribution and is expressed as follows:
Z|θ=Multinomial(θ)
subject word distribution from all anchor text dataIn the method, a topic word W is generated by sampling, each topic is composed of words in the anchor text data, and the combination proportion is also subject to polynomial distribution and expressed as:
wherein w is i For forming words of the ith anchor related text data, the calculation formula of the probability distribution is as follows:
wherein P (w) i |z=s) represents the word w i Probability of belonging to the s-th topic; p (z=s|i) represents the probability of the s-th topic in the i-th anchor text data; k is the optimal theme number;
s24, the LDA topic model result contains high-frequency words under each topic K and topic distribution of each anchor text data, the first 20 high-frequency words of each topic K under the optimal topic number K are analyzed, and definition and explanation are carried out on each topic K at the same time;
s25, the LDA topic model result also contains probability distribution of each topic in each anchor text data, and the probability distribution is taken as a data variable of the anchor text data and is included in cluster analysis.
Further, the specific steps of data conversion in the step S3 are as follows:
s31, standardizing numerical data of the anchor needing to be clustered, and expressing the numerical data as follows by a formula:
z=(x-μ)/σ
wherein x is a specific number of the numerical data; μ is the average of the numerical data; sigma is the standard deviation of the numerical data; the amount of Z represents the distance between the original score and the average of the parent, calculated in standard deviation, where Z is negative when the original score is below the average, and positive when the original score is not.
Further, the specific steps of step S4 are as follows:
s41, determining the category number of clusters according to the outline coefficients and the square sum of errors in clusters, wherein the outline coefficients are calculated according to the following formula:
wherein a is i Representing the average distance between the ith sample and all other data in the same cluster, namely the aggregation degree in the quantized cluster; b i Representing the average distance of the ith sample from the last cluster for quantifying the degree of separation within the cluster; n represents the total number of anchorThe number of the text messages is equal to the number M of the anchor text messages; f is the contour coefficient of all samples; it is not difficult to find that if f is smaller than 0, the average distance between f and the element in the cluster is larger than that of other nearest clusters, so that the clustering effect is poor; if ai tends to 0, or bi is greater than ai, then f tends to 1, indicating that clustering is best;
the sum of squares of the errors is calculated as follows:
wherein Cq is the qth cluster; mq is the cluster centroid of Cq; p is the sample point in Cq; SSE is the clustering error of all samples and represents the quality of the clustering effect; with the increase of the cluster number l, the aggregation degree of each cluster is gradually increased, and SSE is gradually reduced; when the value of l is increased within a range smaller than the optimal clustering number, the decreasing amplitude of SSE is larger; when the value of L is increased to the optimal cluster number L, the descending amplitude of SSE is suddenly reduced, and then SSE slowly and gradually flattens as the value of L is continuously increased;
determining the optimal cluster number L based on the maximum three points of the contour coefficients 1-9 by combining the inflection points of SSE;
s42, randomly selecting L index vectors from the numerical data standardized in the step S31 as initial center points, wherein L is more than 1;
s43, after an initial center point is selected, calculating the distance from each index vector to L initial center points, and dividing the index vector into classifications corresponding to the initial center points if the distance from the index vector to which initial center point is the smallest;
s44, dividing the index vector into L classifications, and calculating a center point of each classification;
and S45, carrying out the calculation of the step S43 and the step S44 in an iteration manner until the center points of the L classifications are equal to the center points of the L classifications calculated last time or the distance between the center points is smaller than a specified threshold value, ending the iterative operation, and finally obtaining the center points of the L classifications, namely the center points of the index vectors, wherein the center points are the characteristic vectors of the L classifications.
Compared with the prior art, the invention has the beneficial effects that: the invention provides a method for classifying anchor images based on an LDA topic model and kmeans clustering algorithm, which firstly utilizes the LDA topic model to mine text information (such as anchor introduction and notification) of anchors, extracts topic distribution of the text information of the anchors, determines contents of different topics according to topic words, normalizes the topic distribution and numerical data of other anchors, and then carries out kmeans clustering analysis. And determining the optimal clustering number according to the contour coefficient and the error Square Sum (SSE), and carrying out clustering analysis on the basis to obtain different types of center points and establish a main broadcasting portrait. The classification method provided by the invention is in the live broadcast field, wherein the LDA topic model can also analyze a large amount of text data, so that the defect that the text data cannot be analyzed and clustered in the prior study is overcome, and a complete anchor portrait is built. According to the anchor image classification method based on the LDA topic model and the Kmeans clustering algorithm, which is provided by the invention, aiming at an Internet electronic platform, the anchor is automatically classified by taking various behaviors and effects of the anchor on the platform as classification basis and the method of the Kmeans clustering algorithm, so that the objectivity of a classification result can be ensured, and meanwhile, occupied human resources can be released. The invention can convert the audio data into text data, thereby realizing automatic analysis and mining of the text data of the audio.
Drawings
FIG. 1 is a flow chart diagram of a method for classifying anchor images based on an LDA topic model and kmeans clustering algorithm of the invention;
fig. 2 is a simplified schematic diagram of an LDA topic model.
Detailed Description
The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the invention, but not all embodiments, and all other embodiments obtained by those skilled in the art without making creative efforts based on the embodiments of the present invention are all within the protection scope of the present invention.
The first embodiment is as follows: as shown in fig. 1, this embodiment discloses a method for classifying anchor images based on an LDA topic model and kmeans clustering algorithm, which converts text data of anchor into numeric data (i.e. obtains different topic probability distributions in the text data) by using the LDA topic model, clusters related numeric data of anchor by using kmeans clustering algorithm, and establishes anchor images (helps enterprises find proper shippers, and refines marketing); the method comprises the following steps:
s1, acquiring anchor information in an indication terminal device to obtain an original data set, and carrying out data preprocessing on information (the information comprises text data and non-text data, the text data comprises anchor introduction, bulletin, numerical data such as vermicelli quantity and the like) to obtain an initial data set;
s2, constructing an LDA topic model according to the initial data set, and mining different topic probability distributions of topic words and each anchor text data from the initial data set;
s3, data conversion, namely carrying out logarithmic processing and standardization on numerical data (which refers to various information related to the anchor and expressed by numbers, such as vermicelli numbers and the like) of each anchor;
s4, determining the number of the categories of the clusters, and determining the number of the categories of the clusters according to the contour coefficient and the Sum of Squares (SSE) of errors in the clusters (determining the number of the categories of the numerical data clusters in the step S3 and determining the optimal number of the clusters);
and S5, clustering related numerical data of the anchor according to a kmeans clustering algorithm to obtain different categories to which the anchor belongs, analyzing anchor characteristics according to results, and building anchor portraits.
Further, in the step S1, the anchor information is acquired from the indication terminal device to obtain an original data set, and the data preprocessing is performed on the information to obtain the original data set, which comprises the following specific steps:
s11, acquiring text data (such as anchor introduction) and numerical data (such as vermicelli quantity, live broadcast duration and the like) of an anchor, and screening live broadcast containing missing values to obtain an original data set;
s12, on the basis of the step S11, performing text word segmentation on the original data set to obtain word segmentation word sets;
s13, collecting stop words according to the stop word list, constructing a related dictionary, removing stop words of the word segmentation vocabulary (the word segmentation vocabulary set is a plurality of word sets, and the stop words are words which are not adopted in the word segmentation vocabulary set), and obtaining an initial data set.
Further, in the step S2, the specific steps of constructing the LDA theme model are as follows:
s21, determining the topic number K of an LDA topic model (which is the prior art) according to an initial data set, adopting a confusion degree evaluation method to obtain an optimal topic number K (the confusion degree obtained by calculating different topic numbers K is different, the lower the confusion degree is, the stronger the generalization capability of the topic model under the corresponding K value is), and the confusion degree calculation formula is as follows:
wherein M is the number of anchor text data (such as anchor introduction); n (N) i The total number of words appearing in text data (such as anchor introduction) of the ith anchor; w (w) i Words constituting the text data related to the ith anchor (such as anchor introduction); p (w) i ) Is w i The probability of generation;
in order to ensure the clustering effect, obtaining the confusion degree of all the topic numbers K with the topic number K within 10, and selecting the inflection point of the confusion degree as the optimal topic number K according to an elbow method;
s22, in Dirichlet distribution with a priori parameters of alpha and beta, sampling to generate topic distribution theta of text data (such as anchor introduction) under the condition of topic number K (the optimal topic number is adopted in the operation of the text) of each anchor and topic word distribution of all anchor text data (such as anchor introduction)
Alpha is specifically expressed as a dirichlet a priori parameter of the distribution of each anchor text data (such as anchor introduction) on the subject;
beta is specifically expressed as a dirichlet a priori parameter of the subject word distribution of all the anchor text data (such as anchor introduction);
s23, sampling and generating a theme Z of each anchor text data (such as anchor introduction) from a theme distribution theta of each anchor text data (such as anchor introduction), wherein the LDA theme model assumes that each anchor text data (such as anchor introduction) is composed of word combinations with different proportions, reflects a unique theme of each anchor text data (such as anchor introduction), and the combination proportions are distributed according to a polynomial (Multinomial), and are expressed as:
Z|θ=Multinomial(θ)
subject word distribution from all subject text data (e.g., a subject introduction)In which the topic words W are generated by sampling, each topic being composed of words in the anchor text data (as presented by the anchor), the combination proportions also obey a polynomial (Multinomial) distribution, expressed as:
wherein w is i For forming words of the text data (such as the introduction of the anchor) related to the ith anchor, the calculation formula of the probability distribution is as follows:
wherein P (w) i |z=s) represents the word w i Probability of belonging to the s-th topic; p (z=s|i) represents the probability of the s-th topic in the i-th anchor text data (e.g., anchor introduction); k is the optimal theme number;
s24, the LDA topic model result contains high-frequency words under each topic K and topic distribution of each anchor text data (such as anchor introduction), the first 20 high-frequency words of each topic K under the optimal topic number K are analyzed, and meanwhile definition and explanation are carried out on each topic K;
s25, the LDA topic model result also contains probability distribution of each topic in each anchor text data, and the probability distribution is taken as a data variable of the anchor text data and is included in cluster analysis.
Further, the specific steps of data conversion in the step S3 are as follows:
s31, standardizing numerical data of the anchor needing to be clustered, and expressing the numerical data as follows by a formula:
z=(x-μ)/σ
wherein x is a specific number of the numerical data; μ is the average of the numerical data; sigma is the standard deviation of the numerical data; the amount of Z value represents the distance between the original score (i.e., the specific number) and the parent mean (i.e., the mean of the numerical data), calculated in standard deviation units; z is a negative number when the raw score is below the average value, and is a positive number otherwise.
Further, the specific steps of step S4 are as follows:
s41, determining the number of categories of clusters according to the contour coefficient and the Sum of Squares (SSE) of errors in clusters, wherein the contour coefficient has the following calculation formula:
wherein a is i Representing the average distance between the ith sample (i.e. the anchor) and all other data in the same cluster, namely quantifying the aggregation degree in the cluster; b i Representing the average distance of the ith sample (i.e., anchor) from the nearest cluster for quantifying the degree of separation within the cluster; n represents the total number of anchor, and the number is equal to the number M of anchor text information (such as anchor introduction); f is the contour coefficient of all samples; it is not difficult to find that if f is smaller than 0, the average distance between f and the element in the cluster is larger than that of other nearest clusters, so that the clustering effect is poor; if ai tends to 0, or bi is greater than ai, then f tends to 1, indicating that clustering is best;
the Sum of Squares Error (SSE) is calculated as follows:
wherein Cq is the qth cluster; mq is the cluster centroid of Cq; p is the sample point in Cq; SSE is the clustering error of all samples and represents the quality of the clustering effect; with the increase of the cluster number l, the aggregation degree of each cluster is gradually increased, and SSE is gradually reduced; when the value of l is increased within a range smaller than the optimal clustering number, the decreasing amplitude of SSE is larger; when the value of L is increased to the optimal cluster number L, the descending amplitude of SSE is suddenly reduced, and then SSE slowly and gradually flattens as the value of L is continuously increased;
determining the optimal cluster number L based on the maximum three points of the contour coefficients 1-9 by combining the inflection points of SSE;
s42, randomly selecting L index vectors (namely the optimal cluster number) from the numerical data standardized in the step S31 as initial center points, wherein L is more than 1;
s43, after an initial center point is selected, calculating the distance from each index vector (namely each sample) to L initial center points, and dividing the index vector into classifications corresponding to the initial center points if the distance from the index vector to which initial center point is the smallest;
s44, dividing the index vector (i.e. all samples) into L classifications (i.e. the optimal cluster number), and calculating a center point (mean value) of each classification (each cluster);
and S45, carrying out the calculation of the step S43 and the step S44 in an iteration manner until the center points of the L classifications are equal to the center points of the L classifications calculated last time or the distance between the center points is smaller than a specified threshold value, ending the iterative operation, and finally obtaining the center points of the L classifications, namely the center points of the index vectors, wherein the center points are the characteristic vectors of the L classifications.
Example 1:
the embodiment discloses a method for classifying anchor images based on an LDA topic model and a kmeans clustering algorithm, which adopts the LDA topic model to mine topic words in text data and carries out classification extraction to obtain topic distribution introduced by each anchor, finally clusters the numerical data of each anchor through the kmeans clustering algorithm to obtain anchor images, guides enterprises to more effectively select cooperative anchors, and deepens the mining and understanding of anchor features by related parties.
1. Study data and methods
1. Study data
With the development of mobile internet technology, live broadcast is also increasingly favored by audience, and a large number of anchor sites with different characteristics and effects begin to appear on a live broadcast platform. In this embodiment, a 81237 live broadcast is selected from 2069 hosts on the tremble platform for 5-10 months 2021, and the host introduction is not changed in the period, i.e. each host has a host introduction. And acquiring the anchor numerical data (the number of vermicelli, the distribution of live time periods, the average commodity category number with commodity, the average live time length, the average work number and the average commodity price) and anchor text data (anchor introduction), and carrying out anchor image based on the LDA topic model and kmeans clustering algorithm on the basis of the data.
2. Research method
With the development of the age science and technology, living broadcast greatly enriches the life of audiences by virtue of convenience and immersive property, and the audiences sell happiness brought by living broadcast. Therefore, enterprises also select and cooperate with the anchor in a dispute manner to popularize own products; the live platform can select the anchor culture to bring flow to the live platform; the newly added anchor also mimics the characteristics of the existing anchor in order to obtain a corresponding effect. However, in the face of massive anchor, how to quickly classify the anchor, and take text data of the anchor into consideration for anchor images, is always a less involved aspect of the existing research. Therefore, the invention provides a method for classifying anchor images based on an LDA topic model and kmeans clustering algorithm, which rapidly classifies anchors and mines relevant characteristics by carrying out data mining on real-time anchor data including text data and numerical data. As shown in fig. 1, the method of the present invention comprises the steps of:
(1) Data acquisition and data preprocessing; acquiring information (including barrages and sales volume information) of a host and each live broadcast in an indication terminal device to obtain an original data set; the method comprises the steps of obtaining relevant data of a host broadcasting tremble sound and each live broadcast through a Python crawler program, and carrying out data preprocessing on an initial data set, wherein the data preprocessing mainly comprises data cleaning, jieba word segmentation and stop word removal processing.
(2) Analyzing a theme model; and adopting an LDA topic model to identify different topics and distribution thereof in the anchor introduction.
(3) kmeans cluster analysis: converting the numerical data, determining an optimal cluster number according to the contour coefficient and the Sum of Squares Error (SSE), classifying the data by using a kmeans clustering algorithm on the basis, and analyzing the characteristics of the anchor according to the result to establish the anchor portrait.
2. Experiment and analysis
1. Data acquisition and preprocessing
The 81237 live broadcast of 2069 shipments in 5-10 years on the shaking platform is selected through the third party platform, and the numerical data of the host (the number of vermicelli, the distribution of live broadcast time periods, the average commodity types of the shipments, the average live broadcast time length, the average work number and the average commodity price) and the text data of the host (the introduction of the host) are obtained, so that the actual live broadcast data of the shaking host group are analyzed.
After the original data is obtained, data preprocessing is usually needed to improve the reliability of the data, and the specific process is as follows:
(1) Screening out live broadcast with related data deletion through Excel;
(2) Text word segmentation is carried out in a Python program by utilizing a Jieba word segmentation software package;
(3) Collecting a stop word library, making a stop word list, and removing the stop word by using a Python program; and making a stop word dictionary according to the idiom dictionary;
2. topic model analysis
In the live broadcast process, different anchor guides own experience through anchor, related products or live broadcast information is released, an LDA topic model is used for converting and clustering text information of anchor guides into numerical information, and different topic distributions contained in anchor guides of each anchor are obtained.
2.1, mining topics by using LDA topic model
LDA topic model
The invention adopts an LDA topic model to carry out topic mining on the anchor introduction, which is a document topic generation model and comprises three layers of word, topic and document (i.e. live broadcast), and is particularly shown in figure 2; the model adopts a probability inference algorithm to process the text, does not need manual intervention to annotate an initial document before modeling, can identify the implicit subject information in the document, better reserves the internal relation of the document, and achieves good practical effects in the aspects of text semantic analysis, information retrieval and the like.
In fig. 2, α and β are Dirichlet a priori parameters;
wherein:
alpha is specifically expressed as a dirichlet a priori parameter of the distribution of each anchor text data (such as anchor introduction) on the subject;
beta is specifically expressed as a dirichlet a priori parameter of the subject word distribution of all the anchor text data (such as anchor introduction);
θ specifically represents generating a topic distribution of text data (such as a anchor introduction) per anchor under the condition of topic number K (the optimal topic number is adopted in the operation of the text) for sampling;
phi represents the topic word distribution of all anchor text data (e.g., anchor introduction);
z represents the topic of sampling to generate each anchor text data (such as anchor introduction);
w represents sampling to generate a subject term;
m represents the number of anchor text data (such as anchor introduction);
n represents the number of words in the document (i.e., a piece of anchor text data);
the LDA topic model generation process is as follows:
2.2, determining the number of topics; according to the initial data set, determining the topic number K of an LDA topic model (which is the prior art), adopting a confusion degree evaluation method to obtain the optimal topic number K (the confusion degree obtained by the calculation of different topic numbers K is different, the lower the confusion degree is, the stronger the generalization capability of the topic model under the corresponding K value is), and the confusion degree calculation formula is as follows:
wherein M is the number of anchor text data (such as anchor introduction), and in this operation is the number of anchor introduction; n (N) i The total number of words appearing in text data (such as anchor introduction) of the ith anchor; w (w) i Words constituting the text data related to the ith anchor (such as anchor introduction); p (w) i ) Is w i The probability of generation;
in order to ensure the clustering effect, obtaining the confusion degree of all the topic numbers K with the topic number K within 10, and selecting the inflection point of the confusion degree as the optimal topic number K=3 according to the elbow method;
2.3, constructing an LDA theme model;
in Dirichlet distribution with a priori parameters of alpha and beta, topic distribution theta of text data (such as anchor introduction) under the condition of optimal topic number K and topic word distribution of all anchor text data (such as anchor introduction) are generated by sampling
Alpha is specifically expressed as a dirichlet a priori parameter of the distribution of each anchor text data (such as anchor introduction) on the subject;
beta is specifically expressed as a dirichlet a priori parameter of the subject word distribution of all the anchor text data (such as anchor introduction);
from the topic distribution θ of each bit of anchor text data (e.g., anchor introduction), the topic Z of each bit of anchor text data (e.g., anchor introduction) is sampled and generated, and the LDA topic model assumes that each bit of anchor text data (e.g., anchor introduction) is composed of word combinations of different proportions reflecting the unique topic of each bit of anchor text data (e.g., anchor introduction), the combination proportions obey a polynomial (Multinomial) distribution, expressed as:
Z|θ=Multinomial(θ)
subject word distribution from all subject text data (e.g., a subject introduction)In which the topic words W are generated by sampling, each topic consisting of words in the anchor introduction, the combination ratio also obeys a polynomial (Multinomial) distribution, expressed as:
wherein, the word w is calculated in the introduction of the ith anchor i The probability distribution is calculated by the following formula:
wherein P (w) i |z=s) represents the word w i Probability of belonging to the s-th topic; p (z=s|i) represents the probability of the s-th topic in the i-th anchor introduction; k is the optimal theme number;
2.4 LDA topic model results
The LDA topic model result contains high-frequency words under each topic K and topic distribution introduced by each anchor, the first 20 high-frequency words of each topic K under the optimal topic number K are analyzed, and meanwhile, definition and explanation are carried out on each topic K;
the LDA topic model result also contains probability distribution of each topic in each anchor text message, and the probability distribution is taken as a data variable of the anchor text message and is then included in cluster analysis.
According to the invention, a Python program is adopted to carry out sklearn package in an LDA topic model to carry out topic modeling, a pyLDA-Vis visualization tool is used to present a result, and when the result is 3 topics, the topics are interpreted according to high-frequency words in the topics. The first five high-frequency words under different topics are shown in table 1 below;
TABLE 1
In topic 1, the main vocabulary of the barrage includes brands, customer service, authorities, factories, etc. These words are both reputation specific. Therefore, we call this anchor introduction element reputation. Under this theme, the anchor introduction tends to focus on highlighting its own reputation and brands, and will discuss more about the security, reputation, etc. of products and services. In contrast, the main vocabulary of the barrage in topic 2 includes collaboration, business, after-market, sharing, attention, and the like. The results indicate that this class of anchor introduction elements is focused on interactions, more prone to interaction behavior between anchor and audience, relationships and emotions play an important role in this topic. Thus, the emotion word proportion in the topic 2 is higher, and the anchor introduction in the topic 2 is classified as a relational or interactive anchor introduction. The subject 3 focuses on the product, under which there is a large number of unique words, such as merchandise, women's clothing, etc., and such elements of the anchor introduction tend to highlight the product information itself to prove that the product itself fits the needs of the customer.
On this basis, a distribution of different topics in the anchor introduction is obtained. Some examples are shown in table 2. The anchor can know the topic distribution of each live broadcast, and know the styles and atmospheres introduced by different anchors. The method lays a foundation for further exploring the influence of elements introduced by different anchor on the live performance of the anchor, namely, searching the introduction mode most suitable for the anchor and the unique interaction preference and interest point of the anchor according to the live effect of each anchor per se in the past according to the corresponding probability distribution (namely, the topic distribution) of each topic.
TABLE 2
3. Kmeans cluster analysis
And carrying out data conversion on numerical data to be clustered, and avoiding influence on a clustering result caused by overlarge numerical difference. The data conversion includes normalization processing, and the like, and only logarithmic and normalization processing is performed in this embodiment. On the basis of the data, determining the optimal clustering number according to the contour coefficient and the intra-cluster error Square Sum (SSE), classifying the data according to the optimal clustering number by using a kmeans clustering algorithm, and analyzing the characteristics of the host according to the result.
3.1, data conversion
To avoid the influence of excessive differences in data values on the clusters, logarithmic transformation is performed on data types that may exceed 1000. After that, all the clustered data were normalized using statistical analysis related software, normalized by:
z=(x-μ)/σ
wherein x is a specific number of the numerical data; μ is the average of the numerical data; sigma is the standard deviation of the numerical data; the amount of Z value represents the distance between the original score (i.e., the specific number) and the parent mean (i.e., the mean of the numerical data) in standard deviation units. Z is a negative number when the raw score is below the average value, and is a positive number otherwise.
3.2, determining the clustering number
The cluster number L is determined from the contour coefficients and the Sum of Squares Error (SSE). The calculation formula of the contour coefficient is as follows:
wherein a is i Representing the average distance between the ith sample (i.e. the anchor) and all other data in the same cluster, namely quantifying the aggregation degree in the cluster; b i Representing the average distance of the ith sample (i.e., anchor) from the nearest cluster for quantifying the degree of separation within the cluster; n represents the number of anchor; f is the contour coefficient of all samples; it is not difficult to find that if f is smaller than 0, the average distance between f and the element in the cluster is larger than that of other nearest clusters, so that the clustering effect is poor; if ai tends to 0, or bi is greater than ai, then f tends to 1, indicating that clustering is best;
the Sum of Squares Error (SSE) is calculated as follows:
wherein Cq is the qth cluster; mq is the cluster centroid of Cq; p is the sample point in Cq; SSE is the clustering error of all samples and represents the quality of the clustering effect; with the increase of the cluster number l, the aggregation degree of each cluster is gradually increased, and SSE is gradually reduced; when the value of l is increased within a range smaller than the optimal clustering number, the decreasing amplitude of SSE is larger; when the value of L is increased to the optimal cluster number L, the descending amplitude of SSE is suddenly reduced, and then SSE slowly and gradually flattens as the value of L is continuously increased;
the optimal cluster number L is determined based on the largest three points of the contour coefficients 1-9 in combination with the inflection point of SSE.
Using sklearn packet computation in python, the contour coefficients are the largest two points among them when the number of clusters l=2, l=3, while SSE takes the inflection point when l=3, thereby determining the optimal number of clusters l=3.
3.3, kmeans clustering and result analysis
3.3.1, randomly selecting L index vectors (namely the optimal clustering quantity) from the numerical data standardized in the step 3.1 as initial center points, wherein L is greater than 1.
3.3.2, calculating the distance between each index vector (i.e. each sample) and L initial center points, and dividing the index vector into classifications corresponding to the initial center points if the distance between the index vector and which initial center point is the smallest;
3.3.3 the index vector (i.e. all samples) is divided into L categories, the center point (mean) of each of the categories (each cluster) is calculated;
and 3.3.4, carrying out the calculation of the step 3.3.2 and the step 3.3.3 in an iterative manner until the center points of the L classifications are equal to the center points of the L classifications calculated last time or the distance between the center points of the L classifications is smaller than a specified threshold value, and ending the iterative operation.
And finally calculating the central points of the L classifications, namely the central points of the index vectors, wherein the central points are the characteristic vectors of the L classifications.
And calculating by using a kmeans package in python to finally obtain the category to which each anchor belongs. The average of the indicators of the anchor in each category was calculated as shown in table 3 below:
as can be seen from Table 3, the first (category 1) anchor introduction focuses on topic 2, i.e., interaction, and the anchor introduction has more shots in the afternoon and evening, larger vermicelli amount and low average commodity price, and can be defined as low-price travel anchor. The anchor introduction of the second type (category 2) anchor focuses on topic 1, namely reputation, almost all-day live broadcast, long live broadcast duration, high average commodity price, and abundant live broadcast experience, but the number of works is small, and can be defined as an ultra-long live broadcast quality stream anchor. The third type (category 3) of the host is focused on the theme 3, namely, the product, the number of works is large, the live broadcast duration is long, and the host can also play in the early morning, but the commodity category is small, and the third type (category 3) of the host can be defined as multi-choice type host (namely, the host image classification).

Claims (1)

1. A method for classifying anchor images based on an LDA topic model and a kmeans clustering algorithm is characterized by comprising the following steps of: converting text data of the anchor into numerical data by using an LDA topic model, clustering related numerical data of the anchor by using a kmeans clustering algorithm, and establishing an anchor portrait; the method comprises the following steps:
s1, acquiring anchor information in an indication terminal device to obtain an original data set, and carrying out data preprocessing on the information to obtain an initial data set, wherein the anchor information comprises text data and non-text data; the main broadcasting information comprises main broadcasting numerical value data and main broadcasting text data, the main broadcasting text data comprises main broadcasting introduction and bullet screen information of each live broadcast, and the main broadcasting numerical value data comprises vermicelli numbers, live broadcasting time period distribution, average commodity category numbers with goods, average live broadcasting duration, average work numbers and average commodity prices;
s2, constructing an LDA topic model according to the initial data set, and mining different topic probability distributions of topic words and each anchor text data from the initial data set;
s3, data conversion, namely carrying out logarithmic processing and standardization on the numerical data of each anchor;
s4, determining the number of the categories of the clusters, and determining the number of the categories of the clusters according to the contour coefficient and the square sum of errors in the clusters;
s5, clustering related numerical data of the anchor according to a kmeans clustering algorithm to obtain different categories to which the anchor belongs, analyzing anchor characteristics according to results, and building anchor portraits;
the specific steps of data conversion in the step S3 are as follows:
s31, standardizing numerical data of the anchor needing to be clustered, and expressing the numerical data as follows by a formula:
z=(x-μ)/σ
wherein x is a specific number of the numerical data, μ is an average number of the numerical data, and σ is a standard deviation of the numerical data; the amount of Z value represents the distance between the original fraction and the parent mean, calculated in standard deviation units; z is a negative number when the original score is lower than the average value, and is a positive number otherwise;
the specific steps of step S4 are as follows:
s41, determining the category number of clusters according to the outline coefficients and the square sum of errors in clusters, wherein the outline coefficients are calculated according to the following formula:
wherein a is i Representing the average distance between the ith sample and all other data in its same cluster,namely, quantifying the aggregation degree in the cluster; b i Representing the average distance of the ith sample from the last cluster for quantifying the degree of separation within the cluster; n represents the total number of anchor, and the number is equal to the number M of anchor text information; f is the contour coefficient of all samples; if f is smaller than 0, the average distance between f and elements in the cluster is larger than that of other nearest clusters, and the clustering effect is poor; if a is i Tend to be 0, or b i Greater than a i F approaches 1, indicating that the clustering effect is best;
the error square sum SSE is calculated as follows:
wherein Cq is the qth cluster; mq is the cluster centroid of Cq; p is the sample point in Cq; SSE is the clustering error of all samples and represents the quality of the clustering effect; with the increase of the cluster number l, the aggregation degree of each cluster is gradually increased, and SSE is gradually reduced; when the value of l is increased within a range smaller than the optimal clustering number, the decreasing amplitude of SSE is larger; when the value of L is increased to the optimal cluster number L, the descending amplitude of SSE is suddenly reduced, and then SSE slowly and gradually flattens as the value of L is continuously increased;
determining the optimal cluster number L based on the maximum three points of the contour coefficients 1-9 by combining the inflection points of SSE;
s42, randomly selecting L index vectors from the numerical data standardized in the step S31 as initial center points, wherein L is more than 1;
s43, after an initial center point is selected, calculating the distance from each index vector to L initial center points, and dividing the index vector into classifications corresponding to the initial center points if the distance from the index vector to which initial center point is the smallest;
s44, dividing the index vector into L classifications, and calculating a center point of each classification;
s45, performing the calculation of the step S43 and the step S44 in an iteration manner until the center points of the L classifications are equal to the center points of the L classifications calculated last time or the distance between the center points is smaller than a prescribed threshold value, ending the iterative operation, and finally obtaining the center points of the L classifications, namely the center points of the index vectors, wherein the center points are the characteristic vectors of the L classifications;
in the step S1, the anchor information is acquired from the indication terminal device to obtain an original data set, and the data preprocessing is performed on the information to obtain an initial data set, which comprises the following specific steps:
s11, acquiring text data and numerical data of a host, and screening live broadcast containing missing values to obtain an original data set;
s12, on the basis of the step S11, performing text word segmentation on the original data set to obtain word segmentation word sets;
s13, collecting stop words according to the stop word list, constructing a related dictionary, and removing the stop words of the word segmentation vocabulary to obtain an initial data set;
in the step S2, the specific steps of constructing the LDA theme model are as follows:
s21, determining the topic number K of the LDA topic model according to the initial data set, and obtaining the optimal topic number K by adopting a confusion degree evaluation method, wherein a confusion degree calculation formula is as follows:
wherein M is the number of anchor text data; n (N) i The total number of words appearing in text data of the ith anchor; w (w) i Words constituting the i-th anchor related text data; p (w) i ) Is w i The probability of generation; in order to ensure the clustering effect, obtaining the confusion degree of all the topic numbers K with the topic number K within 10, and selecting the inflection point of the confusion degree as the optimal topic number K according to an elbow method;
s22, in Dirichlet distribution with the prior parameters of alpha and beta, sampling to generate topic distribution theta of text data of each anchor under the condition of topic number K and topic word distribution phi of all anchor text data;
alpha specifically represents a dirichlet a priori parameter introducing a distribution on the topic for each anchor; beta is specifically expressed as a dirichlet prior parameter of the distribution of the subject words introduced by all the anchor;
s23, sampling and generating a theme Z of each anchor text data from the theme distribution theta of each anchor text data, wherein the LDA theme model assumes that each anchor text data is composed of word combinations with different proportions, reflects a unique theme of each anchor text data, and the combination proportion obeys polynomial distribution and is expressed as follows:
Z|θ=Multinomial(θ)
from the topic word distribution phi of all the topic text data, the topic words W are generated by sampling, each topic is composed of words in the topic text data, and the combination proportion is also subject to polynomial distribution, and is expressed as follows: w|z, phi = Multinomial (phi)
Wherein w is i For forming words of the ith anchor related text data, the calculation formula of the probability distribution is as follows:
wherein P (w) i |z=s) represents the word w i Probability of belonging to the s-th topic; p (z=s|i) represents the probability of the s-th topic in the i-th anchor introduction, K is the optimal topic number;
s24, the LDA topic model result contains high-frequency words under each topic K and topic distribution of each anchor text data, the first 20 high-frequency words of each topic K under the optimal topic number K are analyzed, and definition and explanation are carried out on each topic K at the same time;
s25, the LDA topic model result also contains probability distribution of each topic in each anchor text data, and the probability distribution is taken as a data variable of the anchor text data and is included in cluster analysis.
CN202310157141.3A 2023-02-23 2023-02-23 Anchor image classification method based on LDA theme model and kmeans clustering algorithm Active CN116127074B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310157141.3A CN116127074B (en) 2023-02-23 2023-02-23 Anchor image classification method based on LDA theme model and kmeans clustering algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310157141.3A CN116127074B (en) 2023-02-23 2023-02-23 Anchor image classification method based on LDA theme model and kmeans clustering algorithm

Publications (2)

Publication Number Publication Date
CN116127074A CN116127074A (en) 2023-05-16
CN116127074B true CN116127074B (en) 2024-03-01

Family

ID=86297371

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310157141.3A Active CN116127074B (en) 2023-02-23 2023-02-23 Anchor image classification method based on LDA theme model and kmeans clustering algorithm

Country Status (1)

Country Link
CN (1) CN116127074B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106815310A (en) * 2016-12-20 2017-06-09 华南师范大学 A kind of hierarchy clustering method and system to magnanimity document sets
CN110689040A (en) * 2019-08-19 2020-01-14 广州荔支网络技术有限公司 Sound classification method based on anchor portrait
CN115630644A (en) * 2022-11-09 2023-01-20 哈尔滨工业大学 Topic mining method of live broadcast user barrage based on LDA topic model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106815310A (en) * 2016-12-20 2017-06-09 华南师范大学 A kind of hierarchy clustering method and system to magnanimity document sets
CN110689040A (en) * 2019-08-19 2020-01-14 广州荔支网络技术有限公司 Sound classification method based on anchor portrait
CN115630644A (en) * 2022-11-09 2023-01-20 哈尔滨工业大学 Topic mining method of live broadcast user barrage based on LDA topic model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘子平 ; 李学明 ; .基于改进LDA和K-means算法的主题句聚类.计算机应用.2016,(第S2期),244-246+255. *
王莉芳等.《清洁能源背景下电能用户管理体系创新》.机械工业出版社,2022,169-175. *

Also Published As

Publication number Publication date
CN116127074A (en) 2023-05-16

Similar Documents

Publication Publication Date Title
CN110909165B (en) Data processing method, device, medium and electronic equipment
CN110619568A (en) Risk assessment report generation method, device, equipment and storage medium
WO2018184518A1 (en) Microblog data processing method and device, computer device and storage medium
CN112015721A (en) E-commerce platform storage database optimization method based on big data
CN112995690B (en) Live content category identification method, device, electronic equipment and readable storage medium
US20190138628A1 (en) Duplicative data detection
WO2021103401A1 (en) Data object classification method and apparatus, computer device and storage medium
CN115099239B (en) Resource identification method, device, equipment and storage medium
CN113515434A (en) Abnormity classification method, abnormity classification device, abnormity classification equipment and storage medium
CN114491149A (en) Information processing method and apparatus, electronic device, storage medium, and program product
CN114065720A (en) Conference summary generation method and device, storage medium and electronic equipment
CN116127074B (en) Anchor image classification method based on LDA theme model and kmeans clustering algorithm
CN111859955A (en) Public opinion data analysis model based on deep learning
CN114417974B (en) Model training method, information processing device, electronic equipment and medium
CN112800230B (en) Text processing method and device, computer readable storage medium and electronic equipment
CN114443930A (en) News public opinion intelligent monitoring and analyzing method, system and computer storage medium
CN114579751A (en) Emotion analysis method and device, electronic equipment and storage medium
CN112632229A (en) Text clustering method and device
CN112434155A (en) Comment quality classification method, device, equipment and readable medium
CN111078888A (en) Method for automatically classifying comment data of social network users
Zhu Analysis of the Influence of Multimedia Information Fusion on the Psychological Emotion of Financial Investment Customers under the Background of e-Commerce
Xing English Linguistic Term Extraction and Classification Strategies under The Influence of Network Language
CN110929175A (en) Method, device, system and medium for evaluating user evaluation
CN115379259B (en) Video processing method, device, electronic equipment and storage medium
TW201824113A (en) Social data analyzing system and method for predicting emerging topics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant