CN116303893A

CN116303893A - Method for classifying anchor image and analyzing key characteristics based on LDA topic model

Info

Publication number: CN116303893A
Application number: CN202310161332.7A
Authority: CN
Inventors: 吴少辉; 谢晓东; 王洪珑; 李子菲
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2023-02-23
Filing date: 2023-02-23
Publication date: 2023-06-23
Anticipated expiration: 2043-02-23
Also published as: CN116303893B

Abstract

An anchor image classification and key feature analysis method based on an LDA topic model belongs to the technical field of data analysis. The method comprises the following steps: s1, acquiring an introduction text of each anchor in indication terminal equipment to obtain an original data set; s2, performing data preprocessing on the introduction text in the original data set to obtain an initial data set; s3, constructing an LDA theme model according to the initial data set; s4, mining topic high-frequency words and topic distribution introduced by each anchor from an initial data set through an LDA topic model, determining the number of topics, and classifying according to the highest value of the topic distribution as the anchor image; s5, using analysis of variance to obtain the difference characteristics among different anchor groups, and knowing the difference of live broadcast effects of the different anchor groups; s6, based on the difference characteristics and the live effect differences among different anchor groups, obtaining key characteristics affecting the live effect in each anchor group by using regression analysis. The method is used for the image classification and key feature analysis of the anchor.

Description

Method for classifying anchor image and analyzing key characteristics based on LDA topic model

Technical Field

The invention relates to the technical field of data analysis, in particular to a method for classifying anchor images and analyzing key characteristics based on an LDA topic model.

Background

The anchor introduction refers to the important text that in the current live shopping environment, the anchor presents self-characteristics to consumers and companies through the personal information interface of the live shopping platform, clearly broadcasts live contents, issues a statement and reminds audience and company related live information. With the rapid development of information technology and electronic commerce, more and more audiences further know the anchor information and live broadcast content thereof through anchor introduction, so as to pay attention to anchor, realize purchase and the like. The anchor introduction is used by the anchor as an important presentation mode of anchor style and brand characteristics, so as to highlight itself, promote itself and guide audience to purchase. However, what anchor settings or anchor portraits are present in the existing anchor group? How does these different types of anchor introduce themselves? In addition, whether the live effect of the different types of anchors is different, and at the same time, what are the characteristics of the different types of anchors affecting the difference of the live effect, what resources or behaviors are needed to promote the live effect of a certain type of anchor? The related elements and the duty ratio of the anchor introduction cannot be clearly defined, the anchor introduction mode cannot be guided, and therefore, the anchor self-presentation, the content release and the user preference are deviated, and finally, accurate marketing and personal brand construction cannot be realized. The contrast analysis of the characteristics of the main broadcasting is not carried out by combining the live broadcasting effect, so that the effort direction of different people for setting the main broadcasting cannot be known. The current research on the problem uses more experimental methods and qualitative research methods, and cannot deeply research a large amount of text data. Meanwhile, the existing anchor portraits capable of aiming at big data often need manual coding, and the processing and mining of information depend on manual labels (such as a sound classification method based on anchor portraits). And the documents for processing personal introduction and researching the live effect by using natural language are relatively few, the samples for data acquisition are also few, the excavation of text contents is insufficient, the companies are difficult to truly and rapidly know about the anchor and the personal image, the anchor is difficult to accurately and effectively self-introduce, the related research according to the introduction characteristics of the anchor cannot be deeply developed, and the live effect and key characteristics of the anchor cannot be excavated.

Through natural language processing and machine learning, core content is rapidly extracted aiming at a large amount of text data (anchor introduction), the emphasis and the category of the anchor introduction are extracted, the content and the classification of the anchor introduction are researched, the proportion of different topic points in the anchor introduction is rapidly excavated, the anchor is classified according to the topic distribution with the largest proportion in the anchor introduction, images are drawn according to the topic word distribution (namely, which features are possessed by different types of anchors), and meanwhile, the live broadcast effect and the unique features of different types of anchors are compared and analyzed. The method has remarkable significance for realizing accurate introduction and content presentation of the anchor and audience, mining and comparing key characteristics, improving communication efficiency of live broadcast participants and immersing experience of live broadcast.

Disclosure of Invention

The invention provides an anchor image classification and key feature analysis method based on an LDA topic model, which can analyze and classify the introduction of anchors (namely, can classify the anchor image and analyze the key feature).

The technical scheme adopted by the invention is as follows:

the method for classifying the anchor image and analyzing the key characteristics based on the LDA topic model comprises the steps of obtaining different topic groups by using the LDA topic model, knowing the difference of live broadcast effects of different anchor groups, and mining the key characteristics affecting the live broadcast effects of the groups, wherein the method comprises the following steps:

s1, acquiring an introduction text of each anchor in indication terminal equipment to obtain an original data set;

s2, performing data preprocessing on the introduction text in the original data set to obtain an initial data set;

s3, constructing an LDA theme model according to the initial data set;

s4, mining topic high-frequency words and topic distribution introduced by each anchor from an initial data set through an LDA topic model, determining a topic number K, and classifying according to the highest value of the topic distribution as the anchor image;

s5, using analysis of variance to obtain the difference characteristics among different anchor groups, and knowing the difference of live broadcast effects of the different anchor groups;

s6, obtaining key characteristics affecting the live effect in each anchor group by using regression analysis based on the difference characteristics and the live effect differences among different anchor groups in the step S5.

Further, in the step S2, the specific steps of performing data preprocessing on the introduction text in the original dataset are as follows:

s21, screening out the anchor with empty anchor introduction content;

s22, on the basis of the step S21, performing text word segmentation on the original data set to obtain word segmentation word sets;

s23, collecting stop words according to the stop word list, and removing the stop words in the word segmentation vocabulary set to obtain an initial data set.

Further, in the step S3, the specific steps of constructing the LDA theme model are as follows:

s31, determining the topic number K of the LDA topic model according to the initial data set, and obtaining the optimal topic number K by adopting a confusion degree evaluation method, wherein a confusion degree calculation formula is as follows:

wherein M is the number introduced by the anchor, N _i Word count, w, appearing in the anchor introduction for the ith anchor _i For the words introduced by the anchor constituting the ith anchor, p (w _i ) Represents w based on the topic number K _i The probability of generation;

in order to ensure the clustering effect, obtaining the confusion degree of all the topic numbers K with the topic number K within 10; according to the elbow method, an inflection point of confusion degree is selected as an optimal theme number K;

s32, in the dirichlet distribution with the prior parameters of alpha and beta, sampling to generate a topic distribution theta based on each topic introduction and a topic word distribution of all topic introductions under the condition of the optimal topic number K

Alpha represents a dirichlet a priori parameter distributed on the topic for each anchor introduction;

beta represents the dirichlet a priori parameters of the subject word distribution introduced by all the anchor;

s33, sampling and generating a topic Z of each anchor introduction from topic distribution theta of each anchor introduction, wherein the LDA topic model assumes that each anchor introduction is composed of word combinations with different proportions, reflects a unique topic of each anchor introduction, and is expressed as follows:

Z|θ＝Multinomial(θ)

subject matter distribution introduced from all anchor

In the method, topic words W are generated by sampling, each topic k is composed of words in the anchor introduction, and the combination proportion is also subject to polynomial distribution and expressed as:

wherein, the word w of the anchor introduction forming the ith anchor _i The probability distribution is calculated by the following formula:

wherein P (w) _i |z=s) represents the word w _i Probability of belonging to the s-th topic; p (z=s|i) represents the probability of the s-th topic in the i-th anchor introduction; k is the optimal theme number; p (w) _i I) represents probability distribution;

further, in the step S4, a topic high-frequency word and a topic distribution which is self-introduced by each anchor are mined from an initial data set through an LDA topic model, a topic number K is determined, and the topic is classified according to the highest value of the topic distribution as the anchor image, and the specific steps are as follows:

s41, analyzing the top 20 high-frequency words of each topic K under the optimal topic number K, and defining and explaining each topic K at the same time, wherein the LDA topic model result contains the high-frequency words of each topic K and topic distribution theta introduced by each anchor;

s42, in order to avoid the occurrence of the same high-frequency word under different topics k and influence the interpretation result of the topics k, adopting the topic-word association degree to control and display different hyponyms of a certain topic k;

wherein w represents the words in the corpus; k represents a subject; p (w) represents the subject word distribution of word w introduced by all the anchor

Is a marginal probability in (a); />

When lambda=0, unique and relatively independent lower terms under the topic k are displayed, namely the terms are always only appeared on the topic; when λ=1, the lower-level terms with higher distribution probability are displayed, but the terms with high distribution probability often do not belong to the topic alone, but also belong to other topics at the same time, and a user adjusts the degree of correlation between the word w and the topic k, namely r (w, k|λ) by giving a λ value;

s43, classifying according to the highest value of the topic distribution as the image of the anchor, and explaining the classification of the anchor according to the relatively independent hyponyms and the hyponyms with high distribution probability in the result of the step S42.

Further, in the step S5, variance analysis is used to obtain the difference characteristics among different anchor groups, so as to know the difference of the live broadcast effects of the different anchor groups; the method comprises the following specific steps:

s51, carrying out logarithmic processing on the feature and effect data of the anchor so as to avoid the influence of extreme values, and simultaneously converting the bias data into normal data;

s52, using variance analysis to analyze differences of live characteristics and effects among different anchor groups, wherein the variance analysis is used for analyzing differences between fixed data and quantitative data, the fixed data is an anchor group, and the quantitative data is a live effect;

further, in the step S6, regression analysis is used to obtain key characteristics affecting the live broadcast effect in each anchor group; the method comprises the following specific steps:

s61, in each anchor group, establishing a regression equation by taking anchor characteristics as independent variables and live effect as dependent variables,

y _i ＝k ₁ x _i1 +k ₂ x _i2+ k ₃ x _i3+ ...k _n x _in+ b+c

where yi represents the sales of the ith anchor; xin represents n trait-related variables of the i-th anchor; b represents an intercept term of the anchor; c represents the residual term of the anchor; ki. kn represents coefficients corresponding to n characteristics;

s62, for each anchor, selecting the largest k value, namely the largest influence factor of the anchor, and further analyzing the importance of the variable according to the k values corresponding to different anchor characteristics.

Compared with the prior art, the invention has the beneficial effects that: the invention provides a method for classifying anchor images and analyzing key characteristics based on an LDA topic model, which comprises the steps of firstly utilizing the LDA topic model to mine anchor introduction, using the anchor introduction as a corpus to expand and analyze, and extracting the proportion of high-frequency characteristic words to different topics. The LDA topic model used by the method is an unsupervised model, only needs anchor introduction data (i.e. introduction text) as corpus and specifies the number of topics, and can complete training without labels, so that the method is easy to realize; according to the result, the method can clearly determine different dimensionalities and proportion of the anchor introduction contents (which are obtained by analyzing the topic distribution and the topic word distribution of the anchor introduction of each anchor), overcomes the defects of the existing analysis method based on personal introduction, and can rapidly, efficiently and accurately analyze the anchor introduction contents. According to the invention, the LDA topic model can match the introduction of each anchor to the most relevant topic, namely the probability distribution of different topics in the introduction of each anchor, so that the internal modes of interaction and brand propaganda of a live broadcast e-commerce and audience are deeply understood, the foundation is laid for further exploring the influence on the live broadcast performance of the anchor under different introduction focuses, and effective support service is provided for the anchor in a live broadcast room. The method has the characteristics of high recognition speed, high accuracy, easiness in implementation and the like, successfully provides reliable guarantee for semantic analysis of the anchor introduction (namely text data analysis of the anchor), can be widely used for live broadcast effect analysis, and provides advice for the anchor. The method solves the problem that the existing text classification method usually adopts a subjective qualitative view angle, classifies the anchor introduction through machine learning, improves the classification accuracy, and fully considers the heterogeneity of each anchor. The analysis method can be widely applied to anchor introduction and is suitable for various live broadcast.

Drawings

FIG. 1 is a block flow diagram of an embodiment 1 of the method for classification of anchor images and analysis of key features based on an LDA topic model of the present invention;

fig. 2 is a simplified schematic diagram of an LDA topic model.

In fig. 2, α and β are Dirichlet a priori parameters, where: alpha represents the dirichlet a priori parameters distributed on the topic by each anchor introduction; beta represents the dirichlet a priori parameters of the subject word distribution of all the anchor introductions; θ represents the topic distribution of sampling generation of each anchor introduction;

representing the distribution of the subject terms of all the anchor introductions; m represents the number of anchor introductions (text number); n (N) _i Represent the firsti the total number of words appearing in the anchor introduction of the anchor; z refers to the topic of sampling generation of each anchor introduction; w represents the sample generation subject term.

Detailed Description

The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the invention, but not all embodiments, and all other embodiments obtained by those skilled in the art without making creative efforts based on the embodiments of the present invention are all within the protection scope of the present invention.

The first embodiment is as follows: the utility model provides a host image classification and key feature analysis method based on an LDA topic model, which utilizes the LDA topic model to obtain different topic groups, knows the difference of live broadcast effects of different host groups, and digs key features affecting the live broadcast effects of the groups, the method comprises the following steps:

s3, constructing an LDA theme model according to the initial data set;

s4, mining topic high-frequency words and topic distribution which is self-introduced by each anchor from an initial data set through an LDA topic model, determining a topic number K, and classifying the anchor image according to the highest value of the topic distribution (the step is to obtain different topic groups by utilizing the LDA topic model; namely classifying the anchor groups);

s5, obtaining the difference characteristics among different anchor groups by using analysis of variance (ANOVA), and knowing the difference of live broadcast effects of the different anchor groups (the difference characteristics among the different anchor groups are obtained in the anchor group classification of the step S4, namely, the anchor groups of different types are compared);

s21, screening out the anchor with empty anchor introduction content;

s31, determining the topic number K of an LDA topic model (which is the prior art) according to an initial data set, adopting a confusion degree evaluation method to obtain an optimal topic number K (the confusion degree obtained by calculating different topic numbers K is different, the lower the confusion degree is, the stronger the generalization capability of the topic model under the corresponding K value is), and the confusion degree calculation formula is as follows:

wherein M is the number of anchor introductions; n (N) _i The total number of words appearing in the anchor introduction for the ith anchor; w (w) _i Words introduced for the anchor constituting the ith anchor; p (w) _i ) Represents w based on the topic number K _i The probability of generation;

s32, in Dirichlet distribution with the prior parameters of alpha and beta, sampling to generate topic distribution theta based on each topic introduction and topic word distribution of all topic introductions under the condition of optimal topic number K

Z|θ＝Multinomial(θ)

subject matter distribution introduced from all anchor

In which the topic words W are generated by sampling, each topic k is composed of words in the anchor introduction, and the combination proportions are also subject to a polynomial (Multinomial) distribution, expressed as:

wherein P (w) _i |z=s) represents the word w _i Probability of belonging to the s-th topic; p (x=s|i) represents the probability of the s-th topic in the i-th anchor introduction; k is the optimal theme number; p (w) _i I) represents probability distribution;

s42, in order to avoid the occurrence of the same high-frequency word under different topics k and influence the interpretation result of the topics k, adopting a topic-word association degree (recall) to control and display different hyponyms of a certain topic k;

Is a marginal probability in (a); />

Representing the relatedness of the word w to the subject k, when λ=0, a specific, relatively independent lower term under the subject k is displayed (said subject term distribution +.>

Relevance by individual words and individual topics->

Composition), i.e., these terms tend to appear only on the subject; when λ=1, the lower-level terms with higher distribution probability are displayed, but the terms with high distribution probability often do not belong to the topic alone, but also belong to other topics at the same time, and a user adjusts the degree of correlation between the word w and the topic k, namely r (w, k|λ) by giving a λ value;

s43, classifying the image of the anchor according to the highest value of the topic distribution (namely the value of the largest topic distribution in the step S41), and explaining the classification of the anchor according to the relatively independent hyponyms and hyponyms with high distribution probability in the result of the step S42.

Further, in the step S5, variance analysis (ANOVA) is used to obtain the difference characteristics among different anchor groups, so as to know the difference of live broadcast effects of the different anchor groups; the method comprises the following specific steps:

s51, carrying out logarithmic processing on characteristics (such as vermicelli quantity and live broadcast duration) and effect data (such as vermicelli quantity and live broadcast sales quantity) of a host so as to avoid the influence of extreme values, and converting bias data into normal data;

s52, using analysis of variance (ANOVA, which is the prior art) for the difference of live characteristics and effects (such as the difference of live sales and the like) among different main broadcasting groups, wherein the analysis of variance is used for the difference analysis between fixed data and quantitative data, the fixed data is a main broadcasting group, and the quantitative data is a live effect (such as the live sales, praise and the like);

s61, in each main broadcasting group, taking main broadcasting characteristics (such as the number of vermicelli, live broadcasting time length and the like) as independent variables, taking the live broadcasting effect as dependent variables, establishing a regression equation,

y _i ＝k ₁ x _i1 +k ₂ x _i2+ k ₃ x _i3+ ...k _n x _in+ b+c

s62, for each anchor, selecting the largest k value, namely the largest influence factor of the anchor, and further analyzing the importance of the variable according to the k values corresponding to different anchor characteristics (including live broadcast duration and the like).

Example 1:

the embodiment discloses a method for classifying anchor images and analyzing key characteristics based on an LDA theme model, which adopts the LDA theme model to mine different themes and subject words in anchor introduction, classifies and extracts content elements aiming at mine anchor introduction, divides anchor groups and images on the basis, takes average sales of each anchor as a measurement index of live broadcast effect, and explores live broadcast effects of different forms like anchors. On the basis, the influence of the characteristics of the main broadcasting in the group on the live broadcasting effect is subjected to music regression analysis, and the influence difference of the characteristics of the main broadcasting in different groups is compared to guide the main broadcasting to better introduce and present themselves in the live broadcasting, so as to introduce the live broadcasting content.

1. Study data and methods

1. Study data

With the development of mobile internet technology, live broadcast is also more and more favored by audiences, and the anchor of various images also appears on a live broadcast platform. This embodiment selects the anchor introduction of 2067-bit anchor on the tremble platform.

2. Research method

With the development of the age science and technology, live broadcasting greatly enriches the life of audiences by virtue of convenience and immersive property, and the audiences also increasingly rely on live broadcasting shopping. The anchor introduction of each anchor is also an important stimulus in the consumer's view of live broadcast, determining whether the viewer is motivated to purchase. The anchor introduction serves as an important communication carrier between the anchor and the audience, and can inform the audience of live content and anchor characteristics, so that the individual brands of the anchor are constructed, and the audience can know and trust the individual brands. According to the embodiment, the data mining is carried out on the anchor introduction, different content plates of the anchor introduction are analyzed, the image characteristic proportion of the anchor introduction (different theme distribution in each anchor introduction) is finely classified, so that the division of anchor groups is realized, the difference of live broadcast effects brought by personal images corresponding to different anchor groups is explored on the basis, and the differential influence of different anchor characteristics on the live broadcast effects is explored. The specific steps are as follows (as shown in fig. 1):

(1) Data preprocessing: original data introduced by a host of tremble sound is obtained through designing a Python crawler program, and the original data is subjected to data preprocessing, wherein the data preprocessing mainly comprises data cleaning, jieba word segmentation (namely text word segmentation) and de-stop word processing.

(2) Topic model analysis: the LDA topic model is adopted to identify hidden different content elements (namely topic distribution) in the anchor introduction, the topics and the corresponding high-frequency words are mined, the anchor is divided into groups according to the corresponding maximum probability distribution in each anchor introduction, and the images of the anchor groups are summarized.

(3) Key trait analysis: and carrying out natural logarithmic processing (namely data conversion) on the data model, and carrying out variance analysis to explore the differences of live broadcast effects of different anchor groups. Regression analysis is performed to explore the differential impact of different anchor traits on anchor effects (i.e., the anchor traits are analyzed based on the analysis results).

2. Experiment and analysis

1. Data source and preprocessing

The melon is known to select all live broadcast with goods in the period of 5 months to 10 months of 2021 on the tremble platform through a third party platform, and acquiring information data of each live broadcast, and accumulating 2067-bit anchor introduction to be acquired after the live broadcast with the anchor introduction being empty is removed.

Because the introduction content of the anchor in the anchor introduction is more random, after the original data is obtained, data preprocessing is generally needed, and the reliability of the data is improved, wherein the specific process is as follows:

(1) Removing special characters through Excel screening;

(2) Text word segmentation is carried out in a Python program by utilizing a Jieba word segmentation software package;

(3) Collecting a stop word library, manufacturing a stop word list, and removing the stop word by using a Python program;

2. topic model analysis

In the live broadcast process, different anchor broadcasters have different anchor introduction styles and elements. Some anchor will focus the anchor introduction on the product; while some anchor will share his own experience and expect to create emotional resonance with the audience; also, the anchor highlights its own reputation and services. The invention adopts an LDA topic model to carry out topic mining on the anchor introduction to obtain the duty ratio of feature words and topic elements under different introduction topics, and obtains anchor groups of different types according to the maximum value of topic distribution.

2.1, determining the optimal theme number of the LDA theme model; the present embodiment adopts a confusion algorithm to determine the optimal topic number range.

The confusion algorithm is:

the LDA topic model needs to set the topic number K of the text in advance, the topic number K is optimal to make reasonable classification, the topic number K is bigger, so that semantic information of part of topics is not obvious, and the topic number K is smaller, so that topic granularity is too coarse. The choice of a suitable number of topics K has always been an open question. The embodiment adopts a confusion algorithm and the interpretability of topics in the LDA topic model result, and the two are combined to determine the optimal topic number K, wherein the confusion (Perplexity) represents the uncertainty of the topic to which the document (i.e. each anchor introduction) belongs, and is inversely proportional to the clustering effect, and the smaller the confusion is, the better the topic number is. The calculation formula is as follows:

wherein M is the number introduced by the anchor and is also the number of the anchor; n (N) _i The total number of words appearing in the anchor introduction for the ith anchor; w (w) _i Words introduced for the anchor constituting the ith anchor; p (w) _i ) Represents w based on the topic number K _i The probability of generation;

a lesser degree of confusion indicates that the trained subject matter is less misread of words in the test document. At the same time, the best choice set, except for a small degree of confusion, represents a statistically significant one.

In order to ensure the clustering effect, the confusion degree of all the topic numbers K with the topic number K within 10 is obtained;

we used Python program to calculate the confusion by sklearn package in LDA. A larger number of topics may affect the clustering effect, and a smaller number of topics may make topic analysis less accurate. According to the elbow method, the inflection point of the confusion degree, namely K=3 is selected as the optimal theme number K, and the LDA theme model is built.

2.2, LDA topic model

In the embodiment, an LDA topic model is adopted to carry out topic mining on the anchor introduction, which is a document topic generation model, and comprises a word, topic and document (i.e. anchor introduction) three-layer structure, as shown in FIG. 2; the model adopts a probability inference algorithm to process the text, does not need manual intervention to annotate an initial document before modeling, can identify the implicit subject information in the document, better reserves the internal relation of the document, and achieves good practical effects in the aspects of text semantic analysis, information retrieval and the like.

The LDA topic model generation process is as follows:

(1) In Dirichlet distribution with a priori parameters of alpha and beta, the topic distribution theta of each topic introduction and the topic word distribution of all topic introductions under the condition of optimal topic number K are sampled and generated

Alpha represents the dirichlet a priori parameters introducing the distribution on the topic for each anchor;

(2) From the topic distribution θ of each anchor introduction, the topic Z of each anchor introduction is generated by sampling, and the LDA topic model assumes that each anchor introduction is composed of word combinations of different proportions reflecting the unique topic of each anchor introduction, the combination proportions obey a polynomial (multi) distribution, expressed as:

Z|θ＝Multinomial(θ)

subject matter distribution introduced from all anchor

wherein the first is formed ofWord w introduced by the anchor of the i-bit anchor _i The probability distribution is calculated by the following formula:

wherein P (w) _i |z=s) represents the word w _i Probability of belonging to the s-th topic; p (z=s|i) represents the probability of the s-th topic in the i-th anchor introduction, K is the optimal topic number, P (w) _i I) represents probability distribution;

2.3 topic result analysis

The partial example of the results with topic number k=3 is shown in table 1, which lists the 20 words with the highest occurrence frequency in each topic and the probability of occurrence thereof, and shows the partial word distribution of the three topics in the anchor introduction. In topic 1, the primary words are brands, customer service, authorities, factories, etc. These words are both reputation specific. Therefore, we call this anchor introduction element reputation. Under this theme, the anchor tends to focus on highlighting its own reputation and brand, and will discuss more about the security, reputation, etc. of products and services. In contrast, the main vocabulary of personal introduction in topic 2 includes collaboration, business, after-sales, sharing, attention, and the like. The results indicate that this class of anchor introduction elements is focused on interactions, more prone to interaction behavior between anchor and audience, relationships and emotions play an important role in this topic. Thus, the emotion word proportion in the topic 2 is higher, and the anchor introduction in the topic 2 is classified as a relational or interactive anchor introduction. The subject 3 focuses on the product, and there is a large number of unique words under the subject, such as commodities, women's clothes, height, weight, etc., and such elements introduced by the anchor often highlight the product information of the host, so as to prove that the product of the host fits with the requirements of the customer. It is noted that the same high frequency word may appear under different topics, affecting our definition and interpretation of topics, and for this reason, further analysis of relevance may be employed.

TABLE 1

In order to avoid the appearance of the same high-frequency words under different topics, the interpretation results of the topics are influenced. By adopting the topic-word association degree (release), different hyponyms of a certain topic can be controlled to be displayed.

r(w，k|λ)＝λlog(φ _kw )+(1－λ)log(φ _kw /p(w))

Wherein w represents the words in the corpus; k represents a subject; p (w) represents the marginal probability of the word w in the subject word distribution matrix phi; phi _kw Representing the relatedness of the word w to the topic k, the topic word distribution

When the parameter λ=0 of the topic-term association, a specific, relatively independent lower term (++) under topic k is displayed>

Is the distribution of subject words, is a generic term, and is internally provided with the relevance degree of each word and each subject>

Composition), i.e., these terms tend to appear only on the subject; when λ=1, the lower-level terms with higher distribution probability are displayed, but the terms with high distribution probability do not belong to the topic alone, but belong to other topics at the same time, and a user adjusts the degree of correlation between the word w and the topic k, namely r (w, k|λ) by giving a λ value;

in this embodiment, the pyLDAvis toolkit under Python is used to draw a dynamic interactive LDA topic model visualization map, analyze the association between topics, and thereby identify the core topic and the secondary topic. pyLDAvis can control topic-word association degree release (termw|topict) by adjusting parameter λ (0+.lambda.ltoreq.1), taking topic 3 as an example:

from the association, 6 of the first 10 associated words under the theme 3 are related to the product, namely the commodity, the women's dress, the factory, the height, the clothing and the clothes, and the theme can be verified to be mainly focused on the product.

On this basis, a distribution of different topics in the anchor introduction is obtained, part of which is shown in table 2. The method is characterized in that the style and atmosphere of different anchor introductions are known, the basis is laid for further exploring the influence of elements of different anchor introductions on the anchor live performance, namely, according to the live broadcast effect of each anchor, the unique interaction preference and interest point which are most suitable for the introduction mode of the anchor and the vermicelli of the anchor are searched according to the corresponding probability distribution (namely, the topic distribution) of each topic.

Finally, in this step we get three groups of anchor groups, the first group (group 1), focusing on the reputation of their own products, brands. A second group of anchor groups (group 2) focuses on interactions with customers. A third group of anchor groups (group 3) focuses on their own products, whose personal introduction contains a large amount of product information to expire enough for the customer's demand for the product.

TABLE 2

3. Critical trait analysis

The method comprises the steps of classifying groups of the anchor through an LDA topic model, carrying out natural logarithm processing on data, carrying out variance analysis, exploring live effect differences of different anchor groups, carrying out regression analysis, and exploring the differential influence of different anchor characteristics on anchor effects.

3.1, data conversion

Logarithmic data, namely live sales volume (GMV), main broadcasting vermicelli volume and the like are subjected to logarithmic processing, so that the influence of extreme values is avoided, meanwhile, bias data are converted into normal data, and variance analysis and regression analysis are performed on the basis of the normal data.

3.2, analysis of variance test live effect

The variance analysis (ANOVA analysis, which is the prior art) is used to analyze the differences in live effectiveness (in this embodiment, differences in live sales) between different anchor groups, and is used to analyze the differences between the data of the anchor group, which is the anchor group, and the quantitative data, which is the live effectiveness (e.g., live sales, praise, etc.).

In the embodiment, the spss is adopted for variance analysis, and the obtained results are shown in Table 3;

TABLE 3 Table 3

Note p <0.01.

According to the analysis of variance result, the parameter F value (ratio of inter-group to intra-group mean square) is 5.799, the parameter p (used for judging the hypothesis test result) value is 0.003<0.05, and the highest sales volume of the anchor group based on the reputation can be obtained, the anchor group based on the interaction is inferior, and the live broadcast effect of the anchor group based on the product is worst. It can be seen that the contact of the audience with the image of the quality stream by the anchor is most irritating to the audience's purchase.

3.3, regression analysis mining Key Properties

By LDA topic model and analysis of variance we divide the anchor group into three categories, group 1 is the reputation anchor (first category anchor), group 2 is the interactive anchor (second category anchor), group 3 is the product anchor (third category anchor), and it is found that there is a higher sales in the anchor group for the reputation anchor. And then, carrying out regression analysis on each anchor group, and exploring the differential influence of different characteristics of anchors on live sales in each anchor group, thereby better providing guidance for anchors of different groups.

The method comprises the steps of establishing a regression analysis model by taking log (average commodity category number), log (average price of products), frequency of direct broadcast in different time periods in the day (namely probability of direct broadcast in the morning, afternoon and evening), log (direct broadcast duration), average vermicelli number of the host as an independent variable and log (live load) as a dependent variable and probability of direct broadcast in the morning as a reference group in each live broadcast, wherein the average commodity category number is log (average price of products), the frequency of direct broadcast in the morning (probability of direct broadcast in the afternoon and evening), and the regression analysis model is shown in table 4;

TABLE 4 Table 4

/>

As shown by regression analysis results, for three types of anchor, the average price of the product is the most important influence factor in the current independent variable, and for the third type of anchor, the number of commodity types does not have great influence on the live broadcast effect; for the second type of anchor, live broadcast in the early morning is not needed, but for the first type of anchor, no relation exists, and live broadcast in any time period can be selected; for the first and third types of anchor, the live time length should be increased based on the current live time; meanwhile, the number of vermicelli is critical for three types of anchor, especially for the first type of anchor.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims

1. A method for classifying anchor image and analyzing key characteristics based on an LDA topic model is characterized in that: obtaining different topic groups by using an LDA topic model, knowing the difference of live broadcast effects of different anchor groups, and mining key characteristics affecting the live broadcast effects of the groups, wherein the method comprises the following steps:

s3, constructing an LDA theme model according to the initial data set;

2. The method for classifying anchor figures and analyzing key features based on the LDA topic model as claimed in claim 1, wherein: in the step S2, the specific steps of data preprocessing on the introduction text in the original dataset are as follows:

s21, screening out the anchor with empty anchor introduction content;

3. The LDA topic model-based anchor image classification and key trait analysis method of claim 1 or 2, wherein: in the step S3, the specific steps of constructing the LDA theme model are as follows:

Z|θ＝Multinomial(θ)

subject matter distribution introduced from all anchor

wherein P (w) _i |z=) s represents the word w _i Probability of belonging to the s-th topic; p (z=s|i) represents the probability of the s-th topic in the i-th anchor introduction, K is the optimal topic number, P (w) _i I) represents a probability distribution.

4. The LDA topic model-based anchor image classification and key trait analysis method of claim 3 wherein: in the step S4, a topic high-frequency word and a topic distribution self-introduced by each anchor are mined from an initial data set through an LDA topic model, a topic number K is determined, and the topic distribution is classified as the anchor image according to the highest value of the topic distribution, and the specific steps are as follows:

wherein w represents words in the corpus, k represents the topic, and P (w) represents the topic word distribution of the words w introduced by all the anchor

Is>

5. The method for classification and key trait analysis of anchor image based on LDA topic model as claimed in claim 4, wherein: in the step S5, variance analysis is used to obtain the difference characteristics among different anchor groups, and the difference of live broadcast effects of the different anchor groups is known; the method comprises the following specific steps:

s52, using variance analysis to analyze differences of live characteristics and effects among different anchor groups, wherein the variance analysis is used for analyzing differences between fixed data and quantitative data, the fixed data is an anchor group, and the quantitative data is a live effect.

6. The method for classifying anchor figures and analyzing key features based on the LDA topic model as claimed in claim 5, wherein: in the step S6, regression analysis is used to obtain key characteristics affecting the live broadcast effect in each anchor group; the method comprises the following specific steps:

s61, in each anchor group, establishing a regression equation by taking anchor characteristics as independent variables and live broadcast effects as dependent variables, y _i ＝k ₁ x _i1 +k ₂ x _i2+ k ₃ x _i3+... k _n x _in+ b+c

Where yi represents sales of the i-th anchor, xi 1..xin represents n attribute-related variables of the i-th anchor, b represents an intercept term of the anchor, c represents a residual term of the anchor, ki...kn represents coefficients corresponding to the n attributes;