WO2019242453A1

WO2019242453A1 - Information processing method and device, storage medium, and electronic device

Info

Publication number: WO2019242453A1
Application number: PCT/CN2019/088435
Authority: WO
Inventors: 陆平; 韦安军; 胡晓
Original assignee: 中兴通讯股份有限公司
Priority date: 2018-06-21
Filing date: 2019-05-24
Publication date: 2019-12-26
Also published as: CN110633410A

Abstract

Disclosed in the present invention are an information processing method and device, a storage medium, and an electronic device. The method comprises: obtaining topic data; preprocessing the topic data to obtain structured data; and inputting the structured data into a model file, to obtain popularity information of the topic data by calculation.

Description

Information processing method and device, storage medium, and electronic device

Cross-reference to related applications

This application is based on a Chinese patent application with an application number of 201810644005.6 and an application date of June 21, 2018, and claims the priority of the Chinese patent application. The entire contents of the Chinese patent application are incorporated herein by reference.

Technical field

The present invention relates to, but is not limited to, the field of communications, and in particular, to an information processing method and device, a storage medium, and an electronic device.

Background technique

Everyone's work and life are closely related to the computer and the Internet. People can obtain a variety of information on the Internet. Even entertainment, consumption, and communication methods between people have penetrated into the Internet. The emergence of social media platforms represented by Weibo and WeChat circle of friends has made web-based social interaction more popular. In this Internet age, a large amount of topic data is generated anytime, anywhere. These topic data are like waves, and they will generate new internal peaks over time. In the field of Weibo, they are hot trends, and they may be buzzwords in posts. In the field of music, it constitutes a pop music list. In more detail, in screening out comments about an event, it may constitute the public psychological state of the event. The process of obtaining popular information is called popularity analysis, which targets individuals. User analysis popularity can play a decisive role as a dimension of the recommendation system. According to the overall analysis of the popularity of all user data, you can predict the development trend of the transaction.

In related technologies, the popularity analysis of topic data is inefficient. For example, when collecting the wishes of the people, relevant departments often choose electronic suggestion boxes, mobile app Governor's mailboxes, etc. to collect people's demands, understand the lack of people and people's urgent desire to change The important point is that in such a way, each piece of information is likely to reflect an event or a mental state, but this way of passively receiving mobile phone public opinion is very inefficient.

Summary of the Invention

In view of this, embodiments of the present invention desire to provide an information processing method and device, a storage medium, and an electronic device.

An embodiment of the present invention provides an information processing method, including: obtaining topic data; pre-processing the topic data to obtain structured data; inputting the structured data to a model file, and calculating the topic data. Hot information.

An embodiment of the present invention further provides an information processing apparatus including: an acquisition module configured to acquire topic data; a processing module configured to preprocess the topic data to obtain structured data; and a calculation module configured to convert all information The structured data is input to a model file, and the popularity information of the topic data is calculated.

According to an embodiment of the present invention, a storage medium is also provided. The storage medium stores a computer program, and the computer program is configured to execute the information processing method provided by the embodiment of the present invention when running.

An embodiment of the present invention further provides an electronic device including a memory and a processor. The memory stores a computer program, and the processor is configured to run the computer program to perform information processing provided by the embodiment of the present invention. method.

By applying the embodiments of the present invention, structured data is obtained by preprocessing the topic data, and then the hotness information of the topic data is calculated according to the model file, thereby improving the efficiency of analyzing topic popularity.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart of an information processing method according to an embodiment of the present invention;

2 is a structural block diagram of an information processing apparatus according to an embodiment of the present invention;

3 is a system structural diagram of an embodiment of the present invention;

4 is a system module diagram of an embodiment of the present invention;

5 is a schematic diagram of an elbow algorithm for determining a K value in an embodiment of the present invention;

6 is a diagram illustrating an example of an initialization point determination process in an embodiment of the present invention;

FIG. 7 is an example diagram of an overall prediction process in an embodiment of the present invention; FIG.

FIG. 8 is an example diagram of a processing flow before training in an example of the present invention; FIG.

FIG. 9 is an overall flowchart of analyzing the popularity of microblog data in the example of the present invention; FIG.

10 is a diagram illustrating an example of popularity analysis of music data in an example of the present invention;

11 is a schematic diagram of an audio signal to feature vector in an example of the present invention;

FIG. 12 is a diagram illustrating an example of popularity analysis of commodities in an example of the present invention; FIG.

13 is a flowchart of preprocessing and commodity popularity prediction in an example of the present invention;

FIG. 14 is an overall flowchart of news data popularity analysis in an example of the present invention.

detailed description

Hereinafter, the present invention will be described in detail with reference to the accompanying drawings and embodiments. It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other.

It should be noted that the terms “first” and “second” in the description and claims of the present invention and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence.

FIG. 1 is a flowchart of an information processing method provided by an embodiment of the present invention. As shown in FIG. 1, an information processing method provided by an embodiment of the present invention includes:

Step S102, obtaining topic data;

Step S104: pre-process the topic data to obtain structured data;

In step S106, the structured data is input into a model file, and the popularity information of the topic data is calculated.

Through the above steps, structured data is obtained by preprocessing the topic data, and then the heat information of the topic data is calculated according to the model file, which solves the technical problem of inefficient analysis of topic popularity in related technologies.

In some embodiments, the execution subject of the above steps may be a server, a terminal, etc., but is not limited thereto.

In some embodiments, after calculating the popularity information of the topic data, the method further includes: displaying the popularity information of the topic data on a front-end interface. It can be arranged in order according to the height of the heat, and the heat information can be a score.

In some embodiments, before entering the structured data into the model file, the method further includes one of the following: a training model file; a preset model file. When the model file is preset, the model file has been trained and can be used directly. Of course, it can also be retrained during the use process.

In some embodiments, the training model file includes:

S11. Segment the sample text data to remove the characters of the specified type in the sample text data to obtain the first data. Remove the characters of the specified type in the sample text data include: remove special symbols such as symbols, numbers, spaces, and stop words;

S12. Perform word embedding on the first data to obtain second data.

S13. Add and sum the word vectors of the second data to obtain the third data.

S14. Gaussian mixture model training is performed on the original model according to the category on the third data to obtain a model file.

In some embodiments, the structured data is input into a model file, and the popularity information of the topic data is calculated, including:

S21: Segment the structured data to remove the characters of the specified type from the structured data to obtain the first structured data. Remove the characters of the specified type from the structured data include: remove special symbols such as symbols, numbers, spaces, etc. .

S22. Perform word embedding processing on the first structured data to obtain second structured data.

S23. Add and sum the word vectors of the second structured data to obtain the third structured data.

S24. Input the third structured data into the model file to obtain the classification and category probability of each piece of data;

S25: Calculate the category probability to obtain the popularity information of the topic data.

In some embodiments, pre-processing the topic data to obtain structured data includes:

Split topic data according to data type; clean pictures, voice, expressions and other data contained in the data;

Delete specific types of data contained in the topic data to obtain candidate data, where the specific types include at least one of the following: pictures, voices, and expressions;

Structure candidate data into structured data.

In some embodiments, obtaining the topic data includes: capturing topic data from the Internet, where the topic data includes at least one of the following: topic content and comment information. Topic data can be obtained from WeChat circle of friends, Weibo, Post Bar, website, application software, etc.

Through the description of the above embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by means of software plus a necessary universal hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is Better implementation. Based on such an understanding, the technical solution of the present invention in essence or a part that contributes to the existing technology can be embodied in the form of a software product, which is stored in a storage medium such as a read-only memory (Read-Only Memory (ROM) / Random Access Memory (RAM), magnetic disks, compact discs, including a number of instructions for a terminal device (can be a mobile phone, computer, server, or network device, etc.) to execute this Invent the method described in various embodiments.

An information processing apparatus is also provided in this embodiment, and the apparatus is configured to implement the above-mentioned embodiments and preferred implementation manners, and the descriptions will not be repeated. As used below, the term "module" may implement a combination of software and / or hardware for a predetermined function. Although the devices described in the following embodiments are preferably implemented in software, hardware, or a combination of software and hardware, is also possible and conceivable.

FIG. 2 is a structural block diagram of an information processing apparatus according to an embodiment of the present invention. As shown in FIG. 2, the apparatus includes:

An obtaining module 22 configured to obtain topic data;

A processing module 24 configured to preprocess the topic data to obtain structured data;

The calculation module 26 is configured to input the structured data into the model file and calculate the popularity information of the topic data.

In some embodiments, the calculation module includes: a first processing unit configured to segment the structured data, removing characters of a specified type from the structured data to obtain the first structured data; and a second processing unit configured to process the first structured data. A structured data is subjected to word embedding processing to obtain a second structured data; a first calculation unit is configured to add and average the word vectors of the second structured data to obtain a third structured data; a second calculation A unit configured to input the third structured data into the model file and calculate a classification and category probability of each piece of data; a third calculation unit configured to calculate a category probability to obtain popularity information of the topic data.

It should be noted that each of the above modules can be implemented by software or hardware. For the latter, it can be implemented by the following methods, but is not limited to the above: the above modules are all located in the same processor; The forms are located in different processors.

The following describes the embodiments of the present invention in detail with reference to specific scenarios:

Embodiments of the present invention provide a popularity analysis system and method based on a Gaussian mixture model. This paper proposes a system from corpus topic information crawling, corpus information preprocessing, Gaussian mixture modeling, popularity analysis and prediction, to output popularity score results. This system is based on the popularity of Gaussian Mixture Model (GMM) Analytical methods. This article focuses on the analysis and prediction methods for the popularity of public opinion topics based on the Gaussian mixture model. Based on this, it extends to multiple fields and establishes a popularity analysis system based on the hybrid Gaussian clustering technology.

The embodiment of the present invention discusses a popularity analysis method based on Gaussian mixture clustering (Mixture of Gaussian, MoG) and a prediction method for popularity analysis of public opinion topics based on a Gaussian mixture model, and extends to multiple fields based on this. Popularity analysis system based on hybrid Gaussian clustering technology.

The structure diagram of the “popularity analysis system for public opinion content based on Gaussian mixture model” provided by the embodiment of the present invention is shown in FIG. 3. Referring to FIG. 3, the system structure diagram of FIG. 3 describes the system processing flow of the embodiment of the present invention, namely The distributed data capture module captures public opinion content data, filters the original data through a preprocessing process, and stores the preprocessed data in a distributed file system. Start the popularity analysis task at regular intervals, load the model file obtained by training the popularity analysis algorithm provided by the embodiment of the present invention, input sample data, obtain the popularity score of each sample text information and rank it, and display it on the portal interface.

FIG. 4 is a system module diagram of an embodiment of the present invention. Referring to FIG. 4, a system provided by an embodiment of the present invention includes:

Distributed data capture module: responsible for capturing public opinion topic data from the Internet. Weibo topic data includes Weibo topics, text content contained in topics, and comment information under each text. The most important thing is to grab the content of the text itself.

Data pre-processing module: Responsible for pre-processing the captured raw data, cleaning the pictures, voice, expressions and other data contained in the data, and normalizing unstructured data into structured data.

The storage format of structured data is shown in Table 1 below. Table 1 is used to describe the structured data fields.

Table 1

First, the distributed file system module: responsible for storing data.

2. Algorithm training and analysis module: responsible for establishing a hybrid Gaussian analysis algorithm model, training algorithm model through training data (because it belongs to clustering operation, training data does not need to be marked), and save the model for use in predictive analysis.

3. Predictive scoring calculation module: Predicts the classification of test samples and the probability of belonging to a certain category according to the Gaussian mixture model. The total number of samples in this category, K value, calculates the scoring.

The training analysis module algorithm establishment idea is as follows:

1) Perform word segmentation on all text information data, remove special symbols such as symbols, numbers, spaces, and stop words.

2) Set the dimensions and perform word embedding on the words obtained in 1).

3) Add the word vectors obtained in 2) and take the average value and save.

4) Identify several categories, both Gaussian distributions.

5) For each Gaussian distribution, use the k-means algorithm to assign values to the mean and randomly assign variance to the values.

6) For each sample, calculate its probability under each Gaussian distribution.

7) For each Gaussian distribution, the contribution of each sample to the Gaussian distribution can be expressed by the probability below it. If the probability is large, the contribution is large, and vice versa. In this way, the sample's contribution to the Gaussian distribution is used as a weight to calculate the weighted mean and variance. Then replace its original mean and variance.

8) Repeat 6) to 7) until the mean and variance of each Gaussian distribution converge.

For step 1), the open source tokenizer ansj, hanlp, etc. can be used. Here, the hanlp tokenizer is used.

For step 2), a word2vec or Glove model that has been trained is directly generated.

The fourth step is to determine several categories, that is, determine the K value and use the elbow algorithm to give some symbolic representations of the clustering algorithm:

Clustering algorithm's m input samples: x ₁ , x ₂ , ..., x _m

The clustering center to which x _i belongs:

In the clustering process, the clustering algorithm will look for the point with the smallest distance between each sample and the clustering center as the clustering center. So the optimization goal of the clustering algorithm is:

Where c _i represents the cluster center index closest to x _i and μ _k represents the cluster center

The value of the optimization objective J represents the sum of the distances from each sample to the cluster center, so J represents the error to some extent, and the smallest J means the smallest cluster error. When the value of K is different, the obtained J value is also different.

The elbow method believes that the value of K should take the value at the inflection point, as shown in FIG. 5, which is a schematic diagram of the elbow algorithm to determine the value of K in the embodiment of the present invention. It is more appropriate that K is 3 or 6.

Step 5) Use the K-Means algorithm to find the initialization point: Since this algorithm is only used to find the initialization point for Gaussian hybrid cluster training, which improves the accuracy and convergence efficiency of MoG, the specific algorithm details are not discussed here. Taking two-dimensional data as an example, an example of the process of the K-Means algorithm finding a Gaussian hybrid cluster initialization point is shown in FIG. 6. FIG. 6 is an example diagram of the initialization point determination process in the embodiment of the present invention.

Steps 6), 7) and 8) involve Gaussian mixture clustering and EM algorithm (Expectation-Maximization algorithm). Each key step is described descriptively below:

With a random variable X, the mixed Gaussian model can be expressed by the following formula:

Where N (x | μ _k , Σ _k ) is called the k-th component in the mixed model. As in the previous example in Figure 6, there are four clusters, which can be represented by four two-dimensional Gaussian distributions, then the number of components K = 4.π _k is the mixture coefficient and satisfies:

0≤π _k ≤1

It can be seen that π _{k is} equivalent to the weight of each component N (x | μ _k , Σ _k ).

Introduce a new K-dimensional random variable z, z _k (1≤k≤K) can only take two values of 0 or 1; z _k = 1 means that the k class is selected, that is: p (z _k = 1) = π _k ; if z _k = 0, it means that the k-th class is not selected. To be more mathematical, z _k must satisfy the following two conditions:

z _k ∈ {0,1};

For example, in the example in Figure 6, there are four classes, and the dimension of z is 4. If a point is taken from the first class, z = (1,0,0,0), and if a point is taken from the second class, Point, then z = (0,1,0,0). The probability of z _k = 1 is π _k . Assuming z _k is independent and identically distributed, we can write the joint probability distribution form of z:

Because _K z can be 0 or 1, and z can have only a _K z is 1 and the other are all 0, so the equation is true.

The data in Figure 6 can be divided into four categories. It is assumed that the data in each category follows a Gaussian distribution. This statement can be expressed in terms of conditional probability:

p (x | z _k = 1) = N (x | μ _k , Σ _k )

That is, the data in the k class obeys the Gaussian distribution. The above formula can be written as follows:

The above formulas (2) and (3) respectively give the forms of p (z) and p (x | z). According to the product rule and sum rule formula of probability, the form of p (x) can be obtained:

It can be seen that (1) and (4) of the GMM model have the same form, and a new variable z is introduced in (4), which is often called a latent variable. For the data in Figure 6, the meaning of "hidden" is: we know that the data can be divided into four categories, but randomly extract a data point, we do not know which category this data point belongs to, we cannot observe its belonging, so An implicit variable z is introduced to describe this assignment.

Note that under Bayesian thought, p (z) is the prior probability and p (x | z) is the likelihood probability. Naturally we will think of finding the posterior probability p (z | x):

In the above formula, we define the symbol γ (z _k ) to represent the posterior probability of the k-th component. From Bayesian point of view, π _k can be regarded as the prior probability of z _k = 1.

The above content rewrites the form of GMM, and introduces the implicit variable z and the posterior probability γ (z _k ) after the known x. This is done to facilitate the use of the EM algorithm to estimate the parameters of the GMM.

Next, the EM algorithm is used to calculate the parameters. The EM algorithm has two steps. The first step is to find the rough value of the parameter to be estimated. The second step uses the value of the first step to maximize the likelihood function. Therefore, the likelihood function of GMM must be obtained first.

Suppose X = {x ₁ , x ₂ , ..., x _n }, for FIG. 6, X is all points in the figure (each point has two coordinates on a two-dimensional plane and is a two-dimensional vector). The probability model of GMM is shown in formula (1). There are three parameters in the GMM model that need to be estimated, namely π, μ, and Σ. Write (1) as a continuous multiplication:

In order to estimate these three parameters, the maximum likelihood functions of these three parameters need to be solved separately. First solve the maximum likelihood function of μ _k , and take the logarithm of the left and right sides of formula (6) to obtain the likelihood function:

Differentiate μ _k and set the derivative to 0 to get:

Note that the form of the term of the fraction in the above formula is exactly the form of the posterior probability of formula (5). Ride on both sides

Rearranging can get:

among them:

In formulas (9) and (10), N represents the number of points. γ (z _nk ) represents the posterior probability that point x _n belongs to cluster k. Then _nk can represent the number of points belonging to the k-th cluster. Then μ _k represents the weighted average of all points, and the weight of each point is

Related to the k-th cluster.

Similarly, to find the maximum likelihood function of Σ _k , we can get:

Finally, the maximum likelihood function of π _k remains. Note that π _k has restrictions

Therefore, according to the Lagrangian multiplier method, we need to add the Lagrangian operator:

Finding the maximum likelihood function of the above formula for π _k , we get:

By multiplying π _k on both sides of the above formula, you can get λ = -N, and then you can get a more concise expression of π _k :

At this point, we can use the EM algorithm to calculate the model parameters using (5) (7) (9) (10) (11) (12).

EM algorithm process:

Define the number of components K, in this example K is 4, set the initial values of π _k , μ _k and Σ _k for each component k, and then calculate the log-likelihood function (7) of formula (6).

E-step calculates the posterior probability γ (z _nk ) based on the current π _k , μ _k , and Σ _k :

M-step

Calculate new π _k , μ _k , Σ _k according to γ (z _nk ) calculated in E step:

among them:

Calculate the log-likelihood function of (6)

Check whether the parameters converge or whether the log-likelihood function converges. If not, return to step 2.

For better understanding and easy visualization, the previous examples are based on the two-dimensional data shown in Figure 6. In practice, after the text information is segmented, word embedded, vector added, and averaged (1) 2 ) 3) Step), the input training data will be much larger than two-dimensional, but the algorithm principle is exactly the same. The parameter to be determined by the training module is only a K value, and no other parameters need to be set, and the K value can be determined using the elbow algorithm. Therefore, one of the characteristics of this system is that training can be performed directly after obtaining training data. Table 2 is used to explain the training corpus input.

Table 2

Gaussian hybrid clustering does not need to divide the training set, the verification set and the test set. After the training is completed, the parameters can be obtained directly. At this time, the parameter set can be saved.

Enter a new corpus (the input corpus format type is the same as the training corpus) for predictive reasoning, and get the type result and type probability. The format of the input corpus is shown in Table 3. Table 3 is used to describe the format of the predicted input corpus.

table 3

The process of scoring the popularity also belongs to the solution of this embodiment. By scoring, the data can be easily sorted and compared. In order to ensure accuracy and uniformity, this patent uses a method based on the results of Gaussian clustering to reflect the number of categories and the number of samples in the popularity score. as follows:

To take a piece of text as a test sample and input it to the trained Gaussian mixture model for popularity prediction, you need to first convert the text content into feature vectors according to the steps 1) 2) 3) mentioned above, and then enter the model to predict. This feature vector is x. According to the characteristics of the Gaussian mixture model, two values can be obtained by predicting x. The probability k of cluster k to which x belongs and cluster prob (x) of x belongs to these two values. The score is calculated by the following formula. Assuming that the test sample x is classified into the ith category, the score calculation method is recorded as:

Where amount (k = i) is the total number of training samples classified into the ith class, amount (X) is the total number of all training samples, and proba (x) is the probability that the sample x belongs to the i class, which is predicted by the Gaussian mixture model.

According to the properties of the Gaussian mixture model, the type k = i of the sample composition with a large sample size in the cluster is bound to have a high popularity, so the sample size ratio is calculated.

The method can roughly locate the score of the test sample in this model, use proba (x) to multiply this score, and then balance the score by the number of K (this calculation is obtained because the value of K is larger,

The value will be small, resulting in too large differences in the ratings given by different models, which is not conducive to horizontal comparison.) Through this calculation, we can score the popularity of each sample, and can easily add new data to the training data. , Optimize the model.

When displaying in the front-end, you can make changes according to your own needs, such as displaying the popularity score of a certain complaint information, or displaying the content in order according to the ranking of the score, or directly input it into the algorithm of the recommendation system as a parameter.

As shown in Figure 3, this implementation of a topic-oriented sentiment analysis system includes a distributed data capture module, a data preprocessing module, a distributed storage module, an algorithm analysis module, a predictive scoring module, and an optional Front-end display module. The algorithm analysis module contains a special parameter training sub-module and a model loading sub-module.

The method provided by this implementation mainly includes the following steps:

Step 1: The distributed data capture module captures Internet data, such as public opinion topics and their content, WeChat public account and its response, etc .;

Step 2: The data preprocessing module processes the received data in a regular manner. According to the requirements of the technical solution, the structured data format is shown in Table 1.

Step 3: Store the structured data in a distributed storage system, such as a distributed file system (Hadoop, HDFS), MongoDB, etc .;

Step 4: The algorithm training module periodically loads training data to train algorithm parameters, and obtains an algorithm model file. The specific implementation mode is:

Step 4-1: tokenize all text data, remove special symbols such as symbols, numbers, spaces, and stop words;

Step 4-2: Perform word embedding on the words obtained in 4-1.

Step 4-3: Add the word vectors obtained in 4-2 and take the average value and save;

Step 4-4: Perform Gaussian mixture model training;

Step 4-5: Get the optimal parameters according to the algorithm training and save it as an algorithm model file;

Step 5: The system loads the algorithm model file, calculates the data to be predicted stored in the distributed file system, and obtains the popularity score of each piece of data;

Step 5-1: tokenize the predicted text data, remove special symbols, numbers, spaces, and other special symbols, and stop words;

Step 5-2: Perform word embedding on the words obtained in 5-1.

Step 5-3: Add the word vectors obtained in 5-2 and take the average value and save;

Step 5-4: Start the platform and load the algorithm model file trained in step 4.

Step 5-5: Calculate the prediction data to obtain the classification and category probability of each piece of data;

Step 5-6: Calculate the 5-5 data to get the popularity score of each data;

Step 6: Display the aggregation result in step 5 on the front-end interface.

This embodiment also includes the following implementation scenarios:

Implementation scenario 1

Weibo used in the daily life of the public, WeChat circle of friends and Internet sites will generate rich Internet materials. Real-time analysis and tracking of public concerns and real-time public opinion trends are necessary. In the Weibo WeChat public account popularity trend analysis, using the popularity analysis system based on the Gaussian mixture model in this example, when a new Weibo appears, it can accurately calculate and analyze the popularity of this Weibo in order to Make the next recommendation decision.

Step 1: Use distributed crawlers to crawl Weibo content and comments under Weibo, WeChat friends circle content and comments;

Step 2: Data preprocessing cleans the data and stores the data structured in HDFS. The storage format is shown in Table 4.

Table 4

Step 3: After word segmentation is performed on the data, word embedding is performed, and a multi-dimensional vector is obtained by adding and averaging. An example of the process is shown in FIG. 8.

Step 4: Enter the training data into the algorithm platform for training to obtain the algorithm model. The training platform uses sklearn.

Step 5: Enter the corpus to be predicted. The overall process is shown in Figure 7. Figure 7 is an example of the overall prediction process in the embodiment of the present invention. The algorithm model obtained in step 4 is called for popularity score, and the analysis results are pushed and presented. The process is shown in FIG. 9, which is a general flowchart of microblog data popularity analysis in the example of the present invention.

Implementation scenario 2

Music players are very common software, and there are various clients in various platforms. This kind of software recommends music is very common, such as music rankings based on popularity, and users ’personal habits. Music recommendations, etc., such questions can be answered through this system. Take the overall music leaderboard as an example. It can be seen from the example in FIG. 10 that popular recommendation content is directly linked to popularity. FIG. 10 is a diagram illustrating an example of popularity analysis of music data in the example of the present invention.

Step 1: Collect music files without labeling.

Step 2: Record music files into the system;

Step 3: The data preprocessing module preprocesses the data and converts the data into feature vectors. Here is an idea without specific analysis. See Figure 11, which is a schematic diagram of audio signal to feature vectors in the example of the present invention. After the sound signal is input, a Mel Frequency Frequency Cepstral Coefficient (MFCC) parameter vector is output.

Step 4: Start algorithm training to build a prediction model. Construct a Gaussian mixture model model according to the technical solution in the present invention;

Step 5: Analyze and predict the music. The overall process is similar to Figure 9, except that the data acquisition and preprocessing are slightly different.

Implementation scenario 3

Online shopping platforms need to analyze existing products, better understand market conditions and changes, and have relatively accurate estimates of the popularity of newly listed products. In this scenario, the analysis of product popularity becomes extremely important. Using the popularity analysis system in this example, it is possible to perform a popularity analysis on a specific product type and provide a popularity evaluation score for newly listed products. An example is shown in FIG. 12, which is a diagram illustrating an example of popularity analysis of a commodity in an example of the present invention.

Step 1: Collect basic information about a certain type of product, where name, brand, click volume (or sales volume), and click user (or purchase user) are required fields.

Step 2: Each click (purchase) of each user is regarded as a training sample to form a sample set (repeated clicks or purchases by the same user are not counted) and stored in a distributed persistence system such as HDFS.

Step 3: Preprocess the samples and convert them into feature vectors. The method is as shown by the dashed arrows in FIG. 13.

Step 3-1: Use word2vec or Glove algorithm to do word embedding, convert text words into word vectors, and keep the attribute type as the parameter;

Step 3-2: Use the one-hot method to label the brand;

Step 3-3: Combine the results of the previous two parts and the remaining digital parameters into a vector;

Step 4: Enable the popularity analysis system in the present invention for training to obtain a GMM model.

Step 5: Make predictions on the predicted products. The process is shown in Figure 13.

Step 5-1: Use the word2vec algorithm to perform calculations to convert text words into word vectors with the attribute type remaining unchanged.

Step 5-2: Use the one-hot method to mark the brand;

Step 5-3: Combine the results of the previous two parts and the remaining digital parameters into a vector;

Step 5-4: Use the prediction method in the present invention to obtain the category i and the category probability proba;

Step 5-5: Calculate the popularity score using the calculation method of the present invention;

Implementation scenario 4

Watching news on the Internet today is nothing new. News news is constantly updated from Internet sites. In order to analyze and track popular concerns and real-time public opinion trends in real time, the popularity analysis system based on the Gaussian mixture model in this example can be used to score the heat of each news. When a new news appears, it can be accurate. Calculate and analyze the popularity of this news in order to make the next decision.

Step 1: Use distributed crawlers to crawl news content;

Step 2: Data pre-processing cleans the data, stores the data structured in HDFS, and the storage format is shown in Table 5:

table 5

Step 3: After word segmentation is performed on the data, word embedding is performed, and the sum is averaged to obtain a multi-dimensional vector. The process example is similar to that shown in FIG. 7;

Step 5: Input the corpus to be predicted, call the algorithm model obtained in step 4 to perform popularity scoring, and push and present the analysis results. The overall process is shown in FIG. 14, which is an overall flowchart of news data popularity analysis in the example of the present invention .

Embodiments of the present invention provide implementation of a popularity analysis system and method based on a Gaussian mixture model. A system is proposed from corpus topic information crawling, corpus information preprocessing, Gaussian mixture modeling, popularity analysis and prediction, to output popularity score results. The system is based on the popularity analysis method of the Gaussian mixture model. An embodiment of the present invention This paper discusses the method of analyzing and predicting the popularity of public opinion topics based on the Gaussian mixture model, and extends it to multiple fields based on this, and establishes a popularity analysis system based on the hybrid Gaussian clustering technology.

An embodiment of the present invention further provides a storage medium. A computer program is stored in the storage medium, and the computer program is configured to execute the information processing method provided by the embodiment of the present invention when running.

In some embodiments, the above-mentioned storage medium may be configured to store a computer program for performing the following steps:

S1. Obtain topic data.

S2. Preprocess the topic data to obtain structured data.

S3. The structured data is input to a model file, and the popularity information of the topic data is calculated.

In some embodiments, the above-mentioned storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a mobile hard disk, a magnetic disk, or an optical disk, etc. Various media that can store computer programs.

An embodiment of the present invention further provides an electronic device including a memory and a processor. The memory stores a computer program, and the processor is configured to run the computer program to execute the information processing method.

In some embodiments, the electronic device may further include a transmission device and an input-output device, wherein the transmission device is connected to the processor, and the input-output device is connected to the processor.

In some embodiments, the processor may be configured to perform the following steps by a computer program:

S1. Obtain topic data.

S2. Preprocess the topic data to obtain structured data.

Obviously, those skilled in the art should understand that the above-mentioned modules or steps of the embodiments of the present invention may be implemented by a general-purpose computing device, and they may be concentrated on a single computing device or distributed to be composed of multiple computing devices Network, in some embodiments, they can be implemented with program code executable by a computing device, so that they can be stored in a storage device and executed by the computing device, and in some cases, can be different from Here, the steps shown or described are performed sequentially, or they are separately made into individual integrated circuit modules, or multiple modules or steps in them are made into a single integrated circuit module for implementation. As such, the invention is not limited to any particular combination of hardware and software.

The above descriptions are merely preferred embodiments of the present invention and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modification, equivalent replacement, or improvement made within the principle of the present invention shall be included in the protection scope of the present invention.

Claims

An information processing method includes:

Get topic data;

Pre-processing the topic data to obtain structured data;

The structured data is input to a model file, and the popularity information of the topic data is calculated.
The method according to claim 1, wherein after calculating the popularity information of the topic data, the method further comprises:

Display the hotness information of the topic data on the front-end interface.
The method according to claim 1, wherein before the structured data is input into a model file, the method further comprises one of the following:

Training the model file;

The model file is preset.
The method of claim 3, wherein training the model file comprises:

Segment the sample text data, remove characters of a specified type from the sample text data, and obtain first data;

Performing word embedding processing on the first data to obtain second data;

Adding and averaging word vectors of the second data to obtain third data;

Gaussian mixture model training is performed on the original model according to the category of the third data to obtain the model file.
The method according to claim 1, wherein inputting the structured data to a model file and calculating the popularity information of the topic data comprises:

Segmenting structured data, removing characters of a specified type from the structured data, and obtaining first structured data;

Performing word embedding processing on the first structured data to obtain second structured data;

Adding and averaging word vectors of the second structured data to obtain third structured data;

Inputting the third structured data to the model file to obtain a classification and a category probability of each piece of data;

Calculate the category probability to obtain the popularity information of the topic data.
The method according to claim 1, wherein pre-processing the topic data to obtain structured data comprises:

Split the topic data according to the data type;

Deleting specific types of data included in the topic data to obtain candidate data, where the specific types include at least one of the following: pictures, voices, and expressions;

The candidate data is structured into structured data.
The method according to claim 1, wherein obtaining topic data comprises:

Grab the topic data from the Internet, where the topic data includes at least one of the following: topic content and comment information.
An information processing device includes:

An acquisition module configured to acquire topic data;

A processing module configured to preprocess the topic data to obtain structured data;

The calculation module is configured to input the structured data into a model file, and calculate and obtain heat information of the topic data.
The apparatus according to claim 8, wherein the calculation module comprises:

A first processing unit configured to perform word segmentation on the structured data, remove characters of a specified type from the structured data, and obtain a first structured data;

A second processing unit configured to perform word embedding processing on the first structured data to obtain a second structured data;

A first calculation unit configured to add and average the word vectors of the second structured data to obtain third structured data;

A second calculation unit configured to input the third structured data into the model file and calculate a classification and category probability of each piece of data;

A third calculation unit is configured to calculate the category probability to obtain popularity information of the topic data.
A storage medium stores a computer program therein, wherein the computer program is configured to execute the method described in any one of claims 1 to 7 when running.
An electronic device includes a memory and a processor, and a computer program is stored in the memory, and the processor is configured to run the computer program to perform the method described in any one of claims 1 to 7.