WO2019242453A1 - Information processing method and device, storage medium, and electronic device - Google Patents

Information processing method and device, storage medium, and electronic device Download PDF

Info

Publication number
WO2019242453A1
WO2019242453A1 PCT/CN2019/088435 CN2019088435W WO2019242453A1 WO 2019242453 A1 WO2019242453 A1 WO 2019242453A1 CN 2019088435 W CN2019088435 W CN 2019088435W WO 2019242453 A1 WO2019242453 A1 WO 2019242453A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
topic
structured data
structured
model file
Prior art date
Application number
PCT/CN2019/088435
Other languages
French (fr)
Chinese (zh)
Inventor
陆平
韦安军
胡晓
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2019242453A1 publication Critical patent/WO2019242453A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Definitions

  • the present invention relates to, but is not limited to, the field of communications, and in particular, to an information processing method and device, a storage medium, and an electronic device.
  • embodiments of the present invention desire to provide an information processing method and device, a storage medium, and an electronic device.
  • An embodiment of the present invention provides an information processing method, including: obtaining topic data; pre-processing the topic data to obtain structured data; inputting the structured data to a model file, and calculating the topic data. Hot information.
  • An embodiment of the present invention further provides an information processing apparatus including: an acquisition module configured to acquire topic data; a processing module configured to preprocess the topic data to obtain structured data; and a calculation module configured to convert all information
  • the structured data is input to a model file, and the popularity information of the topic data is calculated.
  • a storage medium stores a computer program, and the computer program is configured to execute the information processing method provided by the embodiment of the present invention when running.
  • An embodiment of the present invention further provides an electronic device including a memory and a processor.
  • the memory stores a computer program
  • the processor is configured to run the computer program to perform information processing provided by the embodiment of the present invention. method.
  • structured data is obtained by preprocessing the topic data, and then the hotness information of the topic data is calculated according to the model file, thereby improving the efficiency of analyzing topic popularity.
  • FIG. 1 is a flowchart of an information processing method according to an embodiment of the present invention
  • FIG. 2 is a structural block diagram of an information processing apparatus according to an embodiment of the present invention.
  • FIG. 3 is a system structural diagram of an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of an elbow algorithm for determining a K value in an embodiment of the present invention
  • FIG. 6 is a diagram illustrating an example of an initialization point determination process in an embodiment of the present invention.
  • FIG. 7 is an example diagram of an overall prediction process in an embodiment of the present invention.
  • FIG. 8 is an example diagram of a processing flow before training in an example of the present invention.
  • FIG. 9 is an overall flowchart of analyzing the popularity of microblog data in the example of the present invention.
  • FIG. 10 is a diagram illustrating an example of popularity analysis of music data in an example of the present invention.
  • FIG. 11 is a schematic diagram of an audio signal to feature vector in an example of the present invention.
  • FIG. 12 is a diagram illustrating an example of popularity analysis of commodities in an example of the present invention.
  • FIG. 14 is an overall flowchart of news data popularity analysis in an example of the present invention.
  • FIG. 1 is a flowchart of an information processing method provided by an embodiment of the present invention. As shown in FIG. 1, an information processing method provided by an embodiment of the present invention includes:
  • Step S102 obtaining topic data
  • Step S104 pre-process the topic data to obtain structured data
  • step S106 the structured data is input into a model file, and the popularity information of the topic data is calculated.
  • structured data is obtained by preprocessing the topic data, and then the heat information of the topic data is calculated according to the model file, which solves the technical problem of inefficient analysis of topic popularity in related technologies.
  • the execution subject of the above steps may be a server, a terminal, etc., but is not limited thereto.
  • the method further includes: displaying the popularity information of the topic data on a front-end interface. It can be arranged in order according to the height of the heat, and the heat information can be a score.
  • the method before entering the structured data into the model file, the method further includes one of the following: a training model file; a preset model file.
  • a training model file When the model file is preset, the model file has been trained and can be used directly. Of course, it can also be retrained during the use process.
  • the training model file includes:
  • Remove the characters of the specified type in the sample text data include: remove special symbols such as symbols, numbers, spaces, and stop words;
  • the structured data is input into a model file, and the popularity information of the topic data is calculated, including:
  • S21 Segment the structured data to remove the characters of the specified type from the structured data to obtain the first structured data.
  • Remove the characters of the specified type from the structured data include: remove special symbols such as symbols, numbers, spaces, etc. .
  • S25 Calculate the category probability to obtain the popularity information of the topic data.
  • pre-processing the topic data to obtain structured data includes:
  • Structure candidate data into structured data
  • obtaining the topic data includes: capturing topic data from the Internet, where the topic data includes at least one of the following: topic content and comment information.
  • Topic data can be obtained from WeChat circle of friends, Weibo, Post Bar, website, application software, etc.
  • the method according to the above embodiments can be implemented by means of software plus a necessary universal hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is Better implementation.
  • the technical solution of the present invention in essence or a part that contributes to the existing technology can be embodied in the form of a software product, which is stored in a storage medium such as a read-only memory (Read-Only Memory (ROM) / Random Access Memory (RAM), magnetic disks, compact discs, including a number of instructions for a terminal device (can be a mobile phone, computer, server, or network device, etc.) to execute this Invent the method described in various embodiments.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • a terminal device can be a mobile phone, computer, server, or network device, etc.
  • module may implement a combination of software and / or hardware for a predetermined function.
  • devices described in the following embodiments are preferably implemented in software, hardware, or a combination of software and hardware, is also possible and conceivable.
  • FIG. 2 is a structural block diagram of an information processing apparatus according to an embodiment of the present invention. As shown in FIG. 2, the apparatus includes:
  • An obtaining module 22 configured to obtain topic data
  • a processing module 24 configured to preprocess the topic data to obtain structured data
  • the calculation module 26 is configured to input the structured data into the model file and calculate the popularity information of the topic data.
  • the calculation module includes: a first processing unit configured to segment the structured data, removing characters of a specified type from the structured data to obtain the first structured data; and a second processing unit configured to process the first structured data.
  • a structured data is subjected to word embedding processing to obtain a second structured data;
  • a first calculation unit is configured to add and average the word vectors of the second structured data to obtain a third structured data;
  • a second calculation A unit configured to input the third structured data into the model file and calculate a classification and category probability of each piece of data;
  • a third calculation unit configured to calculate a category probability to obtain popularity information of the topic data.
  • each of the above modules can be implemented by software or hardware.
  • it can be implemented by the following methods, but is not limited to the above: the above modules are all located in the same processor; The forms are located in different processors.
  • Embodiments of the present invention provide a popularity analysis system and method based on a Gaussian mixture model.
  • This paper proposes a system from corpus topic information crawling, corpus information preprocessing, Gaussian mixture modeling, popularity analysis and prediction, to output popularity score results.
  • This system is based on the popularity of Gaussian Mixture Model (GMM) Analytical methods.
  • GMM Gaussian Mixture Model
  • This article focuses on the analysis and prediction methods for the popularity of public opinion topics based on the Gaussian mixture model. Based on this, it extends to multiple fields and establishes a popularity analysis system based on the hybrid Gaussian clustering technology.
  • the embodiment of the present invention discusses a popularity analysis method based on Gaussian mixture clustering (Mixture of Gaussian, MoG) and a prediction method for popularity analysis of public opinion topics based on a Gaussian mixture model, and extends to multiple fields based on this.
  • Popularity analysis system based on hybrid Gaussian clustering technology.
  • FIG. 3 The structure diagram of the “popularity analysis system for public opinion content based on Gaussian mixture model” provided by the embodiment of the present invention is shown in FIG. 3.
  • the distributed data capture module captures public opinion content data, filters the original data through a preprocessing process, and stores the preprocessed data in a distributed file system.
  • Start the popularity analysis task at regular intervals, load the model file obtained by training the popularity analysis algorithm provided by the embodiment of the present invention, input sample data, obtain the popularity score of each sample text information and rank it, and display it on the portal interface.
  • FIG. 4 is a system module diagram of an embodiment of the present invention.
  • a system provided by an embodiment of the present invention includes:
  • Distributed data capture module responsible for capturing public opinion topic data from the Internet.
  • Weibo topic data includes Weibo topics, text content contained in topics, and comment information under each text. The most important thing is to grab the content of the text itself.
  • Data pre-processing module responsible for pre-processing the captured raw data, cleaning the pictures, voice, expressions and other data contained in the data, and normalizing unstructured data into structured data.
  • Table 1 The storage format of structured data is shown in Table 1 below. Table 1 is used to describe the structured data fields.
  • the distributed file system module responsible for storing data.
  • Algorithm training and analysis module responsible for establishing a hybrid Gaussian analysis algorithm model, training algorithm model through training data (because it belongs to clustering operation, training data does not need to be marked), and save the model for use in predictive analysis.
  • Predictive scoring calculation module Predicts the classification of test samples and the probability of belonging to a certain category according to the Gaussian mixture model. The total number of samples in this category, K value, calculates the scoring.
  • the training analysis module algorithm establishment idea is as follows:
  • the contribution of each sample to the Gaussian distribution can be expressed by the probability below it. If the probability is large, the contribution is large, and vice versa. In this way, the sample's contribution to the Gaussian distribution is used as a weight to calculate the weighted mean and variance. Then replace its original mean and variance.
  • step 1) the open source tokenizer ansj, hanlp, etc. can be used.
  • the hanlp tokenizer is used.
  • step 2 a word2vec or Glove model that has been trained is directly generated.
  • the fourth step is to determine several categories, that is, determine the K value and use the elbow algorithm to give some symbolic representations of the clustering algorithm:
  • the clustering algorithm will look for the point with the smallest distance between each sample and the clustering center as the clustering center. So the optimization goal of the clustering algorithm is:
  • J represents the sum of the distances from each sample to the cluster center, so J represents the error to some extent, and the smallest J means the smallest cluster error.
  • K represents the value of the optimization objective
  • the elbow method believes that the value of K should take the value at the inflection point, as shown in FIG. 5, which is a schematic diagram of the elbow algorithm to determine the value of K in the embodiment of the present invention. It is more appropriate that K is 3 or 6.
  • Step 5 Use the K-Means algorithm to find the initialization point: Since this algorithm is only used to find the initialization point for Gaussian hybrid cluster training, which improves the accuracy and convergence efficiency of MoG, the specific algorithm details are not discussed here. Taking two-dimensional data as an example, an example of the process of the K-Means algorithm finding a Gaussian hybrid cluster initialization point is shown in FIG. 6.
  • FIG. 6 is an example diagram of the initialization point determination process in the embodiment of the present invention.
  • Steps 6), 7) and 8) involve Gaussian mixture clustering and EM algorithm (Expectation-Maximization algorithm). Each key step is described descriptively below:
  • the mixed Gaussian model can be expressed by the following formula:
  • ⁇ k , ⁇ k ) is called the k-th component in the mixed model.
  • K 4. ⁇ k is the mixture coefficient and satisfies:
  • ⁇ k is equivalent to the weight of each component N (x
  • z k must satisfy the following two conditions:
  • K z can be 0 or 1
  • z can have only a K z is 1 and the other are all 0, so the equation is true.
  • the above content rewrites the form of GMM, and introduces the implicit variable z and the posterior probability ⁇ (z k ) after the known x. This is done to facilitate the use of the EM algorithm to estimate the parameters of the GMM.
  • the EM algorithm is used to calculate the parameters.
  • the EM algorithm has two steps. The first step is to find the rough value of the parameter to be estimated. The second step uses the value of the first step to maximize the likelihood function. Therefore, the likelihood function of GMM must be obtained first.
  • X ⁇ x 1 , x 2 , ..., x n ⁇ , for FIG. 6, X is all points in the figure (each point has two coordinates on a two-dimensional plane and is a two-dimensional vector).
  • the probability model of GMM is shown in formula (1). There are three parameters in the GMM model that need to be estimated, namely ⁇ , ⁇ , and ⁇ . Write (1) as a continuous multiplication:
  • N represents the number of points.
  • ⁇ (z nk ) represents the posterior probability that point x n belongs to cluster k. Then nk can represent the number of points belonging to the k-th cluster. Then ⁇ k represents the weighted average of all points, and the weight of each point is Related to the k-th cluster.
  • E-step calculates the posterior probability ⁇ (z nk ) based on the current ⁇ k , ⁇ k , and ⁇ k :
  • the previous examples are based on the two-dimensional data shown in Figure 6.
  • the input training data will be much larger than two-dimensional, but the algorithm principle is exactly the same.
  • the parameter to be determined by the training module is only a K value, and no other parameters need to be set, and the K value can be determined using the elbow algorithm. Therefore, one of the characteristics of this system is that training can be performed directly after obtaining training data. Table 2 is used to explain the training corpus input.
  • Gaussian hybrid clustering does not need to divide the training set, the verification set and the test set. After the training is completed, the parameters can be obtained directly. At this time, the parameter set can be saved.
  • the input corpus format type is the same as the training corpus
  • the format of the input corpus is shown in Table 3. Table 3 is used to describe the format of the predicted input corpus.
  • the process of scoring the popularity also belongs to the solution of this embodiment.
  • scoring the data can be easily sorted and compared.
  • this patent uses a method based on the results of Gaussian clustering to reflect the number of categories and the number of samples in the popularity score. as follows:
  • amount (X) is the total number of all training samples
  • proba (x) is the probability that the sample x belongs to the i class, which is predicted by the Gaussian mixture model.
  • the method can roughly locate the score of the test sample in this model, use proba (x) to multiply this score, and then balance the score by the number of K (this calculation is obtained because the value of K is larger, The value will be small, resulting in too large differences in the ratings given by different models, which is not conducive to horizontal comparison.) Through this calculation, we can score the popularity of each sample, and can easily add new data to the training data. , Optimize the model.
  • this implementation of a topic-oriented sentiment analysis system includes a distributed data capture module, a data preprocessing module, a distributed storage module, an algorithm analysis module, a predictive scoring module, and an optional Front-end display module.
  • the algorithm analysis module contains a special parameter training sub-module and a model loading sub-module.
  • the method provided by this implementation mainly includes the following steps:
  • Step 1 The distributed data capture module captures Internet data, such as public opinion topics and their content, WeChat public account and its response, etc .;
  • Step 2 The data preprocessing module processes the received data in a regular manner.
  • the structured data format is shown in Table 1.
  • Step 3 Store the structured data in a distributed storage system, such as a distributed file system (Hadoop, HDFS), MongoDB, etc .;
  • a distributed file system such as a distributed file system (Hadoop, HDFS), MongoDB, etc .
  • Step 4 The algorithm training module periodically loads training data to train algorithm parameters, and obtains an algorithm model file.
  • the specific implementation mode is:
  • Step 4-1 tokenize all text data, remove special symbols such as symbols, numbers, spaces, and stop words;
  • Step 4-2 Perform word embedding on the words obtained in 4-1.
  • Step 4-3 Add the word vectors obtained in 4-2 and take the average value and save;
  • Step 4-4 Perform Gaussian mixture model training
  • Step 4-5 Get the optimal parameters according to the algorithm training and save it as an algorithm model file
  • Step 5 The system loads the algorithm model file, calculates the data to be predicted stored in the distributed file system, and obtains the popularity score of each piece of data;
  • Step 5-1 tokenize the predicted text data, remove special symbols, numbers, spaces, and other special symbols, and stop words;
  • Step 5-2 Perform word embedding on the words obtained in 5-1.
  • Step 5-3 Add the word vectors obtained in 5-2 and take the average value and save;
  • Step 5-4 Start the platform and load the algorithm model file trained in step 4.
  • Step 5-5 Calculate the prediction data to obtain the classification and category probability of each piece of data
  • Step 5-6 Calculate the 5-5 data to get the popularity score of each data
  • Step 6 Display the aggregation result in step 5 on the front-end interface.
  • Weibo used in the daily life of the public WeChat circle of friends and Internet sites will generate rich Internet materials. Real-time analysis and tracking of public concerns and real-time public opinion trends are necessary.
  • Weibo WeChat public account popularity trend analysis using the popularity analysis system based on the Gaussian mixture model in this example, when a new Weibo appears, it can accurately calculate and analyze the popularity of this Weibo in order to Make the next recommendation decision.
  • Step 1 Use distributed crawlers to crawl Weibo content and comments under Weibo, WeChat friends circle content and comments;
  • Step 2 Data preprocessing cleans the data and stores the data structured in HDFS.
  • the storage format is shown in Table 4.
  • Step 3 After word segmentation is performed on the data, word embedding is performed, and a multi-dimensional vector is obtained by adding and averaging. An example of the process is shown in FIG. 8.
  • Step 4 Enter the training data into the algorithm platform for training to obtain the algorithm model.
  • the training platform uses sklearn.
  • Step 5 Enter the corpus to be predicted.
  • the overall process is shown in Figure 7.
  • Figure 7 is an example of the overall prediction process in the embodiment of the present invention.
  • the algorithm model obtained in step 4 is called for popularity score, and the analysis results are pushed and presented.
  • the process is shown in FIG. 9, which is a general flowchart of microblog data popularity analysis in the example of the present invention.
  • FIG. 10 is a diagram illustrating an example of popularity analysis of music data in the example of the present invention.
  • Step 1 Collect music files without labeling.
  • Step 2 Record music files into the system
  • Step 3 The data preprocessing module preprocesses the data and converts the data into feature vectors.
  • MFCC Mel Frequency Frequency Cepstral Coefficient
  • Step 4 Start algorithm training to build a prediction model. Construct a Gaussian mixture model model according to the technical solution in the present invention
  • Step 5 Analyze and predict the music.
  • the overall process is similar to Figure 9, except that the data acquisition and preprocessing are slightly different.
  • FIG. 12 is a diagram illustrating an example of popularity analysis of a commodity in an example of the present invention.
  • Step 1 Collect basic information about a certain type of product, where name, brand, click volume (or sales volume), and click user (or purchase user) are required fields.
  • Step 2 Each click (purchase) of each user is regarded as a training sample to form a sample set (repeated clicks or purchases by the same user are not counted) and stored in a distributed persistence system such as HDFS.
  • a distributed persistence system such as HDFS.
  • Step 3 Preprocess the samples and convert them into feature vectors. The method is as shown by the dashed arrows in FIG. 13.
  • Step 3-1 Use word2vec or Glove algorithm to do word embedding, convert text words into word vectors, and keep the attribute type as the parameter;
  • Step 3-2 Use the one-hot method to label the brand
  • Step 3-3 Combine the results of the previous two parts and the remaining digital parameters into a vector
  • Step 4 Enable the popularity analysis system in the present invention for training to obtain a GMM model.
  • Step 5 Make predictions on the predicted products. The process is shown in Figure 13.
  • Step 5-1 Use the word2vec algorithm to perform calculations to convert text words into word vectors with the attribute type remaining unchanged.
  • Step 5-2 Use the one-hot method to mark the brand
  • Step 5-3 Combine the results of the previous two parts and the remaining digital parameters into a vector
  • Step 5-4 Use the prediction method in the present invention to obtain the category i and the category probability proba;
  • Step 5-5 Calculate the popularity score using the calculation method of the present invention.
  • Step 1 Use distributed crawlers to crawl news content
  • Step 2 Data pre-processing cleans the data, stores the data structured in HDFS, and the storage format is shown in Table 5:
  • Step 3 After word segmentation is performed on the data, word embedding is performed, and the sum is averaged to obtain a multi-dimensional vector.
  • the process example is similar to that shown in FIG. 7;
  • Step 4 Enter the training data into the algorithm platform for training to obtain the algorithm model.
  • the training platform uses sklearn.
  • Step 5 Input the corpus to be predicted, call the algorithm model obtained in step 4 to perform popularity scoring, and push and present the analysis results.
  • the overall process is shown in FIG. 14, which is an overall flowchart of news data popularity analysis in the example of the present invention .
  • Embodiments of the present invention provide implementation of a popularity analysis system and method based on a Gaussian mixture model.
  • a system is proposed from corpus topic information crawling, corpus information preprocessing, Gaussian mixture modeling, popularity analysis and prediction, to output popularity score results.
  • the system is based on the popularity analysis method of the Gaussian mixture model.
  • An embodiment of the present invention This paper discusses the method of analyzing and predicting the popularity of public opinion topics based on the Gaussian mixture model, and extends it to multiple fields based on this, and establishes a popularity analysis system based on the hybrid Gaussian clustering technology.
  • An embodiment of the present invention further provides a storage medium.
  • a computer program is stored in the storage medium, and the computer program is configured to execute the information processing method provided by the embodiment of the present invention when running.
  • the above-mentioned storage medium may be configured to store a computer program for performing the following steps:
  • the structured data is input to a model file, and the popularity information of the topic data is calculated.
  • the above-mentioned storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a mobile hard disk, a magnetic disk, or an optical disk, etc.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • An embodiment of the present invention further provides an electronic device including a memory and a processor.
  • the memory stores a computer program
  • the processor is configured to run the computer program to execute the information processing method.
  • the electronic device may further include a transmission device and an input-output device, wherein the transmission device is connected to the processor, and the input-output device is connected to the processor.
  • the processor may be configured to perform the following steps by a computer program:
  • the structured data is input to a model file, and the popularity information of the topic data is calculated.
  • modules or steps of the embodiments of the present invention may be implemented by a general-purpose computing device, and they may be concentrated on a single computing device or distributed to be composed of multiple computing devices Network, in some embodiments, they can be implemented with program code executable by a computing device, so that they can be stored in a storage device and executed by the computing device, and in some cases, can be different from
  • the steps shown or described are performed sequentially, or they are separately made into individual integrated circuit modules, or multiple modules or steps in them are made into a single integrated circuit module for implementation.
  • the invention is not limited to any particular combination of hardware and software.

Abstract

Disclosed in the present invention are an information processing method and device, a storage medium, and an electronic device. The method comprises: obtaining topic data; preprocessing the topic data to obtain structured data; and inputting the structured data into a model file, to obtain popularity information of the topic data by calculation.

Description

信息处理方法及装置、存储介质、电子装置Information processing method and device, storage medium, and electronic device
相关申请的交叉引用Cross-reference to related applications
本申请基于申请号为201810644005.6、申请日为2018年06月21日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。This application is based on a Chinese patent application with an application number of 201810644005.6 and an application date of June 21, 2018, and claims the priority of the Chinese patent application. The entire contents of the Chinese patent application are incorporated herein by reference.
技术领域Technical field
本发明涉及但不限于通信领域,尤其涉及一种信息处理方法及装置、存储介质、电子装置。The present invention relates to, but is not limited to, the field of communications, and in particular, to an information processing method and device, a storage medium, and an electronic device.
背景技术Background technique
每个人的工作与生活都与计算机、互联网息息相关,人们可以在互联网上获取各种各样的信息,甚至娱乐,消费,人与人之间的交流和沟通方式也已渗透到互联网中。以微博、微信朋友圈为代表的社会化媒体平台出现,更加使得基于网络的社交深入人心。在这个互联网时代,随时随地都会产生大量的话题数据,这些话题数据如浪潮一般,会随时间不停的产生新的内部巅峰值,在微博领域中即热点动态,在贴吧则可能是流行语,在音乐领域则构成流行音乐榜,更细节的说,在筛选出关于一件事的评论中,可能构成这件事的大众心理状态,获取流行信息的过程,称为流行度分析,针对个人用户分析流行度,可以作为推荐系统的一个维度起到举足轻重的作用,根据所有用户的数据进行总体分析流行度,则可以预判事务的发展趋势。Everyone's work and life are closely related to the computer and the Internet. People can obtain a variety of information on the Internet. Even entertainment, consumption, and communication methods between people have penetrated into the Internet. The emergence of social media platforms represented by Weibo and WeChat circle of friends has made web-based social interaction more popular. In this Internet age, a large amount of topic data is generated anytime, anywhere. These topic data are like waves, and they will generate new internal peaks over time. In the field of Weibo, they are hot trends, and they may be buzzwords in posts. In the field of music, it constitutes a pop music list. In more detail, in screening out comments about an event, it may constitute the public psychological state of the event. The process of obtaining popular information is called popularity analysis, which targets individuals. User analysis popularity can play a decisive role as a dimension of the recommendation system. According to the overall analysis of the popularity of all user data, you can predict the development trend of the transaction.
相关技术中,对话题数据的流行度分析效率低下,如相关部门在收集人民大众的意愿时往往会选择电子意见箱,手机App省长信箱等方式收集人民的诉求,了解不足和人民迫切希望改变的重点,在这样一种方式下, 每一条信息很可能在反应一件事或一种心理状态,但是这种通过被动的接收来手机舆情的方式,效率非常低下。In related technologies, the popularity analysis of topic data is inefficient. For example, when collecting the wishes of the people, relevant departments often choose electronic suggestion boxes, mobile app Governor's mailboxes, etc. to collect people's demands, understand the lack of people and people's urgent desire to change The important point is that in such a way, each piece of information is likely to reflect an event or a mental state, but this way of passively receiving mobile phone public opinion is very inefficient.
发明内容Summary of the Invention
有鉴于此,本发明实施例期望提供一种信息处理方法及装置、存储介质、电子装置。In view of this, embodiments of the present invention desire to provide an information processing method and device, a storage medium, and an electronic device.
本发明实施例,提供了一种信息处理方法,包括:获取话题数据;对所述话题数据进行预处理得到结构化数据;将所述结构化数据输入至模型文件,计算得到所述话题数据的热度信息。An embodiment of the present invention provides an information processing method, including: obtaining topic data; pre-processing the topic data to obtain structured data; inputting the structured data to a model file, and calculating the topic data. Hot information.
本发明实施例,还提供了一种信息处理装置,包括:获取模块,配置为获取话题数据;处理模块,配置为对所述话题数据进行预处理得到结构化数据;计算模块,配置为将所述结构化数据输入至模型文件,计算得到所述话题数据的热度信息。An embodiment of the present invention further provides an information processing apparatus including: an acquisition module configured to acquire topic data; a processing module configured to preprocess the topic data to obtain structured data; and a calculation module configured to convert all information The structured data is input to a model file, and the popularity information of the topic data is calculated.
本发明实施例,还提供了一种存储介质,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行本发明实施例提供的信息处理方法。According to an embodiment of the present invention, a storage medium is also provided. The storage medium stores a computer program, and the computer program is configured to execute the information processing method provided by the embodiment of the present invention when running.
本发明实施例,还提供了一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行本发明实施例提供的信息处理方法。An embodiment of the present invention further provides an electronic device including a memory and a processor. The memory stores a computer program, and the processor is configured to run the computer program to perform information processing provided by the embodiment of the present invention. method.
应用本发明实施例,通过对话题数据进行预处理得到结构化数据,然后根据模型文件计算得到话题数据的热度信息,提高了分析话题流行度的效率。By applying the embodiments of the present invention, structured data is obtained by preprocessing the topic data, and then the hotness information of the topic data is calculated according to the model file, thereby improving the efficiency of analyzing topic popularity.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为本发明实施例的信息处理方法的流程图;FIG. 1 is a flowchart of an information processing method according to an embodiment of the present invention;
图2为本发明实施例的信息处理装置的结构框图;2 is a structural block diagram of an information processing apparatus according to an embodiment of the present invention;
图3为本发明实施例的系统结构图;3 is a system structural diagram of an embodiment of the present invention;
图4为本发明实施例的系统模块图;4 is a system module diagram of an embodiment of the present invention;
图5为本发明实施例中elbow算法确定K值的原理图;5 is a schematic diagram of an elbow algorithm for determining a K value in an embodiment of the present invention;
图6为本发明实施例中的初始化点确定过程示例图;6 is a diagram illustrating an example of an initialization point determination process in an embodiment of the present invention;
图7为本发明实施例中的整体预测流程举例图;FIG. 7 is an example diagram of an overall prediction process in an embodiment of the present invention; FIG.
图8为本发明实例中训练前的处理流程举例图;FIG. 8 is an example diagram of a processing flow before training in an example of the present invention; FIG.
图9为本发明实例中的微博数据流行度分析总体流程图;FIG. 9 is an overall flowchart of analyzing the popularity of microblog data in the example of the present invention; FIG.
图10为本发明实例中的音乐数据流行度分析示例图;10 is a diagram illustrating an example of popularity analysis of music data in an example of the present invention;
图11为本发明实例中音频信号转特征向量示意图;11 is a schematic diagram of an audio signal to feature vector in an example of the present invention;
图12为本发明实例中的商品流行度分析示例图;FIG. 12 is a diagram illustrating an example of popularity analysis of commodities in an example of the present invention; FIG.
图13为本发明实例中的预处理及商品流行度预测流程图;13 is a flowchart of preprocessing and commodity popularity prediction in an example of the present invention;
图14为本发明实例中的新闻数据流行度分析总体流程图。FIG. 14 is an overall flowchart of news data popularity analysis in an example of the present invention.
具体实施方式detailed description
下文中将参考附图并结合实施例来详细说明本发明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。Hereinafter, the present invention will be described in detail with reference to the accompanying drawings and embodiments. It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other.
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。It should be noted that the terms “first” and “second” in the description and claims of the present invention and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence.
图1是本发明实施例提供的信息处理方法的流程图,如图1所示,本发明实施例提供的信息处理方法包括:FIG. 1 is a flowchart of an information processing method provided by an embodiment of the present invention. As shown in FIG. 1, an information processing method provided by an embodiment of the present invention includes:
步骤S102,获取话题数据;Step S102, obtaining topic data;
步骤S104,对话题数据进行预处理得到结构化数据;Step S104: pre-process the topic data to obtain structured data;
步骤S106,将结构化数据输入至模型文件,计算得到话题数据的热度信息。In step S106, the structured data is input into a model file, and the popularity information of the topic data is calculated.
通过上述步骤,通过对话题数据进行预处理得到结构化数据,然后根 据模型文件计算得到话题数据的热度信息,解决了相关技术中分析话题流行度效率低下的技术问题。Through the above steps, structured data is obtained by preprocessing the topic data, and then the heat information of the topic data is calculated according to the model file, which solves the technical problem of inefficient analysis of topic popularity in related technologies.
在一些实施例中,上述步骤的执行主体可以为服务器,终端等,但不限于此。In some embodiments, the execution subject of the above steps may be a server, a terminal, etc., but is not limited thereto.
在一些实施例中,在计算得到话题数据的热度信息之后,方法还包括:在前端界面展示话题数据的热度信息。可以按照热度的高低来按序排列,该热度信息可以是分数。In some embodiments, after calculating the popularity information of the topic data, the method further includes: displaying the popularity information of the topic data on a front-end interface. It can be arranged in order according to the height of the heat, and the heat information can be a score.
在一些实施例中,在将结构化数据输入至模型文件之前,方法还包括以下之一:训练模型文件;预设模型文件。在预设模型文件时,该模型文件已经训练完毕,可以直接使用,当然也可以在使用过程进行反馈再训练。In some embodiments, before entering the structured data into the model file, the method further includes one of the following: a training model file; a preset model file. When the model file is preset, the model file has been trained and can be used directly. Of course, it can also be retrained during the use process.
在一些实施例中,训练模型文件包括:In some embodiments, the training model file includes:
S11,对样本文本数据进行分词,去掉样本文本数据中指定类型的字符,得到第一数据;去掉样本文本数据中指定类型的字符包括:去掉符号,数字,空格等特殊符号,去停止词;S11. Segment the sample text data to remove the characters of the specified type in the sample text data to obtain the first data. Remove the characters of the specified type in the sample text data include: remove special symbols such as symbols, numbers, spaces, and stop words;
S12,对第一数据进行单词嵌入处理(word embedding),得到第二数据;S12. Perform word embedding on the first data to obtain second data.
S13,对第二数据的词向量进行加和并取平均值,得到第三数据;S13. Add and sum the word vectors of the second data to obtain the third data.
S14,对第三数据按照类别对原始模型进行高斯混合模型训练,得到模型文件。S14. Gaussian mixture model training is performed on the original model according to the category on the third data to obtain a model file.
在一些实施例中,将结构化数据输入至模型文件,计算得到话题数据的热度信息,包括:In some embodiments, the structured data is input into a model file, and the popularity information of the topic data is calculated, including:
S21,对结构化数据进行分词,去掉结构化数据中指定类型的字符,得到第一结构化数据;去掉结构化数据中指定类型的字符包括:去掉符号,数字,空格等特殊符号,去停止词。S21: Segment the structured data to remove the characters of the specified type from the structured data to obtain the first structured data. Remove the characters of the specified type from the structured data include: remove special symbols such as symbols, numbers, spaces, etc. .
S22,对第一结构化数据进行单词嵌入处理,得到第二结构化数据;S22. Perform word embedding processing on the first structured data to obtain second structured data.
S23,对第二结构化数据的词向量进行加和并取平均值,得到第三结构化数据;S23. Add and sum the word vectors of the second structured data to obtain the third structured data.
S24,将第三结构化数据输入至模型文件,得到每条数据的归类和类别概率;S24. Input the third structured data into the model file to obtain the classification and category probability of each piece of data;
S25,计算类别概率得到话题数据的热度信息。S25: Calculate the category probability to obtain the popularity information of the topic data.
在一些实施例中,对话题数据进行预处理得到结构化数据包括:In some embodiments, pre-processing the topic data to obtain structured data includes:
按照数据类型拆分话题数据;清洗数据中包含的图片,语音,表情等数据;Split topic data according to data type; clean pictures, voice, expressions and other data contained in the data;
删除话题数据中包含的特定类型的数据,得到候选数据,其中,特定类型包括以下至少之一:图片,语音,表情;Delete specific types of data contained in the topic data to obtain candidate data, where the specific types include at least one of the following: pictures, voices, and expressions;
将候选数据规整为结构化数据。Structure candidate data into structured data.
在一些实施例中,获取话题数据包括:从互联网上抓取话题数据,其中,话题数据包括以下至少之一:话题内容,评论信息。话题数据可以从微信朋友圈,微博,贴吧,网站,应用软件等获取。In some embodiments, obtaining the topic data includes: capturing topic data from the Internet, where the topic data includes at least one of the following: topic content and comment information. Topic data can be obtained from WeChat circle of friends, Weibo, Post Bar, website, application software, etc.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如只读存储器(Read-Only Memory,ROM)/随机存取存储器(Random Access Memory,RAM)、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本发明各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by means of software plus a necessary universal hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is Better implementation. Based on such an understanding, the technical solution of the present invention in essence or a part that contributes to the existing technology can be embodied in the form of a software product, which is stored in a storage medium such as a read-only memory (Read-Only Memory (ROM) / Random Access Memory (RAM), magnetic disks, compact discs, including a number of instructions for a terminal device (can be a mobile phone, computer, server, or network device, etc.) to execute this Invent the method described in various embodiments.
本实施例中还提供了一种信息处理装置,该装置配置为实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现 也是可能并被构想的。An information processing apparatus is also provided in this embodiment, and the apparatus is configured to implement the above-mentioned embodiments and preferred implementation manners, and the descriptions will not be repeated. As used below, the term "module" may implement a combination of software and / or hardware for a predetermined function. Although the devices described in the following embodiments are preferably implemented in software, hardware, or a combination of software and hardware, is also possible and conceivable.
图2是根据本发明实施例的信息处理装置的结构框图,如图2所示,该装置包括:FIG. 2 is a structural block diagram of an information processing apparatus according to an embodiment of the present invention. As shown in FIG. 2, the apparatus includes:
获取模块22,配置为获取话题数据;An obtaining module 22 configured to obtain topic data;
处理模块24,配置为对话题数据进行预处理得到结构化数据;A processing module 24 configured to preprocess the topic data to obtain structured data;
计算模块26,配置为将结构化数据输入至模型文件,计算得到话题数据的热度信息。The calculation module 26 is configured to input the structured data into the model file and calculate the popularity information of the topic data.
在一些实施例中,计算模块包括:第一处理单元,配置为对结构化数据进行分词,去掉结构化数据中指定类型的字符,得到第一结构化数据;第二处理单元,配置为对第一结构化数据进行单词嵌入处理,得到第二结构化数据;第一计算单元,配置为对第二结构化数据的词向量进行加和并取平均值,得到第三结构化数据;第二计算单元,配置为将第三结构化数据输入至模型文件,计算得到每条数据的归类和类别概率;第三计算单元,配置为计算类别概率得到话题数据的热度信息。In some embodiments, the calculation module includes: a first processing unit configured to segment the structured data, removing characters of a specified type from the structured data to obtain the first structured data; and a second processing unit configured to process the first structured data. A structured data is subjected to word embedding processing to obtain a second structured data; a first calculation unit is configured to add and average the word vectors of the second structured data to obtain a third structured data; a second calculation A unit configured to input the third structured data into the model file and calculate a classification and category probability of each piece of data; a third calculation unit configured to calculate a category probability to obtain popularity information of the topic data.
需要说明的是,上述各个模块是可以通过软件或硬件来实现的,对于后者,可以通过以下方式实现,但不限于此:上述模块均位于同一处理器中;或者,上述各个模块以任意组合的形式分别位于不同的处理器中。It should be noted that each of the above modules can be implemented by software or hardware. For the latter, it can be implemented by the following methods, but is not limited to the above: the above modules are all located in the same processor; The forms are located in different processors.
接下来结合具体的场景对本发明实施例进行详细说明:The following describes the embodiments of the present invention in detail with reference to specific scenarios:
本发明实施例提供一种基于高斯混合模型的流行度分析系统及方法。提出一种从语料话题信息爬取,语料信息预处理,高斯混合建模,流行度分析预测,到输出流行度评分结果的系统,该系统基于高斯混合模型(Gaussian Mixture Model,GMM)的流行度分析方法,本篇重点讨论基于高斯混合模型方式的舆情话题流行度分析预测方法,并以此为基础延伸到多领域,建立基于混合高斯聚类技术的流行度分析系统。Embodiments of the present invention provide a popularity analysis system and method based on a Gaussian mixture model. This paper proposes a system from corpus topic information crawling, corpus information preprocessing, Gaussian mixture modeling, popularity analysis and prediction, to output popularity score results. This system is based on the popularity of Gaussian Mixture Model (GMM) Analytical methods. This article focuses on the analysis and prediction methods for the popularity of public opinion topics based on the Gaussian mixture model. Based on this, it extends to multiple fields and establishes a popularity analysis system based on the hybrid Gaussian clustering technology.
本发明实施例讨论基于高斯混合聚类(Mixture of Gaussian,MoG)的 流行度分析方法,以及基于高斯混合模型方式的舆情话题流行度分析预测方法,并以此为基础延伸到多领域,建立基于混合高斯聚类技术的流行度分析系统。The embodiment of the present invention discusses a popularity analysis method based on Gaussian mixture clustering (Mixture of Gaussian, MoG) and a prediction method for popularity analysis of public opinion topics based on a Gaussian mixture model, and extends to multiple fields based on this. Popularity analysis system based on hybrid Gaussian clustering technology.
本发明实施例提供的“基于高斯混合模型的面向舆情内容的流行度分析系统”结构图如图3所示,参见图3,图3系统结构图描述了本发明实施例的系统处理流程,即分布式数据抓取模块抓取舆情内容数据,通过预处理流程过滤原始数据,将预处理后的数据存储在分布式文件系统中。定时启动流行度分析任务,载入本发明实施例提供的流行度分析算法训练后所得的模型文件,输入样本数据,得到每条样本文本信息的流行度打分并排名,在门户界面展示。The structure diagram of the “popularity analysis system for public opinion content based on Gaussian mixture model” provided by the embodiment of the present invention is shown in FIG. 3. Referring to FIG. 3, the system structure diagram of FIG. 3 describes the system processing flow of the embodiment of the present invention, namely The distributed data capture module captures public opinion content data, filters the original data through a preprocessing process, and stores the preprocessed data in a distributed file system. Start the popularity analysis task at regular intervals, load the model file obtained by training the popularity analysis algorithm provided by the embodiment of the present invention, input sample data, obtain the popularity score of each sample text information and rank it, and display it on the portal interface.
图4为本发明实施例的系统模块图,参见图4,本发明实施例提供的系统包括:FIG. 4 is a system module diagram of an embodiment of the present invention. Referring to FIG. 4, a system provided by an embodiment of the present invention includes:
分布式数据抓取模块:负责从互联网上抓取舆情话题数据。微博话题数据包括微博话题,话题下所包含的文本内容以及每条文本下的评论信息。最重要的是抓取文本本身的内容。Distributed data capture module: responsible for capturing public opinion topic data from the Internet. Weibo topic data includes Weibo topics, text content contained in topics, and comment information under each text. The most important thing is to grab the content of the text itself.
数据预处理模块:负责预处理抓取的原始数据,清洗数据中包含的图片,语音,表情等数据,并将非结构化数据规整为结构化数据。Data pre-processing module: Responsible for pre-processing the captured raw data, cleaning the pictures, voice, expressions and other data contained in the data, and normalizing unstructured data into structured data.
结构化数据的存储格式如下表1所示,表1用于说明结构化数据字段。The storage format of structured data is shown in Table 1 below. Table 1 is used to describe the structured data fields.
表1Table 1
Figure PCTCN2019088435-appb-000001
Figure PCTCN2019088435-appb-000001
Figure PCTCN2019088435-appb-000002
Figure PCTCN2019088435-appb-000002
一、分布式文件系统模块:负责存储数据。First, the distributed file system module: responsible for storing data.
二、算法训练分析模块:负责建立一个混合高斯分析算法模型,通过训练数据(由于属于聚类操作,训练数据无需标注)训练算法模型,并将模型保存,供预测分析时使用。2. Algorithm training and analysis module: responsible for establishing a hybrid Gaussian analysis algorithm model, training algorithm model through training data (because it belongs to clustering operation, training data does not need to be marked), and save the model for use in predictive analysis.
三、预测打分计算模块:根据高斯混合模型预测测试样本的分类和属于某类别概率,该类别总样本数,K值,计算打分。3. Predictive scoring calculation module: Predicts the classification of test samples and the probability of belonging to a certain category according to the Gaussian mixture model. The total number of samples in this category, K value, calculates the scoring.
训练分析模块算法建立思想如下:The training analysis module algorithm establishment idea is as follows:
1)对所有文本信息数据进行分词,去掉符号,数字,空格等特殊符号,去停止词。1) Perform word segmentation on all text information data, remove special symbols such as symbols, numbers, spaces, and stop words.
2)设置维度,对1)得到的词语进行word embedding(单词嵌入)处理。2) Set the dimensions and perform word embedding on the words obtained in 1).
3)对2)得到的词向量进行加和并取平均值并保存。3) Add the word vectors obtained in 2) and take the average value and save.
4)确定有几个类别,既有几个高斯分布。4) Identify several categories, both Gaussian distributions.
5)针对每一个高斯分布,使用k-means算法给均值赋值,随机给方差进行赋值。5) For each Gaussian distribution, use the k-means algorithm to assign values to the mean and randomly assign variance to the values.
6)针对每一个样本,计算其在各个高斯分布下的概率。6) For each sample, calculate its probability under each Gaussian distribution.
7)针对每一个高斯分布,每一个样本对该高斯分布的贡献可以由其下的概率表示,如概率大则表示贡献大,反之亦然。这样把样本对该高 斯分布的贡献作为权重来计算加权的均值和方差。之后替代其原本的均值和方差。7) For each Gaussian distribution, the contribution of each sample to the Gaussian distribution can be expressed by the probability below it. If the probability is large, the contribution is large, and vice versa. In this way, the sample's contribution to the Gaussian distribution is used as a weight to calculate the weighted mean and variance. Then replace its original mean and variance.
8)重复6)~7)直到每一个高斯分布的均值和方差收敛。8) Repeat 6) to 7) until the mean and variance of each Gaussian distribution converge.
对于第1)步,使用开源分词器ansj,hanlp等均可解决,这里采用hanlp分词器。For step 1), the open source tokenizer ansj, hanlp, etc. can be used. Here, the hanlp tokenizer is used.
对于第2)步,采用已经训练好的word2vec或Glove模型直接生成。For step 2), a word2vec or Glove model that has been trained is directly generated.
其中第4)步确定几个类别即确定K值使用elbow算法,给出聚类算法的一些符号表示:The fourth step is to determine several categories, that is, determine the K value and use the elbow algorithm to give some symbolic representations of the clustering algorithm:
聚类算法的m个输入样本:x 1,x 2,...,x m Clustering algorithm's m input samples: x 1 , x 2 , ..., x m
x i所属的聚类中心:
Figure PCTCN2019088435-appb-000003
The clustering center to which x i belongs:
Figure PCTCN2019088435-appb-000003
聚类算法在聚类过程中,会寻找每个样本到聚类中心距离最小的点作为聚类中心。所以聚类算法的优化目标为:In the clustering process, the clustering algorithm will look for the point with the smallest distance between each sample and the clustering center as the clustering center. So the optimization goal of the clustering algorithm is:
Figure PCTCN2019088435-appb-000004
Figure PCTCN2019088435-appb-000004
其中c i表示最接近x i的聚类中心下标,μ k表示聚类中心 Where c i represents the cluster center index closest to x i and μ k represents the cluster center
优化目标J的值就表示每个样本到聚类中心的距离之和,所以J在某种程度上表示了误差,J最小则聚类误差最小。当K取值不同,得到的J值也不同。The value of the optimization objective J represents the sum of the distances from each sample to the cluster center, so J represents the error to some extent, and the smallest J means the smallest cluster error. When the value of K is different, the obtained J value is also different.
elbow法认为,K值应该取拐点上的那个值,如图5所示,图5为本发明实施例中elbow算法确定K值的原理图,K取3或6比较合适。The elbow method believes that the value of K should take the value at the inflection point, as shown in FIG. 5, which is a schematic diagram of the elbow algorithm to determine the value of K in the embodiment of the present invention. It is more appropriate that K is 3 or 6.
第5)步使用K-Means算法找到初始化点:由于该算法仅仅用于找到高斯混合聚类训练的初始化点,提高MoG的准确度和收敛效率,具体的算法细节在这里不做过多讨论。以二维数据为例,K-Means算法找到高斯混合聚类初始化点的过程示例如图6所示,图6为本发明实施例中的初始化点确定过程示例图。Step 5) Use the K-Means algorithm to find the initialization point: Since this algorithm is only used to find the initialization point for Gaussian hybrid cluster training, which improves the accuracy and convergence efficiency of MoG, the specific algorithm details are not discussed here. Taking two-dimensional data as an example, an example of the process of the K-Means algorithm finding a Gaussian hybrid cluster initialization point is shown in FIG. 6. FIG. 6 is an example diagram of the initialization point determination process in the embodiment of the present invention.
第6)、7)及8)步涉及高斯混合聚类和EM算法(Expectation-Maximization algorithm)下面对每个关键步骤做描述性说明:Steps 6), 7) and 8) involve Gaussian mixture clustering and EM algorithm (Expectation-Maximization algorithm). Each key step is described descriptively below:
设有随机变量X,则混合高斯模型可以用下式表示:With a random variable X, the mixed Gaussian model can be expressed by the following formula:
Figure PCTCN2019088435-appb-000005
Figure PCTCN2019088435-appb-000005
其中Ν(x|μ kk)称为混合模型中的第k个分量(component)。如前面图6中的例子,有四个聚类,可以用四个二维高斯分布来表示,那么分量数K=4.π k是混合系数(mixture coefficient),且满足: Where N (x | μ k , Σ k ) is called the k-th component in the mixed model. As in the previous example in Figure 6, there are four clusters, which can be represented by four two-dimensional Gaussian distributions, then the number of components K = 4.π k is the mixture coefficient and satisfies:
Figure PCTCN2019088435-appb-000006
Figure PCTCN2019088435-appb-000006
0≤π k≤1 0≤π k ≤1
可以看到π k相当于每个分量Ν(x|μ kk)的权重。 It can be seen that π k is equivalent to the weight of each component N (x | μ k , Σ k ).
引入一个新的K维随机变量z,z k(1≤k≤K)只能取0或1两个值;z k=1表示第k类被选中,即:p(z k=1)=π k;如果z k=0表示第k类没有被选中。更数学化一点,z k要满足以下两个条件: Introduce a new K-dimensional random variable z, z k (1≤k≤K) can only take two values of 0 or 1; z k = 1 means that the k class is selected, that is: p (z k = 1) = π k ; if z k = 0, it means that the k-th class is not selected. To be more mathematical, z k must satisfy the following two conditions:
z k∈{0,1}; z k ∈ {0,1};
Figure PCTCN2019088435-appb-000007
Figure PCTCN2019088435-appb-000007
例如图6中的例子,有四类,则z的维数是4.如果从第一类中取出一个点,则z=(1,0,0,0),如果从第二类中取出一个点,则z=(0,1,0,0)。z k=1的概率就是π k,假设z k之间是独立同分布的,我们可以写出z的联合概率分布形式: For example, in the example in Figure 6, there are four classes, and the dimension of z is 4. If a point is taken from the first class, z = (1,0,0,0), and if a point is taken from the second class, Point, then z = (0,1,0,0). The probability of z k = 1 is π k . Assuming z k is independent and identically distributed, we can write the joint probability distribution form of z:
Figure PCTCN2019088435-appb-000008
Figure PCTCN2019088435-appb-000008
因为z k只能取0或1,且z中只能有一个z k为1而其它全为0,所以上式是成立的。 Because K z can be 0 or 1, and z can have only a K z is 1 and the other are all 0, so the equation is true.
图6中的数据可以分为四类,假设每一类中的数据都是服从高斯分布的。这个叙述可以用条件概率来表示:The data in Figure 6 can be divided into four categories. It is assumed that the data in each category follows a Gaussian distribution. This statement can be expressed in terms of conditional probability:
p(x|z k=1)=Ν(x|μ kk) p (x | z k = 1) = N (x | μ k , Σ k )
即第k类中的数据服从高斯分布。进而上式可以写成如下形式:That is, the data in the k class obeys the Gaussian distribution. The above formula can be written as follows:
Figure PCTCN2019088435-appb-000009
Figure PCTCN2019088435-appb-000009
上面(2)(3)式分别给出了p(z)和p(x|z)的形式,根据概率的乘积规则与加和规则公式,可以求出p(x)的形式:The above formulas (2) and (3) respectively give the forms of p (z) and p (x | z). According to the product rule and sum rule formula of probability, the form of p (x) can be obtained:
Figure PCTCN2019088435-appb-000010
Figure PCTCN2019088435-appb-000010
可以看到GMM模型的(1)式与(4)式有一样的形式,且(4)式中引入了一个新的变量z,通常称为隐含变量(latent variable)。对于图6中的数据,“隐含”的意义是:我们知道数据可以分成四类,但是随机抽取一个数据点,我们不知道这个数据点属于哪一类,它的归属我们观察不到,因此引入一个隐含变量z来描述这个归属。It can be seen that (1) and (4) of the GMM model have the same form, and a new variable z is introduced in (4), which is often called a latent variable. For the data in Figure 6, the meaning of "hidden" is: we know that the data can be divided into four categories, but randomly extract a data point, we do not know which category this data point belongs to, we cannot observe its belonging, so An implicit variable z is introduced to describe this assignment.
注意到在贝叶斯的思想下,p(z)是先验概率,p(x|z)是似然概率,很自然我们会想到求出后验概率p(z|x):Note that under Bayesian thought, p (z) is the prior probability and p (x | z) is the likelihood probability. Naturally we will think of finding the posterior probability p (z | x):
Figure PCTCN2019088435-appb-000011
Figure PCTCN2019088435-appb-000011
Figure PCTCN2019088435-appb-000012
Figure PCTCN2019088435-appb-000012
上式中我们定义符号γ(z k)来表示第k个分量的后验概率。在贝叶斯的观点下,π k可视为z k=1的先验概率。 In the above formula, we define the symbol γ (z k ) to represent the posterior probability of the k-th component. From Bayesian point of view, π k can be regarded as the prior probability of z k = 1.
上述内容改写了GMM的形式,并引入了隐含变量z和已知x后的的后验概率γ(z k),这样做是为了方便使用EM算法来估计GMM的参数。 The above content rewrites the form of GMM, and introduces the implicit variable z and the posterior probability γ (z k ) after the known x. This is done to facilitate the use of the EM algorithm to estimate the parameters of the GMM.
接下来使用EM算法计算参数,EM算法分两步,第一步先求出要估计参数的粗略值,第二步使用第一步的值最大化似然函数。因此要先求出GMM的似然函数。Next, the EM algorithm is used to calculate the parameters. The EM algorithm has two steps. The first step is to find the rough value of the parameter to be estimated. The second step uses the value of the first step to maximize the likelihood function. Therefore, the likelihood function of GMM must be obtained first.
假设X={x 1,x 2,...,x n},对于图6,X是图中所有点(每个点在二维平面上有两个坐标,是二维向量)。GMM的概率模型如(1)式所示。GMM模型中有三个参数需要估计,分别是π,μ和Σ.将(1)式写成连乘的形式: Suppose X = {x 1 , x 2 , ..., x n }, for FIG. 6, X is all points in the figure (each point has two coordinates on a two-dimensional plane and is a two-dimensional vector). The probability model of GMM is shown in formula (1). There are three parameters in the GMM model that need to be estimated, namely π, μ, and Σ. Write (1) as a continuous multiplication:
Figure PCTCN2019088435-appb-000013
Figure PCTCN2019088435-appb-000013
为了估计这三个参数,需要分别求解出这三个参数的最大似然函数。先求解μ k的最大似然函数,对(6)式左右两边取对数后得到似然函数: In order to estimate these three parameters, the maximum likelihood functions of these three parameters need to be solved separately. First solve the maximum likelihood function of μ k , and take the logarithm of the left and right sides of formula (6) to obtain the likelihood function:
Figure PCTCN2019088435-appb-000014
Figure PCTCN2019088435-appb-000014
再对μ k求导并令导数为0即得到: Differentiate μ k and set the derivative to 0 to get:
Figure PCTCN2019088435-appb-000015
Figure PCTCN2019088435-appb-000015
注意到上式中分数的一项的形式正好是(5)式后验概率的形式。两边同乘
Figure PCTCN2019088435-appb-000016
重新整理可以得到:
Note that the form of the term of the fraction in the above formula is exactly the form of the posterior probability of formula (5). Ride on both sides
Figure PCTCN2019088435-appb-000016
Rearranging can get:
Figure PCTCN2019088435-appb-000017
Figure PCTCN2019088435-appb-000017
其中:among them:
Figure PCTCN2019088435-appb-000018
Figure PCTCN2019088435-appb-000018
(9)式和(10)式中,N表示点的数量。γ(z nk)表示点x n属于聚类k的后验概率。则Ν k可以表示属于第k个聚类的点的数量。那么μ k表示所有点的加权平均,每个点的权值是
Figure PCTCN2019088435-appb-000019
跟第k个聚类有关。
In formulas (9) and (10), N represents the number of points. γ (z nk ) represents the posterior probability that point x n belongs to cluster k. Then nk can represent the number of points belonging to the k-th cluster. Then μ k represents the weighted average of all points, and the weight of each point is
Figure PCTCN2019088435-appb-000019
Related to the k-th cluster.
同理求Σ k的最大似然函数,可以得到: Similarly, to find the maximum likelihood function of Σ k , we can get:
Figure PCTCN2019088435-appb-000020
Figure PCTCN2019088435-appb-000020
最后剩下π k的最大似然函数。注意到π k有限制条件
Figure PCTCN2019088435-appb-000021
因此我们根据拉格朗日乘数法,需要加入拉格朗日算子:
Finally, the maximum likelihood function of π k remains. Note that π k has restrictions
Figure PCTCN2019088435-appb-000021
Therefore, according to the Lagrangian multiplier method, we need to add the Lagrangian operator:
Figure PCTCN2019088435-appb-000022
Figure PCTCN2019088435-appb-000022
求上式关于π k的最大似然函数,得到: Finding the maximum likelihood function of the above formula for π k , we get:
Figure PCTCN2019088435-appb-000023
Figure PCTCN2019088435-appb-000023
上式两边同乘π k,可以得到λ=-N,进而可以得到π k更简洁的表达式: By multiplying π k on both sides of the above formula, you can get λ = -N, and then you can get a more concise expression of π k :
Figure PCTCN2019088435-appb-000024
Figure PCTCN2019088435-appb-000024
至此,我们就可以利用(5)(7)(9)(10)(11)(12)式子使用EM算法计算模型参数了。At this point, we can use the EM algorithm to calculate the model parameters using (5) (7) (9) (10) (11) (12).
EM算法过程:EM algorithm process:
定义分量数目K,此例中K为4,对每个分量k设置π k,μ k和Σ k的初始值,然后计算(6)式的对数似然函数(7)。 Define the number of components K, in this example K is 4, set the initial values of π k , μ k and Σ k for each component k, and then calculate the log-likelihood function (7) of formula (6).
E-step根据当前的π k、μ k、Σ k计算后验概率γ(z nk): E-step calculates the posterior probability γ (z nk ) based on the current π k , μ k , and Σ k :
Figure PCTCN2019088435-appb-000025
Figure PCTCN2019088435-appb-000025
M-stepM-step
根据E step中计算的γ(z nk)再计算新的π k、μ k、Σ k: Calculate new π k , μ k , Σ k according to γ (z nk ) calculated in E step:
Figure PCTCN2019088435-appb-000026
Figure PCTCN2019088435-appb-000026
Figure PCTCN2019088435-appb-000027
Figure PCTCN2019088435-appb-000027
Figure PCTCN2019088435-appb-000028
Figure PCTCN2019088435-appb-000028
其中:among them:
Figure PCTCN2019088435-appb-000029
Figure PCTCN2019088435-appb-000029
计算(6)式的对数似然函数Calculate the log-likelihood function of (6)
Figure PCTCN2019088435-appb-000030
Figure PCTCN2019088435-appb-000030
检查参数是否收敛或对数似然函数是否收敛,若不收敛,则返回第2步。Check whether the parameters converge or whether the log-likelihood function converges. If not, return to step 2.
为了好理解,容易可视化,此前的举例都以图6所示的二维数据为基础,实际情况中,在对文本信息进行分词、word embedding、向量加和取平均值后(前面的1)2)3)步),输入的训练数据会远远大于二维,但是算法原理是完全相同的。训练模块需要确定的参数仅仅为一个K值,不需要设定其 它参数,而K值可以用elbow算法来确定,因此本系统的特点之一是获取训练数据后就可以直接进行训练。表2用于说明训练语料输入。For better understanding and easy visualization, the previous examples are based on the two-dimensional data shown in Figure 6. In practice, after the text information is segmented, word embedded, vector added, and averaged (1) 2 ) 3) Step), the input training data will be much larger than two-dimensional, but the algorithm principle is exactly the same. The parameter to be determined by the training module is only a K value, and no other parameters need to be set, and the K value can be determined using the elbow algorithm. Therefore, one of the characteristics of this system is that training can be performed directly after obtaining training data. Table 2 is used to explain the training corpus input.
表2Table 2
Figure PCTCN2019088435-appb-000031
Figure PCTCN2019088435-appb-000031
高斯混合聚类不需要划分训练集,验证集和测试集,训练完成后可以直接得到参数,此时保存参数集即可。Gaussian hybrid clustering does not need to divide the training set, the verification set and the test set. After the training is completed, the parameters can be obtained directly. At this time, the parameter set can be saved.
输入新的语料(输入语料格式类型与训练语料一致)进行预测推理,得出类型结果和类型概率。输入语料的格式为表3所示,表3用于说明预测输入语料格式。Enter a new corpus (the input corpus format type is the same as the training corpus) for predictive reasoning, and get the type result and type probability. The format of the input corpus is shown in Table 3. Table 3 is used to describe the format of the predicted input corpus.
表3table 3
Figure PCTCN2019088435-appb-000032
Figure PCTCN2019088435-appb-000032
对流行度进行打分的过程也属于本实施例的方案,通过打分,可方便的对数据进行排序和比对。为了保证准确度和统一性,本专利采用了基于高斯聚类的结果,将类别数目和类别的样本数量反应在流行度打分中的方法。如下:The process of scoring the popularity also belongs to the solution of this embodiment. By scoring, the data can be easily sorted and compared. In order to ensure accuracy and uniformity, this patent uses a method based on the results of Gaussian clustering to reflect the number of categories and the number of samples in the popularity score. as follows:
取某条文本作为测试样本输入到训练好的高斯混合模型中进行流行度预测,需要先按照前面提到的1)2)3)步骤将文本内容转化成特征向量,然后输入模型进行预测,设此特征向量为x,根据高斯混合模型的特点,对x做预测可以得到两个值,x所属的聚类k和x属于聚类k的概率proba(x),得 到这两个值,即可通过以下公式计算打分,假设测试样本x被分到了第i类中,则打分计算方法记为:To take a piece of text as a test sample and input it to the trained Gaussian mixture model for popularity prediction, you need to first convert the text content into feature vectors according to the steps 1) 2) 3) mentioned above, and then enter the model to predict. This feature vector is x. According to the characteristics of the Gaussian mixture model, two values can be obtained by predicting x. The probability k of cluster k to which x belongs and cluster prob (x) of x belongs to these two values. The score is calculated by the following formula. Assuming that the test sample x is classified into the ith category, the score calculation method is recorded as:
Figure PCTCN2019088435-appb-000033
Figure PCTCN2019088435-appb-000033
其中amount(k=i)为被分到第i类的训练样本总数,amount(X)为全部训练样本总数,proba(x)是样本x属于第i类的概率,由高斯混合模型预测得到。Where amount (k = i) is the total number of training samples classified into the ith class, amount (X) is the total number of all training samples, and proba (x) is the probability that the sample x belongs to the i class, which is predicted by the Gaussian mixture model.
根据高斯混合模型的性质,聚类中样本量较大的样本组成的类型k=i必然流行度较高,故采用计算样本数量比
Figure PCTCN2019088435-appb-000034
的方式可以大体定位测试样本在本模型中的得分,使用proba(x)和这个得分相乘,再通过K的数量来平衡得分(这样计算是因为K值较大时,得到的
Figure PCTCN2019088435-appb-000035
值会较小,导致不同模型给出的评分差异过大,不利于横向比较),通过这样的计算,我们就可以对每个样本进行流行度打分,并且可以方便的添加新数据到训练数据中,优化模型。
According to the properties of the Gaussian mixture model, the type k = i of the sample composition with a large sample size in the cluster is bound to have a high popularity, so the sample size ratio is calculated.
Figure PCTCN2019088435-appb-000034
The method can roughly locate the score of the test sample in this model, use proba (x) to multiply this score, and then balance the score by the number of K (this calculation is obtained because the value of K is larger,
Figure PCTCN2019088435-appb-000035
The value will be small, resulting in too large differences in the ratings given by different models, which is not conducive to horizontal comparison.) Through this calculation, we can score the popularity of each sample, and can easily add new data to the training data. , Optimize the model.
在前端展示时,可以按照自己的需求做变化,如可以显示某条投诉信息的流行度打分,或按打分排名先后依次显示内容,或者直接作为参数输入到推荐系统的算法中。When displaying in the front-end, you can make changes according to your own needs, such as displaying the popularity score of a certain complaint information, or displaying the content in order according to the ranking of the score, or directly input it into the algorithm of the recommendation system as a parameter.
如图3所示,本实施一个面向话题的情感分析系统,包括一个分布式数据抓取模块、一个数据预处理模块、一个分布式存储模块、一个算法分析模块、一个预测打分模块和一个可选的前端展示模块。算法分析模块内部包含专门的参数训练子模块和模型加载子模块。As shown in Figure 3, this implementation of a topic-oriented sentiment analysis system includes a distributed data capture module, a data preprocessing module, a distributed storage module, an algorithm analysis module, a predictive scoring module, and an optional Front-end display module. The algorithm analysis module contains a special parameter training sub-module and a model loading sub-module.
本实施所提供的方法主要包括以下步骤:The method provided by this implementation mainly includes the following steps:
步骤1:分布式数据抓取模块抓取互联网数据,如舆情话题及其内容,微信公众号及其回复等;Step 1: The distributed data capture module captures Internet data, such as public opinion topics and their content, WeChat public account and its response, etc .;
步骤2:数据预处理模块将收到的数据处理规整。按照技术方案的要求,结构化的数据格式为如表1所示。Step 2: The data preprocessing module processes the received data in a regular manner. According to the requirements of the technical solution, the structured data format is shown in Table 1.
步骤3:将结构化后的数据存储在分布式存储系统中,如分布式文件系统(Hadoop,HDFS),MongoDB等;Step 3: Store the structured data in a distributed storage system, such as a distributed file system (Hadoop, HDFS), MongoDB, etc .;
步骤4:算法训练模块定时加载训练数据训练算法参数,并得到算法模型文件。具体实施方式为:Step 4: The algorithm training module periodically loads training data to train algorithm parameters, and obtains an algorithm model file. The specific implementation mode is:
步骤4-1:对所有文本数据进行分词,去掉符号,数字,空格等特殊符号,去停止词;Step 4-1: tokenize all text data, remove special symbols such as symbols, numbers, spaces, and stop words;
步骤4-2:对4-1得到的词语进行word embedding处理;Step 4-2: Perform word embedding on the words obtained in 4-1.
步骤4-3:对4-2得到的词向量进行加和并取平均值并保存;Step 4-3: Add the word vectors obtained in 4-2 and take the average value and save;
步骤4-4:进行高斯混合模型训练;Step 4-4: Perform Gaussian mixture model training;
步骤4-5:根据算法训练得到最优参数,保存为算法模型文件;Step 4-5: Get the optimal parameters according to the algorithm training and save it as an algorithm model file;
步骤5:系统加载算法模型文件,对存储于分布式文件系统中的待预测数据进行计算,得到每条数据的流行度打分;Step 5: The system loads the algorithm model file, calculates the data to be predicted stored in the distributed file system, and obtains the popularity score of each piece of data;
步骤5-1:对待预测的文本数据进行分词,去掉符号,数字,空格等特殊符号,去停止词;Step 5-1: tokenize the predicted text data, remove special symbols, numbers, spaces, and other special symbols, and stop words;
步骤5-2:对5-1得到的词语进行word embedding处理;Step 5-2: Perform word embedding on the words obtained in 5-1.
步骤5-3:对5-2得到的词向量进行加和并取平均值并保存;Step 5-3: Add the word vectors obtained in 5-2 and take the average value and save;
步骤5-4:启动平台,加载步骤4中训练得到的算法模型文件;Step 5-4: Start the platform and load the algorithm model file trained in step 4.
步骤5-5:对待预测数据进行计算,得到每条数据的归类和类别概率;Step 5-5: Calculate the prediction data to obtain the classification and category probability of each piece of data;
步骤5-6:对待5-5数据进行计算,得到每条数据的流行度打分;Step 5-6: Calculate the 5-5 data to get the popularity score of each data;
步骤6:将步骤5中的聚合结果在前端界面展示。Step 6: Display the aggregation result in step 5 on the front-end interface.
本实施例还包括以下实施场景:This embodiment also includes the following implementation scenarios:
实施场景1 Implementation scenario 1
在大众日常生活中使用的微博,微信朋友圈和互联网网站当中会产生丰富的互联网资料。实时分析和追踪大众关注点和社会实时舆论流行动态 是非常必要的。在微博微信公众号流行度趋向分析中,使用本实例中的基于高斯混合模型的流行度分析系统,当一条新的微博出现时,能够准确计算分析一这条微博的流行度,以便进行下一步推荐决策。Weibo used in the daily life of the public, WeChat circle of friends and Internet sites will generate rich Internet materials. Real-time analysis and tracking of public concerns and real-time public opinion trends are necessary. In the Weibo WeChat public account popularity trend analysis, using the popularity analysis system based on the Gaussian mixture model in this example, when a new Weibo appears, it can accurately calculate and analyze the popularity of this Weibo in order to Make the next recommendation decision.
步骤一:使用分布式爬虫爬取微博内容以及微博下的评论,微信朋友圈内容以及评论;Step 1: Use distributed crawlers to crawl Weibo content and comments under Weibo, WeChat friends circle content and comments;
步骤二:数据预处理清洗数据,将数据结构化存储在HDFS中,存储的格式如表4所示。Step 2: Data preprocessing cleans the data and stores the data structured in HDFS. The storage format is shown in Table 4.
表4Table 4
Figure PCTCN2019088435-appb-000036
Figure PCTCN2019088435-appb-000036
步骤三:对数据进行分词后做word embedding,加和取平均得到一个多维向量,流程举例如图8所示,图8为本发明实例中训练前的处理流程举例图。Step 3: After word segmentation is performed on the data, word embedding is performed, and a multi-dimensional vector is obtained by adding and averaging. An example of the process is shown in FIG. 8.
步骤四:将训练数据输入算法平台训练得出算法模型,训练平台选用 sklearn。Step 4: Enter the training data into the algorithm platform for training to obtain the algorithm model. The training platform uses sklearn.
步骤五:输入待预测语料,整体流程如图7所示,图7为本发明实施例中的整体预测流程举例图,调用步骤四中得到的算法模型进行流行度评分,推送呈现分析结果,整体流程如图9所示,图9为本发明实例中的微博数据流行度分析总体流程图。Step 5: Enter the corpus to be predicted. The overall process is shown in Figure 7. Figure 7 is an example of the overall prediction process in the embodiment of the present invention. The algorithm model obtained in step 4 is called for popularity score, and the analysis results are pushed and presented. The process is shown in FIG. 9, which is a general flowchart of microblog data popularity analysis in the example of the present invention.
实施场景2 Implementation scenario 2
音乐播放器是很常见的软件,在各类平台中都存在各式各样的客户端,这类软件对音乐进行推荐非常常见,如根据热度进行排名的音乐排行榜,根据用户个人习惯进行的音乐推荐等,这类问题可以通过本系统给出答案。以总的音乐排行榜为例。从图10的示例可以看出,热门推荐内容和流行度直接挂钩,图10为本发明实例中的音乐数据流行度分析示例图。Music players are very common software, and there are various clients in various platforms. This kind of software recommends music is very common, such as music rankings based on popularity, and users ’personal habits. Music recommendations, etc., such questions can be answered through this system. Take the overall music leaderboard as an example. It can be seen from the example in FIG. 10 that popular recommendation content is directly linked to popularity. FIG. 10 is a diagram illustrating an example of popularity analysis of music data in the example of the present invention.
步骤一:收集音乐文件,无需标注。Step 1: Collect music files without labeling.
步骤二:将音乐文件录入系统;Step 2: Record music files into the system;
步骤三:数据预处理模块预处理数据,将数据转化为特征向量,这里给出一种思路,不做具体分析,见图11所示,图11为本发明实例中音频信号转特征向量示意图,声音信号输入后,输出梅尔频率倒频谱系数(MFCC,Mel Frequency Cepstral Coefficient)参数向量。Step 3: The data preprocessing module preprocesses the data and converts the data into feature vectors. Here is an idea without specific analysis. See Figure 11, which is a schematic diagram of audio signal to feature vectors in the example of the present invention. After the sound signal is input, a Mel Frequency Frequency Cepstral Coefficient (MFCC) parameter vector is output.
步骤四:启动算法训练建立预测模型。按照本发明中的技术方案构建高斯混合模型模型;Step 4: Start algorithm training to build a prediction model. Construct a Gaussian mixture model model according to the technical solution in the present invention;
步骤五:分析预测音乐,整体流程和图9类似,只是数据获取和预处理有些许不同。Step 5: Analyze and predict the music. The overall process is similar to Figure 9, except that the data acquisition and preprocessing are slightly different.
实施场景3 Implementation scenario 3
网购平台需要对现有商品进行分析,更好的了解市场的情况和变化,对新上架的商品有相对准确的流行度预估,这种场景下商品流行度分析变得极其重要。运用本实例中的流行度分析系统,可以对特定的商品类型进行流行度分析,并给新上架的商品提供流行度估值评分。示例如图12所示, 图12为本发明实例中的商品流行度分析示例图。Online shopping platforms need to analyze existing products, better understand market conditions and changes, and have relatively accurate estimates of the popularity of newly listed products. In this scenario, the analysis of product popularity becomes extremely important. Using the popularity analysis system in this example, it is possible to perform a popularity analysis on a specific product type and provide a popularity evaluation score for newly listed products. An example is shown in FIG. 12, which is a diagram illustrating an example of popularity analysis of a commodity in an example of the present invention.
步骤一:收集某一类商品的基本信息,其中名称,品牌,点击量(或销量),点击用户(或购买用户)为必须字段。Step 1: Collect basic information about a certain type of product, where name, brand, click volume (or sales volume), and click user (or purchase user) are required fields.
步骤二:对每个用户的每次点击(购买)看做一个训练样本组成样本集(同一用户的重复点击或购买不计),存入分布式持久化系统中如HDFS。Step 2: Each click (purchase) of each user is regarded as a training sample to form a sample set (repeated clicks or purchases by the same user are not counted) and stored in a distributed persistence system such as HDFS.
步骤三:对样本进行预处理,转化为特征向量,方式如图13中虚线箭头部分,图13为本发明实例中的预处理及商品流行度预测流程图。Step 3: Preprocess the samples and convert them into feature vectors. The method is as shown by the dashed arrows in FIG. 13.
步骤3-1:运用word2vec或Glove算法做word embedding,将文本词语转化为词向量,属性类型为参数的保持不变;Step 3-1: Use word2vec or Glove algorithm to do word embedding, convert text words into word vectors, and keep the attribute type as the parameter;
步骤3-2:利用one-hot方式将品牌标注出来;Step 3-2: Use the one-hot method to label the brand;
步骤3-3:将上两部的结果和其余数字参数结合成一个向量;Step 3-3: Combine the results of the previous two parts and the remaining digital parameters into a vector;
步骤四:启用本发明中的流行度分析系统进行训练,得到GMM模型。Step 4: Enable the popularity analysis system in the present invention for training to obtain a GMM model.
步骤五:对待预测商品进行预测,流程见图13。Step 5: Make predictions on the predicted products. The process is shown in Figure 13.
步骤5-1:运用word2vec算法做计算,将文本词语转化为词向量,属性类型为参数的保持不变;Step 5-1: Use the word2vec algorithm to perform calculations to convert text words into word vectors with the attribute type remaining unchanged.
步骤5-2:利用one-hot方式将品牌标注出来;Step 5-2: Use the one-hot method to mark the brand;
步骤5-3:将上两部的结果和其余数字参数结合成一个向量;Step 5-3: Combine the results of the previous two parts and the remaining digital parameters into a vector;
步骤5-4:利用本发明中的预测方法得到类别i和所属类别概率proba;Step 5-4: Use the prediction method in the present invention to obtain the category i and the category probability proba;
步骤5-5:利用本发明的计算方法计算流行度分数;Step 5-5: Calculate the popularity score using the calculation method of the present invention;
实施场景4 Implementation scenario 4
当今在互联网上看新闻早已不是什么新鲜事,新闻消息不断地从互联网网站当中更新出来。为了实时分析和追踪大众关注点和社会实时舆论流行动态,使用本实例中的基于高斯混合模型的流行度分析系统,对每条新闻都可以做热度评分,当一条新的新闻出现时,能够准确计算分析一这条新闻的流行度,以便进行下一步决策。Watching news on the Internet today is nothing new. News news is constantly updated from Internet sites. In order to analyze and track popular concerns and real-time public opinion trends in real time, the popularity analysis system based on the Gaussian mixture model in this example can be used to score the heat of each news. When a new news appears, it can be accurate. Calculate and analyze the popularity of this news in order to make the next decision.
步骤一:使用分布式爬虫爬取新闻内容;Step 1: Use distributed crawlers to crawl news content;
步骤二:数据预处理清洗数据,将数据结构化存储在HDFS中,存储的格式为表5所示:Step 2: Data pre-processing cleans the data, stores the data structured in HDFS, and the storage format is shown in Table 5:
表5table 5
Figure PCTCN2019088435-appb-000037
Figure PCTCN2019088435-appb-000037
步骤三:对数据进行分词后做word embedding,加和取平均得到一个多维向量,流程举例类似图7所示;Step 3: After word segmentation is performed on the data, word embedding is performed, and the sum is averaged to obtain a multi-dimensional vector. The process example is similar to that shown in FIG. 7;
步骤四:将训练数据输入算法平台训练得出算法模型,训练平台选用sklearn。Step 4: Enter the training data into the algorithm platform for training to obtain the algorithm model. The training platform uses sklearn.
步骤五:输入待预测语料,调用步骤四中得到的算法模型进行流行度评分,推送呈现分析结果,整体流程如图14所示,图14为本发明实例中的新闻数据流行度分析总体流程图。Step 5: Input the corpus to be predicted, call the algorithm model obtained in step 4 to perform popularity scoring, and push and present the analysis results. The overall process is shown in FIG. 14, which is an overall flowchart of news data popularity analysis in the example of the present invention .
本发明实施例提供一种基于高斯混合模型的流行度分析系统及方法的实现。提出一种从语料话题信息爬取,语料信息预处理,高斯混合建模,流行度分析预测,到输出流行度评分结果的系统,该系统基于高斯混合模 型的流行度分析方法,本发明实施例讨论基于高斯混合模型方式的舆情话题流行度分析预测方法,并以此为基础延伸到多领域,建立基于混合高斯聚类技术的流行度分析系统。Embodiments of the present invention provide implementation of a popularity analysis system and method based on a Gaussian mixture model. A system is proposed from corpus topic information crawling, corpus information preprocessing, Gaussian mixture modeling, popularity analysis and prediction, to output popularity score results. The system is based on the popularity analysis method of the Gaussian mixture model. An embodiment of the present invention This paper discusses the method of analyzing and predicting the popularity of public opinion topics based on the Gaussian mixture model, and extends it to multiple fields based on this, and establishes a popularity analysis system based on the hybrid Gaussian clustering technology.
本发明的实施例还提供了一种存储介质,该存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行本发明实施例提供的信息处理方法。An embodiment of the present invention further provides a storage medium. A computer program is stored in the storage medium, and the computer program is configured to execute the information processing method provided by the embodiment of the present invention when running.
在一些实施例中,上述存储介质可以被设置为存储用于执行以下步骤的计算机程序:In some embodiments, the above-mentioned storage medium may be configured to store a computer program for performing the following steps:
S1,获取话题数据;S1. Obtain topic data.
S2,对话题数据进行预处理得到结构化数据;S2. Preprocess the topic data to obtain structured data.
S3,将结构化数据输入至模型文件,计算得到话题数据的热度信息。S3. The structured data is input to a model file, and the popularity information of the topic data is calculated.
在一些实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储计算机程序的介质。In some embodiments, the above-mentioned storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a mobile hard disk, a magnetic disk, or an optical disk, etc. Various media that can store computer programs.
本发明的实施例还提供了一种电子装置,包括存储器和处理器,该存储器中存储有计算机程序,该处理器被设置为运行计算机程序以执行上述信息处理方法。An embodiment of the present invention further provides an electronic device including a memory and a processor. The memory stores a computer program, and the processor is configured to run the computer program to execute the information processing method.
在一些实施例中,上述电子装置还可以包括传输设备以及输入输出设备,其中,该传输设备和上述处理器连接,该输入输出设备和上述处理器连接。In some embodiments, the electronic device may further include a transmission device and an input-output device, wherein the transmission device is connected to the processor, and the input-output device is connected to the processor.
在一些实施例中,上述处理器可以被设置为通过计算机程序执行以下步骤:In some embodiments, the processor may be configured to perform the following steps by a computer program:
S1,获取话题数据;S1. Obtain topic data.
S2,对话题数据进行预处理得到结构化数据;S2. Preprocess the topic data to obtain structured data.
S3,将结构化数据输入至模型文件,计算得到话题数据的热度信息。S3. The structured data is input to a model file, and the popularity information of the topic data is calculated.
显然,本领域的技术人员应该明白,上述的本发明实施例的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,在一些实施例中,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that the above-mentioned modules or steps of the embodiments of the present invention may be implemented by a general-purpose computing device, and they may be concentrated on a single computing device or distributed to be composed of multiple computing devices Network, in some embodiments, they can be implemented with program code executable by a computing device, so that they can be stored in a storage device and executed by the computing device, and in some cases, can be different from Here, the steps shown or described are performed sequentially, or they are separately made into individual integrated circuit modules, or multiple modules or steps in them are made into a single integrated circuit module for implementation. As such, the invention is not limited to any particular combination of hardware and software.
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are merely preferred embodiments of the present invention and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modification, equivalent replacement, or improvement made within the principle of the present invention shall be included in the protection scope of the present invention.

Claims (11)

  1. 一种信息处理方法,包括:An information processing method includes:
    获取话题数据;Get topic data;
    对所述话题数据进行预处理得到结构化数据;Pre-processing the topic data to obtain structured data;
    将所述结构化数据输入至模型文件,计算得到所述话题数据的热度信息。The structured data is input to a model file, and the popularity information of the topic data is calculated.
  2. 根据权利要求1所述的方法,其中,在计算得到所述话题数据的热度信息之后,所述方法还包括:The method according to claim 1, wherein after calculating the popularity information of the topic data, the method further comprises:
    在前端界面展示所述话题数据的热度信息。Display the hotness information of the topic data on the front-end interface.
  3. 根据权利要求1所述的方法,其中,在将所述结构化数据输入至模型文件之前,所述方法还包括以下之一:The method according to claim 1, wherein before the structured data is input into a model file, the method further comprises one of the following:
    训练所述模型文件;Training the model file;
    预设所述模型文件。The model file is preset.
  4. 根据权利要求3所述的方法,其中,训练所述模型文件包括:The method of claim 3, wherein training the model file comprises:
    对样本文本数据进行分词,去掉所述样本文本数据中指定类型的字符,得到第一数据;Segment the sample text data, remove characters of a specified type from the sample text data, and obtain first data;
    对所述第一数据进行单词嵌入处理,得到第二数据;Performing word embedding processing on the first data to obtain second data;
    对所述第二数据的词向量进行加和并取平均值,得到第三数据;Adding and averaging word vectors of the second data to obtain third data;
    对所述第三数据按照类别对原始模型进行高斯混合模型训练,得到所述模型文件。Gaussian mixture model training is performed on the original model according to the category of the third data to obtain the model file.
  5. 根据权利要求1所述的方法,其中,将所述结构化数据输入至模型文件,计算得到所述话题数据的热度信息,包括:The method according to claim 1, wherein inputting the structured data to a model file and calculating the popularity information of the topic data comprises:
    对结构化数据进行分词,去掉所述结构化数据中指定类型的字符,得到第一结构化数据;Segmenting structured data, removing characters of a specified type from the structured data, and obtaining first structured data;
    对所述第一结构化数据进行单词嵌入处理,得到第二结构化数据;Performing word embedding processing on the first structured data to obtain second structured data;
    对所述第二结构化数据的词向量进行加和并取平均值,得到第三结构化数据;Adding and averaging word vectors of the second structured data to obtain third structured data;
    将所述第三结构化数据输入至所述模型文件,得到每条数据的归类和类别概率;Inputting the third structured data to the model file to obtain a classification and a category probability of each piece of data;
    计算所述类别概率得到所述话题数据的热度信息。Calculate the category probability to obtain the popularity information of the topic data.
  6. 根据权利要求1所述的方法,其中,对所述话题数据进行预处理得到结构化数据包括:The method according to claim 1, wherein pre-processing the topic data to obtain structured data comprises:
    按照数据类型拆分所述话题数据;Split the topic data according to the data type;
    删除所述话题数据中包含的特定类型的数据,得到候选数据,其中,所述特定类型包括以下至少之一:图片,语音,表情;Deleting specific types of data included in the topic data to obtain candidate data, where the specific types include at least one of the following: pictures, voices, and expressions;
    将所述候选数据规整为结构化数据。The candidate data is structured into structured data.
  7. 根据权利要求1所述的方法,其中,获取话题数据包括:The method according to claim 1, wherein obtaining topic data comprises:
    从互联网上抓取所述话题数据,其中,所述话题数据包括以下至少之一:话题内容,评论信息。Grab the topic data from the Internet, where the topic data includes at least one of the following: topic content and comment information.
  8. 一种信息处理装置,包括:An information processing device includes:
    获取模块,配置为获取话题数据;An acquisition module configured to acquire topic data;
    处理模块,配置为对所述话题数据进行预处理得到结构化数据;A processing module configured to preprocess the topic data to obtain structured data;
    计算模块,配置为将所述结构化数据输入至模型文件,计算得到所述话题数据的热度信息。The calculation module is configured to input the structured data into a model file, and calculate and obtain heat information of the topic data.
  9. 根据权利要求8所述的装置,其中,所述计算模块包括:The apparatus according to claim 8, wherein the calculation module comprises:
    第一处理单元,配置为对结构化数据进行分词,去掉所述结构化数据中指定类型的字符,得到第一结构化数据;A first processing unit configured to perform word segmentation on the structured data, remove characters of a specified type from the structured data, and obtain a first structured data;
    第二处理单元,配置为对所述第一结构化数据进行单词嵌入处理,得到第二结构化数据;A second processing unit configured to perform word embedding processing on the first structured data to obtain a second structured data;
    第一计算单元,配置为对所述第二结构化数据的词向量进行加和并取平均值,得到第三结构化数据;A first calculation unit configured to add and average the word vectors of the second structured data to obtain third structured data;
    第二计算单元,配置为将所述第三结构化数据输入至所述模型文件,计算得到每条数据的归类和类别概率;A second calculation unit configured to input the third structured data into the model file and calculate a classification and category probability of each piece of data;
    第三计算单元,配置为计算所述类别概率得到所述话题数据的热度信息。A third calculation unit is configured to calculate the category probability to obtain popularity information of the topic data.
  10. 一种存储介质,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行所述权利要求1至7任一项中所述的方法。A storage medium stores a computer program therein, wherein the computer program is configured to execute the method described in any one of claims 1 to 7 when running.
  11. 一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行所述权利要求1至7任一项中所述的方法。An electronic device includes a memory and a processor, and a computer program is stored in the memory, and the processor is configured to run the computer program to perform the method described in any one of claims 1 to 7.
PCT/CN2019/088435 2018-06-21 2019-05-24 Information processing method and device, storage medium, and electronic device WO2019242453A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810644005.6 2018-06-21
CN201810644005.6A CN110633410A (en) 2018-06-21 2018-06-21 Information processing method and device, storage medium, and electronic device

Publications (1)

Publication Number Publication Date
WO2019242453A1 true WO2019242453A1 (en) 2019-12-26

Family

ID=68966243

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/088435 WO2019242453A1 (en) 2018-06-21 2019-05-24 Information processing method and device, storage medium, and electronic device

Country Status (2)

Country Link
CN (1) CN110633410A (en)
WO (1) WO2019242453A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515494B (en) * 2020-04-09 2024-03-22 中国移动通信集团广东有限公司 Database processing method based on distributed file system and electronic equipment
CN117078341A (en) * 2023-08-18 2023-11-17 时趣互动(北京)科技有限公司 Brand marketing activity analysis display method, system, terminal and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100305913A1 (en) * 2009-05-29 2010-12-02 Johnson Daniel P Method of modeling the socio-spatial dynamics of extreme urban heat events
US20100319031A1 (en) * 2009-06-12 2010-12-16 National Taiwan University Of Science & Technology Hot video prediction system based on user interests social network
CN104731857A (en) * 2015-01-27 2015-06-24 南京烽火星空通信发展有限公司 Fast public sentiment heat computing method
CN106257449A (en) * 2015-06-19 2016-12-28 阿里巴巴集团控股有限公司 A kind of information determines method and apparatus
CN107766360A (en) * 2016-08-17 2018-03-06 北京神州泰岳软件股份有限公司 A kind of video temperature Forecasting Methodology and device
CN107885793A (en) * 2017-10-20 2018-04-06 江苏大学 A kind of hot microblog topic analyzing and predicting method and system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699663B (en) * 2013-12-27 2017-02-08 中国科学院自动化研究所 Hot event mining method based on large-scale knowledge base
CN106649405A (en) * 2015-11-04 2017-05-10 陈包容 Method and device for acquiring reply prompt content of chat initiating sentence
CN105787049B (en) * 2016-02-26 2019-07-16 浙江大学 A kind of network video focus incident discovery method based on Multi-source Information Fusion analysis

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100305913A1 (en) * 2009-05-29 2010-12-02 Johnson Daniel P Method of modeling the socio-spatial dynamics of extreme urban heat events
US20100319031A1 (en) * 2009-06-12 2010-12-16 National Taiwan University Of Science & Technology Hot video prediction system based on user interests social network
CN104731857A (en) * 2015-01-27 2015-06-24 南京烽火星空通信发展有限公司 Fast public sentiment heat computing method
CN106257449A (en) * 2015-06-19 2016-12-28 阿里巴巴集团控股有限公司 A kind of information determines method and apparatus
CN107766360A (en) * 2016-08-17 2018-03-06 北京神州泰岳软件股份有限公司 A kind of video temperature Forecasting Methodology and device
CN107885793A (en) * 2017-10-20 2018-04-06 江苏大学 A kind of hot microblog topic analyzing and predicting method and system

Also Published As

Publication number Publication date
CN110633410A (en) 2019-12-31

Similar Documents

Publication Publication Date Title
CN111931062B (en) Training method and related device of information recommendation model
CN108073568B (en) Keyword extraction method and device
Xiao et al. Crowd intelligence: Analyzing online product reviews for preference measurement
Rodríguez-Ibánez et al. A review on sentiment analysis from social media platforms
Gandomi et al. Beyond the hype: Big data concepts, methods, and analytics
Chen et al. Predicting the influence of users’ posted information for eWOM advertising in social networks
WO2019175571A1 (en) Combined methods and systems for online media content
Liu et al. Riding the tide of sentiment change: sentiment analysis with evolving online reviews
CN110334356A (en) Article matter method for determination of amount, article screening technique and corresponding device
KR102407057B1 (en) Systems and methods for analyzing the public data of SNS user channel and providing influence report
WO2020135642A1 (en) Model training method and apparatus employing generative adversarial network
Helles et al. Infrastructures of tracking: Mapping the ecology of third-party services across top sites in the EU
US11275994B2 (en) Unstructured key definitions for optimal performance
WO2019242453A1 (en) Information processing method and device, storage medium, and electronic device
CN114491255A (en) Recommendation method, system, electronic device and medium
CN110825868A (en) Topic popularity based text pushing method, terminal device and storage medium
US10394804B1 (en) Method and system for increasing internet traffic to a question and answer customer support system
Xu Machine Learning for Flavor Development
Ali et al. Big social data as a service (BSDaaS): a service composition framework for social media analysis
CN113722487A (en) User emotion analysis method, device and equipment and storage medium
Zhu et al. Identifying and modeling the dynamic evolution of niche preferences
CN112035740A (en) Project use duration prediction method, device, equipment and storage medium
Tao et al. Mining Pain Points from Hotel Online Comments Based on Sentiment Analysis
Liu et al. Stratify Mobile App Reviews: E-LDA Model Based on Hot" Entity" Discovery
CN112949963A (en) Employee service quality evaluation method and device, storage medium and intelligent equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19823204

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 06/05/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19823204

Country of ref document: EP

Kind code of ref document: A1