CN114003803A - Method and system for discovering media account in specific region on social platform - Google Patents

Method and system for discovering media account in specific region on social platform Download PDF

Info

Publication number
CN114003803A
CN114003803A CN202110944831.4A CN202110944831A CN114003803A CN 114003803 A CN114003803 A CN 114003803A CN 202110944831 A CN202110944831 A CN 202110944831A CN 114003803 A CN114003803 A CN 114003803A
Authority
CN
China
Prior art keywords
account
media
accounts
social platform
original
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110944831.4A
Other languages
Chinese (zh)
Inventor
王慧
徐小琳
李扬曦
王永庆
沈华伟
刘科栋
彭成维
王佩
陈苏
史铂深
程学旗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Priority to CN202110944831.4A priority Critical patent/CN114003803A/en
Publication of CN114003803A publication Critical patent/CN114003803A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Abstract

The invention provides a method and a system for discovering a media account in a specific region on a social platform. When the task of finding media accounts in a specific region on a social platform is solved, an inventor finds that the existing method is difficult to quickly obtain high-quality seed media account information, so a set of methods for quickly labeling the seed media accounts is designed, including determination of candidate accounts and account classification. The inventor finds that the characteristic design of the existing method has the problems of weak characteristics, incapability of embodying the characteristics of the media account, difficulty in large-scale implementation and the like, and designs a characteristic extraction method aiming at the characteristics of the media account. The method has the advantages of high speed and strong extraction characteristics. The inventor finds that the existing method is difficult to expand from the seed account number to more media account numbers with high efficiency and high quality. Therefore, on the basis of solving the problems of seed account labeling and feature extraction methods, two complementary expansion methods are designed to obtain high-quality and high-quantity media accounts.

Description

Method and system for discovering media account in specific region on social platform
Technical Field
The invention belongs to the field of data mining, and particularly relates to a method and a system for discovering a specific regional media account on a social platform.
Background
With the rise of social platforms, more and more traditional media began to migrate the main information dissemination channel from newspapers, radio and television to social platforms. The information digitization process also prompts a large number of new generation digital media, which are born on the internet and take official websites, mobile phone applications and social platform accounts as main information propagation channels. Meanwhile, the development of social platforms promotes personal expression, and more users share knowledge of themselves in certain fields on social media to become self-media. The media account number on the social platform is a high-quality information acquisition channel. Through observation of the media account number of the specific region, people can acquire high-value information such as the latest dynamic state, the hottest dynamic state, public opinion situation, public opinion and preference and the like of the region. Media accounts are increasingly of high value on social platforms.
Data mining is a common means of obtaining media account information. Data mining is a technology for analyzing mass data and searching implicit information with potential value from the mass data, and mainly comprises three steps of data preparation, pattern mining and result representation. The data preparation is to select required data from related data sources and integrate the data into a data set for data mining; pattern mining finds out the rules and patterns contained in a data set by using a certain method, and the main method is statistical machine learning and deep learning; the resulting representation is to represent the found laws and patterns in a manner that is understandable to the user as much as possible.
Currently, there are several data mining techniques in acquiring media accounts and analyzing media accounts. The first technology is as follows: a method, system and device (CN 110782222A) for identifying social media accounts based on big data intelligent mailboxes. The technology takes a mailbox address as a proof of social media existence, and realizes identification and collection of a media account by retrieving a mailbox on a mainstream social platform. The second technology is as follows: a group user mining method and device (CN 10850934A). The technology is used for clustering and obtaining a plurality of group users through four kinds of track characteristics of a historical track data technology of the users. The third technology: an identity and motivation identification method and system for atypical media accounts (CN 112559845A). The technology allocates the accounts in each quadrant of a two-dimensional coordinate system according to the number of original texts and participation times of each atypical media account aiming at the same event, and identifies the identity and motivation of each media account. The fourth technology is as follows: english social media account classification method based on information gain (CN 107463703A). The technology selects keywords of the media account according to an information gain theory, designs characteristics of the media account by combining with domain keywords, and classifies the media account in the domain category through a support vector machine model. The fifth technology is as follows: a social media account identification method and system (CN 110688593A). The technology provides a method and a system for identifying social media account numbers, namely, interest characteristics of the media account numbers are mined by clustering topics of the media account numbers and utilizing an Apriori algorithm, and whether the two media account numbers are the same account number or not is judged according to the interest characteristics.
The technology I extracts the mailbox characteristics to discover the media account, but the accurate seed mailbox data acquisition difficulty is higher, and the mailbox is only closely associated with a small part of large-scale media accounts and has no universality. And in the second technology, historical behavior tracks of the account are extracted for group user discovery, but the behavior track data are only suitable for a few social platforms, and account characteristics are not specially designed for the characteristics of the media account. The third, fourth and fifth technologies are established on the basis of the acquired media account, and an effective means for discovering the media account is lacked; in addition, in the aspect of feature mining, the features proposed by the third technology are too weak, the features proposed by the fourth technology are based on a plurality of specific fields, and both the features cannot be applied to media account discovery. The feature extraction method provided by the fifth technology is difficult to apply to a mass user library of the social platform, and cannot solve the task of media account discovery.
Disclosure of Invention
The invention aims to overcome the defects of seed media account acquisition, feature extraction and account expansion in the prior art, and provides a method for finding media type accounts in a specific region on a social platform, so that a user can quickly find the media type accounts from a large number of social platform users based on few manually marked accounts and effectively integrate the basic information of the media accounts.
Specifically, the invention provides a method and a system for discovering a media account in a specific region on a social platform, wherein the method comprises the following steps:
step 1, acquiring all account numbers of a specific region on a social platform, screening out media account numbers with influence larger than a threshold value from all account numbers as original account numbers, and marking media types of all the original account numbers;
step 2, extracting features of the original account to obtain a plurality of features of the original account;
step 3, training a classification model based on machine learning by taking the original account and the corresponding characteristics thereof as training data and the media type of the original account as a training target to obtain a classifier corresponding to each media type;
step 4, taking an account of which the original account actively interacts on the social platform as a candidate account, performing media account discrimination and region filtering on the candidate account by using the classifier, and adding the candidate account which is located in the specific region and belongs to the media account into a media account set;
step 5, taking the accounts in the media account set as a new round of candidate accounts, repeatedly executing the step 4 until the number of the accounts reaches a threshold value repeatedly or no new account is added to the media account set, saving the current media account set and adding all original accounts as a first set;
step 6, adding accounts which belong to media into a second set by utilizing the classifier for the accounts which are not labeled in all accounts in a specific region on the social platform;
and 7, combining the first set and the second set to serve as a media account discovery result of a specific region on the social platform.
The method and the system for discovering the media account in the specific region on the social platform are characterized in that the influence in the step 1 comprises the following steps: number of fans and number of primary messages.
The method and the system for discovering the media account in the specific region on the social platform are characterized in that the step 2 comprises the following steps:
extracting user name characteristics based on whether the user name has media type keywords and region name limiting words;
extracting user data characteristics based on the number of fans of the account, the number of concerns, the number of fans of the account, whether an external link exists in a profile field of the account, whether a media type keyword exists in the profile field, and the registration duration of the account on the social platform;
respectively calculating the ratio of three behavior types of the account as the basic behavior characteristics of the user, wherein the three behavior types comprise a spontaneous behavior ratio, a forwarding behavior ratio and a comment behavior ratio;
and fusing the basic behavior characteristic, the user profile characteristic and the user name characteristic of the user as the characteristic information of the account.
The method and the system for discovering the media account in the specific region on the social platform are characterized in that the step 7 comprises the following steps:
and removing repeated accounts from the account set after the first set and the second set are combined, and removing a plurality of accounts with the lowest prediction confidence degrees.
The invention also provides a system for discovering the media account in a specific region on the social platform, which comprises the following steps:
the system comprises a module 1, a social platform and a server, wherein the module 1 is used for acquiring all account numbers of a specific region on the social platform, screening media account numbers with influence larger than a threshold value from all account numbers as original account numbers, and marking media types of all the original account numbers;
the module 2 is used for extracting features of the original account to obtain a plurality of features of the original account;
the module 3 is used for training a classification model based on machine learning by taking the original account and the corresponding characteristics thereof as training data and the media type of the original account as a training target to obtain a classifier corresponding to each media type;
the module 4 is used for taking an account of which the original account actively interacts on the social platform as a candidate account, performing media account discrimination and region filtering on the candidate account by using the classifier, and adding the candidate account which is located in the specific region and belongs to the media account into a media account set;
a module 5, configured to use the account in the media account set as a new round of candidate account, repeatedly execute the module 4 until the number of the new round reaches a threshold value or the media account set is not added with a new account, save the current media account set, and add all original accounts as a first set;
a module 6, configured to add, to accounts that are not labeled in all accounts in a specific region on the social platform, accounts belonging to a media into a second set by using the classifier;
and the module 7 combines the first set and the second set as a result of the discovery of the media account of the specific region on the social platform.
The system for discovering the media account in the specific region on the social platform, wherein the influence in the module 1 includes: number of fans and number of primary messages.
The system for discovering the media account in the specific region on the social platform comprises the following modules 2:
extracting user name characteristics based on whether the user name has media type keywords and region name limiting words;
extracting user data characteristics based on the number of fans of the account, the number of concerns, the number of fans of the account, whether an external link exists in a profile field of the account, whether a media type keyword exists in the profile field, and the registration duration of the account on the social platform;
respectively calculating the ratio of three behavior types of the account as the basic behavior characteristics of the user, wherein the three behavior types comprise a spontaneous behavior ratio, a forwarding behavior ratio and a comment behavior ratio;
and fusing the basic behavior characteristic, the user profile characteristic and the user name characteristic of the user as the characteristic information of the account.
The system for discovering the media account in the specific region on the social platform comprises the following modules 7:
and removing repeated accounts from the account set after the first set and the second set are combined, and removing a plurality of accounts with the lowest prediction confidence degrees.
The invention further provides a server for implementing the method for discovering the media account in the specific region on the social platform.
The invention further provides a client used for the media account discovery system of the specific region on the social platform, wherein the client is a mobile phone application APP or computer application software.
According to the scheme, the invention has the advantages that:
1. according to the method, the media type account existing on the social platform is subdivided into the traditional media, the new generation digital media and the self-media, and the annotation speed of the seed media account can be increased by enhancing the understanding of an annotator on the account type; meanwhile, media type subdivision can also improve the performance of a subsequent classifier. In addition, the user of the invention can also quickly find the corresponding media account from different subdivision fields according to the requirements.
2. The invention designs a series of characteristics for the media account in a specific region. The characteristics are strongly related to the properties of the media account numbers, have strong universality and can be suitable for the media account number classification in any region by slight modification.
3. The invention combines the interactive expansion and the region screening expansion, can simultaneously discover the media account numbers which are related or not related to the expanded account numbers, ensures the quantity and the quality of the expansion results and has higher expansion speed.
4. The invention has faster speed on three main technical points of data marking, feature extraction, account number expansion and the like, and can be implemented on mass data.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
When the task of finding media accounts in a specific region on a social platform is solved, an inventor finds that the existing method is difficult to quickly obtain high-quality seed media account information, so a set of methods for quickly labeling the seed media accounts is designed, including determination of candidate accounts and account classification.
Secondly, the inventor finds that the characteristic design of the existing method has the problems of weak characteristics, incapability of embodying the characteristics of the media account, difficulty in large-scale implementation and the like, and designs a characteristic extraction method aiming at the characteristics of the media account. The method has the advantages of high speed and strong extraction characteristics.
Finally, the inventor finds that the existing method is difficult to expand from the seed account number to more media account numbers with high efficiency and high quality. Therefore, on the basis of solving the problems of seed account labeling and feature extraction methods, two complementary expansion methods are designed to obtain high-quality and high-quantity media accounts.
Specifically, the method provided by the invention comprises the following steps:
s1 seed media account annotation: manually marking a small number of accounts with higher vermicelli number and higher primary message number in a specific region, judging whether the accounts are media type accounts or not, and further judging the subdivided media types of the accounts.
S2 media account feature extraction: and (5) extracting high-quality features aiming at the manually marked media account in the step (S1) to obtain a plurality of representative features of the media account.
S3 construction of a media account classification model: and constructing a training sample set based on the manually marked media account number in the step S1 and the features extracted in the step S2, and training a classifier by using a machine learning technology to obtain a classifier for distinguishing different subdivided media types.
S4 interactive expansion of media account: screening out accounts actively interacted on the social platform by the manually marked media accounts in the step S1 as candidates, and performing media account discrimination and region filtering on the candidate account set by using the classifier in the step S3; and adding the matched candidate accounts into the media account set, taking the accounts as new candidate accounts, and repeating the operation of the step, and finally realizing the snowball expansion until no new account is found. This step ultimately results in a set of media accounts.
S5 geographical screening type extended media account: this step is parallel to step S4. And judging whether the account with the specific region which is not manually marked belongs to the media type account by using the classifier obtained in the step S3. This step ultimately results in a set of media accounts.
S6 merging the expansion results of the media account: and integrating, de-duplicating and removing low confidence of the media account sets obtained in the steps S4 and S5 to obtain a final specific region media account set.
Further, the specific implementation method of step S1 is as follows:
preparing data: social platform personal account information collected by the crawler is stored in the database, including but not limited to user ID, user name, region, profile, fan number, attention number, account creation time, number of spontaneous messages, number of praise, and the like. The spontaneous message is different from the forwarding message and the comment message and belongs to the content message generated by the account number.
Manual labeling: dividing the media type account into three subdivision types: legacy media, new generation digital media, and self media. And judging whether the manually marked candidate account is of the media type by using a manual marking means, and marking the media type in a subdivision mode.
Further, the specific implementation method of step S2 is as follows:
extracting user name characteristics: the user name characteristics can be extracted by judging whether preset media type keywords and region name limiting words exist in the user name. In addition, the length of the user ID represents the registration time of the account on the social platform, and can be used as one of the user name characteristics.
Extracting other user profile features: and extracting the characteristics of the number of fans of the account, the ratio of the number of concerns to the number of fans, whether an external link exists in a profile field of the account, whether a media type keyword and a region name limiting word exist in the profile field, the registration time of the account and the like.
Extracting user behavior characteristics: calculating the ratio of the three behavior types of the account as basic behavior characteristics respectively: spontaneous behavior rate, forward behavior rate, and comment behavior rate. In addition, the hourly liveness of the account is calculated and aggregated into four time period liveness: morning, afternoon, evening, and early morning activity. And finally, calculating the maximum value and the average value of the number of comments of all the messages generated by the account. Generally, the larger the number of comments, the wider the message dissemination scope is represented, and the greater the influence.
Feature fusion: and integrating the three characteristics into the characteristic information of each account.
Further, the implementation method of step S3 is:
data preprocessing: the method comprises the steps of numerical characteristic normalization, category type characteristic one-hot coding, characteristic screening and the like. Feature normalization refers to min-max normalization of the value of each numeric feature in different account numbers, by which means all numeric features are scaled to the [0,1] interval so that the features can be compared. Category-specific feature encoding refers to the unique hot encoding of each category-specific feature. Feature screening refers to deleting a plurality of features with minimum variance and minimum correlation coefficient by comparing feature variance, Pearson correlation coefficient and the like.
Training a classifier: considering that different media subdivision types have differences in the feature level, training a unified classifier may be poor in effect, so for three media subdivision types: legacy media, new generation digital media, and self-media each train a respective classifier. The classifiers for each subdivision type are integrated by three standard base classifiers: and the random forest classifier, the gradient boosting classifier and the logistic regression classifier are used for carrying out majority voting by using the results of the three base classifiers to obtain the final classifier result. If each base classifier needs to carry out hyper-parameter adjustment, a grid searching method is adopted to obtain the hyper-parameters which enable the performance of each base classifier to be the best.
Further, the specific implementation method of step S4 is as follows:
the interactive expansion method expands the media type account numbers in a specific region by a method similar to a 'rolling snowball', firstly appoints iteration times and convergence judgment conditions and takes all manually marked media account numbers as iteration alternatives; and in each iteration, inquiring all account numbers actively commented by the iteration alternative account numbers and the account numbers to which the forwarded messages belong from the database for storing the social platform messages, and extracting the characteristics by the method in the step S2. And (4) sending the account numbers and the characteristics into the three subdivision type classifiers obtained in the step (S3) to obtain the account numbers predicted to be media types, and performing region filtering to obtain the expanded account numbers meeting the requirements in the iteration. These accounts are used on the one hand to perform the next iteration and on the other hand to integrate with the manually labeled media accounts. And stopping iteration when the convergence judgment condition is met to obtain an integrated media account set.
Further, the specific implementation method of step S5 is as follows:
screening out an alternative account set by a method of specifying a region field as a specific region and setting a lower limit threshold of fan number in a database for storing personal account information of the social platform, and extracting account features according to the method in the step S2; and then obtaining a set of accounts predicted to be media types on the set by using the three subdivision type classifiers obtained in the step S3 respectively.
Further, the specific implementation method of step S6 is as follows:
account number integration: and merging the two account sets obtained in the steps S4 and S5.
Cleaning an account number: and removing repeated accounts and removing a plurality of accounts with the lowest prediction confidence degrees from the combined account set. The final account set is the found specific territorial media account set.
In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.
As shown in fig. 1, S1 seed media account label:
preparing data: social platform personal account information collected by the crawler is stored in the database, including but not limited to user ID, user name, region, profile, fan number, attention number, account creation time, primary message number, praise number, and the like. The original message is different from the forwarding message and the comment message and belongs to the content message generated by the account number. In this case, the social platform is designated as social media a, and the specific region is designated as region XX.
Seed account number candidates: the media account needs to have three basic properties: high influence, frequent content output and non-outdated content output; the media with the properties can more effectively spread own viewpoints, and meanwhile, the media with the properties have typical media characteristics, so that the manual annotation speed can be increased. Therefore, a small number of account numbers with higher fan number and higher primary message number in the XX area are manually marked. When the region is screened, it is necessary to determine whether the region field of the personal data belongs to the XX region qualifier, such as various synonyms in the XX region. The lower threshold of the number of fans is set to 500, and the number of primary messages is set to not less than 50 messages (one message per week on average) in the last year. And manually marking the account number which is 200 times the selected account number before the vermicelli number is ranked.
And (3) marking the seed account number: in the labeling process, firstly, whether the account is of the media type is judged, and then, the subdivided media type of the account is further judged: traditional media, new generation digital media, or self media. The traditional media refers to media types which have been subjected to the paper media age, the radio station media age or the television media age and open official networks, social media accounts and the like on the internet. The new generation digital media refers to media types which do not go through the times of paper media, radio stations and televisions and only open official networks, social media accounts and the like on the Internet. Such as a digital media account opened at social media a.
The first two media types are more formal, the entity to which the account belongs is usually an institution or company, and most of them remain active in the current news domain. The entity to which the self-media account belongs is often an individual, and the account has a certain number of fans, so that the content is continuously generated in some fields, the view is spread, and the audience is influenced. The division into different subdivided media types helps to improve the performance of the subsequent classifier, and if the subdivided media types are considered as one media type in a general way, the performance of the classifier is easily reduced.
S2 media account feature extraction:
extracting user name characteristics: the user name characteristics can be extracted by judging whether preset media type keywords and region name limiting words exist in the user name. In addition, the length of the user ID represents the registration time of the account on the social platform, and can be used as one of the user name characteristics.
Extracting other user profile features: and extracting the characteristics of the number of fans of the account, the ratio of the number of concerns to the number of fans, whether an external link exists in a profile field of the account, whether a media type keyword and a region name limiting word exist in the profile field, the registration time of the account and the like.
Extracting user behavior characteristics: calculating the ratio of the three behavior types of the account as basic behavior characteristics respectively: the original release is a rate, a forward behavior rate, and a comment behavior rate. In addition, the hourly liveness of the account is calculated and aggregated into four time period liveness: morning, afternoon, evening, and early morning activity. And finally, calculating the maximum value and the average value of the number of comments of all the messages generated by the account. Generally, the larger the number of comments, the wider the message dissemination scope is represented, and the greater the influence. Whether the user's text content has the media type keyword and the region name limiting word is also taken as one of the behavior characteristics.
Feature fusion: and splicing the vectors formed by the three features into a feature vector of each account.
In one aspect, the specific territorial media account number presents two typical trends on the user name: with media type keywords and with domain name qualifiers. The media type keywords include: news, media, new , etc. The user's text type information, including user name, profile, message content, etc., may implement the determination of the presence or absence of a keyword. In addition, according to the properties of high influence, continuous output content, non-outdated content and the like of the media account, the characteristics of fan number-focus number ratio, average/maximum forwarding number, account registration time, period activity and the like are designed from account data and account behavior.
Media accounts in different subdivision fields have differences in the aspects of user data, user behaviors and the like. For example, the usernames for traditional media and new-generation digital media tend to be more regular, and they tend to include territorial qualifiers (e.g., HK, HK) and media type qualifiers (e.g., xx News) in the names; the fan-to-focus ratio from media tends to be lower because the account numbers that are personal in nature have higher degrees of freedom, tending to focus on more social media a users.
S3 construction of a media account classification model:
and constructing a training sample set based on the manually marked media account in the step S1 and the characteristics extracted in the step S2. Firstly, data preprocessing is carried out, including numerical characteristic normalization, category type characteristic one-hot coding, characteristic screening and the like. Numerical features refer to features whose values are continuous numbers, and categorical features refer to features whose values are discrete categories. Feature normalization refers to scaling the value of each numeric feature among all account numbers to the [0,1] interval min-max, so that the features can be compared. Namely: given a Feature set Feature { X, Y }, where X is a numeric Feature and Y is a categorical Feature. Given a Sample set Sample { s1, s2, …, sn }, the value of feature X over all samples is { X1, X2, …, Xn }. After min-max scaling, each value Xi is scaled to Xi': xi ^ prime ^ frac { Xi-min { { X1, X2, \ butt s Xn } } } { max { { X1, X2, \ butt, Xn } } -Xi } \\ \ in \ 0, 1.
Class-type feature one-hot encoding refers to expanding a feature containing M classes into an M-dimensional feature represented by one-hot encoding. For example, a Feature Y in a Feature set Feature contains two categories: and if yes, Y is expanded into two-dimensional characteristics after one-hot coding [ Y _ yes, Y _ no ]. Accordingly, a value of [1,0] indicates yes and [0,1] indicates no. The category-type features subjected to the one-hot coding can be used as sample features together with numerical-type features. Otherwise, the value used to represent the class number does not have the attribute to characterize the size of the value and cannot be used directly as a feature input classifier.
The feature screening means that a plurality of features with the minimum variance and the minimum Pearson correlation coefficient are removed by calculating the variance and the Pearson correlation coefficient of each feature. A small variance of a feature indicates that the feature has no degree of distinction over different classes, and a low correlation coefficient indicates that the feature is weakly associated with a class. Because the feature dimension extracted in the step S2 is not high, a good effect can be achieved by selectively removing 0-2 features.
After preprocessing, three types of base classifiers are trained and integrated for each subdivided media type: a random forest classifier, a gradient boosting classifier and a logistic regression classifier. During training, a certain subdivided media type is regarded as a positive sample, a non-media type is regarded as a negative sample, and the positive sample is up-sampled to keep the number of the positive and negative samples close. In order to reduce the influence of data noise on the prediction result to the maximum extent, the invention adopts a base classifier integration mode, and the final prediction result is obtained by majority voting.
S4 interactive expansion of media account:
the method is based on the following steps: generally, the media type account actively forwards and reviews messages with higher value, the accounts with the media type account have larger difference with common audiences, and the accounts with the media type account also have larger possibility of belonging to the media type. In addition, media accounts also tend to cooperate in a forwarding and commenting manner to strengthen the influence of each other, so the invention utilizes the active interaction relationship to expand the accounts with specific territorial media types through a method similar to a 'snowball'.
Before iteration, iteration times and convergence judgment conditions need to be specified. Generally, the iteration number is set to be 50-100, and the convergence judgment condition is that the maximum iteration number is reached or new media type accounts are not generated by iteration. Firstly, all manually marked media account numbers are used as iteration alternatives; and in each iteration, inquiring all account numbers actively commented by the iteration alternative account numbers and the account numbers to which the forwarded messages belong from the database for storing the social platform messages, and extracting the characteristics by the method in the step S2. And (4) sending the account numbers and the characteristics into the three subdivision type classifiers obtained in the step S3 to obtain account numbers predicted to be media types, and performing region filtering based on XX region limiting words to obtain expanded account numbers meeting the requirements in the iteration. These accounts are used on the one hand to perform the next iteration and on the other hand to integrate with the manually labeled media accounts. And stopping iteration when the convergence judgment condition is met to obtain an integrated media account set.
S5 geographical screening type extended media account:
this step, in parallel with the expansion method mentioned in step S4, can save a lot of computation time. Firstly screening out accounts which are not manually marked in the XX area, judging whether the accounts belong to subdivided media types by using the three subdivided category classifiers obtained in the step S3, and finally integrating prediction results of all subdivided types to obtain an expanded media account set.
S6 merging the expansion results of the media account:
and merging the two media account sets obtained in the steps S4 and S5, removing duplicate accounts and removing a plurality of accounts with low confidence in prediction. And finally, obtaining a media account set in the XX region on the social media A. Each expanded media account carries an identification field for a subdivided media type (legacy media, new generation digital media, self media).
The interactive expansion and the region-screening expansion complement each other. On the one hand, interactive augmentation can find media type users that are closely related to the augmented account, since the method focuses on performing augmentation through user interaction; on the other hand, the database only stores a limited subset of all messages of the social platform, and interaction behaviors are difficult to generate among different fields and different types of media, so that a regional screening method is required to be used for expansion, and the quantity and quality of discovered media are enhanced.
The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.
The invention also provides a system for discovering the media account in a specific region on the social platform, which comprises the following steps:
the system comprises a module 1, a social platform and a server, wherein the module 1 is used for acquiring all account numbers of a specific region on the social platform, screening media account numbers with influence larger than a threshold value from all account numbers as original account numbers, and marking media types of all the original account numbers;
the module 2 is used for extracting features of the original account to obtain a plurality of features of the original account;
the module 3 is used for training a classification model based on machine learning by taking the original account and the corresponding characteristics thereof as training data and the media type of the original account as a training target to obtain a classifier corresponding to each media type;
the module 4 is used for taking an account of which the original account actively interacts on the social platform as a candidate account, performing media account discrimination and region filtering on the candidate account by using the classifier, and adding the candidate account which is located in the specific region and belongs to the media account into a media account set;
a module 5, configured to use the account in the media account set as a new round of candidate account, repeatedly execute the module 4 until the number of the new round reaches a threshold value or the media account set is not added with a new account, save the current media account set, and add all original accounts as a first set;
a module 6, configured to add, to accounts that are not labeled in all accounts in a specific region on the social platform, accounts belonging to a media into a second set by using the classifier;
and the module 7 combines the first set and the second set as a result of the discovery of the media account of the specific region on the social platform.
The system for discovering the media account in the specific region on the social platform, wherein the influence in the module 1 includes: number of fans and number of primary messages.
The system for discovering the media account in the specific region on the social platform comprises the following modules 2:
extracting user name characteristics based on whether the user name has media type keywords and region name limiting words;
extracting user data characteristics based on the number of fans of the account, the number of concerns, the number of fans of the account, whether an external link exists in a profile field of the account, whether a media type keyword exists in the profile field, and the registration duration of the account on the social platform;
respectively calculating the ratio of three behavior types of the account as the basic behavior characteristics of the user, wherein the three behavior types comprise a spontaneous behavior ratio, a forwarding behavior ratio and a comment behavior ratio;
and fusing the basic behavior characteristic, the user profile characteristic and the user name characteristic of the user as the characteristic information of the account.
The system for discovering the media account in the specific region on the social platform comprises the following modules 7:
and removing repeated accounts from the account set after the first set and the second set are combined, and removing a plurality of accounts with the lowest prediction confidence degrees.
The invention further provides a server for implementing the method for discovering the media account in the specific region on the social platform.
The invention further provides a client used for the media account discovery system of the specific region on the social platform, wherein the client is a mobile phone application APP or computer application software.

Claims (10)

1. A method and a system for discovering a media account in a specific region on a social platform are characterized by comprising the following steps:
step 1, obtaining all account numbers of a specific region on a social platform, screening out media account numbers with influence larger than a threshold value from all account numbers as original account numbers, and carrying out media type labeling on all the original account numbers.
Step 2, extracting features of the original account to obtain a plurality of features of the original account;
step 3, training a classification model based on machine learning by taking the original account and the corresponding characteristics thereof as training data and the media type of the original account as a training target to obtain a classifier corresponding to each media type;
step 4, taking an account of which the original account actively interacts on the social platform as a candidate account, performing media account discrimination and region filtering on the candidate account by using the classifier, and adding the candidate account which is located in the specific region and belongs to the media account into a media account set;
step 5, taking the accounts in the media account set as a new round of candidate accounts, repeatedly executing the step 4 until the number of the accounts reaches a threshold value repeatedly or no new account is added to the media account set, saving the current media account set and adding all original accounts as a first set;
step 6, adding accounts which belong to media into a second set by utilizing the classifier for the accounts which are not labeled in all accounts in a specific region on the social platform;
and 7, combining the first set and the second set to serve as a media account discovery result of a specific region on the social platform.
2. The method and system for discovering media account in specific geographic areas on a social platform of claim 1, wherein the influence in step 1 comprises: number of fans and number of primary messages.
3. The method and system for discovering a specific geographic media account on a social platform of claim 1, wherein the step 2 comprises:
extracting user name characteristics based on whether the user name has media type keywords and region name limiting words;
extracting user data characteristics based on the number of fans of the account, the number of concerns, the number of fans of the account, whether an external link exists in a profile field of the account, whether a media type keyword exists in the profile field, and the registration duration of the account on the social platform;
respectively calculating the ratio of three behavior types of the account as the basic behavior characteristics of the user, wherein the three behavior types comprise a spontaneous behavior ratio, a forwarding behavior ratio and a comment behavior ratio;
and fusing the basic behavior characteristic, the user profile characteristic and the user name characteristic of the user as the characteristic information of the account.
4. The method and system for discovering media account in specific geographic areas on a social platform of claim 1, wherein the step 7 comprises:
and removing repeated accounts from the account set after the first set and the second set are combined, and removing a plurality of accounts with the lowest prediction confidence degrees.
5. A system for discovering a specific geographic media account on a social platform, comprising:
the system comprises a module 1, a social platform and a server, wherein the module 1 is used for acquiring all account numbers of a specific region on the social platform, screening media account numbers with influence larger than a threshold value from all account numbers as original account numbers, and marking media types of all the original account numbers;
the module 2 is used for extracting features of the original account to obtain a plurality of features of the original account;
the module 3 is used for training a classification model based on machine learning by taking the original account and the corresponding characteristics thereof as training data and the media type of the original account as a training target to obtain a classifier corresponding to each media type;
the module 4 is used for taking an account of which the original account actively interacts on the social platform as a candidate account, performing media account discrimination and region filtering on the candidate account by using the classifier, and adding the candidate account which is located in the specific region and belongs to the media account into a media account set;
a module 5, configured to use the account in the media account set as a new round of candidate account, repeatedly execute the module 4 until the number of the new round reaches a threshold value or the media account set is not added with a new account, save the current media account set, and add all original accounts as a first set;
a module 6, configured to add, to accounts that are not labeled in all accounts in a specific region on the social platform, accounts belonging to a media into a second set by using the classifier;
and the module 7 combines the first set and the second set as a result of the discovery of the media account of the specific region on the social platform.
6. The system of claim 5, wherein the influence of module 1 comprises: number of fans and number of primary messages.
7. The system of claim 5, wherein the module 2 comprises:
extracting user name characteristics based on whether the user name has media type keywords and region name limiting words;
extracting user data characteristics based on the number of fans of the account, the number of concerns, the number of fans of the account, whether an external link exists in a profile field of the account, whether a media type keyword exists in the profile field, and the registration duration of the account on the social platform;
respectively calculating the ratio of three behavior types of the account as the basic behavior characteristics of the user, wherein the three behavior types comprise a spontaneous behavior ratio, a forwarding behavior ratio and a comment behavior ratio;
and fusing the basic behavior characteristic, the user profile characteristic and the user name characteristic of the user as the characteristic information of the account.
8. The system of claim 5, wherein the module 7 comprises:
and removing repeated accounts from the account set after the first set and the second set are combined, and removing a plurality of accounts with the lowest prediction confidence degrees.
9. A server, configured to implement the method for discovering a specific geographical media account on the social platform according to any one of claims 1 to 4.
10. A client, configured to be used in the system for discovering a media account in a specific region on the social platform according to any one of claims 6 to 8, where the client is a mobile APP or a computer APP.
CN202110944831.4A 2021-08-17 2021-08-17 Method and system for discovering media account in specific region on social platform Pending CN114003803A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110944831.4A CN114003803A (en) 2021-08-17 2021-08-17 Method and system for discovering media account in specific region on social platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110944831.4A CN114003803A (en) 2021-08-17 2021-08-17 Method and system for discovering media account in specific region on social platform

Publications (1)

Publication Number Publication Date
CN114003803A true CN114003803A (en) 2022-02-01

Family

ID=79921102

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110944831.4A Pending CN114003803A (en) 2021-08-17 2021-08-17 Method and system for discovering media account in specific region on social platform

Country Status (1)

Country Link
CN (1) CN114003803A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115859988A (en) * 2023-02-08 2023-03-28 成都无糖信息技术有限公司 Entity account extraction method and system for social text

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115859988A (en) * 2023-02-08 2023-03-28 成都无糖信息技术有限公司 Entity account extraction method and system for social text
CN115859988B (en) * 2023-02-08 2023-10-03 成都无糖信息技术有限公司 Entity account extraction method and system for social text

Similar Documents

Publication Publication Date Title
CN109635171B (en) Fusion reasoning system and method for news program intelligent tags
Yang et al. A sentiment-enhanced personalized location recommendation system
CN109299271B (en) Training sample generation method, text data method, public opinion event classification method and related equipment
CN108897784B (en) Emergency multidimensional analysis system based on social media
CN103812872B (en) A kind of network navy behavioral value method and system based on mixing Di Li Cray process
Perdana et al. Combining likes-retweet analysis and naive bayes classifier within twitter for sentiment analysis
CN105608194A (en) Method for analyzing main characteristics in social media
CN105631749A (en) User portrait calculation method based on statistical data
US20060288272A1 (en) Computer-implemented method, system, and program product for developing a content annotation lexicon
CN108363748B (en) Topic portrait system and topic portrait method based on knowledge
CN108733791B (en) Network event detection method
Lubis et al. A framework of utilizing big data of social media to find out the habits of users using keyword
CN106951471A (en) A kind of construction method of the label prediction of the development trend model based on SVM
Zhang et al. Social media public opinion as flocks in a murmuration: Conceptualizing and measuring opinion expression on social media
Alves et al. A spatial and temporal sentiment analysis approach applied to Twitter microtexts
Wang et al. Time-variant graph classification
Liu et al. Behavior2vector: Embedding users’ personalized travel behavior to Vector
CN115438274A (en) False news identification method based on heterogeneous graph convolutional network
CN111984787A (en) Public opinion hotspot obtaining method and system based on internet data
JP6042790B2 (en) Trend analysis apparatus, trend analysis method, and trend analysis program
CN105354343B (en) User characteristics method for digging based on remote dialogue
CN114003803A (en) Method and system for discovering media account in specific region on social platform
Xu et al. Seeing the big picture from microblogs: Harnessing social signals for visual event summarization
CN116955855B (en) Low-cost cross-region address resolution model construction method and system
Rizzo et al. Shaping city neighborhoods leveraging crowd sensors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination