CN111159399A - Automobile vertical website water army discrimination method - Google Patents

Automobile vertical website water army discrimination method Download PDF

Info

Publication number
CN111159399A
CN111159399A CN201911285641.5A CN201911285641A CN111159399A CN 111159399 A CN111159399 A CN 111159399A CN 201911285641 A CN201911285641 A CN 201911285641A CN 111159399 A CN111159399 A CN 111159399A
Authority
CN
China
Prior art keywords
user
water army
automobile
characteristic
army
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911285641.5A
Other languages
Chinese (zh)
Inventor
娄子安
王磊
郭伟
陈晓帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201911285641.5A priority Critical patent/CN111159399A/en
Publication of CN111159399A publication Critical patent/CN111159399A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/018Certifying business or products
    • G06Q30/0185Product, service or business identity fraud

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Finance (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for discriminating water army of a vertical website of an automobile, which comprises the following steps: collecting user information of a vertical website of an automobile; deeply analyzing the difference between a normal user and a water army, and constructing a six-element characteristic model consisting of a user name characteristic, a bean vermicelli attention ratio, an essence post characteristic, an activeness characteristic, a vehicle owner characteristic and a text content characteristic; and detecting and identifying the water army by combining logistic regression on the basis of the six-element characteristic model. The method is used for exploring the identification problems of real users and water army in forum comments of certain vehicle types of the vertical website of the automobile, removing the false and true, removing the water army users and comments issued by the water army users, and leaving the real users and the comments of the water army users, thereby providing reference for the next product improvement and design.

Description

Automobile vertical website water army discrimination method
Technical Field
The invention relates to the technical field of information processing of automobile vertical websites, in particular to a water army screening method for the automobile vertical websites.
Background
With the rapid development and popularization of the internet, more and more people choose to browse and purchase automobiles on line, so that other buyers can often view the evaluation of purchased automobile models. On the other hand, research and development teams of automobile products can also mine the use experience of users from user comments. Therefore, the user comment of the automobile vertical website has an important reference function in the aspects of automobile product improvement iteration and customer purchasing decision. However, the senders of the comments are not normal users, so that a lot of water troops are mixed, and a lot of abnormal comments are sent to confuse the audio and video and interfere with buyers.
Leyiping (plum fruit)[1]And the like analyze the generation, diffusion characteristics and influence of the network water army, but a specific identification method is not provided. However, as this group gradually goes deep into the visual field of people, the research on the specific identification of the network navy is also gradually going deep, and the model of fangxuizhen[2]Firstly, it is explicitly proposed that the prevention of the water army influence event is started from the source, namely, the prevention object is 'water army' rather than 'network'. The network water force deliberately guides public opinion direction through a large amount of comments and replies, and the result leads part of crowd to gain, and the comment content of the network water force lacks objectivity and authenticity, is a commercial behavior under the control of benefits, and causes misleading to a design research institution. Again by way of example, Moqian[3]The method comprises the following steps that (1) people research on characteristics and behavior patterns of the network water army, the progress of identification characteristics of the water army is explained, and the identification direction of the network water army is analyzed; liu Jian Men[4]The method can reflect different types of invalid users, but the existing water army composition is not limited to professional teams any more, and a plurality of part-time water armies exist, and the error rate of the method is a big problem which cannot be solved along with the accumulation of time. Also, Zhang Yanmei based on the same naive Bayes algorithm principle[5]The identification of invalid users in the microblog field is analyzed by the people, and the common action of a plurality of characteristics including the number of microblog fans and the number of microbany returns is summarized to identify the users of the water army, and the analysis is guaranteedThe water army is identified on the premise of accurate identification, and the analysis result of the water army can cause the problem of excessive judgment of the water army. For the case that part of the sample cannot be standardized, Zhang comeger[6]The patent refers to the field of 'recognition, presentation of data and record carriers and its handling'. The method can greatly reduce the huge manual marking workload of false comment identification, however, inevitable errors exist in the process of optimizing each model, and the final identification accuracy is affected. In terms of design decision, Yang Cheng[7]The method improves the objectivity and scientificity of design and evaluation, but has huge data volume and certain requirements on the storage capacity of a server, so that the method has certain limitation. In recent years, with the complexity of behavior patterns and comment habits of water army, the water army identification only by supervised learning cannot achieve expected effects, and the WangmonHua[8]A semi-supervised learning method based on divergence is provided to realize the task of detecting false comments, and the effect display is more accurate and good. Doeruna[9]The method is based on text and user behavior mining, the false comments are identified, an identification model is built by utilizing SVM (support vector machine) and XGboost (extreme gradient boost) classification algorithms, although the accuracy rate is high, the method is not comprehensive in feature selection, and is difficult to adapt to the massive influx situation of the part-time water army.
With the increasing number of automobile website user comments, the behavior of the water army tends to be normalized and concealed, and the accelerated expansion of the water army proportion, whether the comments are real or not can not be identified only by recording and analyzing the characteristics of the ID and the IP address, the identification of the user comments urgently needs an automatic identification method which can cover the characteristics of big data to improve the identification efficiency and accuracy, and further can provide reasonable suggestions and measures for the improvement of products in time, thereby increasing the vitality of the industrial development of automobile products.
Reference documents:
[1] analysis of propagation disorder of Liyiping, Wupeng, Networks [ J ] network propagation, 2011(9):98-99
[2] Model show the transmission mechanism and treatment strategy of network navy [ J ] network transmission, 2011(7):56-57
[3] Moqian, Poke, network navy identification study [ J ] software bulletin, 2014,25(07):1505 1526.
[4] Liu Jian Man, advanced water army identification model based on machine learning [ A ]. China computer society, 33 rd national computer security academy discourse corpus [ C ]. China computer society, China computer science, computer security professional Committee, 2018:4.
[5] Zhanmei, Huangying, Ganshijie, Dingxung, Martensilon microblog network navy recognition algorithm research based on Bayesian model [ J ] Communication, 2017,38(01):44-53.
[6] Zhang comee, network navy organization discovery technology research based on multi-feature scale space model [ D ]. Zhejiang industry and commerce university, 2015.
[7] Yang Cheng, Sun Saturn, Liu Zheng, Chaochun Lei, product appearance design decision model [ J ] based on principal component analysis, Chinese mechanical engineering, 2011,22(18): 2218-.
[8] Wangmonghua, study of false comment identification based on semi-supervised learning [ D ]. university of financial institutions in south kyo, 2018.
[9] That. identify study [ D ] based on false comments mined from text and user behavior, university of inner mongolia, 2018.
Disclosure of Invention
The invention provides a method for discriminating water army of an automobile vertical website, which is characterized in that identification problems of real users and water army in forum comments of certain automobile types of the automobile vertical website are explored, false and true are removed, water army users and comments issued by the water army users are removed, the real users and comments are left, reference is provided for the next product improvement and design, and detailed description is provided as follows:
a method for screening water army of a vertical website of an automobile is characterized by comprising the following steps:
collecting user information of a vertical website of an automobile;
deeply analyzing the difference between a normal user and a water army, and constructing a six-element characteristic model consisting of a user name characteristic, a bean vermicelli attention ratio, an essence post characteristic, an activeness characteristic, a vehicle owner characteristic and a text content characteristic;
detecting and identifying the water army by combining logistic regression on the basis of the six-element characteristic model;
the user name characteristics are as follows:
Figure BDA0002317905590000031
where len (number) indicates the number or length of the numbers in the user's nickname, and len (name) indicates the number or length of the entire characters of the user's nickname.
The attention ratio of the vermicelli is as follows:
Figure BDA0002317905590000032
wherein num (fans) is the number of fans owned by the user, num (observer) is the attention number of the user, and abs is the absolute value of the difference.
The essence is characterized in that:
Figure BDA0002317905590000033
where num (jinghuaite) indicates the number of essence posts posted by the user, and num (zhhutie) indicates the number of all posts posted by the user.
The liveness characteristics are as follows:
Figure BDA0002317905590000034
wherein h isiThe number of replies to a single post made by others for the user, and N is the total number of posts made by the user to others.
The car owner characteristic:
Figure BDA0002317905590000035
wherein, 1 indicates that the user has the sign of authenticating the owner of the vehicle, and 0 indicates that the user does not have the sign of authenticating the owner of the vehicle.
The text content features are as follows: t6 ═ count ad, senw, puc +
Where ad is a degree adverb, senw is an emotion word (including both good and bad), puc is a special punctuation mark that is not commonly used, and count represents a count.
The technical scheme provided by the invention has the beneficial effects that:
1. by the method, the water army can be well filtered to leave a real user, so that the authenticity of information is ensured, and a buyer can be better served;
2. the invention is also helpful for the research and development team of the automobile products to dig out the most valuable use experience and the preference of the consumers from various information of real users, so as to be beneficial to the update iteration of the products and make various automobile products which are most popular with the consumers.
Drawings
FIG. 1 is a flow chart of a method for screening water armies of a vertical website of an automobile;
FIG. 2 is a partial screenshot of an experimental data set;
FIG. 3 is a screenshot of a recognition accuracy result of the present invention;
FIG. 4 is a screenshot of a verification comparison result for different text classification methods.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
The invention is realized in such a way, and specifically comprises the following steps:
1) the method for acquiring the user information of the automobile vertical website specifically comprises the following steps: user name, posting time, posting content, purchased vehicle type, vehicle type evaluation scoring in all aspects, browsing number, support number, comment number, vehicle purchasing number, attention number, whether to authenticate vehicle owner, fan number, main post number, essence post number, copyback number and the like, and then storing the information in a local database.
2) Deeply analyzing the difference between normal users and water army, and constructing six-element group feature models (T1, T2, T3, T4, T5 and T6);
2.1) user name characteristics:
Figure BDA0002317905590000041
where len (number) indicates the number (or length) of the numbers in the user's nickname, and len (name) indicates the number (or length) of the characters in the user's nickname as a whole.
2.2) concern ratio of vermicelli:
Figure BDA0002317905590000042
wherein num (fans) is the number of fans owned by the user, num (observer) is the attention number of the user, and abs is the absolute value of the difference.
2.3) the essence paste characteristics:
Figure BDA0002317905590000051
where num (jinghuaite) indicates the number of essence posts posted by the user, and num (zhhutie) indicates the number of all posts posted by the user.
2.4) liveness characteristics:
Figure BDA0002317905590000052
wherein h isiThe number of replies to a single post made by others for the user, and N is the total number of posts made by the user to others.
2.5) vehicle owner characteristics:
Figure BDA0002317905590000053
wherein, 1 indicates that the user has the sign of authenticating the owner of the vehicle, and 0 indicates that the user does not have the sign of authenticating the owner of the vehicle.
2.6) text content characteristics: t6 ═ count ad, senw, puc +.
Where ad is a degree adverb, senw is an emotion word (including both good and bad), puc is a special punctuation mark that is not commonly used, and count represents a count.
3) Because the judgment of whether a user belongs to the water army is a two-classification problem, the water army can be detected and identified by combining a logistic regression algorithm on the basis of the six-element group feature model.
Where logistic regression is a generalized linear regression analysis model, although the name regression is used, it is actually a linear model used for classification rather than regression. And dividing the sorted data set into a training set and a testing set, training the model in the training set, and predicting on the testing set.
4) In order to ensure that the effectiveness of the water army identification needs to be verified, the comments of the water army users and the comments of the real users are separately screened, extracted and separated after the water army identification is completed, then three different methods are adopted for comparison, and the comment texts are classified, and the identification method is finally determined until the accuracy of each method reaches more than seventy-seven percent.
Example 2
The protocol of example 1 is further validated in conjunction with specific experiments, described in detail below:
the experimental operating environment is as follows: windows7 operating system, 3.70GHz, 4-core processor, 4GB memory, and related software Python3.6, MySQL5.7.17.
The experimental data source is data of an automobile vertical website crawled by adopting Python software and stored in a MySQL database.
When the water army is identified by means of a logistic regression algorithm based on the hexahydric group feature model, a sklern machine learning module in Python software is used, Logistic regressorionCV in the sklern machine learning module is called, the regularization coefficient with the highest identification accuracy is automatically searched by using cross validation, and the final identification rate reaches 97.8%.
In the verification process, in order to ensure the credibility, the invention simultaneously selects three text classification methods for comparison, and the final result shows that the accuracy of each text classification method is good, and the three text classification methods are naive Bayes (A), (B), (C), (D
Figure BDA0002317905590000061
Bayes), Support Vector Machines (SVM), and long-short term memory artificial neural networks (LSTM).
In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (7)

1. A method for screening water army of a vertical website of an automobile is characterized by comprising the following steps:
collecting user information of a vertical website of an automobile;
deeply analyzing the difference between a normal user and a water army, and constructing a six-element characteristic model consisting of a user name characteristic, a bean vermicelli attention ratio, an essence post characteristic, an activeness characteristic, a vehicle owner characteristic and a text content characteristic;
and detecting and identifying the water army by combining logistic regression on the basis of the six-element characteristic model.
2. The method for screening water army of vertical automobile websites according to claim 1,
the user name characteristics are as follows:
Figure FDA0002317905580000011
where len (number) indicates the number or length of the numbers in the user's nickname, and len (name) indicates the number or length of the entire characters of the user's nickname.
3. The method for screening water army of vertical automobile websites according to claim 1,
the attention ratio of the vermicelli is as follows:
Figure FDA0002317905580000012
wherein num (fans) is the number of fans owned by the user, num (observer) is the attention number of the user, and abs is the absolute value of the difference.
4. The method for screening water army of vertical automobile websites according to claim 1,
the essence is characterized in that:
Figure FDA0002317905580000013
where num (jinghuaite) indicates the number of essence posts posted by the user, and num (zhhutie) indicates the number of all posts posted by the user.
5. The method for screening water army of vertical automobile websites according to claim 1,
the liveness characteristics are as follows:
Figure FDA0002317905580000014
wherein h isiThe number of replies to a single post made by others for the user, and N is the total number of posts made by the user to others.
6. The method for screening the water army of the vertical website of the automobile according to claim 1, wherein the owner characteristics are as follows:
Figure FDA0002317905580000021
wherein, 1 indicates that the user has the sign of authenticating the owner of the vehicle, and 0 indicates that the user does not have the sign of authenticating the owner of the vehicle.
7. The method for screening water army of vertical automobile websites according to claim 1,
the text content features are as follows: t6 ═ count { ad, senw, puc }
Wherein ad is a degree adverb, senw is an emotion word, puc is a punctuation mark, and count represents a count.
CN201911285641.5A 2019-12-13 2019-12-13 Automobile vertical website water army discrimination method Pending CN111159399A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911285641.5A CN111159399A (en) 2019-12-13 2019-12-13 Automobile vertical website water army discrimination method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911285641.5A CN111159399A (en) 2019-12-13 2019-12-13 Automobile vertical website water army discrimination method

Publications (1)

Publication Number Publication Date
CN111159399A true CN111159399A (en) 2020-05-15

Family

ID=70557106

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911285641.5A Pending CN111159399A (en) 2019-12-13 2019-12-13 Automobile vertical website water army discrimination method

Country Status (1)

Country Link
CN (1) CN111159399A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111784492A (en) * 2020-07-10 2020-10-16 讯飞智元信息科技有限公司 Public opinion analysis and financial early warning method, device, electronic equipment and storage medium
CN112861128A (en) * 2021-01-21 2021-05-28 微梦创科网络科技(中国)有限公司 Method and system for identifying machine accounts in batches

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239539A (en) * 2013-09-22 2014-12-24 中科嘉速(北京)并行软件有限公司 Microblog information filtering method based on multi-information fusion
US20170200205A1 (en) * 2016-01-11 2017-07-13 Medallia, Inc. Method and system for analyzing user reviews
CN109241518A (en) * 2017-07-11 2019-01-18 北京交通大学 A kind of detection network navy method based on sentiment analysis
CN109558555A (en) * 2018-08-20 2019-04-02 湖北大学 Microblog water army detection method and detection system based on artificial immunity danger theory

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239539A (en) * 2013-09-22 2014-12-24 中科嘉速(北京)并行软件有限公司 Microblog information filtering method based on multi-information fusion
US20170200205A1 (en) * 2016-01-11 2017-07-13 Medallia, Inc. Method and system for analyzing user reviews
CN109241518A (en) * 2017-07-11 2019-01-18 北京交通大学 A kind of detection network navy method based on sentiment analysis
CN109558555A (en) * 2018-08-20 2019-04-02 湖北大学 Microblog water army detection method and detection system based on artificial immunity danger theory

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
程传鹏: "基于特定话题的网络水军识别研究" *
谢忠红: "基于逻辑回归算法的微博水军识别" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111784492A (en) * 2020-07-10 2020-10-16 讯飞智元信息科技有限公司 Public opinion analysis and financial early warning method, device, electronic equipment and storage medium
CN112861128A (en) * 2021-01-21 2021-05-28 微梦创科网络科技(中国)有限公司 Method and system for identifying machine accounts in batches

Similar Documents

Publication Publication Date Title
CN109325165B (en) Network public opinion analysis method, device and storage medium
CN110704572B (en) Suspected illegal fundraising risk early warning method, device, equipment and storage medium
CN109087135B (en) Mining method and device for user intention, computer equipment and readable medium
CN107798571A (en) Identifying system, the method and device of malice address/malice order
CN108021651B (en) Network public opinion risk assessment method and device
CN108550054B (en) Content quality evaluation method, device, equipment and medium
CN108241867B (en) Classification method and device
CN110795568A (en) Risk assessment method and device based on user information knowledge graph and electronic equipment
CN107807941A (en) Information processing method and device
CN111046282B (en) Text label setting method, device, medium and electronic equipment
CN109933648B (en) Real user comment distinguishing method and device
CN112995414B (en) Behavior quality inspection method, device, equipment and storage medium based on voice call
CN104750791A (en) Image retrieval method and device
CN113743111A (en) Financial risk prediction method and device based on text pre-training and multi-task learning
CN111159399A (en) Automobile vertical website water army discrimination method
CN116150201A (en) Sensitive data identification method, device, equipment and computer storage medium
CN113282754A (en) Public opinion detection method, device, equipment and storage medium for news events
CN113360788A (en) Address recommendation method, device, equipment and storage medium
CN115018588A (en) Product recommendation method and device, electronic equipment and readable storage medium
CN115249007A (en) Method and device for detecting enclosing and bidding behavior based on electronic bidding document comparison
US10521727B2 (en) System, method, and storage medium for generating hypotheses in data sets
JP6511865B2 (en) INFORMATION PROCESSING APPARATUS AND INFORMATION PROCESSING PROGRAM
CN117235253A (en) Truck user implicit demand mining method based on natural language processing technology
CN113837836A (en) Model recommendation method, device, equipment and storage medium
CN113821596A (en) Information recommendation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200515