CN113094567A - Malicious complaint identification method and system based on text clustering - Google Patents

Malicious complaint identification method and system based on text clustering Download PDF

Info

Publication number
CN113094567A
CN113094567A CN202110351440.1A CN202110351440A CN113094567A CN 113094567 A CN113094567 A CN 113094567A CN 202110351440 A CN202110351440 A CN 202110351440A CN 113094567 A CN113094567 A CN 113094567A
Authority
CN
China
Prior art keywords
complaint
word
information
matching
malicious
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110351440.1A
Other languages
Chinese (zh)
Inventor
王萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan XW Bank Co Ltd
Original Assignee
Sichuan XW Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan XW Bank Co Ltd filed Critical Sichuan XW Bank Co Ltd
Priority to CN202110351440.1A priority Critical patent/CN113094567A/en
Publication of CN113094567A publication Critical patent/CN113094567A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of artificial intelligence and software systems, in particular to a malicious complaint identification method and system based on text clustering, which comprises the following steps: step 1: complaint website complaint information crawlers; step 2: processing the user-defined text characteristics; and step 3: bi-directionally matching and segmenting words; and 4, step 4: the word frequency characteristic set and visualization; and 5: density clustering based on a DBSCAN algorithm; step 6: and determining the malicious complaint cluster. The invention has the following beneficial effects: according to the invention, a crawler technology, a natural language processing technology and a clustering algorithm technology are comprehensively applied, a crawler complaint information content is established through a network complaint channel, an LDA topic model is constructed based on the content to carry out natural language processing, all complaint information is classified based on the natural language processing, and finally malicious complaints are identified by adopting a density clustering method DBSCAN.

Description

Malicious complaint identification method and system based on text clustering
Technical Field
The invention relates to the technical field of artificial intelligence and software systems, in particular to a malicious complaint identification method and system based on text clustering.
Background
In recent years, a "black industry" has been developed for malicious complaint agencies of institutions (financial institutions for short) such as banks, payment companies, cash companies, mutual companies, insurance companies, and the like. Order receiving, signing, right maintaining and division are clear, and the work of grouping fire and division of labor are clear. Financial institutions can not be disturbed, on one hand, malicious complaints are more and more, and on the other hand, the bank interior really has supervision and assessment pressure. The ultimate goal of a malicious complaint is a malicious evasive debt. In recent years, the behavior of escaping waste and debt in the financial field leads to the rising of the rate of bad loan in the financial industry, and causes the accumulation of risks of some small and medium-sized financial institutions and financial institutions.
The identification methods of malicious complaints in the current financial industry are relatively few. By deeply researching the behavior pattern of the malicious complaint user, the malicious complaint user is generally found to adopt a complaint template provided by a black agency and concentrate on a network channel or a channel under a supervision pipeline for complaint. The network complaint channels are as follows: black cat complaints, gathering complaints, and the like.
Based on the above, the text provides a malicious complaint identification method and system based on text clustering, which comprehensively apply a crawler technology, a natural language processing technology and a clustering algorithm technology, construct an LDA topic model for natural language processing based on the text content through the content of the crawler complaint information of a network complaint channel, classify all complaint information by adopting a density clustering method DBSCAN, and finally identify malicious complaints.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a method and a system for identifying malicious complaints based on text clustering, and solves the problem that malicious complaints cannot be identified quickly at present.
In order to solve the problems, the invention discloses a malicious complaint identification method based on text clustering, which comprises the following steps:
step 1: a complaint website and subject parameters are designated through interface configuration, and complaint contents meeting certain conditions are collected by a background through a crawler technology;
step 2: storing the complaint information of the crawler in a relational database, and performing user-defined text characteristic processing; suppose that n complains are collected; the name of the primary key field is ID, the primary key field is defined as an autonomous key, and the value of the autonomous key is 1, 2. The complaint information set is denoted as C ═ C1,C2,......,CnIn which C isiContent indicating the ith complaint, i 1, 2.... n; assume that the feature set of text processing is X1The custom text features are mIs characterized by the fact that
Figure BDA0003002480410000021
And step 3: the complaint description is assumed to be mainly Chinese, and each complaint content is participled by adopting a Chinese word segmentation method; chinese word segmentation cuts each complaint description into a single word; performing word segmentation processing on the complaint description contents through a bidirectional matching word segmentation method; the bidirectional maximum matching method is a word segmentation method based on a dictionary, the word segmentation method based on the dictionary is to match a Chinese word string to be segmented with a vocabulary entry in a dictionary base according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful; for the collected n complains, suppose that the ith complain is divided into kiThe vector formed by words is recorded as
Figure BDA0003002480410000022
Figure BDA0003002480410000023
For the complaint information set C, the complaint information segmentation set formed after word segmentation is recorded
Figure BDA0003002480410000024
Figure BDA0003002480410000025
Figure BDA0003002480410000026
Wherein i 1, 2.... n;
and 4, step 4: performing word frequency statistics on the word segmentation result in the step 3, namely counting the occurrence frequency of each word in the full complaint information, and forming a word cloud picture; counting the occurrence frequency of each word of each piece of complaint information and the proportion in the content of the complaint information, and forming a word frequency statistical characteristic set
Figure BDA00030024804100000213
Sum word frequency scaling feature set
Figure BDA0003002480410000027
And 5: merging the processed text feature set X1Word frequency statistical feature set
Figure BDA0003002480410000028
Sum word frequency scaling feature set
Figure BDA0003002480410000029
Totaling m +2nt characteristic variables and recording the variables as a clustering characteristic set
Figure BDA00030024804100000210
Figure BDA00030024804100000211
Figure BDA00030024804100000212
Figure BDA0003002480410000031
Will be provided with
Figure BDA0003002480410000032
As a reference of the DBSCAN algorithm, classifying each complaint content, assuming that the classification variable of the model output result is Y, and the clustering result of the ith complaint content is Yi,Y={y1,y2,......,yn},i=1,2,......,n;
Step 6: and hierarchically sampling the clustering result of the DBSCAN algorithm, manually marking and judging whether the malicious complaints exist or not based on the sampling result, finally confirming the labels of the malicious complaint clusters according to the proportion of the malicious complaints in each family, and correcting the clustering result of the malicious complaint model.
Preferably, the method comprises the following steps: the step 1 specifically comprises the following steps:
step 1.1 theme parameter configuration: the topic parameter module mainly sets the condition content of the crawler, and comprises 4 parameters of monitoring topics, topic codes, acquisition frequency and keyword configuration; monitoring the theme which is the theme content required to be specified by the crawler; the theme code is the only main key of the monitoring theme and consists of numbers, characters and underlines; the acquisition frequency specifies the acquisition condition of the crawler, and the interval duration of information acquisition needs to be set; configuring a filtering condition of appointed crawler content by the keyword, and acquiring the content meeting the condition;
step 1.2, website configuration acquisition: the website acquisition module sets a website to be crawled in an interface mode, and the name and address of the website are required to be specified; multiple sources of acquisition information may be added at the same time.
Preferably, the method comprises the following steps: the step 2 specifically comprises the following steps:
step 2.1, data storage: storing the complaint information from the crawler in a relational database MYSQL, creating a complaint information data table, and taking a complaint number as a main key;
step 2.2 basic attributes: the basic attribute refers to a basic attribute field associated with the complaint information;
step 2.3 statistical characterization: the statistical characteristics refer to the number of texts meeting certain conditions in the content of the statistical complaint information;
step 2.4 proportional characteristics: the proportional characteristic refers to the proportion of the number of texts meeting certain conditions in the content of the statistical complaint information.
Preferably, the method comprises the following steps: the step 3 specifically comprises the following steps:
step 3.1, maximum matching word segmentation in the forward direction: the forward scanning is to scan the left side of the character string in the forward direction, and the sub string is taken out to be matched with the dictionary; the forward maximum matching algorithm is mainly divided into three steps: firstly, taking max characters in an ith complaint from left to right as a matching field for the complaint description of the ith complaint (i is 1, 2.. said., n), wherein max is the longest number of entries in a dictionary library; secondly, searching and matching the segmented matching field in a dictionary library; if the matching is successful, the matching field is used as a word to be segmented; if the matching is unsuccessful, removing the last character of the matching field, taking the rest character string as a new matching field, and performing matching again; thirdly, repeating the above processes until all words are segmented;
step 3.2 reverse maximum matching segmentation: the algorithm is a reverse thinking of the maximum positive matching, namely reverse scanning from the right side of a character string, and extracting a sub string to match with a dictionary; the reverse maximum matching algorithm is mainly divided into three steps: firstly, for the ith complaint description (i is 1, 2.. the., n), taking max characters in the complaint from right to left as a matching field, wherein max is the longest number of entries in a dictionary library; secondly, searching and matching the segmented matching field in a dictionary library; if the matching is successful, the matching field is used as a word to be segmented; if the matching is unsuccessful, removing the most previous character of the matching field, taking the rest character string as a new matching field, and performing matching again; thirdly, repeating the above processes until the length of the character string to be segmented is 0, namely segmenting all words;
step 3.3, forward and reverse result matching: the bidirectional maximum matching is to compare the word segmentation result obtained by the forward maximum matching with the result obtained by the reverse maximum matching to determine a correct word segmentation method; the forward and reverse result matching is mainly divided into 2 steps: firstly, if the word numbers of the forward and reverse word segmentation results are different, the word with the smaller word number is selected; secondly, if the word numbers of the word segmentation results are the same; if the word segmentation results are the same, any one of the word segmentation results can be selected; if the word segmentation results are different, the word with the smaller number is selected.
Preferably, the method comprises the following steps: the step 4 specifically comprises the following steps:
step 4.1 word frequency statistical feature set: segmenting sets of complaint information
Figure BDA0003002480410000041
Performing duplication elimination statistics, that is, each word is only retained once, every two retained words are different, and the formed unique vocabulary collection is recorded
Figure BDA0003002480410000051
Suppose that
Figure BDA0003002480410000052
In which there are t elements, denoted as
Figure BDA0003002480410000053
Segmented set of statistical complaint information
Figure BDA0003002480410000054
Wherein each of the complaining information segmentations corresponds to
Figure BDA0003002480410000055
For the ith complaint information split set
Figure BDA0003002480410000056
The corresponding word frequency statistical characteristic set is recorded as
Figure BDA0003002480410000057
Figure BDA0003002480410000058
Wherein
Figure BDA0003002480410000059
To represent
Figure BDA00030024804100000510
Including the word stIs 1, 2, and n, then the complaint information segmentation set
Figure BDA00030024804100000511
Corresponding word frequency statistical characteristic set
Figure BDA00030024804100000512
Figure BDA00030024804100000513
Step 4.2, a word frequency proportion characteristic set: segmented set of statistical complaint information
Figure BDA00030024804100000514
Wherein each of the complaining information segmentations corresponds to
Figure BDA00030024804100000515
The proportion of each element in (b) is defined to correspond to
Figure BDA00030024804100000516
Dividing the length of the complaint information segmentation set by the occurrence frequency of each element, namely the number of words contained in the segmentation set; segmenting sets of complaint information for ith complaint
Figure BDA00030024804100000517
The corresponding word frequency proportion characteristic set is recorded as
Figure BDA00030024804100000518
Figure BDA00030024804100000519
Wherein
Figure BDA00030024804100000520
To represent
Figure BDA00030024804100000521
Including the word st1, 2, and n, then the complaint information segmentation set
Figure BDA00030024804100000522
Corresponding word frequency scale feature set
Figure BDA00030024804100000523
Figure BDA00030024804100000524
Step 4.3, displaying the word frequency: the word cloud picture is used for visually displaying words with high occurrence frequency in the text, so that a large amount of low-frequency and low-quality text information can be filtered, and a browser can intuitively draw the text through the word cloud picture; statistics based on full complaint information
Figure BDA00030024804100000525
Number of occurrences of each element in
Figure BDA00030024804100000526
The j-th element s in (1)jLet the number of occurrences in the full complaint information be pjJ 1, 2.... prot, t; the word frequency set for recording the full amount of complaint information is
Figure BDA00030024804100000527
Then
Figure BDA00030024804100000528
The word cloud display package WORDCOUNT adopting python can be directly constructed
Figure BDA00030024804100000529
The word cloud.
Preferably, the method comprises the following steps: the step 5 specifically comprises the following steps:
step 5.1 initial core sample labeling: firstly, setting two core parameters eps and min _ sample of a DBSCAN algorithm, wherein eps represents a distance threshold value belonging to a neighborhood and the value is a floating point type; min _ samples represents a sample number threshold value of an epsilon-neighborhood required by a sample point to become a core object, and the value is a positive integer; firstly, randomly selecting a point, and then finding all points which are less than or equal to eps away from the point;
step 5.2 cluster label allocation: if the number of data points within eps from the starting point is less than min _ samples, then this point is marked as noise; if the number of data points within eps is greater than min _ samples, then this point is marked as a core sample and assigned a new cluster label; then accessing all neighbors of the point within the distance eps; if they have not already been assigned a cluster, then the new cluster label just created is assigned to them; if they are core samples, then their neighbors are visited in turn, and so on; the cluster is gradually increased until there are no more core samples within the eps distance of the cluster;
step 5.3 convergence training: selecting another point which is not visited yet, and repeating the processes of initial core sample marking and cluster label distribution until all the points are marked;
step 5.4, outputting a model result: the DBSCAN model outputs the label of the cluster to which each point belongs; clustering result y for ith complainti,yi-1 or a positive integer, -1 represents a noise point, the remaining positive integers representing the label number of the cluster to which the current point belongs, wherein i-1, 2.
Preferably, the method comprises the following steps: the step 6 specifically comprises the following steps:
step 6.1 clustering sampling: calculating the maximum value of all vector elements of the output result Y of the DBSCAN algorithm, and recording the maximum value as d; then the clustering result of the DBSCAN algorithm has d classes, and the clustering sample label of the r-th class is assumed to be MrWherein comprises urAn element, wherein n ═ u1+u2+......+ud(ii) a r 1, 2.. said, d; then complaint information set C ═ C1,C2,......,CnCluster set of the result complaint information after clustering
Figure BDA0003002480410000061
Figure BDA0003002480410000062
Appointing sampling proportion h, 0 < h < 1, clustering complaining information
Figure BDA0003002480410000066
Performing hierarchical sampling to form a complaint information clustering sampling set
Figure BDA0003002480410000063
Figure BDA0003002480410000064
Wherein
Figure BDA0003002480410000065
The number of the contained elements is nh;
step 6.2 manual qualification: clustering sample sets for complaint information
Figure BDA0003002480410000071
Wherein 1 represents a malicious complaint and 0 represents a general complaint; clustering sampling marking set of complaint information after marking as
Figure BDA0003002480410000072
Figure BDA0003002480410000073
Figure BDA0003002480410000074
Wherein
Figure BDA0003002480410000075
Or 0, r 1, 2, a. 1, 2, ah; statistics of
Figure BDA0003002480410000076
The proportion of malicious complaints in each cluster in (i.e. for the r-th cluster result M)rThe number of malicious complaints is calculated as
Figure BDA0003002480410000077
The proportion of malicious complaints in the cluster was scored
Figure BDA0003002480410000078
Figure BDA0003002480410000079
Recording the complaint information malicious complaint proportion set as
Figure BDA00030024804100000710
Figure BDA00030024804100000711
Step 6.3 marking the malicious complaint cluster: computing
Figure BDA00030024804100000712
In
Figure BDA00030024804100000713
The maximum value of (1) is recorded as
Figure BDA00030024804100000714
Cluster M corresponding to the maximum valueetThe cluster is a malicious complaint cluster, and all complaints in the cluster are malicious complaints; elements of the remaining clusters are combined into a normal complaint cluster and denoted as Mot(ii) a Then complaint information set C ═ C1,C2,......,CnThe classification result of the corresponding malicious complaint model is
Figure BDA00030024804100000715
Figure BDA00030024804100000716
Wherein u iset+uotN; all malicious complaints have been identified so far.
In order to solve the problems, the invention also discloses a malicious complaint identification system based on text clustering, which comprises a complaint information crawler module of a complaint website, a custom text feature processing module, a bidirectional matching word segmentation module, a word frequency feature set and visualization module, a density clustering module based on a DBSCAN algorithm and a malicious complaint cluster determination module; a complaint website and subject parameters are designated through interface configuration, and complaint contents meeting certain conditions are collected by a background through a crawler technology;
the system comprises a complaint website complaint information crawler module, a background and a data processing module, wherein the complaint website complaint information crawler module is used for configuring and specifying a complaint website and subject parameters through an interface, and the background collects complaint contents meeting certain conditions through a crawler technology;
the user-defined text feature processing module is used for storing the complaint information of the crawler in a relational database and performing user-defined text feature processing; suppose that n complains are collected; the name of the primary key field is ID, the primary key field is defined as an autonomous key, and the value of the autonomous key is 1, 2. The complaint information set is denoted as C ═ C1,C2,......,CnIn which C isiContent of the ith complaint, i ═ 12, a. Assume that the feature set of text processing is X1The custom text feature has m features, and is recorded as
Figure BDA0003002480410000081
The bidirectional matching word segmentation module is used for assuming that the complaint description is mainly Chinese, and segmenting words of each complaint content by adopting a Chinese word segmentation method; chinese word segmentation cuts each complaint description into a single word; performing word segmentation processing on the complaint description contents through a bidirectional matching word segmentation method; the bidirectional maximum matching method is a word segmentation method based on a dictionary, the word segmentation method based on the dictionary is to match a Chinese word string to be segmented with a vocabulary entry in a dictionary base according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful; for the collected n complains, suppose that the ith complain is divided into kiThe vector formed by words is recorded as
Figure BDA0003002480410000082
Figure BDA0003002480410000083
For the complaint information set C, the complaint information segmentation set formed after word segmentation is recorded
Figure BDA0003002480410000084
Figure BDA0003002480410000085
Figure BDA0003002480410000086
Wherein i is 1, 2.. times.n;
the word frequency characteristic set and visualization module is used for carrying out word frequency statistics on the word segmentation result in the step 3, namely counting the occurrence frequency of each word in the full complaint information and forming a word cloud picture; counting the occurrence frequency of each word of each piece of complaint information and the proportion in the content of the complaint information, and forming a word frequency statistical characteristic set
Figure BDA0003002480410000087
Sum word frequency scaling feature set
Figure BDA0003002480410000088
A density clustering module based on DBSCAN algorithm for merging the processed text feature set X1Word frequency statistical feature set
Figure BDA0003002480410000089
Sum word frequency scaling feature set
Figure BDA00030024804100000810
Totaling m +2nt characteristic variables and recording the variables as a clustering characteristic set
Figure BDA00030024804100000811
Figure BDA00030024804100000812
Figure BDA00030024804100000813
Figure BDA00030024804100000814
Will be provided with
Figure BDA00030024804100000815
As a reference of the DBSCAN algorithm, classifying each complaint content, assuming that the classification variable of the model output result is Y, and the clustering result of the ith complaint content is Yi,Y={y1,y2,......,yn},i=1,2,......,n;
And the malicious complaint cluster determining module is used for performing hierarchical sampling on the clustering result of the DBSCAN algorithm, performing artificial marking judgment on whether malicious complaints exist or not on the basis of the sampling result, finally confirming the label of the malicious complaint cluster according to the proportion of the malicious complaints in each cluster, and correcting the clustering result of the malicious complaint model.
Preferably, the method comprises the following steps: the complaint information crawler module of the complaint website comprises 2 submodules for topic parameter configuration and acquisition website configuration;
the theme parameter configuration sub-module is used for the theme parameter module to mainly set the condition content of the crawler, and comprises 4 parameters of monitoring a theme, theme coding, acquisition frequency and keyword configuration; monitoring the theme which is the theme content required to be specified by the crawler; the theme code is the only main key of the monitoring theme and consists of numbers, characters and underlines; the acquisition frequency specifies the acquisition condition of the crawler, and the interval duration of information acquisition needs to be set; configuring a filtering condition of appointed crawler content by the keyword, and acquiring the content meeting the condition;
the acquisition website configuration submodule is used for setting a website needing the crawler by the acquisition website module in an interface mode, and specifying a website name and a website address; multiple sources of acquisition information may be added at the same time.
Preferably, the method comprises the following steps: the custom text feature processing module comprises 4 submodules of data storage, basic attribute, statistical feature and proportional feature;
the data storage submodule is used for storing the complaint information from the crawler in a relational database MYSQL and creating a complaint information data table by taking the complaint number as a main key;
the basic attribute submodule is used for indicating basic attribute fields related to the complaint information by basic attributes;
the statistical characteristic submodule is used for counting the number of texts meeting certain conditions in the complaint information content;
and the proportion characteristic submodule is used for proportion characteristic, and is used for counting the proportion of the number of texts meeting certain conditions in the complaint information content.
Due to the adoption of the technical scheme, the invention has the following beneficial effects:
1. the system comprehensively applies a natural language technology and an unsupervised learning technology, and constructs a 6-dimensional closed-loop flow module based on complaint information crawlers of complaint websites, custom text feature processing, bidirectional matching word segmentation, word frequency feature set and visualization, density clustering based on a DBSCAN algorithm and malicious complaint cluster determination, so as to realize automatic identification of malicious complaint contents.
2. The user-defined text feature processing set, the word frequency statistical feature set and the word frequency proportion feature set are ingeniously combined to serve as a feature factory of the DBSCAN density clustering algorithm, so that the significant information in text content is mined to the greatest extent, and the accuracy of the malicious complaint identification model is greatly improved.
3. The method adopts a DBSCAN density clustering algorithm to perform clustering analysis on the complaint information, so that the number of clusters does not need to be set a priori, the clusters with complex shapes are divided, points which do not belong to any cluster can be found out, and the clustering effect is greatly improved; meanwhile, a clustering layered sampling mode is adopted for marking, and a model result is greatly calibrated.
Drawings
FIG. 1 is a block diagram of a text clustering based malicious complaint identification system;
FIG. 2 is a crawler configuration system diagram.
Detailed Description
The embodiments of the invention will be described in detail below with reference to the drawings, but the invention can be implemented in many different ways as defined and covered by the claims.
To illustrate this approach more specifically, the following provides an identification case for identifying a malicious complaint user of "new web bank" in "complaints congregation".
Step 1: the complaint website complaints the information crawler. The complaining website and the subject parameters are designated through interface configuration, and the background collects the complaining contents meeting certain conditions through a crawler technology.
Step 1.1: and configuring theme parameters. The topic parameter module mainly sets the condition content of the crawler, and comprises 4 parameters of monitoring topics, topic codes, acquisition frequency and keyword configuration. The monitoring subject is set as 'identification of malicious complaints of new network banks'; the theme is coded as the only primary key of the monitoring theme, set up as "XWBANK _ JTS _ EYTS"; the acquisition frequency specifies the acquisition conditions of the crawler and is configured to acquire every 5 minutes; the keyword configuration specifies a filtering condition of the crawler content, and is configured as "new web bank & & complaint".
Step 1.2: and collecting the configuration of the website. The website acquisition module sets a website needing the crawler in an interface mode, and configures a website name of ' poly complaints ' and a website address of ' https: com/"/ts.21cn.com/".
Step 2: and (4) processing the user-defined text features. And storing the complaint information of the crawler in a relational database, and performing user-defined text characteristic processing. Suppose that n complains are collected; the primary key field name is ID, the value is an autoincrement primary key, and the value is 1, 2. The complaint information set is denoted as C ═ C1,C2,......,CnIn which C isiThe contents of the ith complaint, i 1, 2. Assume that the feature set of text processing is X1The custom text feature has m features, and is recorded as
Figure BDA0003002480410000111
The module comprises 4 sub-modules of data storage, basic attribute, statistical characteristic and proportion characteristic.
Step 2.1: and (4) storing data. Storing the complaint information from the crawler in a relational database MySQL, creating a complaint information data table, taking a complaint number as a main key, and comprising: complaining time, name of complainer, crawler time, complaining object, complaining question, related party, amount of complaining, complaining description and the like.
Step 2.2: a base attribute. The basic information refers to basic attribute fields associated with the complaint information, such as complaint time points, complaint amounts, complaint complaints, complaint progress and the like.
Step 2.3: and (5) counting the characteristics. The statistical characteristics refer to the number of texts satisfying certain conditions in the content of the statistical complaint information. Such as: counting the total words of the complaint description, counting the words of the complaint description including the complaint, counting the words of the complaint description with the report, counting the number of complaint progress being the follow-up, counting the number of complaint reply bars, and the like.
Step 2.4: and (4) proportional characteristics. The proportional characteristic refers to the proportion of the number of texts meeting certain conditions in the content of the statistical complaint information. Such as: the proportion of the word number of the complaint in the statistical complaint description in the whole text, the proportion of the word number of the report in the statistical complaint description in the whole text and the like.
And step 3: and performing bidirectional matching word segmentation. Chinese segmentation divides each complaint description into a single word, which is the smallest, independently active, meaningful language component. In order to deeply mine the relevance between the complaint information, the complaint description content is participled by a bidirectional matching word segmentation method. The bidirectional maximum matching method is a word segmentation method based on a dictionary. The word segmentation method based on the dictionary is to match the Chinese word string to be segmented with the vocabulary entry in a dictionary base according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful. For the collected n complains, suppose that the ith complain is divided into kiThe vector formed by words is recorded as
Figure BDA0003002480410000121
Figure BDA0003002480410000122
For the complaint information set C, the complaint information segmentation set formed after word segmentation is recorded
Figure BDA0003002480410000123
Figure BDA0003002480410000124
Figure BDA0003002480410000125
Wherein i is 1, 2.
Step 3.1: forward maximum matching participles. The forward direction is to scan the left side of the character string in the forward direction, and to extract the sub-string to match with the dictionary. The forward maximum matching algorithm is mainly divided into three steps: first, for the ith complaint description (i ═ 1, 2.. said., n), max characters in the complaint are taken from left to right as matching fields, and max is the longest entry number in the dictionary base. Second, the segmented matching field is looked up in a dictionary base and matched. If the matching is successful, the matching field is segmented as a word. If the matching is unsuccessful, the last word of the matching field is removed, and the rest character string is used as a new matching field for matching again. Third, the above process is repeated until all words are segmented.
Step 3.2: and (4) carrying out reverse maximum matching word segmentation. The algorithm is the reverse thinking of a forward maximum match. The reverse scanning is to scan the right side of the character string reversely, and then the sub-string is taken out to match with the dictionary. The reverse maximum matching algorithm is mainly divided into three steps: first, for the ith complaint description (i ═ 1, 2.... times.n), max characters in the complaint are taken from right to left as matching fields, and max is the longest entry number in the dictionary base. Second, the segmented matching field is looked up in a dictionary base and matched. If the matching is successful, the matching field is segmented as a word. If the matching is unsuccessful, the most previous word of the matching field is removed, and the rest character string is used as a new matching field for matching again. Thirdly, the above process is repeated until the length of the character string to be cut is 0, that is, all the words are cut.
And 3.3, matching forward and reverse results. The bidirectional maximum matching is to compare the word segmentation result obtained by the forward maximum matching with the result obtained by the reverse maximum matching to determine the correct word segmentation method. Studies by sunm.s. and Benjamin K.T (1995) showed that: about 90% of Chinese characters are cut into character strings, the forward maximum matching method and the reverse maximum matching method are completely overlapped and correct, and the rest 10% of the Chinese characters are different. The forward and reverse result matching is mainly divided into 2 steps: firstly, if the word numbers of the forward and reverse word segmentation results are different, the word with the smaller word number is selected. Second, if the word segmentation results in the same number of words. If the word segmentation results are the same, any one of the word segmentation results can be selected; if the word segmentation results are different, the word with the smaller number is selected.
And 4, step 4: and (4) carrying out word frequency feature set and visualization. And (3) performing word frequency statistics on the word segmentation result in the step (3), namely counting the occurrence frequency of each word in the full amount of complaint information, and forming a word cloud picture so as to visually observe the distribution of the complaint content. Counting the number of occurrences of each word for each complaint information and the complaint in the complaint informationThe proportion in the information content and the word frequency statistical characteristic set are formed
Figure BDA00030024804100001323
Sum word frequency scaling feature set
Figure BDA0003002480410000131
Step 4.1: and (5) carrying out a word frequency statistical characteristic set. Segmenting sets of complaint information
Figure BDA0003002480410000132
Performing duplication elimination statistics, that is, each word is only retained once, the retained word is different in pairs, and the formed unique vocabulary collection is recorded
Figure BDA0003002480410000133
Suppose that
Figure BDA0003002480410000134
In which there are t elements, denoted as
Figure BDA0003002480410000135
Segmented set of statistical complaint information
Figure BDA0003002480410000136
Wherein each of the complaining information segmentations corresponds to
Figure BDA0003002480410000137
For the ith complaint information split set
Figure BDA0003002480410000138
The corresponding word frequency statistical characteristic set is recorded as
Figure BDA0003002480410000139
Figure BDA00030024804100001310
Wherein
Figure BDA00030024804100001311
To represent
Figure BDA00030024804100001312
Including the word st1, 2. Complaint information split set
Figure BDA00030024804100001324
And (4) corresponding word frequency statistical feature sets.
Figure BDA00030024804100001313
Figure BDA00030024804100001314
Step 4.2: the set of word frequency scale features. Segmented set of statistical complaint information
Figure BDA00030024804100001315
Wherein each of the complaining information segmentations corresponds to
Figure BDA00030024804100001316
The proportion of each element in (b) is defined to correspond to
Figure BDA00030024804100001317
The number of times each element in the set appears is divided by the length of the segment of complaint information (i.e., the number of words contained in the segment). Segmenting sets of complaint information for ith complaint
Figure BDA00030024804100001318
The corresponding word frequency proportion characteristic set is recorded as
Figure BDA00030024804100001319
Figure BDA00030024804100001320
Wherein
Figure BDA00030024804100001321
To represent
Figure BDA00030024804100001322
Including the word st1, 2. Complaint information split set
Figure BDA0003002480410000141
Corresponding word frequency scale feature set
Figure BDA0003002480410000142
Figure BDA0003002480410000143
Step 4.3: and displaying the word frequency. The word cloud picture is used for visually displaying words with high occurrence frequency in the text, so that a large amount of low-frequency and low-quality text information can be filtered, and a browser can intuitively draw the text through the word cloud picture. Statistics based on full complaint information
Figure BDA0003002480410000144
Number of occurrences of each element in
Figure BDA0003002480410000145
The j-th element s in (1)jLet the number of occurrences in the full complaint information be pjJ is 1, 2. The word frequency set for recording the full amount of complaint information is
Figure BDA0003002480410000146
Then
Figure BDA0003002480410000147
The word cloud display package WORDCOUNT adopting python can be directly constructed
Figure BDA0003002480410000148
The word cloud.
And 5: and (4) density clustering based on the DBSCAN algorithm. DBSCAN, a density-based clustering method with noise, is a density-based spatial clustering algorithm. The algorithm divides the area with sufficient densityClusters are divided and clusters of arbitrary shape can be found in a spatial database with noise. Merging the processed text feature set X1Word frequency statistical feature set
Figure BDA0003002480410000149
Sum word frequency scaling feature set
Figure BDA00030024804100001410
Totaling m +2nt characteristic variables and recording the variables as a clustering characteristic set
Figure BDA00030024804100001411
Figure BDA00030024804100001412
Figure BDA00030024804100001413
Figure BDA00030024804100001414
Will be provided with
Figure BDA00030024804100001415
As a reference of the DBSCAN algorithm, classifying each complaint content, assuming that the classification variable of the model output result is Y, and the clustering result of the ith complaint content is Yi,Y={y1,y2,......,yn},i=1,2,......,n。
Step 5.1: initial core sample labeling. Firstly, setting two core parameters eps and min _ sample of a DBSCAN algorithm, wherein eps represents a distance threshold belonging to a neighborhood, and eps is set to be 0.5; min _ sample represents the sample number threshold of e-neighborhood required for the sample point to become the core object, and min _ sample is set to 5. First, a point is arbitrarily selected, and then all points which are less than or equal to eps in distance from the point are found.
Step 5.2: and allocating cluster labels. If the number of data points within eps from the starting point is less than min samples, then this point is marked as noise. If the number of data points within eps is greater than min samples, then this point is marked as the core sample and assigned a new cluster label. All neighbors of the point (within the distance eps) are then visited. If they have not already been assigned a cluster, then the new cluster label just created is assigned to them. If they are core samples, then their neighbors are visited in turn, and so on. The cluster is gradually increased until there are no more core samples within the eps distance of the cluster.
Step 5.3: and (5) carrying out convergence training. Another point is selected that has not been visited and the process of initial core sample labeling and cluster label assignment is repeated until all points are labeled complete.
Step 5.4: and outputting a model result. The DBSCAN model will output the label of the cluster to which each point belongs. Clustering result y for ith complainti,yi-1 or a positive integer, -1 represents a noise point, the remaining positive integers representing the label number of the cluster to which the current point belongs, wherein i-1, 2.
Step 6: and determining the malicious complaint cluster. And hierarchically sampling the clustering result of the DBSCAN algorithm, manually marking and judging whether the malicious complaints exist or not based on the sampling result, finally confirming the labels of the malicious complaint clusters according to the proportion of the malicious complaints in each family, and correcting the clustering result of the malicious complaint model.
Step 6.1: and clustering and sampling. For the output result Y of the DBSCAN algorithm, the maximum value of all vector elements is calculated, denoted as d. Then the clustering result of the DBSCAN algorithm has d classes, and the clustering sample label of the r-th class is assumed to be MrWherein comprises urAn element, wherein n ═ u1+u2+......+ud(ii) a r 1, 2. Then complaint information set C ═ C1,C2,......,CnCluster set of the result complaint information after clustering
Figure BDA0003002480410000151
Figure BDA0003002480410000152
Appointing sampling proportion h, 0 < h < 1, clustering complaining information
Figure BDA0003002480410000153
Performing hierarchical sampling to form a complaint information clustering sampling set
Figure BDA0003002480410000154
Figure BDA0003002480410000161
Wherein
Figure BDA0003002480410000162
The number of the contained elements is nh.
Step 6.2: and (5) manual qualitative determination. Clustering sample sets for complaint information
Figure BDA0003002480410000163
Is marked, where 1 represents a malicious complaint and 0 represents a general complaint. Clustering sampling marking set of complaint information after marking as
Figure BDA0003002480410000164
Figure BDA0003002480410000165
Figure BDA0003002480410000166
Figure BDA0003002480410000167
Wherein
Figure BDA0003002480410000168
Or 0, r 1, 2, a. 1, 2. Statistics of
Figure BDA0003002480410000169
The proportion of malicious complaints in each cluster in (i.e. for the r-th cluster result M)rComputing malicious throwsThe number of people complaining is
Figure BDA00030024804100001610
The proportion of malicious complaints in the cluster was scored
Figure BDA00030024804100001611
Figure BDA00030024804100001612
Recording the complaint information malicious complaint proportion set as
Figure BDA00030024804100001613
Figure BDA00030024804100001614
Step 6.3: marking the malicious complaint cluster. Computing
Figure BDA00030024804100001615
In
Figure BDA00030024804100001616
The maximum value of (1) is recorded as
Figure BDA00030024804100001617
Cluster M corresponding to the maximum valueetThe cluster is a malicious complaint cluster, and all complaints in the cluster are malicious complaints; elements of the remaining clusters are combined into a normal complaint cluster and denoted as Mot. Then complaint information set C ═ C1,C2,......,CnThe classification result of the corresponding malicious complaint model is
Figure BDA00030024804100001618
Figure BDA00030024804100001619
Wherein u iset+uotN. All malicious complaints have been identified so far.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A malicious complaint identification method based on text clustering is characterized by comprising the following steps:
step 1: a complaint website and subject parameters are designated through interface configuration, and complaint contents meeting certain conditions are collected by a background through a crawler technology;
step 2: storing the complaint information of the crawler in a relational database, and performing user-defined text characteristic processing; suppose that n complains are collected; the name of the primary key field is ID, which is defined as an autonomy key and takes the value of 1, 2, … …, n; the complaint information set is denoted as C ═ C1,C2,……,CnIn which C isiThe contents of the ith complaint, i ═ 1, 2, … …, n; assume that the feature set of text processing is X1The custom text feature has m features, and is recorded as
Figure FDA0003002480400000011
And step 3: the complaint description is assumed to be mainly Chinese, and each complaint content is participled by adopting a Chinese word segmentation method; chinese word segmentation cuts each complaint description into a single word; performing word segmentation processing on the complaint description contents through a bidirectional matching word segmentation method; the bidirectional maximum matching method is a word segmentation method based on a dictionary, the word segmentation method based on the dictionary is to match a Chinese word string to be segmented with a vocabulary entry in a dictionary base according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful; for the collected n complains, suppose that the ith complain is divided into kiThe vector formed by words is recorded as
Figure FDA0003002480400000012
Figure FDA0003002480400000013
For the complaint information set C, the complaint information segmentation set formed after word segmentation is recorded
Figure FDA0003002480400000014
Figure FDA0003002480400000015
Figure FDA0003002480400000016
Wherein i is 1, 2, … …, n;
and 4, step 4: performing word frequency statistics on the word segmentation result in the step 3, namely counting the occurrence frequency of each word in the full complaint information, and forming a word cloud picture; counting the occurrence frequency of each word of each piece of complaint information and the proportion in the content of the complaint information, and forming a word frequency statistical characteristic set
Figure FDA0003002480400000017
Sum word frequency scaling feature set
Figure FDA0003002480400000018
And 5: merging the processed text feature sets
Figure FDA0003002480400000019
Word frequency statistical feature set
Figure FDA00030024804000000110
Sum word frequency scaling feature set
Figure FDA00030024804000000111
Totaling m +2nt characteristic variables and recording the variables as a clustering characteristic set
Figure FDA00030024804000000112
Figure FDA00030024804000000113
Figure FDA00030024804000000114
Figure FDA0003002480400000021
Will be provided with
Figure FDA0003002480400000022
As a reference of the DBSCAN algorithm, classifying each complaint content, assuming that the classification variable of the model output result is Y, and the clustering result of the ith complaint content is Yi,Y={y1,y2,……,yn},i=1,2,……,n;
Step 6: and hierarchically sampling the clustering result of the DBSCAN algorithm, manually marking and judging whether the malicious complaints exist or not based on the sampling result, finally confirming the labels of the malicious complaint clusters according to the proportion of the malicious complaints in each family, and correcting the clustering result of the malicious complaint model.
2. The method for identifying malicious complaints based on text clustering according to claim 1, wherein step 1 specifically comprises the following steps:
step 1.1 theme parameter configuration: the topic parameter module mainly sets the condition content of the crawler, and comprises 4 parameters of monitoring topics, topic codes, acquisition frequency and keyword configuration; monitoring the theme which is the theme content required to be specified by the crawler; the theme code is the only main key of the monitoring theme and consists of numbers, characters and underlines; the acquisition frequency specifies the acquisition condition of the crawler, and the interval duration of information acquisition needs to be set; configuring a filtering condition of appointed crawler content by the keyword, and acquiring the content meeting the condition;
step 1.2, website configuration acquisition: the website acquisition module sets a website to be crawled in an interface mode, and the name and address of the website are required to be specified; multiple sources of acquisition information may be added at the same time.
3. The method for identifying malicious complaints based on text clustering according to claim 2, wherein step 2 specifically comprises the following steps:
step 2.1, data storage: storing the complaint information from the crawler in a relational database MYSQL, creating a complaint information data table, and taking a complaint number as a main key;
step 2.2 basic attributes: the basic attribute refers to a basic attribute field associated with the complaint information;
step 2.3 statistical characterization: the statistical characteristics refer to the number of texts meeting certain conditions in the content of the statistical complaint information;
step 2.4 proportional characteristics: the proportional characteristic refers to the proportion of the number of texts meeting certain conditions in the content of the statistical complaint information.
4. The method of claim 3, wherein the malicious complaint recognition based on text clustering is characterized in that: the step 3 specifically comprises the following steps:
step 3.1, maximum matching word segmentation in the forward direction: the forward scanning is to scan the left side of the character string in the forward direction, and the sub string is taken out to be matched with the dictionary; the forward maximum matching algorithm is mainly divided into three steps: firstly, taking max characters in the complaint from left to right as a matching field for the ith complaint description (i is 1, 2, … …, n), wherein max is the longest entry number in a dictionary library; secondly, searching and matching the segmented matching field in a dictionary library; if the matching is successful, the matching field is used as a word to be segmented; if the matching is unsuccessful, removing the last character of the matching field, taking the rest character string as a new matching field, and performing matching again; thirdly, repeating the above processes until all words are segmented;
step 3.2 reverse maximum matching segmentation: the algorithm is a reverse thinking of the maximum positive matching, namely reverse scanning from the right side of a character string, and extracting a sub string to match with a dictionary; the reverse maximum matching algorithm is mainly divided into three steps: firstly, for the ith complaint description (i is 1, 2, … …, n), taking max characters in the complaint from right to left as a matching field, wherein max is the longest entry number in a dictionary library; secondly, searching and matching the segmented matching field in a dictionary library; if the matching is successful, the matching field is used as a word to be segmented; if the matching is unsuccessful, removing the most previous character of the matching field, taking the rest character string as a new matching field, and performing matching again; thirdly, repeating the above processes until the length of the character string to be segmented is 0, namely segmenting all words;
step 3.3, forward and reverse result matching: the bidirectional maximum matching is to compare the word segmentation result obtained by the forward maximum matching with the result obtained by the reverse maximum matching to determine a correct word segmentation method; the forward and reverse result matching is mainly divided into 2 steps: firstly, if the word numbers of the forward and reverse word segmentation results are different, the word with the smaller word number is selected; secondly, if the word numbers of the word segmentation results are the same; if the word segmentation results are the same, any one of the word segmentation results can be selected; if the word segmentation results are different, the word with the smaller number is selected.
5. The method of claim 4, wherein the malicious complaint recognition based on text clustering is characterized in that: the step 4 specifically comprises the following steps:
step 4.1 word frequency statistical feature set: segmenting sets of complaint information
Figure FDA0003002480400000041
Performing duplication elimination statistics, that is, each word is only retained once, every two retained words are different, and the formed unique vocabulary collection is recorded
Figure FDA0003002480400000042
Suppose that
Figure FDA0003002480400000043
In which there are t elements, denoted as
Figure FDA0003002480400000044
Segmented set of statistical complaint information
Figure FDA0003002480400000045
Wherein each of the complaining information segmentations corresponds to
Figure FDA0003002480400000046
For the ith complaint information split set
Figure FDA0003002480400000047
The corresponding word frequency statistical characteristic set is recorded as
Figure FDA0003002480400000048
Figure FDA0003002480400000049
Wherein
Figure FDA00030024804000000410
To represent
Figure FDA00030024804000000411
Including the word stWhen i is 1, 2, … …, n, the complaint information is divided into sets
Figure FDA00030024804000000412
Corresponding word frequency statistical characteristic set
Figure FDA00030024804000000413
Figure FDA00030024804000000414
Step 4.2, a word frequency proportion characteristic set: segmented set of statistical complaint information
Figure FDA00030024804000000415
Wherein each of the complaining information segmentations corresponds to
Figure FDA00030024804000000416
The proportion of each element in (b) is defined to correspond to
Figure FDA00030024804000000417
Dividing the length of the complaint information segmentation set by the occurrence frequency of each element, namely the number of words contained in the segmentation set; segmenting sets of complaint information for ith complaint
Figure FDA00030024804000000418
The corresponding word frequency proportion characteristic set is recorded as
Figure FDA00030024804000000419
Figure FDA00030024804000000420
Wherein
Figure FDA00030024804000000421
To represent
Figure FDA00030024804000000422
Including the word st1, 2, … …, n, the complaint information is segmented into sets
Figure FDA00030024804000000423
Corresponding word frequency scale feature set
Figure FDA00030024804000000424
Figure FDA00030024804000000425
Step 4.3, displaying the word frequency: the word cloud picture is used for visually displaying words with high occurrence frequency in the text, so that a large amount of low-frequency and low-quality text information can be filtered, and a browser can intuitively draw the text through the word cloud pictureA subject matter; statistics based on full complaint information
Figure FDA00030024804000000426
Number of occurrences of each element in
Figure FDA00030024804000000427
The j-th element s in (1)jLet the number of occurrences in the full complaint information be pjJ is 1, 2 … …, t; the word frequency set for recording the full amount of complaint information is
Figure FDA00030024804000000428
Then
Figure FDA00030024804000000429
The word cloud display package WORDCOUNT adopting python can be directly constructed
Figure FDA00030024804000000430
The word cloud.
6. The method of claim 5, wherein the method comprises: the step 5 specifically comprises the following steps:
step 5.1 initial core sample labeling: firstly, setting two core parameters eps and min _ sample of a DBSCAN algorithm, wherein eps represents a distance threshold value belonging to a neighborhood and the value is a floating point type; min _ samples represents a sample number threshold value of an epsilon-neighborhood required by a sample point to become a core object, and the value is a positive integer; firstly, randomly selecting a point, and then finding all points which are less than or equal to eps away from the point;
step 5.2 cluster label allocation: if the number of data points within eps from the starting point is less than min _ samples, then this point is marked as noise; if the number of data points within eps is greater than min _ samples, then this point is marked as a core sample and assigned a new cluster label; then accessing all neighbors of the point within the distance eps; if they have not already been assigned a cluster, then the new cluster label just created is assigned to them; if they are core samples, then their neighbors are visited in turn, and so on; the cluster is gradually increased until there are no more core samples within the eps distance of the cluster;
step 5.3 convergence training: selecting another point which is not visited yet, and repeating the processes of initial core sample marking and cluster label distribution until all the points are marked;
step 5.4, outputting a model result: the DBSCAN model outputs the label of the cluster to which each point belongs; clustering result y for ith complainti,yi-1 represents a noise point, the remaining positive integers representing the label number of the cluster to which the current point belongs, where i is 1, 2, … …, n.
7. The method of claim 6, wherein the method comprises: the step 6 specifically comprises the following steps:
step 6.1 clustering sampling: calculating the maximum value of all vector elements of the output result Y of the DBSCAN algorithm, and recording the maximum value as d; then the clustering result of the DBSCAN algorithm has d classes, and the clustering sample label of the r-th class is assumed to be MrWherein comprises urAn element, wherein n ═ u1+u2+……+ud(ii) a r is 1, 2, … …, d; then complaint information set C ═ C1,C2,……,CnCluster set of the result complaint information after clustering
Figure FDA0003002480400000061
Figure FDA0003002480400000062
Specifying a sampling ratio h, 0<h<1, clustering the complaint information
Figure FDA0003002480400000063
Performing hierarchical sampling to form a complaint information clustering sampling set
Figure FDA0003002480400000064
Figure FDA0003002480400000065
Wherein
Figure FDA0003002480400000066
The number of the contained elements is nh;
step 6.2 manual qualification: clustering sample sets for complaint information
Figure FDA0003002480400000067
Wherein 1 represents a malicious complaint and 0 represents a general complaint; clustering sampling marking set of complaint information after marking as
Figure FDA0003002480400000068
Figure FDA0003002480400000069
Figure FDA00030024804000000610
Wherein
Figure FDA00030024804000000611
Or 0, r ═ 1, 2, … …, d; i ═ 1, 2, … …, nh; statistics of
Figure FDA00030024804000000612
The proportion of malicious complaints in each cluster in (i.e. for the r-th cluster result M)rThe number of malicious complaints is calculated as
Figure FDA00030024804000000613
The proportion of malicious complaints in the cluster was scored
Figure FDA00030024804000000614
Figure FDA00030024804000000615
Recording the complaint information malicious complaint proportion set as
Figure FDA00030024804000000616
Figure FDA00030024804000000617
Step 6.3 marking the malicious complaint cluster: computing
Figure FDA00030024804000000618
In
Figure FDA00030024804000000619
The maximum value of (1) is recorded as
Figure FDA00030024804000000620
Cluster M corresponding to the maximum valueetThe cluster is a malicious complaint cluster, and all complaints in the cluster are malicious complaints; elements of the remaining clusters are combined into a normal complaint cluster and denoted as Mot(ii) a Then complaint information set C ═ C1,C2,……,CnThe classification result of the corresponding malicious complaint model is
Figure FDA00030024804000000621
Figure FDA00030024804000000622
Wherein u iset+uotN; all malicious complaints have been identified so far.
8. The system according to claim 1, wherein said system comprises: the system comprises a complaining website complaining information crawler module, a custom text feature processing module, a bidirectional matching word segmentation module, a word frequency feature set and visualization module, a density clustering module based on a DBSCAN algorithm and a malicious complaining cluster determination module; a complaint website and subject parameters are designated through interface configuration, and complaint contents meeting certain conditions are collected by a background through a crawler technology;
the system comprises a complaint website complaint information crawler module, a background and a data processing module, wherein the complaint website complaint information crawler module is used for configuring and specifying a complaint website and subject parameters through an interface, and the background collects complaint contents meeting certain conditions through a crawler technology;
the user-defined text feature processing module is used for storing the complaint information of the crawler in a relational database and performing user-defined text feature processing; suppose that n complains are collected; the name of the primary key field is ID, which is defined as an autonomy key and takes the value of 1, 2, … …, n; the complaint information set is denoted as C ═ C1,C2,……,CnIn which C isiThe contents of the ith complaint, i ═ 1, 2, … …, n; assume that the feature set of text processing is X1The custom text feature has m features, and is recorded as
Figure FDA0003002480400000071
The bidirectional matching word segmentation module is used for assuming that the complaint description is mainly Chinese, and segmenting words of each complaint content by adopting a Chinese word segmentation method; chinese word segmentation cuts each complaint description into a single word; performing word segmentation processing on the complaint description contents through a bidirectional matching word segmentation method; the bidirectional maximum matching method is a word segmentation method based on a dictionary, the word segmentation method based on the dictionary is to match a Chinese word string to be segmented with a vocabulary entry in a dictionary base according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful; for the collected n complains, suppose that the ith complain is divided into kiThe vector formed by words is recorded as
Figure FDA0003002480400000072
Figure FDA0003002480400000073
For complaint information set C, word-cutting shapeRecording the resultant complaint information split set
Figure FDA0003002480400000074
Figure FDA0003002480400000075
Figure FDA0003002480400000076
Wherein i is 1, 2, …, n;
the word frequency characteristic set and visualization module is used for carrying out word frequency statistics on the word segmentation result in the step 3, namely counting the occurrence frequency of each word in the full complaint information and forming a word cloud picture; counting the occurrence frequency of each word of each piece of complaint information and the proportion in the content of the complaint information, and forming a word frequency statistical characteristic set
Figure FDA0003002480400000077
Sum word frequency scaling feature set
Figure FDA0003002480400000078
A density clustering module based on DBSCAN algorithm for merging the processed text feature set X1Word frequency statistical feature set
Figure FDA0003002480400000081
Sum word frequency scaling feature set
Figure FDA0003002480400000082
Totaling m +2nt characteristic variables and recording the variables as a clustering characteristic set
Figure FDA0003002480400000083
Figure FDA0003002480400000084
Figure FDA0003002480400000085
Figure FDA0003002480400000086
Will be provided with
Figure FDA0003002480400000087
As a reference of the DBSCAN algorithm, classifying each complaint content, assuming that the classification variable of the model output result is Y, and the clustering result of the ith complaint content is Yi,Y={y1,y2,……,yn},i=1,2,……,n;
And the malicious complaint cluster determining module is used for performing hierarchical sampling on the clustering result of the DBSCAN algorithm, performing artificial marking judgment on whether malicious complaints exist or not on the basis of the sampling result, finally confirming the label of the malicious complaint cluster according to the proportion of the malicious complaints in each cluster, and correcting the clustering result of the malicious complaint model.
9. The system of claim 8, wherein the malicious complaint recognition system based on text clustering comprises: the complaint information crawler module of the complaint website comprises 2 submodules for topic parameter configuration and acquisition website configuration;
the theme parameter configuration sub-module is used for the theme parameter module to mainly set the condition content of the crawler, and comprises 4 parameters of monitoring a theme, theme coding, acquisition frequency and keyword configuration; monitoring the theme which is the theme content required to be specified by the crawler; the theme code is the only main key of the monitoring theme and consists of numbers, characters and underlines; the acquisition frequency specifies the acquisition condition of the crawler, and the interval duration of information acquisition needs to be set; configuring a filtering condition of appointed crawler content by the keyword, and acquiring the content meeting the condition;
the acquisition website configuration submodule is used for setting a website needing the crawler by the acquisition website module in an interface mode, and specifying a website name and a website address; multiple sources of acquisition information may be added at the same time.
10. The system of claim 9, wherein the malicious complaint recognition system based on text clustering comprises: the custom text feature processing module comprises 4 submodules of data storage, basic attribute, statistical feature and proportional feature;
the data storage submodule is used for storing the complaint information from the crawler in a relational database MYSQL and creating a complaint information data table by taking the complaint number as a main key;
the basic attribute submodule is used for indicating basic attribute fields related to the complaint information by basic attributes;
the statistical characteristic submodule is used for counting the number of texts meeting certain conditions in the complaint information content;
and the proportion characteristic submodule is used for proportion characteristic, and is used for counting the proportion of the number of texts meeting certain conditions in the complaint information content.
CN202110351440.1A 2021-03-31 2021-03-31 Malicious complaint identification method and system based on text clustering Pending CN113094567A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110351440.1A CN113094567A (en) 2021-03-31 2021-03-31 Malicious complaint identification method and system based on text clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110351440.1A CN113094567A (en) 2021-03-31 2021-03-31 Malicious complaint identification method and system based on text clustering

Publications (1)

Publication Number Publication Date
CN113094567A true CN113094567A (en) 2021-07-09

Family

ID=76673191

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110351440.1A Pending CN113094567A (en) 2021-03-31 2021-03-31 Malicious complaint identification method and system based on text clustering

Country Status (1)

Country Link
CN (1) CN113094567A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114676796A (en) * 2022-05-27 2022-06-28 浙江清大科技有限公司 Clustering acquisition and identification system based on big data

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015044934A1 (en) * 2013-09-30 2015-04-02 ABIDIN, Indira Ratna Dewi A method for adaptively classifying sentiment of document snippets
CN105912625A (en) * 2016-04-07 2016-08-31 北京大学 Linked data oriented entity classification method and system
CN106296422A (en) * 2016-07-29 2017-01-04 重庆邮电大学 A kind of social networks junk user detection method merging many algorithms
CN107944460A (en) * 2016-10-12 2018-04-20 甘肃农业大学 One kind is applied to class imbalance sorting technique in bioinformatics
CN108470282A (en) * 2018-03-26 2018-08-31 国家电网公司客户服务中心 Work order intelligent method for classifying is complained by Customer Service Center
CN108573031A (en) * 2018-03-26 2018-09-25 上海万行信息科技有限公司 A kind of complaint sorting technique and system based on content
CN109376226A (en) * 2018-11-08 2019-02-22 合肥工业大学 Complain disaggregated model, construction method, system, classification method and the system of text
CN109492091A (en) * 2018-09-28 2019-03-19 科大国创软件股份有限公司 A kind of complaint work order intelligent method for classifying based on convolutional neural networks
CN111104466A (en) * 2019-12-25 2020-05-05 航天科工网络信息发展有限公司 Method for rapidly classifying massive database tables
CN111210057A (en) * 2019-12-25 2020-05-29 广东飞企互联科技股份有限公司 Method for predicting complaints of mobile phone internet users
CN111447574A (en) * 2018-12-27 2020-07-24 中国移动通信集团辽宁有限公司 Short message classification method, device, system and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015044934A1 (en) * 2013-09-30 2015-04-02 ABIDIN, Indira Ratna Dewi A method for adaptively classifying sentiment of document snippets
CN105912625A (en) * 2016-04-07 2016-08-31 北京大学 Linked data oriented entity classification method and system
CN106296422A (en) * 2016-07-29 2017-01-04 重庆邮电大学 A kind of social networks junk user detection method merging many algorithms
CN107944460A (en) * 2016-10-12 2018-04-20 甘肃农业大学 One kind is applied to class imbalance sorting technique in bioinformatics
CN108470282A (en) * 2018-03-26 2018-08-31 国家电网公司客户服务中心 Work order intelligent method for classifying is complained by Customer Service Center
CN108573031A (en) * 2018-03-26 2018-09-25 上海万行信息科技有限公司 A kind of complaint sorting technique and system based on content
CN109492091A (en) * 2018-09-28 2019-03-19 科大国创软件股份有限公司 A kind of complaint work order intelligent method for classifying based on convolutional neural networks
CN109376226A (en) * 2018-11-08 2019-02-22 合肥工业大学 Complain disaggregated model, construction method, system, classification method and the system of text
CN111447574A (en) * 2018-12-27 2020-07-24 中国移动通信集团辽宁有限公司 Short message classification method, device, system and storage medium
CN111104466A (en) * 2019-12-25 2020-05-05 航天科工网络信息发展有限公司 Method for rapidly classifying massive database tables
CN111210057A (en) * 2019-12-25 2020-05-29 广东飞企互联科技股份有限公司 Method for predicting complaints of mobile phone internet users

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114676796A (en) * 2022-05-27 2022-06-28 浙江清大科技有限公司 Clustering acquisition and identification system based on big data
CN114676796B (en) * 2022-05-27 2022-09-06 浙江清大科技有限公司 Clustering acquisition and identification system based on big data

Similar Documents

Publication Publication Date Title
CN109189901B (en) Method for automatically discovering new classification and corresponding corpus in intelligent customer service system
CN111126386B (en) Sequence domain adaptation method based on countermeasure learning in scene text recognition
CN109783639B (en) Mediated case intelligent dispatching method and system based on feature extraction
CN109189767B (en) Data processing method and device, electronic equipment and storage medium
CN108710651A (en) A kind of large scale customer complaint data automatic classification method
US10387805B2 (en) System and method for ranking news feeds
CN111783394A (en) Training method of event extraction model, event extraction method, system and equipment
CN111931505A (en) Cross-language entity alignment method based on subgraph embedding
CN109446423B (en) System and method for judging sentiment of news and texts
CN111191051B (en) Method and system for constructing emergency knowledge map based on Chinese word segmentation technology
CN111078979A (en) Method and system for identifying network credit website based on OCR and text processing technology
CN107679075A (en) Method for monitoring network and equipment
CN112241458A (en) Text knowledge structuring processing method, device, equipment and readable storage medium
CN113946657A (en) Knowledge reasoning-based automatic identification method for power service intention
CN113486664A (en) Text data visualization analysis method, device, equipment and storage medium
Chen et al. Offline handwritten digits recognition using machine learning
CN111709225A (en) Event cause and effect relationship judging method and device and computer readable storage medium
CN113094567A (en) Malicious complaint identification method and system based on text clustering
CN112380346B (en) Financial news emotion analysis method and device, computer equipment and storage medium
CN111984790B (en) Entity relation extraction method
CN111597423B (en) Performance evaluation method and device of interpretable method of text classification model
CN113569048A (en) Method and system for automatically dividing affiliated industries based on enterprise operation range
CN109993381B (en) Demand management application method, device, equipment and medium based on knowledge graph
CN113934833A (en) Training data acquisition method, device and system and storage medium
CN111428033B (en) Automatic threat information extraction method based on double-layer convolutional neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210709