CN113094567A

CN113094567A - Malicious complaint identification method and system based on text clustering

Info

Publication number: CN113094567A
Application number: CN202110351440.1A
Authority: CN
Inventors: 王萍
Original assignee: Sichuan XW Bank Co Ltd
Current assignee: Sichuan XW Bank Co Ltd
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2021-07-09

Abstract

The invention relates to the technical field of artificial intelligence and software systems, in particular to a malicious complaint identification method and system based on text clustering, which comprises the following steps: step 1: complaint website complaint information crawlers; step 2: processing the user-defined text characteristics; and step 3: bi-directionally matching and segmenting words; and 4, step 4: the word frequency characteristic set and visualization; and 5: density clustering based on a DBSCAN algorithm; step 6: and determining the malicious complaint cluster. The invention has the following beneficial effects: according to the invention, a crawler technology, a natural language processing technology and a clustering algorithm technology are comprehensively applied, a crawler complaint information content is established through a network complaint channel, an LDA topic model is constructed based on the content to carry out natural language processing, all complaint information is classified based on the natural language processing, and finally malicious complaints are identified by adopting a density clustering method DBSCAN.

Description

Malicious complaint identification method and system based on text clustering

Technical Field

The invention relates to the technical field of artificial intelligence and software systems, in particular to a malicious complaint identification method and system based on text clustering.

Background

In recent years, a "black industry" has been developed for malicious complaint agencies of institutions (financial institutions for short) such as banks, payment companies, cash companies, mutual companies, insurance companies, and the like. Order receiving, signing, right maintaining and division are clear, and the work of grouping fire and division of labor are clear. Financial institutions can not be disturbed, on one hand, malicious complaints are more and more, and on the other hand, the bank interior really has supervision and assessment pressure. The ultimate goal of a malicious complaint is a malicious evasive debt. In recent years, the behavior of escaping waste and debt in the financial field leads to the rising of the rate of bad loan in the financial industry, and causes the accumulation of risks of some small and medium-sized financial institutions and financial institutions.

The identification methods of malicious complaints in the current financial industry are relatively few. By deeply researching the behavior pattern of the malicious complaint user, the malicious complaint user is generally found to adopt a complaint template provided by a black agency and concentrate on a network channel or a channel under a supervision pipeline for complaint. The network complaint channels are as follows: black cat complaints, gathering complaints, and the like.

Based on the above, the text provides a malicious complaint identification method and system based on text clustering, which comprehensively apply a crawler technology, a natural language processing technology and a clustering algorithm technology, construct an LDA topic model for natural language processing based on the text content through the content of the crawler complaint information of a network complaint channel, classify all complaint information by adopting a density clustering method DBSCAN, and finally identify malicious complaints.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method and a system for identifying malicious complaints based on text clustering, and solves the problem that malicious complaints cannot be identified quickly at present.

In order to solve the problems, the invention discloses a malicious complaint identification method based on text clustering, which comprises the following steps:

step 1: a complaint website and subject parameters are designated through interface configuration, and complaint contents meeting certain conditions are collected by a background through a crawler technology;

step 2: storing the complaint information of the crawler in a relational database, and performing user-defined text characteristic processing; suppose that n complains are collected; the name of the primary key field is ID, the primary key field is defined as an autonomous key, and the value of the autonomous key is 1, 2. The complaint information set is denoted as C ═ C¹，C²，......，CⁿIn which C isⁱContent indicating the ith complaint,

i

1, 2.... n; assume that the feature set of text processing is X₁The custom text features are mIs characterized by the fact that

And step 3: the complaint description is assumed to be mainly Chinese, and each complaint content is participled by adopting a Chinese word segmentation method; chinese word segmentation cuts each complaint description into a single word; performing word segmentation processing on the complaint description contents through a bidirectional matching word segmentation method; the bidirectional maximum matching method is a word segmentation method based on a dictionary, the word segmentation method based on the dictionary is to match a Chinese word string to be segmented with a vocabulary entry in a dictionary base according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful; for the collected n complains, suppose that the ith complain is divided into kⁱThe vector formed by words is recorded as

For the complaint information set C, the complaint information segmentation set formed after word segmentation is recorded

Wherein i 1, 2.... n;

and 4, step 4: performing word frequency statistics on the word segmentation result in the step 3, namely counting the occurrence frequency of each word in the full complaint information, and forming a word cloud picture; counting the occurrence frequency of each word of each piece of complaint information and the proportion in the content of the complaint information, and forming a word frequency statistical characteristic set

Sum word frequency scaling feature set

And 5: merging the processed text feature set X₁Word frequency statistical feature set

Sum word frequency scaling feature set

Totaling m +2nt characteristic variables and recording the variables as a clustering characteristic set

Will be provided with

As a reference of the DBSCAN algorithm, classifying each complaint content, assuming that the classification variable of the model output result is Y, and the clustering result of the ith complaint content is Y_i，Y＝{y₁，y₂，......，y_n}，i＝1，2，......，n；

Step 6: and hierarchically sampling the clustering result of the DBSCAN algorithm, manually marking and judging whether the malicious complaints exist or not based on the sampling result, finally confirming the labels of the malicious complaint clusters according to the proportion of the malicious complaints in each family, and correcting the clustering result of the malicious complaint model.

Preferably, the method comprises the following steps: the step 1 specifically comprises the following steps:

step 1.1 theme parameter configuration: the topic parameter module mainly sets the condition content of the crawler, and comprises 4 parameters of monitoring topics, topic codes, acquisition frequency and keyword configuration; monitoring the theme which is the theme content required to be specified by the crawler; the theme code is the only main key of the monitoring theme and consists of numbers, characters and underlines; the acquisition frequency specifies the acquisition condition of the crawler, and the interval duration of information acquisition needs to be set; configuring a filtering condition of appointed crawler content by the keyword, and acquiring the content meeting the condition;

step 1.2, website configuration acquisition: the website acquisition module sets a website to be crawled in an interface mode, and the name and address of the website are required to be specified; multiple sources of acquisition information may be added at the same time.

Preferably, the method comprises the following steps: the step 2 specifically comprises the following steps:

step 2.1, data storage: storing the complaint information from the crawler in a relational database MYSQL, creating a complaint information data table, and taking a complaint number as a main key;

step 2.2 basic attributes: the basic attribute refers to a basic attribute field associated with the complaint information;

step 2.3 statistical characterization: the statistical characteristics refer to the number of texts meeting certain conditions in the content of the statistical complaint information;

step 2.4 proportional characteristics: the proportional characteristic refers to the proportion of the number of texts meeting certain conditions in the content of the statistical complaint information.

Preferably, the method comprises the following steps: the step 3 specifically comprises the following steps:

step 3.1, maximum matching word segmentation in the forward direction: the forward scanning is to scan the left side of the character string in the forward direction, and the sub string is taken out to be matched with the dictionary; the forward maximum matching algorithm is mainly divided into three steps: firstly, taking max characters in an ith complaint from left to right as a matching field for the complaint description of the ith complaint (i is 1, 2.. said., n), wherein max is the longest number of entries in a dictionary library; secondly, searching and matching the segmented matching field in a dictionary library; if the matching is successful, the matching field is used as a word to be segmented; if the matching is unsuccessful, removing the last character of the matching field, taking the rest character string as a new matching field, and performing matching again; thirdly, repeating the above processes until all words are segmented;

step 3.2 reverse maximum matching segmentation: the algorithm is a reverse thinking of the maximum positive matching, namely reverse scanning from the right side of a character string, and extracting a sub string to match with a dictionary; the reverse maximum matching algorithm is mainly divided into three steps: firstly, for the ith complaint description (i is 1, 2.. the., n), taking max characters in the complaint from right to left as a matching field, wherein max is the longest number of entries in a dictionary library; secondly, searching and matching the segmented matching field in a dictionary library; if the matching is successful, the matching field is used as a word to be segmented; if the matching is unsuccessful, removing the most previous character of the matching field, taking the rest character string as a new matching field, and performing matching again; thirdly, repeating the above processes until the length of the character string to be segmented is 0, namely segmenting all words;

step 3.3, forward and reverse result matching: the bidirectional maximum matching is to compare the word segmentation result obtained by the forward maximum matching with the result obtained by the reverse maximum matching to determine a correct word segmentation method; the forward and reverse result matching is mainly divided into 2 steps: firstly, if the word numbers of the forward and reverse word segmentation results are different, the word with the smaller word number is selected; secondly, if the word numbers of the word segmentation results are the same; if the word segmentation results are the same, any one of the word segmentation results can be selected; if the word segmentation results are different, the word with the smaller number is selected.

Preferably, the method comprises the following steps: the step 4 specifically comprises the following steps:

step 4.1 word frequency statistical feature set: segmenting sets of complaint information

Performing duplication elimination statistics, that is, each word is only retained once, every two retained words are different, and the formed unique vocabulary collection is recorded

Suppose that

In which there are t elements, denoted as

Segmented set of statistical complaint information

Wherein each of the complaining information segmentations corresponds to

For the ith complaint information split set

The corresponding word frequency statistical characteristic set is recorded as

Wherein

To represent

Including the word s_tIs 1, 2, and n, then the complaint information segmentation set

Corresponding word frequency statistical characteristic set

Step 4.2, a word frequency proportion characteristic set: segmented set of statistical complaint information

Wherein each of the complaining information segmentations corresponds to

The proportion of each element in (b) is defined to correspond to

Dividing the length of the complaint information segmentation set by the occurrence frequency of each element, namely the number of words contained in the segmentation set; segmenting sets of complaint information for ith complaint

The corresponding word frequency proportion characteristic set is recorded as

Wherein

To represent

Including the word s_t1, 2, and n, then the complaint information segmentation set

Corresponding word frequency scale feature set

Step 4.3, displaying the word frequency: the word cloud picture is used for visually displaying words with high occurrence frequency in the text, so that a large amount of low-frequency and low-quality text information can be filtered, and a browser can intuitively draw the text through the word cloud picture; statistics based on full complaint information

Number of occurrences of each element in

The j-th element s in (1)_jLet the number of occurrences in the full complaint information be p_jJ 1, 2.... prot, t; the word frequency set for recording the full amount of complaint information is

Then

The word cloud display package WORDCOUNT adopting python can be directly constructed

The word cloud.

Preferably, the method comprises the following steps: the step 5 specifically comprises the following steps:

step 5.1 initial core sample labeling: firstly, setting two core parameters eps and min _ sample of a DBSCAN algorithm, wherein eps represents a distance threshold value belonging to a neighborhood and the value is a floating point type; min _ samples represents a sample number threshold value of an epsilon-neighborhood required by a sample point to become a core object, and the value is a positive integer; firstly, randomly selecting a point, and then finding all points which are less than or equal to eps away from the point;

step 5.2 cluster label allocation: if the number of data points within eps from the starting point is less than min _ samples, then this point is marked as noise; if the number of data points within eps is greater than min _ samples, then this point is marked as a core sample and assigned a new cluster label; then accessing all neighbors of the point within the distance eps; if they have not already been assigned a cluster, then the new cluster label just created is assigned to them; if they are core samples, then their neighbors are visited in turn, and so on; the cluster is gradually increased until there are no more core samples within the eps distance of the cluster;

step 5.3 convergence training: selecting another point which is not visited yet, and repeating the processes of initial core sample marking and cluster label distribution until all the points are marked;

step 5.4, outputting a model result: the DBSCAN model outputs the label of the cluster to which each point belongs; clustering result y for ith complaint_i，y_i-1 or a positive integer, -1 represents a noise point, the remaining positive integers representing the label number of the cluster to which the current point belongs, wherein i-1, 2.

Preferably, the method comprises the following steps: the step 6 specifically comprises the following steps:

step 6.1 clustering sampling: calculating the maximum value of all vector elements of the output result Y of the DBSCAN algorithm, and recording the maximum value as d; then the clustering result of the DBSCAN algorithm has d classes, and the clustering sample label of the r-th class is assumed to be M_rWherein comprises u^rAn element, wherein n ═ u¹+u²+......+u^d(ii) a r 1, 2.. said, d; then complaint information set C ═ C¹，C²，......，CⁿCluster set of the result complaint information after clustering

Appointing sampling proportion h, 0 < h < 1, clustering complaining information

Performing hierarchical sampling to form a complaint information clustering sampling set

Wherein

The number of the contained elements is nh;

step 6.2 manual qualification: clustering sample sets for complaint information

Wherein 1 represents a malicious complaint and 0 represents a general complaint; clustering sampling marking set of complaint information after marking as

Wherein

Or 0,

r

1, 2, a. 1, 2, ah; statistics of

The proportion of malicious complaints in each cluster in (i.e. for the r-th cluster result M)_rThe number of malicious complaints is calculated as

The proportion of malicious complaints in the cluster was scored

Recording the complaint information malicious complaint proportion set as

Step 6.3 marking the malicious complaint cluster: computing

In

The maximum value of (1) is recorded as

Cluster M corresponding to the maximum value_etThe cluster is a malicious complaint cluster, and all complaints in the cluster are malicious complaints; elements of the remaining clusters are combined into a normal complaint cluster and denoted as M_ot(ii) a Then complaint information set C ═ C¹，C²，......，CⁿThe classification result of the corresponding malicious complaint model is

Wherein u is^et+u^otN; all malicious complaints have been identified so far.

In order to solve the problems, the invention also discloses a malicious complaint identification system based on text clustering, which comprises a complaint information crawler module of a complaint website, a custom text feature processing module, a bidirectional matching word segmentation module, a word frequency feature set and visualization module, a density clustering module based on a DBSCAN algorithm and a malicious complaint cluster determination module; a complaint website and subject parameters are designated through interface configuration, and complaint contents meeting certain conditions are collected by a background through a crawler technology;

the system comprises a complaint website complaint information crawler module, a background and a data processing module, wherein the complaint website complaint information crawler module is used for configuring and specifying a complaint website and subject parameters through an interface, and the background collects complaint contents meeting certain conditions through a crawler technology;

the user-defined text feature processing module is used for storing the complaint information of the crawler in a relational database and performing user-defined text feature processing; suppose that n complains are collected; the name of the primary key field is ID, the primary key field is defined as an autonomous key, and the value of the autonomous key is 1, 2. The complaint information set is denoted as C ═ C¹，C²，......，CⁿIn which C isⁱContent of the ith complaint, i ═ 12, a. Assume that the feature set of text processing is X₁The custom text feature has m features, and is recorded as

The bidirectional matching word segmentation module is used for assuming that the complaint description is mainly Chinese, and segmenting words of each complaint content by adopting a Chinese word segmentation method; chinese word segmentation cuts each complaint description into a single word; performing word segmentation processing on the complaint description contents through a bidirectional matching word segmentation method; the bidirectional maximum matching method is a word segmentation method based on a dictionary, the word segmentation method based on the dictionary is to match a Chinese word string to be segmented with a vocabulary entry in a dictionary base according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful; for the collected n complains, suppose that the ith complain is divided into kⁱThe vector formed by words is recorded as

Wherein i is 1, 2.. times.n;

the word frequency characteristic set and visualization module is used for carrying out word frequency statistics on the word segmentation result in the step 3, namely counting the occurrence frequency of each word in the full complaint information and forming a word cloud picture; counting the occurrence frequency of each word of each piece of complaint information and the proportion in the content of the complaint information, and forming a word frequency statistical characteristic set

Sum word frequency scaling feature set

A density clustering module based on DBSCAN algorithm for merging the processed text feature set X₁Word frequency statistical feature set

Sum word frequency scaling feature set

Will be provided with

And the malicious complaint cluster determining module is used for performing hierarchical sampling on the clustering result of the DBSCAN algorithm, performing artificial marking judgment on whether malicious complaints exist or not on the basis of the sampling result, finally confirming the label of the malicious complaint cluster according to the proportion of the malicious complaints in each cluster, and correcting the clustering result of the malicious complaint model.

Preferably, the method comprises the following steps: the complaint information crawler module of the complaint website comprises 2 submodules for topic parameter configuration and acquisition website configuration;

the theme parameter configuration sub-module is used for the theme parameter module to mainly set the condition content of the crawler, and comprises 4 parameters of monitoring a theme, theme coding, acquisition frequency and keyword configuration; monitoring the theme which is the theme content required to be specified by the crawler; the theme code is the only main key of the monitoring theme and consists of numbers, characters and underlines; the acquisition frequency specifies the acquisition condition of the crawler, and the interval duration of information acquisition needs to be set; configuring a filtering condition of appointed crawler content by the keyword, and acquiring the content meeting the condition;

the acquisition website configuration submodule is used for setting a website needing the crawler by the acquisition website module in an interface mode, and specifying a website name and a website address; multiple sources of acquisition information may be added at the same time.

Preferably, the method comprises the following steps: the custom text feature processing module comprises 4 submodules of data storage, basic attribute, statistical feature and proportional feature;

the data storage submodule is used for storing the complaint information from the crawler in a relational database MYSQL and creating a complaint information data table by taking the complaint number as a main key;

the basic attribute submodule is used for indicating basic attribute fields related to the complaint information by basic attributes;

the statistical characteristic submodule is used for counting the number of texts meeting certain conditions in the complaint information content;

and the proportion characteristic submodule is used for proportion characteristic, and is used for counting the proportion of the number of texts meeting certain conditions in the complaint information content.

Due to the adoption of the technical scheme, the invention has the following beneficial effects:

1. the system comprehensively applies a natural language technology and an unsupervised learning technology, and constructs a 6-dimensional closed-loop flow module based on complaint information crawlers of complaint websites, custom text feature processing, bidirectional matching word segmentation, word frequency feature set and visualization, density clustering based on a DBSCAN algorithm and malicious complaint cluster determination, so as to realize automatic identification of malicious complaint contents.

2. The user-defined text feature processing set, the word frequency statistical feature set and the word frequency proportion feature set are ingeniously combined to serve as a feature factory of the DBSCAN density clustering algorithm, so that the significant information in text content is mined to the greatest extent, and the accuracy of the malicious complaint identification model is greatly improved.

3. The method adopts a DBSCAN density clustering algorithm to perform clustering analysis on the complaint information, so that the number of clusters does not need to be set a priori, the clusters with complex shapes are divided, points which do not belong to any cluster can be found out, and the clustering effect is greatly improved; meanwhile, a clustering layered sampling mode is adopted for marking, and a model result is greatly calibrated.

Drawings

FIG. 1 is a block diagram of a text clustering based malicious complaint identification system;

FIG. 2 is a crawler configuration system diagram.

Detailed Description

The embodiments of the invention will be described in detail below with reference to the drawings, but the invention can be implemented in many different ways as defined and covered by the claims.

To illustrate this approach more specifically, the following provides an identification case for identifying a malicious complaint user of "new web bank" in "complaints congregation".

Step 1: the complaint website complaints the information crawler. The complaining website and the subject parameters are designated through interface configuration, and the background collects the complaining contents meeting certain conditions through a crawler technology.

Step 1.1: and configuring theme parameters. The topic parameter module mainly sets the condition content of the crawler, and comprises 4 parameters of monitoring topics, topic codes, acquisition frequency and keyword configuration. The monitoring subject is set as 'identification of malicious complaints of new network banks'; the theme is coded as the only primary key of the monitoring theme, set up as "XWBANK _ JTS _ EYTS"; the acquisition frequency specifies the acquisition conditions of the crawler and is configured to acquire every 5 minutes; the keyword configuration specifies a filtering condition of the crawler content, and is configured as "new web bank & & complaint".

Step 1.2: and collecting the configuration of the website. The website acquisition module sets a website needing the crawler in an interface mode, and configures a website name of ' poly complaints ' and a website address of ' https: com/"/ts.21cn.com/".

Step 2: and (4) processing the user-defined text features. And storing the complaint information of the crawler in a relational database, and performing user-defined text characteristic processing. Suppose that n complains are collected; the primary key field name is ID, the value is an autoincrement primary key, and the value is 1, 2. The complaint information set is denoted as C ═ C¹，C²，......，CⁿIn which C isⁱThe contents of the ith complaint,

i

1, 2. Assume that the feature set of text processing is X₁The custom text feature has m features, and is recorded as

The module comprises 4 sub-modules of data storage, basic attribute, statistical characteristic and proportion characteristic.

Step 2.1: and (4) storing data. Storing the complaint information from the crawler in a relational database MySQL, creating a complaint information data table, taking a complaint number as a main key, and comprising: complaining time, name of complainer, crawler time, complaining object, complaining question, related party, amount of complaining, complaining description and the like.

Step 2.2: a base attribute. The basic information refers to basic attribute fields associated with the complaint information, such as complaint time points, complaint amounts, complaint complaints, complaint progress and the like.

Step 2.3: and (5) counting the characteristics. The statistical characteristics refer to the number of texts satisfying certain conditions in the content of the statistical complaint information. Such as: counting the total words of the complaint description, counting the words of the complaint description including the complaint, counting the words of the complaint description with the report, counting the number of complaint progress being the follow-up, counting the number of complaint reply bars, and the like.

Step 2.4: and (4) proportional characteristics. The proportional characteristic refers to the proportion of the number of texts meeting certain conditions in the content of the statistical complaint information. Such as: the proportion of the word number of the complaint in the statistical complaint description in the whole text, the proportion of the word number of the report in the statistical complaint description in the whole text and the like.

And step 3: and performing bidirectional matching word segmentation. Chinese segmentation divides each complaint description into a single word, which is the smallest, independently active, meaningful language component. In order to deeply mine the relevance between the complaint information, the complaint description content is participled by a bidirectional matching word segmentation method. The bidirectional maximum matching method is a word segmentation method based on a dictionary. The word segmentation method based on the dictionary is to match the Chinese word string to be segmented with the vocabulary entry in a dictionary base according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful. For the collected n complains, suppose that the ith complain is divided into kⁱThe vector formed by words is recorded as

Wherein i is 1, 2.

Step 3.1: forward maximum matching participles. The forward direction is to scan the left side of the character string in the forward direction, and to extract the sub-string to match with the dictionary. The forward maximum matching algorithm is mainly divided into three steps: first, for the ith complaint description (i ═ 1, 2.. said., n), max characters in the complaint are taken from left to right as matching fields, and max is the longest entry number in the dictionary base. Second, the segmented matching field is looked up in a dictionary base and matched. If the matching is successful, the matching field is segmented as a word. If the matching is unsuccessful, the last word of the matching field is removed, and the rest character string is used as a new matching field for matching again. Third, the above process is repeated until all words are segmented.

Step 3.2: and (4) carrying out reverse maximum matching word segmentation. The algorithm is the reverse thinking of a forward maximum match. The reverse scanning is to scan the right side of the character string reversely, and then the sub-string is taken out to match with the dictionary. The reverse maximum matching algorithm is mainly divided into three steps: first, for the ith complaint description (i ═ 1, 2.... times.n), max characters in the complaint are taken from right to left as matching fields, and max is the longest entry number in the dictionary base. Second, the segmented matching field is looked up in a dictionary base and matched. If the matching is successful, the matching field is segmented as a word. If the matching is unsuccessful, the most previous word of the matching field is removed, and the rest character string is used as a new matching field for matching again. Thirdly, the above process is repeated until the length of the character string to be cut is 0, that is, all the words are cut.

And 3.3, matching forward and reverse results. The bidirectional maximum matching is to compare the word segmentation result obtained by the forward maximum matching with the result obtained by the reverse maximum matching to determine the correct word segmentation method. Studies by sunm.s. and Benjamin K.T (1995) showed that: about 90% of Chinese characters are cut into character strings, the forward maximum matching method and the reverse maximum matching method are completely overlapped and correct, and the rest 10% of the Chinese characters are different. The forward and reverse result matching is mainly divided into 2 steps: firstly, if the word numbers of the forward and reverse word segmentation results are different, the word with the smaller word number is selected. Second, if the word segmentation results in the same number of words. If the word segmentation results are the same, any one of the word segmentation results can be selected; if the word segmentation results are different, the word with the smaller number is selected.

And 4, step 4: and (4) carrying out word frequency feature set and visualization. And (3) performing word frequency statistics on the word segmentation result in the step (3), namely counting the occurrence frequency of each word in the full amount of complaint information, and forming a word cloud picture so as to visually observe the distribution of the complaint content. Counting the number of occurrences of each word for each complaint information and the complaint in the complaint informationThe proportion in the information content and the word frequency statistical characteristic set are formed

Sum word frequency scaling feature set

Step 4.1: and (5) carrying out a word frequency statistical characteristic set. Segmenting sets of complaint information

Performing duplication elimination statistics, that is, each word is only retained once, the retained word is different in pairs, and the formed unique vocabulary collection is recorded

Suppose that

In which there are t elements, denoted as

Segmented set of statistical complaint information

Wherein each of the complaining information segmentations corresponds to

For the ith complaint information split set

The corresponding word frequency statistical characteristic set is recorded as

Wherein

To represent

Including the word s_t1, 2. Complaint information split set

And (4) corresponding word frequency statistical feature sets.

Step 4.2: the set of word frequency scale features. Segmented set of statistical complaint information

Wherein each of the complaining information segmentations corresponds to

The proportion of each element in (b) is defined to correspond to

The number of times each element in the set appears is divided by the length of the segment of complaint information (i.e., the number of words contained in the segment). Segmenting sets of complaint information for ith complaint

The corresponding word frequency proportion characteristic set is recorded as

Wherein

To represent

Including the word s_t1, 2. Complaint information split set

Corresponding word frequency scale feature set

Step 4.3: and displaying the word frequency. The word cloud picture is used for visually displaying words with high occurrence frequency in the text, so that a large amount of low-frequency and low-quality text information can be filtered, and a browser can intuitively draw the text through the word cloud picture. Statistics based on full complaint information

Number of occurrences of each element in

The j-th element s in (1)_jLet the number of occurrences in the full complaint information be p_jJ is 1, 2. The word frequency set for recording the full amount of complaint information is

Then

The word cloud.

And 5: and (4) density clustering based on the DBSCAN algorithm. DBSCAN, a density-based clustering method with noise, is a density-based spatial clustering algorithm. The algorithm divides the area with sufficient densityClusters are divided and clusters of arbitrary shape can be found in a spatial database with noise. Merging the processed text feature set X₁Word frequency statistical feature set

Sum word frequency scaling feature set

Will be provided with

As a reference of the DBSCAN algorithm, classifying each complaint content, assuming that the classification variable of the model output result is Y, and the clustering result of the ith complaint content is Y_i，Y＝{y₁，y₂，......，y_n}，i＝1，2，......，n。

Step 5.1: initial core sample labeling. Firstly, setting two core parameters eps and min _ sample of a DBSCAN algorithm, wherein eps represents a distance threshold belonging to a neighborhood, and eps is set to be 0.5; min _ sample represents the sample number threshold of e-neighborhood required for the sample point to become the core object, and min _ sample is set to 5. First, a point is arbitrarily selected, and then all points which are less than or equal to eps in distance from the point are found.

Step 5.2: and allocating cluster labels. If the number of data points within eps from the starting point is less than min samples, then this point is marked as noise. If the number of data points within eps is greater than min samples, then this point is marked as the core sample and assigned a new cluster label. All neighbors of the point (within the distance eps) are then visited. If they have not already been assigned a cluster, then the new cluster label just created is assigned to them. If they are core samples, then their neighbors are visited in turn, and so on. The cluster is gradually increased until there are no more core samples within the eps distance of the cluster.

Step 5.3: and (5) carrying out convergence training. Another point is selected that has not been visited and the process of initial core sample labeling and cluster label assignment is repeated until all points are labeled complete.

Step 5.4: and outputting a model result. The DBSCAN model will output the label of the cluster to which each point belongs. Clustering result y for ith complaint_i，y_i-1 or a positive integer, -1 represents a noise point, the remaining positive integers representing the label number of the cluster to which the current point belongs, wherein i-1, 2.

Step 6: and determining the malicious complaint cluster. And hierarchically sampling the clustering result of the DBSCAN algorithm, manually marking and judging whether the malicious complaints exist or not based on the sampling result, finally confirming the labels of the malicious complaint clusters according to the proportion of the malicious complaints in each family, and correcting the clustering result of the malicious complaint model.

Step 6.1: and clustering and sampling. For the output result Y of the DBSCAN algorithm, the maximum value of all vector elements is calculated, denoted as d. Then the clustering result of the DBSCAN algorithm has d classes, and the clustering sample label of the r-th class is assumed to be M_rWherein comprises u^rAn element, wherein n ═ u¹+u²+......+u^d(ii)

a r

1, 2. Then complaint information set C ═ C¹，C²，......，CⁿCluster set of the result complaint information after clustering

Appointing sampling proportion h, 0 < h < 1, clustering complaining information

Wherein

The number of the contained elements is nh.

Step 6.2: and (5) manual qualitative determination. Clustering sample sets for complaint information

Is marked, where 1 represents a malicious complaint and 0 represents a general complaint. Clustering sampling marking set of complaint information after marking as

Wherein

Or 0,

r

1, 2, a. 1, 2. Statistics of

The proportion of malicious complaints in each cluster in (i.e. for the r-th cluster result M)_rComputing malicious throwsThe number of people complaining is

The proportion of malicious complaints in the cluster was scored

Recording the complaint information malicious complaint proportion set as

Step 6.3: marking the malicious complaint cluster. Computing

In

The maximum value of (1) is recorded as

Cluster M corresponding to the maximum value_etThe cluster is a malicious complaint cluster, and all complaints in the cluster are malicious complaints; elements of the remaining clusters are combined into a normal complaint cluster and denoted as M_ot. Then complaint information set C ═ C¹，C²，......，CⁿThe classification result of the corresponding malicious complaint model is

Wherein u is^et+u^otN. All malicious complaints have been identified so far.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A malicious complaint identification method based on text clustering is characterized by comprising the following steps:

step 2: storing the complaint information of the crawler in a relational database, and performing user-defined text characteristic processing; suppose that n complains are collected; the name of the primary key field is ID, which is defined as an autonomy key and takes the value of 1, 2, … …, n; the complaint information set is denoted as C ═ C¹，C²，……，CⁿIn which C isⁱThe contents of the ith complaint, i ═ 1, 2, … …, n; assume that the feature set of text processing is X₁The custom text feature has m features, and is recorded as

Wherein i is 1, 2, … …, n;

Sum word frequency scaling feature set

And 5: merging the processed text feature sets

Word frequency statistical feature set

Sum word frequency scaling feature set

Will be provided with

As a reference of the DBSCAN algorithm, classifying each complaint content, assuming that the classification variable of the model output result is Y, and the clustering result of the ith complaint content is Y_i,Y＝{y₁，y₂，……，y_n},i＝1，2，……，n；

2. The method for identifying malicious complaints based on text clustering according to claim 1, wherein step 1 specifically comprises the following steps:

3. The method for identifying malicious complaints based on text clustering according to claim 2, wherein step 2 specifically comprises the following steps:

4. The method of claim 3, wherein the malicious complaint recognition based on text clustering is characterized in that: the step 3 specifically comprises the following steps:

step 3.1, maximum matching word segmentation in the forward direction: the forward scanning is to scan the left side of the character string in the forward direction, and the sub string is taken out to be matched with the dictionary; the forward maximum matching algorithm is mainly divided into three steps: firstly, taking max characters in the complaint from left to right as a matching field for the ith complaint description (i is 1, 2, … …, n), wherein max is the longest entry number in a dictionary library; secondly, searching and matching the segmented matching field in a dictionary library; if the matching is successful, the matching field is used as a word to be segmented; if the matching is unsuccessful, removing the last character of the matching field, taking the rest character string as a new matching field, and performing matching again; thirdly, repeating the above processes until all words are segmented;

step 3.2 reverse maximum matching segmentation: the algorithm is a reverse thinking of the maximum positive matching, namely reverse scanning from the right side of a character string, and extracting a sub string to match with a dictionary; the reverse maximum matching algorithm is mainly divided into three steps: firstly, for the ith complaint description (i is 1, 2, … …, n), taking max characters in the complaint from right to left as a matching field, wherein max is the longest entry number in a dictionary library; secondly, searching and matching the segmented matching field in a dictionary library; if the matching is successful, the matching field is used as a word to be segmented; if the matching is unsuccessful, removing the most previous character of the matching field, taking the rest character string as a new matching field, and performing matching again; thirdly, repeating the above processes until the length of the character string to be segmented is 0, namely segmenting all words;

5. The method of claim 4, wherein the malicious complaint recognition based on text clustering is characterized in that: the step 4 specifically comprises the following steps:

Suppose that

In which there are t elements, denoted as

Segmented set of statistical complaint information

Wherein each of the complaining information segmentations corresponds to

For the ith complaint information split set

The corresponding word frequency statistical characteristic set is recorded as

Wherein

To represent

Including the word s_tWhen i is 1, 2, … …, n, the complaint information is divided into sets

Corresponding word frequency statistical characteristic set

Wherein each of the complaining information segmentations corresponds to

The proportion of each element in (b) is defined to correspond to

The corresponding word frequency proportion characteristic set is recorded as

Wherein

To represent

Including the word s_t1, 2, … …, n, the complaint information is segmented into sets

Corresponding word frequency scale feature set

Step 4.3, displaying the word frequency: the word cloud picture is used for visually displaying words with high occurrence frequency in the text, so that a large amount of low-frequency and low-quality text information can be filtered, and a browser can intuitively draw the text through the word cloud pictureA subject matter; statistics based on full complaint information

Number of occurrences of each element in

The j-th element s in (1)_jLet the number of occurrences in the full complaint information be p_jJ is 1, 2 … …, t; the word frequency set for recording the full amount of complaint information is

Then

The word cloud.

6. The method of claim 5, wherein the method comprises: the step 5 specifically comprises the following steps:

step 5.4, outputting a model result: the DBSCAN model outputs the label of the cluster to which each point belongs; clustering result y for ith complaint_i，y_i-1 represents a noise point, the remaining positive integers representing the label number of the cluster to which the current point belongs, where i is 1, 2, … …, n.

7. The method of claim 6, wherein the method comprises: the step 6 specifically comprises the following steps:

step 6.1 clustering sampling: calculating the maximum value of all vector elements of the output result Y of the DBSCAN algorithm, and recording the maximum value as d; then the clustering result of the DBSCAN algorithm has d classes, and the clustering sample label of the r-th class is assumed to be M_rWherein comprises u^rAn element, wherein n ═ u¹+u²+……+u^d(ii) a r is 1, 2, … …, d; then complaint information set C ═ C¹，C²，……，CⁿCluster set of the result complaint information after clustering

Specifying a sampling ratio h, 0<h<1, clustering the complaint information

Wherein

The number of the contained elements is nh;

step 6.2 manual qualification: clustering sample sets for complaint information

Wherein

Or 0, r ═ 1, 2, … …, d; i ═ 1, 2, … …, nh; statistics of

The proportion of malicious complaints in the cluster was scored

Recording the complaint information malicious complaint proportion set as

Step 6.3 marking the malicious complaint cluster: computing

In

The maximum value of (1) is recorded as

Cluster M corresponding to the maximum value_etThe cluster is a malicious complaint cluster, and all complaints in the cluster are malicious complaints; elements of the remaining clusters are combined into a normal complaint cluster and denoted as M_ot(ii) a Then complaint information set C ═ C¹，C²，……，CⁿThe classification result of the corresponding malicious complaint model is

Wherein u is^et+u^otN; all malicious complaints have been identified so far.

8. The system according to claim 1, wherein said system comprises: the system comprises a complaining website complaining information crawler module, a custom text feature processing module, a bidirectional matching word segmentation module, a word frequency feature set and visualization module, a density clustering module based on a DBSCAN algorithm and a malicious complaining cluster determination module; a complaint website and subject parameters are designated through interface configuration, and complaint contents meeting certain conditions are collected by a background through a crawler technology;

the user-defined text feature processing module is used for storing the complaint information of the crawler in a relational database and performing user-defined text feature processing; suppose that n complains are collected; the name of the primary key field is ID, which is defined as an autonomy key and takes the value of 1, 2, … …, n; the complaint information set is denoted as C ═ C¹，C²，……，CⁿIn which C isⁱThe contents of the ith complaint, i ═ 1, 2, … …, n; assume that the feature set of text processing is X₁The custom text feature has m features, and is recorded as

For complaint information set C, word-cutting shapeRecording the resultant complaint information split set

Wherein i is 1, 2, …, n;

Sum word frequency scaling feature set

Sum word frequency scaling feature set

Will be provided with

9. The system of claim 8, wherein the malicious complaint recognition system based on text clustering comprises: the complaint information crawler module of the complaint website comprises 2 submodules for topic parameter configuration and acquisition website configuration;

10. The system of claim 9, wherein the malicious complaint recognition system based on text clustering comprises: the custom text feature processing module comprises 4 submodules of data storage, basic attribute, statistical feature and proportional feature;