CN113094567A - Malicious complaint identification method and system based on text clustering - Google Patents
Malicious complaint identification method and system based on text clustering Download PDFInfo
- Publication number
- CN113094567A CN113094567A CN202110351440.1A CN202110351440A CN113094567A CN 113094567 A CN113094567 A CN 113094567A CN 202110351440 A CN202110351440 A CN 202110351440A CN 113094567 A CN113094567 A CN 113094567A
- Authority
- CN
- China
- Prior art keywords
- complaint
- word
- information
- matching
- malicious
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 68
- 238000012545 processing Methods 0.000 claims abstract description 26
- 238000005516 engineering process Methods 0.000 claims abstract description 15
- 238000012800 visualization Methods 0.000 claims abstract description 7
- 230000011218 segmentation Effects 0.000 claims description 85
- 238000005070 sampling Methods 0.000 claims description 26
- 230000002457 bidirectional effect Effects 0.000 claims description 19
- 238000012544 monitoring process Methods 0.000 claims description 15
- 238000013500 data storage Methods 0.000 claims description 7
- 230000000717 retained effect Effects 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 5
- 238000002372 labelling Methods 0.000 claims description 4
- 230000008030 elimination Effects 0.000 claims description 3
- 238000003379 elimination reaction Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- JFALSRSLKYAFGM-UHFFFAOYSA-N uranium(0) Chemical compound [U] JFALSRSLKYAFGM-UHFFFAOYSA-N 0.000 claims description 3
- 238000012512 characterization method Methods 0.000 claims description 2
- 238000012797 qualification Methods 0.000 claims description 2
- 238000003058 natural language processing Methods 0.000 abstract description 5
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 230000009286 beneficial effect Effects 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of artificial intelligence and software systems, in particular to a malicious complaint identification method and system based on text clustering, which comprises the following steps: step 1: complaint website complaint information crawlers; step 2: processing the user-defined text characteristics; and step 3: bi-directionally matching and segmenting words; and 4, step 4: the word frequency characteristic set and visualization; and 5: density clustering based on a DBSCAN algorithm; step 6: and determining the malicious complaint cluster. The invention has the following beneficial effects: according to the invention, a crawler technology, a natural language processing technology and a clustering algorithm technology are comprehensively applied, a crawler complaint information content is established through a network complaint channel, an LDA topic model is constructed based on the content to carry out natural language processing, all complaint information is classified based on the natural language processing, and finally malicious complaints are identified by adopting a density clustering method DBSCAN.
Description
Technical Field
The invention relates to the technical field of artificial intelligence and software systems, in particular to a malicious complaint identification method and system based on text clustering.
Background
In recent years, a "black industry" has been developed for malicious complaint agencies of institutions (financial institutions for short) such as banks, payment companies, cash companies, mutual companies, insurance companies, and the like. Order receiving, signing, right maintaining and division are clear, and the work of grouping fire and division of labor are clear. Financial institutions can not be disturbed, on one hand, malicious complaints are more and more, and on the other hand, the bank interior really has supervision and assessment pressure. The ultimate goal of a malicious complaint is a malicious evasive debt. In recent years, the behavior of escaping waste and debt in the financial field leads to the rising of the rate of bad loan in the financial industry, and causes the accumulation of risks of some small and medium-sized financial institutions and financial institutions.
The identification methods of malicious complaints in the current financial industry are relatively few. By deeply researching the behavior pattern of the malicious complaint user, the malicious complaint user is generally found to adopt a complaint template provided by a black agency and concentrate on a network channel or a channel under a supervision pipeline for complaint. The network complaint channels are as follows: black cat complaints, gathering complaints, and the like.
Based on the above, the text provides a malicious complaint identification method and system based on text clustering, which comprehensively apply a crawler technology, a natural language processing technology and a clustering algorithm technology, construct an LDA topic model for natural language processing based on the text content through the content of the crawler complaint information of a network complaint channel, classify all complaint information by adopting a density clustering method DBSCAN, and finally identify malicious complaints.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a method and a system for identifying malicious complaints based on text clustering, and solves the problem that malicious complaints cannot be identified quickly at present.
In order to solve the problems, the invention discloses a malicious complaint identification method based on text clustering, which comprises the following steps:
step 1: a complaint website and subject parameters are designated through interface configuration, and complaint contents meeting certain conditions are collected by a background through a crawler technology;
step 2: storing the complaint information of the crawler in a relational database, and performing user-defined text characteristic processing; suppose that n complains are collected; the name of the primary key field is ID, the primary key field is defined as an autonomous key, and the value of the autonomous key is 1, 2. The complaint information set is denoted as C ═ C1,C2,......,CnIn which C isiContent indicating the ith complaint, i 1, 2.... n; assume that the feature set of text processing is X1The custom text features are mIs characterized by the fact that
And step 3: the complaint description is assumed to be mainly Chinese, and each complaint content is participled by adopting a Chinese word segmentation method; chinese word segmentation cuts each complaint description into a single word; performing word segmentation processing on the complaint description contents through a bidirectional matching word segmentation method; the bidirectional maximum matching method is a word segmentation method based on a dictionary, the word segmentation method based on the dictionary is to match a Chinese word string to be segmented with a vocabulary entry in a dictionary base according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful; for the collected n complains, suppose that the ith complain is divided into kiThe vector formed by words is recorded as For the complaint information set C, the complaint information segmentation set formed after word segmentation is recorded Wherein i 1, 2.... n;
and 4, step 4: performing word frequency statistics on the word segmentation result in the step 3, namely counting the occurrence frequency of each word in the full complaint information, and forming a word cloud picture; counting the occurrence frequency of each word of each piece of complaint information and the proportion in the content of the complaint information, and forming a word frequency statistical characteristic setSum word frequency scaling feature set
And 5: merging the processed text feature set X1Word frequency statistical feature setSum word frequency scaling feature setTotaling m +2nt characteristic variables and recording the variables as a clustering characteristic set
Will be provided withAs a reference of the DBSCAN algorithm, classifying each complaint content, assuming that the classification variable of the model output result is Y, and the clustering result of the ith complaint content is Yi,Y={y1,y2,......,yn},i=1,2,......,n;
Step 6: and hierarchically sampling the clustering result of the DBSCAN algorithm, manually marking and judging whether the malicious complaints exist or not based on the sampling result, finally confirming the labels of the malicious complaint clusters according to the proportion of the malicious complaints in each family, and correcting the clustering result of the malicious complaint model.
Preferably, the method comprises the following steps: the step 1 specifically comprises the following steps:
step 1.1 theme parameter configuration: the topic parameter module mainly sets the condition content of the crawler, and comprises 4 parameters of monitoring topics, topic codes, acquisition frequency and keyword configuration; monitoring the theme which is the theme content required to be specified by the crawler; the theme code is the only main key of the monitoring theme and consists of numbers, characters and underlines; the acquisition frequency specifies the acquisition condition of the crawler, and the interval duration of information acquisition needs to be set; configuring a filtering condition of appointed crawler content by the keyword, and acquiring the content meeting the condition;
step 1.2, website configuration acquisition: the website acquisition module sets a website to be crawled in an interface mode, and the name and address of the website are required to be specified; multiple sources of acquisition information may be added at the same time.
Preferably, the method comprises the following steps: the step 2 specifically comprises the following steps:
step 2.1, data storage: storing the complaint information from the crawler in a relational database MYSQL, creating a complaint information data table, and taking a complaint number as a main key;
step 2.2 basic attributes: the basic attribute refers to a basic attribute field associated with the complaint information;
step 2.3 statistical characterization: the statistical characteristics refer to the number of texts meeting certain conditions in the content of the statistical complaint information;
step 2.4 proportional characteristics: the proportional characteristic refers to the proportion of the number of texts meeting certain conditions in the content of the statistical complaint information.
Preferably, the method comprises the following steps: the step 3 specifically comprises the following steps:
step 3.1, maximum matching word segmentation in the forward direction: the forward scanning is to scan the left side of the character string in the forward direction, and the sub string is taken out to be matched with the dictionary; the forward maximum matching algorithm is mainly divided into three steps: firstly, taking max characters in an ith complaint from left to right as a matching field for the complaint description of the ith complaint (i is 1, 2.. said., n), wherein max is the longest number of entries in a dictionary library; secondly, searching and matching the segmented matching field in a dictionary library; if the matching is successful, the matching field is used as a word to be segmented; if the matching is unsuccessful, removing the last character of the matching field, taking the rest character string as a new matching field, and performing matching again; thirdly, repeating the above processes until all words are segmented;
step 3.2 reverse maximum matching segmentation: the algorithm is a reverse thinking of the maximum positive matching, namely reverse scanning from the right side of a character string, and extracting a sub string to match with a dictionary; the reverse maximum matching algorithm is mainly divided into three steps: firstly, for the ith complaint description (i is 1, 2.. the., n), taking max characters in the complaint from right to left as a matching field, wherein max is the longest number of entries in a dictionary library; secondly, searching and matching the segmented matching field in a dictionary library; if the matching is successful, the matching field is used as a word to be segmented; if the matching is unsuccessful, removing the most previous character of the matching field, taking the rest character string as a new matching field, and performing matching again; thirdly, repeating the above processes until the length of the character string to be segmented is 0, namely segmenting all words;
step 3.3, forward and reverse result matching: the bidirectional maximum matching is to compare the word segmentation result obtained by the forward maximum matching with the result obtained by the reverse maximum matching to determine a correct word segmentation method; the forward and reverse result matching is mainly divided into 2 steps: firstly, if the word numbers of the forward and reverse word segmentation results are different, the word with the smaller word number is selected; secondly, if the word numbers of the word segmentation results are the same; if the word segmentation results are the same, any one of the word segmentation results can be selected; if the word segmentation results are different, the word with the smaller number is selected.
Preferably, the method comprises the following steps: the step 4 specifically comprises the following steps:
step 4.1 word frequency statistical feature set: segmenting sets of complaint informationPerforming duplication elimination statistics, that is, each word is only retained once, every two retained words are different, and the formed unique vocabulary collection is recordedSuppose thatIn which there are t elements, denoted asSegmented set of statistical complaint informationWherein each of the complaining information segmentations corresponds toFor the ith complaint information split setThe corresponding word frequency statistical characteristic set is recorded as WhereinTo representIncluding the word stIs 1, 2, and n, then the complaint information segmentation setCorresponding word frequency statistical characteristic set
Step 4.2, a word frequency proportion characteristic set: segmented set of statistical complaint informationWherein each of the complaining information segmentations corresponds toThe proportion of each element in (b) is defined to correspond toDividing the length of the complaint information segmentation set by the occurrence frequency of each element, namely the number of words contained in the segmentation set; segmenting sets of complaint information for ith complaintThe corresponding word frequency proportion characteristic set is recorded as WhereinTo representIncluding the word st1, 2, and n, then the complaint information segmentation setCorresponding word frequency scale feature set
Step 4.3, displaying the word frequency: the word cloud picture is used for visually displaying words with high occurrence frequency in the text, so that a large amount of low-frequency and low-quality text information can be filtered, and a browser can intuitively draw the text through the word cloud picture; statistics based on full complaint informationNumber of occurrences of each element inThe j-th element s in (1)jLet the number of occurrences in the full complaint information be pjJ 1, 2.... prot, t; the word frequency set for recording the full amount of complaint information isThenThe word cloud display package WORDCOUNT adopting python can be directly constructedThe word cloud.
Preferably, the method comprises the following steps: the step 5 specifically comprises the following steps:
step 5.1 initial core sample labeling: firstly, setting two core parameters eps and min _ sample of a DBSCAN algorithm, wherein eps represents a distance threshold value belonging to a neighborhood and the value is a floating point type; min _ samples represents a sample number threshold value of an epsilon-neighborhood required by a sample point to become a core object, and the value is a positive integer; firstly, randomly selecting a point, and then finding all points which are less than or equal to eps away from the point;
step 5.2 cluster label allocation: if the number of data points within eps from the starting point is less than min _ samples, then this point is marked as noise; if the number of data points within eps is greater than min _ samples, then this point is marked as a core sample and assigned a new cluster label; then accessing all neighbors of the point within the distance eps; if they have not already been assigned a cluster, then the new cluster label just created is assigned to them; if they are core samples, then their neighbors are visited in turn, and so on; the cluster is gradually increased until there are no more core samples within the eps distance of the cluster;
step 5.3 convergence training: selecting another point which is not visited yet, and repeating the processes of initial core sample marking and cluster label distribution until all the points are marked;
step 5.4, outputting a model result: the DBSCAN model outputs the label of the cluster to which each point belongs; clustering result y for ith complainti,yi-1 or a positive integer, -1 represents a noise point, the remaining positive integers representing the label number of the cluster to which the current point belongs, wherein i-1, 2.
Preferably, the method comprises the following steps: the step 6 specifically comprises the following steps:
step 6.1 clustering sampling: calculating the maximum value of all vector elements of the output result Y of the DBSCAN algorithm, and recording the maximum value as d; then the clustering result of the DBSCAN algorithm has d classes, and the clustering sample label of the r-th class is assumed to be MrWherein comprises urAn element, wherein n ═ u1+u2+......+ud(ii) a r 1, 2.. said, d; then complaint information set C ═ C1,C2,......,CnCluster set of the result complaint information after clustering Appointing sampling proportion h, 0 < h < 1, clustering complaining informationPerforming hierarchical sampling to form a complaint information clustering sampling set WhereinThe number of the contained elements is nh;
step 6.2 manual qualification: clustering sample sets for complaint informationWherein 1 represents a malicious complaint and 0 represents a general complaint; clustering sampling marking set of complaint information after marking as WhereinOr 0, r 1, 2, a. 1, 2, ah; statistics ofThe proportion of malicious complaints in each cluster in (i.e. for the r-th cluster result M)rThe number of malicious complaints is calculated asThe proportion of malicious complaints in the cluster was scored Recording the complaint information malicious complaint proportion set as
Step 6.3 marking the malicious complaint cluster: computingInThe maximum value of (1) is recorded asCluster M corresponding to the maximum valueetThe cluster is a malicious complaint cluster, and all complaints in the cluster are malicious complaints; elements of the remaining clusters are combined into a normal complaint cluster and denoted as Mot(ii) a Then complaint information set C ═ C1,C2,......,CnThe classification result of the corresponding malicious complaint model is Wherein u iset+uotN; all malicious complaints have been identified so far.
In order to solve the problems, the invention also discloses a malicious complaint identification system based on text clustering, which comprises a complaint information crawler module of a complaint website, a custom text feature processing module, a bidirectional matching word segmentation module, a word frequency feature set and visualization module, a density clustering module based on a DBSCAN algorithm and a malicious complaint cluster determination module; a complaint website and subject parameters are designated through interface configuration, and complaint contents meeting certain conditions are collected by a background through a crawler technology;
the system comprises a complaint website complaint information crawler module, a background and a data processing module, wherein the complaint website complaint information crawler module is used for configuring and specifying a complaint website and subject parameters through an interface, and the background collects complaint contents meeting certain conditions through a crawler technology;
the user-defined text feature processing module is used for storing the complaint information of the crawler in a relational database and performing user-defined text feature processing; suppose that n complains are collected; the name of the primary key field is ID, the primary key field is defined as an autonomous key, and the value of the autonomous key is 1, 2. The complaint information set is denoted as C ═ C1,C2,......,CnIn which C isiContent of the ith complaint, i ═ 12, a. Assume that the feature set of text processing is X1The custom text feature has m features, and is recorded as
The bidirectional matching word segmentation module is used for assuming that the complaint description is mainly Chinese, and segmenting words of each complaint content by adopting a Chinese word segmentation method; chinese word segmentation cuts each complaint description into a single word; performing word segmentation processing on the complaint description contents through a bidirectional matching word segmentation method; the bidirectional maximum matching method is a word segmentation method based on a dictionary, the word segmentation method based on the dictionary is to match a Chinese word string to be segmented with a vocabulary entry in a dictionary base according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful; for the collected n complains, suppose that the ith complain is divided into kiThe vector formed by words is recorded as For the complaint information set C, the complaint information segmentation set formed after word segmentation is recorded Wherein i is 1, 2.. times.n;
the word frequency characteristic set and visualization module is used for carrying out word frequency statistics on the word segmentation result in the step 3, namely counting the occurrence frequency of each word in the full complaint information and forming a word cloud picture; counting the occurrence frequency of each word of each piece of complaint information and the proportion in the content of the complaint information, and forming a word frequency statistical characteristic setSum word frequency scaling feature set
A density clustering module based on DBSCAN algorithm for merging the processed text feature set X1Word frequency statistical feature setSum word frequency scaling feature setTotaling m +2nt characteristic variables and recording the variables as a clustering characteristic set
Will be provided withAs a reference of the DBSCAN algorithm, classifying each complaint content, assuming that the classification variable of the model output result is Y, and the clustering result of the ith complaint content is Yi,Y={y1,y2,......,yn},i=1,2,......,n;
And the malicious complaint cluster determining module is used for performing hierarchical sampling on the clustering result of the DBSCAN algorithm, performing artificial marking judgment on whether malicious complaints exist or not on the basis of the sampling result, finally confirming the label of the malicious complaint cluster according to the proportion of the malicious complaints in each cluster, and correcting the clustering result of the malicious complaint model.
Preferably, the method comprises the following steps: the complaint information crawler module of the complaint website comprises 2 submodules for topic parameter configuration and acquisition website configuration;
the theme parameter configuration sub-module is used for the theme parameter module to mainly set the condition content of the crawler, and comprises 4 parameters of monitoring a theme, theme coding, acquisition frequency and keyword configuration; monitoring the theme which is the theme content required to be specified by the crawler; the theme code is the only main key of the monitoring theme and consists of numbers, characters and underlines; the acquisition frequency specifies the acquisition condition of the crawler, and the interval duration of information acquisition needs to be set; configuring a filtering condition of appointed crawler content by the keyword, and acquiring the content meeting the condition;
the acquisition website configuration submodule is used for setting a website needing the crawler by the acquisition website module in an interface mode, and specifying a website name and a website address; multiple sources of acquisition information may be added at the same time.
Preferably, the method comprises the following steps: the custom text feature processing module comprises 4 submodules of data storage, basic attribute, statistical feature and proportional feature;
the data storage submodule is used for storing the complaint information from the crawler in a relational database MYSQL and creating a complaint information data table by taking the complaint number as a main key;
the basic attribute submodule is used for indicating basic attribute fields related to the complaint information by basic attributes;
the statistical characteristic submodule is used for counting the number of texts meeting certain conditions in the complaint information content;
and the proportion characteristic submodule is used for proportion characteristic, and is used for counting the proportion of the number of texts meeting certain conditions in the complaint information content.
Due to the adoption of the technical scheme, the invention has the following beneficial effects:
1. the system comprehensively applies a natural language technology and an unsupervised learning technology, and constructs a 6-dimensional closed-loop flow module based on complaint information crawlers of complaint websites, custom text feature processing, bidirectional matching word segmentation, word frequency feature set and visualization, density clustering based on a DBSCAN algorithm and malicious complaint cluster determination, so as to realize automatic identification of malicious complaint contents.
2. The user-defined text feature processing set, the word frequency statistical feature set and the word frequency proportion feature set are ingeniously combined to serve as a feature factory of the DBSCAN density clustering algorithm, so that the significant information in text content is mined to the greatest extent, and the accuracy of the malicious complaint identification model is greatly improved.
3. The method adopts a DBSCAN density clustering algorithm to perform clustering analysis on the complaint information, so that the number of clusters does not need to be set a priori, the clusters with complex shapes are divided, points which do not belong to any cluster can be found out, and the clustering effect is greatly improved; meanwhile, a clustering layered sampling mode is adopted for marking, and a model result is greatly calibrated.
Drawings
FIG. 1 is a block diagram of a text clustering based malicious complaint identification system;
FIG. 2 is a crawler configuration system diagram.
Detailed Description
The embodiments of the invention will be described in detail below with reference to the drawings, but the invention can be implemented in many different ways as defined and covered by the claims.
To illustrate this approach more specifically, the following provides an identification case for identifying a malicious complaint user of "new web bank" in "complaints congregation".
Step 1: the complaint website complaints the information crawler. The complaining website and the subject parameters are designated through interface configuration, and the background collects the complaining contents meeting certain conditions through a crawler technology.
Step 1.1: and configuring theme parameters. The topic parameter module mainly sets the condition content of the crawler, and comprises 4 parameters of monitoring topics, topic codes, acquisition frequency and keyword configuration. The monitoring subject is set as 'identification of malicious complaints of new network banks'; the theme is coded as the only primary key of the monitoring theme, set up as "XWBANK _ JTS _ EYTS"; the acquisition frequency specifies the acquisition conditions of the crawler and is configured to acquire every 5 minutes; the keyword configuration specifies a filtering condition of the crawler content, and is configured as "new web bank & & complaint".
Step 1.2: and collecting the configuration of the website. The website acquisition module sets a website needing the crawler in an interface mode, and configures a website name of ' poly complaints ' and a website address of ' https: com/"/ts.21cn.com/".
Step 2: and (4) processing the user-defined text features. And storing the complaint information of the crawler in a relational database, and performing user-defined text characteristic processing. Suppose that n complains are collected; the primary key field name is ID, the value is an autoincrement primary key, and the value is 1, 2. The complaint information set is denoted as C ═ C1,C2,......,CnIn which C isiThe contents of the ith complaint, i 1, 2. Assume that the feature set of text processing is X1The custom text feature has m features, and is recorded asThe module comprises 4 sub-modules of data storage, basic attribute, statistical characteristic and proportion characteristic.
Step 2.1: and (4) storing data. Storing the complaint information from the crawler in a relational database MySQL, creating a complaint information data table, taking a complaint number as a main key, and comprising: complaining time, name of complainer, crawler time, complaining object, complaining question, related party, amount of complaining, complaining description and the like.
Step 2.2: a base attribute. The basic information refers to basic attribute fields associated with the complaint information, such as complaint time points, complaint amounts, complaint complaints, complaint progress and the like.
Step 2.3: and (5) counting the characteristics. The statistical characteristics refer to the number of texts satisfying certain conditions in the content of the statistical complaint information. Such as: counting the total words of the complaint description, counting the words of the complaint description including the complaint, counting the words of the complaint description with the report, counting the number of complaint progress being the follow-up, counting the number of complaint reply bars, and the like.
Step 2.4: and (4) proportional characteristics. The proportional characteristic refers to the proportion of the number of texts meeting certain conditions in the content of the statistical complaint information. Such as: the proportion of the word number of the complaint in the statistical complaint description in the whole text, the proportion of the word number of the report in the statistical complaint description in the whole text and the like.
And step 3: and performing bidirectional matching word segmentation. Chinese segmentation divides each complaint description into a single word, which is the smallest, independently active, meaningful language component. In order to deeply mine the relevance between the complaint information, the complaint description content is participled by a bidirectional matching word segmentation method. The bidirectional maximum matching method is a word segmentation method based on a dictionary. The word segmentation method based on the dictionary is to match the Chinese word string to be segmented with the vocabulary entry in a dictionary base according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful. For the collected n complains, suppose that the ith complain is divided into kiThe vector formed by words is recorded as For the complaint information set C, the complaint information segmentation set formed after word segmentation is recorded Wherein i is 1, 2.
Step 3.1: forward maximum matching participles. The forward direction is to scan the left side of the character string in the forward direction, and to extract the sub-string to match with the dictionary. The forward maximum matching algorithm is mainly divided into three steps: first, for the ith complaint description (i ═ 1, 2.. said., n), max characters in the complaint are taken from left to right as matching fields, and max is the longest entry number in the dictionary base. Second, the segmented matching field is looked up in a dictionary base and matched. If the matching is successful, the matching field is segmented as a word. If the matching is unsuccessful, the last word of the matching field is removed, and the rest character string is used as a new matching field for matching again. Third, the above process is repeated until all words are segmented.
Step 3.2: and (4) carrying out reverse maximum matching word segmentation. The algorithm is the reverse thinking of a forward maximum match. The reverse scanning is to scan the right side of the character string reversely, and then the sub-string is taken out to match with the dictionary. The reverse maximum matching algorithm is mainly divided into three steps: first, for the ith complaint description (i ═ 1, 2.... times.n), max characters in the complaint are taken from right to left as matching fields, and max is the longest entry number in the dictionary base. Second, the segmented matching field is looked up in a dictionary base and matched. If the matching is successful, the matching field is segmented as a word. If the matching is unsuccessful, the most previous word of the matching field is removed, and the rest character string is used as a new matching field for matching again. Thirdly, the above process is repeated until the length of the character string to be cut is 0, that is, all the words are cut.
And 3.3, matching forward and reverse results. The bidirectional maximum matching is to compare the word segmentation result obtained by the forward maximum matching with the result obtained by the reverse maximum matching to determine the correct word segmentation method. Studies by sunm.s. and Benjamin K.T (1995) showed that: about 90% of Chinese characters are cut into character strings, the forward maximum matching method and the reverse maximum matching method are completely overlapped and correct, and the rest 10% of the Chinese characters are different. The forward and reverse result matching is mainly divided into 2 steps: firstly, if the word numbers of the forward and reverse word segmentation results are different, the word with the smaller word number is selected. Second, if the word segmentation results in the same number of words. If the word segmentation results are the same, any one of the word segmentation results can be selected; if the word segmentation results are different, the word with the smaller number is selected.
And 4, step 4: and (4) carrying out word frequency feature set and visualization. And (3) performing word frequency statistics on the word segmentation result in the step (3), namely counting the occurrence frequency of each word in the full amount of complaint information, and forming a word cloud picture so as to visually observe the distribution of the complaint content. Counting the number of occurrences of each word for each complaint information and the complaint in the complaint informationThe proportion in the information content and the word frequency statistical characteristic set are formedSum word frequency scaling feature set
Step 4.1: and (5) carrying out a word frequency statistical characteristic set. Segmenting sets of complaint informationPerforming duplication elimination statistics, that is, each word is only retained once, the retained word is different in pairs, and the formed unique vocabulary collection is recordedSuppose thatIn which there are t elements, denoted asSegmented set of statistical complaint informationWherein each of the complaining information segmentations corresponds toFor the ith complaint information split setThe corresponding word frequency statistical characteristic set is recorded as WhereinTo representIncluding the word st1, 2. Complaint information split setAnd (4) corresponding word frequency statistical feature sets.
Step 4.2: the set of word frequency scale features. Segmented set of statistical complaint informationWherein each of the complaining information segmentations corresponds toThe proportion of each element in (b) is defined to correspond toThe number of times each element in the set appears is divided by the length of the segment of complaint information (i.e., the number of words contained in the segment). Segmenting sets of complaint information for ith complaintThe corresponding word frequency proportion characteristic set is recorded as WhereinTo representIncluding the word st1, 2. Complaint information split setCorresponding word frequency scale feature set
Step 4.3: and displaying the word frequency. The word cloud picture is used for visually displaying words with high occurrence frequency in the text, so that a large amount of low-frequency and low-quality text information can be filtered, and a browser can intuitively draw the text through the word cloud picture. Statistics based on full complaint informationNumber of occurrences of each element inThe j-th element s in (1)jLet the number of occurrences in the full complaint information be pjJ is 1, 2. The word frequency set for recording the full amount of complaint information isThenThe word cloud display package WORDCOUNT adopting python can be directly constructedThe word cloud.
And 5: and (4) density clustering based on the DBSCAN algorithm. DBSCAN, a density-based clustering method with noise, is a density-based spatial clustering algorithm. The algorithm divides the area with sufficient densityClusters are divided and clusters of arbitrary shape can be found in a spatial database with noise. Merging the processed text feature set X1Word frequency statistical feature setSum word frequency scaling feature setTotaling m +2nt characteristic variables and recording the variables as a clustering characteristic set
Will be provided withAs a reference of the DBSCAN algorithm, classifying each complaint content, assuming that the classification variable of the model output result is Y, and the clustering result of the ith complaint content is Yi,Y={y1,y2,......,yn},i=1,2,......,n。
Step 5.1: initial core sample labeling. Firstly, setting two core parameters eps and min _ sample of a DBSCAN algorithm, wherein eps represents a distance threshold belonging to a neighborhood, and eps is set to be 0.5; min _ sample represents the sample number threshold of e-neighborhood required for the sample point to become the core object, and min _ sample is set to 5. First, a point is arbitrarily selected, and then all points which are less than or equal to eps in distance from the point are found.
Step 5.2: and allocating cluster labels. If the number of data points within eps from the starting point is less than min samples, then this point is marked as noise. If the number of data points within eps is greater than min samples, then this point is marked as the core sample and assigned a new cluster label. All neighbors of the point (within the distance eps) are then visited. If they have not already been assigned a cluster, then the new cluster label just created is assigned to them. If they are core samples, then their neighbors are visited in turn, and so on. The cluster is gradually increased until there are no more core samples within the eps distance of the cluster.
Step 5.3: and (5) carrying out convergence training. Another point is selected that has not been visited and the process of initial core sample labeling and cluster label assignment is repeated until all points are labeled complete.
Step 5.4: and outputting a model result. The DBSCAN model will output the label of the cluster to which each point belongs. Clustering result y for ith complainti,yi-1 or a positive integer, -1 represents a noise point, the remaining positive integers representing the label number of the cluster to which the current point belongs, wherein i-1, 2.
Step 6: and determining the malicious complaint cluster. And hierarchically sampling the clustering result of the DBSCAN algorithm, manually marking and judging whether the malicious complaints exist or not based on the sampling result, finally confirming the labels of the malicious complaint clusters according to the proportion of the malicious complaints in each family, and correcting the clustering result of the malicious complaint model.
Step 6.1: and clustering and sampling. For the output result Y of the DBSCAN algorithm, the maximum value of all vector elements is calculated, denoted as d. Then the clustering result of the DBSCAN algorithm has d classes, and the clustering sample label of the r-th class is assumed to be MrWherein comprises urAn element, wherein n ═ u1+u2+......+ud(ii) a r 1, 2. Then complaint information set C ═ C1,C2,......,CnCluster set of the result complaint information after clustering Appointing sampling proportion h, 0 < h < 1, clustering complaining informationPerforming hierarchical sampling to form a complaint information clustering sampling set WhereinThe number of the contained elements is nh.
Step 6.2: and (5) manual qualitative determination. Clustering sample sets for complaint informationIs marked, where 1 represents a malicious complaint and 0 represents a general complaint. Clustering sampling marking set of complaint information after marking as
WhereinOr 0, r 1, 2, a. 1, 2. Statistics ofThe proportion of malicious complaints in each cluster in (i.e. for the r-th cluster result M)rComputing malicious throwsThe number of people complaining isThe proportion of malicious complaints in the cluster was scored Recording the complaint information malicious complaint proportion set as
Step 6.3: marking the malicious complaint cluster. ComputingInThe maximum value of (1) is recorded asCluster M corresponding to the maximum valueetThe cluster is a malicious complaint cluster, and all complaints in the cluster are malicious complaints; elements of the remaining clusters are combined into a normal complaint cluster and denoted as Mot. Then complaint information set C ═ C1,C2,......,CnThe classification result of the corresponding malicious complaint model is Wherein u iset+uotN. All malicious complaints have been identified so far.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.
Claims (10)
1. A malicious complaint identification method based on text clustering is characterized by comprising the following steps:
step 1: a complaint website and subject parameters are designated through interface configuration, and complaint contents meeting certain conditions are collected by a background through a crawler technology;
step 2: storing the complaint information of the crawler in a relational database, and performing user-defined text characteristic processing; suppose that n complains are collected; the name of the primary key field is ID, which is defined as an autonomy key and takes the value of 1, 2, … …, n; the complaint information set is denoted as C ═ C1,C2,……,CnIn which C isiThe contents of the ith complaint, i ═ 1, 2, … …, n; assume that the feature set of text processing is X1The custom text feature has m features, and is recorded as
And step 3: the complaint description is assumed to be mainly Chinese, and each complaint content is participled by adopting a Chinese word segmentation method; chinese word segmentation cuts each complaint description into a single word; performing word segmentation processing on the complaint description contents through a bidirectional matching word segmentation method; the bidirectional maximum matching method is a word segmentation method based on a dictionary, the word segmentation method based on the dictionary is to match a Chinese word string to be segmented with a vocabulary entry in a dictionary base according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful; for the collected n complains, suppose that the ith complain is divided into kiThe vector formed by words is recorded as For the complaint information set C, the complaint information segmentation set formed after word segmentation is recorded Wherein i is 1, 2, … …, n;
and 4, step 4: performing word frequency statistics on the word segmentation result in the step 3, namely counting the occurrence frequency of each word in the full complaint information, and forming a word cloud picture; counting the occurrence frequency of each word of each piece of complaint information and the proportion in the content of the complaint information, and forming a word frequency statistical characteristic setSum word frequency scaling feature set
And 5: merging the processed text feature setsWord frequency statistical feature setSum word frequency scaling feature setTotaling m +2nt characteristic variables and recording the variables as a clustering characteristic set
Will be provided withAs a reference of the DBSCAN algorithm, classifying each complaint content, assuming that the classification variable of the model output result is Y, and the clustering result of the ith complaint content is Yi,Y={y1,y2,……,yn},i=1,2,……,n;
Step 6: and hierarchically sampling the clustering result of the DBSCAN algorithm, manually marking and judging whether the malicious complaints exist or not based on the sampling result, finally confirming the labels of the malicious complaint clusters according to the proportion of the malicious complaints in each family, and correcting the clustering result of the malicious complaint model.
2. The method for identifying malicious complaints based on text clustering according to claim 1, wherein step 1 specifically comprises the following steps:
step 1.1 theme parameter configuration: the topic parameter module mainly sets the condition content of the crawler, and comprises 4 parameters of monitoring topics, topic codes, acquisition frequency and keyword configuration; monitoring the theme which is the theme content required to be specified by the crawler; the theme code is the only main key of the monitoring theme and consists of numbers, characters and underlines; the acquisition frequency specifies the acquisition condition of the crawler, and the interval duration of information acquisition needs to be set; configuring a filtering condition of appointed crawler content by the keyword, and acquiring the content meeting the condition;
step 1.2, website configuration acquisition: the website acquisition module sets a website to be crawled in an interface mode, and the name and address of the website are required to be specified; multiple sources of acquisition information may be added at the same time.
3. The method for identifying malicious complaints based on text clustering according to claim 2, wherein step 2 specifically comprises the following steps:
step 2.1, data storage: storing the complaint information from the crawler in a relational database MYSQL, creating a complaint information data table, and taking a complaint number as a main key;
step 2.2 basic attributes: the basic attribute refers to a basic attribute field associated with the complaint information;
step 2.3 statistical characterization: the statistical characteristics refer to the number of texts meeting certain conditions in the content of the statistical complaint information;
step 2.4 proportional characteristics: the proportional characteristic refers to the proportion of the number of texts meeting certain conditions in the content of the statistical complaint information.
4. The method of claim 3, wherein the malicious complaint recognition based on text clustering is characterized in that: the step 3 specifically comprises the following steps:
step 3.1, maximum matching word segmentation in the forward direction: the forward scanning is to scan the left side of the character string in the forward direction, and the sub string is taken out to be matched with the dictionary; the forward maximum matching algorithm is mainly divided into three steps: firstly, taking max characters in the complaint from left to right as a matching field for the ith complaint description (i is 1, 2, … …, n), wherein max is the longest entry number in a dictionary library; secondly, searching and matching the segmented matching field in a dictionary library; if the matching is successful, the matching field is used as a word to be segmented; if the matching is unsuccessful, removing the last character of the matching field, taking the rest character string as a new matching field, and performing matching again; thirdly, repeating the above processes until all words are segmented;
step 3.2 reverse maximum matching segmentation: the algorithm is a reverse thinking of the maximum positive matching, namely reverse scanning from the right side of a character string, and extracting a sub string to match with a dictionary; the reverse maximum matching algorithm is mainly divided into three steps: firstly, for the ith complaint description (i is 1, 2, … …, n), taking max characters in the complaint from right to left as a matching field, wherein max is the longest entry number in a dictionary library; secondly, searching and matching the segmented matching field in a dictionary library; if the matching is successful, the matching field is used as a word to be segmented; if the matching is unsuccessful, removing the most previous character of the matching field, taking the rest character string as a new matching field, and performing matching again; thirdly, repeating the above processes until the length of the character string to be segmented is 0, namely segmenting all words;
step 3.3, forward and reverse result matching: the bidirectional maximum matching is to compare the word segmentation result obtained by the forward maximum matching with the result obtained by the reverse maximum matching to determine a correct word segmentation method; the forward and reverse result matching is mainly divided into 2 steps: firstly, if the word numbers of the forward and reverse word segmentation results are different, the word with the smaller word number is selected; secondly, if the word numbers of the word segmentation results are the same; if the word segmentation results are the same, any one of the word segmentation results can be selected; if the word segmentation results are different, the word with the smaller number is selected.
5. The method of claim 4, wherein the malicious complaint recognition based on text clustering is characterized in that: the step 4 specifically comprises the following steps:
step 4.1 word frequency statistical feature set: segmenting sets of complaint informationPerforming duplication elimination statistics, that is, each word is only retained once, every two retained words are different, and the formed unique vocabulary collection is recordedSuppose thatIn which there are t elements, denoted asSegmented set of statistical complaint informationWherein each of the complaining information segmentations corresponds toFor the ith complaint information split setThe corresponding word frequency statistical characteristic set is recorded as WhereinTo representIncluding the word stWhen i is 1, 2, … …, n, the complaint information is divided into setsCorresponding word frequency statistical characteristic set
Step 4.2, a word frequency proportion characteristic set: segmented set of statistical complaint informationWherein each of the complaining information segmentations corresponds toThe proportion of each element in (b) is defined to correspond toDividing the length of the complaint information segmentation set by the occurrence frequency of each element, namely the number of words contained in the segmentation set; segmenting sets of complaint information for ith complaintThe corresponding word frequency proportion characteristic set is recorded as WhereinTo representIncluding the word st1, 2, … …, n, the complaint information is segmented into setsCorresponding word frequency scale feature set
Step 4.3, displaying the word frequency: the word cloud picture is used for visually displaying words with high occurrence frequency in the text, so that a large amount of low-frequency and low-quality text information can be filtered, and a browser can intuitively draw the text through the word cloud pictureA subject matter; statistics based on full complaint informationNumber of occurrences of each element inThe j-th element s in (1)jLet the number of occurrences in the full complaint information be pjJ is 1, 2 … …, t; the word frequency set for recording the full amount of complaint information isThenThe word cloud display package WORDCOUNT adopting python can be directly constructedThe word cloud.
6. The method of claim 5, wherein the method comprises: the step 5 specifically comprises the following steps:
step 5.1 initial core sample labeling: firstly, setting two core parameters eps and min _ sample of a DBSCAN algorithm, wherein eps represents a distance threshold value belonging to a neighborhood and the value is a floating point type; min _ samples represents a sample number threshold value of an epsilon-neighborhood required by a sample point to become a core object, and the value is a positive integer; firstly, randomly selecting a point, and then finding all points which are less than or equal to eps away from the point;
step 5.2 cluster label allocation: if the number of data points within eps from the starting point is less than min _ samples, then this point is marked as noise; if the number of data points within eps is greater than min _ samples, then this point is marked as a core sample and assigned a new cluster label; then accessing all neighbors of the point within the distance eps; if they have not already been assigned a cluster, then the new cluster label just created is assigned to them; if they are core samples, then their neighbors are visited in turn, and so on; the cluster is gradually increased until there are no more core samples within the eps distance of the cluster;
step 5.3 convergence training: selecting another point which is not visited yet, and repeating the processes of initial core sample marking and cluster label distribution until all the points are marked;
step 5.4, outputting a model result: the DBSCAN model outputs the label of the cluster to which each point belongs; clustering result y for ith complainti,yi-1 represents a noise point, the remaining positive integers representing the label number of the cluster to which the current point belongs, where i is 1, 2, … …, n.
7. The method of claim 6, wherein the method comprises: the step 6 specifically comprises the following steps:
step 6.1 clustering sampling: calculating the maximum value of all vector elements of the output result Y of the DBSCAN algorithm, and recording the maximum value as d; then the clustering result of the DBSCAN algorithm has d classes, and the clustering sample label of the r-th class is assumed to be MrWherein comprises urAn element, wherein n ═ u1+u2+……+ud(ii) a r is 1, 2, … …, d; then complaint information set C ═ C1,C2,……,CnCluster set of the result complaint information after clustering Specifying a sampling ratio h, 0<h<1, clustering the complaint informationPerforming hierarchical sampling to form a complaint information clustering sampling set WhereinThe number of the contained elements is nh;
step 6.2 manual qualification: clustering sample sets for complaint informationWherein 1 represents a malicious complaint and 0 represents a general complaint; clustering sampling marking set of complaint information after marking as WhereinOr 0, r ═ 1, 2, … …, d; i ═ 1, 2, … …, nh; statistics ofThe proportion of malicious complaints in each cluster in (i.e. for the r-th cluster result M)rThe number of malicious complaints is calculated asThe proportion of malicious complaints in the cluster was scored Recording the complaint information malicious complaint proportion set as
Step 6.3 marking the malicious complaint cluster: computingInThe maximum value of (1) is recorded asCluster M corresponding to the maximum valueetThe cluster is a malicious complaint cluster, and all complaints in the cluster are malicious complaints; elements of the remaining clusters are combined into a normal complaint cluster and denoted as Mot(ii) a Then complaint information set C ═ C1,C2,……,CnThe classification result of the corresponding malicious complaint model is Wherein u iset+uotN; all malicious complaints have been identified so far.
8. The system according to claim 1, wherein said system comprises: the system comprises a complaining website complaining information crawler module, a custom text feature processing module, a bidirectional matching word segmentation module, a word frequency feature set and visualization module, a density clustering module based on a DBSCAN algorithm and a malicious complaining cluster determination module; a complaint website and subject parameters are designated through interface configuration, and complaint contents meeting certain conditions are collected by a background through a crawler technology;
the system comprises a complaint website complaint information crawler module, a background and a data processing module, wherein the complaint website complaint information crawler module is used for configuring and specifying a complaint website and subject parameters through an interface, and the background collects complaint contents meeting certain conditions through a crawler technology;
the user-defined text feature processing module is used for storing the complaint information of the crawler in a relational database and performing user-defined text feature processing; suppose that n complains are collected; the name of the primary key field is ID, which is defined as an autonomy key and takes the value of 1, 2, … …, n; the complaint information set is denoted as C ═ C1,C2,……,CnIn which C isiThe contents of the ith complaint, i ═ 1, 2, … …, n; assume that the feature set of text processing is X1The custom text feature has m features, and is recorded as
The bidirectional matching word segmentation module is used for assuming that the complaint description is mainly Chinese, and segmenting words of each complaint content by adopting a Chinese word segmentation method; chinese word segmentation cuts each complaint description into a single word; performing word segmentation processing on the complaint description contents through a bidirectional matching word segmentation method; the bidirectional maximum matching method is a word segmentation method based on a dictionary, the word segmentation method based on the dictionary is to match a Chinese word string to be segmented with a vocabulary entry in a dictionary base according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful; for the collected n complains, suppose that the ith complain is divided into kiThe vector formed by words is recorded as For complaint information set C, word-cutting shapeRecording the resultant complaint information split set Wherein i is 1, 2, …, n;
the word frequency characteristic set and visualization module is used for carrying out word frequency statistics on the word segmentation result in the step 3, namely counting the occurrence frequency of each word in the full complaint information and forming a word cloud picture; counting the occurrence frequency of each word of each piece of complaint information and the proportion in the content of the complaint information, and forming a word frequency statistical characteristic setSum word frequency scaling feature set
A density clustering module based on DBSCAN algorithm for merging the processed text feature set X1Word frequency statistical feature setSum word frequency scaling feature setTotaling m +2nt characteristic variables and recording the variables as a clustering characteristic set
Will be provided withAs a reference of the DBSCAN algorithm, classifying each complaint content, assuming that the classification variable of the model output result is Y, and the clustering result of the ith complaint content is Yi,Y={y1,y2,……,yn},i=1,2,……,n;
And the malicious complaint cluster determining module is used for performing hierarchical sampling on the clustering result of the DBSCAN algorithm, performing artificial marking judgment on whether malicious complaints exist or not on the basis of the sampling result, finally confirming the label of the malicious complaint cluster according to the proportion of the malicious complaints in each cluster, and correcting the clustering result of the malicious complaint model.
9. The system of claim 8, wherein the malicious complaint recognition system based on text clustering comprises: the complaint information crawler module of the complaint website comprises 2 submodules for topic parameter configuration and acquisition website configuration;
the theme parameter configuration sub-module is used for the theme parameter module to mainly set the condition content of the crawler, and comprises 4 parameters of monitoring a theme, theme coding, acquisition frequency and keyword configuration; monitoring the theme which is the theme content required to be specified by the crawler; the theme code is the only main key of the monitoring theme and consists of numbers, characters and underlines; the acquisition frequency specifies the acquisition condition of the crawler, and the interval duration of information acquisition needs to be set; configuring a filtering condition of appointed crawler content by the keyword, and acquiring the content meeting the condition;
the acquisition website configuration submodule is used for setting a website needing the crawler by the acquisition website module in an interface mode, and specifying a website name and a website address; multiple sources of acquisition information may be added at the same time.
10. The system of claim 9, wherein the malicious complaint recognition system based on text clustering comprises: the custom text feature processing module comprises 4 submodules of data storage, basic attribute, statistical feature and proportional feature;
the data storage submodule is used for storing the complaint information from the crawler in a relational database MYSQL and creating a complaint information data table by taking the complaint number as a main key;
the basic attribute submodule is used for indicating basic attribute fields related to the complaint information by basic attributes;
the statistical characteristic submodule is used for counting the number of texts meeting certain conditions in the complaint information content;
and the proportion characteristic submodule is used for proportion characteristic, and is used for counting the proportion of the number of texts meeting certain conditions in the complaint information content.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110351440.1A CN113094567A (en) | 2021-03-31 | 2021-03-31 | Malicious complaint identification method and system based on text clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110351440.1A CN113094567A (en) | 2021-03-31 | 2021-03-31 | Malicious complaint identification method and system based on text clustering |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113094567A true CN113094567A (en) | 2021-07-09 |
Family
ID=76673191
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110351440.1A Pending CN113094567A (en) | 2021-03-31 | 2021-03-31 | Malicious complaint identification method and system based on text clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113094567A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114676796A (en) * | 2022-05-27 | 2022-06-28 | 浙江清大科技有限公司 | Clustering acquisition and identification system based on big data |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015044934A1 (en) * | 2013-09-30 | 2015-04-02 | ABIDIN, Indira Ratna Dewi | A method for adaptively classifying sentiment of document snippets |
CN105912625A (en) * | 2016-04-07 | 2016-08-31 | 北京大学 | Linked data oriented entity classification method and system |
CN106296422A (en) * | 2016-07-29 | 2017-01-04 | 重庆邮电大学 | A kind of social networks junk user detection method merging many algorithms |
CN107944460A (en) * | 2016-10-12 | 2018-04-20 | 甘肃农业大学 | One kind is applied to class imbalance sorting technique in bioinformatics |
CN108470282A (en) * | 2018-03-26 | 2018-08-31 | 国家电网公司客户服务中心 | Work order intelligent method for classifying is complained by Customer Service Center |
CN108573031A (en) * | 2018-03-26 | 2018-09-25 | 上海万行信息科技有限公司 | A kind of complaint sorting technique and system based on content |
CN109376226A (en) * | 2018-11-08 | 2019-02-22 | 合肥工业大学 | Complain disaggregated model, construction method, system, classification method and the system of text |
CN109492091A (en) * | 2018-09-28 | 2019-03-19 | 科大国创软件股份有限公司 | A kind of complaint work order intelligent method for classifying based on convolutional neural networks |
CN111104466A (en) * | 2019-12-25 | 2020-05-05 | 航天科工网络信息发展有限公司 | Method for rapidly classifying massive database tables |
CN111210057A (en) * | 2019-12-25 | 2020-05-29 | 广东飞企互联科技股份有限公司 | Method for predicting complaints of mobile phone internet users |
CN111447574A (en) * | 2018-12-27 | 2020-07-24 | 中国移动通信集团辽宁有限公司 | Short message classification method, device, system and storage medium |
-
2021
- 2021-03-31 CN CN202110351440.1A patent/CN113094567A/en active Pending
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015044934A1 (en) * | 2013-09-30 | 2015-04-02 | ABIDIN, Indira Ratna Dewi | A method for adaptively classifying sentiment of document snippets |
CN105912625A (en) * | 2016-04-07 | 2016-08-31 | 北京大学 | Linked data oriented entity classification method and system |
CN106296422A (en) * | 2016-07-29 | 2017-01-04 | 重庆邮电大学 | A kind of social networks junk user detection method merging many algorithms |
CN107944460A (en) * | 2016-10-12 | 2018-04-20 | 甘肃农业大学 | One kind is applied to class imbalance sorting technique in bioinformatics |
CN108470282A (en) * | 2018-03-26 | 2018-08-31 | 国家电网公司客户服务中心 | Work order intelligent method for classifying is complained by Customer Service Center |
CN108573031A (en) * | 2018-03-26 | 2018-09-25 | 上海万行信息科技有限公司 | A kind of complaint sorting technique and system based on content |
CN109492091A (en) * | 2018-09-28 | 2019-03-19 | 科大国创软件股份有限公司 | A kind of complaint work order intelligent method for classifying based on convolutional neural networks |
CN109376226A (en) * | 2018-11-08 | 2019-02-22 | 合肥工业大学 | Complain disaggregated model, construction method, system, classification method and the system of text |
CN111447574A (en) * | 2018-12-27 | 2020-07-24 | 中国移动通信集团辽宁有限公司 | Short message classification method, device, system and storage medium |
CN111104466A (en) * | 2019-12-25 | 2020-05-05 | 航天科工网络信息发展有限公司 | Method for rapidly classifying massive database tables |
CN111210057A (en) * | 2019-12-25 | 2020-05-29 | 广东飞企互联科技股份有限公司 | Method for predicting complaints of mobile phone internet users |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114676796A (en) * | 2022-05-27 | 2022-06-28 | 浙江清大科技有限公司 | Clustering acquisition and identification system based on big data |
CN114676796B (en) * | 2022-05-27 | 2022-09-06 | 浙江清大科技有限公司 | Clustering acquisition and identification system based on big data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109189901B (en) | Method for automatically discovering new classification and corresponding corpus in intelligent customer service system | |
CN111126386B (en) | Sequence domain adaptation method based on countermeasure learning in scene text recognition | |
CN109783639B (en) | Mediated case intelligent dispatching method and system based on feature extraction | |
CN109189767B (en) | Data processing method and device, electronic equipment and storage medium | |
CN108710651A (en) | A kind of large scale customer complaint data automatic classification method | |
US10387805B2 (en) | System and method for ranking news feeds | |
CN111783394A (en) | Training method of event extraction model, event extraction method, system and equipment | |
CN111931505A (en) | Cross-language entity alignment method based on subgraph embedding | |
CN109446423B (en) | System and method for judging sentiment of news and texts | |
CN111191051B (en) | Method and system for constructing emergency knowledge map based on Chinese word segmentation technology | |
CN111078979A (en) | Method and system for identifying network credit website based on OCR and text processing technology | |
CN107679075A (en) | Method for monitoring network and equipment | |
CN112241458A (en) | Text knowledge structuring processing method, device, equipment and readable storage medium | |
CN113946657A (en) | Knowledge reasoning-based automatic identification method for power service intention | |
CN113486664A (en) | Text data visualization analysis method, device, equipment and storage medium | |
Chen et al. | Offline handwritten digits recognition using machine learning | |
CN111709225A (en) | Event cause and effect relationship judging method and device and computer readable storage medium | |
CN113094567A (en) | Malicious complaint identification method and system based on text clustering | |
CN112380346B (en) | Financial news emotion analysis method and device, computer equipment and storage medium | |
CN111984790B (en) | Entity relation extraction method | |
CN111597423B (en) | Performance evaluation method and device of interpretable method of text classification model | |
CN113569048A (en) | Method and system for automatically dividing affiliated industries based on enterprise operation range | |
CN109993381B (en) | Demand management application method, device, equipment and medium based on knowledge graph | |
CN113934833A (en) | Training data acquisition method, device and system and storage medium | |
CN111428033B (en) | Automatic threat information extraction method based on double-layer convolutional neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210709 |