CN111400597A

CN111400597A - Information classification method based on k-means algorithm and related equipment

Info

Publication number: CN111400597A
Application number: CN202010183100.8A
Authority: CN
Inventors: 高越
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2020-03-16
Filing date: 2020-03-16
Publication date: 2020-07-10

Abstract

The application relates to the technical field of data analysis, in particular to an information classification method based on a k-means algorithm and related equipment, which comprises the following steps: acquiring original information, and filtering the original information to obtain a client information set; extracting irregular data, and calculating the polymerization degree value between the irregular data and other regular data in the customer information set; preprocessing data in the client information set according to the polymerization degree value; randomly extracting a plurality of sample data in the preprocessed client information set as centroids, and calculating the distance between the residual sample data in the preprocessed client information set and each centroid; obtaining a centroid A and sample data A corresponding to the minimum distance in the distances, and classifying the sample data A and the centroid A into one class to obtain a classification result; and acquiring a preset contact strategy corresponding to the classification result, and contacting the corresponding client based on the contact strategy. The clustering accuracy is improved, and further the accuracy of automatically searching the client information by the computer is improved.

Description

Information classification method based on k-means algorithm and related equipment

Technical Field

The application relates to the technical field of data analysis, in particular to an information classification method based on a k-means algorithm and related equipment.

Background

With the gradual development of big data technology, information can be classified by adopting a clustering algorithm when information is searched, and then targeted search is carried out according to a classification result. Among them, the most common clustering algorithm is k-means algorithm. The K-means algorithm is a clustering analysis algorithm for iterative solution, and comprises the steps of randomly selecting K objects as initial clustering centers, then calculating the distance between each object and each seed clustering center, and allocating each object to the nearest clustering center. The method can well cluster regular data which are normally distributed.

However, data recorded on the internet is often irregular data which is not normally distributed, so that accuracy is reduced when a k-means algorithm is directly applied to classify information, a large amount of manual auxiliary operation is required, and full-automatic searching cannot be achieved.

Disclosure of Invention

Based on the above, the information classification method based on the k-means algorithm and the related equipment are provided for solving the problems that the accuracy is reduced when the k-means algorithm is applied to classify the information at present, a large amount of manual auxiliary operation is needed, and the full-automatic searching cannot be realized.

An information classification method based on a k-means algorithm comprises the following steps:

acquiring original information, and filtering the original information to obtain a client information set;

extracting data which is not normally distributed with other data in the customer information set as irregular data, and calculating a polymerization degree value between the irregular data and each other regular data in the customer information set;

performing data screening on the sample data in the client information set according to the comparison result of the polymerization degree value and a preset polymerization degree threshold value;

randomly extracting a plurality of sample data in the client information set after data screening as centroids, and calculating the distance between the residual sample data in the client information set after preprocessing and each centroid;

and obtaining a centroid A and a sample data A corresponding to the minimum distance in the distances, classifying the sample data A and the centroid A into one class, and repeating the steps until all data in the preprocessed customer information set are classified to obtain a classification result.

In one possible embodiment, the obtaining original information and filtering the original information to obtain a client information set includes:

capturing a webpage from a network, and extracting a plurality of original information from the webpage according to a preset screening rule;

loading element rules of client information, and judging whether elements in each original information accord with the element rules;

marking the original information which accords with the element rule as alternative information, or else, not marking;

and collecting all the alternative information to obtain the client information set.

In one possible embodiment, the extracting a plurality of original information from the web page according to preset filtering rules includes:

dividing the content in the webpage into structured data and unstructured data according to the webpage structure of the webpage;

extracting all rows corresponding to preset customer names from the structured data;

dividing the unstructured data into a plurality of subsegments according to a preset segmentation threshold value, and extracting all subsegments containing the preset client name;

and summarizing the lines corresponding to the preset client names and the subsections containing the preset client names to obtain the original information.

In one possible embodiment, before the extracting data in the customer information set that is not normally distributed with other data as irregular data and calculating a polymerization degree value between the irregular data and each other regular data in the customer information set, the method further includes:

classifying the client information in the client information set according to element categories, and respectively establishing corresponding information distribution coordinate graphs of the classified client information according to the attributes of the element categories;

determining irregular points according to the distance between each point in the information distribution coordinate graph;

and taking the customer information corresponding to the irregular point as the irregular data.

In one possible embodiment, the performing data filtering on the sample data in the client information set according to the comparison result between the aggregation level value and a preset aggregation level threshold includes:

taking the position of any regular data on a classified coordinate system as a circle center, drawing a circle A by taking the value of a corresponding preset category parameter as a radius, calculating the distance from the irregular data to the circle center, and taking the distance as a polymerization degree value;

if the polymerization degree value is larger than a preset polymerization degree threshold value, determining that the irregular data is high polymerization data, otherwise, determining that the irregular data is low polymerization data;

drawing a circle B by taking the position of the high aggregation data on the classification coordinate system as a circle center and the category parameter as a radius, and judging whether the low aggregation data is in the circle B;

and if the low aggregation data are in the circle B, packaging the low aggregation data and the high aggregation data into sample data, and otherwise, discarding the low aggregation data.

In one possible embodiment, the obtaining a centroid a and a sample data a corresponding to a minimum distance among the distances, and classifying the sample data a and the centroid a into one class, and so on until all data in the preprocessed customer information set are classified, and after a classification result is obtained, the method further includes:

acquiring a preset contact strategy corresponding to the classification result, and contacting a corresponding client based on the contact strategy;

judging whether the client corresponding to the original information is successfully contacted or not according to the feedback information of the client;

if the connection fails, judging whether the original information contains abnormal data or not according to a preset abnormal data rule;

if the abnormal data is contained, the abnormal data is removed and then is reclassified, otherwise, the abnormal data is marked as a difficult client for waiting for manual processing.

An information classification device based on a k-means algorithm comprises the following modules:

the information set establishing module is used for acquiring original information and filtering the original information to obtain a client information set;

the polymerization degree obtaining module is used for extracting data which is not normally distributed with other data in the customer information set as irregular data, and calculating a polymerization degree value between the irregular data and other regular data in the customer information set;

the preprocessing module is used for performing data screening on the sample data in the client information set according to the comparison result of the polymerization degree value and a preset polymerization degree threshold value;

the sample analysis module is used for randomly extracting a plurality of sample data in the client information set after data screening as centroids and calculating the distance between the residual sample data in the client information set after preprocessing and each centroid;

and the result generation module is used for acquiring the centroid A and the sample data A corresponding to the minimum distance in the distances, classifying the sample data A and the centroid A into one class, and repeating the steps until all data in the preprocessed customer information set are classified to obtain a classification result.

In one possible embodiment, the information set creating module is further configured to:

marking the original information which accords with the element rule as alternative information, or else, not marking; and collecting all the alternative information to obtain the client information set.

A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the above-described k-means algorithm based information classification method.

A storage medium storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the above-described k-means algorithm-based information classification method.

Compared with the existing mechanism, the method and the system have the advantages that the original information is obtained and filtered to obtain the client information set; extracting data which is not normally distributed with other data in the customer information set as irregular data, and calculating a polymerization degree value between the irregular data and each other regular data in the customer information set; performing data screening on the sample data in the client information set according to the comparison result of the polymerization degree value and a preset polymerization degree threshold value; randomly extracting a plurality of sample data in the client information set after data screening as centroids, and calculating the distance between the residual sample data in the client information set after preprocessing and each centroid; and obtaining a centroid A and a sample data A corresponding to the minimum distance in the distances, classifying the sample data A and the centroid A into one class, and repeating the steps until all data in the preprocessed customer information set are classified to obtain a classification result. Therefore, the problem that when the k-means algorithm is applied to classify the information, the accuracy is reduced, a large amount of manual auxiliary operation is needed, and full-automatic searching cannot be achieved is solved.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application.

FIG. 1 is an overall flow chart of an information classification method based on a k-means algorithm according to an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating an information set establishing process in an information classification method based on a k-means algorithm according to an embodiment of the present application;

FIG. 3 is a block diagram of an information classification apparatus based on a k-means algorithm according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Fig. 1 is an overall flowchart of an information classification method based on a k-means algorithm in one embodiment of the present application, and the information classification method based on the k-means algorithm includes the following steps:

s1, acquiring original information, and filtering the original information to obtain a client information set;

specifically, the original information can be cooperated with operators such as mobile, internet and telecommunication operators, and data in the system can be updated through records of the operators, so that the latest contact telephone of a target client is obtained; or mining the phone associated with the client through web data mining, such as contact ways of their relatives, colleagues, classmates, etc. can be found through social platforms, WeChat, qq, etc.; the contact address of the relationship network is found according to the initial telephone number of the client. When the original information is filtered, a threshold condition can be established, if the age of the target customer is 35 years old, and the 35 years old is used as a threshold value, the customer who obtains the same name as the target customer is screened, and only the contact ways of a plurality of customers with the age of 35 are reserved. In the step, a coarse filtering step is used for filtering the original information, namely only 1 to 2 threshold values are set as filtering conditions, so that the target client cannot be contacted due to recording errors of the original information.

S2, extracting data which are not normally distributed with other data in the customer information set as irregular data, and calculating a polymerization degree value between the irregular data and each other regular data in the customer information set;

the irregular data refers to data which is not normally distributed with other data in the client set, that is, the data of the isolated point, and may be generated because the client does not completely meet the conditions of the corresponding dangerous species when signing the policy due to some special reasons, or the client meets the conditions when signing the policy but does not meet the conditions set at first due to the occurrence of a change.

When irregular data is checked, a coordinate system can be established, data in a client set is split according to different dimensions, each dimension corresponds to one coordinate system, and then normal distribution statistics is carried out on points in the coordinate system, so that irregular data of each dimension is obtained.

S3, performing data screening on the sample data in the client information set according to the comparison result of the polymerization degree value and a preset polymerization degree threshold value;

specifically, the regular data is distributed in a spherical state when being clustered, and the irregular data is not distributed on the sphere. If the irregular data and the regular data to be clustered with the irregular data are characters, the characters need to be subjected to word vector conversion, converted into multi-dimensional word vectors, then reduced into word vectors with the same dimension through PCA and the like, usually three-dimensional vectors or two-dimensional vectors, and then the distance between the two vectors is calculated in a three-dimensional coordinate system or a two-dimensional coordinate system, and the distance determines the degree of polymerization. The higher the polymerization degree is, the more similar the irregular data and the regular data are, the classification can be used as a classification for clustering, otherwise, the irregular data is abnormal data and needs to be removed.

S4, randomly extracting a plurality of sample data in the client information set after data screening as centroids, and calculating the distance between the residual sample data in the client information set after preprocessing and each centroid;

specifically, for example, 10 telephone data in the industry a and 5 telephone data in the industry B are captured from the web page, and 2 centroid vectors are randomly selected from these samples. The distances that can be used in calculating the distance between the sample data and the centroid are euclidean distances and mahalanobis distances, which are generally used in the present embodiment. The euclidean distance is a commonly used definition of distance, which refers to the true distance between two points in m-dimensional space, or the natural length of a vector (i.e., the distance of the point from the origin). The euclidean distance in two and three dimensions is the actual distance between two points. The Euclidean distance can be used for effectively calculating the distance between other data and the centroid in the multidimensional space, so that effective clustering is facilitated.

S5, obtaining a centroid A and a sample data A corresponding to the minimum distance in the distances, classifying the sample data A and the centroid A into one class, and repeating the steps until all data in the preprocessed customer information set are classified to obtain a classification result;

specifically, if there is more than one centroid a corresponding to the minimum distance obtained by the other data, a voting mechanism needs to be introduced to vote for the result, and the category corresponding to the centroid with the large number of votes is used as the category of the other data a.

In addition, after the other data and centroid a are categorized as an initial categorization, i.e., as a subclass, for example, centroid a is classified as a basketball, then a further categorization may be made, i.e., the top class of basketball, "ball". The classification method comprises the steps of generating a plurality of initial data sets after initial classification, then determining a new centroid, clustering again according to the new centroid, and repeating the steps to obtain K major classes.

According to the embodiment, the irregular data in the client information are effectively processed, so that the clustering accuracy is improved, and the accuracy of automatically searching the client information by a computer is improved.

Fig. 2 is a schematic diagram illustrating an information set establishing process in an information classification method based on a k-means algorithm according to an embodiment of the present application, where as shown in the drawing, the S1 obtains original information, and filters the original information to obtain a client information set, including:

s11, capturing a webpage from the network, and extracting a plurality of original information from the webpage according to preset screening rules;

the Web page is captured by the capturing tool used in the step, the capturing tool can be a tool such as a Heritrix tool, a WebSPHINX tool, a Web L ech tool, an Arale tool and the like, the network information is automatically captured after the screening rule is configured to the tools, the current page can be captured, and the page related to the page can also be captured, so that the required target customer information is obtained.

The method comprises the following steps that the Heritrix captures webpage data in a multithreading mode, a main thread distributes tasks to Teo threads (processing threads), each Teo thread processes one UR L at a time, and UR L comprises the following steps:

prefetching: some preparatory work is mainly done, such as delaying or reprocessing the process, overruling subsequent operations.

Extraction: the method mainly comprises the processes of downloading web pages from a target website, carrying out DNS conversion, filling in a request and responding to a form.

Extraction when extraction is completed, the interested HTM L and JavaScript are extracted according to preset rules, UR L is generally grabbed according to new needs in a page, the grabbed results are stored, Heritrix provides that downloaded result data is stored in an ARC format, submission is carried out, which is to check which newly extracted UR L is in the grabbing range, then the UR L is submitted to a development processor (Frontier), and in addition, the cache information of DNS is updated.

In one embodiment, the extracting of the original information from the web page according to the preset filtering rule may be performed by the following steps:

In this embodiment, the division of the web page content may adopt obtaining a source code of the web page through a url address of the web page, and analyzing the source code to obtain a web page structure. The web page structure may be a text only with unstructured data, a table only with structured data, or the like, and is more likely to be similar to a graph and text. For the web page with the best pictures and texts, the picture part is removed from the web page, and then the text part is classified. When the webpage is inquired by the client name, the extraction can be carried out by adopting a knowledge extraction mode.

S12, loading element rules of customer information, and judging whether elements in the original information conform to the element rules;

wherein, the element rule refers to the limitation condition of the client information, such as age 30-35, professional officer, etc.

S13, marking the original information which accords with the element rule as alternative information, or else, not marking; the element rule is met, namely the original information meets each element rule, and if one element rule does not meet the element rule, marking is not carried out.

And S14, collecting all the alternative information to obtain the client information set.

According to the embodiment, the client information set is established by effectively analyzing the original information, so that the efficiency of analyzing the original information is improved.

In one embodiment, before the step S2 of extracting data in the customer information set that is not normally distributed with other data as irregular data, and calculating a polymerization degree value between the irregular data and each other regular data in the customer information set, the method further includes:

The client information elements in this embodiment mainly refer to the age, occupation, insurance variety, telephone attribution, and the like of the client, which can facilitate finding the identity information of the client. For example, the age coordinate system is arranged from small to large, for example, 5 pieces of customer information correspond to ages of 33, 35, 28, 54, 40, the abscissa on the coordinate system corresponds to the age, the ordinate is the name of the customer, and the irregular point in the coordinate system corresponds to the customer of 54.

Irregular data can be effectively identified by utilizing the information distribution coordinate graph, so that the speed of automatically searching information by a computer is effectively improved.

In an embodiment, the S3, performing data filtering on the sample data in the customer information set according to the comparison result between the aggregation level value and a preset aggregation level threshold, where the data filtering includes:

In this embodiment, the category parameters are preset according to different categories, for example, personal insurance is taken as a major category, and the numerical values of the category parameters include health insurance, life insurance, and the like, that is, the value of the category parameters taken as the personal insurance is the sum of the health insurance and the life insurance.

The polymerization degree threshold is obtained according to historical data statistics, and the polymerization degree reflects the similarity between two data, namely the larger the polymerization degree is, the larger the similarity between the data is, and then the two data can be aggregated into one class.

In an embodiment, at step S5, obtaining a centroid a and a sample data a corresponding to a minimum distance among the distances, and classifying the sample data a and the centroid a into one class, and so on until all data in the preprocessed customer information set are classified, and after obtaining a classification result, the method further includes:

specifically, the classification result generally corresponds to a contact strategy that can be divided into: automatic outbound and manual outbound, wherein the automatic outbound is realized by automatically triggering the telephone of the client, and if the telephone is not available or stops or is cancelled, the telephone number can be further filtered; the manual outbound call is judged whether the call is the target customer by a friendly telephone operation by manually dialing the telephone number which is filtered again by the machine.

The classification result is that the person corresponds to the automatic outbound, and the classification result is that the friend corresponds to the manual outbound.

In this embodiment, the feedback information mainly includes whether the client found by the above scheme is the client used by the target client, and if the client is queried to be "zhang san", the obtained feedback information is "yes", it is proved that the target client is found, and if the answer is "no", the target client is not found. The reclassification method is the same as the above method, and is not repeated here.

The embodiment introduces a feedback mechanism to verify the clustering scheme after processing the irregular data, and if the customer cannot be contacted, the corresponding customer is rapidly and accurately contacted by adjusting the threshold value of the polymerization degree and the like through manual intervention.

The technical features mentioned in any of the above corresponding embodiments or implementations are also applicable to the embodiment corresponding to fig. 3 in the present application, and the details of the subsequent similarities are not repeated.

The information classification method based on the k-means algorithm in the present application is described above, and the information classification device executing the k-means algorithm is described below.

Fig. 3 is a block diagram of an information classification apparatus based on a k-means algorithm, which is applicable to information classification based on a k-means algorithm. The information classification device based on the k-means algorithm in the embodiment of the present application can implement the steps corresponding to the information classification method based on the k-means algorithm executed in the embodiment corresponding to fig. 1. The function realized by the information classification device based on the k-means algorithm can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, which may be software and/or hardware.

In one embodiment, an information classification apparatus based on k-means algorithm is provided, as shown in fig. 3, including the following modules:

an information set establishing module 10, configured to obtain original information, and filter the original information to obtain a client information set;

the polymerization degree obtaining module 20 is configured to extract data that is not normally distributed with other data in the client information set as irregular data, and calculate a polymerization degree value between the irregular data and each other regular data in the client information set;

the preprocessing module 30 is configured to perform data screening on the sample data in the client information set according to a comparison result between the polymerization degree value and a preset polymerization degree threshold;

the sample analysis module 40 is configured to randomly extract a plurality of sample data in the customer information set after data screening as centroids, and calculate distances between the remaining sample data in the customer information set after preprocessing and the centroids;

and the result generating module 50 is configured to obtain a centroid a and sample data a corresponding to a minimum distance in the distances, classify the sample data a and the centroid a into one class, and so on until all data in the preprocessed customer information set are classified, so as to obtain a classification result.

In one embodiment, the information set creating module is further configured to:

In one embodiment, a computer device is provided, the computer device includes a memory and a processor, the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, cause the processor to execute the steps of the information classification method based on k-means algorithm in the above embodiments.

In one embodiment, a storage medium storing computer readable instructions is provided, which when executed by one or more processors, cause the one or more processors to perform the steps of the information classification method based on k-means algorithm in the above embodiments. The storage medium may be a nonvolatile storage medium or a volatile storage medium, and the present application is not limited in particular.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic or optical disk, or the like.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-described embodiments are merely illustrative of some embodiments of the present application, which are described in more detail and detail, but are not to be construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An information classification method based on a k-means algorithm is characterized by comprising the following steps:

2. The information classification method based on the k-means algorithm as claimed in claim 1, wherein the obtaining of the original information and the filtering of the original information to obtain the client information set comprises:

3. The information classification method based on the k-means algorithm as claimed in claim 2, wherein the extracting of the original information from the web page according to the preset filtering rule comprises:

4. The information classification method based on the k-means algorithm according to claim 2 or 3, characterized in that before the extracting data that is not normally distributed with other data in the customer information set as irregular data and calculating the aggregation degree value between the irregular data and each other regular data in the customer information set, the method further comprises:

5. The information classification method based on the k-means algorithm according to claim 1, wherein the data filtering of the sample data in the client information set according to the comparison result between the polymerization degree value and a preset polymerization degree threshold value comprises:

6. The information classification method based on the k-means algorithm according to claim 5, wherein after the obtaining of the centroid A and the sample data A corresponding to the minimum distance among the distances, and the grouping of the sample data A and the centroid A into one class, and so on until the classification of all data in the preprocessed customer information set is completed, and the classification result is obtained, the method further comprises:

7. An information classification device based on a k-means algorithm is characterized by comprising the following components:

8. The information classification apparatus based on the k-means algorithm according to claim 7, wherein the information set creating module is further configured to:

9. An information classifying device based on k-means algorithm, comprising a memory and a processor, wherein the memory stores computer readable instructions, characterized in that the computer readable instructions, when executed by the processor, cause the processor to execute the information classifying method based on k-means algorithm according to any one of claims 1 to 6.

10. A storage medium storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the method of information categorization based on k-means algorithm of any of claims 1 to 6.