CN111475652B

CN111475652B - Data mining method and system

Info

Publication number: CN111475652B
Application number: CN202010441154.XA
Authority: CN
Inventors: 谢杨易; 崔恒斌; 潘寅旭; 杨圆圆; 毛佩瑶
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-05-22
Filing date: 2020-05-22
Publication date: 2023-09-22
Anticipated expiration: 2040-05-22
Also published as: CN111475652A

Abstract

The method and the system for data mining provided by the specification are used for clustering massive sample data through a clustering algorithm to generate a plurality of sample clusters, wherein the plurality of sample data in each sample cluster correspond to similar topics; selecting at least one sample cluster with the largest sample data amount from the sample clusters as a hot topic; and classifying the sample data in the trending topics through the intention recognition model, and recognizing valuable sample data from the sample data. The method and the system can cluster mass data so as to solve the problem of commonality in batches, and greatly improve the working efficiency; meanwhile, the method and the system can mine valuable information for the product manager from mass data so as to help the product manager solve the user's appeal from the user's perspective, greatly improve the experience of the product and improve the user satisfaction.

Description

Data mining method and system

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to a method and system for data mining.

Background

With the rapid development of information technology and internet technology, people work, live, study and the like are closely connected with a network, and online transacting business brings great convenience to the life and work of people. Typically, the user may transact online through a corresponding application client or website. During the transaction process of using an application program client or a website, a user can feed back comments about problems encountered during the use process and experiences of using products. Aiming at massive opinion feedback data submitted by users, the problem submitted by the users is high in cost, long in time consumption and poor in timeliness through manual gradual solution. Moreover, solutions to the problem posed by users by means of human experience do not necessarily truly meet the user's appeal, rely excessively on the personal experience of the product manager, and easily ignore some information that the user deems important. In fact, many user-submitted questions are similar and commonplace. How to mine the common problem from massive opinion feedback data and solve the common problem in batches is an important means for improving the efficiency by realizing automatic mining of the product problems. In addition, many users give corresponding suggestions for solving the problems while making questions, so that mining suggestions helpful for solving the problems from massive opinion feedback data is an important means for helping product managers to solve user demands in a proper way.

Thus, there is a need for a more efficient method and system for data mining that mines commonalities and valuable information from a vast amount of data.

Disclosure of Invention

The present specification provides a more efficient data mining method and system to mine commonalities from massive data, to solve in batches, and to mine suggestions for solving problems from massive data to help product managers solve user complaints in a suitable way.

According to the data mining method and system provided by the specification, massive text data are clustered through a clustering algorithm to generate a plurality of class clusters, the plurality of text data in each class cluster correspond to similar topics, so that the problem of commonality is mined from the massive data, a product manager can check the text data under each class cluster according to the title of each class cluster, and the class cluster with the largest number of texts is a trending topic; after the trending topics are selected, the method and the system can classify text data in the trending topics through an intention recognition model, and recognize valuable text data, such as product suggestion text or competitor text, and the like. The method and the system can cluster mass data so as to solve the problem of commonality in batches, and greatly improve the working efficiency; meanwhile, the method and the system can mine valuable information for the product manager from mass data so as to help the product manager solve the user's appeal from the user's perspective, greatly improve the experience of the product and improve the user satisfaction.

In a first aspect, the present specification provides a method of data mining, comprising: acquiring N sample data, wherein N is an integer greater than 1, the N sample data are N text data, and each sample data comprises at least one topic; generating N sample vectors, wherein each sample vector corresponds to one of the sample data, each sample vector including the at least one topic of the corresponding sample data; clustering the N sample vectors to generate M sample clusters, wherein each sample cluster corresponds to a plurality of sample data, the plurality of sample vectors in each sample cluster correspond to similar topics, and M is a positive integer; selecting a target sample cluster from the M sample clusters; inputting a plurality of sample data corresponding to the target sample cluster into an intention recognition model for classification, and generating at least one target sample data set, wherein each target sample data set corresponds to one target intention type; and outputting at least a portion of the target sample data in the at least one set of target sample data as a data mining result.

In some embodiments, the clustering the N sample vectors to generate M sample clusters includes: calculating the distance between the N sample vectors; generating a distance matrix, wherein the rows and the columns of the distance matrix respectively correspond to the N sample vectors, and the value of any element in the matrix is the distance between the sample vector corresponding to the row in which the element is positioned and the sample vector corresponding to the column in which the element is positioned; based on the distance matrix, dividing the N sample vectors into M sample clusters by adopting a clustering algorithm; and determining representative sample data for each of the M sample clusters, the representative sample data comprising a representative topic for the sample cluster.

In some embodiments, the determining representative data for each of the M sample clusters comprises: determining a center vector of a current sample cluster; determining a distance of each candidate sample vector in the current sample cluster from the center vector; and selecting candidate sample data corresponding to the candidate sample vector nearest to the center vector as representative sample data of the current sample cluster.

In some embodiments, the clustering algorithm comprises: at least one of an HDBSCAN algorithm, a DBSCAN algorithm, a K-means algorithm, and a spectral clustering algorithm.

In some embodiments, the selecting a target sample cluster from the M sample clusters includes: and selecting the first M sample clusters with the largest sample vector quantity from the M sample clusters as the target sample cluster, wherein M is a positive integer.

In some embodiments, the classifying the plurality of sample data input intent recognition models corresponding to the target sample cluster, generating at least one target sample data set includes: inputting sample data corresponding to each sample vector in the target sample cluster into an intention recognition model, classifying the sample data corresponding to the target sample cluster, and generating a plurality of sample data sets, wherein each sample data set corresponds to an intention category and a corresponding classification value, and the intention category comprises the target intention category; and selecting the at least one target sample data set corresponding to the target intent category from the plurality of sample data sets.

In some embodiments, the intent recognition model is trained based on natural language processing techniques using historical sample data and corresponding classification labels as training samples, and is configured to analyze the input natural language text data to obtain intent information in the corresponding natural language text content.

In some embodiments, the outputting at least a portion of the target sample data in the at least one set of target sample data as a data mining result comprises: acquiring a target classification value corresponding to each target sample data in each target sample data set; and selecting at least one target sample data with the highest corresponding target classification value from each target sample data set as the data mining result to output.

In some embodiments, the generating N sample vectors comprises: and inputting the N sample data into a feature vector extraction model to generate the N sample vectors, wherein the feature vector extraction model comprises at least one of a BERT model, a BOW model, a TF-IDF model and an LSTM model.

In a second aspect, the present specification provides a system for data mining comprising at least one storage medium and at least one processor, the at least one storage medium comprising at least one set of instructions for data mining; the at least one processor is communicatively coupled to the at least one storage medium, wherein the at least one processor reads the at least one instruction set and performs the method of data mining described herein as directed by the at least one instruction set when the system is operating.

Additional functions of the data mining methods and systems provided herein will be set forth in part in the description which follows. The following numbers and examples presented will be apparent to those of ordinary skill in the art in view of the description. The inventive aspects of the data mining methods, systems, and storage media provided herein may be best explained by practicing or using the methods, apparatus, and combinations described in the following detailed examples.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present description, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 illustrates a system diagram of a data mining system provided in accordance with an embodiment of the present description;

FIG. 2 shows a schematic diagram of a server architecture for data mining provided in accordance with an embodiment of the present description;

FIG. 3 illustrates a flow chart of a method of data mining provided in accordance with an embodiment of the present description;

FIG. 4 illustrates a method flow diagram for providing a cluster in accordance with an embodiment of the present disclosure; and

Detailed Description

The following description is presented to enable one of ordinary skill in the art to make and use the invention, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the disclosure. Thus, the present description is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the claims.

The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting. For example, as used herein, the singular forms "a", "an" and "the" include plural referents unless the context clearly dictates otherwise. The terms "comprises," "comprising," or "includes" when used in this specification, are taken to specify the presence of stated integers, steps, operations, elements, or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or groups thereof.

These and other features of the present specification, as well as the operation and function of the related elements of structure, as well as the combination of parts and economies of manufacture, may be significantly improved upon in view of the following description. All of which form a part of this specification, reference is made to the accompanying drawings. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the description. It should also be understood that the drawings are not drawn to scale.

The flowcharts used in this specification illustrate operations implemented by systems according to some embodiments in this specification. It should be clearly understood that the operations of the flow diagrams may be implemented out of order. Rather, operations may be performed in reverse order or concurrently. Further, one or more other operations may be added to the flowchart. One or more operations may be removed from the flowchart.

In the prior art, during the transaction process of using an application program client or a website, a user can solve the problems and produce the problems in the use processThe feeling of the product use is used for opinion feedback. The opinion feedback may be a system problem encountered by the user in using the application, such as system stuck, crashed or slow to react, etc., or a functional problem encountered by the user in using the application, such as in paying treasures ^TM The problems encountered in the APP use process, such as insufficient amount of money, how to improve the amount of money, how to change the payment date of the money, and the like, can be some suggestions of the application program of the user, such as 'the function of self-setting the payment date of the money is suggested to be opened', and the like, and the opinion feedback can also be bid comparison, such as 'XX functions of XXX software are more convenient', and the like. The opinion feedback data of the user is huge in quantity. In order to quickly discover common problems and clues or prompts to solve the problems from massive data, the present specification provides a method and system for data mining massive opinion feedback. The data mining refers to processing massive big data, mining key content, extracting common hot spot problems, mining new hot spots and the like, thereby refining high-value content and helping product managers to solve user demands in a proper way.

In a first aspect, the present description provides a system for data mining. In a second aspect, the present specification describes a method of data mining. Fig. 1 shows a schematic diagram of a system 100 for data mining (hereinafter referred to as system 100). System 100 may include server 200, client 300, network 120, and database 150.

The server 200 may store data or instructions for performing the methods of data mining described herein and may execute or be used to execute the data or instructions.

As shown in fig. 1, user 110 is a user of client 300. Client 300 is a device that user 110 accesses server 200. The client 300 is communicatively connected to the server 200. In some embodiments, client 300 may be installed with one or more Applications (APPs). The APP can provide the user 110 with the ability to interact with the outside world via the network 120 as well as an interface. The APP includes, but is not limited to: chat APP program and shopping APPPrograms, video-like APP programs, financial APP programs, etc., e.g. payment treasures ^TM Naughty medicine ^TM Nail ^TM Jingdong tea ^TM Or APP programs such as financial service institutions such as banks, financial products and the like. The client 300 is loaded with a target APP corresponding to the server 200. User 110 may feed back opinion to the target APP on client 300 via the target APP. The opinion feedback may be transmitted to the server 200 via the network 120. In some embodiments, the client 300 may include a mobile device 300-1, a tablet computer 300-2, a notebook computer 300-3, a built-in device of the motor vehicle 300-4, or the like, or any combination thereof. In some embodiments, the mobile device 300-1 may include a smart home device, a smart mobile device, a virtual reality device, an augmented reality device, or the like, or any combination thereof. In some embodiments, the smart home device may include a smart television, a desktop computer, or the like, or any combination. In some embodiments, the smart mobile device may include a smart phone, personal digital assistant, gaming device, navigation device, etc., or any combination thereof. In some embodiments, the virtual reality device or augmented reality device may include a virtual reality helmet, virtual reality glasses, virtual reality patch, augmented reality helmet, augmented reality glasses, augmented reality patch, or the like, or any combination thereof. For example, the virtual reality device or the augmented reality device may include google glass, a head mounted display, a gear VR, and the like. In some embodiments, the built-in devices in the motor vehicle 300-4 may include an on-board computer, an on-board television, and the like. In some embodiments, the client 300 may be a device with positioning technology for locating the position of the client 300.

The network 120 may facilitate the exchange of information or data. As shown in fig. 1, the client 300, the server 200, and the database 150 may be connected to the network 120 and transmit information or data to each other through the network 120. For example, the server 200 may obtain opinion feedback data from the client 300 through the network 120. In some embodiments, network 120 may be any type of wired or wireless network, or a combination thereof. For example, network 120 may include a cable network, a wired network, a fiber optic network, a telecommunications network, an intranet, the internet, a Local Area Network (LAN), a Wide Area Network (WAN), a Wireless Local Area Network (WLAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a Public Switched Telephone Network (PSTN), a bluetooth network, a ZigBee network, a Near Field Communication (NFC) network, or the like. In some embodiments, network 120 may include one or more network access points. For example, the network 120 may include wired or wireless network access points, such as base stations or Internet switching points 120-1, 120-2, … …, through which one or more components of the client 300, server 200, database 150 may connect to the network 120 to exchange data or information.

Database 150 may store data or instructions. In some embodiments, database 150 may store data obtained from server 200 or client 300. In some embodiments, database 150 may store data or instructions that server 200 may perform or for performing the methods of data mining described in this specification. In some embodiments, database 150 may store opinion feedback data for all users. Server 200 and client 300 may have access to database 150, and server 200 and client 300 may access data or instructions stored in database 150 via network 120. In some embodiments, database 150 may be directly connected to server 200 and client 300. In some embodiments, database 150 may be part of server 200. In some embodiments, database 150 may include mass storage, removable storage, volatile read-write memory, read-only memory (ROM), or the like, or any combination thereof. Exemplary mass storage may include non-transitory storage media (non-transitory storage medium) such as magnetic disks, optical disks, solid state drives, and the like. Removable storage may include flash drives, floppy disks, optical disks, memory cards, zip disks, tape, and the like. Typical volatile read-write memory can include Random Access Memory (RAM). The RAM may include Dynamic RAM (DRAM), dual date rate synchronous dynamic RAM (DDR SDRAM), static RAM (SRAM), thyristor RAM (T-RAM), zero capacitance RAM (Z-RAM), and the like. The ROM may include a Mask ROM (MROM), a Programmable ROM (PROM), a virtual programmable ROM (PEROM), an electrically programmable ROM (EEPROM), a compact disk (CD-ROM), a digital versatile disk ROM, and the like. In some embodiments, database 150 may be implemented on a cloud platform. By way of example only, the cloud platform may include a private cloud, public cloud, hybrid cloud, community cloud, distributed cloud, inter-cloud, etc., or a form similar to the above, or any combination of the above.

As shown in fig. 1, a user 110 inputs opinion feedback data on the target APP of a client 300, and the opinion feedback data is transmitted to a server 200 through a network 120; the server 200 performs instructions of the method of data mining stored in the server 200 or the database 150, from which common problems as well as high value information are mined. For example, the target APP may be a payment instrument ^TM APP, payment device ^TM User 110 of APP may be able to pay for the treasures ^TM APP Payment device ^TM APP makes comments or advice, or uses payment treasures ^TM Problems are encountered. Payment device ^TM Server 200 mines common problems and clues to solve those problems from the massive data fed back by user 110, thereby helping product managers to solve the common complaints of the users in the most appropriate way.

Fig. 2 shows a schematic diagram of a data mining server 200. The server 200 may perform the methods of data mining described herein. The method of data mining is described elsewhere in this specification. For example, the method of data mining P100 is presented in the description of fig. 3 and 4.

As shown in fig. 2, the server 200 includes at least one storage medium 230 and at least one processor 220. In some embodiments, server 200 may also include a communication port 250 and an internal communication bus 210. Also, server 200 may also include I/O component 260.

Internal communication bus 210 may connect the various system components including storage medium 230 and processor 220.

The I/O component 260 supports input/output between the server 200 and other components (e.g., the client 300).

Storage medium 230 may include a data storage device. The data storage device may be a non-transitory storage medium or a transitory storage medium. For example, the data storage devices may include one or more of magnetic disk 232, read Only Memory (ROM) 234, or Random Access Memory (RAM) 236. The storage medium 230 further includes at least one set of instructions stored in the data storage device. The instructions are computer program code that may include programs, routines, objects, components, data structures, procedures, modules, etc. that perform the methods of data mining provided herein.

The communication port 250 is used for data communication between the server 200 and the outside world. For example, server 200 may connect to network 120 via communication port 250 to receive information from user 110 on a target APP (e.g., payment instrument ^TM Or naughty medicine ^TM ) The feedback data is further used for completing data mining to the user 110 through the communication port 250 to the target APP.

The at least one processor 220 is communicatively coupled to at least one storage medium 230 via an internal communication bus 210. The at least one processor 220 is configured to execute the at least one instruction set. When the system 100 is running, the at least one processor 220 reads the at least one instruction set and performs the method P100 of data mining provided herein as indicated by the at least one instruction set. The processor 220 may perform all the steps involved in the method P100 of data mining. Processor 220 may be in the form of one or more processors, in some embodiments processor 220 may include one or more hardware processors, such as microcontrollers, microprocessors, reduced Instruction Set Computers (RISC), application Specific Integrated Circuits (ASIC), application specific instruction set processors (ASIP), central Processing Units (CPU), graphics Processing Units (GPU), physical Processing Units (PPU), microcontroller units, digital Signal Processors (DSP), field Programmable Gate Arrays (FPGA), advanced RISC Machines (ARM), programmable Logic Devices (PLD), any circuit or processor capable of executing one or more functions, or the like, or any combination thereof. For illustrative purposes only, only one processor 220 is depicted in the server 200 in this description. However, it should be noted that the server 200 may also include multiple processors in this specification, and thus, the operations or method steps disclosed in this specification may be performed by one processor as described in this specification, or may be performed by multiple processors in combination. For example, if the processor 220 of the server 200 performs steps a and B in this specification, it should be understood that steps a and B may also be performed by two different processors 220 in combination or separately (e.g., a first processor performs step a, a second processor performs step B, or the first and second processors perform steps a and B together).

Although the above structure describes the server 200, the structure is also applicable to the client 300.

Fig. 3 shows a flow chart of a method P100 of data mining. As previously described, the server 200 may perform the method P100 of data mining provided in the present specification. Specifically, the processor 220 in the server 200 may perform the method P100 of data mining provided in the present specification. The method P100 may include performing, by at least one processor 220:

s110: n sample data are acquired.

As described above, the method P100 is used for data mining of massive opinion feedback. The N sample data may be opinion feedback data of a user of the target APP corresponding to the server 200. Each piece of opinion feedback data is one sample data. Wherein N is an integer greater than 1. The N sample data may be text data reflecting natural language corresponding to text of opinion feedback under N natural languages. For example, the target APP may be a payment instrument ^TM APP. The sample data may be user-to-payment treasures ^TM And the APP feeds back data corresponding to the opinion natural language text. For example, the sample data may be data corresponding to text input of system click, crash or slow response of the user feedback, or the like, or may be data corresponding to text input of user feedback when using a payment device ^TM Natural language corresponding to problems encountered in APP processThe text data, such as "how the amount of money is insufficient, how to improve", "how to change the payment date of the money", etc., can also be the payment treasures of the user ^TM Some suggestions of the APP correspond to text data, for example, "suggest to open a function of automatically setting the date of the flower repayment", etc., and the sample data may also be a bid comparison corresponding to text data, for example, "XX functions of XXX software are better in use feeling", etc. Accordingly, each of the sample data includes at least one topic. For example, "how the amount of the flower is insufficient," how to increase "includes the topic as the amount of the flower, and" how to change the payment date of the flower "includes the topic as the payment date of the flower. Sometimes, the user 110 may simultaneously present a plurality of questions in one opinion feedback data, such as "how the amount of flowers is insufficient, how to increase, how the payment date of flowers is changed" includes two topics, namely the amount of flowers and the payment date of flowers.

S120: n sample vectors are generated.

After the N sample data are obtained, vectorization processing needs to be performed on the N and the sample data, so as to obtain N sample vectors corresponding to the N sample data. Wherein each of the sample vectors corresponds to one of the sample data. As previously described, each of the sample data includes at least one topic, and thus, each of the sample vectors is a vector representation of the at least one topic of the corresponding sample data. For example, the direction of the sample vector includes the at least one topic, and the length of the sample vector represents a measure of the strength of the at least one topic in the corresponding sample data. Specifically, the vectorizing processing of the N sample data may be: and inputting the N sample data into a feature vector extraction model to generate the N sample vectors. The feature vector extraction model may process each sample data into a vector representation. The present disclosure does not limit the manner of vectorizing the sample data, and one of ordinary skill in the art may select a feature vector extraction model according to actual needs. For example, the feature vector extraction model may be at least one of a BERT model, a BOW model, a TF-IDF model, and an LSTM model. The BERT model is trained based on large-scale unlabeled corpus, so that semantic expression of input text data, namely, feature vectors corresponding to the text data are obtained, wherein the feature vectors contain context semantic information of the text data. The BOW model may construct a vector representation of text data based on how frequently a word or word occurs in the current text data and how frequently it occurs in the corpus. The TF-IDF model may construct a vector representation of text data based on how frequently a word or word appears in the text data. The LSTM model is a long-short-term memory network model, and is a recurrent neural network model. And inputting the text data into the LSTM model according to the position sequence, and obtaining the feature vector corresponding to the text data, wherein the feature vector contains semantic information of the text data. The BERT model, the BOW model, the TF-IDF model, and the LSTM model are already relatively well established techniques and are not limited herein.

The following description will take the BERT model as an example of the feature vector extraction model. The BERT model is composed of a stack of a number of coding layers, each of which can be understood as a black box that converts semantic vectors of individual words or words in the input text data into enhanced semantic (including the entire up and down Wen Yuyi) vectors of the same length. For example, the BERT model converts each word or word in the input text sample data into a 512-dimensional semantic vector, which, when input to the encoding layer, outputs the same 512-dimensional context-added semantic vector. Thus, the output of the BERT model contains semantic information of the context.

S130: and clustering the N sample vectors to generate M sample clusters.

The clustering means that sample vectors corresponding to sample data with similar text semantics are combined together to generate a plurality of sample clusters. Each sample cluster may include a plurality of sample vectors, wherein the plurality of sample vectors respectively correspond to the plurality of sample data. Thus, each sample cluster may be a collection of multiple sample data. The text semantic similarity of the sample data corresponding to the sample vectors inside each sample cluster generated after clustering is higher, and the text semantic similarity of the sample data corresponding to the sample vectors between different sample clusters after clustering is lower. As previously described, each sample data includes at least one topic. Thus, the plurality of sample vectors inside each sample cluster generated after clustering correspond to similar topics, and the topic similarity corresponding to the plurality of sample vectors between different sample clusters after clustering is low. FIG. 4 shows a flow chart of a method of clustering provided in accordance with an embodiment of the present description. As shown in fig. 4, step S130 may include, by at least one processor 220 of the server 200, performing:

S132: the distances between the N sample vectors are calculated.

As described above, when the N sample vectors corresponding to the N sample data are clustered, the text semantic similarity between the N sample data needs to be calculated, and the text semantic similarity calculation of the N sample data may be implemented by calculating the distance between the N sample vectors. The server 200 calculates the distances between every two of the N sample vectors, thereby obtaining the distance between each sample data and other sample data, and merges the sample vectors with the distance smaller than the threshold value into one sample cluster. The smaller the distance between sample vectors, the higher the semantic similarity of text between the representative corresponding sample data, and the more similar the corresponding topics. The distance calculation manner of the N sample vectors may include at least one of a cosine distance, a manhattan distance, a mahalanobis distance, and a euclidean distance. The distance of the N sample vectors calculated by the method is not particularly limited in this specification, and a calculation method can be selected by a person skilled in the art according to actual needs. We will further describe the calculation of the distance between the N sample vectors, taking the cosine distance as an example. The cosine distance calculates the distance (namely, semantic similarity) between the sample vectors according to the calculation formula:

（1）

Wherein, the liquid crystal display device comprises a liquid crystal display device,is the first of the sample vector AiElement(s)>Is the first of the sample vector BiThe elements. The cosine distance takes the included angle between the two vectors as an consideration angle, and takes the product of the inner product (multiplication summation of corresponding elements) of the two vectors and the product of the modulus of the two vectors as a calculation result. The cosine distance calculation similarity may represent a difference in direction of the two vectors, i.e., a distance of the two vectors in the direction.

Of course, other distance calculation methods may be selected to calculate the distance between the N sample vectors. For example, the euclidean distance may take into account the difference in value of the two vectors, i.e., the distance in value of the two vectors.

S134: a distance matrix is generated.

And generating the distance matrix according to the distance between the N sample vectors. The rows and columns of the distance matrix correspond to the N sample vectors respectively, and the value of any element in the matrix is the distance between the sample vector corresponding to the row where the element is located and the sample vector corresponding to the column where the element is located. The distance of each sample data from the other sample data is known from the distance matrix. For example, the distance matrix may be:

（2）

wherein the method comprises the steps ofRepresenting the distance of the ith sample vector from the jth sample vector.

S136: based on the distance matrix, a clustering algorithm is adopted to divide the N sample vectors into the M sample clusters, wherein each sample cluster corresponds to a plurality of sample data.

The clustering algorithm can reveal the intrinsic properties and rules of data through learning of unlabeled samples, and divide the samples into a plurality of classes, so that sample points belonging to the same class are very similar, and sample points belonging to different classes are dissimilar. I.e. the clustering algorithm may divide the sample set into several mutually disjoint subsets, i.e. clusters of samples. The server 200 may cluster the N sample data corresponding to the N sample vectors by using the clustering algorithm, so as to generate M sample clusters. And M is a positive integer greater than or equal to 1. And according to the distance matrix, dividing the sample vector with a short distance into one sample cluster by adopting the clustering algorithm, and dividing the sample vector with a long distance into different sample clusters. When the distance is calculated using the cosine distance, the difference between the directions in which the sample vectors within each of the M generated sample clusters are directed is smaller than a preset threshold, that is, the vectors are directed in substantially the same direction. Wherein each of the sample clusters may include a plurality of sample vectors corresponding to the plurality of sample data. The plurality of sample vectors inside each sample cluster generated after clustering are generally consistent in pointing direction, so that the corresponding plurality of sample data corresponds to similar topics, and the topic similarity of the plurality of sample data corresponding to the plurality of sample vectors among different sample clusters after clustering is lower than the topic similarity of the sample data under the same cluster.

The clustering algorithm may include, but is not limited to, any one or more of the HDBSCAN algorithm, the DBSCAN algorithm, the K-means algorithm, and the spectral clustering algorithm. Specifically, a corresponding clustering algorithm can be selected according to actual requirements to perform clustering analysis on the sample data. For example, in the case that it is required to ensure that text semantics of a plurality of corresponding sample data in the same sample cluster are highly similar, a clustering analysis may be performed on the text of N sample data corresponding to the N sample vectors based on a clustering algorithm HDBSCAN. The HDBSCAN algorithm is a superior density-based clustering algorithm. The HDBSCAN algorithm has the following advantages: 1) Insensitive to outlier noise points; 2) The parameters are insensitive, and the clustering result is stable; 3) Classes of arbitrary shape can be detected; 4) The method is suitable for a large amount of data, and can cope with more than 10 ten thousand data volume operations in practice.

S138: representative sample data for each of the M sample clusters is determined.

For each of the M sample clusters obtained by clustering above, it is necessary to determine representative sample data of each sample cluster. The representative sample data may be text data representative of the corresponding sample cluster, and may be a title of the corresponding sample cluster. The representative sample data includes a representative topic of the sample cluster. The product manager can view a plurality of sample data corresponding to the sample cluster according to the title of the sample cluster. Specifically, step S138 may include:

S138-2: a center vector of the current sample cluster is determined.

The server 200 may calculate a vector located at the center of the current sample cluster, that is, a center vector corresponding to the semantic center of the sample cluster, according to each candidate sample vector corresponding to each sample data in the current sample cluster.

S138-4: a distance of each candidate sample vector in the current sample cluster from the center vector is determined.

After determining the center vector of the current sample cluster, the server 200 may calculate the distance between each candidate sample vector corresponding to each sample data in the current sample cluster and the center vector. The distance calculation manner of the sample vector is identical to that described above, and will not be described in detail here.

S138-6: and selecting candidate sample data corresponding to the candidate sample vector nearest to the center vector as representative sample data of the current sample cluster.

The server 200 selects, from the current sample cluster, sample data corresponding to a candidate sample vector closest to the center vector as representative sample data corresponding to the current sample cluster. The topics included in the representative sample data are representative topics of the current sample cluster and can be used as titles of corresponding sample clusters. For example, the title of the sample cluster may be "amount of flowers", "date of repayment of flowers", etc.

The data mining method P100 provided in the present specification may divide a huge amount of sample data (i.e., opinion feedback text data) into a plurality of sample clusters (i.e., a plurality of topics). The corresponding plurality of sample data in each sample cluster has similar topics. The representative sample data in each sample cluster is taken as the title of the current sample cluster. The product manager can view a plurality of corresponding sample data in the current sample cluster according to the title of the sample cluster, so that the problems of different topics proposed by the user are viewed.

S150: and selecting a target sample cluster from the M sample clusters.

As described above, the plurality of sample data corresponding to each of the M clustered sample clusters has similar topics. The number of sample data included in each sample cluster may be different. The greater the amount of sample data that a sample cluster contains, the more users that are interested in the corresponding topic, that is, the hotter the topic that the sample cluster represents. To mine the problem of user commonality from the massive sample data, the server 200 may first sort the M sample clusters according to the number of vectors in each cluster. The greater the number of vectors per cluster, the hotter the topics corresponding to that cluster vector. Then, the server 200 may select the first M sample clusters having the largest number of sample vectors from the M sample clusters, where M is a positive integer greater than or equal to 1, as the target sample cluster. The product manager can view the trending topics of user commonalities according to the title of the target sample cluster. Specifically, the server 200 may sort the M sample clusters from more to less according to the number of sample vectors included in each sample cluster, and then select the first M sample clusters from the M sorted sample clusters as the target sample cluster. Topics corresponding to the target sample clusters represent common problems concerned by users, such as the amount of money, the repayment date of the money, and the like. Through the method P100 described in the present specification, the server 200 may mine out the common problem of interest of the user from the massive sample data, so as to help the product manager solve the problem of the user in batches.

S170: and inputting a plurality of sample data corresponding to the target sample cluster into an intention recognition model for classification, and generating at least one target sample data set, wherein each target sample data set corresponds to one target intention type.

As previously described, a plurality of sample data may be associated with each target sample cluster. The plurality of sample data corresponding to each target sample cluster may have similar topics. The plurality of sample data may be advice for the target APP or a service provided by the target APP, praise for the target APP or a service provided by the target APP, consultation, complaint, abuse, etc. for the target APP or a service provided by the target APP, or even comparison of the target APP or a service provided by the target APP with a similar product or service of a competitor, etc. In order to quickly extract high-value information which is helpful to solve the problem from the plurality of sample data corresponding to the target sample cluster, the plurality of sample data corresponding to the target sample cluster needs to be classified to identify the high-value information, for example, suggestions or improvement suggestions proposed by a user on the target APP or services provided by the target APP, for example, comparison between the use sense of the target APP or services provided by the target APP and competitors' products, which are both beneficial to upgrading and improving the products, and meanwhile, the method can help product managers to solve the common problem encountered by the user according to the suggestions proposed by the user. The information can help a product manager to solve the problems encountered by the user from the angle of the user, so that the user's appeal can be more attached, and the satisfaction degree of the user is improved. While some of the plurality of sample data corresponding to the target sample cluster are less valuable, e.g., abusive, to help product improvement and steel ball product manager solve problems. Such low value information needs to be quickly screened out when classifying data to improve work efficiency.

Specifically, the server 200 may classify a plurality of sample data corresponding to the target sample cluster through an intention recognition model to recognize high-value information. The intention recognition model is used for judging the intention of a user contained in the plurality of sample data, such as product suggestion, praise, complaint, consultation and the like, through a natural language processing technology, so as to classify the plurality of sample data. The intent recognition model is configured to analyze the input natural language text data to obtain intent information in corresponding natural language text content. Therefore, the server 200 needs to find out text data corresponding to each vector in the target sample cluster reversely according to the selected target sample cluster, and then input the text data into the intention recognition model one by one for intention recognition. Specifically, the intention recognition model is obtained by training a training sample by using historical sample data and a corresponding classification label based on a natural language processing technology. The historical sample data may be a number of sample data extracted from the N sample data. And labeling the extracted sample data with classification labels by manpower. The server 200 trains the intention recognition model by taking the extracted sample data and the corresponding classification labels as training samples, and obtains parameters of the intention recognition model. Specifically, step S170 may include executing, by at least one processor 220 of server 200:

Inputting sample data corresponding to each sample vector in the target sample cluster into an intention recognition model, classifying the sample data corresponding to the target sample cluster, and generating a plurality of sample data sets; selecting the at least one target sample data set corresponding to the target intent category from the plurality of sample data sets.

The server 200 classifies each sample data in the plurality of sample data corresponding to the target sample cluster through the intention recognition model, and outputs a classification result corresponding to each sample data. The classification result includes a plurality of sample data sets, wherein each sample data set corresponds to an intent category that includes the target intent category. For convenience of description, we define the plurality of sample data sets as L. Wherein L is a positive integer greater than or equal to 1. The intent categories may be product advice, praise, complaints, consultations, competing products, and so forth. The target intent category may be product suggestions, competing products, and so forth. Wherein, each sample data corresponds to an intention category and a corresponding classification value. The classification value is the probability of the intention category corresponding to the sample data. It should be noted that the intent recognition model outputs a vector, and each element in the vector corresponds to an intent category. The intention category corresponding to the element with the largest element value in the vector is the intention category corresponding to the sample data input into the intention recognition model, and the largest element value is the classification value corresponding to the sample data input into the intention recognition model. After classifying the plurality of sample data corresponding to the target sample cluster, the server 200 may check the plurality of sample data corresponding to each intention category according to the intention category, so as to check the sample data corresponding to each intention category of the popular common topic. Responding to the common topics according to the sample data and making a solution.

After classifying the plurality of sample data corresponding to the target sample cluster, the server 200 needs to select information valuable for solving the user common topics, i.e. sample data corresponding to the target intention category, such as sample data under a product suggestion and a competitor category. Accordingly, the server 200 may select a plurality of sample data sets corresponding to the target intention category from the plurality of sample data sets as the at least one target sample data set. The number of the target sample data sets may be P, where P is a positive integer greater than or equal to 1.

S190: and outputting a data mining result.

After extracting the high-value information, the server 200 may output all the extracted high-value information to the product manager as a result of data mining, or may output part of the high-value information to the product manager as a result of data mining. The product manager can refer to the data mining result to formulate a solution to the problem raised by the user so as to solve the user's appeal. The data mining result may include at least a portion of the target sample data in the at least one target sample data set, may include all of the target sample data in the at least one target sample data set, and may further include a title and a target intention category of a target sample cluster corresponding to the at least a portion of the target sample data or all of the target sample data. Specifically, step S190 may include executing, by at least one processor 220 of server 200:

Acquiring a target classification value corresponding to each target sample data in each target sample data set; and selecting at least one target sample data with the highest corresponding target classification value from each target sample data set as the data mining result to output.

Specifically, the server 200 may sort the plurality of target sample data in each of the P target sample data sets according to the target classification value from large to small; and selecting the first p target sample data with the highest target classification value from the plurality of target sample data after sequencing, and outputting the target sample data. Wherein p is an integer of 1 or more. It should be noted that the number of target sample data included in different target sample data sets may be different. The number of top p target sample data with highest target classification value selected from different target sample data sets may also be different. That is, different sets of target sample data may choose different amounts of target sample data to output as the result of the data mining. The data mining result may include the top p target sample data with the highest target classification value in each target sample data set, and may further include the title and the target intention category of the target sample cluster corresponding to the top p target sample data, so as to facilitate the product manager to view the corresponding sample data by searching the title and the target intention category.

The method P100 provided by the specification can help a product manager to view different titles and corresponding target sample data under different target intention categories according to the titles and the target intention categories of the target sample clusters, and formulate a solution according to information contained in the target sample data so as to solve the problem of popularity from the perspective of users, thereby improving user satisfaction, improving user experience and improving user adhesiveness.

In summary, according to the data mining method P100 and system 100 provided in the present disclosure, a clustering algorithm is used to cluster massive text sample data to generate a plurality of sample clusters, and the corresponding plurality of sample data in each sample cluster corresponds to similar topics, so that a problem of commonality is mined from the massive data, and a product manager can view the sample data under each sample cluster according to the title of each sample cluster, wherein the more the number of sample data in the sample cluster is, the more users concerned about the corresponding topics, the hotter the sample cluster is; after the trending topics are selected, the method P100 and the system 100 may classify the sample data in the trending topics through the intent recognition model, and identify valuable sample data therefrom, such as product suggestion class text or competitor class text, and so on. The method P100 and the system 100 can cluster mass data so as to mine the problem of commonality, solve the problem in batches and greatly improve the working efficiency; meanwhile, the method P100 and the system 100 can mine out valuable information for the product manager from mass data, so that the product manager is helped to solve the user's appeal from the user's perspective, the experience of the product is greatly improved, and the user satisfaction is improved.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In view of the foregoing, it will be evident to a person skilled in the art that the foregoing detailed disclosure may be presented by way of example only and may not be limiting. Although not explicitly described herein, those skilled in the art will appreciate that the present description is intended to encompass various adaptations, improvements, and modifications of the embodiments. Such alterations, improvements, and modifications are intended to be proposed by this specification, and are intended to be within the spirit and scope of the exemplary embodiments of this specification.

Furthermore, certain terms in the present description have been used to describe embodiments of the present description. For example, "one embodiment," "an embodiment," or "some embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present description. Thus, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various portions of this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined as suitable in one or more embodiments of the invention.

It should be appreciated that in the foregoing description of embodiments of the present specification, various features have been combined in a single embodiment, the accompanying drawings, or description thereof for the purpose of simplifying the specification in order to assist in understanding one feature. However, this is not to say that a combination of these features is necessary, and it is entirely possible for a person skilled in the art to extract some of them as separate embodiments to understand them upon reading this description. That is, embodiments in this specification may also be understood as an integration of multiple secondary embodiments. While each secondary embodiment is satisfied by less than all of the features of a single foregoing disclosed embodiment.

Each patent, patent application, publication of patent application, and other materials, such as articles, books, specifications, publications, documents, articles, etc., cited herein are hereby incorporated by reference. The entire contents for all purposes, except for any prosecution file history associated therewith, may be any identical prosecution file history inconsistent or conflicting with this file, or any identical prosecution file history which may have a limiting influence on the broadest scope of the claims. Now or later in association with this document. For example, if there is any inconsistency or conflict between the description, definition, or use of a term associated with any of the incorporated materials, the term in the present document shall prevail.

Finally, it is to be understood that the embodiments of the application disclosed herein are illustrative of the principles of the embodiments of the present specification. Other modified embodiments are also within the scope of this specification. Accordingly, the embodiments disclosed herein are by way of example only and not limitation. Those skilled in the art can adopt alternative arrangements to implement the application in the specification based on the embodiments in the specification. Therefore, the embodiments of the present specification are not limited to the embodiments precisely described in the application.

Claims

1. A method of data mining, comprising:

acquiring N sample data, wherein N is an integer greater than 1, the N sample data are N text data, and each sample data comprises at least one topic;

generating N sample vectors, wherein each sample vector corresponds to one of the sample data, each sample vector including the at least one topic of the corresponding sample data;

clustering the N sample vectors to generate M sample clusters, wherein each sample cluster corresponds to a plurality of sample data with similar topics, the plurality of sample vectors in each sample cluster corresponds to the similar topics, and M is a positive integer;

Selecting the first M sample clusters with the hottest corresponding topics from the M sample clusters as target sample clusters;

inputting the plurality of sample data corresponding to the target sample cluster into an intention recognition model for classification, generating at least one target sample data set, wherein each target sample data set is valuable information for solving the corresponding user common topics and corresponds to one target intention type; and

and outputting at least part of target sample data in the at least one target sample data set as a data mining result.

2. The method of data mining of claim 1, wherein the clustering the N sample vectors to generate M sample clusters comprises:

calculating the distance between the N sample vectors;

generating a distance matrix, wherein the rows and the columns of the distance matrix respectively correspond to the N sample vectors, and the value of any element in the matrix is the distance between the sample vector corresponding to the row in which the element is positioned and the sample vector corresponding to the column in which the element is positioned;

based on the distance matrix, dividing the N sample vectors into M sample clusters by adopting a clustering algorithm; and

representative sample data for each of the M sample clusters is determined, the representative sample data comprising a representative topic for the sample cluster.

3. The method of data mining of claim 2, wherein the determining representative data for each of the M sample clusters comprises:

determining a center vector of a current sample cluster;

determining a distance of each candidate sample vector in the current sample cluster from the center vector; and

and selecting candidate sample data corresponding to the candidate sample vector nearest to the center vector as representative sample data of the current sample cluster.

4. The method of data mining of claim 2, wherein the clustering algorithm comprises: at least one of an HDBSCAN algorithm, a DBSCAN algorithm, a K-means algorithm, and a spectral clustering algorithm.

5. The method of data mining of claim 1, wherein the selecting a target sample cluster from the M sample clusters comprises:

and selecting the first M sample clusters with the largest sample vector quantity from the M sample clusters as the target sample cluster, wherein M is a positive integer.

6. The method of data mining of claim 1, wherein the classifying the plurality of sample data input intent recognition models corresponding to the target sample cluster to generate at least one target sample data set comprises:

Inputting sample data corresponding to each sample vector in the target sample cluster into an intention recognition model, classifying the sample data corresponding to the target sample cluster, and generating a plurality of sample data sets, wherein each sample data set corresponds to an intention category and a corresponding classification value, and the intention category comprises the target intention category; and

selecting the at least one target sample data set corresponding to the target intent category from the plurality of sample data sets.

7. The method of data mining of claim 6, wherein the intent recognition model is trained based on natural language processing techniques with historical sample data and corresponding classification labels as training samples and is configured to analyze the input natural language text data to obtain intent information in corresponding natural language text content.

8. The method of data mining of claim 6, wherein the outputting at least a portion of the target sample data in the at least one set of target sample data as data mining results comprises:

acquiring a target classification value corresponding to each target sample data in each target sample data set; and

And selecting at least one target sample data with the highest corresponding target classification value from each target sample data set as the data mining result to output.

9. The method of data mining of claim 1, wherein the generating N sample vectors comprises: inputting the N sample data into a feature vector extraction model to generate the N sample vectors,

wherein the feature vector extraction model includes at least one of a BERT model, a BOW model, a TF-IDF model, and an LSTM model.

10. A system for data mining, comprising:

at least one storage medium comprising at least one set of instructions for data mining; and

at least one processor communicatively coupled to the at least one storage medium,

wherein the at least one processor reads the at least one instruction set and performs the method of data mining of any of claims 1-9 as directed by the at least one instruction set when the system is running.