WO2022126971A1 - Procédé et appareil de groupement de textes selon la densité, dispositif et support de stockage - Google Patents

Procédé et appareil de groupement de textes selon la densité, dispositif et support de stockage Download PDF

Info

Publication number
WO2022126971A1
WO2022126971A1 PCT/CN2021/090434 CN2021090434W WO2022126971A1 WO 2022126971 A1 WO2022126971 A1 WO 2022126971A1 CN 2021090434 W CN2021090434 W CN 2021090434W WO 2022126971 A1 WO2022126971 A1 WO 2022126971A1
Authority
WO
WIPO (PCT)
Prior art keywords
distance
data
point
local density
target
Prior art date
Application number
PCT/CN2021/090434
Other languages
English (en)
Chinese (zh)
Inventor
曾斌
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022126971A1 publication Critical patent/WO2022126971A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Definitions

  • the present application relates to the technical field of artificial intelligence, in particular to a density-based text clustering method, apparatus, device and storage medium.
  • Clustering is a typical unsupervised learning method, which divides the samples in the dataset into several usually disjoint subsets (clusters) by learning from unlabeled training samples.
  • the goal of cluster analysis is to classify elements based on the similarity of elements. It has a wide range of applications in the fields of bioinformatics and pattern recognition. Commonly used clustering algorithms are: K-means, K-medoids, DBSCAN, etc.
  • Text clustering is a specific application of clustering algorithms in the field of natural language processing.
  • the usual practice is to create text feature vectors based on tfidf (term frequency-inverse document frequency, word frequency-inverse text frequency index), word2vec, etc., and then use various Clustering methods for text clustering.
  • tfidf term frequency-inverse document frequency, word frequency-inverse text frequency index
  • word2vec word2vec
  • Clustering methods for text clustering.
  • the technical problem to be solved by the embodiments of the present application is to provide a density-based text clustering method, device, device and storage medium, which can reduce the number of operations and improve the clustering effect on non-spherical data.
  • the embodiments of the present application provide a density-based text clustering method, which adopts the following technical solutions:
  • a density-based text clustering method including:
  • the target data set includes several data points corresponding to several pieces of text data
  • the target distance formula calculating the distance between each data point in the target data set and each other data point according to the target distance formula, and generating a distance matrix about the entire target data set;
  • the data points in the target data set are classified based on the cluster centers, and each data point is divided into clusters in the clustering decision diagram.
  • the embodiments of the present application also provide a density-based text clustering device, which adopts the following technical solutions:
  • a density-based text clustering device comprising:
  • a data receiving module for receiving an input target data set, where the target data set includes several data points corresponding to several pieces of text data;
  • a distance formula confirmation module for identifying the type of the target data set and confirming the target distance formula
  • a distance matrix generation module used to call the target distance formula, calculate the distance between each data point in the target data set and each other data point according to the target distance formula, and generate a distance about the entire target data set matrix;
  • a local density calculation module configured to obtain a local density distance parameter, and calculate the local density of each data point according to the local density distance parameter and the distance matrix
  • the minimum point distance extraction module is used to confirm that the set of data points corresponding to each data point in the target data set with a higher local density than the data point is recorded as a sample point set, and each data point and the data point are extracted respectively.
  • the minimum distance between the data points in the sample point set corresponding to the data point is denoted as the minimum point distance; wherein, the minimum point distance of the data point with the highest local density is the difference between the data point and other data points in the target data set. the maximum value of the distance between;
  • a clustering decision graph generation module used for establishing a clustering decision graph according to the local density and the minimum point distance
  • a cluster determination module for determining the number of clusters and cluster centers in the clustering decision diagram
  • the data classification module is configured to classify the data points in the target data set based on the cluster centers, and divide each data point into the clusters of the clustering decision diagram respectively.
  • an embodiment of the present application further provides a computer device, including at least one processor; and,
  • the memory stores computer-readable instructions, and the processor implements the following steps when executing the computer-readable instructions:
  • the target data set includes several data points corresponding to several pieces of text data
  • the target distance formula calculating the distance between each data point in the target data set and each other data point according to the target distance formula, and generating a distance matrix about the entire target data set;
  • the data points in the target data set are classified based on the cluster centers, and each data point is divided into clusters in the clustering decision diagram.
  • an embodiment of the present application further provides a computer-readable storage medium, where computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the The processor performs the following steps:
  • the target data set includes several data points corresponding to several pieces of text data
  • the target distance formula calculating the distance between each data point in the target data set and each other data point according to the target distance formula, and generating a distance matrix about the entire target data set;
  • the minimum value of the distance between the data points is denoted as the minimum point distance; wherein, the minimum point distance of the data point with the highest local density is the maximum value of the distance between the data point and other data points in the target data set;
  • the data points in the target data set are classified based on the cluster centers, and each data point is divided into clusters in the clustering decision diagram.
  • the embodiments of the present application disclose a density-based text clustering method, device, equipment, and storage medium.
  • the density-based text clustering method described in the embodiments of the present application after receiving the input target data set; The type of the target data set, confirm the target distance formula; then call the target distance formula, calculate the distance between each data point and other data points in the target data set according to the target distance formula, and generate information about the entire The distance matrix of the target data set; then obtain the local density distance parameter, calculate the local density of each data point according to the local density distance parameter and the distance matrix; confirm the ratio corresponding to each data point in the target data set
  • the set of data points with high local density of the data point is recorded as the sample point set, and the minimum distance between each data point and the data points in the sample point set corresponding to the data point is extracted, and recorded as the minimum point distance
  • the method uses the defined concept of local density, so that in the whole clustering process, the distance between the sample points only needs to be calculated once, and the non-spherical data can be clustered without iterative calculation, which greatly improves the time performance of the algorithm. , and use the clustering decision diagram to select the number of clusters more scientifically, to avoid artificially setting the number of clusters without basis.
  • FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present application can be applied;
  • FIG. 2 is a flowchart of an embodiment of the density-based text clustering method described in the embodiment of the application;
  • Embodiment 3 is a clustering decision diagram in a specific implementation manner of Embodiment 1 of the present application.
  • FIG. 4 is a schematic structural diagram of an embodiment of the density-based text clustering apparatus described in the embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of an embodiment of a computer device in an embodiment of the present application.
  • the system architecture 100 may include a first terminal device 101 , a second terminal device 102 , a third terminal device 103 , a network 104 and a server 105 .
  • the network 104 is a medium for providing a communication link between the first terminal device 101 , the second terminal device 102 , the third terminal device 103 and the server 105 .
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the user can use the first terminal device 101 , the second terminal device 102 and the third terminal device 103 to interact with the server 105 through the network 104 to receive or send messages and the like.
  • Various communication client applications such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, Social platform software, etc.
  • the first terminal device 101, the second terminal device 102 and the third terminal device 103 may be various electronic devices with display screens and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, Moving Picture Experts Compression Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Group Audio Layer 4) Players, Laptops and Desktops, etc. Wait.
  • MP3 players Moving Picture Experts Group Audio Layer III, Moving Picture Experts Compression Standard Audio Layer 3
  • MP4 Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Group Audio Layer 4
  • Players Laptops and Desktops, etc. Wait.
  • the server 105 may be a server that provides various services, such as a background server that provides support for the pages displayed on the first terminal device 101 , the first terminal device 102 and the third terminal device 103 .
  • the density-based text clustering method provided by the embodiments of the present application is generally performed by a server/terminal device, and accordingly, the density-based text clustering apparatus is generally set in the server/terminal device.
  • terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
  • the density-based text clustering method includes the following steps:
  • Step 201 Receive an input target data set, where the target data set includes several data points corresponding to several pieces of text data.
  • the object implemented by the text clustering method is text information.
  • the received target data set includes several pieces of text data, wherein the feature vector corresponding to each piece of text data can be regarded as a data point, and the data points in the target data set are used as sample points to implement text clustering.
  • the density-based text clustering method further includes:
  • the corresponding text data is identified by the feature vector as the data point.
  • the feature words corresponding to each piece of text data are extracted, and then the feature words are transformed through a preset word vector model.
  • the feature vector is used as a data point carrying the corresponding coordinates to identify the corresponding text data, so as to realize the quantitative processing of the text data.
  • the electronic device (for example, the server/terminal device shown in FIG. 1 ) on which the density-based text clustering method runs may receive the target sent to the server through a wired connection or a wireless connection data set.
  • the above wireless connection methods may include but are not limited to 3G/4G connection, WiFi connection, Bluetooth connection, WiMAX connection, Zigbee connection, UWB (ultra wideband) connection, and other wireless connection methods currently known or developed in the future .
  • Step 202 Identify the type of the target data set, and confirm the target distance formula.
  • the distance calculation formula for calculating the distance between data points in the data set is different.
  • the distance formula includes: Euclidean distance, cosine similarity, Jaccard distance, edit distance, etc.
  • the distance formula to be selected is first confirmed according to the type of the dataset.
  • Euclidean distance also known as Euclidean distance
  • cosine similarity is the cosine value of the angle between two vectors in the vector space. to measure the similarity between two texts
  • Jaccard distance is used to calculate the similarity between two individuals measured by symbol or Boolean value, because the characteristic attributes of individuals are measured by symbol or Boolean value
  • edit distance is mainly Used to calculate the similarity between two strings.
  • the type of the data set includes data type and data dimension, that is, the selection of the distance calculation formula needs to comprehensively consider the data type and data dimension corresponding to the data factor substituted into the calculation.
  • the text is extracted with a TF-IDF (term frequency-inverse document frequency, term frequency-inverse text frequency index) model.
  • the feature words in the data are used to construct the feature vector of the text data.
  • the feature vector corresponding to the text data is used as the data type of the data set, and the data type of the feature vector is vector data and the data dimension is two-dimensional.
  • cosine similarity can be used.
  • A_tfidf represents the feature vector of text A
  • B_tfidf represents the feature vector of text B.
  • the data type of the text data in the dataset is a string
  • the data dimension corresponding to the dataset is one-dimensional
  • the edit distance can be used as the target distance formula.
  • the type of the data set may also include judging factors such as application scenarios.
  • Step 203 Call the target distance formula, calculate the distance between each data point in the target data set and each other data point according to the target distance formula, and generate a distance matrix about the entire target data set.
  • a distance matrix about the entire data set is obtained by calculating the distance between each data point in the data set, and the distance matrix needs to cover the distance between any two points in the data set.
  • the two data points in the target data set are substituted into the target distance formula in turn, and the distance between the two data points is calculated, and the combination of the data points substituted each time is different, and the calculation is performed until the target data is traversed.
  • a corresponding distance matrix is generated according to the obtained distance.
  • Step 204 Obtain a local density distance parameter, and calculate the local density of each data point according to the local density distance parameter and the distance matrix.
  • the local density of a data point is understood as the number of data points whose distance is smaller than the value represented by the local density distance parameter when the data point is the center.
  • the distance matrix and the local density distance parameter are used to calculate the local density of each data point according to the distance between each data point represented in the distance matrix.
  • the step 204 includes:
  • ⁇ (x i ) ⁇ j ⁇ (d ij -d c ) about the data point x i , and obtain the local density distance parameter d c , where ⁇ ( xi ) represents the local density, ⁇ ( x) represents a discrete function, and d ij represents an element in the distance matrix;
  • the local density distance parameter is input into the local density calculation formula, and the local density ⁇ (x i ) of each data point is calculated based on the value of each element in the distance matrix.
  • the local density adopts a discrete value calculation method.
  • the discrete function ⁇ (x) is defined to be equal to 1 when x is less than 0, and equal to 0 otherwise. According to the definitions in the above embodiments, it can be understood that if x is less than 0, it means that the distance between two data points is smaller than the local density distance parameter, and any one of the two data points can be counted into the local density value belonging to the other data point. data points, and vice versa.
  • d c is an adjustable parameter for calculating the local density. Its adjustment needs to be considered according to the amount of data and the range of the distance calculation method used. Usually, it can take the maximum value of the range * 10%; for data sets with a large amount of data, this The effect of parameter adjustment on the results is relatively small.
  • Step 205 Confirm the set of data points corresponding to each data point in the target data set with a higher local density than the data point, record it as a sample point set, and extract the samples corresponding to each data point and the data point respectively.
  • the minimum distance between each data point in the point set is denoted as the minimum point distance; wherein, the minimum point distance of the data point with the highest local density is the maximum distance between the data point and other data points in the target data set .
  • Step 206 Establish a clustering decision graph according to the local density and the minimum point distance.
  • the cluster decision diagram is a diagram used to easily analyze the cluster center/cluster center. Especially for some data sets, the established cluster decision diagram can clearly determine the cluster center by direct observation.
  • the clustering decision diagram in this application is generated based on the local density of data points and the minimum point distance. By calculating the local density of each data point and the minimum point distance between the data point and the data points in the sample point set , which is displayed in the clustering decision diagram to determine the number of clusters and the centers of clusters.
  • step 206 includes:
  • each data point of the target data set is marked in the plane coordinate system according to its corresponding coordinates , forming the clustering decision graph.
  • Step 207 Determine the number of clusters and cluster centers in the clustering decision diagram.
  • Clusters represent the number of groups divided into groups when objects are grouped. Objects in the same cluster have higher similarity under a certain feature than other clusters; cluster centers represent the basis of a cluster A data point with a more central value when a feature rule is calculated. The number of clusters in a cluster decision graph is the same as the number of cluster centers.
  • the data point as the center of the cluster generally has the following characteristics: its own local density is large, that is, it is surrounded by data points whose local density does not exceed its local density, and the distance between it and other data points with higher local density, The distance between data points in the range of their clusters is larger.
  • the number of clusters and cluster centers in the clustering decision diagram can be simply judged according to the local density of each data point and the value of the minimum point distance. The larger the product of the two, the more likely it is the cluster center.
  • the target data set has about 28 data points, which can be easily obtained by simple observation.
  • Data point 1 and data point 10 have large local Density value and minimum point distance value, where data points 1 and 10 are suitable as cluster centers, where the number of clusters is 2.
  • the product value may be sorted in descending order, and then several data points may be intercepted from front to back as the cluster center according to the smoothness of the change of the product value.
  • the change of the product value of the non-cluster center is relatively smooth, and when transitioning from the cluster center to the non-cluster center, the product value has an obvious drop.
  • Step 208 Classify the data points in the target data set based on the cluster centers, and divide each data point into clusters in the clustering decision diagram.
  • the data point After determining the number of clusters and the cluster center of each cluster, according to the correlation between the data point and the cluster center, the data point is classified into the cluster where the cluster center with the highest correlation is located. Realize that all data points in the target data set are classified into the divided clusters for text clustering.
  • the calculation rules for the above correlation values can be adjusted according to the requirements of different scenarios.
  • the step 208 includes:
  • the data points are divided into clusters to which the target cluster centers belong.
  • the data centers are divided into the cluster where the cluster center closest to itself is located. Complete text clustering on the target dataset.
  • the density-based text clustering method described in the embodiment of the present application uses the defined local density concept, so that in the whole clustering process, the distance between the sample points only needs to be calculated once without iterative calculation.
  • the data is clustered, which greatly improves the time performance of the algorithm, and the clustering decision diagram is used to select the number of clusters more scientifically, so as to avoid artificially setting the number of clusters without basis.
  • the computer-readable instructions can be stored in a computer-readable storage medium.
  • the computer-readable instructions when executed, may include the processes of the above-mentioned method embodiments.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only storage memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.
  • FIG. 4 shows a schematic structural diagram of an embodiment of the density-based text clustering apparatus described in this embodiment of the present application.
  • the present application provides an embodiment of a density-based text clustering apparatus.
  • the apparatus embodiment corresponds to the method embodiment shown in FIG. 2 , and the apparatus can be specifically applied in various electronic devices.
  • the density-based text clustering apparatus described in this embodiment includes:
  • the data receiving module 301 is used for receiving an input target data set, where the target data set includes several data points corresponding to several pieces of text data.
  • Distance formula confirmation module 302 used for identifying the type of the target data set, and confirming the target distance formula.
  • Distance matrix generation module 303 used to call the target distance formula, calculate the distance between each data point in the target data set and each other data point according to the target distance formula, and generate a data about the entire target data set. distance matrix.
  • the local density calculation module 304 is configured to obtain a local density distance parameter, and calculate the local density of each data point according to the local density distance parameter and the distance matrix.
  • the minimum point distance extraction module 305 is used to confirm that the set of data points corresponding to each data point in the target data set is higher than the local density of the data point, denoted as a sample point set, and extracts each data point and each data point respectively.
  • the minimum distance between the data points in the sample point set corresponding to the data point is recorded as the minimum point distance; wherein, the minimum point distance of the data point with the highest local density is the data point and other data points in the target data set the maximum distance between.
  • the clustering decision graph generating module 306 is used for establishing a clustering decision graph according to the local density and the minimum point distance.
  • the cluster determination module 307 is used for determining the number of clusters and cluster centers in the clustering decision diagram.
  • the data classification module 308 is configured to classify the data points in the target data set based on the cluster centers, and divide each data point into a cluster of the clustering decision diagram.
  • the density-based text clustering apparatus further includes: a text data conversion module.
  • the text data conversion module is used for parsing the target data set, extracting the feature words of each piece of text data in the target data set; calling a preset word vector model, and converting the feature words into feature vectors through the word vector model ; Use the feature vector as the data point to identify the corresponding text data.
  • the clustering decision graph generation module 306 is configured to: take the local density of data points in the target dataset as the horizontal axis, and take the minimum point distance as the vertical axis to establish plane coordinates system; distribute each data point in the target data set into the plane coordinate system to generate the clustering decision diagram.
  • the data classification module 308 is used to: compare the distances between the data points and the respective cluster centers, to confirm the target cluster center corresponding to each data point with the closest distance to it;
  • the data points are divided into clusters to which the target cluster centers belong.
  • the density-based text clustering device described in the embodiment of the present application uses the defined local density concept, so that in the whole clustering process, the distance between sample points only needs to be calculated once without iterative calculation, and the non-spherical The data is clustered, which greatly improves the time performance of the algorithm, and the clustering decision diagram is used to select the number of clusters more scientifically, so as to avoid artificially setting the number of clusters without basis.
  • FIG. 5 is a block diagram of a basic structure of a computer device according to this embodiment.
  • the computer device 6 includes a memory 61 , a processor 62 , and a network interface 63 that communicate with each other through a system bus. It should be pointed out that only the computer device 6 with components 61-63 is shown in the figure, but it should be understood that it is not required to implement all of the shown components, and more or less components may be implemented instead.
  • the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • embedded equipment etc.
  • the computer equipment may be a desktop computer, a notebook computer, a palmtop computer, a cloud server and other computing equipment.
  • the computer device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad or a voice control device.
  • the memory 61 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), Magnetic Memory, Magnetic Disk, Optical Disk, etc.
  • the memory 61 may be an internal storage unit of the computer device 6 , such as a hard disk or a memory of the computer device 6 .
  • the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc.
  • the memory 61 may also include both the internal storage unit of the computer device 6 and its external storage device.
  • the memory 61 is generally used to store the operating system and various application software installed on the computer device 6, such as computer-readable instructions for a density-based text clustering method.
  • the memory 61 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 62 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. This processor 62 is typically used to control the overall operation of the computer device 6 . In this embodiment, the processor 62 is configured to execute computer-readable instructions stored in the memory 61 or process data, such as computer-readable instructions for executing the density-based text clustering method.
  • CPU Central Processing Unit
  • controller central processing unit
  • microcontroller a microcontroller
  • microprocessor microprocessor
  • This processor 62 is typically used to control the overall operation of the computer device 6 .
  • the processor 62 is configured to execute computer-readable instructions stored in the memory 61 or process data, such as computer-readable instructions for executing the density-based text clustering method.
  • the network interface 63 may include a wireless network interface or a wired network interface, and the network interface 63 is generally used to establish a communication connection between the computer device 6 and other electronic devices.
  • the computer device described in the embodiments of the present application when the processor executes the computer-readable instructions stored in the memory to perform the function test of data push, does not need to create tasks through front-end operations, and can meet the requirements for mass-volume density-based text clustering , and reduce the consumption of testing time, improve the efficiency of functional testing, and easily perform stress testing during the data push test, and analyze the problems during the test when judging the results of the data push through the log, and Locating problems that arise during the testing process.
  • Another embodiment of the present application is to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions for density-based text clustering, and the density-based text clustering computer
  • the readable instructions are executable by at least one processor to cause the at least one processor to perform the steps of the density-based text clustering method as described above.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the above picture data can also be stored in a node of a blockchain.
  • the methods of the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course hardware can also be used, but in many cases the former is better implementation.
  • the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.
  • a storage medium such as ROM/RAM, magnetic disk, CD-ROM
  • the disclosed apparatus and method may be implemented in other manners.
  • the device embodiments described above are only illustrative.
  • the division of the modules is only a logical function division. In actual implementation, there may be other division methods.
  • multiple modules or components may be combined. Either it can be integrated into another system, or some features can be omitted, or not implemented.
  • the modules or components may or may not be physically separated, and components shown as modules or components may or may not be physical modules, and may be located in one place or distributed over multiple network elements. Some or all of the modules or components may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé et un appareil de groupement de textes selon la densité, un dispositif et un support de stockage, concernant le domaine technique de l'intelligence artificielle et spécifiquement appliqués à l'analyse de groupements. Le procédé consiste à : recevoir un ensemble de données cibles ; confirmer une formule de distances cibles ; générer une matrice de distances pour la totalité de l'ensemble de données cibles ; calculer une densité locale de chaque point de données ; extraire respectivement une valeur minimale de la distance entre chaque point de données et chaque point de données d'un ensemble de points d'échantillonnage et enregistrer la valeur minimale comme distance minimale entre points ; établir un graphe décisionnel de groupement selon la densité locale et selon la distance minimale entre points ; déterminer le nombre de groupes de classes et un centre de groupes de classes dans le graphe décisionnel de groupement ; et diviser respectivement chaque point de données en groupes de classes du graphe décisionnel de groupement. Selon le procédé, pendant tout le processus de groupement, des données non sphériques peuvent être groupées par calcul de la distance entre les points d'échantillon une seule fois sans calcul itératif, ce qui améliore nettement les performances d'algorithme ; le graphe décisionnel de groupement sert à sélectionner scientifiquement le nombre de groupes de classes, ce qui permet d'éviter de régler manuellement le nombre de groupes de classes sans base.
PCT/CN2021/090434 2020-12-16 2021-04-28 Procédé et appareil de groupement de textes selon la densité, dispositif et support de stockage WO2022126971A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011487463.7A CN112528025A (zh) 2020-12-16 2020-12-16 基于密度的文本聚类方法、装置、设备及存储介质
CN202011487463.7 2020-12-16

Publications (1)

Publication Number Publication Date
WO2022126971A1 true WO2022126971A1 (fr) 2022-06-23

Family

ID=75000703

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/090434 WO2022126971A1 (fr) 2020-12-16 2021-04-28 Procédé et appareil de groupement de textes selon la densité, dispositif et support de stockage

Country Status (2)

Country Link
CN (1) CN112528025A (fr)
WO (1) WO2022126971A1 (fr)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116166960A (zh) * 2023-02-07 2023-05-26 河南大学 用于神经网络训练的大数据特征清洗方法及系统
CN116360956A (zh) * 2023-06-02 2023-06-30 济南大陆机电股份有限公司 用于大数据任务调度的数据智能处理方法及系统
CN116628128A (zh) * 2023-07-13 2023-08-22 湖南九立供应链有限公司 一种供应链数据标准化方法、装置、设备及其存储介质
CN116796214A (zh) * 2023-06-07 2023-09-22 南京北极光生物科技有限公司 一种基于差分特征的数据聚类方法
CN117217501A (zh) * 2023-11-09 2023-12-12 山东多科科技有限公司 一种数字化生产计划与调度方法
CN117216599A (zh) * 2023-09-27 2023-12-12 北京青丝科技有限公司 一种问卷数据分析方法及系统
CN118012876A (zh) * 2024-04-10 2024-05-10 山东硕杰医疗科技有限公司 一种残疾儿童康复信息平台数据的智慧存储方法
CN117933571B (zh) * 2024-03-20 2024-05-31 临沂恒泰新能源有限公司 一种垃圾发电数据综合管理系统及存储方法

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528025A (zh) * 2020-12-16 2021-03-19 平安科技(深圳)有限公司 基于密度的文本聚类方法、装置、设备及存储介质
CN112597313B (zh) * 2021-03-03 2021-06-29 北京沃丰时代数据科技有限公司 短文本聚类方法、装置、电子设备及存储介质
CN113255288B (zh) * 2021-07-15 2021-09-24 成都威频通讯技术有限公司 一种基于快速密度峰值聚类的电子元器件聚类方法
CN113869465A (zh) * 2021-12-06 2021-12-31 深圳大学 I-nice算法优化方法、装置、设备及计算机可读存储介质
CN114500200B (zh) * 2022-02-22 2023-01-17 苏州大学 数字信号处理方法、动态均衡方法、装置、介质以及设备
CN115563522B (zh) * 2022-12-02 2023-04-07 湖南工商大学 交通数据的聚类方法、装置、设备及介质
CN115580493B (zh) * 2022-12-07 2023-03-31 南方电网数字电网研究院有限公司 电力数据分类加密传输方法、装置和计算机设备
CN116541252B (zh) * 2023-07-06 2023-10-20 广州豪特节能环保科技股份有限公司 一种机房故障日志数据处理方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170371886A1 (en) * 2016-06-22 2017-12-28 Agency For Science, Technology And Research Methods for identifying clusters in a dataset, methods of analyzing cytometry data with the aid of a computer and methods of detecting cell sub-populations in a plurality of cells
CN108647297A (zh) * 2018-05-08 2018-10-12 山东师范大学 一种共享近邻优化的密度峰值聚类中心选取方法和系统
CN109255384A (zh) * 2018-09-12 2019-01-22 湖州市特种设备检测研究院 一种基于密度峰值聚类算法的交通流模式识别方法
CN112528025A (zh) * 2020-12-16 2021-03-19 平安科技(深圳)有限公司 基于密度的文本聚类方法、装置、设备及存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI391837B (zh) * 2009-09-23 2013-04-01 Univ Nat Pingtung Sci & Tech 基於密度式之資料分群方法
WO2018137126A1 (fr) * 2017-01-24 2018-08-02 深圳大学 Procédé et dispositif permettant de générer un résumé vidéo statique
CN109446520B (zh) * 2018-10-17 2023-08-15 北京神州泰岳软件股份有限公司 用于构建知识库的数据聚类方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170371886A1 (en) * 2016-06-22 2017-12-28 Agency For Science, Technology And Research Methods for identifying clusters in a dataset, methods of analyzing cytometry data with the aid of a computer and methods of detecting cell sub-populations in a plurality of cells
CN108647297A (zh) * 2018-05-08 2018-10-12 山东师范大学 一种共享近邻优化的密度峰值聚类中心选取方法和系统
CN109255384A (zh) * 2018-09-12 2019-01-22 湖州市特种设备检测研究院 一种基于密度峰值聚类算法的交通流模式识别方法
CN112528025A (zh) * 2020-12-16 2021-03-19 平安科技(深圳)有限公司 基于密度的文本聚类方法、装置、设备及存储介质

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116166960A (zh) * 2023-02-07 2023-05-26 河南大学 用于神经网络训练的大数据特征清洗方法及系统
CN116166960B (zh) * 2023-02-07 2023-09-29 山东经鼎智能科技有限公司 用于神经网络训练的大数据特征清洗方法及系统
CN116360956A (zh) * 2023-06-02 2023-06-30 济南大陆机电股份有限公司 用于大数据任务调度的数据智能处理方法及系统
CN116360956B (zh) * 2023-06-02 2023-08-08 济南大陆机电股份有限公司 用于大数据任务调度的数据智能处理方法及系统
CN116796214B (zh) * 2023-06-07 2024-01-30 南京北极光生物科技有限公司 一种基于差分特征的数据聚类方法
CN116796214A (zh) * 2023-06-07 2023-09-22 南京北极光生物科技有限公司 一种基于差分特征的数据聚类方法
CN116628128A (zh) * 2023-07-13 2023-08-22 湖南九立供应链有限公司 一种供应链数据标准化方法、装置、设备及其存储介质
CN116628128B (zh) * 2023-07-13 2023-10-03 湖南九立供应链有限公司 一种供应链数据标准化方法、装置、设备及其存储介质
CN117216599A (zh) * 2023-09-27 2023-12-12 北京青丝科技有限公司 一种问卷数据分析方法及系统
CN117216599B (zh) * 2023-09-27 2024-02-13 北京青丝科技有限公司 一种问卷数据分析方法及系统
CN117217501A (zh) * 2023-11-09 2023-12-12 山东多科科技有限公司 一种数字化生产计划与调度方法
CN117217501B (zh) * 2023-11-09 2024-02-20 山东多科科技有限公司 一种数字化生产计划与调度方法
CN117933571B (zh) * 2024-03-20 2024-05-31 临沂恒泰新能源有限公司 一种垃圾发电数据综合管理系统及存储方法
CN118012876A (zh) * 2024-04-10 2024-05-10 山东硕杰医疗科技有限公司 一种残疾儿童康复信息平台数据的智慧存储方法

Also Published As

Publication number Publication date
CN112528025A (zh) 2021-03-19

Similar Documents

Publication Publication Date Title
WO2022126971A1 (fr) Procédé et appareil de groupement de textes selon la densité, dispositif et support de stockage
US20230205610A1 (en) Systems and methods for removing identifiable information
WO2021174944A1 (fr) Procédé de distribution sélective de message basé sur l'activité de cible et dispositif associé
CN107436875B (zh) 文本分类方法及装置
WO2022095352A1 (fr) Procédé et appareil d'identification d'utilisateur anormal basés sur une décision intelligente, et dispositif informatique
US20200034419A1 (en) Text classification using automatically generated seed data
WO2022126963A1 (fr) Procédé de profilage de client basé sur un corpus de réponse client, et dispositif associé
WO2022048363A1 (fr) Procédé et appareil de classification de site web, dispositif informatique et support de stockage
CN110110225B (zh) 基于用户行为数据分析的在线教育推荐模型及构建方法
CN104077723B (zh) 一种社交网络推荐系统及方法
CN110569289B (zh) 基于大数据的列数据处理方法、设备及介质
WO2023029356A1 (fr) Procédé et appareil de génération d'incorporation de phrases basés sur un modèle d'incorporation de phrases, et dispositif informatique
WO2022105119A1 (fr) Procédé de génération de corpus d'apprentissage pour un modèle de reconnaissance d'intention, et dispositif associé
CN112668482B (zh) 人脸识别训练方法、装置、计算机设备及存储介质
WO2020147259A1 (fr) Procédé et appareil de portrait d'utilisateur, support d'enregistrement lisible et équipement terminal
CN115544257B (zh) 网盘文档快速分类方法、装置、网盘及存储介质
Lian Implementation of computer network user behavior forensic analysis system based on speech data system log
CN111222032A (zh) 舆情分析方法及相关设备
WO2022142032A1 (fr) Procédé et appareil de vérification de signature manuscrite, dispositif informatique et support de stockage
US11593740B1 (en) Computing system for automated evaluation of process workflows
CN115099875A (zh) 基于决策树模型的数据分类方法及相关设备
WO2021159668A1 (fr) Procédé et appareil de dialogue robotisé, dispositif informatique et support de stockage
CN114528378A (zh) 文本分类方法、装置、电子设备及存储介质
CN113408579A (zh) 一种基于用户画像的内部威胁预警方法
Huang et al. A Hybrid Clustering Approach for Bag‐of‐Words Image Categorization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21904910

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21904910

Country of ref document: EP

Kind code of ref document: A1