CN112905792A - Text clustering method, device and equipment based on non-text scene and storage medium - Google Patents

Text clustering method, device and equipment based on non-text scene and storage medium Download PDF

Info

Publication number
CN112905792A
CN112905792A CN202110195010.5A CN202110195010A CN112905792A CN 112905792 A CN112905792 A CN 112905792A CN 202110195010 A CN202110195010 A CN 202110195010A CN 112905792 A CN112905792 A CN 112905792A
Authority
CN
China
Prior art keywords
classified
text
clustering
vector
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110195010.5A
Other languages
Chinese (zh)
Inventor
王开宏
陈婷
吴三平
庄伟亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202110195010.5A priority Critical patent/CN112905792A/en
Publication of CN112905792A publication Critical patent/CN112905792A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Abstract

The invention relates to the technical field of financial technology (Fintech). The invention discloses a text clustering method, a text clustering device, text clustering equipment and a computer readable storage medium based on a non-text scene.A logical relation in information to be classified is extracted firstly in the non-text scene, and then the information to be classified in the non-text scene is serialized according to the logical relation, so that the information to be classified in the non-text scene can be converted into a serial form, and the information has a structural relation of context in a text, and is convenient for a subsequent processing process; the serialized information to be classified is subjected to vectorization and clustering operation, and the category of the information to be classified in the non-text scene is finally obtained, so that the text clustering idea can be applied to the non-text scene, and the limitation of the application range of the conventional text clustering method is broken.

Description

Text clustering method, device and equipment based on non-text scene and storage medium
Technical Field
The invention relates to the technical field of financial technology (Fintech), in particular to a text clustering method, a text clustering device, text clustering equipment and a computer-readable storage medium based on a non-text scene.
Background
With the development of computer technology, more and more technologies (big data, distributed, Blockchain, artificial intelligence, etc.) are applied to the financial field, and the traditional financial industry is gradually changing to financial technology (Fintech), but higher requirements are also put forward on the technologies due to the requirements of security and real-time performance of the financial industry. The text clustering method is a method for training word vectors in text words based on texts, taking the word vectors as input and clustering the word vectors by using a clustering algorithm, is usually applied to text scenes and plays an important role in the natural language processing process. However, in other application scenarios besides texts, the practical application of the idea of the text clustering method to solve the problem is only available, that is, the technical problem that the application range of the existing text clustering method is relatively limited is reflected.
Disclosure of Invention
The invention mainly aims to provide a text clustering method and equipment based on a non-text scene and a computer readable storage medium, and aims to solve the technical problem that the application range of the existing text clustering method is limited.
In order to achieve the above object, the present invention provides a text clustering method based on non-text scenes, which includes:
acquiring information to be classified in a non-text scene, and serializing the information to be classified according to an internal logical relationship of the information to be classified to obtain an element sequence to be classified consisting of a plurality of elements to be classified;
vectorizing each element to be classified in the element sequence to be classified so as to convert the element sequence to be classified into a vector sequence to be classified consisting of a plurality of word vectors;
and clustering the vector sequences to be classified by utilizing a preset clustering algorithm so as to divide a plurality of word vectors into a plurality of vector sets belonging to different categories.
Optionally, the context logic relationship comprises a temporal order,
the step of obtaining information to be classified in a non-text scene, and serializing the information to be classified according to the internal logical relationship of the information to be classified to obtain an element sequence to be classified consisting of a plurality of elements to be classified comprises the following steps:
receiving a classification instruction, and acquiring a plurality of words to be classified and time information corresponding to each word to be classified under a non-text scene based on the classification instruction to serve as the information to be classified;
and sequencing the words to be classified according to the time sequence determined based on the time information, and taking each sequenced word to be classified as each element to be classified to form the element sequence to be classified.
Optionally, the step of using each sorted word to be classified as each element to be classified to form the element sequence to be classified includes:
using each sequenced word to be classified as each element to be classified to obtain an initial element sequence;
determining an interval duration threshold value by combining the non-text scene and the initial element sequence, and acquiring the interval duration between every two adjacent elements to be classified;
and determining an invalid element to be classified in the initial element sequence based on the interval duration threshold and the interval duration, and deleting the invalid element to be classified from the initial element sequence to form the element sequence to be classified.
Optionally, the step of determining an invalid element to be classified in the initial element sequence based on the interval duration threshold and the interval duration, and deleting the invalid element to be classified from the initial element sequence to form the element sequence to be classified includes:
judging whether each interval duration exceeds the interval duration threshold one by one;
if so, deleting the previous element to be classified in the two adjacent elements to be classified corresponding to the interval duration from the initial element sequence;
and obtaining the element sequence to be classified until all the interval durations are traversed.
Optionally, the clustering algorithm comprises a K-means clustering algorithm,
the step of clustering the vector sequence to be classified by using a preset clustering algorithm to divide a plurality of word vectors into a plurality of vector sets belonging to different categories comprises the following steps:
based on a K-means clustering algorithm, taking word vectors of a preset number in the vector sequence to be classified as an initial clustering center;
distributing the word vectors except the initial clustering center in the vector sequence to be classified to one initial clustering center one by one according to a shortest distance principle until the distribution is finished to generate a preset number of initial vector sets;
calculating a mean vector of each initial vector set, and taking each mean vector as each new clustering center to continue the distribution process;
and when detecting that each new clustering center meets a preset termination condition, taking each new clustering center as each target clustering center, and taking each target clustering center and the corresponding distributed word vector as a vector set.
Optionally, after the step of clustering the vector sequence to be classified by using a preset clustering algorithm to divide the word vectors into a plurality of vector sets belonging to different categories, the method further includes:
dividing each vector set into a first vector subset with clear category meaning and a second vector subset with unclear category meaning;
and acquiring the category meaning of each first vector subset, and outputting the category meaning as the category shared by the first vector subset and the second vector subset in the same vector set.
Optionally, the vectorizing each element to be classified in the element sequence to be classified to convert the element sequence to be classified into a vector sequence to be classified composed of a plurality of word vectors includes:
and taking the element sequence to be classified as input of Word2vec, and converting each element to be classified into a Word vector by using the Word2vec to obtain the vector sequence to be classified.
In addition, to achieve the above object, the present invention further provides a text clustering device based on non-text scenes, wherein the text clustering device based on non-text scenes comprises:
the element sequence acquisition module is used for acquiring information to be classified in a non-text scene and serializing the information to be classified according to the internal logic relationship of the information to be classified to obtain an element sequence to be classified consisting of a plurality of elements to be classified;
the vector sequence conversion module is used for vectorizing each element to be classified in the element sequence to be classified so as to convert the element sequence to be classified into a vector sequence to be classified consisting of a plurality of word vectors;
and the vector category division module is used for clustering the vector sequences to be classified by utilizing a preset clustering algorithm so as to divide a plurality of word vectors into a plurality of vector sets belonging to different categories.
Optionally, the element sequence obtaining module includes:
the instruction information acquisition unit is used for receiving a classification instruction and acquiring a plurality of words to be classified and time information corresponding to each word to be classified under a non-text scene based on the classification instruction to serve as the information to be classified;
and the time sequence ordering unit is used for ordering the words to be classified according to the time sequence determined based on the time information and taking each ordered word to be classified as each element to be classified so as to form the element sequence to be classified.
Optionally, the chronological ordering unit is further configured to:
using each sequenced word to be classified as each element to be classified to obtain an initial element sequence;
determining an interval duration threshold value by combining the non-text scene and the initial element sequence, and acquiring the interval duration between every two adjacent elements to be classified;
and determining an invalid element to be classified in the initial element sequence based on the interval duration threshold and the interval duration, and deleting the invalid element to be classified from the initial element sequence to form the element sequence to be classified.
Optionally, the chronological ordering unit is further configured to:
judging whether each interval duration exceeds the interval duration threshold one by one;
if so, deleting the previous element to be classified in the two adjacent elements to be classified corresponding to the interval duration from the initial element sequence;
and obtaining the element sequence to be classified until all the interval durations are traversed.
Optionally, the clustering algorithm comprises a K-means clustering algorithm,
the vector class classification module comprises:
the initial center determining unit is used for taking word vectors of a preset number in the vector sequence to be classified as an initial clustering center based on a K-means clustering algorithm;
the initial set generating unit is used for allocating word vectors except the initial clustering centers in the vector sequence to be classified to one initial clustering center one by one according to a shortest distance principle until allocation is finished so as to generate a preset number of initial vector sets;
a clustering center iteration unit, configured to calculate a mean vector of each initial vector set, and use each mean vector as each new clustering center to continue the allocation process;
and the vector set determining unit is used for taking each new clustering center as each target clustering center until each new clustering center is detected to meet a preset termination condition, and taking each target clustering center and the corresponding distributed word vector as a vector set.
Optionally, the non-text scene-based text clustering apparatus further includes:
the vector set dividing module is used for dividing each vector set into a first vector subset with definite category meaning and a second vector subset with unclear category meaning;
and the book category output module is used for acquiring the category meaning of each first vector subset, and outputting the category meaning as the category shared by the first vector subset and the second vector subset in the same vector set.
Optionally, the vector sequence conversion module includes:
and the vector sequence conversion unit is used for taking the element sequence to be classified as input of Word2vec, and converting each element to be classified into a Word vector by using the Word2vec to obtain the vector sequence to be classified.
In addition, to achieve the above object, the present invention further provides a text clustering device based on a non-text scene, including: a memory, a processor and a non-text scene based text clustering program stored on the memory and operable on the processor, the non-text scene based text clustering program when executed by the processor implementing the steps of the non-text scene based text clustering method as described above.
In addition, to achieve the above object, the present invention further provides a computer readable storage medium, wherein a non-text scene based text clustering program is stored on the computer readable storage medium, and when being executed by a processor, the non-text scene based text clustering program implements the steps of the non-text scene based text clustering method as described above.
The invention provides a text clustering method, a text clustering device, text clustering equipment and a computer readable storage medium based on a non-text scene. According to the information classification method, the internal logical relationship of the information to be classified is extracted firstly in the non-text scene, and then the information to be classified is serialized according to the logical relationship, so that the information to be classified in the non-text scene can be converted into a serial form, and the information classification method has the structural relationship of context in the text and is convenient for the subsequent processing process; the serialized information to be classified is subjected to vectorization and clustering operation, and the category of the information to be classified under the non-text scene is finally obtained, so that the text clustering idea can be applied to the non-text scene, the limitation of the application range of the existing text clustering method is broken, and the technical problem that the application range of the existing text clustering method is limited is solved.
Drawings
FIG. 1 is a schematic diagram of an apparatus architecture of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a first embodiment of a non-text scene-based text clustering method according to the present invention;
FIG. 3 is a flowchart illustrating a first embodiment of a non-text scene-based text clustering method according to the present invention;
fig. 4 is a schematic diagram of functional modules of the non-text scene-based text clustering device according to the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the non-text scene-based text clustering apparatus may include: a processor 1001, such as a CPU, a user interface 1003, a network interface 1004, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the configuration of the apparatus shown in fig. 1 is not intended to be limiting of the apparatus and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a type of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a non-text scene based text clustering program.
In the device shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 is mainly used for connecting a client (programmer's end) and performing data communication with the client; and the processor 1001 may be configured to call the non-text scene-based text clustering program stored in the memory 1005, and perform the following operations in the non-text scene-based text clustering method:
acquiring information to be classified in a non-text scene, and serializing the information to be classified according to an internal logical relationship of the information to be classified to obtain an element sequence to be classified consisting of a plurality of elements to be classified;
vectorizing each element to be classified in the element sequence to be classified so as to convert the element sequence to be classified into a vector sequence to be classified consisting of a plurality of word vectors;
and clustering the vector sequences to be classified by utilizing a preset clustering algorithm so as to divide a plurality of word vectors into a plurality of vector sets belonging to different categories.
Further, the context logic relationship includes a temporal order,
the step of obtaining information to be classified in a non-text scene, and serializing the information to be classified according to the internal logical relationship of the information to be classified to obtain an element sequence to be classified consisting of a plurality of elements to be classified comprises the following steps:
receiving a classification instruction, and acquiring a plurality of words to be classified and time information corresponding to each word to be classified under a non-text scene based on the classification instruction to serve as the information to be classified;
and sequencing the words to be classified according to the time sequence determined based on the time information, and taking each sequenced word to be classified as each element to be classified to form the element sequence to be classified.
Further, the step of using each sorted word to be classified as each element to be classified to form the element sequence to be classified includes:
using each sequenced word to be classified as each element to be classified to obtain an initial element sequence;
determining an interval duration threshold value by combining the non-text scene and the initial element sequence, and acquiring the interval duration between every two adjacent elements to be classified;
and determining an invalid element to be classified in the initial element sequence based on the interval duration threshold and the interval duration, and deleting the invalid element to be classified from the initial element sequence to form the element sequence to be classified.
Further, the step of determining an invalid element to be classified in the initial element sequence based on the interval duration threshold and the interval duration, and deleting the invalid element to be classified from the initial element sequence to form the element sequence to be classified includes:
judging whether each interval duration exceeds the interval duration threshold one by one;
if so, deleting the previous element to be classified in the two adjacent elements to be classified corresponding to the interval duration from the initial element sequence;
and obtaining the element sequence to be classified until all the interval durations are traversed.
Further, the clustering algorithm comprises a K-means clustering algorithm,
the step of clustering the vector sequence to be classified by using a preset clustering algorithm to divide a plurality of word vectors into a plurality of vector sets belonging to different categories comprises the following steps:
based on a K-means clustering algorithm, taking word vectors of a preset number in the vector sequence to be classified as an initial clustering center;
distributing the word vectors except the initial clustering center in the vector sequence to be classified to one initial clustering center one by one according to a shortest distance principle until the distribution is finished to generate a preset number of initial vector sets;
calculating a mean vector of each initial vector set, and taking each mean vector as each new clustering center to continue the distribution process;
and when detecting that each new clustering center meets a preset termination condition, taking each new clustering center as each target clustering center, and taking each target clustering center and the corresponding distributed word vector as a vector set.
Further, after the step of clustering the vector sequence to be classified by using a preset clustering algorithm to divide a plurality of word vectors into a plurality of vector sets belonging to different categories, the processor 1001 may be configured to invoke a text clustering program based on a non-text scene stored in the memory 1005, and perform the following operations in the text clustering method based on a non-text scene:
dividing each vector set into a first vector subset with clear category meaning and a second vector subset with unclear category meaning;
and acquiring the category meaning of each first vector subset, and outputting the category meaning as the category shared by the first vector subset and the second vector subset in the same vector set.
Further, the step of vectorizing each element to be classified in the element sequence to be classified to convert the element sequence to be classified into a vector sequence to be classified composed of a plurality of word vectors includes:
and taking the element sequence to be classified as input of Word2vec, and converting each element to be classified into a Word vector by using the Word2vec to obtain the vector sequence to be classified.
Based on the hardware structure, the embodiment of the text clustering method based on the non-text scene is provided.
In order to solve the problems, the invention provides a text clustering method based on a non-text scene, namely, in the non-text scene, the internal logical relationship of the information to be classified is extracted firstly, and then the information to be classified in the non-text scene is serialized according to the logical relationship, so that the information to be classified in the non-text scene can be converted into a serial form, and the information has the structural relationship of the context in the text, thereby facilitating the subsequent processing process; the serialized information to be classified is subjected to vectorization and clustering operation, and the category of the information to be classified under the non-text scene is finally obtained, so that the text clustering idea can be applied to the non-text scene, the limitation of the application range of the existing text clustering method is broken, and the technical problem that the application range of the existing text clustering method is limited is solved.
Referring to fig. 2, fig. 2 is a schematic flow chart of a text clustering method based on non-text scenes according to a first embodiment of the present invention. The text clustering method based on the non-text scene comprises the following steps;
step S10, obtaining information to be classified in a non-text scene, and serializing the information to be classified according to the internal logic relationship of the information to be classified to obtain an element sequence to be classified consisting of a plurality of elements to be classified;
in this embodiment, the method is applied to the terminal device. Non-text scenes refer to some actual scene other than text, such as different place names occurring in different periods of the day, APPs downloaded by the user at different points in time, etc. The information to be classified refers to information which needs to be classified in a non-text scene, and the information can be information in various forms. For example, if the information to be classified is location information, the form of the location information may be text, longitude and latitude, and the like. The inherent logical relationships are usually represented by chronological order, and in practical cases, other orders may be used. The element to be classified refers to a sequence element obtained by serializing information to be classified, and the element sequence to be classified is generally composed of a plurality of elements to be classified.
If the terminal acquires a series of information needing to be classified in the current clustering task in a non-text scene, a logical relationship which is inherent in the series of information needs to be extracted, then the series of information is serialized according to the context logical relationship, so that the series of information has a structural relationship of context in a text, the series of information is converted into a plurality of elements to be classified to form an element sequence to be classified, and each element to be classified in the sequence can be regarded as a word in the text.
Step S20, vectorizing each element to be classified in the element sequence to be classified to convert the element sequence to be classified into a vector sequence to be classified consisting of a plurality of word vectors;
in this embodiment, the word vector refers to a vector obtained by vectorizing an element to be classified. Common ways of vectorizing sequence data include Word2vec, one-hot encoding, tag encoding, etc., with Word2vec being the preferred way. Word2vec is a toolkit for acquiring Word vectors, which is derived by Google in 2013, is simple and efficient, and is generally divided into two models, namely CBOW and Skip-Gram. One-Hot coding, or One-Hot coding, also known as One-bit-efficient coding, uses an N-bit state register to encode N states, each state having its own independent register bit and only One of which is active at any time.
The terminal vectorizes each element to be classified in the element sequence to be classified in a sequence data vectorization mode, so that each element to be classified is converted into a word vector, and the original element sequence to be classified is correspondingly converted into a vector sequence to be classified.
And step S30, clustering the vector sequences to be classified by using a preset clustering algorithm so as to divide a plurality of word vectors into a plurality of vector sets belonging to different categories.
In this embodiment, the Clustering algorithm may specifically adopt a K-means Clustering algorithm, a mean shift Clustering algorithm, a Density-Based Clustering algorithm with Noise (DBSCAN, Density-Based Clustering of Applications with Noise), and the like, and preferably adopts a K-means Clustering algorithm. The K-means clustering algorithm is a clustering analysis algorithm for iterative solution, and comprises the steps of dividing data into K groups in advance, randomly selecting K objects as initial clustering centers, calculating the distance between each object and each seed clustering center, and allocating each object to the nearest clustering center. The cluster centers and the objects assigned to them represent a cluster. The cluster center of a cluster is recalculated for each sample assigned based on the objects existing in the cluster. This process will be repeated until some termination condition is met. The mean shift clustering algorithm first assumes that each cluster in the sample space obeys a certain known probability distribution rule, then fits a statistical histogram in the sample with different probability density functions, and continuously moves the position of the center (mean) of the density function until the best fitting effect is obtained. The peak point of the probability density functions is the center of the cluster, and then the category to which the nearest cluster center belongs is selected as the category of the sample according to the distance between each sample and each center. DBSCAN is a density-based spatial clustering algorithm. The algorithm divides the area with sufficient density into clusters and finds arbitrarily shaped clusters in a spatial database with noise, which defines clusters as the largest set of density-connected points.
The terminal adopts a certain clustering algorithm to cluster the current vector sequence to be classified, for example, the current vector sequence to be classified comprises 10 word vectors, the vector sequence to be classified is divided into three vector sets of different categories, the first vector set comprises 3 word vectors, the second vector set comprises 2 word vectors, the third vector set comprises 5 word vectors, and then the 3 word vectors belong to the first category, the 2 word vectors belong to the second category, and the 5 word vectors belong to the third category.
As a specific example, as shown in fig. 3. Taking the example of different locations that a client appears in time series (e.g., different time periods of a day), the idea of text clustering can be used to effectively cluster locations (which can be in any form, such as text, longitude and latitude, etc.). After a series of location information is preprocessed by a terminal through a certain technology, sequencing each location according to a corresponding time sequence to serve as a sequence; and then vectorizing the sequenced places by the terminal based on a Word2vec tool to obtain a series of place vectors, clustering the series of place vectors by using a K-means clustering algorithm, regarding the places in the same cluster as belonging to the same category, determining the specific category of the cluster through the places with definite category meanings and outputting the specific category, thereby realizing the rapid identification of the categories to which all the places belong.
The invention provides a text clustering method based on a non-text scene. The text clustering method based on the non-text scene obtains information to be classified under the non-text scene, and serializes the information to be classified according to the internal logic relation of the information to be classified to obtain an element sequence to be classified consisting of a plurality of elements to be classified; vectorizing each element to be classified in the element sequence to be classified so as to convert the element sequence to be classified into a vector sequence to be classified consisting of a plurality of word vectors; and clustering the vector sequences to be classified by utilizing a preset clustering algorithm so as to divide a plurality of word vectors into a plurality of vector sets belonging to different categories. According to the information classification method, the internal logical relationship of the information to be classified is extracted firstly in the non-text scene, and then the information to be classified is serialized according to the logical relationship, so that the information to be classified in the non-text scene can be converted into a serial form, and the information classification method has the structural relationship of context in the text and is convenient for the subsequent processing process; the serialized information to be classified is subjected to vectorization and clustering operation, and the category of the information to be classified under the non-text scene is finally obtained, so that the text clustering idea can be applied to the non-text scene, the limitation of the application range of the existing text clustering method is broken, and the technical problem that the application range of the existing text clustering method is limited is solved.
Further, based on the first embodiment shown in fig. 2, a second embodiment of the non-text scene-based text clustering method according to the present invention is provided. In this embodiment, the context logic relationship includes a time sequence, and step S10 includes:
receiving a classification instruction, and acquiring a plurality of words to be classified and time information corresponding to each word to be classified under a non-text scene based on the classification instruction to serve as the information to be classified;
and sequencing the words to be classified according to the time sequence determined based on the time information, and taking each sequenced word to be classified as each element to be classified to form the element sequence to be classified.
In this embodiment, the information to be classified includes a plurality of words to be classified and time information corresponding to each word to be classified. When the terminal receives the classification instruction, the terminal can directly acquire the information of different places where the appointed client appears in the appointed time length and the time information of each place according to the classification instruction; or the APP information downloaded by the client within the specified time length and the downloading time information of each APP. Each piece of location information or APP information can be regarded as a word, that is, the word to be classified, by the terminal, and the location information or APP information and the corresponding time line are collectively regarded as the information to be classified. And the terminal sequences all the current words according to the time sequence indicated by the time information corresponding to each word, so that the element sequence to be classified can be obtained.
Further, the step of using each sorted word to be classified as each element to be classified to form the element sequence to be classified includes:
using each sequenced word to be classified as each element to be classified to obtain an initial element sequence;
determining an interval duration threshold value by combining the non-text scene and the initial element sequence, and acquiring the interval duration between every two adjacent elements to be classified;
and determining an invalid element to be classified in the initial element sequence based on the interval duration threshold and the interval duration, and deleting the invalid element to be classified from the initial element sequence to form the element sequence to be classified.
In this embodiment, the validity of the logical relationship between each two adjacent elements in the sequence needs to be considered. If the time interval between two adjacent elements is too long, the former element is difficult to be used as a reference for influencing the latter element. This time interval needs to be determined in combination with the current non-text scene and the specific data representation.
And the terminal takes the sequence formed by sequencing the words to be classified as an initial element sequence, and then determines an interval duration threshold value by combining the current actual scene and the data expression in the sequence. And the terminal acquires the time interval between every two adjacent elements in the current sequence, determines the invalid elements in the current sequence and eliminates the invalid elements by combining the time interval threshold and the time interval between every two adjacent elements, and takes the sequence after eliminating all the invalid elements as the element sequence to be classified.
Further, the step of determining an invalid element to be classified in the initial element sequence based on the interval duration threshold and the interval duration, and deleting the invalid element to be classified from the initial element sequence to form the element sequence to be classified includes:
judging whether each interval duration exceeds the interval duration threshold one by one;
if so, deleting the previous element to be classified in the two adjacent elements to be classified corresponding to the interval duration from the initial element sequence;
and obtaining the element sequence to be classified until all the interval durations are traversed.
In this embodiment, specifically, the terminal needs to determine one by one whether the time interval between each two adjacent elements exceeds the interval duration threshold, and if so, it indicates that the logical relationship between the two loud elements may have been greatly attenuated, and the context relationship in the text has not been obtained, and should be removed from the current sequence; and if not, continuously comparing the next element until the time interval between all adjacent elements is traversed to obtain the final element sequence to be classified.
Further, the embodiment screens out invalid elements in the sequence, which are excessively attenuated in the logical relationship with the element to be classified later, by determining the interval duration threshold value by combining the actual non-text scene and the actual data representation, thereby ensuring the validity of the finally obtained element sequence to be classified, and further ensuring the accuracy of the finally obtained clustering result.
Further, based on the first embodiment shown in fig. 2, a third embodiment of the non-text scene-based text clustering method according to the present invention is provided. In this embodiment, the clustering algorithm includes a K-means clustering algorithm, and step S30 includes:
based on a K-means clustering algorithm, taking word vectors of a preset number in the vector sequence to be classified as an initial clustering center;
distributing the word vectors except the initial clustering center in the vector sequence to be classified to one initial clustering center one by one according to a shortest distance principle until the distribution is finished to generate a preset number of initial vector sets;
calculating a mean vector of each initial vector set, and taking each mean vector as each new clustering center to continue the distribution process;
and when detecting that each new clustering center meets a preset termination condition, taking each new clustering center as each target clustering center, and taking each target clustering center and the corresponding distributed word vector as a vector set.
In this embodiment, the terminal selects K (K is a preset value) word vectors from the current sequence as initial clustering centers according to a K-means clustering algorithm, and then assigns each of the remaining word vectors to the initial clustering center closest to the current sequence, so as to obtain K initial vector sets. And then the terminal starts to calculate the mean vector combined by each initial vector to determine the mean vector as a new clustering center required by a new iteration, and continues to allocate the process until the new clustering center in a certain iteration is detected to meet a preset termination condition, for example, the distance between the new clustering center and the initial clustering center is smaller than a preset threshold value and other conditions, the iteration process is ended, the new clustering center at the moment is used as a final target clustering center, and each target clustering center and the word vector correspondingly allocated to the target clustering center are used as vector sets belonging to the same category.
Further, after step S30, the method further includes:
dividing each vector set into a first vector subset with clear category meaning and a second vector subset with unclear category meaning;
and acquiring the category meaning of each first vector subset, and outputting the category meaning as the category shared by the first vector subset and the second vector subset in the same vector set.
In this embodiment, the first vector subset represents a set containing all word vectors with definite category meanings in one vector set; the second vector quantum set represents a set of word vectors of ambiguous meanings for all classes in a vector set. After clustering, the original elements to be classified are divided into a plurality of sets according to categories, and words with definite category meanings and words with indefinite category meanings may exist in each set at the same time, so that the categories of the words with indefinite category meanings can be quickly judged through the words with indefinite category meanings. For example, the words corresponding to the existing vector set include KTV, movie theatre and AAA, and since the first two words are both words with definite category meanings, the category is obviously an entertainment place, and then the category of "AAA" belonging to the same category should also be an entertainment place, the entertainment place can be used as the category of the vector set and displayed, and other clusters are the same.
Further, step S20 includes:
and taking the element sequence to be classified as input of Word2vec, and converting each element to be classified into a Word vector by using the Word2vec to obtain the vector sequence to be classified.
In this embodiment, Word2vec is a toolkit for obtaining Word vectors derived by Google in 2013, and is a group of related models for generating Word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word text. The network is represented by words and the input words in adjacent positions are guessed, and the order of the words is unimportant under the assumption of the bag-of-words model in word2 vec. After training is completed, the word2vec model can be used to map each word to a vector, which can be used to represent word-to-word relationships, and the vector is a hidden layer of the neural network. The specific operation process is a common means in the art and is not described herein.
Further, in the embodiment, the category to which the word with ambiguous category meaning belongs is predicted by using the category to which the word with ambiguous category meaning belongs in the same cluster, and is output as the common category to which the cluster belongs, so that the category meanings of all the words with ambiguous category meaning in the current classification task can be determined quickly and accurately.
As shown in fig. 4, the present invention further provides a non-text scene-based text clustering apparatus, where the non-text scene-based text clustering apparatus includes:
the element sequence acquiring module 10 is configured to acquire information to be classified in a non-text scene, and serialize the information to be classified according to an internal logical relationship of the information to be classified to obtain an element sequence to be classified, which is composed of a plurality of elements to be classified;
a vector sequence conversion module 20, configured to perform vectorization on each element to be classified in the element sequence to be classified, so as to convert the element sequence to be classified into a vector sequence to be classified, where the vector sequence to be classified is composed of a plurality of word vectors;
the vector category dividing module 30 is configured to cluster the vector sequences to be classified by using a preset clustering algorithm, so as to divide the word vectors into a plurality of vector sets belonging to different categories.
The method executed by each program module can refer to each embodiment of the non-text scene-based text clustering method of the present invention, and is not described herein again.
The invention also provides text clustering equipment based on the non-text scene.
The non-text scene based text clustering device comprises a processor, a memory and a non-text scene based text clustering program which is stored on the memory and can run on the processor, wherein when the non-text scene based text clustering program is executed by the processor, the steps of the non-text scene based text clustering method are realized.
The method for implementing the non-text scene-based text clustering program when executed can refer to each embodiment of the non-text scene-based text clustering method of the present invention, and is not described herein again.
The invention also provides a computer readable storage medium.
The computer-readable storage medium of the present invention stores thereon a non-text scene-based text clustering program, which when executed by a processor implements the steps of the non-text scene-based text clustering method as described above.
The method for implementing the non-text scene-based text clustering program when executed can refer to each embodiment of the non-text scene-based text clustering method of the present invention, and is not described herein again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A text clustering method based on non-text scenes is characterized by comprising the following steps:
acquiring information to be classified in a non-text scene, and serializing the information to be classified according to an internal logical relationship of the information to be classified to obtain an element sequence to be classified consisting of a plurality of elements to be classified;
vectorizing each element to be classified in the element sequence to be classified so as to convert the element sequence to be classified into a vector sequence to be classified consisting of a plurality of word vectors;
and clustering the vector sequences to be classified by utilizing a preset clustering algorithm so as to divide a plurality of word vectors into a plurality of vector sets belonging to different categories.
2. The non-text scene-based text clustering method according to claim 1, wherein the context logic relationship includes a temporal order,
the step of obtaining information to be classified in a non-text scene, and serializing the information to be classified according to the internal logical relationship of the information to be classified to obtain an element sequence to be classified consisting of a plurality of elements to be classified comprises the following steps:
receiving a classification instruction, and acquiring a plurality of words to be classified and time information corresponding to each word to be classified under a non-text scene based on the classification instruction to serve as the information to be classified;
and sequencing the words to be classified according to the time sequence determined based on the time information, and taking each sequenced word to be classified as each element to be classified to form the element sequence to be classified.
3. The method for clustering texts based on non-text scenes according to claim 2, wherein the step of using each ordered word to be classified as each element to be classified to form the sequence of elements to be classified comprises:
using each sequenced word to be classified as each element to be classified to obtain an initial element sequence;
determining an interval duration threshold value by combining the non-text scene and the initial element sequence, and acquiring the interval duration between every two adjacent elements to be classified;
and determining an invalid element to be classified in the initial element sequence based on the interval duration threshold and the interval duration, and deleting the invalid element to be classified from the initial element sequence to form the element sequence to be classified.
4. The method for clustering text based on non-text scenes according to claim 3, wherein the step of determining the invalid elements to be classified in the initial element sequence based on the interval duration threshold and the interval duration, and deleting the invalid elements to be classified from the initial element sequence to form the element sequence to be classified comprises:
judging whether each interval duration exceeds the interval duration threshold one by one;
if so, deleting the previous element to be classified in the two adjacent elements to be classified corresponding to the interval duration from the initial element sequence;
and obtaining the element sequence to be classified until all the interval durations are traversed.
5. The non-text scene-based text clustering method according to claim 1, wherein the clustering algorithm comprises a K-means clustering algorithm,
the step of clustering the vector sequence to be classified by using a preset clustering algorithm to divide a plurality of word vectors into a plurality of vector sets belonging to different categories comprises the following steps:
based on a K-means clustering algorithm, taking word vectors of a preset number in the vector sequence to be classified as an initial clustering center;
distributing the word vectors except the initial clustering center in the vector sequence to be classified to one initial clustering center one by one according to a shortest distance principle until the distribution is finished to generate a preset number of initial vector sets;
calculating a mean vector of each initial vector set, and taking each mean vector as each new clustering center to continue the distribution process;
and when detecting that each new clustering center meets a preset termination condition, taking each new clustering center as each target clustering center, and taking each target clustering center and the corresponding distributed word vector as a vector set.
6. The method for clustering texts based on non-text scenes according to claim 1, wherein after the step of clustering the sequence of vectors to be classified by using a preset clustering algorithm to divide a plurality of word vectors into a plurality of vector sets belonging to different categories, the method further comprises:
dividing each vector set into a first vector subset with clear category meaning and a second vector subset with unclear category meaning;
and acquiring the category meaning of each first vector subset, and outputting the category meaning as the category shared by the first vector subset and the second vector subset in the same vector set.
7. The non-text scene-based text clustering method according to any one of claims 1 to 6, wherein the step of vectorizing each element to be classified in the sequence of elements to be classified to convert the sequence of elements to be classified into a sequence of vectors to be classified consisting of a plurality of word vectors comprises:
and taking the element sequence to be classified as input of Word2vec, and converting each element to be classified into a Word vector by using the Word2vec to obtain the vector sequence to be classified.
8. A non-text scene-based text clustering device is characterized in that the non-text scene-based text clustering device comprises:
the element sequence acquisition module is used for acquiring information to be classified in a non-text scene and serializing the information to be classified according to the internal logic relationship of the information to be classified to obtain an element sequence to be classified consisting of a plurality of elements to be classified;
the vector sequence conversion module is used for vectorizing each element to be classified in the element sequence to be classified so as to convert the element sequence to be classified into a vector sequence to be classified consisting of a plurality of word vectors;
and the vector category division module is used for clustering the vector sequences to be classified by utilizing a preset clustering algorithm so as to divide a plurality of word vectors into a plurality of vector sets belonging to different categories.
9. A non-text scene based text clustering device, the non-text scene based text clustering device comprising: memory, a processor and a non-text scene based text clustering program stored on the memory and executable on the processor, the non-text scene based text clustering program when executed by the processor implementing the steps of the non-text scene based text clustering method according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a non-text scene-based text clustering program, which when executed by a processor implements the steps of the non-text scene-based text clustering method according to any one of claims 1 to 7.
CN202110195010.5A 2021-02-20 2021-02-20 Text clustering method, device and equipment based on non-text scene and storage medium Pending CN112905792A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110195010.5A CN112905792A (en) 2021-02-20 2021-02-20 Text clustering method, device and equipment based on non-text scene and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110195010.5A CN112905792A (en) 2021-02-20 2021-02-20 Text clustering method, device and equipment based on non-text scene and storage medium

Publications (1)

Publication Number Publication Date
CN112905792A true CN112905792A (en) 2021-06-04

Family

ID=76124260

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110195010.5A Pending CN112905792A (en) 2021-02-20 2021-02-20 Text clustering method, device and equipment based on non-text scene and storage medium

Country Status (1)

Country Link
CN (1) CN112905792A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113537828A (en) * 2021-08-04 2021-10-22 拉扎斯网络科技(上海)有限公司 Virtual site mining method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105430032A (en) * 2014-09-17 2016-03-23 阿里巴巴集团控股有限公司 Method of pushing information by combining geographic position of terminal, and server
WO2016169192A1 (en) * 2015-04-24 2016-10-27 百度在线网络技术(北京)有限公司 Method and apparatus for determining user similarity
CN106339417A (en) * 2016-08-15 2017-01-18 浙江大学 Detection method for user group behavior rules based on stay places in mobile trajectory
CN106776930A (en) * 2016-12-01 2017-05-31 合肥工业大学 A kind of location recommendation method for incorporating time and geographical location information
WO2018027180A1 (en) * 2016-08-05 2018-02-08 The Regents Of The University Of California Phase identification in power distribution systems
CN110113368A (en) * 2019-06-27 2019-08-09 电子科技大学 A kind of network behavior method for detecting abnormality based on sub-trajectory mode
CN112084237A (en) * 2020-09-09 2020-12-15 广东电网有限责任公司中山供电局 Power system abnormity prediction method based on machine learning and big data analysis

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105430032A (en) * 2014-09-17 2016-03-23 阿里巴巴集团控股有限公司 Method of pushing information by combining geographic position of terminal, and server
WO2016169192A1 (en) * 2015-04-24 2016-10-27 百度在线网络技术(北京)有限公司 Method and apparatus for determining user similarity
WO2018027180A1 (en) * 2016-08-05 2018-02-08 The Regents Of The University Of California Phase identification in power distribution systems
CN106339417A (en) * 2016-08-15 2017-01-18 浙江大学 Detection method for user group behavior rules based on stay places in mobile trajectory
CN106776930A (en) * 2016-12-01 2017-05-31 合肥工业大学 A kind of location recommendation method for incorporating time and geographical location information
CN110113368A (en) * 2019-06-27 2019-08-09 电子科技大学 A kind of network behavior method for detecting abnormality based on sub-trajectory mode
CN112084237A (en) * 2020-09-09 2020-12-15 广东电网有限责任公司中山供电局 Power system abnormity prediction method based on machine learning and big data analysis

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
丰江帆;熊雨虹;: "一种基于个人位置信息的重要地点识别方法", 小型微型计算机系统, no. 03, pages 503 - 507 *
毛郁欣;邱智学;: "基于Word2Vec模型和K-Means算法的信息技术文档聚类研究", 中国信息技术教育, no. 08, pages 99 - 101 *
陈婷: "基于移动时空轨迹的路网热点区域挖掘系统设计与实现", 中国优秀硕士学位论文全文数据库 (工程科技Ⅱ辑), pages 034 - 1270 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113537828A (en) * 2021-08-04 2021-10-22 拉扎斯网络科技(上海)有限公司 Virtual site mining method and device

Similar Documents

Publication Publication Date Title
CN111709533B (en) Distributed training method and device of machine learning model and computer equipment
CN114840327B (en) Multi-mode multi-task processing method, device and system
CN113408570A (en) Image category identification method and device based on model distillation, storage medium and terminal
CN108268936B (en) Method and apparatus for storing convolutional neural networks
CN114896067A (en) Automatic generation method and device of task request information, computer equipment and medium
CN112905792A (en) Text clustering method, device and equipment based on non-text scene and storage medium
CN114360027A (en) Training method and device for feature extraction network and electronic equipment
CN113515593A (en) Topic detection method and device based on clustering model and computer equipment
CN110659631A (en) License plate recognition method and terminal equipment
CN110532448B (en) Document classification method, device, equipment and storage medium based on neural network
CN108830302B (en) Image classification method, training method, classification prediction method and related device
CN116957006A (en) Training method, device, equipment, medium and program product of prediction model
CN113408571B (en) Image classification method and device based on model distillation, storage medium and terminal
CN112862073B (en) Compressed data analysis method and device, storage medium and terminal
CN113139751A (en) Method for determining micro-service user service type based on big data
CN114638308A (en) Method and device for acquiring object relationship, electronic equipment and storage medium
CN114898184A (en) Model training method, data processing method and device and electronic equipment
CN110472113B (en) Intelligent interaction engine optimization method, device and equipment
CN113780532A (en) Training method, device and equipment for semantic segmentation network and storage medium
CN112764923A (en) Computing resource allocation method and device, computer equipment and storage medium
CN111261165A (en) Station name identification method, device, equipment and storage medium
EP4300366A1 (en) Method, apparatus, and system for multi-modal multi-task processing
CN111526054B (en) Method and device for acquiring network
CN116974898A (en) Data processing method, device, equipment and computer readable storage medium
CN113988316A (en) Method and device for training machine learning model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination