CN112905792A

CN112905792A - Text clustering method, device and equipment based on non-text scene and storage medium

Info

Publication number: CN112905792A
Application number: CN202110195010.5A
Authority: CN
Inventors: 王开宏; 陈婷; 吴三平; 庄伟亮
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2021-02-20
Filing date: 2021-02-20
Publication date: 2021-06-04

Abstract

The invention relates to the technical field of financial technology (Fintech). The invention discloses a text clustering method, a text clustering device, text clustering equipment and a computer readable storage medium based on a non-text scene.A logical relation in information to be classified is extracted firstly in the non-text scene, and then the information to be classified in the non-text scene is serialized according to the logical relation, so that the information to be classified in the non-text scene can be converted into a serial form, and the information has a structural relation of context in a text, and is convenient for a subsequent processing process; the serialized information to be classified is subjected to vectorization and clustering operation, and the category of the information to be classified in the non-text scene is finally obtained, so that the text clustering idea can be applied to the non-text scene, and the limitation of the application range of the conventional text clustering method is broken.

Description

Text clustering method, device and equipment based on non-text scene and storage medium

Technical Field

The invention relates to the technical field of financial technology (Fintech), in particular to a text clustering method, a text clustering device, text clustering equipment and a computer-readable storage medium based on a non-text scene.

Background

With the development of computer technology, more and more technologies (big data, distributed, Blockchain, artificial intelligence, etc.) are applied to the financial field, and the traditional financial industry is gradually changing to financial technology (Fintech), but higher requirements are also put forward on the technologies due to the requirements of security and real-time performance of the financial industry. The text clustering method is a method for training word vectors in text words based on texts, taking the word vectors as input and clustering the word vectors by using a clustering algorithm, is usually applied to text scenes and plays an important role in the natural language processing process. However, in other application scenarios besides texts, the practical application of the idea of the text clustering method to solve the problem is only available, that is, the technical problem that the application range of the existing text clustering method is relatively limited is reflected.

Disclosure of Invention

The invention mainly aims to provide a text clustering method and equipment based on a non-text scene and a computer readable storage medium, and aims to solve the technical problem that the application range of the existing text clustering method is limited.

In order to achieve the above object, the present invention provides a text clustering method based on non-text scenes, which includes:

acquiring information to be classified in a non-text scene, and serializing the information to be classified according to an internal logical relationship of the information to be classified to obtain an element sequence to be classified consisting of a plurality of elements to be classified;

vectorizing each element to be classified in the element sequence to be classified so as to convert the element sequence to be classified into a vector sequence to be classified consisting of a plurality of word vectors;

and clustering the vector sequences to be classified by utilizing a preset clustering algorithm so as to divide a plurality of word vectors into a plurality of vector sets belonging to different categories.

Optionally, the context logic relationship comprises a temporal order,

the step of obtaining information to be classified in a non-text scene, and serializing the information to be classified according to the internal logical relationship of the information to be classified to obtain an element sequence to be classified consisting of a plurality of elements to be classified comprises the following steps:

receiving a classification instruction, and acquiring a plurality of words to be classified and time information corresponding to each word to be classified under a non-text scene based on the classification instruction to serve as the information to be classified;

and sequencing the words to be classified according to the time sequence determined based on the time information, and taking each sequenced word to be classified as each element to be classified to form the element sequence to be classified.

Optionally, the step of using each sorted word to be classified as each element to be classified to form the element sequence to be classified includes:

using each sequenced word to be classified as each element to be classified to obtain an initial element sequence;

determining an interval duration threshold value by combining the non-text scene and the initial element sequence, and acquiring the interval duration between every two adjacent elements to be classified;

and determining an invalid element to be classified in the initial element sequence based on the interval duration threshold and the interval duration, and deleting the invalid element to be classified from the initial element sequence to form the element sequence to be classified.

Optionally, the step of determining an invalid element to be classified in the initial element sequence based on the interval duration threshold and the interval duration, and deleting the invalid element to be classified from the initial element sequence to form the element sequence to be classified includes:

judging whether each interval duration exceeds the interval duration threshold one by one;

if so, deleting the previous element to be classified in the two adjacent elements to be classified corresponding to the interval duration from the initial element sequence;

and obtaining the element sequence to be classified until all the interval durations are traversed.

Optionally, the clustering algorithm comprises a K-means clustering algorithm,

the step of clustering the vector sequence to be classified by using a preset clustering algorithm to divide a plurality of word vectors into a plurality of vector sets belonging to different categories comprises the following steps:

based on a K-means clustering algorithm, taking word vectors of a preset number in the vector sequence to be classified as an initial clustering center;

distributing the word vectors except the initial clustering center in the vector sequence to be classified to one initial clustering center one by one according to a shortest distance principle until the distribution is finished to generate a preset number of initial vector sets;

calculating a mean vector of each initial vector set, and taking each mean vector as each new clustering center to continue the distribution process;

and when detecting that each new clustering center meets a preset termination condition, taking each new clustering center as each target clustering center, and taking each target clustering center and the corresponding distributed word vector as a vector set.

Optionally, after the step of clustering the vector sequence to be classified by using a preset clustering algorithm to divide the word vectors into a plurality of vector sets belonging to different categories, the method further includes:

dividing each vector set into a first vector subset with clear category meaning and a second vector subset with unclear category meaning;

and acquiring the category meaning of each first vector subset, and outputting the category meaning as the category shared by the first vector subset and the second vector subset in the same vector set.

Optionally, the vectorizing each element to be classified in the element sequence to be classified to convert the element sequence to be classified into a vector sequence to be classified composed of a plurality of word vectors includes:

and taking the element sequence to be classified as input of Word2vec, and converting each element to be classified into a Word vector by using the Word2vec to obtain the vector sequence to be classified.

In addition, to achieve the above object, the present invention further provides a text clustering device based on non-text scenes, wherein the text clustering device based on non-text scenes comprises:

the element sequence acquisition module is used for acquiring information to be classified in a non-text scene and serializing the information to be classified according to the internal logic relationship of the information to be classified to obtain an element sequence to be classified consisting of a plurality of elements to be classified;

the vector sequence conversion module is used for vectorizing each element to be classified in the element sequence to be classified so as to convert the element sequence to be classified into a vector sequence to be classified consisting of a plurality of word vectors;

and the vector category division module is used for clustering the vector sequences to be classified by utilizing a preset clustering algorithm so as to divide a plurality of word vectors into a plurality of vector sets belonging to different categories.

Optionally, the element sequence obtaining module includes:

the instruction information acquisition unit is used for receiving a classification instruction and acquiring a plurality of words to be classified and time information corresponding to each word to be classified under a non-text scene based on the classification instruction to serve as the information to be classified;

and the time sequence ordering unit is used for ordering the words to be classified according to the time sequence determined based on the time information and taking each ordered word to be classified as each element to be classified so as to form the element sequence to be classified.

Optionally, the chronological ordering unit is further configured to:

Optionally, the clustering algorithm comprises a K-means clustering algorithm,

the vector class classification module comprises:

the initial center determining unit is used for taking word vectors of a preset number in the vector sequence to be classified as an initial clustering center based on a K-means clustering algorithm;

the initial set generating unit is used for allocating word vectors except the initial clustering centers in the vector sequence to be classified to one initial clustering center one by one according to a shortest distance principle until allocation is finished so as to generate a preset number of initial vector sets;

a clustering center iteration unit, configured to calculate a mean vector of each initial vector set, and use each mean vector as each new clustering center to continue the allocation process;

and the vector set determining unit is used for taking each new clustering center as each target clustering center until each new clustering center is detected to meet a preset termination condition, and taking each target clustering center and the corresponding distributed word vector as a vector set.

Optionally, the non-text scene-based text clustering apparatus further includes:

the vector set dividing module is used for dividing each vector set into a first vector subset with definite category meaning and a second vector subset with unclear category meaning;

and the book category output module is used for acquiring the category meaning of each first vector subset, and outputting the category meaning as the category shared by the first vector subset and the second vector subset in the same vector set.

Optionally, the vector sequence conversion module includes:

and the vector sequence conversion unit is used for taking the element sequence to be classified as input of Word2vec, and converting each element to be classified into a Word vector by using the Word2vec to obtain the vector sequence to be classified.

In addition, to achieve the above object, the present invention further provides a text clustering device based on a non-text scene, including: a memory, a processor and a non-text scene based text clustering program stored on the memory and operable on the processor, the non-text scene based text clustering program when executed by the processor implementing the steps of the non-text scene based text clustering method as described above.

In addition, to achieve the above object, the present invention further provides a computer readable storage medium, wherein a non-text scene based text clustering program is stored on the computer readable storage medium, and when being executed by a processor, the non-text scene based text clustering program implements the steps of the non-text scene based text clustering method as described above.

The invention provides a text clustering method, a text clustering device, text clustering equipment and a computer readable storage medium based on a non-text scene. According to the information classification method, the internal logical relationship of the information to be classified is extracted firstly in the non-text scene, and then the information to be classified is serialized according to the logical relationship, so that the information to be classified in the non-text scene can be converted into a serial form, and the information classification method has the structural relationship of context in the text and is convenient for the subsequent processing process; the serialized information to be classified is subjected to vectorization and clustering operation, and the category of the information to be classified under the non-text scene is finally obtained, so that the text clustering idea can be applied to the non-text scene, the limitation of the application range of the existing text clustering method is broken, and the technical problem that the application range of the existing text clustering method is limited is solved.

Drawings

FIG. 1 is a schematic diagram of an apparatus architecture of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a first embodiment of a non-text scene-based text clustering method according to the present invention;

FIG. 3 is a flowchart illustrating a first embodiment of a non-text scene-based text clustering method according to the present invention;

fig. 4 is a schematic diagram of functional modules of the non-text scene-based text clustering device according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the non-text scene-based text clustering apparatus may include: a processor 1001, such as a CPU, a user interface 1003, a network interface 1004, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the configuration of the apparatus shown in fig. 1 is not intended to be limiting of the apparatus and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a type of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a non-text scene based text clustering program.

In the device shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 is mainly used for connecting a client (programmer's end) and performing data communication with the client; and the processor 1001 may be configured to call the non-text scene-based text clustering program stored in the memory 1005, and perform the following operations in the non-text scene-based text clustering method:

Further, the context logic relationship includes a temporal order,

Further, the step of using each sorted word to be classified as each element to be classified to form the element sequence to be classified includes:

Further, the step of determining an invalid element to be classified in the initial element sequence based on the interval duration threshold and the interval duration, and deleting the invalid element to be classified from the initial element sequence to form the element sequence to be classified includes:

Further, the clustering algorithm comprises a K-means clustering algorithm,

Further, after the step of clustering the vector sequence to be classified by using a preset clustering algorithm to divide a plurality of word vectors into a plurality of vector sets belonging to different categories, the processor 1001 may be configured to invoke a text clustering program based on a non-text scene stored in the memory 1005, and perform the following operations in the text clustering method based on a non-text scene:

Further, the step of vectorizing each element to be classified in the element sequence to be classified to convert the element sequence to be classified into a vector sequence to be classified composed of a plurality of word vectors includes:

Based on the hardware structure, the embodiment of the text clustering method based on the non-text scene is provided.

In order to solve the problems, the invention provides a text clustering method based on a non-text scene, namely, in the non-text scene, the internal logical relationship of the information to be classified is extracted firstly, and then the information to be classified in the non-text scene is serialized according to the logical relationship, so that the information to be classified in the non-text scene can be converted into a serial form, and the information has the structural relationship of the context in the text, thereby facilitating the subsequent processing process; the serialized information to be classified is subjected to vectorization and clustering operation, and the category of the information to be classified under the non-text scene is finally obtained, so that the text clustering idea can be applied to the non-text scene, the limitation of the application range of the existing text clustering method is broken, and the technical problem that the application range of the existing text clustering method is limited is solved.

Referring to fig. 2, fig. 2 is a schematic flow chart of a text clustering method based on non-text scenes according to a first embodiment of the present invention. The text clustering method based on the non-text scene comprises the following steps;

step S10, obtaining information to be classified in a non-text scene, and serializing the information to be classified according to the internal logic relationship of the information to be classified to obtain an element sequence to be classified consisting of a plurality of elements to be classified;

in this embodiment, the method is applied to the terminal device. Non-text scenes refer to some actual scene other than text, such as different place names occurring in different periods of the day, APPs downloaded by the user at different points in time, etc. The information to be classified refers to information which needs to be classified in a non-text scene, and the information can be information in various forms. For example, if the information to be classified is location information, the form of the location information may be text, longitude and latitude, and the like. The inherent logical relationships are usually represented by chronological order, and in practical cases, other orders may be used. The element to be classified refers to a sequence element obtained by serializing information to be classified, and the element sequence to be classified is generally composed of a plurality of elements to be classified.

If the terminal acquires a series of information needing to be classified in the current clustering task in a non-text scene, a logical relationship which is inherent in the series of information needs to be extracted, then the series of information is serialized according to the context logical relationship, so that the series of information has a structural relationship of context in a text, the series of information is converted into a plurality of elements to be classified to form an element sequence to be classified, and each element to be classified in the sequence can be regarded as a word in the text.

Step S20, vectorizing each element to be classified in the element sequence to be classified to convert the element sequence to be classified into a vector sequence to be classified consisting of a plurality of word vectors;

in this embodiment, the word vector refers to a vector obtained by vectorizing an element to be classified. Common ways of vectorizing sequence data include Word2vec, one-hot encoding, tag encoding, etc., with Word2vec being the preferred way. Word2vec is a toolkit for acquiring Word vectors, which is derived by Google in 2013, is simple and efficient, and is generally divided into two models, namely CBOW and Skip-Gram. One-Hot coding, or One-Hot coding, also known as One-bit-efficient coding, uses an N-bit state register to encode N states, each state having its own independent register bit and only One of which is active at any time.

The terminal vectorizes each element to be classified in the element sequence to be classified in a sequence data vectorization mode, so that each element to be classified is converted into a word vector, and the original element sequence to be classified is correspondingly converted into a vector sequence to be classified.

And step S30, clustering the vector sequences to be classified by using a preset clustering algorithm so as to divide a plurality of word vectors into a plurality of vector sets belonging to different categories.

In this embodiment, the Clustering algorithm may specifically adopt a K-means Clustering algorithm, a mean shift Clustering algorithm, a Density-Based Clustering algorithm with Noise (DBSCAN, Density-Based Clustering of Applications with Noise), and the like, and preferably adopts a K-means Clustering algorithm. The K-means clustering algorithm is a clustering analysis algorithm for iterative solution, and comprises the steps of dividing data into K groups in advance, randomly selecting K objects as initial clustering centers, calculating the distance between each object and each seed clustering center, and allocating each object to the nearest clustering center. The cluster centers and the objects assigned to them represent a cluster. The cluster center of a cluster is recalculated for each sample assigned based on the objects existing in the cluster. This process will be repeated until some termination condition is met. The mean shift clustering algorithm first assumes that each cluster in the sample space obeys a certain known probability distribution rule, then fits a statistical histogram in the sample with different probability density functions, and continuously moves the position of the center (mean) of the density function until the best fitting effect is obtained. The peak point of the probability density functions is the center of the cluster, and then the category to which the nearest cluster center belongs is selected as the category of the sample according to the distance between each sample and each center. DBSCAN is a density-based spatial clustering algorithm. The algorithm divides the area with sufficient density into clusters and finds arbitrarily shaped clusters in a spatial database with noise, which defines clusters as the largest set of density-connected points.

The terminal adopts a certain clustering algorithm to cluster the current vector sequence to be classified, for example, the current vector sequence to be classified comprises 10 word vectors, the vector sequence to be classified is divided into three vector sets of different categories, the first vector set comprises 3 word vectors, the second vector set comprises 2 word vectors, the third vector set comprises 5 word vectors, and then the 3 word vectors belong to the first category, the 2 word vectors belong to the second category, and the 5 word vectors belong to the third category.

As a specific example, as shown in fig. 3. Taking the example of different locations that a client appears in time series (e.g., different time periods of a day), the idea of text clustering can be used to effectively cluster locations (which can be in any form, such as text, longitude and latitude, etc.). After a series of location information is preprocessed by a terminal through a certain technology, sequencing each location according to a corresponding time sequence to serve as a sequence; and then vectorizing the sequenced places by the terminal based on a Word2vec tool to obtain a series of place vectors, clustering the series of place vectors by using a K-means clustering algorithm, regarding the places in the same cluster as belonging to the same category, determining the specific category of the cluster through the places with definite category meanings and outputting the specific category, thereby realizing the rapid identification of the categories to which all the places belong.

The invention provides a text clustering method based on a non-text scene. The text clustering method based on the non-text scene obtains information to be classified under the non-text scene, and serializes the information to be classified according to the internal logic relation of the information to be classified to obtain an element sequence to be classified consisting of a plurality of elements to be classified; vectorizing each element to be classified in the element sequence to be classified so as to convert the element sequence to be classified into a vector sequence to be classified consisting of a plurality of word vectors; and clustering the vector sequences to be classified by utilizing a preset clustering algorithm so as to divide a plurality of word vectors into a plurality of vector sets belonging to different categories. According to the information classification method, the internal logical relationship of the information to be classified is extracted firstly in the non-text scene, and then the information to be classified is serialized according to the logical relationship, so that the information to be classified in the non-text scene can be converted into a serial form, and the information classification method has the structural relationship of context in the text and is convenient for the subsequent processing process; the serialized information to be classified is subjected to vectorization and clustering operation, and the category of the information to be classified under the non-text scene is finally obtained, so that the text clustering idea can be applied to the non-text scene, the limitation of the application range of the existing text clustering method is broken, and the technical problem that the application range of the existing text clustering method is limited is solved.

Further, based on the first embodiment shown in fig. 2, a second embodiment of the non-text scene-based text clustering method according to the present invention is provided. In this embodiment, the context logic relationship includes a time sequence, and step S10 includes:

In this embodiment, the information to be classified includes a plurality of words to be classified and time information corresponding to each word to be classified. When the terminal receives the classification instruction, the terminal can directly acquire the information of different places where the appointed client appears in the appointed time length and the time information of each place according to the classification instruction; or the APP information downloaded by the client within the specified time length and the downloading time information of each APP. Each piece of location information or APP information can be regarded as a word, that is, the word to be classified, by the terminal, and the location information or APP information and the corresponding time line are collectively regarded as the information to be classified. And the terminal sequences all the current words according to the time sequence indicated by the time information corresponding to each word, so that the element sequence to be classified can be obtained.

In this embodiment, the validity of the logical relationship between each two adjacent elements in the sequence needs to be considered. If the time interval between two adjacent elements is too long, the former element is difficult to be used as a reference for influencing the latter element. This time interval needs to be determined in combination with the current non-text scene and the specific data representation.

And the terminal takes the sequence formed by sequencing the words to be classified as an initial element sequence, and then determines an interval duration threshold value by combining the current actual scene and the data expression in the sequence. And the terminal acquires the time interval between every two adjacent elements in the current sequence, determines the invalid elements in the current sequence and eliminates the invalid elements by combining the time interval threshold and the time interval between every two adjacent elements, and takes the sequence after eliminating all the invalid elements as the element sequence to be classified.

In this embodiment, specifically, the terminal needs to determine one by one whether the time interval between each two adjacent elements exceeds the interval duration threshold, and if so, it indicates that the logical relationship between the two loud elements may have been greatly attenuated, and the context relationship in the text has not been obtained, and should be removed from the current sequence; and if not, continuously comparing the next element until the time interval between all adjacent elements is traversed to obtain the final element sequence to be classified.

Further, the embodiment screens out invalid elements in the sequence, which are excessively attenuated in the logical relationship with the element to be classified later, by determining the interval duration threshold value by combining the actual non-text scene and the actual data representation, thereby ensuring the validity of the finally obtained element sequence to be classified, and further ensuring the accuracy of the finally obtained clustering result.

Further, based on the first embodiment shown in fig. 2, a third embodiment of the non-text scene-based text clustering method according to the present invention is provided. In this embodiment, the clustering algorithm includes a K-means clustering algorithm, and step S30 includes:

In this embodiment, the terminal selects K (K is a preset value) word vectors from the current sequence as initial clustering centers according to a K-means clustering algorithm, and then assigns each of the remaining word vectors to the initial clustering center closest to the current sequence, so as to obtain K initial vector sets. And then the terminal starts to calculate the mean vector combined by each initial vector to determine the mean vector as a new clustering center required by a new iteration, and continues to allocate the process until the new clustering center in a certain iteration is detected to meet a preset termination condition, for example, the distance between the new clustering center and the initial clustering center is smaller than a preset threshold value and other conditions, the iteration process is ended, the new clustering center at the moment is used as a final target clustering center, and each target clustering center and the word vector correspondingly allocated to the target clustering center are used as vector sets belonging to the same category.

Further, after step S30, the method further includes:

In this embodiment, the first vector subset represents a set containing all word vectors with definite category meanings in one vector set; the second vector quantum set represents a set of word vectors of ambiguous meanings for all classes in a vector set. After clustering, the original elements to be classified are divided into a plurality of sets according to categories, and words with definite category meanings and words with indefinite category meanings may exist in each set at the same time, so that the categories of the words with indefinite category meanings can be quickly judged through the words with indefinite category meanings. For example, the words corresponding to the existing vector set include KTV, movie theatre and AAA, and since the first two words are both words with definite category meanings, the category is obviously an entertainment place, and then the category of "AAA" belonging to the same category should also be an entertainment place, the entertainment place can be used as the category of the vector set and displayed, and other clusters are the same.

Further, step S20 includes:

In this embodiment, Word2vec is a toolkit for obtaining Word vectors derived by Google in 2013, and is a group of related models for generating Word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word text. The network is represented by words and the input words in adjacent positions are guessed, and the order of the words is unimportant under the assumption of the bag-of-words model in word2 vec. After training is completed, the word2vec model can be used to map each word to a vector, which can be used to represent word-to-word relationships, and the vector is a hidden layer of the neural network. The specific operation process is a common means in the art and is not described herein.

Further, in the embodiment, the category to which the word with ambiguous category meaning belongs is predicted by using the category to which the word with ambiguous category meaning belongs in the same cluster, and is output as the common category to which the cluster belongs, so that the category meanings of all the words with ambiguous category meaning in the current classification task can be determined quickly and accurately.

As shown in fig. 4, the present invention further provides a non-text scene-based text clustering apparatus, where the non-text scene-based text clustering apparatus includes:

the element sequence acquiring module 10 is configured to acquire information to be classified in a non-text scene, and serialize the information to be classified according to an internal logical relationship of the information to be classified to obtain an element sequence to be classified, which is composed of a plurality of elements to be classified;

a vector sequence conversion module 20, configured to perform vectorization on each element to be classified in the element sequence to be classified, so as to convert the element sequence to be classified into a vector sequence to be classified, where the vector sequence to be classified is composed of a plurality of word vectors;

the vector category dividing module 30 is configured to cluster the vector sequences to be classified by using a preset clustering algorithm, so as to divide the word vectors into a plurality of vector sets belonging to different categories.

The method executed by each program module can refer to each embodiment of the non-text scene-based text clustering method of the present invention, and is not described herein again.

The invention also provides text clustering equipment based on the non-text scene.

The non-text scene based text clustering device comprises a processor, a memory and a non-text scene based text clustering program which is stored on the memory and can run on the processor, wherein when the non-text scene based text clustering program is executed by the processor, the steps of the non-text scene based text clustering method are realized.

The method for implementing the non-text scene-based text clustering program when executed can refer to each embodiment of the non-text scene-based text clustering method of the present invention, and is not described herein again.

The invention also provides a computer readable storage medium.

The computer-readable storage medium of the present invention stores thereon a non-text scene-based text clustering program, which when executed by a processor implements the steps of the non-text scene-based text clustering method as described above.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A text clustering method based on non-text scenes is characterized by comprising the following steps:

2. The non-text scene-based text clustering method according to claim 1, wherein the context logic relationship includes a temporal order,

3. The method for clustering texts based on non-text scenes according to claim 2, wherein the step of using each ordered word to be classified as each element to be classified to form the sequence of elements to be classified comprises:

4. The method for clustering text based on non-text scenes according to claim 3, wherein the step of determining the invalid elements to be classified in the initial element sequence based on the interval duration threshold and the interval duration, and deleting the invalid elements to be classified from the initial element sequence to form the element sequence to be classified comprises:

5. The non-text scene-based text clustering method according to claim 1, wherein the clustering algorithm comprises a K-means clustering algorithm,

6. The method for clustering texts based on non-text scenes according to claim 1, wherein after the step of clustering the sequence of vectors to be classified by using a preset clustering algorithm to divide a plurality of word vectors into a plurality of vector sets belonging to different categories, the method further comprises:

7. The non-text scene-based text clustering method according to any one of claims 1 to 6, wherein the step of vectorizing each element to be classified in the sequence of elements to be classified to convert the sequence of elements to be classified into a sequence of vectors to be classified consisting of a plurality of word vectors comprises:

8. A non-text scene-based text clustering device is characterized in that the non-text scene-based text clustering device comprises:

9. A non-text scene based text clustering device, the non-text scene based text clustering device comprising: memory, a processor and a non-text scene based text clustering program stored on the memory and executable on the processor, the non-text scene based text clustering program when executed by the processor implementing the steps of the non-text scene based text clustering method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a non-text scene-based text clustering program, which when executed by a processor implements the steps of the non-text scene-based text clustering method according to any one of claims 1 to 7.