CN116010607A

CN116010607A - Text clustering method, device, computer system and storage medium

Info

Publication number: CN116010607A
Application number: CN202310106081.2A
Authority: CN
Inventors: 王佳; 俞晓光; 宋双永
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2023-01-31
Filing date: 2023-01-31
Publication date: 2023-04-25

Abstract

The disclosure provides a text clustering method, a text clustering device, a computer system and a storage medium, and relates to the technical field of artificial intelligence and big data. The method comprises the following steps: extracting the topic features of the target short text to obtain topic feature vectors, wherein the number of words in the target short text meets the preset condition; extracting semantic features of the target short text to obtain semantic feature vectors; generating a fusion feature vector according to the theme feature vector and the semantic feature vector; and clustering the fusion feature vectors to obtain a clustering result of the target short text.

Description

Text clustering method, device, computer system and storage medium

Technical Field

The disclosure relates to the technical field of artificial intelligence and big data, in particular to the field of natural language processing, and specifically relates to a text clustering method, a device, a computer system and a storage medium.

Background

With the high-speed development of the Internet, the popularity of intelligent terminals rises year by year, and the access amount of various websites grows exponentially, so that massive text data are generated. The topic information and the text structure information of the text data are mined from the text data to analyze the text data, so that the information focused on a plurality of scenes is very necessary.

Text clustering is one of the processing tasks in the field of natural language processing, and is applied to a large number of natural language processing tasks. And clustering the implicit semantic structures in the text set in an unsupervised learning manner.

In the process of implementing the disclosed concept, the inventor finds that at least the following problems exist in the related art: the long texts are easy to cluster, and the long texts contain large word quantity, so that the characters of each text are more, and the clustering of the texts is facilitated. But for short text, the number of words contained is small, the features of the extracted text are small, and the clustering task is a great challenge.

Disclosure of Invention

In view of this, the present disclosure provides a text clustering method, apparatus, computer system, storage medium, and program product.

One aspect of the present disclosure provides a text clustering method, including: extracting the topic features of the target short text to obtain topic feature vectors, wherein the number of words in the target short text meets the preset condition; extracting semantic features of the target short text to obtain semantic feature vectors; generating a fusion feature vector according to the theme feature vector and the semantic feature vector; and clustering the fusion feature vectors to obtain a clustering result of the target short text.

According to an embodiment of the present disclosure, performing topic feature extraction on a target short text to obtain a topic feature vector includes: word segmentation processing is carried out on the target short text, and a plurality of word segmentation vectors corresponding to the target short text are obtained; extracting the topic features of the word segmentation vectors to obtain topic feature vectors of the word segmentation vectors; and generating the topic feature vector of the target text according to the topic feature vectors of the word segmentation vectors.

According to an embodiment of the present disclosure, performing topic feature extraction on a plurality of word segmentation vectors to obtain topic feature vectors of the plurality of word segmentation vectors, including: sampling the word segmentation vectors to obtain sampling results of the word segmentation vectors; and under the condition that the sampling result meets the preset condition, determining the theme feature vectors of the word segmentation vectors.

According to an embodiment of the present disclosure, semantic feature extraction is performed on a target short text to obtain a semantic feature vector, including: preprocessing a target short text, and determining a plurality of target text units aiming at the target short text; and extracting semantic features of the target text unit data of the plurality of target text units to obtain semantic feature vectors.

According to an embodiment of the present disclosure, semantic feature extraction is performed on target text unit data of a plurality of target text units to obtain a semantic feature vector, including: extracting position features of the plurality of target text unit data to obtain position feature vectors; extracting word characteristics of the plurality of target text unit data to obtain word characteristic vectors; and generating semantic feature vectors of the target short text according to the position feature vectors and the word feature vectors.

According to an embodiment of the present disclosure, semantic feature extraction is performed on a target short text to obtain a semantic feature vector, including: extracting semantic features of the target short text by using the characterization model to obtain semantic feature vectors; the characterization model is obtained by training a pre-training language model by using a short sample text.

According to an embodiment of the present disclosure, the characterization model is obtained by training a pre-training language model using a sample short text, including: the characterization model is obtained by adjusting model parameters of the pre-training language model based on the mask prediction loss function value; the mask predictive loss function value is determined based on a predictive result of the first sample short text data; the prediction result of the first sample short text data is based on a mask prediction of the first sample short text data of the first sample short text.

According to an embodiment of the present disclosure, the characterization model is obtained by training a pre-training language model using a sample short text, including: the characterization model is obtained by adjusting model parameters of the pre-training language model based on the clustering loss function value; the cluster loss function value is determined based on the sample feature vector and the clustering result; the clustering result is determined according to a plurality of sample clusters of the sample feature vector; the plurality of sample clusters are obtained based on clustering the sample feature vectors; the sample feature vector is based on feature extraction of the second sample short text.

According to an embodiment of the present disclosure, clustering a fusion feature vector to obtain a clustering result of a target short text includes: inputting the fusion feature vector into a clustering model for clustering treatment, and determining a plurality of clustering centers; clustering the fusion feature vectors according to the clustering centers and the fusion feature vectors to obtain a plurality of clusters; and obtaining a clustering result of the target short text according to the plurality of clusters.

Another aspect of the present disclosure provides a text clustering device, including: the first extraction module is used for extracting the theme characteristics of the target short text to obtain theme characteristic vectors, wherein the number of words in the target short text meets the preset condition; the second extraction module is used for extracting semantic features of the target short text to obtain semantic feature vectors; the generation module is used for generating a fusion feature vector according to the theme feature vector and the semantic feature vector; and the clustering module is used for clustering the fusion feature vectors to obtain a clustering result of the target short text.

Another aspect of the present disclosure provides a computer system comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods of embodiments of the present disclosure.

Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions that, when executed, are configured to implement a method of an embodiment of the present disclosure.

Another aspect of the present disclosure provides a computer program product comprising computer executable instructions which, when executed, are for implementing the method of embodiments of the present disclosure.

According to the embodiment of the disclosure, the topic feature vector is obtained by extracting topic features of the target short text; extracting semantic features of the target short text to obtain semantic feature vectors; generating a fusion feature vector according to the theme feature vector and the semantic feature vector; the method comprises the steps of clustering the fusion feature vectors to obtain a clustering result of the target short text, so that the technical means of clustering the short text is realized, and the semantic information of the target short text can be mutually supplemented by combining the theme feature vectors and the semantic feature vectors, so that the technical problems that semantic dimension information representation is less and clustering effect accuracy is low when the conventional text clustering is performed on the short text are at least partially solved, and the technical effects of fully acquiring the semantic information expressed by the short text and improving the clustering result of the short text are achieved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments thereof with reference to the accompanying drawings in which:

FIG. 1 schematically illustrates an exemplary system architecture to which text clustering methods and apparatus may be applied, according to embodiments of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a text clustering method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a schematic diagram of a text clustering method according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a block diagram of a text clustering device according to an embodiment of the disclosure; and

fig. 5 schematically illustrates a block diagram of an electronic device adapted to implement a text clustering method in accordance with an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

Text clustering places similar texts into one class cluster, and dissimilar texts into different class clusters, so that effective organization of text information is realized, and the application is very wide. The common text clustering mode is to vectorize the text firstly and then cluster the text by adopting a conventional clustering method, but the text vectorization represents more word dimension representations and less semantic dimension information representations. The topic model is a statistical model for clustering the implicit semantic structures of the text set in an unsupervised learning manner and is mainly used for semantic analysis and text mining of the text set.

The inventor finds that the clustering of the long texts is easier because the word quantity contained in the long texts is larger and the characteristics of each text are more at present in the process of realizing the conception of the disclosure; and for short texts, the number of words contained is smaller, the characteristics of each text are smaller, and the clustering effect for the short texts is poorer.

In view of this, in the process of implementing the present solution, in order to fully obtain semantic information of short text, the clustering result of short text is improved. The embodiment of the disclosure provides a text clustering method. The method comprises the following steps: extracting the topic features of the target short text to obtain topic feature vectors, wherein the number of words in the target short text meets the preset condition; extracting semantic features of the target short text to obtain semantic feature vectors; generating a fusion feature vector according to the theme feature vector and the semantic feature vector; and clustering the fusion feature vectors to obtain a clustering result of the target short text.

Fig. 1 schematically illustrates an exemplary system architecture to which text clustering methods and apparatus may be applied, according to embodiments of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.

As shown in fig. 1, a system architecture 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.

The user may interact with the server 105 via the network 104 using the

terminal devices

101, 102, 103 to receive or send messages or the like. Various communication client applications may be installed on the

terminal devices

101, 102, 103, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients and/or social platform software, to name a few.

The

terminal devices

101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by users using the

terminal devices

101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that, the text clustering method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the text clustering means provided by the embodiments of the present disclosure may be generally provided in the server 105. The text clustering method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the text clustering means provided by the embodiments of the present disclosure may also be provided in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Alternatively, the text clustering method provided by the embodiment of the present disclosure may be performed by the

terminal device

101, 102, or 103, or may be performed by another terminal device different from the

terminal device

101, 102, or 103. Accordingly, the text clustering device provided by the embodiment of the present disclosure may also be provided in the

terminal device

101, 102, or 103, or in another terminal device different from the

terminal device

101, 102, or 103.

For example, the target short text may be originally stored in any one of the

terminal devices

101, 102, or 103 (for example, but not limited to, the terminal device 101), or stored on an external storage device and imported into the terminal device 101. Then, the terminal device 101 may locally perform the text clustering method provided by the embodiment of the present disclosure, or transmit the target short text to other terminal devices, servers, or server clusters, and perform the text clustering method provided by the embodiment of the present disclosure by the other terminal devices, servers, or server clusters that receive the target short text.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 2 schematically illustrates a flow chart of a text clustering method according to an embodiment of the present disclosure.

As shown in fig. 2, the method 200 may include operations S210 to S240.

In operation S210, extracting the subject feature of the target short text to obtain a subject feature vector, where the number of words in the target short text satisfies a preset condition.

According to embodiments of the present disclosure, the target short text may be text that text clusters the short text. Before extracting the subject feature of the target short text, the input target text can be identified, and the input target text is determined to be the short text. For example, the recognition may be performed by whether the number of words included in the target text sentence satisfies a preset condition. The preset condition may be a number of words threshold. If the number of words in the target text is less than or equal to the word number threshold, the target text is determined to be a short text. For example, the number of words threshold may be set to 20 to 30 words. The number of words threshold may be determined according to actual needs, and is not limited herein.

According to embodiments of the present disclosure, a plurality of pieces of short text data may be included in the target short text. The target short text may be audio data or text data. For example, text data of comments made by users in a social network may also be audio data of customer service questions for voice consultation of the users.

According to embodiments of the present disclosure, the subject feature may characterize the subject matter of the target short text. Subject feature extraction of target short text can utilize a subject model to extract subject feature vectors. For example, the topic model (Biterm Topic Model, BTM) is used to extract topic feature vectors from target short text by extracting word pairs.

In operation S220, semantic feature extraction is performed on the target short text, and a semantic feature vector is obtained.

According to embodiments of the present disclosure, semantic features may characterize sentence information of a target short text. Semantic feature extraction of the target short text can be performed by utilizing a pre-training model to extract semantic feature vectors of the target short text. The pre-trained model is a model that converts natural language into vectors.

According to the embodiment of the disclosure, in the process of converting the target short text into the semantic feature vector by the pre-training model, when one word in the target short text is processed, the information of the words in front of and behind the word can be considered, so that the semantic of the context is obtained. The pre-trained model may be a BERT (Bidirectional Encoder Representations from Transformers), ERNIE (Enhanced Language Representation with Informative Entities) or RoBERTa (Robustly optimized BERT approach) model.

In operation S230, a fusion feature vector is generated from the subject feature vector and the semantic feature vector.

According to embodiments of the present disclosure, the fusion feature vector may characterize subject information and semantic information of the target short text. The fused feature vector may be a feature vector obtained by feature fusing the subject feature vector and the semantic feature vector.

According to embodiments of the present disclosure, for example, one feature vector may be obtained by stitching the subject feature vector and the semantic feature vector, for example, by stitching through a separator, that is, a fused feature vector. Note that, the method for generating the fusion feature vector is not limited in this disclosure.

In operation S240, the fusion feature vectors are clustered to obtain a clustering result of the target short text.

According to the embodiment of the disclosure, the fusion feature vectors can be clustered through a clustering algorithm to obtain a clustering result of the target short text. Specifically, a clustering model can be constructed by setting a neighborhood distance, a value range and a value span of a neighborhood density threshold, and fusion feature vectors representing different topics and semantic information are subjected to cluster analysis based on the constructed clustering model to obtain a plurality of clusters, wherein each cluster represents a clustering result of a target short text, and the fusion feature vectors contained in each cluster can be vectors with the same or similar types.

According to an embodiment of the present disclosure, extracting a subject feature from a target short text to obtain a subject feature vector may include: word segmentation processing is carried out on the target short text, and a plurality of word segmentation vectors corresponding to the target short text are obtained; extracting the topic features of the word segmentation vectors to obtain topic feature vectors of the word segmentation vectors; and generating the topic feature vector of the target text according to the topic feature vectors of the word segmentation vectors.

According to the embodiment of the disclosure, topic feature extraction can be performed on the target short text through a topic model (Latent Dirichlet Allocation, LDA) to obtain topic feature vectors. The target short text is document information or semantic information, and sentence information of the target short text needs to be subjected to word segmentation. The word segmentation process can analyze the words contained in each sentence data of the target short text, and finally output all the words composing the sentence.

According to the embodiment of the disclosure, the target short text can be preprocessed by removing stop words, removing low-frequency words and the like, redundant information in the short text is removed, and the target short text is concise, neat and convenient to calculate.

According to the embodiment of the disclosure, the word segmentation device can be utilized to segment the target short text, so that a plurality of word segmentation vectors corresponding to the target short text are obtained.

According to an embodiment of the present disclosure, feature extraction of each word-segmentation vector may include: and performing topic feature calculation on each word segmentation vector to obtain topic feature vectors aiming at each word segmentation vector, wherein the topic feature vectors can characterize topics of the corresponding word segmentation vectors.

According to an embodiment of the present disclosure, generating a subject feature vector of a target text from subject feature vectors of a plurality of word segmentation vectors may include: and counting the subjects characterized by the subject feature vector of each word segmentation vector to obtain the subject feature vector of the target short text, namely, the subject distribution of the target short text.

According to an embodiment of the present disclosure, extracting a subject feature of a plurality of word segmentation vectors to obtain a subject feature vector of the plurality of word segmentation vectors may include: sampling the word segmentation vectors to obtain sampling results of the word segmentation vectors; and under the condition that the sampling result meets the preset condition, determining the theme feature vectors of the word segmentation vectors.

According to an embodiment of the present disclosure, before sampling a plurality of word segmentation vectors, each word segmentation vector may be randomly assigned an initial topic number, each topic number representing a topic feature vector, and each topic feature vector representing a topic.

According to an embodiment of the present disclosure, extracting the subject feature for each word segmentation vector may include: and carrying out Gibbs sampling calculation on each word segmentation vector to obtain a sampling calculation result, updating the initial topic number of each word segmentation vector according to the sampling calculation result, carrying out Gibbs sampling calculation again on the word segmentation vector with the updated initial topic number, updating the topic number of the word segmentation vector, and cycling the operation based on the Gibbs sampling calculation until the Gibbs sampling calculation result meets the preset condition, and taking the topic code which finally meets the preset condition as the topic of the corresponding word segmentation vector.

According to an embodiment of the present disclosure, when the sampling result satisfies the preset condition may include: the sampling calculation times meet the preset times or the sampling calculation value reaches the convergence condition.

According to an embodiment of the present disclosure, extracting semantic features from a target short text to obtain a semantic feature vector may include: preprocessing a target short text, and determining a plurality of target text units aiming at the target short text; and extracting semantic features of the target text unit data of the plurality of target text units to obtain semantic feature vectors.

According to an embodiment of the present disclosure, preprocessing the target short text may include: and performing word segmentation processing on the obtained target short text by using the pre-training model, and segmenting the target short text into a plurality of target text units. The target text unit is the minimum unit processed by the preprocessing model. For example, for a target short text in chinese form, the target recognition unit may be a word; for target short text in english, the target recognition unit may be a word.

According to embodiments of the present disclosure, text is unstructured data information that cannot be directly calculated. Text may be expressed by a low-dimensional vector. Extracting semantic features from target text unit data of a plurality of target text units may include: and converting unstructured information of target text unit data of a plurality of text units into structured information of low-dimensional vectors by utilizing a pre-training model, and extracting semantic features of the low-dimensional vector data to obtain semantic feature vectors.

According to an embodiment of the present disclosure, extracting semantic features from target text unit data of a plurality of target text units to obtain semantic feature vectors may include: extracting position features of the plurality of target text unit data to obtain position feature vectors; extracting word characteristics of the plurality of target text unit data to obtain word characteristic vectors; and generating semantic feature vectors of the target short text according to the position feature vectors and the word feature vectors.

According to the embodiment of the disclosure, the position encoder can be utilized to extract the position characteristics of the plurality of target text unit data, and the position characteristic vector for each target text unit data is obtained. The location feature vector may characterize the relative or absolute location of each target text unit data occurrence in the target short text.

According to the embodiment of the disclosure, word feature extraction is performed on a plurality of target text unit data by using a word feature encoder, so as to obtain a word feature vector for each target text unit data. The word feature vector may characterize respective semantic information corresponding to each target text unit data.

According to the embodiment of the disclosure, the relative or absolute position of each target text unit data in the target short text can be determined according to the position feature vector of each target text unit data, and the semantic information of each target text unit data in the target text can be determined according to the word feature vector of each target text unit data. And integrating the relative or absolute position information and semantic information of each target text unit data in the target short text, so as to determine the semantic feature vector of the target short text.

According to an embodiment of the present disclosure, extracting semantic features from a target short text to obtain a semantic feature vector may include: extracting semantic features of the target short text by using the characterization model to obtain semantic feature vectors; the characterization model is obtained by training a pre-training language model by using a short sample text.

According to embodiments of the present disclosure, the characterization model may be a model for training language characterization, obtained by training a pre-training language model with short sample text. The pre-trained language model may be a supervised model, a semi-supervised model, and an unsupervised model. The model structure of the pre-trained language model may be configured according to actual business requirements, which is not limited herein.

According to the embodiment of the disclosure, the characterization model can perform word division processing on the input target text, and the word division result is used as a unit to obtain a feature extraction result corresponding to each word division result.

According to an embodiment of the present disclosure, the characterization model is obtained by training a pre-training language model using a sample short text, and may include: the characterization model is obtained by adjusting model parameters of the pre-training language model based on the mask prediction loss function value; the mask predictive loss function value is determined based on a predictive result of the first sample short text data; the prediction result of the first sample short text data is based on a mask prediction of the first sample short text data of the first sample short text.

According to an embodiment of the present disclosure, the first sample short text data may be text data or audio data. The first sample short text data may be a vast amount of training corpus within a particular domain. For example, the corpus can be common corpuses such as customer service field, medical field and biological category field.

According to an embodiment of the present disclosure, performing mask prediction on first sample short text data of a first sample short text may include: random data filling can be performed on the first sample short text data, and masking is performed on data in the original first sample short text data to obtain masked first sample short text data.

According to the embodiment of the disclosure, the first sample short text data after masking is input into a pre-training language model, and a prediction result aiming at the first sample short text data is obtained. And predicting the loss function based on the mask, and obtaining a mask prediction loss function value according to the prediction result. And adjusting model parameters of the pre-training language model according to the mask predictive loss function value to obtain a trained pre-training language model, namely, a characterization model.

According to embodiments of the present disclosure, for example, a prediction result may be input into a mask prediction loss function, resulting in a mask prediction loss function value. And adjusting model parameters of the pre-training language model according to the mask predictive loss function value until a preset ending condition is met. The pre-trained language model obtained in the case that the predetermined end condition is satisfied is determined as the characterization model. The predetermined end condition may include at least one of: the loss function value converges and the training round reaches a predetermined training round. The predetermined training round may be determined based on traffic scenario requirements. Business scenario requirements may refer to requirements related to training tasks.

According to an embodiment of the present disclosure, a characterization model is obtained by training a pre-training language model using a sample short text, comprising: the characterization model is obtained by adjusting model parameters of the pre-training language model based on the clustering loss function value; the cluster loss function value is determined based on the sample feature vector and the clustering result; the clustering result is determined according to a plurality of sample clusters of the sample feature vector; the plurality of sample clusters are obtained based on clustering the sample feature vectors; the sample feature vector is based on feature extraction of the second sample short text.

According to embodiments of the present disclosure, the pre-trained language model may be a deep cluster model. Inputting second sample short text data of the second sample short text into the pre-training language model, and carrying out feature extraction on the second sample short text data by utilizing an Encoder (Encoder) module of the pre-training language model to obtain sample feature vectors. And inputting the sample feature vectors into a clustering module for clustering analysis to obtain a plurality of sample clusters aiming at the sample feature vectors, and obtaining a clustering result. Based on a KL (Kullback-Leibler) divergence loss function, obtaining a KL divergence loss function value according to the sample feature vector and the clustering result. And fine tuning the parameters of the Encoder module in the pre-training language model according to the KL divergence loss function value to obtain a trained pre-training language model, namely, a characterization model.

According to an embodiment of the present disclosure, for example, the sample feature vector and the clustering result may be input into a KL divergence loss function, resulting in a KL divergence loss function value. And adjusting parameters of an Encoder module in the pre-training language model according to the KL divergence loss function value until a preset ending condition is met. The pre-trained language model obtained in the case that the predetermined end condition is satisfied is determined as the characterization model. The predetermined end condition may include at least one of: the loss function value converges and the training round reaches a predetermined training round. The predetermined training round may be determined based on traffic scenario requirements. Business scenario requirements may refer to requirements related to training tasks.

According to embodiments of the present disclosure, a pre-trained topic model may be trained using third sample short text data within a particular domain, words with domain features may be assigned different topics, and the topic model may be obtained by training as follows.

According to the embodiment of the disclosure, the number of topics of the pre-training topic model is determined, third sample short text data is input into the pre-training topic model, word segmentation processing is carried out on the third sample short text data, and initial topic distribution for each word segment is obtained. Based on the sampling loss function, according to the topic distribution endowed by each word, the sampling loss function value is obtained. And updating the topic distribution endowed by each word according to the sampling loss function value until the sampling loss function reaches a preset end condition, and counting the topic of each word in the corpus to obtain the topic distribution of the third sample short text, namely obtaining a trained pre-trained topic model, namely obtaining the topic model.

According to an embodiment of the present disclosure, clustering the fusion feature vector to obtain a clustering result of the target short text includes: inputting the fusion feature vector into a clustering model for clustering treatment, and determining a plurality of clustering centers; clustering the fusion feature vectors according to the clustering centers and the fusion feature vectors to obtain a plurality of clusters; and obtaining a clustering result of the target short text according to the plurality of clusters.

According to an embodiment of the present disclosure, it is assumed that the target short text is divided into k categories. Randomly selecting k category initial clustering centers from the fusion feature vectors; and solving the distance from any one fusion feature vector to the k category initial clustering centers, and classifying the fusion feature vector into the category where the initial clustering center with the shortest distance is located to obtain a plurality of initial clusters.

According to the embodiment of the disclosure, iterative updating is performed on the cluster center of each initial cluster by using a mean method, specifically, the mean value of the fusion feature vectors in each initial cluster can be calculated, and the cluster center of each initial cluster is updated to obtain k new cluster centers; and solving the distance from any one fusion feature vector to the new cluster centers of k categories, and classifying the fusion feature vector into the category where the new cluster center with the shortest distance is located to obtain a plurality of clusters.

According to the embodiment of the disclosure, the above steps are utilized to continue to iteratively update the cluster center of each cluster, iteration is stopped under the condition that the iteratively updated cluster center meets the preset end condition, the target cluster center of each cluster is determined, the distance from any one fusion feature vector to k target cluster centers is calculated, the fusion feature vector is classified into the class where the target cluster center with the shortest distance is located, a plurality of target clusters are obtained, and the clustering result of the target short text is obtained according to each target cluster.

According to an embodiment of the present disclosure, the preset end condition may be a condition that a cluster center value updated by using a mean value remains unchanged, or may be a preset iteration number.

Fig. 3 schematically illustrates a schematic diagram of a text clustering method according to an embodiment of the present disclosure.

As shown in fig. 3, in a schematic diagram 300, a target short text 301 is input into a topic model 302 and a token model 303 respectively, topic feature extraction is performed on the target short text 301 by using the topic model 302 to obtain a topic feature vector 304, and semantic feature extraction is performed on the target short text 301 by using the token model 303 to obtain a semantic feature vector 305. The subject feature vector 304 and the semantic feature vector 305 are fused to generate a fused feature vector 306. And inputting the fusion feature vector 306 into a clustering model 307 for clustering to obtain a clustering result 308 of the target short text.

Fig. 4 schematically illustrates a block diagram of a text clustering device according to an embodiment of the present disclosure.

As shown in fig. 4, the apparatus 400 may include: a first extraction module 410, a second extraction module 420, a generation module 430, and a clustering module 440.

The first extraction module 410 is configured to perform topic feature extraction on the target short text to obtain a topic feature vector, where the number of words in the target short text meets a preset condition.

The second extraction module 420 is configured to perform semantic feature extraction on the target short text, so as to obtain a semantic feature vector.

The generating module 430 is configured to generate a fusion feature vector according to the topic feature vector and the semantic feature vector.

And the clustering module 440 is used for clustering the fusion feature vectors to obtain a clustering result of the target short text.

According to an embodiment of the present disclosure, the first extraction module 410 may include: the device comprises a processing sub-module, a first extraction sub-module and a generation sub-module.

And the processing sub-module is used for performing word segmentation processing on the target short text to obtain a plurality of word segmentation vectors corresponding to the target short text.

The first extraction submodule is used for extracting the theme characteristics of the word segmentation vectors to obtain theme characteristic vectors of the word segmentation vectors.

And the generating sub-module is used for generating the theme feature vector of the target text according to the theme feature vectors of the word segmentation vectors.

According to an embodiment of the present disclosure, the extraction sub-module may include: a sampling unit and a determining unit.

The sampling unit is used for sampling the word segmentation vectors to obtain sampling results of the word segmentation vectors.

And the determining unit is used for determining the theme feature vectors of the word segmentation vectors under the condition that the sampling result meets the preset condition.

According to an embodiment of the present disclosure, the second extraction module 420 may include: a sub-module and a second extraction sub-module are determined.

And the determining submodule is used for preprocessing the target short text and determining a plurality of target text units aiming at the target short text.

And the second extraction sub-module is used for extracting semantic features of the target text unit data of the plurality of target text units to obtain semantic feature vectors.

According to an embodiment of the present disclosure, the second extraction sub-module may include: the device comprises a first extraction unit, a second extraction unit and a generation unit.

The first extraction unit is used for extracting the position features of the plurality of target text unit data to obtain position feature vectors.

And the second extraction unit is used for extracting word characteristics of the plurality of target text unit data to obtain word characteristic vectors.

And the generating unit is used for generating the semantic feature vector of the target short text according to the position feature vector and the word feature vector.

According to an embodiment of the present disclosure, the second extraction module 420 may include: and a third extraction sub-module.

The third extraction sub-module is used for extracting semantic features of the target short text by using the characterization model to obtain semantic feature vectors; the characterization model is obtained by training a pre-training language model by using a short sample text.

According to an embodiment of the present disclosure, a characterization model is obtained by training a pre-training language model using a sample short text, comprising: the characterization model is obtained by adjusting model parameters of the pre-training language model based on the mask prediction loss function value; the mask predictive loss function value is determined based on a predictive result of the first sample short text data; the prediction result of the first sample short text data is based on a mask prediction of the first sample short text data of the first sample short text.

According to an embodiment of the present disclosure, the clustering module 440 may include: the system comprises a first clustering sub-module, a second clustering sub-module and an acquisition sub-module.

And the first clustering sub-module is used for inputting the fusion feature vector into a clustering model for clustering processing and determining a plurality of clustering centers.

And the second clustering sub-module is used for clustering the fusion feature vectors according to the clustering center and the fusion feature vectors to obtain a plurality of clusters.

And the acquisition sub-module is used for acquiring a clustering result of the target short text according to the plurality of clusters.

It should be noted that, the embodiments of the apparatus portion of the present disclosure are the same as or similar to the embodiments of the method portion of the present disclosure, and are not described herein.

Any number of modules, sub-modules, units, sub-units, or at least some of the functionality of any number of the sub-units according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented as split into multiple modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system-on-chip, a system-on-substrate, a system-on-package, an Application Specific Integrated Circuit (ASIC), or in any other reasonable manner of hardware or firmware that integrates or encapsulates the circuit, or in any one of or a suitable combination of three of software, hardware, and firmware. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be at least partially implemented as computer program modules, which when executed, may perform the corresponding functions.

For example, any of the first extraction module 410, the second extraction module 420, the generation module 430, and the clustering module 440 may be combined in one module/unit/sub-unit or any of the modules/units/sub-units may be split into a plurality of modules/units/sub-units. Alternatively, at least some of the functionality of one or more of these modules/units/sub-units may be combined with at least some of the functionality of other modules/units/sub-units and implemented in one module/unit/sub-unit. According to embodiments of the present disclosure, at least one of the first extraction module 410, the second extraction module 420, the generation module 430, and the clustering module 440 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging the circuitry, or in any one of or a suitable combination of three of software, hardware, and firmware. Alternatively, at least one of the first extraction module 410, the second extraction module 420, the generation module 430, and the clustering module 440 may be at least partially implemented as a computer program module, which when executed, may perform the respective functions.

Fig. 5 schematically illustrates a block diagram of an electronic device adapted to implement a text clustering method in accordance with an embodiment of the present disclosure. The electronic device shown in fig. 5 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 5, an electronic device 500 according to an embodiment of the present disclosure includes a processor 501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. The processor 501 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 501 may also include on-board memory for caching purposes. The processor 501 may comprise a single processing unit or a plurality of processing units for performing different actions of the method flows according to embodiments of the disclosure.

In the RAM 503, various programs and data required for the operation of the electronic apparatus 500 are stored. The processor 501, ROM 502, and RAM 503 are connected to each other by a bus 504. The processor 501 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 502 and/or the RAM 503. Note that the program may be stored in one or more memories other than the ROM 502 and the RAM 503. The processor 501 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the present disclosure, the electronic device 500 may also include an input/output (I/O) interface 505, the input/output (I/O) interface 505 also being connected to the bus 504. The system 500 may also include one or more of the following components connected to the I/O interface 505: an input section 506 including a keyboard, a mouse, and the like; an output portion 507 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The drive 510 is also connected to the I/O interface 505 as needed. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as needed so that a computer program read therefrom is mounted into the storage section 508 as needed.

According to embodiments of the present disclosure, the method flow according to embodiments of the present disclosure may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 509, and/or installed from the removable media 511. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 501. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium. Examples may include, but are not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 502 and/or RAM 503 and/or one or more memories other than ROM 502 and RAM 503 described above.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be combined in various combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A text clustering method, comprising:

extracting the topic features of the target short text to obtain topic feature vectors, wherein the number of words in the target short text meets the preset condition;

extracting semantic features of the target short text to obtain semantic feature vectors;

generating a fusion feature vector according to the theme feature vector and the semantic feature vector;

and clustering the fusion feature vectors to obtain a clustering result of the target short text.

2. The method of claim 1, wherein the extracting the subject feature of the target short text to obtain the subject feature vector comprises:

Word segmentation processing is carried out on the target short text, and a plurality of word segmentation vectors corresponding to the target short text are obtained;

extracting the topic features of the word segmentation vectors to obtain topic feature vectors of the word segmentation vectors;

and generating the theme feature vector of the target text according to the theme feature vectors of the word segmentation vectors.

3. The method of claim 2, wherein the extracting the topic feature of the plurality of word segmentation vectors to obtain topic feature vectors of the plurality of word segmentation vectors comprises:

sampling a plurality of word segmentation vectors to obtain sampling results of the plurality of word segmentation vectors;

and under the condition that the sampling result meets the preset condition, determining the theme feature vectors of the word segmentation vectors.

4. The method of claim 1, wherein the extracting semantic features from the target short text to obtain a semantic feature vector comprises:

preprocessing the target short text, and determining a plurality of target text units aiming at the target short text;

and extracting semantic features of the target text unit data of the target text units to obtain semantic feature vectors.

5. The method of claim 4, wherein the extracting semantic features from the target text unit data of the plurality of target text units to obtain semantic feature vectors includes:

extracting position features of the plurality of target text unit data to obtain position feature vectors;

extracting word characteristics of the plurality of target text unit data to obtain word characteristic vectors;

and generating a semantic feature vector of the target short text according to the position feature vector and the word feature vector.

6. The method of claim 1, wherein the extracting semantic features from the target short text to obtain semantic feature vectors comprises:

extracting semantic features of the target short text by using a characterization model to obtain semantic feature vectors; the characterization model is obtained by training a pre-training language model by using a short sample text.

7. The method of claim 6, wherein the characterization model is trained on a pre-trained language model using short sample text, comprising:

the characterization model is obtained by adjusting model parameters of the pre-training language model based on mask prediction loss function values;

The mask predictive loss function value is determined based on a predictive result of the first sample short text data;

the prediction result of the first sample short text data is obtained based on mask prediction of the first sample short text data of the first sample short text.

8. The method of claim 6, wherein the characterization model is trained on a pre-trained language model using short sample text, comprising:

the characterization model is obtained by adjusting model parameters of the pre-training language model based on the clustering loss function value;

the cluster loss function value is determined based on the sample feature vector and the clustering result;

the clustering result is determined according to a plurality of sample clusters of the sample feature vector;

the plurality of sample clusters are obtained based on clustering the sample feature vectors;

the sample feature vector is based on feature extraction of the second sample short text.

9. The method of claim 1, wherein the clustering the fused feature vectors to obtain a clustering result of the target short text comprises:

inputting the fusion feature vector into a clustering model for clustering treatment, and determining a plurality of clustering centers;

Clustering the fusion feature vectors according to the clustering center and the fusion feature vectors to obtain a plurality of clusters;

and obtaining a clustering result of the target short text according to the clusters.

10. A text clustering device, comprising:

the first extraction module is used for extracting the theme characteristics of the target short text to obtain theme characteristic vectors, wherein the number of words in the target short text meets the preset condition;

the second extraction module is used for extracting semantic features of the target short text to obtain semantic feature vectors;

the generation module is used for generating a fusion feature vector according to the theme feature vector and the semantic feature vector;

and the clustering module is used for clustering the fusion feature vectors to obtain a clustering result of the target short text.

11. A computer system, comprising:

one or more processors;

a memory for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-9.

12. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to implement the method of any of claims 1 to 9.

13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 9.