CN115186652A

CN115186652A - Text topic classification method, device, equipment and storage medium

Info

Publication number: CN115186652A
Application number: CN202210820489.1A
Authority: CN
Inventors: 王磊; 郑博洪; 赖伟; 史超; 龚晓怡; 钟鸿科
Original assignee: Guangzhou Teligen Communication Technology Co ltd
Current assignee: Guangzhou Teligen Communication Technology Co ltd
Priority date: 2022-07-13
Filing date: 2022-07-13
Publication date: 2022-10-14

Abstract

The application discloses a text theme classification method, a text theme classification device, text theme classification equipment and a storage medium, wherein the text theme classification method comprises the following steps: acquiring an information text input by a user, wherein the information text comprises at least one text theme; determining a primary label corresponding to at least one text topic; determining anchor words of predefined topics, clustering sentences under each text topic according to the anchor words to obtain a plurality of clustering results under each text topic; determining the probability distribution of each clustering result under each text topic in any sentence segment of the information text according to the correlation between the information text and each text topic; according to the probability distribution of each clustering result under each text topic in any sentence segment of the information text, the clustering result with the highest probability under each text topic is taken as a secondary label under each text topic, and the relevance between the information text and the text topic is represented in a mode of converting the relevance into the probability distribution, so that the controllability and the interpretability are better, and the clustering result is more practical.

Description

Text topic classification method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and more particularly, to a text topic classification method, apparatus, device, and storage medium.

Background

With the development of social science and technology, platforms such as social media deeply affect our lives, and a large amount of high-value information is submerged in massive texts which are published and interacted daily. For the purpose of efficient and safe social supervision, public sentiment of social media is generally required to be supervised, which is generally realized by topic analysis tasks such as extracting popular or newly added discussion topics and quickly locating speakers of specific topics.

In the prior art, a label clustering method is generally adopted to perform theme analysis on massive social media texts, and clustering has low controllability, which means that final clustering results cannot be controlled in an unforeseen way, the clustering results cannot be regulated, the clustering results are far from actual results, and the clustering has low controllability, which also leads to low interpretability.

Disclosure of Invention

In view of this, the present application provides a text topic classification method, apparatus, device, and storage medium, which can improve interpretability and controllability of clustering, so that a clustering result better conforms to an actual result.

The specific scheme of the application is as follows:

a text topic classification method comprises the following steps:

acquiring an information text input by a user, wherein the information text comprises at least one text theme;

determining a primary label corresponding to the at least one text topic;

determining anchor words of predefined topics, and clustering sentences under each text topic according to the anchor words to obtain a plurality of clustering results under each text topic;

determining the probability distribution of each clustering result under each text topic in any sentence segment of the information text according to the correlation between the information text and each text topic;

and according to the probability distribution of each clustering result under each text topic in any sentence segment of the information text, taking the clustering result with the highest probability under each text topic as a secondary label under each text topic.

Preferably, the determining, according to the correlation between the information text and each text topic, the probability distribution of each clustering result in any period of the information text under each text topic includes:

determining the multivalued mutual information quantity of each clustering result under each text topic in the information text according to the correlation between the information text and each text topic;

and under the condition that the sum of the probability distributions of each clustering result under each text topic in the information text is 1, determining the probability distribution of each clustering result under each text topic in any sentence section of the information text according to the multi-valued mutual information quantity of each clustering result under each text topic in the information text.

Preferably, the determining, according to a multi-valued mutual information amount of each clustering result in the information text under each text topic, a probability distribution of each clustering result in each text topic in any sentence segment of the information text includes:

determining a Lagrange multiplier according to the sum of multi-valued mutual information quantity of each clustering result under each text theme in the information text;

and determining the probability distribution of each clustering result under each text topic in any sentence section of the information text based on the Lagrange multiplier.

Preferably, the determining, based on the lagrangian multiplier, a probability distribution of each clustering result in each text topic in any period segment of the information text includes:

for each text topic, determining a first probability distribution of the text topic in the information text based on the Lagrangian multiplier;

determining a second probability distribution of one clustering result under the text topic in the information text according to the first probability distribution;

determining a fourth probability distribution of the clustering result under the text topic according to the second probability distribution and a pre-calculated third probability distribution of the word characteristics of the information text;

determining a sixth probability distribution of the clustering result under the text theme in each phrase of the information text based on the second probability distribution and a pre-calculated fifth probability distribution of each phrase in the corresponding information text;

determining the multivalue mutual information quantity of each phrase of the information text and the clustering result under the text topic according to the third probability distribution, the fifth probability distribution and the sixth probability distribution;

determining feature selection parameters according to each phrase of the information text and the multi-valued mutual information quantity of the clustering result under the text theme;

determining a seventh probability distribution of each phrase of the information text in the clustering result under the text topic according to the second probability distribution, the third probability distribution and the fourth probability distribution;

determining an eighth probability distribution of each phrase of any sentence segment of the information text in the clustering result under the text theme according to the seventh probability distribution;

determining a ninth probability distribution of each phrase of any sentence fragment in the information text based on the fifth probability distribution;

determining a tenth probability distribution of the clustering result in any sentence segment of the information text under the text topic according to the fourth probability distribution, the sixth probability distribution, the eighth probability distribution, the ninth probability distribution and the feature selection parameters;

determining whether the tenth probability distribution has not converged;

if so, reselecting a clustering result under the text topic, and returning to execute the step of determining a second probability distribution of one clustering result under the text topic in the information text according to the first probability distribution until a tenth probability distribution converges;

and if not, taking the tenth probability distribution obtained by the last updating as the probability distribution of one clustering result in any sentence segment of the information text under the text theme.

Preferably, the process of constructing word features of the information text includes:

constructing a vocabulary table corresponding to the information text based on the information text;

generating a text vector according to the vocabulary;

and taking the text vector as word features of the information text.

Preferably, the determining a primary label corresponding to the at least one text topic comprises:

determining a classification model, wherein the classification model comprises a bidirectional long-short term memory neural network, a word embedding layer and a classification layer;

inputting the information text into the classification model;

extracting features of the information text based on the word embedding layer to obtain text features corresponding to a plurality of text subjects of the information text;

calculating the text features based on the bidirectional long-short term memory neural network to obtain a multi-dimensional text vector;

and classifying the multi-dimensional text vectors based on the classification layer to obtain a primary label corresponding to at least one text topic of the information text.

Preferably, the classification layer comprises a fully connected network and a classifier;

the step of classifying the multi-dimensional text vector based on the classification layer to obtain a primary label corresponding to at least one text topic of the information text comprises the following steps:

inputting the multidimensional text vector to the full-connection network to obtain at least one text theme with a characteristic label corresponding to the information text;

and inputting the plurality of text topics with the characteristic labels into the classifier to obtain a primary label corresponding to at least one text topic of the information text.

A text topic classification apparatus comprising:

the text acquisition unit is used for acquiring an information text input by a user, and the information text comprises at least one text theme;

the first label obtaining unit is used for determining a primary label corresponding to the at least one text theme;

the clustering result acquisition unit is used for determining anchor words of predefined topics, clustering sentences under each text topic according to the anchor words and obtaining a plurality of clustering results under each text topic;

the probability distribution determining unit is used for determining the probability distribution of each clustering result under each text topic in any sentence segment of the information text according to the correlation between the information text and each text topic;

and the second label obtaining unit is used for taking the clustering result with the highest probability under each text topic as a secondary label under each text topic according to the probability distribution of each clustering result under each text topic in any sentence segment of the information text.

A text topic classification apparatus comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the text topic classification method.

A storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the text topic classification method as described above.

By means of the technical scheme, the text topic classification method includes the steps of firstly obtaining an information text which contains at least one text topic and is input by a user, determining a primary label corresponding to the at least one text topic, determining an anchor word of a predefined topic, clustering sentences under each text topic according to the anchor word to obtain a plurality of clustering results under each text topic, then determining probability distribution of each clustering result under each text topic in any sentence section of the information text according to correlation between the information text and each text topic, and finally taking the clustering result with the highest probability under each text topic as a secondary label under each text topic according to the probability distribution of each clustering result under each text topic in any sentence section of the information text. According to the method, after the clustering result is determined according to the anchor word, the probability distribution of each clustering result under each text topic in any period of the information text is determined according to the correlation between the information text and the text topic, the correlation between the information text and the text topic is measured by the correlation, the correlation between the information text and the text topic is specifically represented by the mathematical expression form of the probability distribution, and the clustering result can be interpreted and controlled in return by adjusting or controlling the mathematical expression form.

Drawings

Fig. 1 is a schematic flowchart of a text topic classification method according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a classification model provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a text topic classification device according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a text topic classification device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In addition, fine-grained topics, from the probability distribution, many types are at the tail of the long-tailed distribution in the overall data, so it is difficult to extract and accumulate enough samples for training. Meanwhile, the generalization and the sampling universality of the sample are difficult to be guided by an accurate theory, so that the corresponding measures are difficult to evaluate.

Therefore, in order to solve the above problems, the applicant proposes a text topic classification method, device, apparatus and storage medium, which can have better controllability and interpretability for the clustering result.

Referring to fig. 1, fig. 1 is a schematic flow chart of a text topic classification method provided in an embodiment of the present application, and the method may include the following steps:

step S110, an information text input by a user is obtained, and the information text comprises at least one text theme.

Specifically, the information text input by the user may be all or most of the characters input by the user in a certain period of time on some social software or social platform, and the characters are grouped into a content set as the information text input by the user. At least one text topic can be included in the information text, and the text topic reflects the core expression meaning of the information text.

And step S120, determining a primary label corresponding to the at least one text theme.

Specifically, the primary tag is a coarse-grained theme determined based on a text theme contained in the information text, and is a specific type of text screened from a mass of social media texts according to the text information.

Furthermore, the text can be denoised through supervised learning of coarse-grained topics, and texts with irrelevant topics are filtered as far as possible. And carrying out supervised learning on the coarse-grained topics, and screening out the primary labels for subsequent operation.

And S130, determining anchor words of predefined topics, clustering sentences under each text topic according to the anchor words, and obtaining a plurality of clustering results under each text topic.

Specifically, the embodiment of the application can perform manual controllable intervention on the clustering result under a specific subject, further, common keywords or keywords with obvious marks of a certain type of subject can be obtained in advance and set as anchor words, and then each text topic is clustered purposefully according to the anchor words, so that the clustering result with a large relevance to the set anchor words can be obtained.

Step S140, determining a probability distribution of each clustering result in any sentence segment of the information text under each text topic according to the correlation between the information text and each text topic.

In particular, correlation, also known as multi-valued mutual information, is a measure of the information theory framework, which represents the cumulative sum of entropy of each dimension and the difference of joint entropy given a set of random variables. This index is equivalent to the joint distribution of the set of variables Xi andprobability distribution pi of Xi _ip( x _i) KL divergence of the product of (d). The KL divergence (Kullback-Leibler divergence), also known as the relative entropy or information divergence, is a measure of the asymmetry of the difference between two probability distributions. In information theory, the relative entropy is equivalent to the difference in information entropy of two probability distributions. Relative entropy is a loss function of some optimization algorithms, such as the max-expectation algorithm. At this time, one probability distribution involved in the calculation is a true distribution, and the other is a theoretical (fitting) distribution, and the relative entropy represents information loss generated when the true distribution is fitted using the theoretical distribution.

Further, the total correlation is converted into corresponding forms of edge probability and conditional probability for expression, so that the solution is convenient, and the correlation is conveniently represented in a mathematical form, so that the purpose of controlling and explaining the clustering result is achieved.

And S150, according to the probability distribution of each clustering result under each text topic in any period of the information text, taking the clustering result with the highest probability under each text topic as a secondary label under each text topic.

Specifically, the secondary labels are obtained by classifying the text topics in a coarse-grained manner and then classifying the text topics in a fine-grained manner. After the probability distribution of each clustering result under each text topic in any sentence segment of the information text is determined, for the classification of the fine-grained topics of the text, only different probability values are sequenced, and secondary labels with different association degrees with the information text can be obtained.

With the above technical solution, a text topic classification method according to an embodiment of the present application first obtains an information text containing at least one text topic input by a user, determines a primary label corresponding to the at least one text topic, determines an anchor word of a predefined topic, clusters sentences under each text topic according to the anchor word to obtain a plurality of clustering results under each text topic, then determines a probability distribution of each clustering result under each text topic in any sentence segment of the information text according to a correlation between the information text and each text topic, and finally takes a clustering result with a highest probability under each text topic as a secondary label under each text topic according to a probability distribution of each clustering result under each text topic in any sentence segment of the information text. According to the method and the device for classifying the text topics, after the clustering results are determined according to the anchor words, the probability distribution of each clustering result under each text topic in any sentence segment of the information text is determined according to the correlation between the information text and the text topic, the correlation between the information text and the text topic is measured according to the correlation, the correlation between the information text and the text topic is specifically represented in the mathematical expression form of the probability distribution, and the clustering results can be interpreted and controlled in reverse by adjusting or controlling the mathematical expression form.

The foregoing embodiments briefly introduce a text topic classification method of the present application. In some embodiments of the present application, the detailed description of the process of determining, in step S140, a probability distribution of each clustering result in any sentence segment of the information text in each text topic according to the correlation between the information text and each text topic may include the following steps:

and S200, determining the multi-value mutual information quantity of each clustering result in each text subject in the information text according to the correlation between the information text and each text subject.

Specifically, the total correlation is converted into the expression of corresponding mutual information, so that the corresponding forms of marginal probability and conditional probability are conveniently used for expression during solving, and the aim of controlling and explaining the clustering result is fulfilled.

Further, performing finer-grained theme analysis on the primary labels facing the coarse-grained text themes after the first-step coarse screening to obtain secondary labels. The scheme converts the problem into an optimization problem based on an information theory framework, and refers to the following formula:

s.t.|Y|＝k

where X represents an information text in the embodiment of the present application, Y represents a text subject of the information text, and s.t | Y | = k represents a constraint condition that the random variable includes k orthogonal vector groups. It will be appreciated that in this formula, both X and Y are random variables. TC (X; Y) denotes the interpretation of the overall correlation of the random variable Y to the random variable X, which means the reduction of the overall correlation of the random variable X given the condition Y, namely TC (X) -TC (X | Y). Maximizing this measure is the fitting of the internal structure of X, which indicates that the decomposition of the external variable Y is maximized. A simple example is used for explaining a random variable X and a random variable Y, a certain value of Y can be defined as a computer accessory problem, wherein X is used as a condition variable and represents the dimension of terms such as a display card, a mainboard and the like, the total correlation of the random variable X to the random variable is calculated, and the potential influence of which term or terms on Y is larger is judged through the total correlation, namely the internal structure is more fitted.

It is to be understood that, in the present embodiment, a bag-of-words model may be used to obtain the characteristics of the information text. Thus for each word in the vocabulary that appears in a sentence, the dimension is characterized as 1, and conversely 0. It will be appreciated that in some other embodiments, the features may be obtained by dividing the count by the total number, but the features obtained in this way are not represented in binary form, but in fractional form. When the sample data size is large, the binary form can be used to obtain the features.

Further, in order to solve the above formula, in the embodiment of the present application, the above formula is conditionally scaled, that is, it is assumed that the features of X can be divided into n groups, each group is not overlapped with each other, and each group of features is not overlapped with each other

Corresponding to a unique Y _j Reference may be made to the following formula:

wherein,

being a single vector set of variables X, Y _j Is a feature of a certain dimension of the variable Y,

any two vector sets representing the feature variables do not coincide. G represents the set of features in variable X, where a relaxed constraint of the assumed variables is given for solving the optimization problem. Alpha is alpha _ij I-th dimension characteristic X of X _i J dimension Y corresponding to Y _j When it takes 1, otherwise it takes 0, with the aim of expressing the optimization problem in a form that can be solved with convex optimization. In addition, the total correlation is converted into the expression of corresponding mutual information, so that the corresponding forms of marginal probability and conditional probability are conveniently used for expressing during solving.

It is to be understood that the random variable X is a high-dimensional variable since the topic model is generated depending on the band model. The embodiment of the present application explains the intrinsic structure of the learning data through the total correlation, and a dimension reduction scheme such as PCA cannot be applied to the embodiment of the present application.

Step S210, under the condition that the sum of the probability distributions of each clustering result under each text topic in the information text is 1, determining the probability distribution of each clustering result under each text topic in any sentence segment of the information text according to the multivalued mutual information amount of each clustering result under each text topic in the information text.

Specifically, the above formula may be equivalent to a mathematical form expressed by the following formula:

understandably, an illustrative function α is utilized _ij And the definition of the total correlation is used for solving the formula in an expansion mode. And setting a lower bound condition, namely setting a constraint condition to convert the problem into a convex optimization problem for solving, so as to carry out subsequent steps.

In the foregoing embodiment, a process of determining a probability distribution of each clustering result in any period of the information text according to a correlation between the information text and each text topic in the present application is briefly introduced. In some embodiments of the application, for the step S210, a process of determining a probability distribution of each clustering result in each text topic in any sentence segment of the information text according to a multi-valued mutual information amount of each clustering result in each text topic in the information text is described in detail, where the process may include the following steps:

and S300, determining a Lagrangian multiplier according to the sum of multi-value mutual information quantity of each clustering result under each text topic in the information text.

Specifically, a lagrange multiplier is constructed according to the sum of multivalued mutual information amounts of all clustering results under each text topic in the above formula in the information text and a constraint condition, and the form is as follows:

L(x)＝F(x)+h(x)

wherein, F (x) is an objective function, and h (x) is a constraint function.

Step S310, based on the Lagrange multiplier, determining the probability distribution of each clustering result under each text topic in any sentence segment of the information text.

Specifically, making an additional assumption for the above formula, the features of each dimension in the random variable Y do not overlap, so the maximum values can be solved separately. After construction of the Lagrangian multiplier, the variable p (y) is paired _j | x) to calculate the partial derivatives, and juxtapose 0 for the subsequent operation.

In the foregoing embodiment, a process of determining a probability distribution of each clustering result in each text topic in any sentence segment of the information text according to a multi-valued mutual information amount of each clustering result in each text topic in the information text is briefly introduced. In some embodiments of the present application, for the step S310, a process of determining a probability distribution of each clustering result in any period of the information text based on the lagrangian multiplier is described in detail, where the process may include the following steps:

step S400, aiming at each text theme, determining a first probability distribution of the text theme in the information text based on the Lagrange multiplier.

In particular, after the lagrange multiplier is constructed, a first probability distribution p (y | x) is determined, which represents the probability distribution of the text topic in the information text. The first probability distribution may be calculated by the following formula:

wherein Z (x) is a hyper-parametric function introduced in order to solve the optimization equation, α _i Is a parameter subject to uniform distribution of (1/2,1).

Step S401, according to the first probability distribution, determining a second probability distribution of one clustering result in the information text under the text theme.

In particular, the second probability distribution may be p (y) _j | x), which is the probability distribution of one of the clustering results in the information text under the text topic, and after the first probability distribution is calculated, the jth vector in the variable Y is taken to obtain p (Y) _j | x) as a second summaryAnd (4) rate distribution.

And S402, determining a fourth probability distribution of the clustering result under the text subject according to the second probability distribution and a pre-calculated third probability distribution of the word characteristics of the information text.

In particular, the third probability distribution may be p (x), which is a word feature of the informative text. The fourth probability distribution is p (y) _j ) The probability distribution of the clustering result under the text topic can be calculated by the following formula:

further, the construction process of the word features of the information text comprises the following steps:

and S20, constructing a vocabulary corresponding to the information text based on the information text.

Specifically, in the embodiment of the present application, a vocabulary is constructed using a bag-of-words model. The bag of words is a text representation of the occurrence of words in a document, and features of the text can be extracted. The vocabulary refers to all words in the corpus that satisfy a certain frequency threshold, and the corpus in the embodiment of the present application may be a bag-of-words model. According to the corpus, statistics can be performed on the words appearing in the corpus.

And S21, generating a text vector according to the vocabulary.

Specifically, the goal of this step is to convert each document into a vector for input or output to the machine learning model. For a sentence, regardless of order, in vocabulary order, hit is then dimension 1 and miss is 0. The dimension of the entire sentence is equal to the size of the vocabulary. A simple example is illustrated: in the case of a total of 10 words in the lexicon, we use a vector of length 10 to represent the document. Wherein the value of each position in the vector is the score of the corresponding word of the position. One of the simplest scoring methods is: the score of the word at the position is taken as a Boolean value, wherein 1 represents that the word corresponding to the position appears in the document, and 0 represents that the word corresponding to the position does not appear. In this way, a binary vector can be generated.

And S22, taking the text vector as the word feature of the information text.

Specifically, after a text vector is generated from a vocabulary, the generated text vector is used as a word feature of the information text.

After the construction process of the word features of the information text is described in detail, the embodiment of the present application continues to describe in detail the process of determining the probability distribution of each clustering result in any sentence segment of the information text based on the lagrangian multiplier.

Step S403, determining a sixth probability distribution of the clustering result in each phrase of the information text under the text topic based on the second probability distribution and a pre-calculated fifth probability distribution of each phrase in the corresponding information text.

Specifically, the sixth probability distribution is p (y) _j |x _i ) The probability distribution of the clustering result in each phrase of the information text can be understood as the clustering label distribution of each phrase in the bag of words. The sixth probability is calculated by the following formula:

wherein,

for arithmetic equivalent to x, "|" indicates when x equals x

The time is 1, and the other cases are 0.

Further, the fifth probability distribution is p (x) _i ) The probability distribution of each phrase in the corresponding information text may be used, and the calculation process is similar to the word characteristics of the information text, which is not described herein again.

Step S404, determining a multi-valued mutual information amount of each phrase of the information text and the clustering result under the text topic according to the third probability distribution, the fifth probability distribution and the sixth probability distribution.

Specifically, the multi-value mutual information quantity of each phrase of the information text and the clustering result under the text topic is calculated by the following formula:

p(x _i ,y _j )＝p(y _j |x _i )*p(y _j )

step S405, determining feature selection parameters according to each phrase of the information text and the multi-value mutual information quantity of the clustering result under the text theme.

In particular, the feature selection parameter α _i,j The calculation is made by the following formula:

the variable with the bar represents that the variable is a free variable, and a specific dimension is not limited. And is

Specifically, γ is a hyper-parameter used to control the iteration rate, and the physical meaning is equivalent to the compromise factor between the compression rate and the accuracy rate in the information bottleneck framework in the information theory. The scheme can perform manual and controllable intervention on the clustering of the specific theme. The method is that common keywords or keywords with obvious marks of certain topics are obtained in advance through domain knowledge, the keywords are set as anchor words, and higher association strength is set, namely gamma is set to be a value far exceeding 1. Thus, when the features are ranked, these topics are ranked first and into the anchor word set. From the perspective of the parameter, gamma is a compromise between word relevance and the generalization degree of the subject characteristics.

Step S406, determining a seventh probability distribution of each phrase of the information text in the clustering result under the text topic according to the second probability distribution, the third probability distribution, and the fourth probability distribution.

Specifically, the seventh probability distribution is p (x) _i |y _j ) The probability distribution of each phrase of the information text in the clustering result under the text topic is shown, and the seventh probability distribution is calculated by the following formula:

step 407, according to the seventh probability distribution, determining an eighth probability distribution of each phrase of any sentence fragment of the information text in the clustering result under the text topic.

Specifically, the eighth probability distribution is

Is the probability distribution of each phrase of any sentence segment of the information text in the clustering result under the text subject, and is directly and clearly understood as the conditional probability of the phrase in the corresponding sentence sample of the given clustering result. The eighth probability distribution may be calculated by selecting a sentence from the information texts of the seventh probability distribution.

Step S408, determining a ninth probability distribution of each phrase of any sentence fragment in the information text based on the fifth probability distribution.

In particular, the ninth probability distribution is

Is the probability distribution of each phrase of any sentence fragment in the message text. The ninth probability distribution may be derived from word features in the selected sentence in the informative text.

Step S409, determining a tenth probability distribution of the clustering result in any sentence segment of the information text under the text topic according to the fourth probability distribution, the sixth probability distribution, the eighth probability distribution, the ninth probability distribution and the feature selection parameter.

Specifically, the tenth probability distribution is p (y) _j |x ^l ) Is the probability distribution of the clustering result under the text topic in any period of the information text, and in the tenth probability distribution, x ^l It is understood as all features (words) in the information text, i.e. the probabilities of this word in the individual clustering results. The tenth probability distribution may be calculated by the following formula:

step S410, determining whether the tenth probability distribution has not converged, if so, performing step S411; if not, go to step S412.

Specifically, in this embodiment, a method of multiple computations and iterative approximation is used, so that the final result tends to a certain value to achieve the required accuracy. What is specifically meant by convergence is the interpretation of the overall correlation of X to the optimization objective Y, i.e., the difference of TC (X: Y) between the average of 5 consecutive runs and the previous 5 runs is less than 10 ^-5 。

Further, in the present embodiment, the iterative calculation may be performed using the EM algorithm. The EM algorithm, also known as the Expectation-maximization algorithm (Expectation-maximization algorithm), is an algorithm that finds the parameter maximum likelihood estimate or maximum a posteriori estimate in a probabilistic model, wherein the probabilistic model depends on unobservable hidden variables. The maximum expectation algorithm is alternately calculated through two steps, wherein the first step is to calculate expectation (E), and the maximum likelihood estimated value of the hidden variable is calculated by utilizing the existing estimated value of the hidden variable; the second step is to maximize (M), the maximum likelihood found at step E is maximized to calculate the value of the parameter. The parameter estimates found in step M are used in the next E step calculation, which is performed alternately.

The process of iterative computation using the EM algorithm is as follows:

input n _s An observation matrix of xn dimensions, the observation matrix being a sampling n from the data set _s Composing an observation sample by the texts, and constructing a bag of words with n-dimensional characteristics so as to obtain n _s The xn matrix is used as input to the algorithm.

Setting hidden variable dimension parameters: m, Y _j ，k。

Subject to random initialization (1/2,1)

And p (y | x) ^l )。

Output alpha _i,j ，p(y|x ^l )，p(y _j |x _i )，p(y|x ^l )。

Step S411, selecting a clustering result under the text topic again, and returning to execute the step of determining the second probability distribution of one clustering result under the text topic in the information text according to the first probability distribution until the tenth probability distribution converges.

Specifically, if the tenth probability distribution is not converged, the step of determining the second probability distribution of one of the clustering results in the information text according to the first probability distribution is returned to be executed, and the tenth probability distribution is converged after multiple iterative computations.

Step S412, using the tenth probability distribution obtained by the last update as the probability distribution of one of the clustering results in any sentence segment of the information text under the text topic.

Specifically, if the tenth probability distribution has converged, the tenth probability distribution obtained by the last update is used as the probability distribution of one of the clustering results in any sentence segment of the information text under the text topic.

After the steps are completed, a series of encapsulation is carried out on the analysis result, a visual capability interface for analyzing the result is provided, and a series of analysis capabilities for large-scale social media texts and specific field fine-grained subject analysis are provided. For the multiple determined tenth probability distributions, the multiple tenth probability distributions may be ranked, and finally, the clustering result corresponding to the probability maximum value is selected as the secondary label of the text topic. Further, if multi-topic screening is desired, all clustering results corresponding to the tenth probability distribution with probability values greater than a certain set threshold value may also be selected.

It will be appreciated that for each topic to which a sentence may belong, p (y) may be referred to _j |x ^l ) Sorting is performed and confidence is set. For a user, for each sentence above, the total p (y) is calculated _j |x ^l ) And then the subject portrait of the user can be obtained by sequencing. Conversely, clustering results for a given text topic, i.e., given y _j The support features of the key, i.e. the corresponding words, and the sentences containing the highest score of the key features can be reversely obtained according to the iterative probabilities, and the obtained sentences are used as core sentences for expressing the topics. In addition, the embodiment of the application can naturally extract global words and sentences and words and sentences of specific users, thereby meeting the requirement of fine-grained theme analysis.

By means of the technical scheme, on one hand, the labeling cost can be effectively reduced and the public opinion analysis precision can be improved through two stages of coarse-grained screening and fine-grained screening, on the other hand, the topic clustering problem of the text is solved by innovatively applying the information theory perspective, good interpretability and controllability are provided while unsupervised learning is adopted, on the other hand, the solution of the fine-grained topic model is realized in an iterative mode, and the characteristic combination mapping corresponding to the topic is assumed to be independent, so that the distributed computing and deploying capability is possessed in the birth, and the analysis of mass social media can be dealt with.

The foregoing embodiments describe in detail the process of determining the probability distribution of each clustering result in any sentence segment of the information text based on the lagrangian multiplier in the present application. In some embodiments of the present application, regarding step S120, a detailed description is provided on a process of determining a primary tag corresponding to the at least one text topic, where the process may include the following steps:

and step S121, determining a classification model, wherein the classification model comprises a bidirectional long-short term memory neural network, a word embedding layer and a classification layer.

In particular, fig. 2 shows a classification model structure, which includes a word embedding layer, a bidirectional long-short term memory neural network, and a classification layer. The network architecture adopted by this part is a text classifier for innovatively performing coarse screening through a neural network structure before clustering.

And S122, inputting the information text into the classification model.

And S123, extracting the features of the information text based on the word embedding layer to obtain text features corresponding to a plurality of text subjects of the information text.

Specifically, the word embedding layer is used for extracting input features, and extracting features required by corresponding words to be used as input features of unsupervised learning.

And S124, calculating the text features based on the bidirectional long-short term memory neural network to obtain a multi-dimensional text vector.

Specifically, a bidirectional long-short term memory neural network is used for calculating text features according to a preset mechanism to obtain a 200-dimensional text vector.

And step S125, classifying the multi-dimensional text vectors based on the classification layer to obtain a primary label corresponding to at least one text topic of the information text.

Specifically, the classification layer includes a fully connected network and a softmax classifier.

and S30, inputting the multi-dimensional text vector to the full-connection network to obtain at least one text theme with the feature tag corresponding to the information text.

Specifically, for the spliced 200-dimensional text vector, a full-connection network is accessed for classification. The fully-connected layer acts as a "classifier" throughout the neural network. If we say that operations such as convolutional layers, pooling layers, and activation functions map raw data to hidden layer feature space, the fully-connected layer serves to map the distributed feature representation to the sample label space.

Step S31, inputting the plurality of text topics with the characteristic labels into the classifier to obtain a primary label corresponding to at least one text topic of the information text.

Specifically, the probability distribution of the corresponding coarse-grained labels, that is, the primary labels corresponding to at least one text topic of the information text, is output through the softmax classifier.

The text topic classification device provided in the embodiment of the present application is described below, and the text topic classification device described below and the text topic classification method described above may be referred to in correspondence with each other.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a text topic classification device disclosed in the embodiment of the present application.

As shown in fig. 3, the apparatus may include:

the text acquiring unit 11 is configured to acquire an information text input by a user, where the information text includes at least one text topic.

A first tag obtaining unit 12, configured to determine a primary tag corresponding to the at least one text topic.

And the clustering result obtaining unit 13 is configured to determine anchor words of predefined topics, and cluster sentences under each text topic according to the anchor words to obtain multiple clustering results under each text topic.

And a probability distribution determining unit 14, configured to determine, according to a correlation between the information text and each text topic, a probability distribution of each clustering result in any period of the information text under each text topic.

And the second label obtaining unit 15 is configured to, according to the probability distribution of each clustering result under each text topic in any sentence segment of the information text, take the clustering result with the highest probability under each text topic as a second-level label under each text topic.

The text topic classification device provided by the embodiment of the application can be applied to text topic classification equipment, such as a terminal: mobile phones, computers, etc. Optionally, fig. 4 shows a block diagram of a hardware structure of the text topic classification device, and referring to fig. 4, the hardware structure of the text topic classification device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit), or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

determining a primary label corresponding to the at least one text topic;

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a storage medium, where a program suitable for execution by a processor may be stored, where the program is configured to:

determining a primary label corresponding to the at least one text topic;

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, the embodiments may be combined as needed, and the same and similar parts may be referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A text topic classification method is characterized by comprising the following steps:

determining a primary label corresponding to the at least one text topic;

2. The method of claim 1, wherein determining a probability distribution of each clustering result under each text topic in any period of the information text according to a correlation between the information text and each text topic comprises:

determining the multivalue mutual information quantity of each clustering result under each text topic in the information text according to the correlation between the information text and each text topic;

3. The method according to claim 2, wherein the determining a probability distribution of each clustering result in each text topic in any period of the information text according to a multi-valued mutual information amount of each clustering result in each text topic in the information text comprises:

4. The method as claimed in claim 3, wherein said determining a probability distribution of each clustering result under each text topic in any period of said information text based on said Lagrangian multiplier comprises:

determining a fourth probability distribution of the clustering result under the text theme according to the second probability distribution and a pre-calculated third probability distribution of word features of the information text;

determining a sixth probability distribution of the clustering result in each phrase of the information text under the text subject based on the second probability distribution and a pre-calculated fifth probability distribution of each phrase in the corresponding information text;

determining whether the tenth probability distribution has not converged;

5. The method of claim 4, wherein the process of constructing the word features of the information text comprises:

generating a text vector according to the vocabulary;

and taking the text vector as word features of the information text.

6. The method of claim 1, wherein the determining a primary label corresponding to the at least one textual subject includes:

inputting the information text into the classification model;

extracting the features of the information text based on the word embedding layer to obtain text features corresponding to a plurality of text subjects of the information text;

7. The method of claim 6, wherein the classification layer comprises a fully connected network and a classifier;

and inputting the plurality of text topics with the characteristic labels to the classifier to obtain a primary label corresponding to at least one text topic of the information text.

8. A text topic classification apparatus, comprising:

9. A text topic classification device, comprising: a memory and a processor;

the memory is used for storing programs;

the processor, configured to execute the program, and implement the steps of the text topic classification method according to any one of claims 1 to 7.

10. A storage medium having stored thereon a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the text topic classification method according to any one of the claims 1 to 7.