CN109902181B

CN109902181B - Text detection method and device

Info

Publication number: CN109902181B
Application number: CN201910163128.2A
Authority: CN
Inventors: 何泾沙; 穆鹏宇; 朱娜斐; 蔡方博; 侯自强; 李想; 韩松; 张胜凡; 葛加可
Original assignee: Beijing University of Technology
Current assignee: Beijing Yongbo Technology Co ltd
Priority date: 2019-03-04
Filing date: 2019-03-04
Publication date: 2021-04-23
Anticipated expiration: 2039-03-04
Also published as: CN109902181A

Abstract

The invention provides a text detection method and a text detection device, which relate to the technical field of text classification and can calculate a topic distribution matrix corresponding to a plurality of characteristic attributes and a word distribution matrix corresponding to a text topic when the plurality of characteristic attributes and the text topic of a text to be detected are obtained; generating a joint distribution matrix based on the plurality of topic distribution matrices; and performing class detection on the text to be detected by using a classification algorithm according to the joint distribution matrix and the word distribution matrix. By fusing a plurality of topic distribution matrixes, the problem that the change and the fuzziness of the topic clusters cannot be truly reflected by a single document-topic matrix can be effectively avoided, and the class detection of the document is improved.

Description

Text detection method and device

Technical Field

The invention relates to the technical field of text classification, in particular to a text detection method and a text detection device.

Background

Nowadays, the era of information explosion is in an era, wherein the amount of information represented by texts is extremely rapidly increased, including text information such as media information, technical reports, books, mails, microblogs, comments and the like, and how to mine useful subject information from a large amount of texts is a current primary task.

In recent years, LDA (document topic Allocation) has been widely applied in the fields of topic mining, information analysis, knowledge service, and the like, and mainly focuses on research directions such as hot topic discovery, emerging topic detection, academic evaluation, and the like. However, the existing LDA three-layer topic model can not truly reflect the change and ambiguity of topic clusters because of a single document-topic matrix, so that the document representation performance is limited, and the classification detection of texts is reduced.

Disclosure of Invention

In view of the above, an object of the present invention is to provide a text detection method and apparatus, so as to alleviate the technical problem that the document representation performance is limited because a single document-theme matrix of the existing LDA three-layer theme model cannot truly reflect changes and fuzziness of a theme cluster.

In a first aspect, an embodiment of the present invention provides a text detection method, where the method includes: acquiring a plurality of characteristic attributes and text topics of a text to be detected, wherein the characteristic attributes are characteristic attributes matched with the text topics; calculating a theme distribution matrix corresponding to the plurality of characteristic attributes and a word distribution matrix corresponding to the text theme; generating a joint distribution matrix based on the plurality of topic distribution matrices; and performing class detection on the text to be detected by using a classification algorithm according to the joint distribution matrix and the word distribution matrix.

With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where the step of calculating a topic distribution matrix corresponding to a plurality of feature attributes and a word distribution matrix corresponding to a text topic includes: calculating the number of themes corresponding to the characteristic attribute by using the text confusion degree to acquire the hyper-parameter corresponding to the characteristic attribute; and respectively inputting the hyper-parameters corresponding to the characteristic attributes and preset hyper-parameters corresponding to the text topics into a pre-established LDA model to obtain a topic distribution matrix corresponding to the characteristic attributes and a word distribution matrix corresponding to the text topics.

With reference to the first possible implementation manner of the first aspect, an embodiment of the present invention provides a second possible implementation manner of the first aspect, where the step of calculating the number of topics corresponding to the feature attribute by using text obfuscation includes: presetting a plurality of theme numbers corresponding to the characteristic attributes; respectively calculating a confusion value corresponding to each topic number based on the text confusion degree; and selecting the theme number corresponding to the minimum confusion value as the theme number corresponding to the characteristic attribute.

With reference to the first possible implementation manner of the first aspect, an embodiment of the present invention provides a third possible implementation manner of the first aspect, where after calculating a topic distribution matrix corresponding to a plurality of feature attributes and a word distribution matrix corresponding to a text topic, the method further includes: judging whether the topic distribution matrixes corresponding to the characteristic attributes and the word distribution matrixes corresponding to the text topics obey Dirichlet distribution or not; if not, recalculating the hyper-parameters corresponding to the characteristic attributes not complying with the Dirichlet distribution or reselecting the preset hyper-parameters corresponding to the text topics not complying with the Dirichlet distribution.

With reference to the third possible implementation manner of the first aspect, an embodiment of the present invention provides a fourth possible implementation manner of the first aspect, where the step of generating a joint distribution matrix based on the topic distribution matrices corresponding to the multiple feature attributes includes: calculating probability matrixes of the theme distribution matrixes corresponding to the characteristic attributes on the theme respectively according to the Gibbs sampling; and processing the probability matrix by utilizing the Hadamard sum and the projection residual joint vector, and fusing the processed probability matrix to generate a joint distribution matrix.

With reference to the fourth possible implementation manner of the first aspect, an embodiment of the present invention provides a fifth possible implementation manner of the first aspect, where the step of performing category detection on the text to be detected by using a classification algorithm according to the joint distribution matrix and the word distribution matrix includes: generating a feature word distribution matrix based on the joint distribution matrix and the word distribution matrix; judging whether the feature word distribution matrix obeys polynomial distribution or not; and if so, inputting the feature word distribution matrix into a pre-trained text classifier so as to perform class detection on the text to be detected.

In a second aspect, an embodiment of the present invention further provides a device for detecting a text, where the device includes: the acquisition module is used for acquiring a plurality of characteristic attributes and text topics of the text to be detected, wherein the characteristic attributes are characteristic attributes matched with the text topics; the calculation module is used for calculating a theme distribution matrix corresponding to the characteristic attributes and a word distribution matrix corresponding to the text theme; a generation module for generating a joint distribution matrix based on the plurality of topic distribution matrices; and the detection module is used for carrying out class detection on the text to be detected by utilizing a classification algorithm according to the joint distribution matrix and the word distribution matrix.

With reference to the second aspect, an embodiment of the present invention provides a first possible implementation manner of the second aspect, where the computing module is further configured to: calculating the number of themes corresponding to the characteristic attribute by using the text confusion degree to acquire the hyper-parameter corresponding to the characteristic attribute; and respectively inputting the hyper-parameters corresponding to the characteristic attributes and preset hyper-parameters corresponding to the text topics into a pre-established LDA model to obtain a topic distribution matrix corresponding to the characteristic attributes and a word distribution matrix corresponding to the text topics.

With reference to the first possible implementation manner of the second aspect, an embodiment of the present invention provides a second possible implementation manner of the second aspect, where the calculation module is further configured to: presetting a plurality of theme numbers corresponding to the characteristic attributes; respectively calculating a confusion value corresponding to each topic number based on the text confusion degree; and selecting the theme number corresponding to the minimum confusion value as the theme number corresponding to the characteristic attribute.

With reference to the first possible implementation manner of the second aspect, an embodiment of the present invention provides a third possible implementation manner of the second aspect, where after the calculating module, the apparatus further includes: the judging module is used for judging whether the topic distribution matrixes corresponding to the characteristic attributes and the word distribution matrixes corresponding to the text topics obey Dirichlet distribution or not; and the recalculating module is used for recalculating the hyper-parameters corresponding to the characteristic attributes not complying with the Dirichlet distribution or reselecting the preset hyper-parameters corresponding to the text topics not complying with the Dirichlet distribution if the judging module judges that the text topics are not complying with the Dirichlet distribution.

The embodiment of the invention has the following beneficial effects:

the text detection method and the text detection device provided by the embodiment of the invention can calculate the topic distribution matrix corresponding to a plurality of characteristic attributes and the word distribution matrix corresponding to the text topic when acquiring a plurality of characteristic attributes and text topics of a text to be detected; generating a joint distribution matrix based on the plurality of topic distribution matrices; and performing class detection on the text to be detected by using a classification algorithm according to the joint distribution matrix and the word distribution matrix. By fusing a plurality of topic distribution matrixes, the problem that the change and the fuzziness of the topic clusters cannot be truly reflected by a single document-topic matrix can be effectively avoided, and the class detection of the document is improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a text detection method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a process of detecting a network reimbursement text according to an embodiment of the present invention;

fig. 3 is a flowchart of another text detection method according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating classification accuracy provided by an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a text detection apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of another text detection apparatus according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

LDA is a document theme generation model, comprising three layers of word, theme and document structure. The generative model is a process in which each word of an article is obtained by "selecting a topic with a certain probability and selecting a word from the topic with a certain probability". Document-to-topic follows a polynomial distribution, and topic-to-word follows a polynomial distribution. LDA is an unsupervised machine learning technique that can be used to identify underlying topic information in large-scale document sets or corpora. The method adopts a bag-of-words method, and each document is regarded as a word frequency vector, so that text information is converted into digital information which is easy to model.

At present, the LDA three-layer theme model can not truly reflect the change and the fuzziness of a theme cluster due to a single document-theme matrix, so that the document representation performance is limited. Based on this, the text detection method and apparatus provided by the embodiments of the present invention can improve the above technical problem.

For the convenience of understanding the embodiment, a detailed description will be given to a text detection method disclosed in the embodiment of the present invention.

The first embodiment is as follows:

an embodiment of the present invention provides a text detection method, such as a flowchart of a text detection method shown in fig. 1, where the method includes the following steps:

step S102, acquiring a plurality of characteristic attributes and text topics of a text to be detected, wherein the characteristic attributes are characteristic attributes matched with the text topics;

during specific implementation, the type of the text to be detected is a network marketing text, wherein the text subject content of the network marketing text is various, and the text subject content is often accompanied by current hot topics, such as virtual currency, charitable mutual assistance, internet + finance, national major strategic deployment and the like. In the embodiment of the invention, the "reimbursement" is taken as the text topic of the text mining of the network reimbursement, and the two characteristic attributes related to the information of the "reimbursement" text topic are respectively high interest and hierarchy remuneration by capturing and analyzing the text data of the network reimbursement.

Step S104, calculating a theme distribution matrix corresponding to a plurality of characteristic attributes and a word distribution matrix corresponding to a text theme;

step S106, generating a joint distribution matrix based on a plurality of theme distribution matrices;

and S108, performing category detection on the text to be detected by using a classification algorithm according to the joint distribution matrix and the word distribution matrix.

In actual use, the network reimbursement text needs to be preprocessed, each carriage return symbol in the text is taken as the text to be ended to obtain a paragraph corpus, the paragraph is taken as the minimum processing unit to detect the network reimbursement text, as shown in fig. 2, a schematic diagram of a network reimbursement text detection process is shown in fig. 2, and in the process of detecting the network reimbursement text, firstly, Gibbs sampling is utilized to estimate a dirichlet prior hyperparameter alpha corresponding to two characteristic attributes of 'interest higher degree' and 'hierarchy remuneration'₁,α₂Setting a Dirichlet prior hyper-parameter beta corresponding to the 'biography and reimbursement' text theme; then, alpha is adjusted₁,α₂And beta Dirichlet prior hyper-parameters are respectively input into a pre-trained LDA model to obtain topic distribution corresponding to two characteristic attributes of ' high interest ' and ' level compensationThe matrix delta, theta and a word distribution matrix phi corresponding to the text subject of the 'biography and marketing'; then, fusing the two characteristic attributes of 'high interest' and 'level remuneration' corresponding to the theme distribution matrix to generate a combined distribution matrix z; and finally, generating a feature word distribution matrix w according to the combined distribution matrix and the word distribution matrix, and inputting the feature word distribution matrix into a pre-trained text classifier to perform class detection on whether the network reimbursement text belongs to the text topic of 'reimbursement'.

The text detection method provided by the embodiment of the invention can calculate the topic distribution matrix corresponding to a plurality of characteristic attributes and the word distribution matrix corresponding to the text topic when acquiring a plurality of characteristic attributes and text topics of the text to be detected; generating a joint distribution matrix based on the plurality of topic distribution matrices; and performing class detection on the text to be detected by using a classification algorithm according to the joint distribution matrix and the word distribution matrix. By fusing a plurality of topic distribution matrixes, the problem that the change and the fuzziness of the topic clusters cannot be truly reflected by a single document-topic matrix can be effectively avoided, and the classification detection of the documents is improved.

Further, in order to better perform category detection on the text to be detected, a theme distribution matrix and a word distribution matrix with better quality need to be obtained, so that step S104 calculates a theme distribution matrix corresponding to a plurality of feature attributes and a word distribution matrix corresponding to a text theme, which can be realized by step S11 and step S12:

step 11, calculating the number of themes corresponding to the characteristic attribute by using the text confusion degree to acquire the hyper-parameters corresponding to the characteristic attribute;

in the implementation of the invention, the text confusion is used for determining the optimal number of the two characteristic attributes, namely high interest and hierarchy compensation. The following is a detailed description by taking the determination of the optimal number of topics of the "high-interest" feature attributes as an example: for example, the size of the number of topics corresponding to the "high-interest" feature attribute is set to be 10 to 100, 10 topic numbers of 10, 20 and 30 … 100 are respectively substituted into the text confusion calculation formula by taking 10 as a step length to calculate the confusion value corresponding to each topic number, and the topic number corresponding to the lowest confusion value in the 10 confusion values is selected as the optimal topic number of the "high-interest" feature attribute. The process of obtaining the optimal number of topics of the "hierarchy compensation" feature attribute is the same as the process of obtaining the optimal number of topics of the "interest" feature attribute, and the process of obtaining the optimal number of topics of the "hierarchy compensation" feature attribute is not described in the embodiment of the present invention.

Preferably, the hyper-parameters corresponding to the two characteristic attributes of "interest" and "hierarchy compensation" are selected by

This formula is calculated, where α represents the hyperparameter and k represents the number of topics. Table 1 shows the values of α for different topic numbers:

TABLE 1

k	10	20	50	100	200	300	400	500
									α	5	2.5	1	0.5	0.25	0.17	0.13	0.1

Therefore, the value of the hyper-parameter alpha is inversely proportional to the number of the subjects, and the value of the alpha floats according to the difference of k, so that the optimal hyper-parameter can be obtained only by obtaining the optimal number of the subjects with characteristic attributes.

And step 12, respectively inputting the hyper-parameters corresponding to the characteristic attributes and preset hyper-parameters corresponding to the text topics into a pre-established LDA model to obtain topic distribution matrixes corresponding to the characteristic attributes and word distribution matrixes corresponding to the text topics.

In the concrete implementation, the setting of the dirichlet priori hyperparameter beta corresponding to the 'reimbursement' text subject is basically a fixed value, the setting is uniformly that beta is 0.01, and the dirichlet priori hyperparameter alpha corresponding to the two characteristic attributes of 'high interest' and 'hierarchy remuneration' is set₁,α₂And the Dirichlet prior hyper-parameter beta corresponding to the 'reimbursement' text theme is respectively input into a pre-established LDA model, and a theme distribution matrix corresponding to the two characteristic attributes of 'high interest' and 'level remuneration' and a word distribution matrix corresponding to the 'reimbursement' text theme are respectively obtained.

Further, based on the above process of the text detection method, fig. 3 shows a flowchart of another text detection method, which includes the following steps:

step S302, acquiring a plurality of characteristic attributes and text topics of a text to be detected, wherein the characteristic attributes are characteristic attributes matched with the text topics;

step S304, calculating a theme distribution matrix corresponding to a plurality of characteristic attributes and a word distribution matrix corresponding to a text theme;

step S306, judging whether the topic distribution matrixes corresponding to the characteristic attributes and the word distribution matrixes corresponding to the text topics obey Dirichlet distribution or not;

step S308, if not, recalculating the hyper-parameters corresponding to the characteristic attributes not complying with the Dirichlet distribution or reselecting the preset hyper-parameters corresponding to the text topics not complying with the Dirichlet distribution.

In concrete implementation, after a theme distribution matrix and a word distribution matrix are obtained, whether the matrixes are subjected to Dirichlet distribution or not needs to be judged, when the matrixes are subjected to Dirichlet distribution, the type of a network reimbursement text is detected, if the theme distribution matrix corresponding to 'high interest' and/or 'hierarchical rewarding' is judged not to be subjected to Dirichlet distribution, the hyper-parameters corresponding to the characteristic attributes of 'high interest' and/or 'hierarchical rewarding' need to be recalculated, and if the word distribution matrix corresponding to 'reimbursement' is judged not to be subjected to Dirichlet distribution, the hyper-parameters corresponding to 'reimbursement' need to be reset.

Step S310, generating a joint distribution matrix based on a plurality of theme distribution matrices;

and step S312, performing category detection on the text to be detected by using a classification algorithm according to the joint distribution matrix and the word distribution matrix.

In practical application, in order to overcome the technical problem that a single document-topic matrix cannot truly reflect the change and ambiguity of a topic cluster, so that the document representation performance is limited, in the embodiment of the present invention, topic distribution matrices corresponding to a plurality of feature attributes are fused to generate a joint distribution matrix, and therefore, step S106 generates the joint distribution matrix based on the topic distribution matrices corresponding to the plurality of feature attributes, which can be implemented by step 21 and step 22:

step 21, calculating probability matrixes of the theme distribution matrixes corresponding to the characteristic attributes on the theme respectively according to the Gibbs sampling;

and step 22, processing the probability matrix by utilizing the Hadamard sum and the projection residual joint vector, and fusing the processed probability matrix to generate a joint distribution matrix.

Specifically, a probability matrix P (z | delta) of the ith 'high-interest' topic distribution matrix and the jth 'hierarchical rewarding' topic distribution matrix on the 'reimbursement' text topic z is calculated according to Gibbs samples_i) And P (z | theta)_j) (ii) a The probability matrix is then generalized using the hadamard product:

and simultaneously introducing a projection residual united vector which is used for measuring the different degrees between the two characteristics:¹-(P(z|δ_i))^1/2·(P(z|θ_j))^1/2(ii) a The purpose of processing the probability matrix by using the hadamard sum and the projection residual joint vector is to eliminate common feature words in the theme distribution matrix corresponding to the two feature attributes, so that the joint distribution is more representative, for example, when the same feature word "income" appears in "high interest" and "hierarchy remuneration", the "income" cannot independently represent a certain feature attribute, and therefore, the same feature word is removed by using the method.

After the probability matrix is processed, the topic distribution matrix can be fused to obtain a joint distribution matrix:

wherein Z is_ijAnd a joint distribution matrix generated by the theme distribution matrix representing the ith 'interest' and the theme distribution matrix representing the jth 'level compensation'.

Further, the step of performing category detection on the text to be detected by using a classification algorithm according to the joint distribution matrix and the word distribution matrix comprises the following steps: generating a feature word distribution matrix based on the joint distribution matrix and the word distribution matrix; judging whether the feature word distribution matrix obeys polynomial distribution or not; and if so, inputting the feature word distribution matrix into a pre-trained text classifier so as to perform class detection on the text to be detected. Specifically, in order to accurately perform category detection on the network marketing texts, a suitable text classifier needs to be selected, in the embodiment of the present invention, a cross experiment is adopted to respectively compare the expression effects of different classifiers on the classification accuracy of the feature word distribution matrix, fig. 4 shows a schematic diagram of classification accuracy, and as shown in fig. 4, four classifiers which are currently widely used are selected in the embodiment and are respectively: k Nearest neighbors (K-Nearest Neighbor, kNN), Decision Trees (DT), Support Vector Machines (SVM), and Fast text. The feature word distribution matrixes are respectively input into the pre-trained classifiers to obtain the classification accuracy of the network marketing texts by different classifiers, as shown in fig. 4, Fast text is superior to other classifiers in classification accuracy, the classification accuracy reaches 86.25%, is 1.61% higher than that of an SVM, and is superior to the SVM in time and space complexity. Therefore, Fast text is adopted as a classifier, so that the detection efficiency of network marketing texts can be improved in time and space.

Example two:

on the basis of the foregoing embodiments, an embodiment of the present invention further provides a text detection apparatus, as shown in fig. 5, which includes:

an obtaining module 502, configured to obtain a plurality of feature attributes and a text theme of a text to be detected, where a feature attribute is a feature attribute matched with a text theme;

a calculating module 504, configured to calculate a topic distribution matrix corresponding to the multiple feature attributes and a word distribution matrix corresponding to the text topic;

a generating module 506, configured to generate a joint distribution matrix based on the plurality of topic distribution matrices;

and the detection module 508 is configured to perform category detection on the text to be detected by using a classification algorithm according to the joint distribution matrix and the word distribution matrix.

Further, the calculation module 504 is further configured to: calculating the number of themes corresponding to the characteristic attribute by using the text confusion degree to acquire the hyper-parameter corresponding to the characteristic attribute; and respectively inputting the hyper-parameters corresponding to the characteristic attributes and preset hyper-parameters corresponding to the text topics into a pre-established LDA model to obtain a topic distribution matrix corresponding to the characteristic attributes and a word distribution matrix corresponding to the text topics.

In actual use, the calculation module 504 is further configured to: presetting a plurality of theme numbers corresponding to the characteristic attributes; respectively calculating a confusion value corresponding to each topic number based on the text confusion degree; and selecting the theme number corresponding to the minimum confusion value as the theme number corresponding to the characteristic attribute.

On the basis of fig. 5, fig. 6 shows a schematic structural diagram of another text detection apparatus, which further includes, after the calculation module:

a determining module 602, configured to determine whether a topic distribution matrix corresponding to the multiple feature attributes and a word distribution matrix corresponding to the text topic obey dirichlet distribution;

and a recalculating module 604, configured to recalculate the hyper-parameter corresponding to the feature attribute not subject to the dirichlet distribution or reselect the preset hyper-parameter corresponding to the text topic not subject to the dirichlet distribution if the determining module determines that the text topic is not subject to the dirichlet distribution.

The text detection device provided by the embodiment of the invention has the same technical characteristics as the text detection method provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present invention can be understood in specific cases for those skilled in the art.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that the following embodiments are merely illustrative of the present invention, and not restrictive, and the scope of the present invention is not limited thereto: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for detecting text, the method comprising:

acquiring a plurality of characteristic attributes and a text theme of a text to be detected, wherein the characteristic attributes are matched with the text theme;

calculating a plurality of theme distribution matrixes corresponding to the characteristic attributes and word distribution matrixes corresponding to the text themes;

generating a joint distribution matrix based on a plurality of the topic distribution matrices;

performing category detection on the text to be detected by using a classification algorithm according to the combined distribution matrix and the word distribution matrix;

wherein the step of generating a joint distribution matrix based on the topic distribution matrices corresponding to the plurality of feature attributes comprises:

calculating probability matrixes of a plurality of theme distribution matrixes corresponding to the characteristic attributes on the theme according to the Gibbs sampling;

processing the probability matrix by utilizing a Hadamard sum and a projection residual joint vector, and fusing the processed probability matrix to generate the joint distribution matrix;

the step of utilizing a classification algorithm to perform class detection on the text to be detected according to the joint distribution matrix and the word distribution matrix comprises the following steps:

generating a feature word distribution matrix based on the joint distribution matrix and the word distribution matrix;

judging whether the feature word distribution matrix obeys polynomial distribution or not;

and if so, inputting the feature word distribution matrix into a pre-trained text classifier so as to perform class detection on the text to be detected.

2. The method of claim 1, wherein the step of calculating a plurality of topic distribution matrices corresponding to the feature attributes and a plurality of word distribution matrices corresponding to the text topics comprises:

calculating the number of themes corresponding to the characteristic attribute by using text confusion to acquire a hyper-parameter corresponding to the characteristic attribute;

and respectively inputting the hyper-parameters corresponding to the characteristic attributes and preset hyper-parameters corresponding to the text topics into a pre-established LDA model to obtain a topic distribution matrix corresponding to the characteristic attributes and a word distribution matrix corresponding to the text topics.

3. The method according to claim 2, wherein the step of calculating the number of topics corresponding to the feature attributes by using text obfuscation comprises:

presetting a plurality of theme numbers corresponding to the characteristic attributes;

respectively calculating a confusion value corresponding to each topic number based on the text confusion;

and selecting the theme number corresponding to the minimum confusion value as the theme number corresponding to the characteristic attribute.

4. The method of claim 2, wherein after calculating a plurality of topic distribution matrices corresponding to the feature attributes and a plurality of word distribution matrices corresponding to the text topics, the method further comprises:

judging whether the topic distribution matrixes corresponding to the characteristic attributes and the word distribution matrix corresponding to the text topic obey Dirichlet distribution or not;

if not, recalculating the hyper-parameter corresponding to the characteristic attribute not complying with the Dirichlet distribution or reselecting the preset hyper-parameter corresponding to the text theme not complying with the Dirichlet distribution.

5. An apparatus for detecting text, the apparatus comprising:

the acquisition module is used for acquiring a plurality of characteristic attributes and text topics of a text to be detected, wherein the characteristic attributes are characteristic attributes matched with the text topics;

the calculation module is used for calculating a plurality of theme distribution matrixes corresponding to the characteristic attributes and word distribution matrixes corresponding to the text themes;

a generating module for generating a joint distribution matrix based on a plurality of the topic distribution matrices;

the detection module is used for carrying out category detection on the text to be detected by utilizing a classification algorithm according to the joint distribution matrix and the word distribution matrix;

the generating module is further used for calculating probability matrixes of the theme distribution matrixes corresponding to the characteristic attributes on the theme respectively according to Gibbs sampling;

the detection module is further used for generating a feature word distribution matrix based on the joint distribution matrix and the word distribution matrix;

6. The apparatus of claim 5, wherein the computing module is further configured to:

7. The apparatus of claim 6, wherein the computing module is further configured to:

8. The apparatus of claim 6, wherein after the computing module, the apparatus further comprises:

the judging module is used for judging whether the topic distribution matrixes corresponding to the characteristic attributes and the word distribution matrix corresponding to the text topic obey Dirichlet distribution or not;

and the recalculating module is used for recalculating the hyper-parameter corresponding to the characteristic attribute not complying with the Dirichlet distribution or reselecting the preset hyper-parameter corresponding to the text theme not complying with the Dirichlet distribution if the judging module judges that the text theme is not complied with the Dirichlet distribution.