CN111651598A

CN111651598A - Spam text auditing device and method through center vector similarity matching

Info

Publication number: CN111651598A
Application number: CN202010469767.4A
Authority: CN
Inventors: 陈晓峰; 麻沁甜; 刘星辰
Original assignee: Shanghai Bochi Information Technology Co ltd
Current assignee: Shanghai Bochi Information Technology Co ltd
Priority date: 2020-05-28
Filing date: 2020-05-28
Publication date: 2020-09-11

Abstract

A device and a method for auditing junk texts through central vector similarity matching comprise the following steps: establishing a group of garbage sample text sets and a group of normal sample text sets; calculating central vectors of the two sample sets, and performing text classification pretreatment on the text to be recognized. And classifying the text to be detected represented by the characteristic word text by using a classifier obtained by sample text training. Whether the text is the junk text or not can be determined according to a preset standard, and certain preventive measures can be further taken for the judged junk text, so that adverse effects of the junk text on daily life of people can be avoided.

Description

Spam text auditing device and method through center vector similarity matching

Technical Field

The invention relates to the technical field of text auditing, in particular to a junk text auditing device and method based on central vector similarity matching.

Background

With the development of internet technology, the security of information becomes more and more important for many people, and in any kind of business, there is information that is important for the business itself, for example, in the medical insurance auditing business, the information of the insured person is important information for the party providing the medical insurance business, and the risk prevention and control are needed to avoid leakage.

Besides the common commercial advertisements, there are some information such as reaction, fraud, etc. in these spam texts. The propagation of the information not only influences the daily life of people, but also jeopardizes the safety and stability of society. Therefore, it is necessary to identify the spam texts in order to filter or delete the spam texts.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a device and a method for auditing junk texts by matching the similarity of central vectors, which overcome the defects of the prior art and are used for auditing whether texts on a network are the junk texts so as to filter or delete the junk texts and the like.

In order to achieve the purpose, the invention is realized by the following technical scheme:

a device and a method for auditing junk texts through central vector similarity matching comprise the following steps:

step S1: establishing a group of garbage sample text sets and a group of normal sample text sets;

step S2: respectively calculating a central vector of a garbage sample text and a central vector of a normal sample text set for the garbage sample text set and the normal sample text set, wherein the calculation method is respectively calculating the arithmetic mean of the characteristic vectors of the garbage sample text set and the normal sample text set;

step S3: preprocessing newly collected texts to be detected such as word segmentation, feature extraction and the like, and expressing the texts by using feature vectors;

step S4: calculating the similarity between the feature vector of the text to be detected and the central vector of the garbage sample text set and the central vector of the normal sample text set; comparing the similarity between the central vector of the garbage sample text set and the text vector to be detected with the similarity between the normal sample text set and the text vector to be detected, wherein the text to be detected belongs to the category with larger similarity;

step S5: calculating a feedback value of the test text judged as the junk text, and if the feedback value meets a certain condition, adding the test text judged as the junk text into a feedback text set;

step S6: and regenerating a new category feature word list for the classifier according to the feedback information, updating each intermediate statistic in the sample set, calculating a central vector of the garbage sample text set, and revising the classifier.

Preferably, in step S3, the preprocessing of performing word segmentation, feature extraction, and the like on the newly acquired text to be tested includes:

step S3.1: and performing feature extraction on the text by utilizing a pre-constructed word segmentation dictionary, a stop word list, a keyword list and a variation word list to obtain a plurality of text features.

Step S3.2: and segmenting the character string with segmentation according to a word segmentation principle, taking substrings of the character string, sequentially matching the segmented substrings with entries in a given word segmentation dictionary, and if matching is successful, determining the substrings to be words.

Preferably, the word segmentation rule in step S3.2 specifically includes, but is not limited to, a forward maximum matching algorithm, a reverse maximum matching algorithm, an optimal matching method, and a method for setting a segmentation flag.

Preferably, the step S2 specifically includes the following steps:

step S2.1: use of C₁Representing a set of normal sample texts, C₂Representing a garbage sample text set, and a normal sample text set C₁And garbage sample text set C₂All texts in the text list are subjected to word segmentation and stop words are removed;

step S2.2: calculating the prior probability P (C) of the garbage sample text₂) Word segmentation text W_tInformation gain value of (1), text to be participled W_tArranging the information gain values from large to small, and selecting the participle text with the information gain value arranged at the top n as the characteristic word t_iCalculating the weight w of each feature word according to TFIDF_i；

Step S2.3: for each feature word t in the garbage sample text set_iCalculating the weight w of the text feature vectors in all the text feature vectors in the garbage sample text set_iIs arithmetic mean of

As the weight value of the feature word in the category feature vector;

step S2.4: constructing a garbage sample text set C₂Is determined by the central vector of (a),

step S2.5: according to the prior probability P (C) of the garbage sample text set₂) And the prior probability P (C) of the normal sample text set₁) And conditional probability P (t) of each feature word_i|C₂)、P(t_i|C₁) An initial classifier C is generated.

Preferably, the step S3 specifically includes:

performing word segmentation and feature extraction on the text to be detected to generate a feature vector X representing the text to be detected_uAnd calculating the conditional probability of the text to be detected belonging to two types of samples according to a discrimination function formula of a naive Bayes classifier, wherein the conditional probability is shown as the following formula:

wherein i is 1 or 2.

Preferably, the step S4 specifically includes:

calculating the characteristic vector X of the text to be detected_uWith center vector X of garbage sample text set_cThe similarity is calculated as shown in the following formula:

calculating probability value P (C) of the text to be tested belonging to the junk text₂|X_u) That is, the garbage attribute tendency value score of the text to be tested_spam(X_u) As shown in the following formula:

preferably, the step S5 specifically includes:

and calculating the spam attribute tendency value of each text to be detected, and adding the text to be detected into the feedback text set if the following formula is established.

The invention provides a device and a method for auditing junk texts through central vector similarity matching. The method has the following beneficial effects: and regenerating a new category feature word list for the classifier according to the feedback information, updating each intermediate statistic in the sample set, calculating a central vector of the garbage sample text set, and revising the classifier to enable the prior probability of the garbage sample text set and the normal sample text set and the conditional probability of each feature to be closer to the probability of an ideal model, thereby improving the performance of the system.

Drawings

In order to more clearly illustrate the present invention or the prior art solutions, the drawings that are needed in the description of the prior art will be briefly described below.

FIG. 1 is a flow chart of the steps of the present invention;

FIG. 2 is a block diagram of an embodiment of the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings.

As shown in fig. 1-2, a spam text auditing apparatus and method by center vector similarity matching includes the following steps:

step S1: establishing a group of garbage sample text sets and a group of normal sample text sets; in the step, the junk sample text and a group of normal sample texts can be screened from the sample texts by an editor and then manually marked; use of C₁Representing a set of normal sample texts, C₂Representing a garbage sample text set;

step S2: respectively calculating a central vector of a garbage sample text and a central vector of a normal sample text set for the garbage sample text set and the normal sample text set, wherein the calculation method is respectively calculating the arithmetic mean of the characteristic vectors of the garbage sample text set and the normal sample text set; the method comprises the following specific steps:

calculating the prior probability of the junk sample text and the normal sample text, and the probability P (C) of the junk sample or the normal sample in the sample text_j) The calculation formula is as follows.

Wherein j is 1 or 2, N_jIs a member of C_jThe number of sample texts. N is the sum of all sample text quantities.

The probability calculation formula of the sample text containing the participle text t is as follows:

wherein N is_tIs the number of sample texts containing the participle text t.

The probability calculation formula of the sample text not containing the participle text t is as follows:

the sample text containing the participle text t belongs to the sample text set C_jThe conditional probability calculation formula of (1) is:

wherein N (C)_jT) is a sample text set C_jIncluding the number of sample texts of the participle text t.

Word segmentation text W_tThe calculation formula of (1) is as follows:

text W to be participled_tAccording to the arrangement of the information gain values IG (t) from large to small, selecting the word segmentation text with the information gain value IG (t) arranged at the top n as the characteristic word t_i。

Calculating each feature word t according to TFIDF_iWeight w of_iThe calculation formula is as follows:

w_i＝P(t)×N_t

for each feature word t in the sample text set_iCalculating the weight w of the text feature vector in the sample text set_iIs arithmetic mean of

As the weight value of the feature word in the category feature vector, the calculation formula is as follows;

constructing a sample text set C_jIs determined by the central vector of (a),

step S3: preprocessing newly collected texts to be detected such as word segmentation, feature extraction and the like, and expressing the texts by using feature vectors; the method comprises the steps of performing word segmentation processing on a text to be tested by using a forward maximum matching algorithm or a reverse maximum matching algorithm or an optimal matching method or setting a segmentation mark method, recording word segmentation texts and parts of speech corresponding to each word segmentation, keeping nouns, verbs and adjectives, and removing meaningless words by using a stop word list. Obtaining a text to be tested;

step S4: calculating the similarity between the feature vector of the text to be detected and the central vector of the garbage sample text set and the central vector of the normal sample text set;

computing a sample text set C_jThe probability of the feature vector X is included in the calculation formula as follows:

wherein m is_iIndicating whether the text to be tested contains the characteristic word t_iIf the feature vector of the text to be detected contains the feature word t_iThen m is_iIs 1, otherwise is 0;

calculating P (X) by the following formula:

P(X)＝P(X|C₁)P(C₁)+P(X|C₂)P(C₂)

calculating P (C)_j| X), the formula is calculated as follows:

step S5: comparison P (C)₁I X) and P (C)₂And | X), the text to be detected belongs to a sample text set with higher probability.

Step S6: calculating a feedback value of the test text judged as the junk text, and if the feedback value meets a certain condition, adding the test text judged as the junk text into a feedback text set;

calculating probability value P (C) of the text to be tested belonging to the junk text₂|X_u) That is, the garbage attribute tendency value score of the text to be tested_spam(X_u) The calculation formula is as follows:

Step S7: and regenerating a new category feature word list for the classifier according to the feedback information, updating each intermediate statistic in the sample set, calculating a central vector of the garbage sample text set, and revising the classifier.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A device and a method for auditing junk texts through central vector similarity matching are characterized by comprising the following steps:

2. The device and the method for auditing spam texts through central vector similarity matching according to claim 1, wherein: in step S3, the preprocessing of the newly acquired text to be detected, such as word segmentation and feature extraction, includes:

3. The device and the method for auditing spam texts through central vector similarity matching according to claim 2, wherein: the word segmentation rule in step S3.2 includes, but is not limited to, a forward maximum matching algorithm, a reverse maximum matching algorithm, an optimal matching method, and a method for setting segmentation flags.

4. The device and the method for auditing spam texts through central vector similarity matching according to claim 1, wherein: the step S2 specifically includes the following steps:

Step S2.3:for each feature word t in the garbage sample text set_iCalculating the weight w of the text feature vectors in all the text feature vectors in the garbage sample text set_iIs arithmetic mean of

As the weight value of the feature word in the category feature vector;

5. The device and the method for auditing spam texts through central vector similarity matching according to claim 4, wherein: the step S3 specifically includes:

performing word segmentation and feature extraction on the text to be detected to generate a feature vector X representing the text to be detected_uCalculating the conditional probability of the text to be tested belonging to two types of samples according to the discrimination function formula of the naive Bayes classifier, as shown in formula 1

Wherein i is 1 or 2.

6. The device and the method for auditing spam texts through center vector similarity matching according to claim 5, wherein: the step S4 specifically includes:

calculating the characteristic vector X of the text to be detected_uWith center vector X of garbage sample text set_cThe similarity of (2) is calculated as shown in equation 2：

Calculating probability value P (C) of the text to be tested belonging to the junk text₂|X_u) That is, the garbage attribute tendency value score of the text to be tested_spam(X_u) As shown in equation 3:

7. the device and the method for auditing spam texts through center vector similarity matching according to claim 6, wherein: the step S5 specifically includes:

and calculating the spam attribute tendency value of each text to be detected, and adding the text to be detected into the feedback text set if a formula 4 is established.