CN111651598A - Spam text auditing device and method through center vector similarity matching - Google Patents
Spam text auditing device and method through center vector similarity matching Download PDFInfo
- Publication number
- CN111651598A CN111651598A CN202010469767.4A CN202010469767A CN111651598A CN 111651598 A CN111651598 A CN 111651598A CN 202010469767 A CN202010469767 A CN 202010469767A CN 111651598 A CN111651598 A CN 111651598A
- Authority
- CN
- China
- Prior art keywords
- text
- vector
- calculating
- garbage
- sample text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
Abstract
A device and a method for auditing junk texts through central vector similarity matching comprise the following steps: establishing a group of garbage sample text sets and a group of normal sample text sets; calculating central vectors of the two sample sets, and performing text classification pretreatment on the text to be recognized. And classifying the text to be detected represented by the characteristic word text by using a classifier obtained by sample text training. Whether the text is the junk text or not can be determined according to a preset standard, and certain preventive measures can be further taken for the judged junk text, so that adverse effects of the junk text on daily life of people can be avoided.
Description
Technical Field
The invention relates to the technical field of text auditing, in particular to a junk text auditing device and method based on central vector similarity matching.
Background
With the development of internet technology, the security of information becomes more and more important for many people, and in any kind of business, there is information that is important for the business itself, for example, in the medical insurance auditing business, the information of the insured person is important information for the party providing the medical insurance business, and the risk prevention and control are needed to avoid leakage.
Besides the common commercial advertisements, there are some information such as reaction, fraud, etc. in these spam texts. The propagation of the information not only influences the daily life of people, but also jeopardizes the safety and stability of society. Therefore, it is necessary to identify the spam texts in order to filter or delete the spam texts.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a device and a method for auditing junk texts by matching the similarity of central vectors, which overcome the defects of the prior art and are used for auditing whether texts on a network are the junk texts so as to filter or delete the junk texts and the like.
In order to achieve the purpose, the invention is realized by the following technical scheme:
a device and a method for auditing junk texts through central vector similarity matching comprise the following steps:
step S1: establishing a group of garbage sample text sets and a group of normal sample text sets;
step S2: respectively calculating a central vector of a garbage sample text and a central vector of a normal sample text set for the garbage sample text set and the normal sample text set, wherein the calculation method is respectively calculating the arithmetic mean of the characteristic vectors of the garbage sample text set and the normal sample text set;
step S3: preprocessing newly collected texts to be detected such as word segmentation, feature extraction and the like, and expressing the texts by using feature vectors;
step S4: calculating the similarity between the feature vector of the text to be detected and the central vector of the garbage sample text set and the central vector of the normal sample text set; comparing the similarity between the central vector of the garbage sample text set and the text vector to be detected with the similarity between the normal sample text set and the text vector to be detected, wherein the text to be detected belongs to the category with larger similarity;
step S5: calculating a feedback value of the test text judged as the junk text, and if the feedback value meets a certain condition, adding the test text judged as the junk text into a feedback text set;
step S6: and regenerating a new category feature word list for the classifier according to the feedback information, updating each intermediate statistic in the sample set, calculating a central vector of the garbage sample text set, and revising the classifier.
Preferably, in step S3, the preprocessing of performing word segmentation, feature extraction, and the like on the newly acquired text to be tested includes:
step S3.1: and performing feature extraction on the text by utilizing a pre-constructed word segmentation dictionary, a stop word list, a keyword list and a variation word list to obtain a plurality of text features.
Step S3.2: and segmenting the character string with segmentation according to a word segmentation principle, taking substrings of the character string, sequentially matching the segmented substrings with entries in a given word segmentation dictionary, and if matching is successful, determining the substrings to be words.
Preferably, the word segmentation rule in step S3.2 specifically includes, but is not limited to, a forward maximum matching algorithm, a reverse maximum matching algorithm, an optimal matching method, and a method for setting a segmentation flag.
Preferably, the step S2 specifically includes the following steps:
step S2.1: use of C1Representing a set of normal sample texts, C2Representing a garbage sample text set, and a normal sample text set C1And garbage sample text set C2All texts in the text list are subjected to word segmentation and stop words are removed;
step S2.2: calculating the prior probability P (C) of the garbage sample text2) Word segmentation text WtInformation gain value of (1), text to be participled WtArranging the information gain values from large to small, and selecting the participle text with the information gain value arranged at the top n as the characteristic word tiCalculating the weight w of each feature word according to TFIDFi;
Step S2.3: for each feature word t in the garbage sample text setiCalculating the weight w of the text feature vectors in all the text feature vectors in the garbage sample text setiIs arithmetic mean ofAs the weight value of the feature word in the category feature vector;
step S2.5: according to the prior probability P (C) of the garbage sample text set2) And the prior probability P (C) of the normal sample text set1) And conditional probability P (t) of each feature wordi|C2)、P(ti|C1) An initial classifier C is generated.
Preferably, the step S3 specifically includes:
performing word segmentation and feature extraction on the text to be detected to generate a feature vector X representing the text to be detecteduAnd calculating the conditional probability of the text to be detected belonging to two types of samples according to a discrimination function formula of a naive Bayes classifier, wherein the conditional probability is shown as the following formula:
wherein i is 1 or 2.
Preferably, the step S4 specifically includes:
calculating the characteristic vector X of the text to be detecteduWith center vector X of garbage sample text setcThe similarity is calculated as shown in the following formula:
calculating probability value P (C) of the text to be tested belonging to the junk text2|Xu) That is, the garbage attribute tendency value score of the text to be testedspam(Xu) As shown in the following formula:
preferably, the step S5 specifically includes:
and calculating the spam attribute tendency value of each text to be detected, and adding the text to be detected into the feedback text set if the following formula is established.
The invention provides a device and a method for auditing junk texts through central vector similarity matching. The method has the following beneficial effects: and regenerating a new category feature word list for the classifier according to the feedback information, updating each intermediate statistic in the sample set, calculating a central vector of the garbage sample text set, and revising the classifier to enable the prior probability of the garbage sample text set and the normal sample text set and the conditional probability of each feature to be closer to the probability of an ideal model, thereby improving the performance of the system.
Drawings
In order to more clearly illustrate the present invention or the prior art solutions, the drawings that are needed in the description of the prior art will be briefly described below.
FIG. 1 is a flow chart of the steps of the present invention;
FIG. 2 is a block diagram of an embodiment of the present invention;
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings.
As shown in fig. 1-2, a spam text auditing apparatus and method by center vector similarity matching includes the following steps:
step S1: establishing a group of garbage sample text sets and a group of normal sample text sets; in the step, the junk sample text and a group of normal sample texts can be screened from the sample texts by an editor and then manually marked; use of C1Representing a set of normal sample texts, C2Representing a garbage sample text set;
step S2: respectively calculating a central vector of a garbage sample text and a central vector of a normal sample text set for the garbage sample text set and the normal sample text set, wherein the calculation method is respectively calculating the arithmetic mean of the characteristic vectors of the garbage sample text set and the normal sample text set; the method comprises the following specific steps:
calculating the prior probability of the junk sample text and the normal sample text, and the probability P (C) of the junk sample or the normal sample in the sample textj) The calculation formula is as follows.
Wherein j is 1 or 2, NjIs a member of CjThe number of sample texts. N is the sum of all sample text quantities.
The probability calculation formula of the sample text containing the participle text t is as follows:
wherein N istIs the number of sample texts containing the participle text t.
The probability calculation formula of the sample text not containing the participle text t is as follows:
the sample text containing the participle text t belongs to the sample text set CjThe conditional probability calculation formula of (1) is:
wherein N (C)jT) is a sample text set CjIncluding the number of sample texts of the participle text t.
Word segmentation text WtThe calculation formula of (1) is as follows:
text W to be participledtAccording to the arrangement of the information gain values IG (t) from large to small, selecting the word segmentation text with the information gain value IG (t) arranged at the top n as the characteristic word ti。
Calculating each feature word t according to TFIDFiWeight w ofiThe calculation formula is as follows:
wi=P(t)×Nt
for each feature word t in the sample text setiCalculating the weight w of the text feature vector in the sample text setiIs arithmetic mean ofAs the weight value of the feature word in the category feature vector, the calculation formula is as follows;
step S3: preprocessing newly collected texts to be detected such as word segmentation, feature extraction and the like, and expressing the texts by using feature vectors; the method comprises the steps of performing word segmentation processing on a text to be tested by using a forward maximum matching algorithm or a reverse maximum matching algorithm or an optimal matching method or setting a segmentation mark method, recording word segmentation texts and parts of speech corresponding to each word segmentation, keeping nouns, verbs and adjectives, and removing meaningless words by using a stop word list. Obtaining a text to be tested;
step S4: calculating the similarity between the feature vector of the text to be detected and the central vector of the garbage sample text set and the central vector of the normal sample text set;
computing a sample text set CjThe probability of the feature vector X is included in the calculation formula as follows:
wherein m isiIndicating whether the text to be tested contains the characteristic word tiIf the feature vector of the text to be detected contains the feature word tiThen m isiIs 1, otherwise is 0;
calculating P (X) by the following formula:
P(X)=P(X|C1)P(C1)+P(X|C2)P(C2)
calculating P (C)j| X), the formula is calculated as follows:
step S5: comparison P (C)1I X) and P (C)2And | X), the text to be detected belongs to a sample text set with higher probability.
Step S6: calculating a feedback value of the test text judged as the junk text, and if the feedback value meets a certain condition, adding the test text judged as the junk text into a feedback text set;
calculating probability value P (C) of the text to be tested belonging to the junk text2|Xu) That is, the garbage attribute tendency value score of the text to be testedspam(Xu) The calculation formula is as follows:
and calculating the spam attribute tendency value of each text to be detected, and adding the text to be detected into the feedback text set if the following formula is established.
Step S7: and regenerating a new category feature word list for the classifier according to the feedback information, updating each intermediate statistic in the sample set, calculating a central vector of the garbage sample text set, and revising the classifier.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (7)
1. A device and a method for auditing junk texts through central vector similarity matching are characterized by comprising the following steps:
step S1: establishing a group of garbage sample text sets and a group of normal sample text sets;
step S2: respectively calculating a central vector of a garbage sample text and a central vector of a normal sample text set for the garbage sample text set and the normal sample text set, wherein the calculation method is respectively calculating the arithmetic mean of the characteristic vectors of the garbage sample text set and the normal sample text set;
step S3: preprocessing newly collected texts to be detected such as word segmentation, feature extraction and the like, and expressing the texts by using feature vectors;
step S4: calculating the similarity between the feature vector of the text to be detected and the central vector of the garbage sample text set and the central vector of the normal sample text set; comparing the similarity between the central vector of the garbage sample text set and the text vector to be detected with the similarity between the normal sample text set and the text vector to be detected, wherein the text to be detected belongs to the category with larger similarity;
step S5: calculating a feedback value of the test text judged as the junk text, and if the feedback value meets a certain condition, adding the test text judged as the junk text into a feedback text set;
step S6: and regenerating a new category feature word list for the classifier according to the feedback information, updating each intermediate statistic in the sample set, calculating a central vector of the garbage sample text set, and revising the classifier.
2. The device and the method for auditing spam texts through central vector similarity matching according to claim 1, wherein: in step S3, the preprocessing of the newly acquired text to be detected, such as word segmentation and feature extraction, includes:
step S3.1: and performing feature extraction on the text by utilizing a pre-constructed word segmentation dictionary, a stop word list, a keyword list and a variation word list to obtain a plurality of text features.
Step S3.2: and segmenting the character string with segmentation according to a word segmentation principle, taking substrings of the character string, sequentially matching the segmented substrings with entries in a given word segmentation dictionary, and if matching is successful, determining the substrings to be words.
3. The device and the method for auditing spam texts through central vector similarity matching according to claim 2, wherein: the word segmentation rule in step S3.2 includes, but is not limited to, a forward maximum matching algorithm, a reverse maximum matching algorithm, an optimal matching method, and a method for setting segmentation flags.
4. The device and the method for auditing spam texts through central vector similarity matching according to claim 1, wherein: the step S2 specifically includes the following steps:
step S2.1: use of C1Representing a set of normal sample texts, C2Representing a garbage sample text set, and a normal sample text set C1And garbage sample text set C2All texts in the text list are subjected to word segmentation and stop words are removed;
step S2.2: calculating the prior probability P (C) of the garbage sample text2) Word segmentation text WtInformation gain value of (1), text to be participled WtArranging the information gain values from large to small, and selecting the participle text with the information gain value arranged at the top n as the characteristic word tiCalculating the weight w of each feature word according to TFIDFi;
Step S2.3:for each feature word t in the garbage sample text setiCalculating the weight w of the text feature vectors in all the text feature vectors in the garbage sample text setiIs arithmetic mean ofAs the weight value of the feature word in the category feature vector;
step S2.5: according to the prior probability P (C) of the garbage sample text set2) And the prior probability P (C) of the normal sample text set1) And conditional probability P (t) of each feature wordi|C2)、P(ti|C1) An initial classifier C is generated.
5. The device and the method for auditing spam texts through central vector similarity matching according to claim 4, wherein: the step S3 specifically includes:
performing word segmentation and feature extraction on the text to be detected to generate a feature vector X representing the text to be detecteduCalculating the conditional probability of the text to be tested belonging to two types of samples according to the discrimination function formula of the naive Bayes classifier, as shown in formula 1
Wherein i is 1 or 2.
6. The device and the method for auditing spam texts through center vector similarity matching according to claim 5, wherein: the step S4 specifically includes:
calculating the characteristic vector X of the text to be detecteduWith center vector X of garbage sample text setcThe similarity of (2) is calculated as shown in equation 2:
Calculating probability value P (C) of the text to be tested belonging to the junk text2|Xu) That is, the garbage attribute tendency value score of the text to be testedspam(Xu) As shown in equation 3:
7. the device and the method for auditing spam texts through center vector similarity matching according to claim 6, wherein: the step S5 specifically includes:
and calculating the spam attribute tendency value of each text to be detected, and adding the text to be detected into the feedback text set if a formula 4 is established.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010469767.4A CN111651598A (en) | 2020-05-28 | 2020-05-28 | Spam text auditing device and method through center vector similarity matching |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010469767.4A CN111651598A (en) | 2020-05-28 | 2020-05-28 | Spam text auditing device and method through center vector similarity matching |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111651598A true CN111651598A (en) | 2020-09-11 |
Family
ID=72343431
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010469767.4A Pending CN111651598A (en) | 2020-05-28 | 2020-05-28 | Spam text auditing device and method through center vector similarity matching |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111651598A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114928498A (en) * | 2022-06-15 | 2022-08-19 | 中国联合网络通信集团有限公司 | Fraud information identification method and device and computer readable storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103336766A (en) * | 2013-07-04 | 2013-10-02 | 微梦创科网络科技(中国)有限公司 | Short text garbage identification and modeling method and device |
CN104462062A (en) * | 2014-12-11 | 2015-03-25 | 珠海金山网络游戏科技有限公司 | Text anti-spam method |
CN107086952A (en) * | 2017-04-19 | 2017-08-22 | 中国石油大学(华东) | A kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations |
CN107943941A (en) * | 2017-11-23 | 2018-04-20 | 珠海金山网络游戏科技有限公司 | It is a kind of can iteration renewal rubbish text recognition methods and system |
CN110309297A (en) * | 2018-03-16 | 2019-10-08 | 腾讯科技(深圳)有限公司 | Rubbish text detection method, readable storage medium storing program for executing and computer equipment |
-
2020
- 2020-05-28 CN CN202010469767.4A patent/CN111651598A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103336766A (en) * | 2013-07-04 | 2013-10-02 | 微梦创科网络科技(中国)有限公司 | Short text garbage identification and modeling method and device |
CN104462062A (en) * | 2014-12-11 | 2015-03-25 | 珠海金山网络游戏科技有限公司 | Text anti-spam method |
CN107086952A (en) * | 2017-04-19 | 2017-08-22 | 中国石油大学(华东) | A kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations |
CN107943941A (en) * | 2017-11-23 | 2018-04-20 | 珠海金山网络游戏科技有限公司 | It is a kind of can iteration renewal rubbish text recognition methods and system |
CN110309297A (en) * | 2018-03-16 | 2019-10-08 | 腾讯科技(深圳)有限公司 | Rubbish text detection method, readable storage medium storing program for executing and computer equipment |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114928498A (en) * | 2022-06-15 | 2022-08-19 | 中国联合网络通信集团有限公司 | Fraud information identification method and device and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107193959B (en) | Pure text-oriented enterprise entity classification method | |
CN109446404B (en) | Method and device for analyzing emotion polarity of network public sentiment | |
CN108446271B (en) | Text emotion analysis method of convolutional neural network based on Chinese character component characteristics | |
US7689531B1 (en) | Automatic charset detection using support vector machines with charset grouping | |
US8112484B1 (en) | Apparatus and method for auxiliary classification for generating features for a spam filtering model | |
US7711673B1 (en) | Automatic charset detection using SIM algorithm with charset grouping | |
CN110502626B (en) | Aspect level emotion analysis method based on convolutional neural network | |
CN109726745B (en) | Target-based emotion classification method integrating description knowledge | |
CN109960727B (en) | Personal privacy information automatic detection method and system for unstructured text | |
US8560466B2 (en) | Method and arrangement for automatic charset detection | |
CN110688479A (en) | Evaluation method and sequencing network for generating abstract | |
CN111753082A (en) | Text classification method and device based on comment data, equipment and medium | |
CN111241824A (en) | Method for identifying Chinese metaphor information | |
CN112949713A (en) | Text emotion classification method based on ensemble learning of complex network | |
CN111177386B (en) | Proposal classification method and system | |
CN114372475A (en) | Network public opinion emotion analysis method and system based on RoBERTA model | |
CN115238040A (en) | Steel material science knowledge graph construction method and system | |
CN115809887A (en) | Method and device for determining main business range of enterprise based on invoice data | |
CN107480126B (en) | Intelligent identification method for engineering material category | |
CN112016294B (en) | Text-based news importance evaluation method and device and electronic equipment | |
CN111651598A (en) | Spam text auditing device and method through center vector similarity matching | |
CN113095858A (en) | Method for identifying fraud-related short text | |
Wibowo et al. | Sentiments Analysis of Indonesian Tweet About Covid-19 Vaccine Using Support Vector Machine and Fasttext Embedding | |
CN110348497B (en) | Text representation method constructed based on WT-GloVe word vector | |
Izzah et al. | Modified TF-Assoc term weighting method for text classification on news dataset from twitter |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |