CN111651598A - Spam text auditing device and method through center vector similarity matching - Google Patents

Spam text auditing device and method through center vector similarity matching Download PDF

Info

Publication number
CN111651598A
CN111651598A CN202010469767.4A CN202010469767A CN111651598A CN 111651598 A CN111651598 A CN 111651598A CN 202010469767 A CN202010469767 A CN 202010469767A CN 111651598 A CN111651598 A CN 111651598A
Authority
CN
China
Prior art keywords
text
vector
calculating
garbage
sample text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010469767.4A
Other languages
Chinese (zh)
Inventor
陈晓峰
麻沁甜
刘星辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bochi Information Technology Co ltd
Original Assignee
Shanghai Bochi Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bochi Information Technology Co ltd filed Critical Shanghai Bochi Information Technology Co ltd
Priority to CN202010469767.4A priority Critical patent/CN111651598A/en
Publication of CN111651598A publication Critical patent/CN111651598A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Abstract

A device and a method for auditing junk texts through central vector similarity matching comprise the following steps: establishing a group of garbage sample text sets and a group of normal sample text sets; calculating central vectors of the two sample sets, and performing text classification pretreatment on the text to be recognized. And classifying the text to be detected represented by the characteristic word text by using a classifier obtained by sample text training. Whether the text is the junk text or not can be determined according to a preset standard, and certain preventive measures can be further taken for the judged junk text, so that adverse effects of the junk text on daily life of people can be avoided.

Description

Spam text auditing device and method through center vector similarity matching
Technical Field
The invention relates to the technical field of text auditing, in particular to a junk text auditing device and method based on central vector similarity matching.
Background
With the development of internet technology, the security of information becomes more and more important for many people, and in any kind of business, there is information that is important for the business itself, for example, in the medical insurance auditing business, the information of the insured person is important information for the party providing the medical insurance business, and the risk prevention and control are needed to avoid leakage.
Besides the common commercial advertisements, there are some information such as reaction, fraud, etc. in these spam texts. The propagation of the information not only influences the daily life of people, but also jeopardizes the safety and stability of society. Therefore, it is necessary to identify the spam texts in order to filter or delete the spam texts.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a device and a method for auditing junk texts by matching the similarity of central vectors, which overcome the defects of the prior art and are used for auditing whether texts on a network are the junk texts so as to filter or delete the junk texts and the like.
In order to achieve the purpose, the invention is realized by the following technical scheme:
a device and a method for auditing junk texts through central vector similarity matching comprise the following steps:
step S1: establishing a group of garbage sample text sets and a group of normal sample text sets;
step S2: respectively calculating a central vector of a garbage sample text and a central vector of a normal sample text set for the garbage sample text set and the normal sample text set, wherein the calculation method is respectively calculating the arithmetic mean of the characteristic vectors of the garbage sample text set and the normal sample text set;
step S3: preprocessing newly collected texts to be detected such as word segmentation, feature extraction and the like, and expressing the texts by using feature vectors;
step S4: calculating the similarity between the feature vector of the text to be detected and the central vector of the garbage sample text set and the central vector of the normal sample text set; comparing the similarity between the central vector of the garbage sample text set and the text vector to be detected with the similarity between the normal sample text set and the text vector to be detected, wherein the text to be detected belongs to the category with larger similarity;
step S5: calculating a feedback value of the test text judged as the junk text, and if the feedback value meets a certain condition, adding the test text judged as the junk text into a feedback text set;
step S6: and regenerating a new category feature word list for the classifier according to the feedback information, updating each intermediate statistic in the sample set, calculating a central vector of the garbage sample text set, and revising the classifier.
Preferably, in step S3, the preprocessing of performing word segmentation, feature extraction, and the like on the newly acquired text to be tested includes:
step S3.1: and performing feature extraction on the text by utilizing a pre-constructed word segmentation dictionary, a stop word list, a keyword list and a variation word list to obtain a plurality of text features.
Step S3.2: and segmenting the character string with segmentation according to a word segmentation principle, taking substrings of the character string, sequentially matching the segmented substrings with entries in a given word segmentation dictionary, and if matching is successful, determining the substrings to be words.
Preferably, the word segmentation rule in step S3.2 specifically includes, but is not limited to, a forward maximum matching algorithm, a reverse maximum matching algorithm, an optimal matching method, and a method for setting a segmentation flag.
Preferably, the step S2 specifically includes the following steps:
step S2.1: use of C1Representing a set of normal sample texts, C2Representing a garbage sample text set, and a normal sample text set C1And garbage sample text set C2All texts in the text list are subjected to word segmentation and stop words are removed;
step S2.2: calculating the prior probability P (C) of the garbage sample text2) Word segmentation text WtInformation gain value of (1), text to be participled WtArranging the information gain values from large to small, and selecting the participle text with the information gain value arranged at the top n as the characteristic word tiCalculating the weight w of each feature word according to TFIDFi
Step S2.3: for each feature word t in the garbage sample text setiCalculating the weight w of the text feature vectors in all the text feature vectors in the garbage sample text setiIs arithmetic mean of
Figure BDA0002513917920000031
As the weight value of the feature word in the category feature vector;
step S2.4: constructing a garbage sample text set C2Is determined by the central vector of (a),
Figure BDA0002513917920000032
step S2.5: according to the prior probability P (C) of the garbage sample text set2) And the prior probability P (C) of the normal sample text set1) And conditional probability P (t) of each feature wordi|C2)、P(ti|C1) An initial classifier C is generated.
Preferably, the step S3 specifically includes:
performing word segmentation and feature extraction on the text to be detected to generate a feature vector X representing the text to be detecteduAnd calculating the conditional probability of the text to be detected belonging to two types of samples according to a discrimination function formula of a naive Bayes classifier, wherein the conditional probability is shown as the following formula:
Figure BDA0002513917920000033
wherein i is 1 or 2.
Preferably, the step S4 specifically includes:
calculating the characteristic vector X of the text to be detecteduWith center vector X of garbage sample text setcThe similarity is calculated as shown in the following formula:
Figure BDA0002513917920000041
calculating probability value P (C) of the text to be tested belonging to the junk text2|Xu) That is, the garbage attribute tendency value score of the text to be testedspam(Xu) As shown in the following formula:
Figure BDA0002513917920000042
preferably, the step S5 specifically includes:
and calculating the spam attribute tendency value of each text to be detected, and adding the text to be detected into the feedback text set if the following formula is established.
Figure BDA0002513917920000043
The invention provides a device and a method for auditing junk texts through central vector similarity matching. The method has the following beneficial effects: and regenerating a new category feature word list for the classifier according to the feedback information, updating each intermediate statistic in the sample set, calculating a central vector of the garbage sample text set, and revising the classifier to enable the prior probability of the garbage sample text set and the normal sample text set and the conditional probability of each feature to be closer to the probability of an ideal model, thereby improving the performance of the system.
Drawings
In order to more clearly illustrate the present invention or the prior art solutions, the drawings that are needed in the description of the prior art will be briefly described below.
FIG. 1 is a flow chart of the steps of the present invention;
FIG. 2 is a block diagram of an embodiment of the present invention;
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings.
As shown in fig. 1-2, a spam text auditing apparatus and method by center vector similarity matching includes the following steps:
step S1: establishing a group of garbage sample text sets and a group of normal sample text sets; in the step, the junk sample text and a group of normal sample texts can be screened from the sample texts by an editor and then manually marked; use of C1Representing a set of normal sample texts, C2Representing a garbage sample text set;
step S2: respectively calculating a central vector of a garbage sample text and a central vector of a normal sample text set for the garbage sample text set and the normal sample text set, wherein the calculation method is respectively calculating the arithmetic mean of the characteristic vectors of the garbage sample text set and the normal sample text set; the method comprises the following specific steps:
calculating the prior probability of the junk sample text and the normal sample text, and the probability P (C) of the junk sample or the normal sample in the sample textj) The calculation formula is as follows.
Figure BDA0002513917920000051
Wherein j is 1 or 2, NjIs a member of CjThe number of sample texts. N is the sum of all sample text quantities.
The probability calculation formula of the sample text containing the participle text t is as follows:
Figure BDA0002513917920000052
wherein N istIs the number of sample texts containing the participle text t.
The probability calculation formula of the sample text not containing the participle text t is as follows:
Figure BDA0002513917920000053
the sample text containing the participle text t belongs to the sample text set CjThe conditional probability calculation formula of (1) is:
Figure BDA0002513917920000054
wherein N (C)jT) is a sample text set CjIncluding the number of sample texts of the participle text t.
Word segmentation text WtThe calculation formula of (1) is as follows:
Figure BDA0002513917920000061
text W to be participledtAccording to the arrangement of the information gain values IG (t) from large to small, selecting the word segmentation text with the information gain value IG (t) arranged at the top n as the characteristic word ti
Calculating each feature word t according to TFIDFiWeight w ofiThe calculation formula is as follows:
wi=P(t)×Nt
for each feature word t in the sample text setiCalculating the weight w of the text feature vector in the sample text setiIs arithmetic mean of
Figure BDA0002513917920000062
As the weight value of the feature word in the category feature vector, the calculation formula is as follows;
Figure BDA0002513917920000063
constructing a sample text set CjIs determined by the central vector of (a),
Figure BDA0002513917920000064
step S3: preprocessing newly collected texts to be detected such as word segmentation, feature extraction and the like, and expressing the texts by using feature vectors; the method comprises the steps of performing word segmentation processing on a text to be tested by using a forward maximum matching algorithm or a reverse maximum matching algorithm or an optimal matching method or setting a segmentation mark method, recording word segmentation texts and parts of speech corresponding to each word segmentation, keeping nouns, verbs and adjectives, and removing meaningless words by using a stop word list. Obtaining a text to be tested;
step S4: calculating the similarity between the feature vector of the text to be detected and the central vector of the garbage sample text set and the central vector of the normal sample text set;
computing a sample text set CjThe probability of the feature vector X is included in the calculation formula as follows:
Figure BDA0002513917920000065
wherein m isiIndicating whether the text to be tested contains the characteristic word tiIf the feature vector of the text to be detected contains the feature word tiThen m isiIs 1, otherwise is 0;
calculating P (X) by the following formula:
P(X)=P(X|C1)P(C1)+P(X|C2)P(C2)
calculating P (C)j| X), the formula is calculated as follows:
Figure BDA0002513917920000071
step S5: comparison P (C)1I X) and P (C)2And | X), the text to be detected belongs to a sample text set with higher probability.
Step S6: calculating a feedback value of the test text judged as the junk text, and if the feedback value meets a certain condition, adding the test text judged as the junk text into a feedback text set;
calculating probability value P (C) of the text to be tested belonging to the junk text2|Xu) That is, the garbage attribute tendency value score of the text to be testedspam(Xu) The calculation formula is as follows:
Figure BDA0002513917920000072
and calculating the spam attribute tendency value of each text to be detected, and adding the text to be detected into the feedback text set if the following formula is established.
Figure BDA0002513917920000073
Step S7: and regenerating a new category feature word list for the classifier according to the feedback information, updating each intermediate statistic in the sample set, calculating a central vector of the garbage sample text set, and revising the classifier.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (7)

1. A device and a method for auditing junk texts through central vector similarity matching are characterized by comprising the following steps:
step S1: establishing a group of garbage sample text sets and a group of normal sample text sets;
step S2: respectively calculating a central vector of a garbage sample text and a central vector of a normal sample text set for the garbage sample text set and the normal sample text set, wherein the calculation method is respectively calculating the arithmetic mean of the characteristic vectors of the garbage sample text set and the normal sample text set;
step S3: preprocessing newly collected texts to be detected such as word segmentation, feature extraction and the like, and expressing the texts by using feature vectors;
step S4: calculating the similarity between the feature vector of the text to be detected and the central vector of the garbage sample text set and the central vector of the normal sample text set; comparing the similarity between the central vector of the garbage sample text set and the text vector to be detected with the similarity between the normal sample text set and the text vector to be detected, wherein the text to be detected belongs to the category with larger similarity;
step S5: calculating a feedback value of the test text judged as the junk text, and if the feedback value meets a certain condition, adding the test text judged as the junk text into a feedback text set;
step S6: and regenerating a new category feature word list for the classifier according to the feedback information, updating each intermediate statistic in the sample set, calculating a central vector of the garbage sample text set, and revising the classifier.
2. The device and the method for auditing spam texts through central vector similarity matching according to claim 1, wherein: in step S3, the preprocessing of the newly acquired text to be detected, such as word segmentation and feature extraction, includes:
step S3.1: and performing feature extraction on the text by utilizing a pre-constructed word segmentation dictionary, a stop word list, a keyword list and a variation word list to obtain a plurality of text features.
Step S3.2: and segmenting the character string with segmentation according to a word segmentation principle, taking substrings of the character string, sequentially matching the segmented substrings with entries in a given word segmentation dictionary, and if matching is successful, determining the substrings to be words.
3. The device and the method for auditing spam texts through central vector similarity matching according to claim 2, wherein: the word segmentation rule in step S3.2 includes, but is not limited to, a forward maximum matching algorithm, a reverse maximum matching algorithm, an optimal matching method, and a method for setting segmentation flags.
4. The device and the method for auditing spam texts through central vector similarity matching according to claim 1, wherein: the step S2 specifically includes the following steps:
step S2.1: use of C1Representing a set of normal sample texts, C2Representing a garbage sample text set, and a normal sample text set C1And garbage sample text set C2All texts in the text list are subjected to word segmentation and stop words are removed;
step S2.2: calculating the prior probability P (C) of the garbage sample text2) Word segmentation text WtInformation gain value of (1), text to be participled WtArranging the information gain values from large to small, and selecting the participle text with the information gain value arranged at the top n as the characteristic word tiCalculating the weight w of each feature word according to TFIDFi
Step S2.3:for each feature word t in the garbage sample text setiCalculating the weight w of the text feature vectors in all the text feature vectors in the garbage sample text setiIs arithmetic mean of
Figure FDA0002513917910000021
As the weight value of the feature word in the category feature vector;
step S2.4: constructing a garbage sample text set C2Is determined by the central vector of (a),
Figure FDA0002513917910000022
step S2.5: according to the prior probability P (C) of the garbage sample text set2) And the prior probability P (C) of the normal sample text set1) And conditional probability P (t) of each feature wordi|C2)、P(ti|C1) An initial classifier C is generated.
5. The device and the method for auditing spam texts through central vector similarity matching according to claim 4, wherein: the step S3 specifically includes:
performing word segmentation and feature extraction on the text to be detected to generate a feature vector X representing the text to be detecteduCalculating the conditional probability of the text to be tested belonging to two types of samples according to the discrimination function formula of the naive Bayes classifier, as shown in formula 1
Figure FDA0002513917910000031
Wherein i is 1 or 2.
6. The device and the method for auditing spam texts through center vector similarity matching according to claim 5, wherein: the step S4 specifically includes:
calculating the characteristic vector X of the text to be detecteduWith center vector X of garbage sample text setcThe similarity of (2) is calculated as shown in equation 2:
Figure FDA0002513917910000032
Calculating probability value P (C) of the text to be tested belonging to the junk text2|Xu) That is, the garbage attribute tendency value score of the text to be testedspam(Xu) As shown in equation 3:
Figure FDA0002513917910000033
7. the device and the method for auditing spam texts through center vector similarity matching according to claim 6, wherein: the step S5 specifically includes:
and calculating the spam attribute tendency value of each text to be detected, and adding the text to be detected into the feedback text set if a formula 4 is established.
Figure FDA0002513917910000034
CN202010469767.4A 2020-05-28 2020-05-28 Spam text auditing device and method through center vector similarity matching Pending CN111651598A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010469767.4A CN111651598A (en) 2020-05-28 2020-05-28 Spam text auditing device and method through center vector similarity matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010469767.4A CN111651598A (en) 2020-05-28 2020-05-28 Spam text auditing device and method through center vector similarity matching

Publications (1)

Publication Number Publication Date
CN111651598A true CN111651598A (en) 2020-09-11

Family

ID=72343431

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010469767.4A Pending CN111651598A (en) 2020-05-28 2020-05-28 Spam text auditing device and method through center vector similarity matching

Country Status (1)

Country Link
CN (1) CN111651598A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114928498A (en) * 2022-06-15 2022-08-19 中国联合网络通信集团有限公司 Fraud information identification method and device and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103336766A (en) * 2013-07-04 2013-10-02 微梦创科网络科技(中国)有限公司 Short text garbage identification and modeling method and device
CN104462062A (en) * 2014-12-11 2015-03-25 珠海金山网络游戏科技有限公司 Text anti-spam method
CN107086952A (en) * 2017-04-19 2017-08-22 中国石油大学(华东) A kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations
CN107943941A (en) * 2017-11-23 2018-04-20 珠海金山网络游戏科技有限公司 It is a kind of can iteration renewal rubbish text recognition methods and system
CN110309297A (en) * 2018-03-16 2019-10-08 腾讯科技(深圳)有限公司 Rubbish text detection method, readable storage medium storing program for executing and computer equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103336766A (en) * 2013-07-04 2013-10-02 微梦创科网络科技(中国)有限公司 Short text garbage identification and modeling method and device
CN104462062A (en) * 2014-12-11 2015-03-25 珠海金山网络游戏科技有限公司 Text anti-spam method
CN107086952A (en) * 2017-04-19 2017-08-22 中国石油大学(华东) A kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations
CN107943941A (en) * 2017-11-23 2018-04-20 珠海金山网络游戏科技有限公司 It is a kind of can iteration renewal rubbish text recognition methods and system
CN110309297A (en) * 2018-03-16 2019-10-08 腾讯科技(深圳)有限公司 Rubbish text detection method, readable storage medium storing program for executing and computer equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114928498A (en) * 2022-06-15 2022-08-19 中国联合网络通信集团有限公司 Fraud information identification method and device and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN107193959B (en) Pure text-oriented enterprise entity classification method
CN109446404B (en) Method and device for analyzing emotion polarity of network public sentiment
CN108446271B (en) Text emotion analysis method of convolutional neural network based on Chinese character component characteristics
US7689531B1 (en) Automatic charset detection using support vector machines with charset grouping
US8112484B1 (en) Apparatus and method for auxiliary classification for generating features for a spam filtering model
US7711673B1 (en) Automatic charset detection using SIM algorithm with charset grouping
CN110502626B (en) Aspect level emotion analysis method based on convolutional neural network
CN109726745B (en) Target-based emotion classification method integrating description knowledge
CN109960727B (en) Personal privacy information automatic detection method and system for unstructured text
US8560466B2 (en) Method and arrangement for automatic charset detection
CN110688479A (en) Evaluation method and sequencing network for generating abstract
CN111753082A (en) Text classification method and device based on comment data, equipment and medium
CN111241824A (en) Method for identifying Chinese metaphor information
CN112949713A (en) Text emotion classification method based on ensemble learning of complex network
CN111177386B (en) Proposal classification method and system
CN114372475A (en) Network public opinion emotion analysis method and system based on RoBERTA model
CN115238040A (en) Steel material science knowledge graph construction method and system
CN115809887A (en) Method and device for determining main business range of enterprise based on invoice data
CN107480126B (en) Intelligent identification method for engineering material category
CN112016294B (en) Text-based news importance evaluation method and device and electronic equipment
CN111651598A (en) Spam text auditing device and method through center vector similarity matching
CN113095858A (en) Method for identifying fraud-related short text
Wibowo et al. Sentiments Analysis of Indonesian Tweet About Covid-19 Vaccine Using Support Vector Machine and Fasttext Embedding
CN110348497B (en) Text representation method constructed based on WT-GloVe word vector
Izzah et al. Modified TF-Assoc term weighting method for text classification on news dataset from twitter

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination