CN111209744A - Junk text recognition method - Google Patents

Junk text recognition method Download PDF

Info

Publication number
CN111209744A
CN111209744A CN202010040291.2A CN202010040291A CN111209744A CN 111209744 A CN111209744 A CN 111209744A CN 202010040291 A CN202010040291 A CN 202010040291A CN 111209744 A CN111209744 A CN 111209744A
Authority
CN
China
Prior art keywords
text
keyword
recognized
spam
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010040291.2A
Other languages
Chinese (zh)
Inventor
刘星辰
陈晓峰
麻沁甜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bochi Information Technology Co ltd
Original Assignee
Shanghai Bochi Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bochi Information Technology Co ltd filed Critical Shanghai Bochi Information Technology Co ltd
Priority to CN202010040291.2A priority Critical patent/CN111209744A/en
Publication of CN111209744A publication Critical patent/CN111209744A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a junk text recognition method, which comprises the following steps: performing word segmentation pretreatment on a text to be recognized to obtain a feature word text; calculating the characteristic contribution ratio of each characteristic word of the text to be recognized, and recognizing the characteristic words by the junk text recognition system to obtain the characteristic contribution ratio of the text to be recognized; whether the text is a junk text can be determined through the feature contribution ratio of the text to be recognized and a preset standard, and certain preventive measures can be further taken for the determined junk text, so that adverse effects of the junk text on daily life of people can be avoided.

Description

Junk text recognition method
[ technical field ] A method for producing a semiconductor device
The invention relates to a junk text recognition method, in particular to the technical field of computer data processing.
[ background of the invention ]
With the development of interconnection technology, the text content is richer and richer, and the junk text is more and more accompanied. Besides the common commercial advertisements, there are some information such as reaction, fraud, etc. in these spam texts. The propagation of the information not only influences the daily life of people, but also jeopardizes the safety and stability of society. Therefore, it is necessary to identify the spam texts in order to filter or delete the spam texts.
[ summary of the invention ]
The invention provides a junk text recognition method which can be applied to recognition of junk texts of mails, short messages and other Internet texts, provides help for restraining spreading of the junk texts by taking measures and meets the actual application requirements.
In order to achieve the above object, the present invention provides a method for recognizing spam texts, comprising the following steps:
step 1, performing interval sliding window word segmentation processing on a text to be recognized, and matching a word segmentation result with a word segmentation dictionary to obtain a keyword;
step 2, selecting the feature words of the text to be recognized according to the feature contribution ratio of each keyword;
step 3, comparing the feature contribution ratio of the feature words of the text to be recognized with a preset threshold value;
and 4, outputting the recognition result of the text to be recognized.
As an improvement of the above technical solution, the method for constructing the feature words in step 2 includes the following steps:
step 11, sliding on the text to be recognized through two sliding windows with the length of n, and filtering out abnormal characters inserted in the text to be recognized by means of a middle interval;
and step 12, introducing a word segmentation dictionary on the basis of the step 11. And obtaining a keyword result by matching with the dictionary.
As an improvement of the technical scheme, the step divides the sample into a junk text and a non-junk text;
calculating the characteristic contribution ratio of all keywords of the text to be recognized, and selecting the keywords with the characteristic contribution ratio larger than a preset value as characteristic words for judging whether the text to be recognized is a junk text;
and calculating the weight of the contribution degree of the junk features of the feature words of the text to be recognized, and judging the text to be recognized as the junk text when the weight is greater than a threshold value.
As an improvement of the above technical solution, the spam text sample stores sensitive words and/or spam features and/or spam various features.
As an improvement of the above technical solution, the calculating a feature contribution ratio of each keyword of the text to be recognized specifically includes:
for each keyword of the text to be recognized, calculating the characteristic contribution ratio of the characteristic word according to formula 1:
Figure BDA0002367502100000021
where t is the keyword, R (t) is the feature contribution ratio of the keyword, C (t, C)spam) Represents the contribution degree of the keyword t to the garbage sample, C (t, C)ham) Representing the degree of contribution of the keyword t to the non-spam sample.
As an improvement of the above technical solution, the calculating the feature contribution degree of each keyword of the text to be recognized specifically includes:
for each keyword of the text to be recognized, calculating the spam feature contribution degree of the keyword according to the following formula 2:
Figure BDA0002367502100000022
where t is the keyword, α (t, C)spam) Is the word frequency factor of the keyword t, P (t | C)spam) Representing the probability that the text containing the keyword belongs to the spam text category, and p (t) representing the probability that the keyword t appears in the entire sample set.
Calculating the non-spam feature contribution degree of the keyword according to the following formula 3:
Figure BDA0002367502100000031
wherein α (t, C)ham) Is the word frequency factor of the keyword t, P (t | C)ham) Indicating the probability that the text containing the keyword belongs to the spam text category.
As an improvement of the above technical solution, the calculating the word frequency factor of the keyword t specifically includes:
calculating the word frequency factor of the keyword according to the following formula 4:
Figure BDA0002367502100000032
in the formula tf (t, C)i) Represents class CiThe number of occurrences of the middle keyword t, n representing the category CiNumber of samples of dijRepresents class CiThe j (th) sample, tf (t, d)ij) Indicating that keyword t is in category CiThe number of occurrences of the jth sample in (b).
As an improvement of the above technical solution, the calculating the weight of the spam feature contribution degree of the text feature words to be recognized specifically includes:
calculating the spam characteristic weight of the text to be recognized according to the following formula 5:
Figure BDA0002367502100000033
in the formula, Wgt represents the spam feature weight of the text to be recognized, and m represents the number of feature words contained in the text to be recognized.
The invention has the beneficial effects that:
the invention provides a junk text recognition method, which is applied to computer text data, in particular to acquiring text data to be recognized; performing word segmentation processing on a text to be recognized; matching the word segmentation of the text to be recognized with a word segmentation dictionary to obtain a keyword; screening the keywords to obtain the characteristic words of the text to be identified; calculating according to the spam characteristic contribution degree and the non-spam characteristic contribution degree of the characteristic words to obtain a spam characteristic weight of the text to be recognized; the spam characteristic weight of the text to be recognized is compared with a preset threshold value, so that whether the text to be recognized is a spam text can be determined, a basis is provided for further processing of computer text data which is determined as the spam text, and negative effects of the spam text on daily life of people are prevented.
The features and advantages of the present invention will be described in detail by embodiments in conjunction with the accompanying drawings.
[ description of the drawings ]
FIG. 1 is a flowchart illustrating steps of an embodiment of a method for recognizing spam texts according to the present invention;
FIG. 2 is a flowchart illustrating a process of classifying spam sample texts and non-spam sample texts according to the present invention;
FIG. 3 is a flow chart of word segmentation of a text to be recognized according to the present invention;
FIG. 4 is a diagram of a segmentation model of a text to be recognized according to the present invention;
fig. 5 is a block diagram of a structure of an embodiment of a spam text recognition method provided by the present invention.
[ detailed description ] embodiments
Referring to FIG. 1: fig. 1 is a flowchart illustrating steps of an embodiment of a spam text recognition method according to the present invention. Although the present application provides the method operation steps as shown in the following embodiments or figures, more or less operation steps after partial combination may be included in the method based on the conventional or non-creative work, and the execution sequence of the steps in the steps without the necessary cause and effect relationship is not limited to the execution sequence shown in the embodiments or figures of the present application.
Step 202, establishing a garbage sample text and a non-garbage sample text, as shown in fig. 2;
in the step, the junk sample text and the non-junk sample text are manually screened from the sample text by an editor and then manually marked.
After constructing the spam sample text and the non-spam sample text, spam recognition can be performed on the text to be recognized according to several constructed spam text recognition methods in a recognition stage, the specific flow is shown in fig. 3, and the specific steps comprise:
step 302, performing an interval sliding window word segmentation method on the sample to be detected,
specifically, unlike the traditional sliding window, the method adopts a mode of linkage of a plurality of sliding windows to collect information data in the text. The sliding window model is shown in fig. 4.
As can be seen in fig. 4, there are two sliding windows in this model. Assuming that the previous sliding window is a sliding window a, the next sliding window is a sliding window B, the lengths of the two sliding windows are both N, and there is a gap between the two sliding windows (in fig. 4, the gap between the sliding windows is 4 characters), when information is acquired for text data, the sliding windows a and B move rightward simultaneously, and the data in the two sliding windows are spliced together as a characteristic acquired by a spaced sliding window model.
The word segmentation method based on the interval sliding window is to pass; two sliding windows with certain length slide on the text to be recognized, and abnormal characters inserted in the junk text by a junk text maker are filtered by means of the middle interval, so that the word segmentation efficiency is improved.
In the process of performing interval sliding window word segmentation on a text to be recognized, since the text to be recognized may contain abnormal characters such as &% ¥ # and the abnormal characters have uncertainty, it is usually necessary to repeatedly slide the content of one text to be recognized to collect text information, and the flow of collecting information once is shown in fig. 3.
And step 402, matching the word segmentation result with a word segmentation dictionary to obtain a keyword.
Specifically, for each word segmentation result obtained after the word segmentation of the text to be recognized, the word segmentation result of the text to be recognized and the word segmentation dictionary can be matched by using a data structure such as a search tree or a character string matching algorithm in the prior art, so that all keywords in the text to be recognized are matched.
Step 403, calculating keyword feature contribution ratios of the keywords according to the keywords, the constructed spam text samples and the constructed non-spam text samples, and screening out the keywords with the feature contribution ratios larger than a preset threshold value as feature words, wherein the specific steps include:
calculating the word frequency factor of the keyword according to the following formula:
Figure BDA0002367502100000061
in the formula tf (t, C)i) Represents class CiIn which the key word t appearsDegree of times, n represents the class CiNumber of samples of dijRepresents class CiThe j (th) sample, tf (t, d)ij) Indicating that keyword t is in category CiThe number of occurrences of the jth sample in (b).
For each keyword of the text to be recognized, calculating the spam feature contribution degree of the keyword according to the following formula:
Figure BDA0002367502100000062
where t is the keyword, α (t, C)spam) Is the word frequency factor of the keyword t, P (t | C)spam) Representing the probability that the text containing the keyword belongs to the spam text category, and p (t) representing the probability that the keyword t appears in the entire sample set.
Step 404, calculating the non-spam feature contribution degree of the keyword according to the following formula:
Figure BDA0002367502100000063
wherein α (t, C)ham) Is the word frequency factor of the keyword t, P (t | C)ham) Indicating the probability that the text containing the keyword belongs to the spam text category.
For each keyword of the text to be recognized, calculating the characteristic contribution ratio of the characteristic word according to the following formula:
Figure BDA0002367502100000064
where t is the keyword, R (t) is the feature contribution ratio of the keyword, C (t, C)spam) Represents the contribution degree of the keyword t to the garbage sample, C (t, C)ham) Representing the degree of contribution of the keyword t to the non-spam sample.
Step 405, if the characteristic contribution ratio R (t) of the keyword t is greater than a preset threshold value, taking the keyword t as a characteristic word of the text to be recognized; and if the characteristic contribution ratio R (t) of the keyword t is smaller than a preset threshold value, not taking the keyword t as a characteristic word of the text to be recognized.
Step 406, calculating the spam feature weight of the text to be recognized according to the spam contribution degree and the non-spam contribution degree of the text feature words to be recognized, and judging whether the text to be recognized is a spam text according to a comparison result of the spam feature weight of the text to be recognized and a preset threshold value, wherein the method specifically comprises the following steps:
calculating the spam characteristic weight of the text to be recognized according to the following formula:
Figure BDA0002367502100000071
in the formula, Wgt represents the spam feature weight of the text to be recognized, and m represents the number of feature words contained in the text to be recognized.
If the spam characteristic weight Wgt of the text to be recognized is larger than a preset threshold value, judging that the text to be recognized belongs to a spam text; and if the spam characteristic weight Wgt of the text to be recognized is smaller than a preset threshold value, judging that the text to be recognized belongs to a non-spam text.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (8)

1. A junk text recognition method is characterized by comprising the following steps:
step 1, performing interval sliding window word segmentation processing on a text to be recognized, and matching a word segmentation result with a word segmentation dictionary to obtain a keyword;
step 2, selecting the feature words of the text to be recognized according to the feature contribution ratio of each keyword;
step 3, comparing the feature contribution ratio of the feature words of the text to be recognized with a preset threshold value;
and 4, outputting the recognition result of the text to be recognized.
2. The spam text recognition method of claim 1, wherein: the construction method of the feature words in the step 2 comprises the following steps:
step 11, sliding on the text to be recognized through two sliding windows with the length of n, and filtering out abnormal characters inserted in the text to be recognized by means of a middle interval;
and step 12, introducing a word segmentation dictionary on the basis of the step 11. And obtaining a keyword result by matching with the dictionary.
3. The spam text recognition method of claim 1, wherein: dividing the sample into junk text and non-junk text;
calculating the characteristic contribution ratio of all keywords of the text to be recognized, and selecting the keywords with the characteristic contribution ratio larger than a preset value as characteristic words for judging whether the text to be recognized is a junk text;
and calculating the weight of the contribution degree of the junk features of the feature words of the text to be recognized, and judging the text to be recognized as the junk text when the weight is greater than a threshold value.
4. The spam text recognition method of claim 3, wherein: the junk text sample stores sensitive words and/or junk features and/or junk various features.
5. The spam text recognition method of claim 1, wherein: the calculating of the feature contribution ratio of each keyword of the text to be recognized specifically comprises:
for each keyword of the text to be recognized, calculating the characteristic contribution ratio of the characteristic word according to formula 1:
Figure FDA0002367502090000021
where t is the keyword, R (t) is the feature contribution ratio of the keyword, C (t, C)spam) Represents the contribution degree of the keyword t to the garbage sample, C (t, C)ham) Representing keyword t versus non-spam sampleThe contribution of the book.
6. The spam text recognition method of claim 1, wherein: the calculating the feature contribution degree of each keyword of the text to be recognized specifically comprises:
for each keyword of the text to be recognized, calculating the spam feature contribution degree of the keyword according to the following formula 2:
Figure FDA0002367502090000022
where t is the keyword, α (t, C)spam) Is the word frequency factor of the keyword t, P (t | C)spam) Representing the probability that the text containing the keyword belongs to the spam text category, and p (t) representing the probability that the keyword t appears in the entire sample set.
Calculating the non-spam feature contribution degree of the keyword according to the following formula 3:
Figure FDA0002367502090000023
wherein α (t, C)ham) Is the word frequency factor of the keyword t, P (t | C)ham) Indicating the probability that the text containing the keyword belongs to the spam text category.
7. The spam text recognition method of claim 5, wherein: the calculating of the word frequency factor of the keyword t specifically includes:
calculating the word frequency factor of the keyword according to the following formula 4:
Figure FDA0002367502090000024
in the formula tf (t, C)i) Represents class CiThe number of occurrences of the middle keyword t, n representing the category CiNumber of samples of dijRepresents class CiThe j (th) sample, tf (t, d)ij) Indicating that keyword t is in category CiJ (1) ofThe number of times the sample occurred.
8. The spam text recognition method of claim 3, wherein: the weight for calculating the spam feature contribution degree of the text feature words to be recognized specifically comprises the following steps:
calculating the spam characteristic weight of the text to be recognized according to the following formula 5:
Figure FDA0002367502090000031
in the formula, Wgt represents the spam feature weight of the text to be recognized, and m represents the number of feature words contained in the text to be recognized.
CN202010040291.2A 2020-03-25 2020-03-25 Junk text recognition method Pending CN111209744A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010040291.2A CN111209744A (en) 2020-03-25 2020-03-25 Junk text recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010040291.2A CN111209744A (en) 2020-03-25 2020-03-25 Junk text recognition method

Publications (1)

Publication Number Publication Date
CN111209744A true CN111209744A (en) 2020-05-29

Family

ID=70786080

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010040291.2A Pending CN111209744A (en) 2020-03-25 2020-03-25 Junk text recognition method

Country Status (1)

Country Link
CN (1) CN111209744A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324745A (en) * 2013-07-04 2013-09-25 微梦创科网络科技(中国)有限公司 Text garbage identifying method and system based on Bayesian model
CN106610949A (en) * 2016-09-29 2017-05-03 四川用联信息技术有限公司 Text feature extraction method based on semantic analysis
CN106708961A (en) * 2016-11-30 2017-05-24 北京粉笔蓝天科技有限公司 Junk text library establishing method and system and junk text filtering method
CN106909628A (en) * 2017-01-24 2017-06-30 南京大学 A kind of text similarity method based on interval
CN109684639A (en) * 2018-12-24 2019-04-26 北京奇虎科技有限公司 Short message recognition methods, device and electronic equipment
CN110334216A (en) * 2019-07-12 2019-10-15 福建省趋普物联科技有限公司 A kind of rubbish text recognition methods and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324745A (en) * 2013-07-04 2013-09-25 微梦创科网络科技(中国)有限公司 Text garbage identifying method and system based on Bayesian model
CN106610949A (en) * 2016-09-29 2017-05-03 四川用联信息技术有限公司 Text feature extraction method based on semantic analysis
CN106708961A (en) * 2016-11-30 2017-05-24 北京粉笔蓝天科技有限公司 Junk text library establishing method and system and junk text filtering method
CN106909628A (en) * 2017-01-24 2017-06-30 南京大学 A kind of text similarity method based on interval
CN109684639A (en) * 2018-12-24 2019-04-26 北京奇虎科技有限公司 Short message recognition methods, device and electronic equipment
CN110334216A (en) * 2019-07-12 2019-10-15 福建省趋普物联科技有限公司 A kind of rubbish text recognition methods and system

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
余良琨;黄立勤;: "基于深度特征K-平均字典的场景识别", 微型机与应用, no. 13 *
李润川;昝红英;申圣亚;毕银龙;张中军;: "基于多特征融合的垃圾短信识别" *
李润川;昝红英;申圣亚;毕银龙;张中军;: "基于多特征融合的垃圾短信识别", 山东大学学报(理学版), vol. 52, no. 07 *
林海波;李扬;张毅;罗元;: "基于时序分析的人体运动模式的识别及应用", 计算机应用与软件, no. 12 *
王禾清: "《改进的互信息特征选择方法在垃圾邮件检测中的应用》", vol. 13, no. 14, pages 163 - 166 *
袁志坚;王乐;田李;贾焰;杨树强;: "数据流突发检测研究与进展", 计算机工程与应用, no. 21 *

Similar Documents

Publication Publication Date Title
CN108052583B (en) E-commerce ontology construction method
CN111414479B (en) Label extraction method based on short text clustering technology
US7117200B2 (en) Synthesizing information-bearing content from multiple channels
US8527436B2 (en) Automated parsing of e-mail messages
US6178396B1 (en) Word/phrase classification processing method and apparatus
CN112329836A (en) Text classification method, device, server and storage medium based on deep learning
US7711673B1 (en) Automatic charset detection using SIM algorithm with charset grouping
CN111798312A (en) Financial transaction system abnormity identification method based on isolated forest algorithm
US10387805B2 (en) System and method for ranking news feeds
CN109918648B (en) Rumor depth detection method based on dynamic sliding window feature score
CN113254643A (en) Text classification method and device, electronic equipment and
CN111538836A (en) Method for identifying financial advertisements in text advertisements
CN111651559B (en) Social network user relation extraction method based on event extraction
CN114756675A (en) Text classification method, related equipment and readable storage medium
CN113450147A (en) Product matching method, device and equipment based on decision tree and storage medium
CN111429184A (en) User portrait extraction method based on text information
CN114266256A (en) Method and system for extracting new words in field
CN111460100A (en) Criminal legal document and criminal name recommendation method and system
CN107480126B (en) Intelligent identification method for engineering material category
CN108920694B (en) Short text multi-label classification method and device
CN106709824B (en) Building evaluation method based on semantic analysis of web text
CN110888977B (en) Text classification method, apparatus, computer device and storage medium
CN115618085B (en) Interface data exposure detection method based on dynamic tag
CN111209744A (en) Junk text recognition method
Kusnawi et al. Sentiment analysis of pancasila values in social media life using the naive Bayes algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200529

RJ01 Rejection of invention patent application after publication