CN111209744A - Junk text recognition method - Google Patents
Junk text recognition method Download PDFInfo
- Publication number
- CN111209744A CN111209744A CN202010040291.2A CN202010040291A CN111209744A CN 111209744 A CN111209744 A CN 111209744A CN 202010040291 A CN202010040291 A CN 202010040291A CN 111209744 A CN111209744 A CN 111209744A
- Authority
- CN
- China
- Prior art keywords
- text
- keyword
- recognized
- spam
- calculating
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 230000011218 segmentation Effects 0.000 claims abstract description 24
- 230000002159 abnormal effect Effects 0.000 claims description 5
- 238000012545 processing Methods 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 2
- 238000010276 construction Methods 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 3
- 230000002411 adverse Effects 0.000 abstract 1
- 230000003449 preventive effect Effects 0.000 abstract 1
- 230000006872 improvement Effects 0.000 description 7
- 230000008569 process Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000452 restraining effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a junk text recognition method, which comprises the following steps: performing word segmentation pretreatment on a text to be recognized to obtain a feature word text; calculating the characteristic contribution ratio of each characteristic word of the text to be recognized, and recognizing the characteristic words by the junk text recognition system to obtain the characteristic contribution ratio of the text to be recognized; whether the text is a junk text can be determined through the feature contribution ratio of the text to be recognized and a preset standard, and certain preventive measures can be further taken for the determined junk text, so that adverse effects of the junk text on daily life of people can be avoided.
Description
[ technical field ] A method for producing a semiconductor device
The invention relates to a junk text recognition method, in particular to the technical field of computer data processing.
[ background of the invention ]
With the development of interconnection technology, the text content is richer and richer, and the junk text is more and more accompanied. Besides the common commercial advertisements, there are some information such as reaction, fraud, etc. in these spam texts. The propagation of the information not only influences the daily life of people, but also jeopardizes the safety and stability of society. Therefore, it is necessary to identify the spam texts in order to filter or delete the spam texts.
[ summary of the invention ]
The invention provides a junk text recognition method which can be applied to recognition of junk texts of mails, short messages and other Internet texts, provides help for restraining spreading of the junk texts by taking measures and meets the actual application requirements.
In order to achieve the above object, the present invention provides a method for recognizing spam texts, comprising the following steps:
step 2, selecting the feature words of the text to be recognized according to the feature contribution ratio of each keyword;
step 3, comparing the feature contribution ratio of the feature words of the text to be recognized with a preset threshold value;
and 4, outputting the recognition result of the text to be recognized.
As an improvement of the above technical solution, the method for constructing the feature words in step 2 includes the following steps:
step 11, sliding on the text to be recognized through two sliding windows with the length of n, and filtering out abnormal characters inserted in the text to be recognized by means of a middle interval;
and step 12, introducing a word segmentation dictionary on the basis of the step 11. And obtaining a keyword result by matching with the dictionary.
As an improvement of the technical scheme, the step divides the sample into a junk text and a non-junk text;
calculating the characteristic contribution ratio of all keywords of the text to be recognized, and selecting the keywords with the characteristic contribution ratio larger than a preset value as characteristic words for judging whether the text to be recognized is a junk text;
and calculating the weight of the contribution degree of the junk features of the feature words of the text to be recognized, and judging the text to be recognized as the junk text when the weight is greater than a threshold value.
As an improvement of the above technical solution, the spam text sample stores sensitive words and/or spam features and/or spam various features.
As an improvement of the above technical solution, the calculating a feature contribution ratio of each keyword of the text to be recognized specifically includes:
for each keyword of the text to be recognized, calculating the characteristic contribution ratio of the characteristic word according to formula 1:
where t is the keyword, R (t) is the feature contribution ratio of the keyword, C (t, C)spam) Represents the contribution degree of the keyword t to the garbage sample, C (t, C)ham) Representing the degree of contribution of the keyword t to the non-spam sample.
As an improvement of the above technical solution, the calculating the feature contribution degree of each keyword of the text to be recognized specifically includes:
for each keyword of the text to be recognized, calculating the spam feature contribution degree of the keyword according to the following formula 2:
where t is the keyword, α (t, C)spam) Is the word frequency factor of the keyword t, P (t | C)spam) Representing the probability that the text containing the keyword belongs to the spam text category, and p (t) representing the probability that the keyword t appears in the entire sample set.
Calculating the non-spam feature contribution degree of the keyword according to the following formula 3:
wherein α (t, C)ham) Is the word frequency factor of the keyword t, P (t | C)ham) Indicating the probability that the text containing the keyword belongs to the spam text category.
As an improvement of the above technical solution, the calculating the word frequency factor of the keyword t specifically includes:
calculating the word frequency factor of the keyword according to the following formula 4:
in the formula tf (t, C)i) Represents class CiThe number of occurrences of the middle keyword t, n representing the category CiNumber of samples of dijRepresents class CiThe j (th) sample, tf (t, d)ij) Indicating that keyword t is in category CiThe number of occurrences of the jth sample in (b).
As an improvement of the above technical solution, the calculating the weight of the spam feature contribution degree of the text feature words to be recognized specifically includes:
calculating the spam characteristic weight of the text to be recognized according to the following formula 5:
in the formula, Wgt represents the spam feature weight of the text to be recognized, and m represents the number of feature words contained in the text to be recognized.
The invention has the beneficial effects that:
the invention provides a junk text recognition method, which is applied to computer text data, in particular to acquiring text data to be recognized; performing word segmentation processing on a text to be recognized; matching the word segmentation of the text to be recognized with a word segmentation dictionary to obtain a keyword; screening the keywords to obtain the characteristic words of the text to be identified; calculating according to the spam characteristic contribution degree and the non-spam characteristic contribution degree of the characteristic words to obtain a spam characteristic weight of the text to be recognized; the spam characteristic weight of the text to be recognized is compared with a preset threshold value, so that whether the text to be recognized is a spam text can be determined, a basis is provided for further processing of computer text data which is determined as the spam text, and negative effects of the spam text on daily life of people are prevented.
The features and advantages of the present invention will be described in detail by embodiments in conjunction with the accompanying drawings.
[ description of the drawings ]
FIG. 1 is a flowchart illustrating steps of an embodiment of a method for recognizing spam texts according to the present invention;
FIG. 2 is a flowchart illustrating a process of classifying spam sample texts and non-spam sample texts according to the present invention;
FIG. 3 is a flow chart of word segmentation of a text to be recognized according to the present invention;
FIG. 4 is a diagram of a segmentation model of a text to be recognized according to the present invention;
fig. 5 is a block diagram of a structure of an embodiment of a spam text recognition method provided by the present invention.
[ detailed description ] embodiments
Referring to FIG. 1: fig. 1 is a flowchart illustrating steps of an embodiment of a spam text recognition method according to the present invention. Although the present application provides the method operation steps as shown in the following embodiments or figures, more or less operation steps after partial combination may be included in the method based on the conventional or non-creative work, and the execution sequence of the steps in the steps without the necessary cause and effect relationship is not limited to the execution sequence shown in the embodiments or figures of the present application.
in the step, the junk sample text and the non-junk sample text are manually screened from the sample text by an editor and then manually marked.
After constructing the spam sample text and the non-spam sample text, spam recognition can be performed on the text to be recognized according to several constructed spam text recognition methods in a recognition stage, the specific flow is shown in fig. 3, and the specific steps comprise:
specifically, unlike the traditional sliding window, the method adopts a mode of linkage of a plurality of sliding windows to collect information data in the text. The sliding window model is shown in fig. 4.
As can be seen in fig. 4, there are two sliding windows in this model. Assuming that the previous sliding window is a sliding window a, the next sliding window is a sliding window B, the lengths of the two sliding windows are both N, and there is a gap between the two sliding windows (in fig. 4, the gap between the sliding windows is 4 characters), when information is acquired for text data, the sliding windows a and B move rightward simultaneously, and the data in the two sliding windows are spliced together as a characteristic acquired by a spaced sliding window model.
The word segmentation method based on the interval sliding window is to pass; two sliding windows with certain length slide on the text to be recognized, and abnormal characters inserted in the junk text by a junk text maker are filtered by means of the middle interval, so that the word segmentation efficiency is improved.
In the process of performing interval sliding window word segmentation on a text to be recognized, since the text to be recognized may contain abnormal characters such as &% ¥ # and the abnormal characters have uncertainty, it is usually necessary to repeatedly slide the content of one text to be recognized to collect text information, and the flow of collecting information once is shown in fig. 3.
And step 402, matching the word segmentation result with a word segmentation dictionary to obtain a keyword.
Specifically, for each word segmentation result obtained after the word segmentation of the text to be recognized, the word segmentation result of the text to be recognized and the word segmentation dictionary can be matched by using a data structure such as a search tree or a character string matching algorithm in the prior art, so that all keywords in the text to be recognized are matched.
calculating the word frequency factor of the keyword according to the following formula:
in the formula tf (t, C)i) Represents class CiIn which the key word t appearsDegree of times, n represents the class CiNumber of samples of dijRepresents class CiThe j (th) sample, tf (t, d)ij) Indicating that keyword t is in category CiThe number of occurrences of the jth sample in (b).
For each keyword of the text to be recognized, calculating the spam feature contribution degree of the keyword according to the following formula:
where t is the keyword, α (t, C)spam) Is the word frequency factor of the keyword t, P (t | C)spam) Representing the probability that the text containing the keyword belongs to the spam text category, and p (t) representing the probability that the keyword t appears in the entire sample set.
Step 404, calculating the non-spam feature contribution degree of the keyword according to the following formula:
wherein α (t, C)ham) Is the word frequency factor of the keyword t, P (t | C)ham) Indicating the probability that the text containing the keyword belongs to the spam text category.
For each keyword of the text to be recognized, calculating the characteristic contribution ratio of the characteristic word according to the following formula:
where t is the keyword, R (t) is the feature contribution ratio of the keyword, C (t, C)spam) Represents the contribution degree of the keyword t to the garbage sample, C (t, C)ham) Representing the degree of contribution of the keyword t to the non-spam sample.
calculating the spam characteristic weight of the text to be recognized according to the following formula:
in the formula, Wgt represents the spam feature weight of the text to be recognized, and m represents the number of feature words contained in the text to be recognized.
If the spam characteristic weight Wgt of the text to be recognized is larger than a preset threshold value, judging that the text to be recognized belongs to a spam text; and if the spam characteristic weight Wgt of the text to be recognized is smaller than a preset threshold value, judging that the text to be recognized belongs to a non-spam text.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.
Claims (8)
1. A junk text recognition method is characterized by comprising the following steps:
step 1, performing interval sliding window word segmentation processing on a text to be recognized, and matching a word segmentation result with a word segmentation dictionary to obtain a keyword;
step 2, selecting the feature words of the text to be recognized according to the feature contribution ratio of each keyword;
step 3, comparing the feature contribution ratio of the feature words of the text to be recognized with a preset threshold value;
and 4, outputting the recognition result of the text to be recognized.
2. The spam text recognition method of claim 1, wherein: the construction method of the feature words in the step 2 comprises the following steps:
step 11, sliding on the text to be recognized through two sliding windows with the length of n, and filtering out abnormal characters inserted in the text to be recognized by means of a middle interval;
and step 12, introducing a word segmentation dictionary on the basis of the step 11. And obtaining a keyword result by matching with the dictionary.
3. The spam text recognition method of claim 1, wherein: dividing the sample into junk text and non-junk text;
calculating the characteristic contribution ratio of all keywords of the text to be recognized, and selecting the keywords with the characteristic contribution ratio larger than a preset value as characteristic words for judging whether the text to be recognized is a junk text;
and calculating the weight of the contribution degree of the junk features of the feature words of the text to be recognized, and judging the text to be recognized as the junk text when the weight is greater than a threshold value.
4. The spam text recognition method of claim 3, wherein: the junk text sample stores sensitive words and/or junk features and/or junk various features.
5. The spam text recognition method of claim 1, wherein: the calculating of the feature contribution ratio of each keyword of the text to be recognized specifically comprises:
for each keyword of the text to be recognized, calculating the characteristic contribution ratio of the characteristic word according to formula 1:
where t is the keyword, R (t) is the feature contribution ratio of the keyword, C (t, C)spam) Represents the contribution degree of the keyword t to the garbage sample, C (t, C)ham) Representing keyword t versus non-spam sampleThe contribution of the book.
6. The spam text recognition method of claim 1, wherein: the calculating the feature contribution degree of each keyword of the text to be recognized specifically comprises:
for each keyword of the text to be recognized, calculating the spam feature contribution degree of the keyword according to the following formula 2:
where t is the keyword, α (t, C)spam) Is the word frequency factor of the keyword t, P (t | C)spam) Representing the probability that the text containing the keyword belongs to the spam text category, and p (t) representing the probability that the keyword t appears in the entire sample set.
Calculating the non-spam feature contribution degree of the keyword according to the following formula 3:
wherein α (t, C)ham) Is the word frequency factor of the keyword t, P (t | C)ham) Indicating the probability that the text containing the keyword belongs to the spam text category.
7. The spam text recognition method of claim 5, wherein: the calculating of the word frequency factor of the keyword t specifically includes:
calculating the word frequency factor of the keyword according to the following formula 4:
in the formula tf (t, C)i) Represents class CiThe number of occurrences of the middle keyword t, n representing the category CiNumber of samples of dijRepresents class CiThe j (th) sample, tf (t, d)ij) Indicating that keyword t is in category CiJ (1) ofThe number of times the sample occurred.
8. The spam text recognition method of claim 3, wherein: the weight for calculating the spam feature contribution degree of the text feature words to be recognized specifically comprises the following steps:
calculating the spam characteristic weight of the text to be recognized according to the following formula 5:
in the formula, Wgt represents the spam feature weight of the text to be recognized, and m represents the number of feature words contained in the text to be recognized.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010040291.2A CN111209744A (en) | 2020-03-25 | 2020-03-25 | Junk text recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010040291.2A CN111209744A (en) | 2020-03-25 | 2020-03-25 | Junk text recognition method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111209744A true CN111209744A (en) | 2020-05-29 |
Family
ID=70786080
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010040291.2A Pending CN111209744A (en) | 2020-03-25 | 2020-03-25 | Junk text recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111209744A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103324745A (en) * | 2013-07-04 | 2013-09-25 | 微梦创科网络科技(中国)有限公司 | Text garbage identifying method and system based on Bayesian model |
CN106610949A (en) * | 2016-09-29 | 2017-05-03 | 四川用联信息技术有限公司 | Text feature extraction method based on semantic analysis |
CN106708961A (en) * | 2016-11-30 | 2017-05-24 | 北京粉笔蓝天科技有限公司 | Junk text library establishing method and system and junk text filtering method |
CN106909628A (en) * | 2017-01-24 | 2017-06-30 | 南京大学 | A kind of text similarity method based on interval |
CN109684639A (en) * | 2018-12-24 | 2019-04-26 | 北京奇虎科技有限公司 | Short message recognition methods, device and electronic equipment |
CN110334216A (en) * | 2019-07-12 | 2019-10-15 | 福建省趋普物联科技有限公司 | A kind of rubbish text recognition methods and system |
-
2020
- 2020-03-25 CN CN202010040291.2A patent/CN111209744A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103324745A (en) * | 2013-07-04 | 2013-09-25 | 微梦创科网络科技(中国)有限公司 | Text garbage identifying method and system based on Bayesian model |
CN106610949A (en) * | 2016-09-29 | 2017-05-03 | 四川用联信息技术有限公司 | Text feature extraction method based on semantic analysis |
CN106708961A (en) * | 2016-11-30 | 2017-05-24 | 北京粉笔蓝天科技有限公司 | Junk text library establishing method and system and junk text filtering method |
CN106909628A (en) * | 2017-01-24 | 2017-06-30 | 南京大学 | A kind of text similarity method based on interval |
CN109684639A (en) * | 2018-12-24 | 2019-04-26 | 北京奇虎科技有限公司 | Short message recognition methods, device and electronic equipment |
CN110334216A (en) * | 2019-07-12 | 2019-10-15 | 福建省趋普物联科技有限公司 | A kind of rubbish text recognition methods and system |
Non-Patent Citations (6)
Title |
---|
余良琨;黄立勤;: "基于深度特征K-平均字典的场景识别", 微型机与应用, no. 13 * |
李润川;昝红英;申圣亚;毕银龙;张中军;: "基于多特征融合的垃圾短信识别" * |
李润川;昝红英;申圣亚;毕银龙;张中军;: "基于多特征融合的垃圾短信识别", 山东大学学报(理学版), vol. 52, no. 07 * |
林海波;李扬;张毅;罗元;: "基于时序分析的人体运动模式的识别及应用", 计算机应用与软件, no. 12 * |
王禾清: "《改进的互信息特征选择方法在垃圾邮件检测中的应用》", vol. 13, no. 14, pages 163 - 166 * |
袁志坚;王乐;田李;贾焰;杨树强;: "数据流突发检测研究与进展", 计算机工程与应用, no. 21 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108052583B (en) | E-commerce ontology construction method | |
CN111414479B (en) | Label extraction method based on short text clustering technology | |
US7117200B2 (en) | Synthesizing information-bearing content from multiple channels | |
US8527436B2 (en) | Automated parsing of e-mail messages | |
US6178396B1 (en) | Word/phrase classification processing method and apparatus | |
CN112329836A (en) | Text classification method, device, server and storage medium based on deep learning | |
US7711673B1 (en) | Automatic charset detection using SIM algorithm with charset grouping | |
CN111798312A (en) | Financial transaction system abnormity identification method based on isolated forest algorithm | |
US10387805B2 (en) | System and method for ranking news feeds | |
CN109918648B (en) | Rumor depth detection method based on dynamic sliding window feature score | |
CN113254643A (en) | Text classification method and device, electronic equipment and | |
CN111538836A (en) | Method for identifying financial advertisements in text advertisements | |
CN111651559B (en) | Social network user relation extraction method based on event extraction | |
CN114756675A (en) | Text classification method, related equipment and readable storage medium | |
CN113450147A (en) | Product matching method, device and equipment based on decision tree and storage medium | |
CN111429184A (en) | User portrait extraction method based on text information | |
CN114266256A (en) | Method and system for extracting new words in field | |
CN111460100A (en) | Criminal legal document and criminal name recommendation method and system | |
CN107480126B (en) | Intelligent identification method for engineering material category | |
CN108920694B (en) | Short text multi-label classification method and device | |
CN106709824B (en) | Building evaluation method based on semantic analysis of web text | |
CN110888977B (en) | Text classification method, apparatus, computer device and storage medium | |
CN115618085B (en) | Interface data exposure detection method based on dynamic tag | |
CN111209744A (en) | Junk text recognition method | |
Kusnawi et al. | Sentiment analysis of pancasila values in social media life using the naive Bayes algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200529 |
|
RJ01 | Rejection of invention patent application after publication |