CN111209744A

CN111209744A - Junk text recognition method

Info

Publication number: CN111209744A
Application number: CN202010040291.2A
Authority: CN
Inventors: 刘星辰; 陈晓峰; 麻沁甜
Original assignee: Shanghai Bochi Information Technology Co ltd
Current assignee: Shanghai Bochi Information Technology Co ltd
Priority date: 2020-03-25
Filing date: 2020-03-25
Publication date: 2020-05-29

Abstract

The invention discloses a junk text recognition method, which comprises the following steps: performing word segmentation pretreatment on a text to be recognized to obtain a feature word text; calculating the characteristic contribution ratio of each characteristic word of the text to be recognized, and recognizing the characteristic words by the junk text recognition system to obtain the characteristic contribution ratio of the text to be recognized; whether the text is a junk text can be determined through the feature contribution ratio of the text to be recognized and a preset standard, and certain preventive measures can be further taken for the determined junk text, so that adverse effects of the junk text on daily life of people can be avoided.

Description

Junk text recognition method

[ technical field ] A method for producing a semiconductor device

The invention relates to a junk text recognition method, in particular to the technical field of computer data processing.

[ background of the invention ]

With the development of interconnection technology, the text content is richer and richer, and the junk text is more and more accompanied. Besides the common commercial advertisements, there are some information such as reaction, fraud, etc. in these spam texts. The propagation of the information not only influences the daily life of people, but also jeopardizes the safety and stability of society. Therefore, it is necessary to identify the spam texts in order to filter or delete the spam texts.

[ summary of the invention ]

The invention provides a junk text recognition method which can be applied to recognition of junk texts of mails, short messages and other Internet texts, provides help for restraining spreading of the junk texts by taking measures and meets the actual application requirements.

In order to achieve the above object, the present invention provides a method for recognizing spam texts, comprising the following steps:

step 1, performing interval sliding window word segmentation processing on a text to be recognized, and matching a word segmentation result with a word segmentation dictionary to obtain a keyword;

step 2, selecting the feature words of the text to be recognized according to the feature contribution ratio of each keyword;

step 3, comparing the feature contribution ratio of the feature words of the text to be recognized with a preset threshold value;

and 4, outputting the recognition result of the text to be recognized.

As an improvement of the above technical solution, the method for constructing the feature words in step 2 includes the following steps:

step 11, sliding on the text to be recognized through two sliding windows with the length of n, and filtering out abnormal characters inserted in the text to be recognized by means of a middle interval;

and step 12, introducing a word segmentation dictionary on the basis of the step 11. And obtaining a keyword result by matching with the dictionary.

As an improvement of the technical scheme, the step divides the sample into a junk text and a non-junk text;

calculating the characteristic contribution ratio of all keywords of the text to be recognized, and selecting the keywords with the characteristic contribution ratio larger than a preset value as characteristic words for judging whether the text to be recognized is a junk text;

and calculating the weight of the contribution degree of the junk features of the feature words of the text to be recognized, and judging the text to be recognized as the junk text when the weight is greater than a threshold value.

As an improvement of the above technical solution, the spam text sample stores sensitive words and/or spam features and/or spam various features.

As an improvement of the above technical solution, the calculating a feature contribution ratio of each keyword of the text to be recognized specifically includes:

for each keyword of the text to be recognized, calculating the characteristic contribution ratio of the characteristic word according to formula 1:

where t is the keyword, R (t) is the feature contribution ratio of the keyword, C (t, C)_spam) Represents the contribution degree of the keyword t to the garbage sample, C (t, C)_ham) Representing the degree of contribution of the keyword t to the non-spam sample.

As an improvement of the above technical solution, the calculating the feature contribution degree of each keyword of the text to be recognized specifically includes:

for each keyword of the text to be recognized, calculating the spam feature contribution degree of the keyword according to the following formula 2:

where t is the keyword, α (t, C)_spam) Is the word frequency factor of the keyword t, P (t | C)_spam) Representing the probability that the text containing the keyword belongs to the spam text category, and p (t) representing the probability that the keyword t appears in the entire sample set.

Calculating the non-spam feature contribution degree of the keyword according to the following formula 3:

wherein α (t, C)_ham) Is the word frequency factor of the keyword t, P (t | C)_ham) Indicating the probability that the text containing the keyword belongs to the spam text category.

As an improvement of the above technical solution, the calculating the word frequency factor of the keyword t specifically includes:

calculating the word frequency factor of the keyword according to the following formula 4:

in the formula tf (t, C)_i) Represents class C_iThe number of occurrences of the middle keyword t, n representing the category C_iNumber of samples of d_ijRepresents class C_iThe j (th) sample, tf (t, d)_ij) Indicating that keyword t is in category C_iThe number of occurrences of the jth sample in (b).

As an improvement of the above technical solution, the calculating the weight of the spam feature contribution degree of the text feature words to be recognized specifically includes:

calculating the spam characteristic weight of the text to be recognized according to the following formula 5:

in the formula, Wgt represents the spam feature weight of the text to be recognized, and m represents the number of feature words contained in the text to be recognized.

The invention has the beneficial effects that:

the invention provides a junk text recognition method, which is applied to computer text data, in particular to acquiring text data to be recognized; performing word segmentation processing on a text to be recognized; matching the word segmentation of the text to be recognized with a word segmentation dictionary to obtain a keyword; screening the keywords to obtain the characteristic words of the text to be identified; calculating according to the spam characteristic contribution degree and the non-spam characteristic contribution degree of the characteristic words to obtain a spam characteristic weight of the text to be recognized; the spam characteristic weight of the text to be recognized is compared with a preset threshold value, so that whether the text to be recognized is a spam text can be determined, a basis is provided for further processing of computer text data which is determined as the spam text, and negative effects of the spam text on daily life of people are prevented.

The features and advantages of the present invention will be described in detail by embodiments in conjunction with the accompanying drawings.

[ description of the drawings ]

FIG. 1 is a flowchart illustrating steps of an embodiment of a method for recognizing spam texts according to the present invention;

FIG. 2 is a flowchart illustrating a process of classifying spam sample texts and non-spam sample texts according to the present invention;

FIG. 3 is a flow chart of word segmentation of a text to be recognized according to the present invention;

FIG. 4 is a diagram of a segmentation model of a text to be recognized according to the present invention;

fig. 5 is a block diagram of a structure of an embodiment of a spam text recognition method provided by the present invention.

[ detailed description ] embodiments

Referring to FIG. 1: fig. 1 is a flowchart illustrating steps of an embodiment of a spam text recognition method according to the present invention. Although the present application provides the method operation steps as shown in the following embodiments or figures, more or less operation steps after partial combination may be included in the method based on the conventional or non-creative work, and the execution sequence of the steps in the steps without the necessary cause and effect relationship is not limited to the execution sequence shown in the embodiments or figures of the present application.

Step 202, establishing a garbage sample text and a non-garbage sample text, as shown in fig. 2;

in the step, the junk sample text and the non-junk sample text are manually screened from the sample text by an editor and then manually marked.

After constructing the spam sample text and the non-spam sample text, spam recognition can be performed on the text to be recognized according to several constructed spam text recognition methods in a recognition stage, the specific flow is shown in fig. 3, and the specific steps comprise:

step 302, performing an interval sliding window word segmentation method on the sample to be detected,

specifically, unlike the traditional sliding window, the method adopts a mode of linkage of a plurality of sliding windows to collect information data in the text. The sliding window model is shown in fig. 4.

As can be seen in fig. 4, there are two sliding windows in this model. Assuming that the previous sliding window is a sliding window a, the next sliding window is a sliding window B, the lengths of the two sliding windows are both N, and there is a gap between the two sliding windows (in fig. 4, the gap between the sliding windows is 4 characters), when information is acquired for text data, the sliding windows a and B move rightward simultaneously, and the data in the two sliding windows are spliced together as a characteristic acquired by a spaced sliding window model.

The word segmentation method based on the interval sliding window is to pass; two sliding windows with certain length slide on the text to be recognized, and abnormal characters inserted in the junk text by a junk text maker are filtered by means of the middle interval, so that the word segmentation efficiency is improved.

In the process of performing interval sliding window word segmentation on a text to be recognized, since the text to be recognized may contain abnormal characters such as &% ￥ # and the abnormal characters have uncertainty, it is usually necessary to repeatedly slide the content of one text to be recognized to collect text information, and the flow of collecting information once is shown in fig. 3.

And step 402, matching the word segmentation result with a word segmentation dictionary to obtain a keyword.

Specifically, for each word segmentation result obtained after the word segmentation of the text to be recognized, the word segmentation result of the text to be recognized and the word segmentation dictionary can be matched by using a data structure such as a search tree or a character string matching algorithm in the prior art, so that all keywords in the text to be recognized are matched.

Step 403, calculating keyword feature contribution ratios of the keywords according to the keywords, the constructed spam text samples and the constructed non-spam text samples, and screening out the keywords with the feature contribution ratios larger than a preset threshold value as feature words, wherein the specific steps include:

calculating the word frequency factor of the keyword according to the following formula:

in the formula tf (t, C)_i) Represents class C_iIn which the key word t appearsDegree of times, n represents the class C_iNumber of samples of d_ijRepresents class C_iThe j (th) sample, tf (t, d)_ij) Indicating that keyword t is in category C_iThe number of occurrences of the jth sample in (b).

For each keyword of the text to be recognized, calculating the spam feature contribution degree of the keyword according to the following formula:

Step 404, calculating the non-spam feature contribution degree of the keyword according to the following formula:

For each keyword of the text to be recognized, calculating the characteristic contribution ratio of the characteristic word according to the following formula:

Step 405, if the characteristic contribution ratio R (t) of the keyword t is greater than a preset threshold value, taking the keyword t as a characteristic word of the text to be recognized; and if the characteristic contribution ratio R (t) of the keyword t is smaller than a preset threshold value, not taking the keyword t as a characteristic word of the text to be recognized.

Step 406, calculating the spam feature weight of the text to be recognized according to the spam contribution degree and the non-spam contribution degree of the text feature words to be recognized, and judging whether the text to be recognized is a spam text according to a comparison result of the spam feature weight of the text to be recognized and a preset threshold value, wherein the method specifically comprises the following steps:

calculating the spam characteristic weight of the text to be recognized according to the following formula:

If the spam characteristic weight Wgt of the text to be recognized is larger than a preset threshold value, judging that the text to be recognized belongs to a spam text; and if the spam characteristic weight Wgt of the text to be recognized is smaller than a preset threshold value, judging that the text to be recognized belongs to a non-spam text.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A junk text recognition method is characterized by comprising the following steps:

and 4, outputting the recognition result of the text to be recognized.

2. The spam text recognition method of claim 1, wherein: the construction method of the feature words in the step 2 comprises the following steps:

3. The spam text recognition method of claim 1, wherein: dividing the sample into junk text and non-junk text;

4. The spam text recognition method of claim 3, wherein: the junk text sample stores sensitive words and/or junk features and/or junk various features.

5. The spam text recognition method of claim 1, wherein: the calculating of the feature contribution ratio of each keyword of the text to be recognized specifically comprises:

where t is the keyword, R (t) is the feature contribution ratio of the keyword, C (t, C)_spam) Represents the contribution degree of the keyword t to the garbage sample, C (t, C)_ham) Representing keyword t versus non-spam sampleThe contribution of the book.

6. The spam text recognition method of claim 1, wherein: the calculating the feature contribution degree of each keyword of the text to be recognized specifically comprises:

7. The spam text recognition method of claim 5, wherein: the calculating of the word frequency factor of the keyword t specifically includes:

in the formula tf (t, C)_i) Represents class C_iThe number of occurrences of the middle keyword t, n representing the category C_iNumber of samples of d_ijRepresents class C_iThe j (th) sample, tf (t, d)_ij) Indicating that keyword t is in category C_iJ (1) ofThe number of times the sample occurred.

8. The spam text recognition method of claim 3, wherein: the weight for calculating the spam feature contribution degree of the text feature words to be recognized specifically comprises the following steps: