KR101508258B1 - Fax spam detection apparatus, method and system - Google Patents

Fax spam detection apparatus, method and system Download PDF

Info

Publication number
KR101508258B1
KR101508258B1 KR20130080263A KR20130080263A KR101508258B1 KR 101508258 B1 KR101508258 B1 KR 101508258B1 KR 20130080263 A KR20130080263 A KR 20130080263A KR 20130080263 A KR20130080263 A KR 20130080263A KR 101508258 B1 KR101508258 B1 KR 101508258B1
Authority
KR
South Korea
Prior art keywords
document
fax
spam
group
analysis
Prior art date
Application number
KR20130080263A
Other languages
Korean (ko)
Other versions
KR20150006930A (en
Inventor
이지형
김재광
김형식
Original Assignee
성균관대학교산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 성균관대학교산학협력단 filed Critical 성균관대학교산학협력단
Priority to KR20130080263A priority Critical patent/KR101508258B1/en
Publication of KR20150006930A publication Critical patent/KR20150006930A/en
Application granted granted Critical
Publication of KR101508258B1 publication Critical patent/KR101508258B1/en

Links

Images

Abstract

A method for blocking a fax spam document according to the present invention includes the steps of generating a spam classification algorithm based on at least one of a group of analysis fax spam documents and a group of analysis general fax documents, A determination step of determining whether the document is a document, and an output step of determining whether to output the target fax document based on the determination result. Therefore, it is possible to reduce the unnecessary resource consumption when implementing the fax spam system, and to increase productivity of the user by increasing the work efficiency of the user.

Description

FIELD OF THE INVENTION [0001] The present invention relates to a facsimile apparatus,

The present invention relates to a fax spam blocking algorithm, and more particularly, to a method for intelligently filtering received fax spam.

Conventional methods for blocking the reception of faxes have been performed through direct registration / deletion / correction of the telephone number of the subject who directly transmits the spam fax by the user. That is, a method of simply registering a transmission telephone number and blocking a document when the document is transmitted from the registered spam phone number is used. In such a case, it is inconvenient to perform registration and deletion, and it is necessary to periodically update the telephone number because all the scrolls transmitted from the registered telephone number are blocked regardless of the contents of the document. When a sender sends a spam to various phone numbers, the phone number must be registered every time, and there is a problem that at least one spam must be received.

1 is a flowchart schematically illustrating a conventional fax spam blocking method.

Referring to FIG. 1, the conventional fax spam blocking device receives a fax document (S110). Then, it is determined whether the telephone number of the received fax document is a registered specific telephone number (S120). As a result of the determination, if the document is received from the spam phone number, the facsimile data is blocked (S130). Conversely, if the document is received from a location other than the spam phone number, it is determined that the document is a general document and a fax document is output (S140).

There may also be a way to block spam documents using blacklists and whitelists, which is vulnerable to avoidance methods that avoid pre-known lists. That is, if the spam is detected using the prohibited keyword, the attacker can easily bypass the detection system by using other keywords, such as prohibited keywords.

SUMMARY OF THE INVENTION The present invention has been made to solve the above-mentioned problems occurring in the prior art, and it is an object of the present invention to provide a fax spam screening device for effectively blocking fax spam using an intelligent / automatic fax spam algorithm by analyzing information of received faxes, A method, and a system.

This prevents waste of resources due to unnecessary reception of fax spam and increases work efficiency.

According to another aspect of the present invention, there is provided a method for blocking a fax spam document, the method comprising: generating a spam classification algorithm based on at least one of a group of analysis fax spam documents and a group of analysis general fax documents; A determination step of determining whether the received target fax document is a spam document, and an output step of determining whether to output the target fax document based on the determination result.

The analysis fax spam document may be a document determined as a spam document based on the contents of the fax document and the analysis fax general document may be a document determined to be suitable for reception based on the contents of the fax document.

Wherein the classifying algorithm generating step includes individually scanning the group of fax spam documents and the group of general fax documents, calculating the frequency of occurrences of words included in the scanned fax spam document group and the scanned fax general document group, Performing fax spam document modeling and fax general document modeling individually on the basis of the appearance frequency and generating the classification algorithm based on at least one of the modeled fax spam document and the modeled fax general document, .

The appearance frequency calculation step may include a step of preprocessing the scanned document to remove an insoluble word and extracting only words, and calculating an appearance frequency based on the extracted word.

The modeling step may include selecting features based on the appearance frequency and performing the fax spam document modeling and fax general document modeling using the selected feature as a feature vector of a support vector machine (SVM) Step < / RTI >

The step of selecting the feature may comprise extracting words of the top N, where N is a natural number, of which the appearance frequency is high.

The spam classification algorithm may be generated using a Naive Bayesian Classifier.

The output determining step may automatically transmit the output destination to a specified online point without outputting the target fax document if the target fax document is determined as a spam document.

The online point may be a user email address or a custom web hard.

The subject facsimile document that has been discriminated can be included in either the analysis fax spam document group or the analysis fax general document group according to the discrimination result.

According to another aspect of the present invention, there is provided a fax spam document blocking apparatus including a classification algorithm generating unit for generating a spam classification algorithm based on at least one of a group of analysis fax spam documents and a group of general fax documents for analysis, A determination unit for determining whether the received target fax document is a spam document, and an output determination unit for determining whether to output the target fax document based on the determination result.

The analysis fax spam document may be a document determined as a spam document based on the contents of the fax document and the analysis fax general document may be a document determined to be suitable for reception based on the contents of the fax document.

The classification algorithm generation unit may include a scan execution unit for individually scanning the group of fax spam documents and the group of general fax documents, the frequency of appearance of words included in the group of scanned fax spam documents and the group of fax general documents A modeling unit for separately performing fax spam document modeling and fax general document modeling based on the appearance frequency, and a modeling unit for modeling the fax spam document and the modeled fax general document based on the appearance frequency, And an algorithm generation unit for generating an algorithm.

The appearance frequency calculating unit may include a word extracting unit for extracting only the word by removing the stop words by preprocessing the scanned document and a calculating unit for calculating the appearance frequency based on the extracted word.

Wherein the modeling unit uses an upper word extracting unit for extracting words of the N high-frequency words, where N is an arbitrary natural number, and the extracted words as feature vectors of a support vector machine (SVM) And a facsimile document modeling unit for performing fax spam document modeling and fax general document modeling.

The spam classification algorithm may be generated using a Naive Bayesian Classifier.

The output determining unit may automatically transmit the destination fax document to a designated online point without outputting the destination fax document as a spam document.

The online point may be a user email address or a custom web hard.

The subject facsimile document that has been discriminated can be included in either the analysis fax spam document group or the analysis fax general document group according to the discrimination result.

According to another aspect of the present invention, there is provided a facsimile apparatus for transmitting a facsimile document, the facsimile apparatus comprising: And a reception fax machine for determining whether the target fax document received from the transmission fax machine is a spam document using the spam classification algorithm and determining whether to output the target fax document based on the determination result can do.

According to the fax spam blocking device, method and system of the present invention, it is possible to reduce unnecessary resource consumption when implementing the fax spam system, and to increase the productivity of the user, thereby increasing the productivity.

Also, according to the fax spam blocking device, method, and system of the present invention, generation and updating of a classifier maintains high accuracy of spam blocking and minimizes maintenance costs.

1 is a flowchart schematically illustrating a conventional fax spam blocking method,
FIG. 2 is a schematic view of a system to which a fax spam blocking method according to an exemplary embodiment of the present invention can be applied;
3 is a flowchart schematically illustrating a method of blocking fax spam according to an exemplary embodiment of the present invention.
FIG. 4 is a detailed flowchart illustrating a feature extracting step of the fax spam blocking method according to an exemplary embodiment of the present invention;
5 is a detailed flowchart specifically illustrating a step of determining whether to output the fax spam blocking method according to an exemplary embodiment of the present invention.
FIG. 6 is a diagram for explaining a process when a spam document is determined as a spam document according to a fax spam blocking method according to an embodiment of the present invention;
FIG. 7 is a block diagram schematically showing a facsimile apparatus according to an embodiment of the present invention.
8 is a detailed block diagram specifically illustrating a classification algorithm generation unit of a fax spam blocking device according to an exemplary embodiment of the present invention.
FIG. 9 is a detailed block diagram specifically showing an appearance frequency calculating unit of the fax spam screening apparatus according to an embodiment of the present invention;
FIG. 10 is a detailed block diagram specifically illustrating a modeling unit of the fax spam blocking device according to an exemplary embodiment of the present invention,
11 is a view showing a confusion matrix used for testing the performance of a fax spam blocking method according to an embodiment of the present invention;
12A is a table showing ACC results of three classification methods of the fax spam blocking method according to an embodiment of the present invention,
12B is a table showing Pre_spam results of three classification methods of the fax spam blocking method according to an embodiment of the present invention,
12C is a table showing Rec_spam results of three classification methods of the fax spam blocking method according to an embodiment of the present invention,
12D is a table showing Rec_norm results of three classification methods of the fax spam blocking method according to an embodiment of the present invention,
13 is a graph comparing F-measures of three classification methods of a facsimile spam blocking method according to an exemplary embodiment of the present invention in an advanced spam attack.

While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail.

It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

The terms first, second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the relevant art and are to be interpreted in an ideal or overly formal sense unless explicitly defined in the present application Do not.

Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In order to facilitate the understanding of the present invention, the same reference numerals are used for the same constituent elements in the drawings and redundant explanations for the same constituent elements are omitted.

Fax spam protection system

FIG. 2 is a schematic view of a system to which a fax spam blocking method according to an exemplary embodiment of the present invention can be applied. 2, the fax spam blocking system according to the embodiment of the present invention includes a transmitting fax machine 10-1, 10-2, ..., 10-N and a receiving fax machine 20 can do.

Referring to FIG. 2, the transmitting fax machines 10-1, 10-2, ..., 10-N transmit fax documents to the receiving fax machine 20. The transmitting fax machines 10-1, 10-2, ..., 10-N can transmit fax documents to the receiving fax machine 20 via a wireless or wired network. The transmission fax apparatuses 10-1, 10-2, ..., 10-N may transmit a spam document or a general document including unwanted advertisement information. Here, the general document may be a document that the user desires to receive based on the contents of the fax document.

The receiving fax machine 20 receives fax documents via a wireless or wired network. The receiving fax machine 20 can generate a spam classification algorithm based on at least one of the fax spam document and the fax general document. Then, it is possible to determine whether the fax document received from the transmission fax apparatuses 10-1, 10-2, ..., 10-N is a spam document by using the generated spam classification algorithm. The receiving fax machine 20 can determine whether to output the received fax document based on the discrimination result. If the received fax document is a general document, the document is output; otherwise, the fax document can be transmitted to the online point designated by the user without outputting.

How to prevent fax spam

3 is a flowchart schematically illustrating a method of blocking fax spam according to an exemplary embodiment of the present invention.

Referring to FIG. 3, the fax spam blocking device according to an embodiment of the present invention first receives a fax document (S310).

Then, Feature Extraction is performed (S320). Feature extraction is the creation of new features based on transformations and proper combination. The feature extraction can be performed by extracting features from a fax spam document group and a fax general document group to generate a spam classification algorithm. Here, the fax spam document group and the fax general document group are analysis populations for generating a spam classification algorithm. To do this, classifier learning should be performed using fax spam documents and fax general documents according to the user's definition. A fax spam document can be defined as a document that the user will use to learn as a fax document that the user does not want, depending on the content of the fax document, rather than just a specific topic-related fax or a fax sent from a specific telephone number. The fax general document can be defined as a document that the user desires to receive based on the contents of the fax document, as described above.

Upon completion of the feature extraction, the spam document interception device determines whether the received fax document is a spam document (S330). It is determined whether the received fax document is a spam document using the classification algorithm generated in the feature extraction step S320.

If it is determined that the document is not a spam document, it is determined that the user desires a general document and a fax document is output (S340). Conversely, if the document is a spam document, the facsimile data is blocked (S350).

4 is a detailed flowchart illustrating a feature extracting step of the fax spam blocking method according to an exemplary embodiment of the present invention.

Referring to FIG. 4, the fax spam blocking device collectively receives a target fax document (S410), generates a spam classification algorithm to determine whether the target fax document is spam (S430), and determines whether the target fax document is spam ). Here, the step of generating a spam classification algorithm (S430) may be the key.

According to an embodiment of the present invention, the spam document blocking device generates a spam classification algorithm through at least one of a fax spam document model and a fax general document model. Therefore, two models can be created separately. That is, depending on the user setting, it is possible to select whether to use only the fax spam document model, only the fax general document model, or both, and the modeling process can proceed accordingly.

To this end, the spam document blocking device scans an OCR (Optical Character Reader) of a fax spam document and / or a general fax document (S431, S441). That is, the image is scanned and converted into a machine readable format. OCR scanning techniques may include a plurality of techniques that are currently known.

Next, the spam document blocking device preprocesses the scanned document to extract only words (S433, S443). Since scanned documents contain many abbreviations that are not words with special meanings, they are removed to extract only words. The spam document interception device can perform preprocessing using an abbreviation dictionary.

Then, the occurrence frequency of each word is calculated (S435, S445). That is, the spam document blocking device can analyze the appearance frequency of words in a fax document for feature extraction.

Next, the spam document blocking device selects features of the two classes of the fax spam document and / or the fax general document in order to construct a support vector machine (SVM) or a Naive Bayesian classifier (S437, S447). Feature selection is a different concept from feature extraction, which means to select the best subset of input feature sets. In the present invention, as the characteristic of the classifier, the frequency of occurrence of words in each class is selected. At this time, all occurrences of all words can be used as a feature, which can be inefficient. Therefore, the spam document interception device selects features having a larger impact among the features of the documents, i.e., words having a high appearance frequency. That is, feature selection based on the appearance frequency can be widely used for text mining.

According to another embodiment of the present invention, it is possible to classify N words having a high appearance frequency by a method of increasing the probability of being included in each group according to the degree of appearance of words, or N A word can be used as a feature vector of a support vector machine to generate a classification algorithm.

Finally, the spam document interception device performs fax spam document modeling and fax general document modeling through two representative methods (e.g., support vector machine or Naive Bayesian) (S439, S449). Here, when the support vector machine is used, as described above, modeling can be performed using N words having a high appearance frequency as feature vectors of the support vector machine. Generating the support vector machine model may be accomplished through training and testing steps.

Alternatively, the Naïve Bayesian classification method may be used, which is based on the assumption that the words appearing in each class are features representing that class, and that each word does not have to be associated with other words for an emerging document To generate a discrete separation model. Therefore, modeling can be performed based on the N words having a high appearance frequency. Modeling through the Naïve Bayesian classification method is simple to implement and has the advantage of fast document modeling. It works well as a bag of word models and is well suited for document modeling.

Fax spam document modeling and fax general document modeling can be performed using either one of the two methods, and a classification algorithm can be generated using at least one of the two modeled documents.

5 is a detailed flowchart specifically illustrating a step of determining whether to output the fax spam blocking method according to an embodiment of the present invention.

Referring to FIG. 5, the fax spam blocking device determines whether the received target fax document is a spam document using the generated spam classification algorithm (S510). As a result of the determination, if the document is a spam document, it is transmitted to the designated online point (S520). If the document is not a spam document, a fax document is output (S530). By doing so, it is possible to prevent unconditional output of spam documents, thereby reducing power and paper waste.

FIG. 6 is a diagram for explaining a process when a spam document is determined as a spam document according to the fax spam blocking method according to an embodiment of the present invention.

Referring to FIG. 6, when the fax spam screening apparatus of the present invention is determined as a spam document, it is transmitted to an online point designated by the user, thereby preventing unconditional disappearance. For example, the user can set the spam document to be temporarily stored in the e-mail 620 used by the user through the user interface or the web hard 630 accessible through the Internet. According to the setting, the fax spam screening device transmits the received fax document determined as a spam document to the user e-mail address 620 or the web hard 630. The user can check the spam document in the e-mail address 620 or the web hard 630 set by the user, and if the spam is not spam, the user can restore and output the spam document again. This prevents a document from being discarded as a spam document unconditionally.

After determining whether to output the received fax document, the received fax document can be included in the analysis fax spam document group or analysis fax general document group according to the discrimination as the general document or the spam document, May be continuously updated to eventually update the spam classification algorithm to the latest.

Fax spam blocker

FIG. 7 is a block diagram schematically illustrating a fax spam screening apparatus according to an exemplary embodiment of the present invention. Referring to FIG. 7, the fax blocking device according to an exemplary embodiment of the present invention may include a classification algorithm generation unit 710, a determination unit 720, and an output determination unit 730.

Referring to FIG. 7, the classification algorithm generation unit 710 performs feature extraction. The classification algorithm generation unit 710 may extract the feature from the fax spam document group and the fax general document group to generate a spam classification algorithm. Here, the fax spam document group and the fax general document group are analysis populations for generating a spam classification algorithm. To do this, classifier learning should be performed using fax spam documents and fax general documents according to the user's definition. A fax spam document can be defined as a document that the user will use to learn as a fax document that the user does not want, depending on the content of the fax document, rather than just a specific topic-related fax or a fax sent from a specific telephone number. The fax general document can be defined as a document that the user desires to receive based on the contents of the fax document, as described above.

The determination unit 720 determines whether the received fax document is a spam document. It is determined whether the received fax document is a spam document by using the classification algorithm generated by the classification algorithm generation unit 710. [

The output determination unit 730 determines whether to output the received fax document according to the determination result of the determination unit 720. [ As a result of the determination, if the document is not a spam document, the user determines that the document is a general document desired and outputs the fax document. Conversely, if it is a spam document, it blocks the fax data. At this time, a fax document that is determined as a spam document and blocked is not immediately deleted, but may be transmitted to a designated online point. By doing so, it is possible to prevent unconditional output of spam documents, thereby reducing power and paper waste. The user can set the temporary storage of the spam document to the web hard which can be accessed through the user interface or via e-mail used by the user through the user interface. According to the setting, the output determining unit 730 transmits the received fax document determined as a spam document to the user's e-mail address or WebHard. The user can check the spam document at the set point, and if the spam is not spam, the user can get the opportunity to restore it and output it again.

Once the output of the received fax document is determined, the received fax document may be included in the analysis fax spam document group or the analysis fax general document group according to the discrimination as the general document or the spam document.

8 is a detailed block diagram specifically illustrating a classification algorithm generation unit 710 of a fax spam blocking device according to an embodiment of the present invention. 8, the classification algorithm generating unit 710 may include a scan performing unit 810, an appearance frequency calculating unit 820, a modeling unit 830, and an algorithm generating unit 840.

Referring to FIG. 8, the scan performing unit 810 may send a fax spam document and / or a fax general document to an OCR (Optical Character Recognition) function to generate a spam classification algorithm through at least one of a fax spam document model and a fax general document model. Reader). That is, the image is scanned and converted into a machine readable format. OCR scanning techniques may include a plurality of techniques that are currently known.

The appearance frequency calculating unit 820 can analyze the occurrence frequency of words in the fax spam document group or the fax general document group for feature extraction.

The modeling unit 830 may perform facsimile spam document modeling and fax general document modeling by selecting features based on the analyzed appearance frequency. At this time, a support vector machine or a Naive Bayes classification method may be used.

The algorithm generation unit 840 generates a spam classification algorithm through at least one of the fax spam document model and the fax general document model. The algorithm generation unit 840 can generate the two models individually. That is, depending on the user setting, it is possible to select whether to use only the fax spam document model, only the fax general document model, or both, and the modeling process can proceed accordingly.

FIG. 9 is a detailed block diagram illustrating an appearance frequency calculating unit 820 of the fax spam screening apparatus according to an exemplary embodiment of the present invention. 9, the appearance frequency calculating unit 820 according to an embodiment of the present invention may include a word extracting unit 910 and a calculating unit 920. [

Referring to FIG. 9, the word extracting unit 910 preprocesses the scanned document to extract only words. Since scanned documents contain many abbreviations that are not words with special meanings, they are removed to extract only words. The spam document interception device can perform preprocessing using an abbreviation dictionary.

The calculation unit 920 calculates the appearance frequency. The calculating unit 920 can analyze the occurrence frequency of words in the fax document for feature extraction.

10 is a detailed block diagram specifically illustrating a modeling unit 830 of the fax spam blocking device according to an embodiment of the present invention. 10, the modeling unit 830 may include a feature selecting unit 1010 and a fax document modeling unit 1020.

The feature selection unit 1010 selects the characteristics of the two classes of the fax spam document and / or the fax general document to construct a support vector machine (SVM) or a Naive Bayesian classifier. Feature selection is a different concept from feature extraction, which means to select the best subset of input feature sets. The feature selecting unit 1010 can select the word occurrence frequency of each class as a characteristic of the classifier. At this time, all occurrences of all words can be used as a feature, which can be inefficient. Therefore, the spam document interception device selects features having a larger impact among the features of the documents, i.e., words having a high appearance frequency. That is, feature selection based on the appearance frequency can be widely used for text mining.

According to another embodiment of the present invention, the feature selecting unit 1010 can classify N words having high occurrence frequency by a method of increasing the probability of being included in each group according to the degree of appearance of words. Alternatively, the feature selection unit 1010 may generate a classification algorithm using N words having a high appearance frequency as a feature vector of a support vector machine.

The fax document modeling unit 1020 performs fax spam document modeling and fax general document modeling through two representative methods (e.g., support vector machine or Naïve Bayesian). Here, when the support vector machine is used, as described above, modeling can be performed using N words having a high appearance frequency as feature vectors of the support vector machine. Generating the support vector machine model may be accomplished through training and testing steps.

Alternatively, the fax document modeling unit 1020 may use the Naïve Bayesian classification method, in which the word represented in each class is a feature representing the class, and each word indicates a word Based on the assumption that there is no association with the discrete separation model. Therefore, the fax document modeling unit 1020 can perform modeling based on the N words having a high appearance frequency. Modeling through the Naïve Bayesian classification method is simple to implement and has the advantage of fast document modeling. It works well as a bag of word models and is well suited for document modeling.

Simulation result

A simulation was performed to verify the performance of the fax spam blocking method of the present invention. First, to generate a fax spam classification algorithm, fax spam documents and fax general documents were collected. At this time, the collected documents were classified by various contents according to user's subjective judgment. Each collected fax document was grouped into groups of words by OCR scan, and preprocessing was performed. Then, the frequency of occurrence of each word was identified, and the most frequently appearing words and their frequency were identified and the characteristics of the spam document and the general document were selected. In addition, modeling was performed using the support vector machine and the Naïve Bayesian classification method.

11 is a view showing a confusion matrix used for testing the performance of the fax spam blocking method according to an embodiment of the present invention.

Referring to FIG. 11, the accuracy (ACC), the accuracy of spam detection, the recall of spam detection, and the recall of general detection are calculated through the matrix. This can be calculated as follows.

Figure 112013061655786-pat00001

Accuracy (ACC) means the overall performance of the fax spam system according to the present invention. This increases when the system identifies the actual spam as spam and the actual general document as a generic document.

12A is a table showing ACC results of three classification methods of the fax spam blocking method according to an embodiment of the present invention.

Referring to FIG. 12A, the RB corresponds to a case where modeling is performed through a rule-based filtering method, the SVM is modeled using a classification method using a support vector machine, and the NB uses a Naive Bayesian method . As shown in FIG. 12A, the accuracy of the rule-based filtering method is the lowest at 55.35%, and the SVM and NB used by the facsimile blocking system according to the embodiment of the present invention show a high accuracy of 91% with almost the same result. The numbers 10, 20, ... 100 at the top of the table represent the number of selected features.

Referring again to FIG. 11, Pre_spam indicates how well the system detects spam. That is, it represents the detection capability of the spam detection system.

12B is a table showing Pre_spam results of three classification methods of the fax spam blocking method according to an embodiment of the present invention.

As shown in FIG. 12B, the Pre_spam of the RB classification method increases according to the number of features used in the RB. On the other hand, the Pre_spam result of SVM and NB represents 100% in all the numbers of features.

Referring back to FIG. 11, Rec_spam indicates whether the system detects as many spam documents as possible. That is, it represents the ability to detect how many spam fax documents are among the total spam documents.

12C is a table showing Rec_spam results of three classification methods of the fax spam blocking method according to an embodiment of the present invention.

As shown in Fig. 12C, the Rec_spam of the RB classification method, the SVM classification method, and the NB classification method are almost the same. In other words, although ACC and Pre_spam of RB were lower than others, Rec_spam of RB was higher than other classification methods, indicating that RB had bad performance.

Referring again to FIG. 11, Rec_norm represents a false positive probability. That is, the higher the Rec_norm, the lower the probability of false positives.

12D is a table showing Rec_norm results of three classification methods of the fax spam blocking method according to an embodiment of the present invention.

As shown in FIG. 12D, NB achieved 100% results when only 10 features were used, RB achieved 28.99%, SVM achieved 78.77%.

Based on the above results, we can calculate the F-measure that represents the combination of precision and recall.

13 is a graph comparing F-measures of three classification methods of a facsimile spam blocking method according to an exemplary embodiment of the present invention in an advanced spam attack.

As shown in FIG. 13, the F-measure of the RB and the SVM is characterized by being greatly influenced by changes in the characteristics. On the other hand, the F-measure of NB shows stable results over the entire x-axis (number of features). Therefore, it is recommended to use NB for fax spam detection.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the inventions as defined by the following claims It will be understood that various modifications and changes may be made thereto without departing from the spirit and scope of the invention.

Claims (20)

  1. Generating a spam classification algorithm based on at least one of a group of analysis fax spam documents and a group of general fax documents for analysis;
    Determining whether the received target fax document is a spam document using the spam classification algorithm; And
    And an output step of determining whether to output the target facsimile document based on the discrimination result,
    And automatically updating the spam classification algorithm by including the discriminated document in either the analysis fax spam document group or the analysis fax general document group according to the discrimination result. Way.
  2. The method according to claim 1,
    Wherein the analysis fax spam document is a document determined as a spam document based on the contents of the fax document and the analysis fax general document is a document determined to be suitable for reception based on the contents of the fax document. How to block documents.
  3. 3. The method of claim 2,
    Scanning the fax spam document group and the fax general document group separately;
    Separately calculating the appearance frequency of the scanned fax spam document group and the words included in the scanned fax general document group;
    Performing fax spam document modeling and fax general document modeling separately based on the appearance frequency; And
    And generating the classification algorithm based on at least one of the modeled fax spam document and the modeled fax general document.
  4. 4. The method of claim 3, wherein the appearance frequency calculating step
    A step of preprocessing the scanned document to remove an abbreviated word and extracting only words; And
    And calculating an appearance frequency based on the extracted word.
  5. 4. The method of claim 3, wherein the modeling step
    Selecting a feature based on the appearance frequency;
    And performing fax spam document modeling and fax general document modeling using the selected feature as a feature vector of a support vector machine (SVM).
  6. 6. The method of claim 5, wherein selecting the feature comprises:
    Wherein the step of extracting words comprises extracting words of the top N, where N is an arbitrary natural number, having a high appearance frequency.
  7. The method according to claim 1,
    Wherein the spam classification algorithm is generated using a Naive Bayesian Classifier.
  8. The method according to claim 1,
    If the target fax document is determined to be a spam document, not to output it, but automatically to the designated online point.
  9. 9. The method of claim 8,
    Wherein the on-line point is a user's email address or a user-specified web hard.
  10. The method according to claim 1,
    Wherein the identified fax document is included in the analysis fax spam document group and the analysis fax general document group according to the discrimination result.
  11. A classification algorithm generation unit for generating a spam classification algorithm based on at least one of a group of analysis fax spam documents and a group of fax general documents for analysis;
    A determination unit for determining whether the received target fax document is a spam document using the spam classification algorithm; And
    And an output determining unit determining whether to output the target facsimile document based on the discrimination result,
    Wherein the spam classification algorithm is automatically updated by including the discriminated document in either the analysis fax spam document group or the analysis fax general document group according to the discrimination result.
  12. 12. The method of claim 11,
    Wherein the analysis fax spam document is a document determined as a spam document based on the contents of the fax document and the analysis fax general document is a document determined to be suitable for reception based on the contents of the fax document. Document blocking device.
  13. 12. The apparatus of claim 11, wherein the classification algorithm generating unit
    A scan performing unit that individually scans the fax spam document group and the fax general document group;
    An appearance frequency calculating unit for individually calculating an occurrence frequency of the scanned fax spam document group and words included in the fax general document group;
    A modeling unit for separately performing fax spam document modeling and fax general document modeling based on the appearance frequency; And
    And an algorithm generation unit for generating the classification algorithm based on at least any one of the modeled fax spam document and the modeled fax general document.
  14. 14. The apparatus of claim 13, wherein the appearance frequency calculation unit
    A word extracting unit that preprocesses the scanned document to remove an abbreviated word and extracts only words; And
    And a calculating unit for calculating an appearance frequency based on the extracted word.
  15. 14. The apparatus of claim 13, wherein the modeling unit
    An upper word extracting unit for extracting words of the upper N words having a higher appearance frequency, where N is an arbitrary natural number; And
    And a facsimile document modeling unit that performs fax spam document modeling and fax general document modeling using the extracted word as a feature vector of a support vector machine (SVM).
  16. 12. The method of claim 11,
    Wherein the spam classification algorithm is generated using a Naive Bayesian Classifier.
  17. 12. The apparatus of claim 11, wherein the output determining unit
    When the target fax document is determined as a spam document, not to output the automatic transmission to the designated online point.
  18. 18. The method of claim 17,
    Wherein the on-line point is a user email address or a user-specified web hard.
  19. 12. The method of claim 11,
    Wherein the discriminated fax document is included in the analysis fax spam document group and the analysis fax general document group according to the discrimination result.
  20. Send to destination fax machine to send destination fax document Fax machine; And
    A spam classification algorithm is generated based on at least one of a group of analysis fax spam documents and a group of fax general documents for analysis and the target fax document received from the transmission fax device is determined to be a spam document using the spam classification algorithm, And a receiving fax machine for determining whether to output the target fax document based on the discrimination result, wherein the receiving fax machine comprises:
    Wherein the spam classification algorithm is automatically updated by including the discriminated document in either the analysis fax spam document group or the analysis fax general document group according to the discrimination result.

KR20130080263A 2013-07-09 2013-07-09 Fax spam detection apparatus, method and system KR101508258B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR20130080263A KR101508258B1 (en) 2013-07-09 2013-07-09 Fax spam detection apparatus, method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR20130080263A KR101508258B1 (en) 2013-07-09 2013-07-09 Fax spam detection apparatus, method and system

Publications (2)

Publication Number Publication Date
KR20150006930A KR20150006930A (en) 2015-01-20
KR101508258B1 true KR101508258B1 (en) 2015-04-08

Family

ID=52570046

Family Applications (1)

Application Number Title Priority Date Filing Date
KR20130080263A KR101508258B1 (en) 2013-07-09 2013-07-09 Fax spam detection apparatus, method and system

Country Status (1)

Country Link
KR (1) KR101508258B1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20040011121A (en) * 2002-07-29 2004-02-05 삼성에스디에스 주식회사 Automatic Spam-mail Dividing Method
KR20060049165A (en) * 2004-05-21 2006-05-18 마이크로소프트 코포레이션 Search engine spam detection using external data
JP2008135926A (en) * 2006-11-28 2008-06-12 Yamaguchi Univ E-mail system with unwanted e-mail filtering function
KR20100051187A (en) * 2008-11-07 2010-05-17 정혁진 Image forming device and method for preventing spam advertising

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20040011121A (en) * 2002-07-29 2004-02-05 삼성에스디에스 주식회사 Automatic Spam-mail Dividing Method
KR20060049165A (en) * 2004-05-21 2006-05-18 마이크로소프트 코포레이션 Search engine spam detection using external data
JP2008135926A (en) * 2006-11-28 2008-06-12 Yamaguchi Univ E-mail system with unwanted e-mail filtering function
KR20100051187A (en) * 2008-11-07 2010-05-17 정혁진 Image forming device and method for preventing spam advertising

Also Published As

Publication number Publication date
KR20150006930A (en) 2015-01-20

Similar Documents

Publication Publication Date Title
Zhou et al. Cost-sensitive three-way email spam filtering
US9910829B2 (en) Automatic document separation
US7949718B2 (en) Phonetic filtering of undesired email messages
KR101462289B1 (en) Digital image archiving and retrieval using a mobile device system
JP5387124B2 (en) Method and system for performing content type search
US8209339B1 (en) Document similarity detection
AU2002350112B2 (en) Systems, methods, and software for classifying documents
Bíró et al. Latent dirichlet allocation in web spam filtering
JP5173721B2 (en) Document processing system, control method therefor, program, and storage medium
Smets et al. Automatic vandalism detection in Wikipedia: Towards a machine learning approach
CN102222192B (en) Optimizing anti-malicious software treatment by automatically correcting detection rules
US6907141B1 (en) Image data sorting device and image data sorting method
US7882192B2 (en) Detecting spam email using multiple spam classifiers
US9906539B2 (en) Suspicious message processing and incident response
US8612444B2 (en) Data classifier
US7756935B2 (en) E-mail based advisor for document repositories
US20090198677A1 (en) Document Comparison Method And Apparatus
US8055078B2 (en) Filter for blocking image-based spam
US8600173B2 (en) Contextualization of machine indeterminable information based on machine determinable information
JP2008538023A (en) Method and system for processing email
JP5121839B2 (en) How to detect image spam
US20090164489A1 (en) Information processing apparatus and information processing method
US7797150B2 (en) Translation system using a translation database, translation using a translation database, method using a translation database, and program for translation using a translation database
CN102509039B (en) Realtime multiple engine selection and combining
US20060123083A1 (en) Adaptive spam message detector

Legal Events

Date Code Title Description
A201 Request for examination
E902 Notification of reason for refusal
E701 Decision to grant or registration of patent right
FPAY Annual fee payment

Payment date: 20180316

Year of fee payment: 4

FPAY Annual fee payment

Payment date: 20190104

Year of fee payment: 5