CN113840246A

CN113840246A - Junk short message filtering method and system and computer readable storage medium

Info

Publication number: CN113840246A
Application number: CN202010583970.4A
Authority: CN
Inventors: 胡志攀; 宋延平
Original assignee: Shenzhen Ipi Network Technology Co ltd
Current assignee: Shenzhen Ipi Network Technology Co ltd
Priority date: 2020-06-23
Filing date: 2020-06-23
Publication date: 2021-12-24

Abstract

The invention discloses a junk short message filtering method, a system and a computer readable storage medium, wherein the method comprises the following steps: acquiring a short message to be sent, and filtering the short message to be sent by using a basic keyword; calling spam messages in a spam database to perform first round matching on unfiltered messages to be sent, and screening out successfully matched messages to be sent; the method comprises the steps of calling the trusted short messages in the trusted database to carry out second round matching on the unscreened short messages to be sent, and sending the successfully matched short messages to be sent out.

Description

Junk short message filtering method and system and computer readable storage medium

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a spam filtering method and system, and a computer-readable storage medium.

Background

When sending short messages of daily enterprises, various spam short messages are often accompanied by the short messages inadvertently, various troubles are caused to audience groups, normal life of people is seriously affected, and therefore, each communication operator often receives a large amount of spam short message complaints. At present, the spam messages are mostly intercepted in a keyword filtering mode, and normal messages are often intercepted and related services are lost by setting a large number of keyword blacklists and other cutting modes.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a spam filtering method, system and computer readable storage medium, aiming at the defects that normal short messages are often intercepted and related services are lost by setting a large number of keyword black lists and other cutting ways in the prior art.

The technical scheme adopted by the invention for solving the technical problems is as follows: a spam message filtering method is constructed, comprising the following steps:

acquiring a short message to be sent, and filtering the short message to be sent by using a basic keyword;

calling spam messages in a spam database to perform first round matching on unfiltered messages to be sent, and screening out successfully matched messages to be sent;

and calling the trusted short messages in the trusted database to perform second round matching on the unscreened short messages to be sent, and sending the successfully matched short messages to be sent out.

Preferably, the method further comprises: putting the short messages to be sent which are not successfully matched in the second matching round into a content library for manual examination; sending out the short message to be sent which passes manual examination in the content library, and simultaneously putting the short message to be sent into a trusted database to be used as a trusted short message; and putting the short messages to be sent which do not pass the manual examination in the content library into a junk database to be used as junk short messages.

Preferably, the method further comprises: and in the first round of matching/the second round of matching, aiming at each short message to be sent, matching the spam short messages/the trusted short messages in the spam database/the called trusted database at the same time according to groups, and finishing the matching of the current short message to be sent when any group finds the spam short messages/the trusted short messages which are successfully matched.

Preferably, the method further comprises: sequencing each group of spam short messages/trusted short messages at regular time or non-regular time;

in the first round of matching/the second round of matching, each group calls the spam messages/the trusted messages one by one according to the sequence of the spam messages/the trusted messages of the group to match with the messages to be sent.

Preferably, the sorting the spam messages/the trusted messages of each group includes: and counting the successful matching times of the spam short messages/the trusted short messages in the same group, and sequencing the spam short messages/the trusted short messages in the same group according to the sequence of the successful matching times from high to low.

Preferably, the sorting of each group is modified in real time during the first round of matching/the second round of matching, including: monitoring the matching success frequency of each spam short message/trusted short message of each group, if the matching success frequency of a certain spam short message/trusted short message in a preset time reaches a preset value, judging that the certain spam short message/trusted short message has a centralized sending state, adjusting the certain spam short message/trusted short message to the head of the group where the certain spam short message/trusted short message is located, and restoring the certain spam short message/trusted short message to be sorted according to the matching success frequency when the centralized sending state disappears.

Preferably, when the first round of matching/the second round of matching is performed, the short message content of the short message to be sent and the short message content of the called spam short message/the trusted short message are respectively subjected to word segmentation processing, similarity is calculated according to word segmentation results, and if the similarity meets a preset requirement, the matching is judged to be successful.

In a further aspect of the invention, a spam filtering system is also provided, comprising a processor and a memory, said memory storing a computer program which, when executed by the processor, carries out the steps of the method as described above.

A further aspect of the invention also provides a computer-readable storage medium, in which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method as described above.

The junk short message filtering method, the junk short message filtering system and the computer readable storage medium have the following beneficial effects: the invention verifies the junk short messages and the trusted short messages on the basis of the generated two libraries containing the junk short messages and the trusted short messages on the premise of combining a small amount of basic keyword filtering, ensures that the junk short messages are filtered and the trusted short messages are normally sent, reduces the trouble to the audience, reduces the complaint amount and avoids the work waste.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts:

FIG. 1 is a flow chart of a spam filtering method according to the present invention;

FIG. 2 is a schematic diagram of a spam filtering method;

fig. 3 is a schematic diagram of the ordering within a group.

Detailed Description

To facilitate an understanding of the invention, the invention will now be described more fully with reference to the accompanying drawings. Exemplary embodiments of the invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

In order to solve the problem that normal short messages are often intercepted and related services are lost by setting a large number of keyword blacklists and other cutting ways in the prior art, the general idea of the invention is as follows: and finally, calling the trusted short messages in the trusted database to perform second round matching on the un-screened short messages to be sent, and sending the successfully matched short messages to be sent out.

In order to better understand the technical solutions, the technical solutions will be described in detail below with reference to the drawings and the specific embodiments of the specification, and it should be understood that the embodiments and specific features of the embodiments of the present invention are detailed descriptions of the technical solutions of the present application, and are not limited to the technical solutions of the present application, and the technical features of the embodiments and examples of the present invention may be combined with each other without conflict.

The method of the invention is started after the submitted short message sending task is submitted and the submitted short message content is verified to finish the basic processes of sending the short message and the like. Referring to fig. 1 and 2, the spam filtering method of the present invention includes:

s101: and acquiring the short message to be sent, and filtering the short message to be sent by using the basic keyword.

Referring to fig. 2, the filtered short message directly rejects transmission, and the following steps are not performed.

The method comprises the steps of filtering short messages to be sent by using basic keywords, and mainly detecting whether the basic keywords exist in the short message content of the short messages to be sent. The basic keywords refer to keywords prohibited by the industry on command, such as 'law turn work' and the like. The method is different from the keywords in the processing mode of the spam messages in the prior art, and the basic keywords of the method are very few.

S102: and calling the spam messages in the spam database to perform a first round of matching on the unfiltered messages to be sent, and screening the successfully matched messages to be sent.

Since there may be messages that need to be filtered out after the filtering in step S101, the purpose of step S102 is to further filter out these messages. Referring to fig. 2, the screened-out message is directly rejected from transmission and the following steps are not performed.

In order to improve the matching efficiency, short messages in the spam database can be copied into a cache, and then the short message data is taken from the cache when matching is carried out.

Specifically, when matching, the short message content of the short message to be sent and the short message content of the called spam short message are respectively subjected to word segmentation processing, similarity is calculated according to word segmentation results, and if the similarity meets the preset requirement, the matching is judged to be successful. For example, the short message word "i is a Chinese" is "i is" and "Chinese". As for the calculation of the similarity, the ratio of the participles on the match can be used. For example, the short message content of the short message to be sent has K1 participles, the called spam short message has K2 participles, the number of matched participles is K3, the ratio of the matched participles is K3/K2, and if K3/K2 reaches or exceeds a set threshold, the similarity is considered to reach the preset requirement, and the matching is successful. In addition, two participles are matched, and generally, the two participles are identical. It should be noted that the similarity may be calculated in other manners, and the calculation is not limited specifically as long as the word segmentation similarity of two short messages can be embodied.

Preferably, the present invention performs matching by grouping and matching at the same time, and for this reason, all spam messages copied to the spam database in the cache need to be grouped before matching. For example, grouping may be performed based on the number of CPU cores of the server, and the content length of each group is divided equally, for example, assuming that the number of spam messages in the spam database is Len, such as 160, the number of CPU cores is n cores, such as 16 cores, and the spam messages are divided into 16 groups, each group is M spam messages, where M is Len/n is 10, it can be understood that if Len/n is a non-integer, M is Len/n, and is incremented by one.

On the basis of grouping, when the method and the device perform the first round of matching, aiming at each short message to be sent, the junk short messages in the junk database are simultaneously matched according to the grouping, and the matching of the current short message to be sent is finished when any group finds the junk short message successfully matched. Or taking the above 16 groups as an example, allocating a processing thread for each group, that is, 16 processing threads are totally matched with the same short message to be sent at the same time, and the 16 processing threads share the matching identifier, once one of the threads is successfully matched, all the other threads are stopped, otherwise, all the threads need to be completely executed.

Further, for each group, the spam messages in the group can also be sorted, and the spam messages most likely to be matched are put to the top for preferential matching, so as to further improve the matching efficiency, so the method of the invention preferably further comprises the following steps: and sequencing the spam messages of each group at regular time or at irregular time. For example, the time period during which the short message transmission is relatively idle can be selected for reordering.

Referring to fig. 3, the above sorting specifically includes: and counting the successful matching times of all the spam messages of the same group, and sequencing all the spam messages of the same group according to the sequence from high to low of the successful matching times. For example, suppose a group has spam messages A, B, C …, G, H, and the number of times they successfully match is sorted from high to low in the left side of fig. 3.

On the basis of the sorting, when the invention carries out the first round of matching, each group calls the junk short messages one by one to match with the short messages to be sent according to the sorting of the junk short messages of the group. That is, each thread calls the spam messages in the packets corresponding to the thread one by one according to the sequence to match with the messages to be sent. Referring to fig. 3, when matching with the short message to be sent, the short message a is first called to match with the short message to be sent, if matching is successful, matching is ended, otherwise, the short message B is continuously called to match, and so on.

Further preferably, when the first round of matching is performed, the sorting of each group is corrected in real time, and specifically includes: monitoring the matching success frequency of each spam short message of each group, if the matching success frequency of a certain spam short message in a preset time reaches a preset value, judging that the matching short message of the certain spam short message has a centralized sending state, adjusting the certain spam short message to the head of the group where the certain spam short message is located, and restoring and sequencing the certain spam short message according to the matching success frequency when the centralized sending state disappears.

Referring to fig. 3, assuming that the number of times of successful matching of the last short message H in the preset time period reaches a preset value, it is determined that a centralized sending state of the matching short messages of a certain spam short message H occurs, and the short message H is adjusted to the head, as shown by a first arrow in fig. 3. Therefore, when a certain short message needs to be sent in a centralized manner, the matching can be rapidly and successfully carried out. After the short messages are sent in the set, the sequence is restored again according to the number of successful matching times of the short messages H, as shown by the second arrow in fig. 3. It can be understood that the short message H in fig. 3 is no longer behind the short message G, because the number of times of successful matching of the short message H increases after the short message G is sent in a centralized manner, when the sequence is restored according to the number of times of successful matching, the sequence is not necessarily the original sequence, and the sequence may be improved.

Therefore, dynamic sequencing in the group can be realized, and similar contents can be preferentially matched and transmitted when a large number of short messages are submitted.

S103: and calling the trusted short messages in the trusted database to perform second round matching on the unscreened short messages to be sent, and sending the successfully matched short messages to be sent out.

Since there may be some short messages in the grey zone after the filtering in step S102, the purpose of step S103 is to further screen out the short messages. Referring to fig. 2, only trusted short messages may be sent.

It should be noted that in this embodiment, the specific principles of the second round of matching and the first round of matching are the same, such as word segmentation comparison, grouping matching, and intra-group sequencing correction, which all refer to the first round of matching, and are not described herein again.

Further preferably, with reference to fig. 2, wherein the method further comprises: putting the short messages to be sent which are not successfully matched in the second matching round into a content library for manual examination; sending out the short message to be sent which passes manual examination in the content library, and simultaneously putting the short message to be sent into a trusted database to be used as a trusted short message; and putting the short messages to be sent which do not pass the manual examination in the content library into a junk database to be used as junk short messages. Through the processing, the reliability of short message filtering can be further improved.

The previous step S102 mentioned above is that the whole method preferably accesses data from a buffer, so that once the data in the spam database and the trusted database are changed due to manual review, the buffer needs to be refreshed synchronously, that is, the short message is put into the spam database and the trusted database, and simultaneously, a copy of the short message is also put into the spam database and the trusted database. Meanwhile, due to the addition of the short messages in the database, the grouping and the sequencing are also synchronously triggered to be updated.

Therefore, the invention ensures that the short message content which cannot be compared enters the content library, ensures the completeness and effectiveness of the data, avoids invalid data in the content library and increases the verification pressure. In addition, it should be noted that the maintenance of the content library may be implemented manually or by other business rules.

In summary, in the present embodiment, on the premise of combining a small amount of basic keyword filtering, the spam short messages and the trusted short messages are verified based on the generated two libraries containing spam short messages and trusted short messages, so that the spam short messages are ensured to be filtered, the trusted short messages are normally sent, the trouble to the audience is reduced, the complaint amount is reduced, and the work waste is avoided. The embodiment of the invention reduces the preset keywords, reduces the condition that normal short messages are intercepted, automatically generates and updates the spam short messages and the database of the received short messages, automatically cleans and maintains, improves the filtering accuracy of the spam short messages and improves the sending success rate of the short messages.

Based on the same concept, the invention further discloses a spam message filtering system, which comprises a processor and a memory, wherein the memory stores a computer program, the steps of the method embodiment are realized when the computer program is executed by the processor, and the specific realization process refers to the part of the method embodiment and is not described herein again.

Based on the same concept, the present invention further discloses a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the steps of the foregoing method embodiments are implemented, and the specific implementation process refers to the foregoing method embodiment, and is not described herein again. The storage medium can be a magnetic disk, an optical disk, a read-only memory or a random access memory.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A spam message filtering method is characterized by comprising the following steps:

2. The spam filtering method of claim 1, wherein the method further comprises: putting the short messages to be sent which are not successfully matched in the second matching round into a content library for manual examination; sending out the short message to be sent which passes manual examination in the content library, and simultaneously putting the short message to be sent into a trusted database to be used as a trusted short message; and putting the short messages to be sent which do not pass the manual examination in the content library into a junk database to be used as junk short messages.

3. The spam filtering method of claim 1, wherein the method further comprises: and in the first round of matching/the second round of matching, aiming at each short message to be sent, matching the spam short messages/the trusted short messages in the spam database/the called trusted database at the same time according to groups, and finishing the matching of the current short message to be sent when any group finds the spam short messages/the trusted short messages which are successfully matched.

4. The spam filtering method of claim 3, wherein the method further comprises: sequencing each group of spam short messages/trusted short messages at regular time or non-regular time;

5. The spam filtering method of claim 4 wherein said sorting each group of spam/trusted SMS messages comprises: and counting the successful matching times of the spam short messages/the trusted short messages in the same group, and sequencing the spam short messages/the trusted short messages in the same group according to the sequence of the successful matching times from high to low.

6. The spam filtering method of claim 4, wherein the modifying the rank of each group in real time during the first round of matching/the second round of matching comprises: monitoring the matching success frequency of each spam short message/trusted short message of each group, if the matching success frequency of a certain spam short message/trusted short message in a preset time reaches a preset value, judging that the certain spam short message/trusted short message has a centralized sending state, adjusting the certain spam short message/trusted short message to the head of the group where the certain spam short message/trusted short message is located, and restoring the certain spam short message/trusted short message to be sorted according to the matching success frequency when the centralized sending state disappears.

7. The spam filtering method according to claim 1, wherein in the first round of matching/the second round of matching, the short message content of the to-be-sent short message and the short message content of the called spam/the to-be-received short message are respectively subjected to word segmentation processing, similarity is calculated according to word segmentation results, and if the similarity meets a preset requirement, the matching is judged to be successful.

8. A spam filtering system comprising a processor and a memory, said memory storing a computer program which, when executed by the processor, carries out the steps of the method according to any one of claims 1 to 7.

9. A computer-readable storage medium, characterized in that a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1-7.