CN117874209B - NLP-based fraud short message monitoring and alarming system - Google Patents

NLP-based fraud short message monitoring and alarming system Download PDF

Info

Publication number
CN117874209B
CN117874209B CN202410275467.0A CN202410275467A CN117874209B CN 117874209 B CN117874209 B CN 117874209B CN 202410275467 A CN202410275467 A CN 202410275467A CN 117874209 B CN117874209 B CN 117874209B
Authority
CN
China
Prior art keywords
short message
fraud
detected
word
text data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410275467.0A
Other languages
Chinese (zh)
Other versions
CN117874209A (en
Inventor
赖红琼
王金龙
尹意萍
龚小惠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Chengliye Technology Development Co ltd
Original Assignee
Shenzhen Chengliye Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Chengliye Technology Development Co ltd filed Critical Shenzhen Chengliye Technology Development Co ltd
Priority to CN202410275467.0A priority Critical patent/CN117874209B/en
Publication of CN117874209A publication Critical patent/CN117874209A/en
Application granted granted Critical
Publication of CN117874209B publication Critical patent/CN117874209B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of data processing, and provides an NLP-based fraud short message monitoring and alarming system, which comprises the following components: the data acquisition module acquires text data of the short message to be detected and the fraud short message and performs preprocessing; the fraud short message suspicious keyword dictionary construction module acquires fraud short message text data semantic word segmentation, calculates a semantic word segmentation keyword value and constructs a fraud short message suspicious keyword dictionary; the fraud short message text feature analysis module calculates a key matching distance, acquires a suspicious key text data set, constructs a deformed word-breaking binary tree, calculates a word-breaking traversal query value, acquires a dynamic granularity step length, divides text vector data and calculates a fraud early warning index; and the fraud short message monitoring and early warning module calculates a dynamic fraud early warning threshold of the short message to be detected according to the fraud short message early warning index, and monitors and warns the fraud short message by utilizing the dynamic fraud early warning threshold. The invention solves the problem of inaccurate detection of the content matching of the fraud short message in the fraud short message early warning process.

Description

NLP-based fraud short message monitoring and alarming system
Technical Field
The invention relates to the technical field of data processing, in particular to an NLP-based fraud short message monitoring and alarming system.
Background
The wide popularization and application of mobile terminal equipment mainly comprising smart phones greatly change the daily life style of individuals and bring convenience to daily communication. Under the influence of the convenience life of internet science and technology, emerging virtual telecom fraud activities taking short message fraud as a main form are increasingly active. Because the message fraud has strong virtualization, lawbreakers can easily hide the true identity of the lawbreakers in the message fraud process, the fraud concealment is strong, and the cost is low. The fraud short message group sending and scattering greatly disturbs the normal life of people, and even unnecessary economic property loss can be caused for people of special ages, so that the social security order is jeopardized.
Disclosure of Invention
The invention provides an NLP-based fraud short message monitoring and alarming system, which aims to solve the problem that the fraud short message content matching detection is inaccurate due to the fact that the text similarity of short messages is measured only through Euclidean distance, and the adopted technical scheme is as follows:
the invention provides an NLP-based fraud short message monitoring and alarming system, which comprises the following modules:
the data acquisition module acquires text data of a short message to be detected and text data of a fraud short message and performs preprocessing;
The method comprises the steps of obtaining fraud short message text data semantic word segmentation, initializing a fraud short message suspicious keyword dictionary, calculating semantic word segmentation keyword values according to different semantic word segmentation, and constructing the fraud short message suspicious keyword dictionary according to different semantic word segmentation keyword values in the fraud short message;
The fraud short message text feature analysis module calculates key matching distances of short message text data to be detected and fraud short message suspicious keyword dictionary text data according to a fraud short message suspicious keyword dictionary, acquires a suspicious keyword text data set according to the key matching distances of the short message text data to be detected and the fraud short message suspicious keyword dictionary text data, constructs a deformed unpacking binary tree according to the suspicious keyword text data set, calculates unpacking traversal query values according to the deformed unpacking binary tree, acquires dynamic granularity step length according to the short message text data to be detected, divides the short message text data to be detected into text vector data with different granularities by utilizing the dynamic granularity step length, and calculates fraud early warning indexes according to the unpacking traversal query values and the text vector data with different granularities;
And the fraud short message monitoring and early warning module calculates a dynamic fraud early warning threshold of the short message to be detected according to the fraud short message early warning index, and monitors and warns the fraud short message by utilizing the dynamic fraud early warning threshold.
Preferably, the mathematical formula for calculating the keyword numerical value of the semantic word according to the different semantic words is as follows:
In the above formula, N represents the total number of all different fraud text data, ,/>Respectively represent the number of semantic word segmentation of the t-th fraud short message and the u-th fraud short message,/>Representing cosine similarity between two different text word vectors,/>,/>Word segmentation vectors of mth semantic word segmentation of the t-th fraud short message and the mth fraud short message are respectively represented,/>Represents an exponential function based on natural constants,/>,/>Respectively represents the positions of the mth semantic segmentation word and the mth semantic segmentation word of the mth and the mth fraud short messages in the fraud short messages,/>,/>Respectively represents the position of the mth semantic segmentation word in the current sentence in the t-th fraud short message and the u-th fraud short message,/>The size of the key value of the mth semantic word in the t-th fraud short message text data is represented.
Preferably, the method for constructing the fraud short message suspicious keyword dictionary according to the keyword values of different semantic word segmentation in the fraud short message comprises the following steps:
And the text vector data of the semantic word with the semantic word keyword value larger than the average value of the different semantic word keyword values of the text data of all the fraud messages is merged into the suspicious keyword dictionary of the fraud messages.
Preferably, the method for calculating the key matching distance between the text data of the short message to be detected and the text data of the suspected keyword dictionary of the fraud short message according to the suspected keyword dictionary of the fraud short message comprises the following steps:
And marking the length difference value of each text vector data in the text data of the short message to be detected and the suspicious keyword dictionary of the fraud short message as a first difference value, marking the sum of the first difference value and the number 1 as a first sum value, marking the ratio of Euclidean distance between each text vector data in the text data of the short message to be detected and the suspicious keyword dictionary of the fraud short message as a first ratio value, and marking the average value of the accumulated sums of the first ratios of each text vector data in the suspicious keyword dictionary of the short message to be detected as a key matching distance.
Preferably, the method for acquiring the suspicious keyword text data set according to the keyword matching distance of the text data of the short message to be detected and the text data of the suspicious keyword dictionary of the fraud short message comprises the following steps:
and in the text data of the short message to be detected, the text data which is larger than the average value of the key matching distances of all the different text data is recorded as a suspicious key text data set.
Preferably, the method for constructing the deformed binary tree according to the suspicious keyword data set comprises the following steps:
And taking each word in the suspicious keyword data set as a root node of each deformed word-breaking binary tree, and taking a first result and a second result formed by dividing each word-breaking word as left and right nodes of each deformed word-breaking binary tree respectively.
Preferably, the method for calculating the unpacking traversal query value according to the deformed unpacking binary tree comprises the following steps:
Performing hierarchical traversal on the deformed word-breaking binary tree, and if a corresponding word-breaking dividing result appears in the short message to be detected, setting the word-breaking traversal query value of the corresponding word to be a digital 1; otherwise, the unpacked traversal inquiry value is set to a digital 0.
Preferably, the mathematical formula for acquiring the dynamic granularity step length according to the text data of the short message to be detected is as follows:
In the above-mentioned method, the step of, Representing a downward rounding function,/>Representing the minimum text granularity of short messages to be detected,/>Represents an exponential function based on natural constants,/>Representing the length of the s-th sentence of the short message to be detected,/>Representing the longest length of all different sentences of the short message to be detected,/>The dynamic granularity step length of the s-th sentence of the short message to be detected is represented.
Preferably, the method for calculating fraud early warning indexes according to the unpacked word traversal inquiry values and the text vector data with different granularities comprises the following steps:
In the above-mentioned method, the step of, Representing the word number of the s-th sentence in the short message to be detected,/>Representing the broken word traversal query value of the w-th word,/>Represents an exponential function based on natural constants,/>Represents the total number of all text granularities obtained by dividing the s-th sentence,/>Represents the key matching distance of the kth text vector data and the t text vector data in the suspected keyword dictionary of the fraud short message,/>And the fraud early warning index of the s-th sentence in the short message to be detected is represented.
Preferably, the mathematical formula of the dynamic fraud early warning threshold for calculating the short message to be detected according to the fraud short message early warning index is as follows:
In the above-mentioned method, the step of, Shows the information entropy of the s-th sentence in the short message to be detected,/>Information entropy of all different sentences of short messages to be detected is representedThe average value of fraud early warning indexes of all different sentences in the short message to be detected is represented by/>And the dynamic fraud early warning threshold of the s-th sentence in the short message to be detected is represented.
The beneficial effects of the invention are as follows: according to the invention, the semantic word segmentation keyword values are obtained through calculation and analysis of different fraud short message contents, and suspicious keyword dictionary is constructed and expanded by utilizing the semantic word segmentation keyword values, so that the suspicious fraud sentences which appear repeatedly in the fraud short message text are analyzed and characterized. Meanwhile, the key matching distance is obtained through analyzing the lengths of different text data of the short message, the similarity between the text data of the short message to be detected and the suspicious keyword dictionary is reflected, the key analysis is carried out on the word breaking phenomenon of the fraud short message through constructing a deformed word breaking binary tree and calculating corresponding word breaking traversal query values, further, the text granularity of different sentences is divided through calculating dynamic granularity step sizes, the semantic meanings of different sentences in the short message to be detected under different text granularities are fully considered, and corresponding fraud early warning indexes are obtained through calculation, so that the similarity between the text data of the short message to be detected and the information of the fraud keyword is accurately reflected, and the fraud risk state of the short message to be detected is accurately represented.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.
FIG. 1 is a schematic flow diagram of an NLP-based fraud SMS monitoring and alarming system according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a modified binary tree with word breaking according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, a flowchart of an NLP-based fraud message monitoring and warning system according to an embodiment of the present invention is shown, where the system includes: the system comprises a data acquisition module, a fraud short message suspicious keyword dictionary extraction module, a fraud short message text feature analysis module and a fraud short message monitoring and early warning module.
And the data acquisition module acquires text data of the short message to be detected and text data of the fraud short message and performs preprocessing.
Specifically, in the process of filling in the personal information mobile phone number such as website account registration and login, the personal information mobile phone number of the user is easy to leak at the moment due to poor data security protection of website application, so that a large amount of fraud short message interference is caused. In the text editing process of a short message, in order to ensure consistency of context semantics, different words of mood and stop words are usually added, and the words have no actual meaning but cause larger interference to the text calculation process. Therefore, in order to eliminate words without practical meaning in the text data of the short message, the text data of the short message to be detected is preprocessed through jieba packets, and stop words in the text data of the short message are eliminated. Meanwhile, in order to further calculate and extract the common property of the text data in the fraud short messages, all different fraud short message text data are obtained and corresponding preprocessing is carried out.
So far, the text data of the short message to be detected and the text data of the fraud short message are obtained.
The method comprises the steps of obtaining fraud short message text data semantic word segmentation, initializing a fraud short message suspicious keyword dictionary, calculating semantic word segmentation keyword values according to different semantic word segmentation, and constructing the fraud short message suspicious keyword dictionary according to different semantic word segmentation keyword values in the fraud short message.
In general, short message fraud mainly aims at acquiring great money, so that the short message fraud content mainly aims at false winning, money transfer loan, money transaction and other information. The illegal fraud molecules can edit and disguise the content form of the fraud short message in order to improve the credibility of the fraud short message. Considering that the text data of the short message fraud can be commonly impersonated by different organizations to ensure the credibility of the fraud short message, suspicious keyword dictionaries are respectively constructed for different types of information, and the keyword dictionary of the corresponding type of information is initialized to text vector data of the common organization structure names Chinese full name and short name under the current type. For example, for a fake bank type short message, the corresponding suspicious keyword dictionary should be initialized to text vector data of Chinese holonomics and abbreviations of all banks.
It should be noted that the completeness of the suspected keyword dictionary largely determines the reliability of the detection result of the fraud short message, so that the suspected keyword dictionary needs to be expanded and adjusted according to the same information in different fraud short messages, and the suspected keyword dictionary is assumed to currently share N different fraud short messages, and all the different fraud short messages are used as the input of a Conditional Random Field (CRF) algorithm to acquire semantic segmentation words of different fraud short message text data, and the number of the semantic segmentation words of the kth fraud short message text data is assumed to beThe semantic word is encoded by One-hot to obtain a corresponding word segmentation vector, wherein the specific calculation process of the conditional random field algorithm and One-hot encoding is a known technology and is not described herein.
In the above formula, N represents the total number of all different fraud text data,,/>Respectively represent the number of semantic word segmentation of the t-th fraud short message and the u-th fraud short message,/>Representing cosine similarity between two different text word vectors,/>,/>Word segmentation vectors of mth semantic word segmentation of the t-th fraud short message and the mth fraud short message are respectively represented,/>Represents an exponential function based on natural constants,/>,/>Respectively represents the positions of the mth semantic segmentation word and the mth semantic segmentation word of the mth and the mth fraud short messages in the fraud short messages,/>,/>Respectively represents the position of the mth semantic segmentation word in the current sentence in the t-th fraud short message and the u-th fraud short message,/>The size of the key value of the mth semantic word in the t-th fraud short message text data is represented.
The key numerical values of different semantic segmentation words of all fraud short messages in different types can be obtained through calculation by the above formula. If a plurality of different words and sentences appear repeatedly in all the fraud short messages under the corresponding type, cosine similarity values among corresponding semantic word vectors are relatively larger, meanwhile, if the position sequences of the different semantic words appearing in the fraud short messages are similar, the corresponding semantic words are indicated to have stronger similarity in sentence structure, at the moment, the calculated semantic word keyword values are relatively larger, and the current semantic word is indicated to be the keyword sentence which appears repeatedly in the fraud short message.
The semantic word segmentation vectors of all the different fraud messages can be calculated to obtain corresponding semantic word segmentation keyword values, and the larger the values are, the higher the frequency of the corresponding keyword in the fraud messages is, and at the moment, the text vector data of the semantic words, of which the semantic word segmentation keyword values of all the different types of fraud messages are larger than the semantic word segmentation keyword value average, are merged into the suspicious keyword dictionary.
Thus, the fraud short message suspicious keyword dictionary is obtained.
The fraud short message text feature analysis module calculates key matching distances of short message text data to be detected and fraud short message suspicious keyword dictionary text data according to a fraud short message suspicious keyword dictionary, acquires a suspicious keyword text data set according to the key matching distances of the short message text data to be detected and the fraud short message suspicious keyword dictionary text data, constructs a deformed unpacking binary tree according to the suspicious keyword text data set, calculates unpacking traversal query values according to the deformed unpacking binary tree, acquires dynamic granularity step length according to the short message text data to be detected, divides the short message text data to be detected into text vector data with different granularities by utilizing the dynamic granularity step length, and calculates fraud early warning indexes according to the unpacking traversal query values and the text vector data with different granularities.
It should be noted that, when words similar to those in the word dictionary of the suspected keyword of the fraud message appear in the text data of the short message to be detected, the probability that the short message to be detected is the fraud message data is relatively high. However, when the similarity of two different text data is calculated only through Euclidean distance between the two different text vectors, the length characteristics of the text data in the sentence structure are ignored, so that the similarity between the texts is characterized inaccurately, and the text data structure characteristics of the short message to be detected are combined for further calculation and analysis.
In the above, TL represents the total number of all different text vector data in the suspicious keyword dictionary of the fraud message,Represents the kth text vector data in the short message to be detected,/>Representing t text vector data in a suspected keyword dictionary of fraud messages,/>Representing the Euclidean distance between two text vector data,/>,/>Respectively represents the length of the kth text vector data in the short message to be detected and the length of the kth text vector data in the suspected keyword dictionary of the fraud short message,/>The key matching distance between the kth text vector data of the short message to be detected and the kth text vector data in the suspected keyword dictionary of the fraud short message is represented.
The key matching distance between all different text vector data in the short message to be detected and the text vector data in the fraud short message suspicious keyword dictionary can be calculated through the method, if related words and sentences exist in the short message to be detected and the fraud short message suspicious keyword dictionary, the Euclidean distance value between the text vector data of the short message to be detected and the text vector data of the fraud short message suspicious keyword dictionary is relatively smaller, meanwhile, the smaller the length difference between the two different text vector data is, the closer the structure between the two text vector data is, the smaller the key matching distance between the two different calculated text vector data is, and the higher the probability of occurrence of fraud short message suspicious keywords in the short message to be detected is.
It should be noted that, in order to transmit and send the fraud information to the victim group, the illegal fraud molecules may adjust the text editing of the fraud messages, for example, divide certain words in the fraud messages into separate words, for example, divide "transfer" in the original fraud messages into "car-parts Bei Chang", so that the fraud message text data is greatly different from the original fraud message text information in text form, but is not easily intercepted by the traditional fraud message detection means, and the fraud message text data under the special condition may cause the fraud of the special age group, and at this time, the different fraud message words and sentences are analyzed differently.
Specifically, firstly, calculating the average value of key matching distances of all different text data in text data of a short message to be detected, acquiring all text data with the key matching distances of all the text data larger than the average value, and recording all the text data larger than the average value as a suspicious key text data set. In order to avoid the bad influence of recognition and detection effects caused by the change of the text form of the fraud short message due to the word breaking division in the fraud short message, a deformed word breaking binary tree is constructed for each word in all text data in the suspicious key text data set. As shown in fig. 2, in the modified split binary tree, the root node represents a word of the current text data, and considering that chinese text kanji has certain structural characteristics, such as a left-right structure or an up-down structure, in the invention, it is specified that there are at most two split division results each. The first result formed after dividing each word is used as the left node of the deformed word-breaking binary tree, the second result formed after dividing each word is used as the right node of the deformed word-breaking binary tree, the open source Chinese character word-breaking word library is used for dividing different words, and the specific word-breaking process is a known technology and is not repeated here.
Performing hierarchical traversal on the deformed word-breaking binary tree, and if a corresponding word-breaking dividing result appears in the short message to be detected, setting the word-breaking traversal query value of the corresponding word to be a digital 1; otherwise, the unpacked traversal inquiry value is set to a digital 0. The specific hierarchical traversal process of the binary tree is a well-known technology and will not be described herein.
It should be noted that, the short message is used as an instant communication means, the content form and the length of the short message are relatively flexible and changeable, and the text data of the short message with different length structures may represent the same meaning of semantic features, so that analysis and calculation of the text data characteristics of the short message with different granularities are required.
Specifically, firstly, different text granularity is obtained by dividing each different sentence, and because the single Chinese character is hard to express the actual meaning, the text data of the short message with two Chinese lengths is used as the minimum text granularity and recorded as the minimum text granularity of the fraud short messageIn the specific application process, the implementer can adjust according to specific conditions, and the text length of the current sentence is used as the maximum text granularity. Considering that the lengths of different sentences of different fraud messages are different, in order to accurately acquire semantic characteristics in the text data of the fraud messages, the text granularity of the different sentences needs to be dynamically adjusted.
In the above-mentioned method, the step of,Representing a downward rounding function,/>Representing the minimum text granularity of short messages to be detected,/>Represents an exponential function based on natural constants,/>Representing the length of the s-th sentence of the short message to be detected,/>Representing the longest length of all different sentences of the short message to be detected,/>The dynamic granularity step length of the s-th sentence of the short message to be detected is represented.
The dynamic granularity step value of all different sentences in the fraud short message can be calculated through the method, and for the sentences with longer lengths in the fraud short message, the meaning of the text data of the fraud short message can be different under different granularity lengths because of the relatively longer sentences, so that the semantic characteristics of the text data with different granularity in the fraud short message can be obtained, the dynamic granularity step value of the longer sentences in the fraud short message is relatively smaller, and the text data with more granularity can be obtained by dividing the smaller dynamic granularity step.
All different sentences in different fraud short messages can be divided into a plurality of different fraud short message text data through dynamic granularity steps, the corresponding fraud short message text granularity data word vectors are obtained through One-hot coding for all the fraud short message text data with different granularity, and the s-th sentence in the current short message to be detected is supposed to be dividedDifferent particle sizes. And the fact that the representation meaning of the text vector data of the fraud short messages is different under different granularity is considered, and the fraud short messages with different granularity are combined for further analysis and calculation.
In the above-mentioned method, the step of,Representing a normalization function,/>The word number of the s-th sentence in the short message to be detected is represented,Representing the broken word traversal query value of the w-th word,/>Represents the total number of all text granularities obtained by dividing the s-th sentence,/>Represents the key matching distance of the kth text vector data and the t text vector data in the suspected keyword dictionary of the fraud short message,/>And the fraud early warning index of the s-th sentence in the short message to be detected is represented.
According to the method, the fraud early warning indexes of different sentences in the short message to be detected can be calculated, when abnormal word separation phenomenon occurs in the short message to be detected, the word separation traversal inquiry value of the corresponding sentences in the short message to be detected is relatively large, meanwhile, the smaller the key matching distance value is calculated under the text semantics corresponding to various different granularities in the current short message to be detected, the fact that the word separation occurs in the current short message to be detected and the fraud meaning information exists is explained, and the calculated fraud early warning index value of the corresponding sentences in the short message to be detected is relatively large.
Thus, early warning indexes of different sentences of the short message to be detected are obtained.
And the fraud short message monitoring and early warning module calculates a dynamic fraud early warning threshold of the short message to be detected according to the fraud short message early warning index, and monitors and warns the fraud short message by utilizing the dynamic fraud early warning threshold.
In the above-mentioned method, the step of,Shows the information entropy of the s-th sentence in the short message to be detected,/>Information entropy of all different sentences of short messages to be detected is representedThe average value of fraud early warning indexes of all different sentences in the short message to be detected is represented by/>And the dynamic fraud early warning threshold of the s-th sentence in the short message to be detected is represented.
Considering that the importance of different sentences in the short message to be detected is different, the fraud early warning threshold values of the different sentences of the short message to be detected are dynamically adjusted by utilizing the information entropy of the different sentences of the short message to be detected, the semantic information of the more important sentences in the short message to be detected is relatively rich, the calculated information entropy value is relatively large, the value of the dynamic fraud early warning threshold value of the corresponding sentence is also relatively large, wherein the specific calculation process of the text information entropy of the sentence is a known technology and is not repeated herein.
When the fraud early warning indexes of n different sentences in the short message to be detected are greater than or equal to the dynamic fraud early warning threshold, the experience value of n is 3, and in the specific application, an implementer can set according to specific conditions. The short message to be detected is considered to have larger fraud risk, the current short message to be detected is marked as a suspected fraud short message, and early warning information is sent at the same time to remind a short message receiver of paying attention to property safety; otherwise, if the fraud early warning index of the current short message to be detected is smaller than the fraud early warning threshold, the short message to be detected is considered to have smaller fraud risk, and the fraud early warning index of the short message to be detected is used for completing monitoring and warning of the short message fraud.
So far, the monitoring alarm of the fraud short message is realized.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (7)

1. The fraud short message monitoring and alarming system based on NLP is characterized by comprising the following modules:
the data acquisition module acquires text data of a short message to be detected and text data of a fraud short message and performs preprocessing;
The method comprises the steps of obtaining fraud short message text data semantic word segmentation, initializing a fraud short message suspicious keyword dictionary, calculating semantic word segmentation keyword values according to different semantic word segmentation, and constructing the fraud short message suspicious keyword dictionary according to different semantic word segmentation keyword values in the fraud short message;
The fraud short message text feature analysis module calculates key matching distances of short message text data to be detected and fraud short message suspicious keyword dictionary text data according to a fraud short message suspicious keyword dictionary, acquires a suspicious keyword text data set according to the key matching distances of the short message text data to be detected and the fraud short message suspicious keyword dictionary text data, constructs a deformed unpacking binary tree according to the suspicious keyword text data set, calculates unpacking traversal query values according to the deformed unpacking binary tree, acquires dynamic granularity step length according to the short message text data to be detected, divides the short message text data to be detected into text vector data with different granularities by utilizing the dynamic granularity step length, and calculates fraud early warning indexes according to the unpacking traversal query values and the text vector data with different granularities;
The fraud short message monitoring and early warning module calculates a dynamic fraud early warning threshold value of the short message to be detected according to the fraud short message early warning index, and monitors and warns the fraud short message by utilizing the dynamic fraud early warning threshold value;
The mathematical formula for calculating the keyword numerical value of the semantic segmentation according to the different semantic segmentation is as follows:
In the above formula, N represents the total number of all different fraud text data, ,/>Respectively represent the number of semantic word segmentation of the t-th fraud short message and the u-th fraud short message,/>Representing cosine similarity between two different text word vectors,/>,/>The word segmentation vectors of the mth semantic word segmentation of the t-th fraud short message and the mth semantic word segmentation of the u-th fraud short message are respectively represented,Represents an exponential function based on natural constants,/>,/>Respectively represents the positions of the mth semantic segmentation word and the mth semantic segmentation word of the mth and the mth fraud short messages in the fraud short messages,/>,/>Respectively represents the position of the mth semantic segmentation word in the current sentence in the t-th fraud short message and the u-th fraud short message,/>The size of the key value of the mth semantic word in the t-th fraud short message text data is represented;
The mathematical formula for acquiring the dynamic granularity step length according to the text data of the short message to be detected is as follows:
in the above equation, ⌊ ⌋ denotes a down-rounding function, Representing the minimum text granularity of short messages to be detected,/>Represents an exponential function based on natural constants,/>Representing the length of the s-th sentence of the short message to be detected,/>Representing the longest length of all different sentences of the short message to be detected,/>The dynamic granularity step length of the s-th sentence of the short message to be detected is represented;
the method for calculating fraud early warning indexes according to the unpacked word traversal inquiry values and the text vector data with different granularities comprises the following steps:
In the above-mentioned method, the step of, Representing the word number of the s-th sentence in the short message to be detected,/>Representing the broken word traversal query value of the w-th word,/>Represents an exponential function based on natural constants,/>Represents the total number of all text granularities obtained by dividing the s-th sentence,/>Represents the key matching distance of the kth text vector data and the t text vector data in the suspected keyword dictionary of the fraud short message,/>And the fraud early warning index of the s-th sentence in the short message to be detected is represented.
2. The NLP-based fraud message monitoring and warning system of claim 1, wherein the method for constructing a fraud message suspicious keyword dictionary according to different semantic word keyword values in the fraud message is as follows:
And the text vector data of the semantic word with the semantic word keyword value larger than the average value of the different semantic word keyword values of the text data of all the fraud messages is merged into the suspicious keyword dictionary of the fraud messages.
3. The NLP-based fraud message monitoring and warning system of claim 2, wherein the method for calculating the key matching distance of the text data of the short message to be detected and the text data of the suspected keyword dictionary of the fraud message according to the suspected keyword dictionary of the fraud message is as follows:
And marking the length difference value of each text vector data in the text data of the short message to be detected and the suspicious keyword dictionary of the fraud short message as a first difference value, marking the sum of the first difference value and the number 1 as a first sum value, marking the ratio of Euclidean distance between each text vector data in the text data of the short message to be detected and the suspicious keyword dictionary of the fraud short message as a first ratio value, and marking the average value of the accumulated sums of the first ratios of each text vector data in the suspicious keyword dictionary of the short message to be detected as a key matching distance.
4. The NLP-based fraud message monitoring and alert system of claim 3, wherein the method for acquiring suspicious keyword data sets according to the keyword matching distance of the text data of the short message to be detected and the text data of the suspicious keyword dictionary of the fraud message is as follows:
and in the text data of the short message to be detected, the text data which is larger than the average value of the key matching distances of all the different text data is recorded as a suspicious key text data set.
5. The NLP-based fraud message monitoring and alert system of claim 4, wherein the method for constructing a deformed binary tree from a suspicious key text data set is as follows:
And taking each word in the suspicious keyword data set as a root node of each deformed word-breaking binary tree, and taking a first result and a second result formed by dividing each word-breaking word as left and right nodes of each deformed word-breaking binary tree respectively.
6. The NLP-based fraud message monitoring and alert system of claim 5, wherein the method for calculating the split traversal query value from the deformed split binary tree is:
Performing hierarchical traversal on the deformed word-breaking binary tree, and if a corresponding word-breaking dividing result appears in the short message to be detected, setting the word-breaking traversal query value of the corresponding word to be a digital 1; otherwise, the unpacked traversal inquiry value is set to a digital 0.
7. The NLP-based fraud message monitoring and alert system of claim 1, wherein the dynamic fraud message alert threshold mathematical formula for calculating the message to be detected according to the fraud message alert index is:
In the above-mentioned method, the step of, Shows the information entropy of the s-th sentence in the short message to be detected,/>Information entropy of all different sentences of short messages to be detected is representedThe average value of fraud early warning indexes of all different sentences in the short message to be detected is represented by/>And the dynamic fraud early warning threshold of the s-th sentence in the short message to be detected is represented.
CN202410275467.0A 2024-03-12 2024-03-12 NLP-based fraud short message monitoring and alarming system Active CN117874209B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410275467.0A CN117874209B (en) 2024-03-12 2024-03-12 NLP-based fraud short message monitoring and alarming system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410275467.0A CN117874209B (en) 2024-03-12 2024-03-12 NLP-based fraud short message monitoring and alarming system

Publications (2)

Publication Number Publication Date
CN117874209A CN117874209A (en) 2024-04-12
CN117874209B true CN117874209B (en) 2024-05-17

Family

ID=90595107

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410275467.0A Active CN117874209B (en) 2024-03-12 2024-03-12 NLP-based fraud short message monitoring and alarming system

Country Status (1)

Country Link
CN (1) CN117874209B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104301896A (en) * 2014-10-15 2015-01-21 上海欣方智能系统有限公司 Intelligent fraud short message monitor and alarm system and method
KR20200025073A (en) * 2018-08-29 2020-03-10 (주)페르소나시스템 Identification method, apparatus and program for fraud detection
WO2021164199A1 (en) * 2020-02-20 2021-08-26 齐鲁工业大学 Multi-granularity fusion model-based intelligent semantic chinese sentence matching method, and device
CN113961764A (en) * 2021-10-19 2022-01-21 平安国际智慧城市科技股份有限公司 Method, device, equipment and storage medium for identifying fraud telephone
CN116881408A (en) * 2023-04-27 2023-10-13 中国民航大学 Visual question-answering fraud prevention method and system based on OCR and NLP

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049532A (en) * 2012-12-21 2013-04-17 东莞中国科学院云计算产业技术创新与育成中心 Method for creating knowledge base engine on basis of sudden event emergency management and method for inquiring knowledge base engine
US11972346B2 (en) * 2019-08-26 2024-04-30 Chenope, Inc. System to detect, assess and counter disinformation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104301896A (en) * 2014-10-15 2015-01-21 上海欣方智能系统有限公司 Intelligent fraud short message monitor and alarm system and method
KR20200025073A (en) * 2018-08-29 2020-03-10 (주)페르소나시스템 Identification method, apparatus and program for fraud detection
WO2021164199A1 (en) * 2020-02-20 2021-08-26 齐鲁工业大学 Multi-granularity fusion model-based intelligent semantic chinese sentence matching method, and device
CN113961764A (en) * 2021-10-19 2022-01-21 平安国际智慧城市科技股份有限公司 Method, device, equipment and storage medium for identifying fraud telephone
CN116881408A (en) * 2023-04-27 2023-10-13 中国民航大学 Visual question-answering fraud prevention method and system based on OCR and NLP

Also Published As

Publication number Publication date
CN117874209A (en) 2024-04-12

Similar Documents

Publication Publication Date Title
US11017178B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN102012985B (en) Sensitive data dynamic identification method based on data mining
CN111177367B (en) Case classification method, classification model training method and related products
CN111198995A (en) Malicious webpage identification method
CN110162624B (en) Text processing method and device and related equipment
CN111291195A (en) Data processing method, device, terminal and readable storage medium
CN110489997A (en) A kind of sensitive information desensitization method based on pattern matching algorithm
CN115495744A (en) Threat information classification method, device, electronic equipment and storage medium
CN109857869A (en) A kind of hot topic prediction technique based on Ap increment cluster and network primitive
CN111460783A (en) Data processing method and device, computer equipment and storage medium
US7617182B2 (en) Document clustering based on entity association rules
CN116992052B (en) Long text abstracting method and device for threat information field and electronic equipment
Villar-Rodriguez et al. A feature selection method for author identification in interactive communications based on supervised learning and language typicality
CN117874209B (en) NLP-based fraud short message monitoring and alarming system
CN112492606A (en) Classification and identification method and device for spam messages, computer equipment and storage medium
CN111985212A (en) Text keyword recognition method and device, computer equipment and readable storage medium
Zhang et al. Spam comments detection with self-extensible dictionary and text-based features
CN112069392B (en) Method and device for preventing and controlling network-related crime, computer equipment and storage medium
CN114969725A (en) Target command identification method and device, electronic equipment and readable storage medium
CN115883111A (en) Phishing website identification method and device, electronic equipment and storage medium
CN113919338A (en) Method and device for processing text data
CN115455179B (en) Sensitive vocabulary detection method, device, equipment and storage medium
CN117278322B (en) Web intrusion detection method, device, terminal equipment and storage medium
CN113257254B (en) Voiceprint recognition method and device, electronic equipment and storage medium
CN117786097A (en) Digest extraction method, digest extraction device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant