CN111782986A - Method and device for monitoring access based on short link - Google Patents

Method and device for monitoring access based on short link Download PDF

Info

Publication number
CN111782986A
CN111782986A CN201910414745.5A CN201910414745A CN111782986A CN 111782986 A CN111782986 A CN 111782986A CN 201910414745 A CN201910414745 A CN 201910414745A CN 111782986 A CN111782986 A CN 111782986A
Authority
CN
China
Prior art keywords
keyword
word
keywords
word frequency
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910414745.5A
Other languages
Chinese (zh)
Inventor
曾文杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201910414745.5A priority Critical patent/CN111782986A/en
Publication of CN111782986A publication Critical patent/CN111782986A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9558Details of hyperlinks; Management of linked annotations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/12Messaging; Mailboxes; Announcements
    • H04W4/14Short messaging services, e.g. short message services [SMS] or unstructured supplementary service data [USSD]

Abstract

The invention discloses a method and a device for monitoring access based on short link, and relates to the technical field of computers. One embodiment of the method comprises: reading an original link, a first keyword and a word frequency thereof in a short link record, wherein the record state of the short link record is allowed to be accessed; capturing the current corresponding webpage content of the original link according to the original link, and determining a second keyword and word frequency thereof according to the current content; determining the similarity of the first keyword and the second keyword based on the first keyword and the word frequency thereof and the second keyword and the word frequency thereof; and if the similarity between the first keyword and the second keyword does not exceed a set threshold, updating the record state of the short link record to be access prohibition. The implementation mode solves the technical problems that the middle monitoring party only audits the content in the original link when the server side applies for pushing, and the pushed short link is not monitored subsequently after the auditing is passed.

Description

Method and device for monitoring access based on short link
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for monitoring access based on short links.
Background
In internet services, users often access content pushed by a service through a short link received. But the content in the original link submitted by the server side initially needs to be audited by the intermediate monitoring side and then can be pushed to the user in the form of the short link.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
1) in the prior art, an intermediate monitoring party only audits the content in the original link when a service party applies for pushing, and does not perform subsequent monitoring on the pushed short link after the audit is passed.
2) In the prior art, after the short link is pushed to the user, the service side replaces the page corresponding to the original link with the page such as a phishing webpage, so that the user accesses an illegal website and loss is caused.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for monitoring short-link-based access, which can solve the problem in the prior art that subsequent monitoring is not performed on a pushed short-link after an audit is passed.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a method of monitoring short link based access, including: reading an original link, a first keyword and a word frequency thereof in a short link record, wherein the record state of the short link record is allowed to be accessed; capturing the current corresponding webpage content of the original link according to the original link, and determining a second keyword and word frequency thereof according to the current content; determining the similarity of the first keyword and the second keyword based on the first keyword and the word frequency thereof and the second keyword and the word frequency thereof; if the similarity between the first keyword and the second keyword does not exceed a set threshold, updating the record state of the short link record to be access-prohibited; and the first keyword is determined by the corresponding webpage content when the original link is approved.
Optionally, the method further includes, if the similarity between the first keyword and the second keyword satisfies a set threshold: acquiring a sensitive word corresponding to the category identification according to the category identification in the short link record, and judging whether the second keyword comprises the sensitive word; and if the second keyword comprises the sensitive word, updating the record state of the short link record to be access prohibition.
Optionally, determining similarity between the first keyword and the second keyword based on the first keyword and the word frequency thereof and the second keyword and the word frequency thereof, including: carrying out normalization processing on the first keyword and the second keyword to obtain a keyword subjected to normalization processing; forming a keyword array according to the keywords after the normalization processing; determining word vectors of the first keywords and word vectors of the second keywords based on the first keywords and word frequencies thereof, the second keywords and word frequencies thereof and the keyword array; and determining the similarity between the first keyword and the second keyword according to the word vector of the first keyword and the word vector of the second keyword.
Optionally, forming a keyword array according to the normalized keywords, including: taking the sum of the word frequency of the first keyword and the word frequency of the second keyword as the word frequency of the keyword after the normalization processing; sorting the keywords after the normalization processing according to the word frequency of the keywords after the normalization processing to obtain a keyword sequence; and forming a keyword array by the keywords after the normalization processing based on the keyword sequence.
Optionally, determining a word vector of the first keyword based on the first keyword, the word frequency thereof, and the keyword array includes: traversing the keyword array: taking the word frequency of the first keyword as a corresponding value of the first keyword in the keyword array; setting the corresponding value of the second keyword in the keyword array to be 0; determining a word vector for the first keyword based on corresponding values in the keyword array.
According to another aspect of the embodiments of the present invention, there is provided an apparatus for monitoring short link based access, including: a reading module to: reading an original link, a first keyword and a word frequency thereof in a short link record, wherein the record state of the short link record is allowed to be accessed; a real-time capture module for: capturing the current corresponding webpage content of the original link according to the original link, and determining a second keyword and word frequency thereof according to the current content; a similarity determination module to: determining the similarity of the first keyword and the second keyword based on the first keyword and the word frequency thereof and the second keyword and the word frequency thereof; a monitoring module to: if the similarity between the first keyword and the second keyword does not exceed a set threshold, updating the record state of the short link record to be access-prohibited; and the first keyword is determined by the corresponding webpage content when the original link is approved.
Optionally, the monitoring module is further configured to, if the similarity between the first keyword and the second keyword meets a set threshold: acquiring a sensitive word corresponding to the category identification according to the category identification in the short link record, and judging whether the second keyword comprises the sensitive word; and if the second keyword comprises the sensitive word, updating the record state of the short link record to be access prohibition.
Optionally, the similarity determining module is further configured to: carrying out normalization processing on the first keyword and the second keyword to obtain a keyword subjected to normalization processing; forming a keyword array according to the keywords after the normalization processing; determining word vectors of the first keywords and word vectors of the second keywords based on the first keywords and word frequencies thereof, the second keywords and word frequencies thereof and the keyword array; and determining the similarity between the first keyword and the second keyword according to the word vector of the first keyword and the word vector of the second keyword.
Optionally, the similarity determining module is further configured to: taking the sum of the word frequency of the first keyword and the word frequency of the second keyword as the word frequency of the keyword after the normalization processing; sorting the keywords after the normalization processing according to the word frequency of the keywords after the normalization processing to obtain a keyword sequence; and forming a keyword array by the keywords after the normalization processing based on the keyword sequence.
Optionally, the similarity determining module is further configured to: traversing the keyword array: taking the word frequency of the first keyword as a corresponding value of the first keyword in the keyword array; setting the corresponding value of the second keyword in the keyword array to be 0; determining a word vector for the first keyword based on corresponding values in the keyword array.
According to still another aspect of an embodiment of the present invention, there is provided an electronic apparatus including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method of monitoring short link based access as provided in the preceding embodiments.
According to a further aspect of embodiments of the present invention, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the method of monitoring short link based access as provided in the previous embodiments.
One embodiment of the above invention has the following advantages or benefits: by the technical means of extracting the keywords and the word frequencies thereof and calculating the similarity of the keywords before and after the verification is passed, the technical problems that the intermediate monitoring party only verifies the content in the original link when the server side applies for pushing and the pushed short link is not subjected to subsequent monitoring after the verification is passed are solved, the technical effect of subsequent monitoring on the pushed short link is further achieved, the situation that the content corresponding to the original link is greatly modified after the verification is passed is effectively avoided, meanwhile, the user is prevented from accessing the illegally modified page, and the benefit of the user is maintained.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of a basic flow of a method of monitoring short link based access according to an embodiment of the invention;
FIG. 2 is a schematic diagram of a preferred flow of a method of monitoring short link based access in accordance with an embodiment of the present invention;
FIG. 3 is a schematic diagram of the main modules of a device monitoring short link based access according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a short link pushing system based on a monitoring short link method according to an embodiment of the present invention;
FIG. 5 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 6 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The following application scenarios can be taken as examples to describe the embodiments of the present invention in detail:
enterprises (i.e. the service side in the invention) often need to send short messages (i.e. push messages) to users, and some short message contents need to include some enterprise-related hyperlinks. The short message virtual operator (i.e. the intermediate monitoring party in the present invention) generally processes the hyperlinks contained in the short message content according to the following procedures: the enterprise submits the hyperlink that it wishes to send (i.e., the original link); the short message virtual operator checks the content corresponding to the hyperlink, and after the content is checked, the short message virtual operator returns a short link to the enterprise; the enterprise attaches the short link to the short message; when the short message is sent to the user, the user clicks the short link, statistical calculation is carried out at the short message virtual operator, and then the short link which the enterprise wants to send is skipped to.
Fig. 1 is a schematic diagram of a basic flow of a method of monitoring short link based access according to an embodiment of the present invention. As shown in fig. 1, an embodiment of the present invention provides a method for monitoring short-link-based access, including:
s101, reading an original link, a first keyword and a word frequency thereof in a short link record, wherein the record state of the short link record is allowed to be accessed;
s102, capturing webpage content currently corresponding to the original link according to the original link, and determining a second keyword and word frequency thereof according to the current content;
s103, determining the similarity between the first keyword and the second keyword based on the first keyword and the word frequency thereof and the second keyword and the word frequency thereof;
s104, if the similarity between the first keyword and the second keyword does not exceed a set threshold, updating the record state of the short link record to be access prohibition;
and the first keyword is determined by the corresponding webpage content when the original link is approved. Specifically, Chinese word segmentation and word frequency calculation may be performed on the web page content corresponding to the original link when the original link is approved, so as to determine the first keyword. Similarly, the second keyword may also be determined by performing chinese word segmentation on the currently corresponding web page content of the captured original link and calculating a word frequency.
Specifically, a short link record table constructed by the intermediate monitoring party is shown in table 1.
TABLE 1
Figure BDA0002063980250000061
Figure BDA0002063980250000071
According to the embodiment of the invention, through the technical means of extracting the keywords and the word frequency thereof and calculating the similarity of the keywords before and after the verification is passed, the technical problems that the middle monitoring party only verifies the content in the original link when the server side applies for pushing and the pushed short link is not subsequently monitored after the verification is passed are solved, so that the technical effect of subsequently monitoring the pushed short link is achieved, the condition that the content corresponding to the original link is greatly modified after the verification is passed is effectively avoided, meanwhile, the user is prevented from accessing the illegal page, and the user benefit is maintained.
In the embodiment of the present invention, the method further includes, if the similarity between the first keyword and the second keyword satisfies a set threshold: acquiring a sensitive word corresponding to the category identification according to the category identification in the short link record, and judging whether the second keyword comprises the sensitive word; and if the second keyword comprises the sensitive word, updating the record state of the short link record to be access prohibition, and meanwhile, sending an early warning to a user. According to the embodiment of the invention, the user can be prevented from losing by judging whether the page contains the sensitive words or not and blocking the illegal page from being accessed by the user in time, so that serious conditions that the information pushing channel is checked and sealed and the like due to complaint surge caused by the user loss are avoided.
Specifically, the category identifier in the short link record may be Comp _ type (industry identifier to which an enterprise belongs) in table 1, and the corresponding sensitive word rule (sensitive word) or other restriction rule may be called from the sensitive word rule table through the industry identifier. The sensitive word rule table is shown in table 2.
TABLE 2
Figure BDA0002063980250000072
Figure BDA0002063980250000081
In step S103 of the embodiment of the present invention, determining the similarity between the first keyword and the second keyword based on the first keyword and the word frequency thereof and the second keyword and the word frequency thereof includes: carrying out normalization processing on the first keyword and the second keyword to obtain a keyword subjected to normalization processing; forming a keyword array according to the keywords after the normalization processing; determining word vectors of the first keywords and word vectors of the second keywords based on the first keywords and word frequencies thereof, the second keywords and word frequencies thereof and the keyword array; and determining the similarity between the first keyword and the second keyword according to the word vector of the first keyword and the word vector of the second keyword.
The keywords are normalized, the determined word vectors can enable the calculation similarity to be more accurate, and further the effect of carrying out subsequent monitoring on the pushed short links is achieved. Wherein, the normalization processing means deleting the repeated keywords; for example: the first keywords obtained during the auditing are: A. b, C, respectively; the second keyword obtained after the current grabbing is: B. c, D, respectively; the keywords after normalization processing are as follows: A. b, C, D are provided.
For example, the first keyword and the word frequency thereof are: fashion 14, yearly 6, members 5, sports 4;
the second keyword and the word frequency thereof are: fashion 15, yearly goods 5, members 5, clothing 3.
After normalization processing is carried out, the obtained keywords after normalization processing are as follows: fashion, year, member, sport, clothing.
Based on the above embodiment, in the embodiment of the present invention, forming a keyword array according to the normalized keywords includes: taking the sum of the word frequency of the first keyword and the word frequency of the second keyword as the word frequency of the keyword after the normalization processing; sorting the keywords after the normalization processing according to the word frequency of the keywords after the normalization processing (preferably sorting according to the sequence of the word frequency from large to small, or sorting according to the sequence of the word frequency from small to large), so as to obtain a keyword sequence; and forming a keyword array by the keywords after the normalization processing based on the keyword sequence. The implementation of the invention determines the keyword array according to the word frequency of the keywords, so that the calculation similarity of the obtained word vectors can be more accurate, and the technical effect of subsequent monitoring on the pushed short links can be further achieved.
Specifically, after normalization processing, the keywords after normalization processing are sorted according to the order of the word frequency of the keywords after normalization processing from large to small to obtain a keyword sequence: fashion 29, yearly goods 11, members 10, sports 4, clothing 3. From this, keyword arrays can be derived (fashion, yearly, members, sports, clothing).
Based on the foregoing embodiment, in the embodiment of the present invention, determining a word vector of the first keyword based on the first keyword, the word frequency thereof, and the keyword array includes: traversing the keyword array: taking the word frequency of the first keyword as a corresponding value of the first keyword in the keyword array; setting the corresponding value of the second keyword in the keyword array to be 0; determining a word vector for the first keyword based on corresponding values in the keyword array. Meanwhile, when determining the word vector of the second keyword, the idea is the same as the above contents: traversing the keyword array: taking the word frequency of the second keyword as a corresponding value of the second keyword in the keyword array; setting the corresponding value of the first keyword in the keyword array to be 0; determining a word vector for the second keyword based on corresponding values in the keyword array. The word vectors of the keywords are determined according to the word frequencies of the keywords, so that the calculated similarity is more accurate, and the technical effect of accurately monitoring the short link state is achieved.
Based on the above example, the word vector of the first keyword is (14,6,5,4,0), and the word vector of the second keyword is (15,5,5,0, 3).
The similarity calculation according to the word vectors may adopt a cosine similarity calculation method. Cosine similarity measures the similarity between two vectors by measuring their cosine values of their angle. The cosine value of the 0-degree angle is 1, and the cosine value of any other angle is not more than 1; and its minimum value is-1. The cosine of the angle between the two vectors thus determines whether the two vectors point in approximately the same direction. When the two vectors have the same direction, the cosine similarity value is 1; when the included angle of the two vectors is 90 degrees, the value of the cosine similarity is 0; the cosine similarity has a value of-1 when the two vectors point in completely opposite directions. The result is independent of the length of the vector, only the pointing direction of the vector. Cosine similarity is commonly used in the positive space, and thus gives values between 0 and 1. The cosine value between the two vectors a, b can be found by using the euclidean dot product formula: a · b | | | a | | | | b | | | | cos θ.
Given two attribute vectors A and B, the remaining chord similarity θ is given by the dot product and the vector length, as follows:
Figure BDA0002063980250000101
wherein A isi、BiRepresenting the components of the attribute vectors a and B, respectively, and n is the number of components.
Based on the above example, the cosine similarity of the word vector (14,6,5,4,0) of the first keyword and the word vector (15,5,5,0,3) of the second keyword is calculated to be 0.9517.
Fig. 2 is a schematic diagram of a preferred flow of a method of monitoring short link based access according to an embodiment of the present invention. As shown in fig. 2, the background scans all the original links that are approved at regular intervals (e.g., 5 minutes apart); enabling a fixed number of multiple threads to perform simultaneous calculation at each time by the background, for example, starting 50 threads to perform calculation at the same time; each thread takes out the short link record according to the short link record ID to be processed, obtains the responsible original link from the short link record, captures the current content of the webpage corresponding to the original link, performs Chinese word segmentation on the obtained current content and calculates word frequency to obtain the current keyword (second keyword) and word frequency; taking out the keywords (first keywords) and the word frequency of the original link when being checked from the record, and carrying out cosine similarity calculation together with the current keywords and the word frequency; judging whether the cosine similarity exceeds a set threshold (the set threshold can be but is not limited to 0.3), if the cosine similarity is less than or equal to 0.3, setting the recording state in the short link record to be Status-2 (indicating the current page content and the cosine similarity of the page content is too low during auditing); if the cosine similarity is more than 0.3, the content of the current webpage is similar to the webpage content during the examination, the industries affiliated to the Comp _ type enterprises of the record are taken out, and then the sensitive word rules or the sensitive words corresponding to the Comp _ type in the sensitive word rule table are taken out. And sequentially verifying the sensitive word rule or the sensitive words of the industry for the second keyword, and if the second keyword contains the sensitive words or the sensitive word rule does not meet the verification requirement, setting the recording state in the short link record to be Status-3 (indicating that the page contains the sensitive words).
In which the background first looks up all records that have been reviewed, for example, 200 in total. Then 50 threads are started, each thread being assigned a processing number, e.g. the first thread is assigned a processing number of 0, the second thread is assigned a processing number of 1, and the 50 th thread is assigned a processing number of 49. Then, each thread traverses all records obtained by the main thread, and when the id number of the record is modulo 50 and is equal to the processing number of the thread, the thread processes the record. For example, if 200 records are recorded, if the ids of the 200 records are consecutive, each thread processes 4 records, if the records are jumped inside, some threads process less than 4 records, and some threads process more than 4 records.
Fig. 3 is a schematic diagram of the main modules of a device for monitoring short link based access according to an embodiment of the present invention. As shown in fig. 3, an embodiment of the present invention provides an apparatus 300 for monitoring short link based access, including:
a reading module 301, configured to: reading an original link, a first keyword and a word frequency thereof in a short link record, wherein the record state of the short link record is allowed to be accessed;
a real-time capture module 302 to: capturing the current corresponding webpage content of the original link according to the original link, and determining a second keyword and word frequency thereof according to the current content;
a similarity determination module 303, configured to: determining the similarity of the first keyword and the second keyword based on the first keyword and the word frequency thereof and the second keyword and the word frequency thereof;
a monitoring module 304 to: if the similarity between the first keyword and the second keyword does not exceed a set threshold, updating the record state of the short link record to be access-prohibited; and the first keyword is determined by the corresponding webpage content when the original link is approved.
According to the embodiment of the invention, through the technical means of extracting the keywords and the word frequency thereof and calculating the similarity of the keywords before and after the verification is passed, the technical problems that the middle monitoring party only verifies the content in the original link when the server side applies for pushing and the pushed short link is not subsequently monitored after the verification is passed are solved, so that the technical effect of subsequently monitoring the pushed short link is achieved, the condition that the content corresponding to the original link is greatly modified after the verification is passed is effectively avoided, meanwhile, the user is prevented from accessing the illegal page, and the user benefit is maintained.
In this embodiment of the present invention, the monitoring module 304 is further configured to, if the similarity between the first keyword and the second keyword meets a set threshold: acquiring a sensitive word corresponding to the category identification according to the category identification in the short link record, and judging whether the second keyword comprises the sensitive word; and if the second keyword comprises the sensitive word, updating the record state of the short link record to be access prohibition. According to the embodiment of the invention, the user can be prevented from losing by judging whether the page contains the sensitive words or not and blocking the illegal page from being accessed by the user in time, so that serious conditions that the information pushing channel is checked and sealed and the like due to complaint surge caused by the user loss are avoided.
In this embodiment of the present invention, the similarity determining module 303 is further configured to: carrying out normalization processing on the first keyword and the second keyword to obtain a keyword subjected to normalization processing; forming a keyword array according to the keywords after the normalization processing; determining word vectors of the first keywords and word vectors of the second keywords based on the first keywords and word frequencies thereof, the second keywords and word frequencies thereof and the keyword array; and determining the similarity between the first keyword and the second keyword according to the word vector of the first keyword and the word vector of the second keyword. The keywords are normalized, the determined word vectors can enable the calculation similarity to be more accurate, and further the effect of carrying out subsequent monitoring on the pushed short links is achieved.
In this embodiment of the present invention, the similarity determining module 303 is further configured to: taking the sum of the word frequency of the first keyword and the word frequency of the second keyword as the word frequency of the keyword after the normalization processing; sorting the keywords after the normalization processing according to the word frequency of the keywords after the normalization processing to obtain a keyword sequence; and forming a keyword array by the keywords after the normalization processing based on the keyword sequence. The method determines the keyword array according to the word frequency of the keywords, so that the calculated similarity of the obtained word vectors can be more accurate, and the technical effect of accurately monitoring the short links is achieved.
In this embodiment of the present invention, the similarity determining module 303 is further configured to: traversing the keyword array: taking the word frequency of the first keyword as a corresponding value of the first keyword in the keyword array; setting the corresponding value of the second keyword in the keyword array to be 0; determining a word vector for the first keyword based on corresponding values in the keyword array. The word vectors of the keywords are determined according to the word frequencies of the keywords, so that the calculated similarity is more accurate, and the technical effect of accurately monitoring the short link state is achieved.
Fig. 4 is a schematic diagram of a short link pushing system based on a monitoring short link method according to an embodiment of the present invention. As shown in fig. 4, an enterprise submits an original link to be accessed, and the system records the original link into Src _ link in a short link record table; the short message virtual operator conducts content verification on the webpage corresponding to the original link, and clicks verification confirmation after confirming that the webpage corresponding to the original link has no problem; the system captures the content of original link for auditing, performs Chinese word segmentation and word frequency calculation on the captured content, forms an array with the first 30 words with the highest word frequency and word frequency, performs json packaging, records the array into Key _ list in short link records, sets Status in the short link records to be 1 (indicating that the auditing is passed), generates short links and informs enterprises that the original links of the enterprises are approved and pushes the short links to users; after the short link record is completed, storing the short link record into a short link database; the system monitors the state of the short link in real time or at regular time based on the method for monitoring the short link, and meanwhile, the recording state in the short link record is updated; when Status is 0, indicating that the audit is not available (or the audit is not passed); when Status is 1, record the Status as access allowed (i.e. the user can access the pushed content via the short link); when Status is 2 or 3, the Status is recorded as no access (i.e. a jump can be made directly to a page indicating no access is allowed). Serialized packaging or other encryption algorithms may be used for json packaging. Here, it is mainly a field that is designed to record a similar snapshot of the entire page. If we design a sub-table, for example, called word frequency sub-table, it can be divided into multiple records, and the parent id is just the Src _ link.
Fig. 5 illustrates an exemplary system architecture 500 for a method of monitoring short link based access or a device for monitoring short link based access to which embodiments of the present invention may be applied.
As shown in fig. 5, the system architecture 500 may include terminal devices 501, 502, 503, a network 504, and a server 505. The network 504 serves to provide a medium for communication links between the terminal devices 501, 502, 503 and the server 505. Network 504 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 501, 502, 503 to interact with a server 505 over a network 504 to receive or send messages or the like. The terminal devices 501, 502, 503 may have various communication client applications installed thereon, such as a shopping application, a web browser application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.
The terminal devices 501, 502, 503 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 505 may be a server that provides various services, such as a background management server that supports shopping websites browsed by users using the terminal devices 501, 502, 503. The background management server can analyze and process the received data such as the product information inquiry request and feed back the processing result to the terminal equipment.
It should be noted that the method for monitoring access based on short link provided by the embodiment of the present invention is generally performed by the server 505, and accordingly, a device for monitoring access based on short link is generally disposed in the server 505.
It should be understood that the number of terminal devices, networks, and servers in fig. 5 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
The invention also provides an electronic device and a computer readable medium according to the embodiment of the invention.
The electronic device of the present invention includes: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the method for monitoring the access based on the short link according to the embodiment of the invention.
The computer readable medium of the present invention has stored thereon a computer program which, when executed by a processor, implements a method of monitoring short link based access as provided by an embodiment of the present invention.
Referring now to FIG. 6, a block diagram of a computer system 600 suitable for use with a terminal device implementing an embodiment of the invention is shown. The terminal device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the system 600 are also stored. The CPU601, ROM602, and RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 601.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor, comprising: the system comprises a reading module, a real-time grabbing module, a similarity determining module and a monitoring module. The names of these modules do not in some cases form a limitation on the module itself, for example, a reading module may also be described as a "module for reading short link records".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: reading an original link, a first keyword and a word frequency thereof in a short link record, wherein the record state of the short link record is allowed to be accessed; capturing the current corresponding webpage content of the original link according to the original link, and determining a second keyword and word frequency thereof according to the current content; determining the similarity of the first keyword and the second keyword based on the first keyword and the word frequency thereof and the second keyword and the word frequency thereof; and if the similarity between the first keyword and the second keyword does not exceed a set threshold, updating the record state of the short link record to be access prohibition.
According to the method for monitoring the access based on the short link, the technical problems that the middle monitoring party only checks the content in the original link when the server side applies for pushing and the pushed short link is not subsequently monitored after the checking is passed are solved through the technical means of extracting the keywords and the word frequency thereof and calculating the similarity of the keywords before and after the checking is passed, the technical effect of subsequently monitoring the pushed short link is further achieved, the situation that the content corresponding to the original link is greatly modified after the checking is passed is effectively avoided, meanwhile, the user is prevented from accessing the illegal page, and the user benefit is maintained.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (12)

1. A method of monitoring short link based access, comprising:
reading an original link, a first keyword and a word frequency thereof in a short link record, wherein the record state of the short link record is allowed to be accessed;
capturing the current corresponding webpage content of the original link according to the original link, and determining a second keyword and word frequency thereof according to the current content;
determining the similarity of the first keyword and the second keyword based on the first keyword and the word frequency thereof and the second keyword and the word frequency thereof;
if the similarity between the first keyword and the second keyword does not exceed a set threshold, updating the record state of the short link record to be access-prohibited;
and the first keyword is determined by the corresponding webpage content when the original link is approved.
2. The method of claim 1, further comprising, if the similarity between the first keyword and the second keyword satisfies a predetermined threshold:
acquiring a sensitive word corresponding to the category identification according to the category identification in the short link record, and judging whether the second keyword comprises the sensitive word;
and if the second keyword comprises the sensitive word, updating the record state of the short link record to be access prohibition.
3. The method of claim 1, wherein determining the similarity between the first keyword and the second keyword based on the first keyword and the word frequency thereof and the second keyword and the word frequency thereof comprises:
carrying out normalization processing on the first keyword and the second keyword to obtain a keyword subjected to normalization processing;
forming a keyword array according to the keywords after the normalization processing;
determining word vectors of the first keywords and word vectors of the second keywords based on the first keywords and word frequencies thereof, the second keywords and word frequencies thereof and the keyword array;
and determining the similarity between the first keyword and the second keyword according to the word vector of the first keyword and the word vector of the second keyword.
4. The method of claim 3, wherein forming a keyword array from the normalized keywords comprises:
taking the sum of the word frequency of the first keyword and the word frequency of the second keyword as the word frequency of the keyword after the normalization processing;
sorting the keywords after the normalization processing according to the word frequency of the keywords after the normalization processing to obtain a keyword sequence;
and forming a keyword array by the keywords after the normalization processing based on the keyword sequence.
5. The method of claim 3, wherein determining a word vector for the first keyword based on the first keyword and its word frequency and the keyword array comprises:
traversing the keyword array: taking the word frequency of the first keyword as a corresponding value of the first keyword in the keyword array; setting the corresponding value of the second keyword in the keyword array to be 0;
determining a word vector for the first keyword based on corresponding values in the keyword array.
6. An apparatus for monitoring short link based access, comprising:
a reading module to: reading an original link, a first keyword and a word frequency thereof in a short link record, wherein the record state of the short link record is allowed to be accessed;
a real-time capture module for: capturing the current corresponding webpage content of the original link according to the original link, and determining a second keyword and word frequency thereof according to the current content;
a similarity determination module to: determining the similarity of the first keyword and the second keyword based on the first keyword and the word frequency thereof and the second keyword and the word frequency thereof;
a monitoring module to: if the similarity between the first keyword and the second keyword does not exceed a set threshold, updating the record state of the short link record to be access-prohibited;
and the first keyword is determined by the corresponding webpage content when the original link is approved.
7. The apparatus of claim 6, wherein the monitoring module is further configured to, if the similarity between the first keyword and the second keyword satisfies a predetermined threshold:
acquiring a sensitive word corresponding to the category identification according to the category identification in the short link record, and judging whether the second keyword comprises the sensitive word;
and if the second keyword comprises the sensitive word, updating the record state of the short link record to be access prohibition.
8. The apparatus of claim 6, wherein the similarity determination module is further configured to:
carrying out normalization processing on the first keyword and the second keyword to obtain a keyword subjected to normalization processing;
forming a keyword array according to the keywords after the normalization processing;
determining word vectors of the first keywords and word vectors of the second keywords based on the first keywords and word frequencies thereof, the second keywords and word frequencies thereof and the keyword array;
and determining the similarity between the first keyword and the second keyword according to the word vector of the first keyword and the word vector of the second keyword.
9. The apparatus of claim 8, wherein the similarity determination module is further configured to:
taking the sum of the word frequency of the first keyword and the word frequency of the second keyword as the word frequency of the keyword after the normalization processing;
sorting the keywords after the normalization processing according to the word frequency of the keywords after the normalization processing to obtain a keyword sequence;
and forming a keyword array by the keywords after the normalization processing based on the keyword sequence.
10. The apparatus of claim 8, wherein the similarity determination module is further configured to:
traversing the keyword array: taking the word frequency of the first keyword as a corresponding value of the first keyword in the keyword array; setting the corresponding value of the second keyword in the keyword array to be 0;
determining a word vector for the first keyword based on corresponding values in the keyword array.
11. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.
12. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-5.
CN201910414745.5A 2019-05-17 2019-05-17 Method and device for monitoring access based on short link Pending CN111782986A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910414745.5A CN111782986A (en) 2019-05-17 2019-05-17 Method and device for monitoring access based on short link

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910414745.5A CN111782986A (en) 2019-05-17 2019-05-17 Method and device for monitoring access based on short link

Publications (1)

Publication Number Publication Date
CN111782986A true CN111782986A (en) 2020-10-16

Family

ID=72755405

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910414745.5A Pending CN111782986A (en) 2019-05-17 2019-05-17 Method and device for monitoring access based on short link

Country Status (1)

Country Link
CN (1) CN111782986A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678602A (en) * 2013-12-16 2014-03-26 昆明理工大学 Webpage filtration method with sensitivity calculation function
CN105959330A (en) * 2016-07-20 2016-09-21 广东世纪网通信设备股份有限公司 False link interception method, device and system
CN106027633A (en) * 2016-05-16 2016-10-12 百度在线网络技术(北京)有限公司 Application push method, application push system and terminal device
CN107092826A (en) * 2017-03-24 2017-08-25 北京国舜科技股份有限公司 Web page contents real-time safety monitoring method
CN107229939A (en) * 2016-03-24 2017-10-03 北大方正集团有限公司 The decision method and device of similar document
CN107423444A (en) * 2017-08-10 2017-12-01 世纪龙信息网络有限责任公司 Hot word phrase extracting method and system
CN107733972A (en) * 2017-08-28 2018-02-23 阿里巴巴集团控股有限公司 A kind of short linking analytic method, device and equipment
CN107943954A (en) * 2017-11-24 2018-04-20 杭州安恒信息技术有限公司 Detection method, device and the electronic equipment of webpage sensitive information

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678602A (en) * 2013-12-16 2014-03-26 昆明理工大学 Webpage filtration method with sensitivity calculation function
CN107229939A (en) * 2016-03-24 2017-10-03 北大方正集团有限公司 The decision method and device of similar document
CN106027633A (en) * 2016-05-16 2016-10-12 百度在线网络技术(北京)有限公司 Application push method, application push system and terminal device
CN105959330A (en) * 2016-07-20 2016-09-21 广东世纪网通信设备股份有限公司 False link interception method, device and system
CN107092826A (en) * 2017-03-24 2017-08-25 北京国舜科技股份有限公司 Web page contents real-time safety monitoring method
CN107423444A (en) * 2017-08-10 2017-12-01 世纪龙信息网络有限责任公司 Hot word phrase extracting method and system
CN107733972A (en) * 2017-08-28 2018-02-23 阿里巴巴集团控股有限公司 A kind of short linking analytic method, device and equipment
CN107943954A (en) * 2017-11-24 2018-04-20 杭州安恒信息技术有限公司 Detection method, device and the electronic equipment of webpage sensitive information

Similar Documents

Publication Publication Date Title
US9083729B1 (en) Systems and methods for determining that uniform resource locators are malicious
CN110225104B (en) Data acquisition method and device and terminal equipment
US11669795B2 (en) Compliance management for emerging risks
US10135830B2 (en) Utilizing transport layer security (TLS) fingerprints to determine agents and operating systems
CN107992738B (en) Account login abnormity detection method and device and electronic equipment
CN109901987B (en) Method and device for generating test data
US9519788B2 (en) Identifying security vulnerabilities related to inter-process communications
CN111125107A (en) Data processing method, device, electronic equipment and medium
US20190163828A1 (en) Method and apparatus for outputting information
CN109150790B (en) Web page crawler identification method and device
CN114969840A (en) Data leakage prevention method and device
CN113761565B (en) Data desensitization method and device
CN110866031B (en) Database access path optimization method and device, computing equipment and medium
CN107634942B (en) Method and device for identifying malicious request
CN113609516B (en) Information generation method and device based on abnormal user, electronic equipment and medium
CN111782986A (en) Method and device for monitoring access based on short link
CN111538663B (en) Test case generation method and device, computing device and medium
US9176998B2 (en) Minimization of surprisal context data through application of a hierarchy of reference artifacts
CN111240948A (en) Experience data processing method and device, computer equipment and storage medium
CN112532734A (en) Message sensitive information detection method and device
CN112256566A (en) Test case preservation method and device
CN110610365A (en) Method and device for identifying transaction request
CN113114611A (en) Method and device for managing blacklist
CN111460273B (en) Information pushing method and device
CN114449052B (en) Data compression method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination