CN113962218A - Illegal application identification method, device and equipment and readable storage medium - Google Patents

Illegal application identification method, device and equipment and readable storage medium Download PDF

Info

Publication number
CN113962218A
CN113962218A CN202110967165.6A CN202110967165A CN113962218A CN 113962218 A CN113962218 A CN 113962218A CN 202110967165 A CN202110967165 A CN 202110967165A CN 113962218 A CN113962218 A CN 113962218A
Authority
CN
China
Prior art keywords
illegal
website
application
detection
violation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110967165.6A
Other languages
Chinese (zh)
Inventor
胡冰
范渊
杨勃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DBAPPSecurity Co Ltd
Original Assignee
DBAPPSecurity Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DBAPPSecurity Co Ltd filed Critical DBAPPSecurity Co Ltd
Priority to CN202110967165.6A priority Critical patent/CN113962218A/en
Publication of CN113962218A publication Critical patent/CN113962218A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9558Details of hyperlinks; Management of linked annotations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management

Abstract

The invention discloses a violation application identification method, which transfers violation identification for an application from an application side to a downloading website, determines a website with violation information in a target field as a violation website by widely screening the website for the violation information in the target field, then uses an application downloading link in the violation website as a violation downloading link, and uses an application corresponding to the violation downloading link as a violation application, wherein the method can determine a source of the violation information in the target field from a source, thereby ensuring detection accuracy and facilitating application propagation from the source in violation containment; meanwhile, the illegal APP can be identified more and more accurately by carrying out wide screening based on the whole network, and the comprehensive detection of the illegal APP is realized. The invention also discloses a violation application identification device, equipment and a readable storage medium, and the violation application identification device, the equipment and the readable storage medium have corresponding technical effects.

Description

Illegal application identification method, device and equipment and readable storage medium
Technical Field
The invention relates to the technical field of security assurance, in particular to a method, a device, equipment and a readable storage medium for identifying illegal application.
Background
The violation application is a mobile application program that transmits illegal violation information such as obscene pornography and bloody smell, or provides network chaos for illegal violation services such as cheating gambling, recruitment, and the like. The illegal management application has low cost, high profit and quick return, and causes disastrous economic loss to individuals, enterprises and society, so that the illegal management application is very important to identify.
The detection capability of the current illegal application is dependent on the capability of the web crawler, the identification detection rate is possibly less, and the identification accuracy of the illegal application is lower.
In summary, how to improve the identification accuracy of the violation application is a technical problem that needs to be solved urgently by those skilled in the art at present.
Disclosure of Invention
The invention aims to provide a method, a device and equipment for identifying an illegal application and a readable storage medium, which can improve the identification accuracy of the illegal application. In order to solve the technical problems, the invention provides the following technical scheme:
a method of violation application identification, comprising:
acquiring a plurality of accessible websites as detection objects;
extracting information corresponding to a preset field in the detection object, and storing the extracted information into a website information table; wherein the preset field includes: an application download link and website content;
extracting the page content from the website information table, and carrying out illegal keyword detection on the page content to generate a keyword detection result;
determining a detection object with violation information of a target field according to the keyword detection result, and using the detection object as a violation website;
and extracting an application download link corresponding to the illegal website from the website information table to be used as the illegal download link, and using the application corresponding to the illegal download link as the illegal application.
Optionally, the extracting information corresponding to a preset field in the detection object, and storing the extracted information in a website information table includes:
extracting a website link of the detection object;
accessing the detection object and capturing page content of the detection object;
determining a domain name corresponding to the website link, and analyzing an IP address corresponding to the domain name;
acquiring IP attribution information corresponding to the IP address;
extracting an application download link in the html document corresponding to the detection object;
and taking the website link, the page content, the domain name, the IP address and the IP attribution information as the website content, and storing the website content and the application download link into a website information table.
Optionally, extracting the page content from the website information table, and performing illegal keyword detection on the page content, including:
extracting a page title and a tag in the page content;
carrying out illegal keyword identification on the page title and the label to generate a first identification result;
extracting hyperlinks in the page content;
carrying out illegal keyword recognition on the hyperlink to generate a second recognition result;
extracting text content in the page content; the text content is the page content with the tag;
carrying out illegal keyword recognition on the text content to generate a third recognition result;
and generating the keyword detection result according to the first recognition result, the second recognition result and the third recognition result.
Optionally, the identifying the violation keywords for the page title and the tag includes:
performing Chinese uniform code translation recognition on the title and the label to generate a translation recognition result;
performing word segmentation processing on the title and the label to obtain content word segmentation;
performing keyword filtering and identification on the content analysis to generate a content identification result;
and generating the first recognition result according to the translation recognition result and the content recognition result.
Optionally, the performing illegal keyword recognition on the hyperlink includes:
extracting uniform resource locators and anchor texts in the hyperlinks;
performing word segmentation processing on the anchor text to obtain anchor text word segmentation;
performing keyword filtering recognition on the anchor text participles to generate an anchor text recognition result;
determining an IP home according to the uniform resource locator;
judging whether the IP home location belongs to a violation high-risk area or not, and generating a home location identification result;
and generating the second recognition result according to the anchor text recognition result and the attribution recognition result.
Optionally, before the detecting the violating keywords of the page content, the method further includes:
searching whether a mobile phone page adaptation label exists in the page content, and generating an adaptation detection result;
correspondingly, the determining, according to the keyword detection result, a detection object with violation information of the target field includes:
and determining a detection object with violation information of the target field according to the keyword detection result and the adaptation detection result.
Optionally, before the extracting information corresponding to the preset field in the detection object, the method further includes:
screening out the detection object corresponding to the effective application download link in the website information table as a target object;
correspondingly, the extracting information corresponding to the preset field in the detection object includes: and extracting information corresponding to preset fields in the target object.
An illegal application identification device comprising:
an object acquisition unit configured to acquire a plurality of accessible websites as detection objects;
the information extraction unit is used for extracting information corresponding to a preset field in the detection object and storing the extracted information into a website information table; wherein the preset field includes: an application download link and website content;
the page detection unit is used for extracting the page content from the website information table, carrying out illegal keyword detection on the page content and generating a keyword detection result;
the illegal object determining unit is used for determining a detection object with illegal information of the target field according to the keyword detection result, and the detection object is used as an illegal website;
and the illegal application determining unit is used for extracting the application downloading link corresponding to the illegal website from the website information table to be used as the illegal downloading link, and using the application corresponding to the illegal downloading link as the illegal application.
A computer device, comprising:
a memory for storing a computer program;
and the processor is used for realizing the steps of the illegal application identification method when the computer program is executed.
A readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the above-mentioned illegal application identification method.
According to the method provided by the embodiment of the invention, violation identification of the application is transferred to a downloading website from an application side, the website with violation information in the target field is determined as the violation website by screening the website extensively for the violation information in the target field, the application downloading link in the violation website is used as the violation downloading link, and the application corresponding to the violation downloading link is used as the violation application; meanwhile, the illegal APP can be identified more and more accurately by carrying out wide screening based on the whole network, and the comprehensive detection of the illegal APP is realized.
Correspondingly, the embodiment of the invention also provides a violation application identification device, equipment and a readable storage medium corresponding to the violation application identification method, which have the technical effects and are not described herein again.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the related art, the drawings used in the description of the embodiments or the related art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without any creative work.
FIG. 1 is a flowchart illustrating an exemplary method for identifying an illegal application according to an embodiment of the present disclosure;
FIG. 2 is a diagram illustrating information extraction according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an illegal application identification device according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The core of the invention is to provide the illegal application identification method, which can improve the identification accuracy of the illegal application.
In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the scope of protection of the present invention.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for identifying an illegal application according to an embodiment of the present invention, where the method includes the following steps:
s101, acquiring a plurality of accessible websites as detection objects;
finding a batch of accessible websites as detection objects, the identification of the illegal application in the embodiment is expanded in the detection objects, and the illegal application is identified for the detection objects. It should be noted that the acquisition of the detection object may be random acquisition, for example, randomly acquiring a batch of accessible websites on the internet as the detection object; the website to be detected may also be designated, for example, as a detection object, which is not limited in this embodiment and may be set according to actual use requirements.
The normal application download store basically cannot record the illegal applications, and the illegal applications are frequently found in illegal websites, so that more illegal applications can be found through the embodiment.
As a detection target, only the condition that it needs to satisfy the accessibility is defined in this step, and other conditions are not limited in this embodiment, and adaptive adjustment may be performed according to actual identification needs.
In order to facilitate subsequent information calling of the detection object, the website link corresponding to the detection object can be linked to generate a crawler seed and store the crawler seed in a seed library, and then corresponding information can be directly called from the seed library in the subsequent information extraction and identification detection of the detection object.
S102, extracting information corresponding to a preset field in a detection object, and storing the extracted information into a website information table;
extracting information of each detection object, wherein the specifically extracted information comprises the following steps: the method comprises the steps of extracting an application download link field for judging whether a link for downloading the application is contained in a detection object or not; the website content refers to information displayed in the website and carried by the website, and specific information types are not limited, and may include, for example, a title, a label, a number, an IP address, a domain name, a home location, and the like.
The extracted website content field is used for judging whether the detection object points to illegal operation or information, such as the detection object contains lottery pornography information or possibly participates in illegal lottery and the like. According to the method and the device, illegal applications are identified according to the website, whether the website is the illegal website is judged, and if yes, the application on the website downloads and connects, so that the illegal application is judged. According to the method for identifying the illegal application, the identification process is transferred to the downloading front end, so that the identification accuracy of the illegal application can be remarkably improved, and the propagation of the illegal application can be restrained from the source.
The extracted information may further extract other types of information besides the application download link and the website content, which is not limited in this embodiment and may be set according to the actual detection requirement.
The information extracted for the detection object is stored in the website information table in the embodiment, and the related information is called from the website information table for viewing.
To deepen understanding, the present embodiment introduces an information extraction method for a detection object, as shown in fig. 2, specifically including the following steps:
(1) extracting a website link of a detection object;
the method comprises the steps of acquiring a plurality of website links based on big data of the whole network to serve as detection objects, setting a crawling state for each detection object in order to avoid repeated detection, and setting the state of the crawling state to be an un-crawling state, wherein the crawling state is not limited.
(2) Accessing the detection object, and capturing the page content of the detection object;
and (calling a web crawler) to obtain the detection object which is not crawled in real time, access, and grab the page to obtain the page content.
(3) Determining a domain name corresponding to the website link, and analyzing an IP address corresponding to the domain name;
and (4) expanding a domain field (domain name) through the url field (the domain field is not expanded by the IP link), and analyzing the IP of the domain name website by using domain name analysis.
(4) Acquiring IP attribution information corresponding to the IP address;
the IP is analyzed from the attribution information (such as country and province) of the IP through an IP trueness library, the IP-class website assets are directly analyzed from the attribution, and the analyzed data are updated to the IP and the geo in the website asset table.
(5) Extracting an application download link in the html document corresponding to the detection object;
the application download link in the detection object is extracted, and the specific extraction manner in this embodiment is not limited, for example, links in html can be parsed by an html tag and a regular expression (https:// [ a-zA-Z0-9.//. In this embodiment, only the above extraction manner is taken as an example for description, and other implementation manners can refer to the description of this embodiment, which is not described herein again.
(6) And taking the website link, the page content, the domain name, the IP address and the IP attribution information as website content, and storing the website content and the application download link into a website information table.
In order to increase the number of detection objects and realize wide detection of websites, before the step (3), the extraction of page links by acquiring page hyperlink labels and regular expressions can be further performed, the extracted page links are used as seeds, and the seeds are stored in a database after duplication removal and are used as detection objects, so that the expansion of the detection objects can be realized.
S103, extracting page content from the website information table, and carrying out illegal keyword detection on the page content to generate a detection result;
in this embodiment, violation information in a target field is identified according to page content, and mainly aims at important identification of violation information in fields such as obscene pornography, fraud gambling, recruitment and the like, specifically, violation keyword detection analysis is performed on the page content, a keyword library is established and contains violation keywords of categories such as lottery, pornography, fraud, riot and the like (the keywords can be queried by an internet search engine and then manually verified and collected without limitation), matching identification of the violation keywords is performed according to the keyword library, and violation detection of the page content is realized according to results such as the number of the violation keywords obtained by matching. The specific keyword detection means may be set according to different application fields and use requirements, which is not limited in this embodiment.
Since the credibility of different keywords in the violation information of the target field is different, in order to ensure the detection accuracy, a credibility level field may be matched for the violation keyword in the keyword library, where the credibility is 1-5, and 5 is the highest credibility, after the violation keyword is matched, the corresponding credibility level is further determined, and the detection result is generated according to the credibility level, the number of keywords, and the like, which is not limited in this embodiment.
S104, determining a detection object with violation information of the target field according to the detection result, and using the detection object as a violation website;
and determining a detection object with violation information of the target field according to the detection result, wherein the detection object with the violation information of the target field is correspondingly determined according to the detection result without limitation because the form of the detection result is not limited in the embodiment, and determining and screening out the violation website according to the detection result so as to further determine violation application according to the violation website in the subsequent process.
And S105, extracting an application download link corresponding to the illegal website from the website information table to be used as the illegal download link, and using the application corresponding to the illegal download link as the illegal application.
If a certain website is a lottery pornographic website, the accuracy of illegal application of the application download link on the website is very high. According to the method, the illegal website is firstly identified, compared with the illegal application identification, the illegal application identification of the website can acquire more comprehensive and abundant information quantity, and the accuracy of identification can be guaranteed by taking the application corresponding to the downloading connection of the application set in the illegal website as the illegal application.
Based on the introduction, the technical scheme provided by the embodiment of the invention transfers violation identification for application from an application side to a downloading website, determines the website with the violation information of the target field as the violation website by screening the website widely for the violation information of the target field, then uses an application downloading link in the violation website as the violation downloading link, and uses the application corresponding to the violation downloading link as the violation application, so that the source of the violation information of the target field can be determined from the source, the detection accuracy can be ensured, and the propagation of the violation application can be restrained from the source; meanwhile, the illegal APP can be identified more and more accurately by carrying out wide screening based on the whole network, and the comprehensive detection of the illegal APP is realized.
It should be noted that, based on the above embodiments, the embodiments of the present invention also provide corresponding improvements. The same steps as those in the above-mentioned embodiments or corresponding steps can be referred to each other in the preferred/modified embodiments, and the corresponding advantageous effects can be referred to each other, and are not described in detail in the preferred/modified embodiments herein.
The detection method for the page content illegal keyword is not limited in the above embodiments, a detection means is provided in this embodiment, and other applications of the detection means based on the above embodiments can refer to the description of this embodiment, and are not described herein again.
The illegal keyword detection on the page content can be carried out from three aspects, namely, the page title, the label, the hyperlink and the text content. In the embodiment, only the above two implementation manners are taken as examples, for the sake of understanding, the description is given in a manner of scores here, and all other implementation manners that generate the final detection result based on the detection results of the individual items can refer to the description of the embodiment.
Specifically, the process of detecting the violation keywords of the page content may include the following steps:
(1) extracting a page title and a tag in page content;
(2) carrying out illegal keyword identification on the page title and the label to generate a first identification result;
first, a title and a meta tag (h1) of html are checked to identify whether illegal keywords in the title and tag categories exist, specifically, types of illegal keywords in the title and tag categories are not limited in this embodiment, optionally, chinese unicode translation and content illegal keywords may be included, and accordingly, a process of identifying illegal keywords in the page title and tag may specifically include the following steps:
(2.1) performing Chinese uniform code translation recognition on the title and the label to generate a translation recognition result;
if yes, the reliability is lowered, and when the translation of the chinese unicode is detected, a result score (for example, type _ confidence, reliability for determining that an application is an illegal application) may be added to a corresponding score (the score setting may be set according to importance levels of different detection objects, which is not limited herein), for example, when the translation of the chinese unicode is detected in the title and the meta, type _ confidence + (15) is used as the translation recognition result.
(2.2) performing word segmentation processing on the title and the label to obtain content word segmentation;
the content t1 in title and meta is segmented, and the segmentation process can refer to the description of the related art, which is not limited herein.
(2.3) carrying out keyword filtering and identification on the content analysis to generate a content identification result;
and filtering keywords of the content participles, and taking type _ confidence + ═ 15 as a content identification result when the keywords are hit.
And (2.4) generating a first recognition result according to the translation recognition result and the content recognition result.
The accumulated value for type _ confidence in the translation recognition result and the content recognition result is taken as the first recognition result.
(3) Extracting hyperlinks in the page content;
all hyperlink labels L1 in the page content are taken, and the hyperlink extraction mode can be used for extracting according to the characteristics of the format and the like of the hyperlinks, which is not limited in this embodiment.
(4) Carrying out illegal keyword identification on the hyperlink to generate a second identification result;
carrying out illegal keyword identification on the hyperlink, wherein the illegal keyword identification comprises the following steps:
(4.1) extracting uniform resource locators and anchor texts in the hyperlinks;
l1 is then processed one by one to obtain the link url (uniform resource locator) and anchor text t2 in the hyperlink.
(4.2) performing word segmentation processing on the anchor text to obtain anchor text word segmentation;
(4.3) filtering and identifying keywords of the anchor text participles to generate an anchor text identification result;
the anchor text is participled and then subjected to keyword verification, adding a set of tags that hit the keywords to T1.
(4.4) determining an IP home according to the uniform resource locator;
and analyzing url in the hyperlink label, wherein the url can be divided into 2 types: the domain name class and the IP class, if the domain name class is the domain name class, the domain name class is firstly analyzed to obtain an IP address, and then the IP is analyzed to obtain a home; if the IP type is the IP type, the attribution can be directly analyzed according to the IP.
(4.5) judging whether the IP home location belongs to the violation high-risk area or not, and generating a home location identification result;
and screening out a label set T2 of the high-risk address belonging to the illegal information.
And (4.6) generating a second recognition result according to the anchor text recognition result and the attribution recognition result.
The implementation manner of generating the second recognition result according to the anchor text recognition result and the attribution recognition result is not limited in this embodiment, and the judgment may be performed according to whether there is a hit object in each type of detection, or according to several lengths of hits of each type. For a better understanding, the latter is taken as an example here, such as obtaining the intersection T3 of T1 and T2, i.e. the hit keyword and belonging to the violation high-risk area, and adding the corresponding score to the result score (type _ confidence) according to the result of each set length, a score assignment criterion is as follows:
when t3.size >5, t1.size > -8, t2.size > -10, type _ confidence + -30,
when t3.size >3, t1.size > -5, t2.size > -7, type _ confidence + -15,
when t3.size >3, t1.size > -3, t2.size > -5, type _ confidence + -5.
It should be noted that, in this embodiment, only the scoring criteria are taken as an example for description, and other scoring criteria or other non-scoring evaluation manners can refer to the description of this embodiment, which is not repeated herein.
(5) Extracting text content in the page content; and acquiring all the text contents after all the labels in the page contents are removed, and naming the text contents as t3.
(6) Carrying out illegal keyword recognition on the text content to generate a third recognition result;
the text content is segmented and filtered and matched by keywords, and a hit keyword set K1 is obtained, and the generation of the third recognition result can directly evaluate whether there are hit keywords or the number of hit keywords.
In order to accurately determine the hit condition of each keyword, comprehensive evaluation can be performed according to various conditions of the hit keyword, for example, the violation level of the hit keyword can be determined. And comprehensively evaluating the hit times, the reliability and the like to generate a third recognition result. For example, filtering and matching are performed by using keywords, a hit keyword set K1 is obtained, a hit keyword, keyword hit times hitCount, a keyword type keytype and keyword reliability are recorded in K1, sorting is performed according to the reliability, highest reliability maxReliability (i.e., the maximum value of the feasibility of each keyword) is obtained, and the total hit times of the keywords are counted: totalHit.
When maxReliability is 5 and k1.size >5, type _ confidence + ═ 40
When maxReliability ═ 4 and k1.size >10, type _ confidence +═ 40
When maxReliability ═ 4 and k1.size >5, type _ confidence + ═ 30
When maxReliability ═ 3 and k1.size >10, type _ confidence +═ 20
When maxreiability ═ 3 and (k1.size >5 or totalHit ═ 20), type _ confidence ═ 20
When maxreiability ═ 3 and (k1.size >3 or totalHit ═ 10), type _ confidence ═ 10
(7) And generating a keyword detection result according to the first recognition result, the second recognition result and the third recognition result.
The first identification result is a detection result of illegal keywords of the page title and the label, the second identification result is a detection result of illegal keywords of the hyperlink, the third identification result is a detection result of illegal keywords of the text content, and the overall keyword detection result is comprehensively generated according to the three results. The specific generation method is not limited in this embodiment, and the detection objects which hit the three violation keywords at the same time may be used as detection objects of violation information in the target field, or part of the hit detection objects may be screened out and further accuracy of identification may be evaluated according to the value of type _ confidence, so as to obtain a detection result comprehensively.
For example, the detection object that hits the violation keyword in the text content and whether the IP attribution belongs to the violation high-risk area may be screened, and the relevant information of the screened detection object including the type _ confidence score (may include information such as url (application mount website address), application download url (application download address), and type (application type) in addition to the score, which is not limited herein) is determined as the detection result, so that the user may obtain the relevant information of each detection object.
Still taking the above-mentioned score of type _ confidence as an example of the comprehensive result of each item, after a lot of experiments, when type _ confidence >80, the confidence is high confidence, the recognition accuracy is about 90-99%, when type _ confidence > 60, the confidence is medium confidence, the recognition accuracy is about 70-90%, when type _ confidence > 60, the recognition accuracy is low confidence, the recognition accuracy is about 40-70%, when type _ confidence <30, the recognition accuracy is about 0-40%.
A detection object with a type _ confidence higher than 60 may be used as the violation information in the suspected existence target area, and a detection object with a type _ confidence higher than 90 may be used as the violation information in the determined existence target area, and it should be noted that the threshold criterion is not limited in this embodiment.
It should be noted that, in this embodiment, the execution sequence of the three-aspect detection is not limited, in this embodiment, only the first step of performing the title label detection and the last step of performing the text content detection texture are performed, the three steps may be executed in series or in parallel according to the need, and the implementation manners in other execution sequences may refer to the description of this embodiment, which is not described herein again.
Based on the above embodiment, to improve the detection accuracy, further before performing detection on the violation keywords on the page content, the following steps may be further performed: searching whether a mobile phone page adaptation label exists in the page content, and generating an adaptation detection result; correspondingly, determining a detection object with violation information of the target field according to the keyword detection result, wherein the detection object comprises: and determining a detection object with violation information of the target field according to the keyword detection result and the adaptation detection result.
When analyzing data, firstly, analyzing page content, and checking whether a mobile phone page adaptation tag exists in html (the mobile phone page adaptation tag is more beneficial for a user to download), for example: and (3) filtering through the viewport of the meta tag and 3 attributes of the viewport-width and the initial-scale of the meta tag (mobile phone page adaptation tag), filtering to obtain a page supporting the mobile phone, setting the is _ mobile of the data to true, and otherwise, setting the false and updating the website information table.
Further, before extracting the information corresponding to the preset field in the detection object, in order to avoid useless work, the following steps may be further performed: screening out a detection object corresponding to the effective application download link in the website information table as a target object; correspondingly, extracting information corresponding to the preset field in the detection object, including: and extracting information corresponding to a preset field in the target object.
The effective application download link exists and can be downloaded, specifically, a detection object with the length being greater than 0 and the application download link not being empty can be used as a target object, and the website is a website containing mobile application download, so that the interference of an invalid link on detection can be avoided.
To improve the detection efficiency, a new table (application links) may be generated for the target object, and the table may specifically include the following fields: url, domain, html, IP, geo, application Links, is _ mobile (whether html supports the mobile phone format), type (html content type), and type _ confidence (confidence level of html content type), so that corresponding information can be directly read from the table during subsequent detection, and extra power consumption caused by information search is avoided.
Corresponding to the above method embodiment, the embodiment of the present invention further provides an illegal application identification device, and the illegal application identification device described below and the illegal application identification method described above may be referred to in correspondence with each other.
Referring to fig. 3, the apparatus includes the following modules:
the object obtaining unit 110 is mainly configured to obtain a plurality of accessible websites as detection objects;
the information extraction unit 120 is mainly configured to extract information corresponding to a preset field in the detection object, and store the extracted information in a website information table; wherein, the preset field includes: an application download link and website content;
the page detection unit 130 is mainly used for extracting page contents from the website information table, performing illegal keyword detection on the page contents, and generating a keyword detection result;
the violation object determination unit 140 is mainly configured to determine, according to the keyword detection result, a detection object with violation information in the target field, as a violation website;
the illegal application determining unit 150 is mainly configured to extract an application download link corresponding to the illegal website in the website information table as the illegal download link, and use an application corresponding to the illegal download link as the illegal application.
Corresponding to the above method embodiment, an embodiment of the present invention further provides a computer device, and a computer device described below and an illegal application identification method described above may be referred to in correspondence.
The computer device includes:
a memory for storing a computer program;
a processor for implementing the steps of the illegal application identification method of the above-described method embodiment when executing the computer program.
Specifically, referring to fig. 4, a schematic diagram of a specific structure of a computer device provided in this embodiment may generate a relatively large difference due to different configurations or performances, and the computer device may include one or more processors (CPUs) 322 (e.g., one or more processors) and a memory 332, where the memory 332 stores one or more computer applications 342 or data 344. Memory 332 may be, among other things, transient or persistent storage. The program stored in memory 332 may include one or more modules (not shown), each of which may include a sequence of instructions operating on a data processing device. Still further, the central processor 322 may be configured to communicate with the memory 332 to execute a series of instruction operations in the memory 332 on the computer device 301.
The computer apparatus 301 may also include one or more power supplies 326, one or more wired or wireless network interfaces 350, one or more input-output interfaces 358, and/or one or more operating systems 341.
The steps in the above-described violation application identification method may be implemented by the structure of the computer device.
Corresponding to the above method embodiment, an embodiment of the present invention further provides a readable storage medium, and a readable storage medium described below and an illegal application identification method described above may be referred to in correspondence with each other.
A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for identifying an offending application of the above-mentioned method embodiment.
The readable storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various readable storage media capable of storing program codes.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Claims (10)

1. A method for identifying an offending application, comprising:
acquiring a plurality of accessible websites as detection objects;
extracting information corresponding to a preset field in the detection object, and storing the extracted information into a website information table; wherein the preset field includes: an application download link and website content;
extracting the page content from the website information table, and carrying out illegal keyword detection on the page content to generate a keyword detection result;
determining a detection object with violation information of a target field according to the keyword detection result, and using the detection object as a violation website;
and extracting an application download link corresponding to the illegal website from the website information table to be used as the illegal download link, and using the application corresponding to the illegal download link as the illegal application.
2. The illegal application identification method according to claim 1, wherein the extracting information corresponding to a preset field in the detection object and storing the extracted information into a website information table includes:
extracting a website link of the detection object;
accessing the detection object and capturing page content of the detection object;
determining a domain name corresponding to the website link, and analyzing an IP address corresponding to the domain name;
acquiring IP attribution information corresponding to the IP address;
extracting an application download link in the html document corresponding to the detection object;
and taking the website link, the page content, the domain name, the IP address and the IP attribution information as the website content, and storing the website content and the application download link into a website information table.
3. The method for identifying the illegal application according to claim 1, wherein the steps of extracting the page content from the website information table and detecting the illegal keyword of the page content comprise:
extracting a page title and a tag in the page content;
carrying out illegal keyword identification on the page title and the label to generate a first identification result;
extracting hyperlinks in the page content;
carrying out illegal keyword recognition on the hyperlink to generate a second recognition result;
extracting text content in the page content; the text content is the page content with the tag;
carrying out illegal keyword recognition on the text content to generate a third recognition result;
and generating the keyword detection result according to the first recognition result, the second recognition result and the third recognition result.
4. The illegal application identification method according to claim 3, wherein the illegal keyword identification of the page title and the tag comprises:
performing Chinese uniform code translation recognition on the title and the label to generate a translation recognition result;
performing word segmentation processing on the title and the label to obtain content word segmentation;
performing keyword filtering and identification on the content analysis to generate a content identification result;
and generating the first recognition result according to the translation recognition result and the content recognition result.
5. The illegal application identification method according to claim 3, wherein the illegal keyword identification of the hyperlink comprises:
extracting uniform resource locators and anchor texts in the hyperlinks;
performing word segmentation processing on the anchor text to obtain anchor text word segmentation;
performing keyword filtering recognition on the anchor text participles to generate an anchor text recognition result;
determining an IP home according to the uniform resource locator;
judging whether the IP home location belongs to a violation high-risk area or not, and generating a home location identification result;
and generating the second recognition result according to the anchor text recognition result and the attribution recognition result.
6. The illegal application identification method according to claim 1, further comprising, before the detection of the illegal keyword on the page content:
searching whether a mobile phone page adaptation label exists in the page content, and generating an adaptation detection result;
correspondingly, the determining, according to the keyword detection result, a detection object with violation information of the target field includes:
and determining a detection object with violation information of the target field according to the keyword detection result and the adaptation detection result.
7. The illegal application identification method according to claim 1, wherein before the extracting information corresponding to the preset field in the detection object, the method further comprises:
screening out the detection object corresponding to the effective application download link in the website information table as a target object;
correspondingly, the extracting information corresponding to the preset field in the detection object includes: and extracting information corresponding to preset fields in the target object.
8. An illegal application identification device, comprising:
an object acquisition unit configured to acquire a plurality of accessible websites as detection objects;
the information extraction unit is used for extracting information corresponding to a preset field in the detection object and storing the extracted information into a website information table; wherein the preset field includes: an application download link and website content;
the page detection unit is used for extracting the page content from the website information table, carrying out illegal keyword detection on the page content and generating a keyword detection result;
the violation object determining unit is used for determining a detection object with violation information of the target field according to the keyword detection result, and the detection object is used as a violation website;
and the illegal application determining unit is used for extracting the application downloading link corresponding to the illegal website from the website information table to be used as the illegal downloading link, and using the application corresponding to the illegal downloading link as the illegal application.
9. A computer device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method for identifying an offending application according to any one of claims 1-7 when executing the computer program.
10. A readable storage medium, characterized in that it has stored thereon a computer program which, when being executed by a processor, carries out the steps of the illegal application identification method according to any one of claims 1 to 7.
CN202110967165.6A 2021-08-23 2021-08-23 Illegal application identification method, device and equipment and readable storage medium Pending CN113962218A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110967165.6A CN113962218A (en) 2021-08-23 2021-08-23 Illegal application identification method, device and equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110967165.6A CN113962218A (en) 2021-08-23 2021-08-23 Illegal application identification method, device and equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN113962218A true CN113962218A (en) 2022-01-21

Family

ID=79460797

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110967165.6A Pending CN113962218A (en) 2021-08-23 2021-08-23 Illegal application identification method, device and equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN113962218A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116846668A (en) * 2023-07-28 2023-10-03 北京中睿天下信息技术有限公司 Harmful URL detection method, system, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116846668A (en) * 2023-07-28 2023-10-03 北京中睿天下信息技术有限公司 Harmful URL detection method, system, equipment and storage medium

Similar Documents

Publication Publication Date Title
US9276956B2 (en) Method for detecting phishing website without depending on samples
US9218482B2 (en) Method and device for detecting phishing web page
CN107437038B (en) Webpage tampering detection method and device
CN110602045B (en) Malicious webpage identification method based on feature fusion and machine learning
CN111565171B (en) Abnormal data detection method and device, electronic equipment and storage medium
CN107888606B (en) Domain name credit assessment method and system
CA2460538A1 (en) Information analyzing method and apparatus
JP2014502753A (en) Web page information detection method and system
CN103455758A (en) Method and device for identifying malicious website
CN109104421B (en) Website content tampering detection method, device, equipment and readable storage medium
CN108023868B (en) Malicious resource address detection method and device
CN112532624B (en) Black chain detection method and device, electronic equipment and readable storage medium
CN112328936A (en) Website identification method, device and equipment and computer readable storage medium
CN110737821B (en) Similar event query method, device, storage medium and terminal equipment
CN111723371A (en) Method for constructing detection model of malicious file and method for detecting malicious file
CN113779481A (en) Method, device, equipment and storage medium for identifying fraud websites
CN114650176A (en) Phishing website detection method and device, computer equipment and storage medium
CN110008701B (en) Static detection rule extraction method and detection method based on ELF file characteristics
CN113962218A (en) Illegal application identification method, device and equipment and readable storage medium
CN112839061B (en) Tracing method and device based on regional characteristics
CN103440454A (en) Search engine keyword-based active honeypot detection method
CN107291685B (en) Semantic recognition method and semantic recognition system
CN107590233A (en) A kind of file management method and device
CN109409091B (en) Method, device and equipment for detecting Web page and computer storage medium
CN112003884A (en) Network asset acquisition and natural language retrieval method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination