CN111652622A - Risk website identification method and device and electronic equipment - Google Patents

Risk website identification method and device and electronic equipment Download PDF

Info

Publication number
CN111652622A
CN111652622A CN202010454581.1A CN202010454581A CN111652622A CN 111652622 A CN111652622 A CN 111652622A CN 202010454581 A CN202010454581 A CN 202010454581A CN 111652622 A CN111652622 A CN 111652622A
Authority
CN
China
Prior art keywords
website
text
representation
image
risk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010454581.1A
Other languages
Chinese (zh)
Other versions
CN111652622B (en
Inventor
李超
汲小溪
蒋博赟
王维强
王澜
赵闻飙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202010454581.1A priority Critical patent/CN111652622B/en
Publication of CN111652622A publication Critical patent/CN111652622A/en
Application granted granted Critical
Publication of CN111652622B publication Critical patent/CN111652622B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q20/00Payment architectures, schemes or protocols
    • G06Q20/38Payment protocols; Details thereof
    • G06Q20/40Authorisation, e.g. identification of payer or payee, verification of customer or shop credentials; Review and approval of payers, e.g. check credit lines or negative lists
    • G06Q20/401Transaction verification
    • G06Q20/4016Transaction verification involving fraud or risk level assessment in transaction processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/51Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems at application loading time, e.g. accepting, rejecting, starting or inhibiting executable software based on integrity or source reliability
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2119Authenticating web pages, e.g. with suspicious links
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Security & Cryptography (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Accounting & Taxation (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Strategic Management (AREA)
  • Finance (AREA)
  • General Business, Economics & Management (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The specification discloses a method, a device and electronic equipment for identifying a risk website, wherein the method comprises the following steps: acquiring a target website to be identified; crawling text information and image information corresponding to the target website; obtaining a common representation between the text information and the image information through multi-modal representation learning; and classifying the common representation through a machine learning algorithm to determine whether the target website is a risk website. In the technical scheme, the common representation between the text information and the image information corresponding to the target website is obtained through multi-mode representation learning, the invalid information and the interference information are eliminated, the risk website is identified based on the common representation, and the accuracy of risk website identification is improved.

Description

Risk website identification method and device and electronic equipment
Technical Field
The present disclosure relates to the field of software technologies, and in particular, to a risk prevention and control method, an apparatus, and an electronic device.
Background
At present, many illegal behaviors such as yellow gambling poison, illegal investment and financing, fraud and the like carry out online crime-making through a website. The principal who does the online crime through the web site has an account and a password. On one hand, because the account is always in an abnormal transaction state, the abnormal transaction characteristics which can be acquired based on the historical transaction record of the account in the transaction behavior are few, and the risk prevention and control of the transaction behavior is difficult to be performed through the abnormal transaction behavior. On the other hand, the accounts are usually multiple and can be switched continuously, risk prevention and control rules such as wind control high frequency large amount and the like can be bypassed, even if some accounts are punished or numbered, the whole operation mode is not influenced, and how to deal with the illegal behavior of carrying out online crime through the website becomes a problem to be solved urgently.
Disclosure of Invention
The embodiment of the specification provides a method and a device for identifying a risk website and electronic equipment, which are used for realizing website identification on online network committing.
In a first aspect, an embodiment of the present specification provides a method for identifying a risk website, where the method includes:
acquiring a target website to be identified;
crawling text information and image information corresponding to the target website;
obtaining a common representation between the text information and the image information through multi-modal representation learning;
and classifying the common representation through a machine learning algorithm to determine whether the target website is a risk website.
Optionally, the obtaining of the common representation between the text information and the image information through multi-modal representation learning includes:
performing vector conversion on the text information to obtain a text characteristic vector, and performing vector conversion on the image information to obtain an image characteristic vector;
performing dimensionality reduction on the text feature vector and the image feature vector through an auto-encoder to obtain a text representation of the text feature vector and an image representation of the image feature vector, wherein the feature dimensions of the text representation and the image representation are the same;
obtaining a typical correlation coefficient between the text representation and the image representation;
the common representation is learned through multi-modal representation learning based on the text representation, the image representation, and the representative correlation coefficient.
Optionally, crawling text information and image information corresponding to the target website includes:
crawling a webpage text in a target webpage corresponding to the target website and a webpage screenshot of the target webpage;
crawling sub-links in the target webpage and sub-link texts and sub-link images corresponding to the sub-links;
and taking the webpage text and the sub-link text as text information corresponding to the target website, and taking the webpage screenshot and the sub-link image as image information corresponding to the target website.
Optionally, the acquiring a target website to be identified includes:
obtaining a complaint website and/or a piece-entering website of the risk prevention and control platform from the risk prevention and control platform as the target website; and the number of the first and second groups,
and acquiring the website meeting the preset risk rule from the Internet as the target website.
Optionally, the obtaining, from the internet, a website that meets a preset risk rule as the target website includes:
performing website retrieval according to the risk keywords to obtain the target website; and/or the presence of a gas in the gas,
and monitoring risk complaint information of forum type web pages, and extracting the target website based on the risk complaint information obtained by monitoring.
In a second aspect, an embodiment of the present specification provides an apparatus for identifying a risk website, where the apparatus includes:
the acquisition unit is used for acquiring a target website to be identified;
the crawling unit is used for crawling text information and image information corresponding to the target website;
the learning unit is used for learning and acquiring common representation between the text information and the image information through multi-modal representation;
and the classification unit is used for classifying the common representation through a machine learning algorithm and confirming whether the target website is a risk website.
Optionally, the learning unit is configured to:
performing vector conversion on the text information to obtain a text characteristic vector, and performing vector conversion on the image information to obtain an image characteristic vector;
performing dimensionality reduction on the text feature vector and the image feature vector through an auto-encoder to obtain a text representation of the text feature vector and an image representation of the image feature vector, wherein the feature dimensions of the text representation and the image representation are the same;
obtaining a typical correlation coefficient between the text representation and the image representation;
the common representation is learned through multi-modal representation learning based on the text representation, the image representation, and the representative correlation coefficient.
Optionally, the crawling unit is configured to:
crawling a webpage text in a target webpage corresponding to the target website and a webpage screenshot of the target webpage;
crawling sub-links in the target webpage and sub-link texts and sub-link images corresponding to the sub-links;
and taking the webpage text and the sub-link text as text information corresponding to the target website, and taking the webpage screenshot and the sub-link image as image information corresponding to the target website.
Optionally, the obtaining unit is configured to:
obtaining a complaint website and/or a piece-entering website of the risk prevention and control platform from the risk prevention and control platform as the target website; and the number of the first and second groups,
and acquiring the website meeting the preset risk rule from the Internet as the target website.
Optionally, the obtaining unit is further configured to:
performing website retrieval according to the risk keywords to obtain the target website; and/or the presence of a gas in the gas,
and monitoring risk complaint information of forum type web pages, and extracting the target website based on the risk complaint information obtained by monitoring.
In a third aspect, the present specification provides a computer readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the program implements the corresponding steps of the method according to the first aspect.
In a fourth aspect, an embodiment of the present specification provides an electronic device, including a memory, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by one or more processors to execute operation instructions included in the one or more programs for performing the corresponding method according to the first aspect.
One or more technical solutions in the embodiments of the present specification have at least the following technical effects:
the embodiment of the specification provides a method for identifying a risk website, which is used for acquiring a target website to be identified; crawling text information and image information corresponding to a target website; acquiring common representation between text information and image information through multi-modal representation learning; whether the target website is a risk website is determined by classifying the common representation through a machine learning algorithm, and website identification on online network crime is realized, so that active prevention and control can be performed on the risk website which is possibly subjected to online crime, and the occurrence of crime on the website is reduced. In addition, the method and the device have the advantages that common representation between the text information and the image information corresponding to the website is obtained through multi-mode representation learning, invalid information and interference information are eliminated, risk website identification is conducted on the basis of the common representation, and accuracy of risk website identification is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present specification, a brief description will be given below of the embodiments or the drawings required in the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present specification, and it is obvious for a person skilled in the art to obtain other drawings based on these drawings without inventive labor.
Fig. 1 is a flowchart of a method for identifying a risk website according to an embodiment of the present disclosure;
fig. 2 is a schematic diagram of an identification apparatus for a risk website according to an embodiment of the present disclosure;
fig. 3 is a schematic view of an electronic device provided in an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step are within the scope of the present specification.
The embodiment of the specification provides a method for identifying a risk website, which is used for realizing website identification on network online crime and improving accuracy of risk website identification, so that active prevention and control are performed on the network online crime.
The main implementation principle, the specific implementation mode and the corresponding beneficial effects of the technical solutions of the embodiments of the present description are explained in detail below with reference to the accompanying drawings.
Examples
Referring to fig. 1, the present embodiment provides a method for identifying a risk website, including the following steps S10-S16:
and S10, acquiring the target website to be identified.
The source of the target website to be identified comprises an internal source and an external source. The internal source refers to a risk prevention and control platform, and the internal target website is a complaint website and a part-entering website of the risk prevention and control platform. The network address of the incoming part is the network address needed when an application program is connected with an upstream application, such as the network address of an online business. For example: and if a certain payment platform is a risk prevention and control platform of the risk website, an upstream payment channel, namely the incoming website of the payment platform can be obtained as the target website to be identified. The external source refers to an external website which is acquired from the Internet and meets preset risk rules.
And S12, crawling the text information and the image information corresponding to the target website.
The text information and the image information in the webpage corresponding to the target website can be crawled through the content crawler. The crawled text information may be HTML content information and URL information of the target web site. The crawled image information can be the screenshot information of the home page of the target website.
And S14, obtaining the common representation between the text information and the image information through multi-modal representation learning.
Among them, the Multimodal Representation Learning is Representation Learning (repetition) in Multimodal Machine Learning (MML). The multi-modal machine is an artificial intelligence learning mode which comprehensively utilizes information of a plurality of modalities, and comprises the following steps: representation learning (Translation), Translation (Translation), Alignment (Alignment), multimodal Fusion (Fusion), and Co-learning (Co-learning).
S16, classifying the common characteristics through a machine learning algorithm, and determining whether the target website is a risk website.
The machine learning algorithm can be a gradient boosting iterative decision tree GBDT, a random forest RF, a linear regression LR, a fully-connected neural network MLP, a support vector machine SVM, and the like. By classifying the common characteristics of the text information and the image information, invalid information in the webpage can be effectively eliminated, so that the accuracy and the coverage rate of risk website identification are effectively improved, and the page tampering behavior aiming at the black products has stronger robustness.
In a specific implementation, when obtaining the target website of the internal source, S10 may obtain the complaint website and/or the incoming website from the inside of the risk prevention and control platform, where the complaint website includes a website for a complaint from the risk prevention and control platform after a user has financed, gambled, and cheated. S10 may be obtained through website retrieval and/or forum monitoring when obtaining the target website of the external source.
And (3) website retrieval: and searching the website according to the risk keywords to obtain the target website.
Firstly, acquiring known risk keywords according to a known risk website; then, obtaining risk keywords similar to the known risk keywords; and searching the known risk keywords or similar risk keywords to obtain the target website. Specifically, word vectors can be trained according to website keywords corresponding to known risk websites, and the training mode of the word vectors can be bert/word2vec and the like; and obtaining word vectors of the known risk keywords based on the trained word vectors to obtain similar word vectors, and taking keywords corresponding to the similar word vectors as the similar risk keywords of the known risk keywords. For example: the method is characterized in that the stock is known to be a risk keyword of an illegal investment and financing risk website, financing is obtained according to the similarity of word vectors and serves as a similar risk keyword of the stock, and website retrieval is respectively carried out on the stock and the financing to obtain a target website to be identified.
And (3) forum monitoring: and monitoring risk complaint information of the forum webpage, and extracting a target website based on the risk complaint information obtained by monitoring.
For forum-type web pages, such as "sticking bar", "Skyline forum", "know", etc., a lot of complaint information of users is often contained, such as "i are cheated about how good you are, XXXX web pages are pits, and tens of millions go, all are pits". When risk complaint information monitoring is carried out, complaint keywords such as: road suites, pot holders, money cheating, gambling, etc. are monitored and complaint-like keywords and their contexts in the web page are extracted to obtain risk complaint information. And acquiring the website as a target website in an entity extraction mode based on the risk complaint information obtained by monitoring.
The coverage rate of the target websites is improved through the acquisition of the target websites of the internal source and the external source, and risk website identification is carried out on each target website, so that the risk websites are identified as much as possible, and the coverage rate of active risk prevention and control is improved.
After the target website to be identified is acquired, S12 is further executed to perform webpage information crawling on the target website. In order to further expand the coverage of the website, the target website can be expanded in a plurality of degrees when the webpage information is crawled, and the text, the image and the URL of the sub-link on the target webpage are also acquired. Specifically, a webpage text in a target webpage corresponding to the target website and a webpage screenshot of the target webpage are crawled; crawling sub-links in a target webpage, sub-link texts and sub-link images corresponding to the sub-links; and taking the webpage text and the sub-link text as text information corresponding to the target website, and taking the webpage screenshot and the sub-link image as image information corresponding to the target website.
After S12, execution continues with S14 for multi-modal representation learning. The common representation of two modes of text information and image information is learned by a DCCAE (deep CCAAutoencode) method, and the method specifically comprises the following steps:
step 1: and (4) preprocessing information. And performing vector conversion on the text information corresponding to the target website to obtain a text characteristic vector x, and performing vector conversion on the image information to obtain an image characteristic vector y.
Step 2: and (5) reducing the dimension of the vector. The vector dimensions of the text feature vector x and the image feature vector y are often different, and the correlation calculation cannot be directly performed, so that the self-encoder is used for reducing the dimensions of the text feature vector and the image feature vector to obtain a text representation of the text feature vector and an image representation of the image feature vector, and the feature dimensions of the text representation and the image representation are the same. Inputting the text feature vector x into an auto encoder, and obtaining the feature f (x) of the middle layer encoder through learning of the auto encoder, wherein the feature f (x) is the text representation of the text feature vector. Similarly, the image feature vector y is input into the self-encoder automatic encoder, and the feature g (y) of the middle layer automatic encoder is obtained through the self-encoder automatic encoder learning, namely the image representation of the image feature vector.
And step 3: typical correlation coefficients between the text representations and the image representations are obtained. And performing Correlation Analysis on the text representation and the image representation which are reduced to the same dimension through a typical Correlation Analysis (CCA) method to obtain a typical Correlation coefficient between f (x) and g (y), namely a CCA coefficient.
And 4, step 4: multimodal representation learning. And on the basis of the text representation and the image representation obtained in the step 2 and the CCA coefficient obtained in the step 3, learning and obtaining a common representation between the text information and the image information through multi-modal representation. Interference information may exist in the text representation f (x) and the image representation g (y) obtained from the interference information because the target website corresponding to the target webpage may have the interference information, and if the text representation f (x) and the image representation g (y) are directly used for learning the model, the accuracy of the obtained common representation is low. In the embodiment, when multi-mode representation learning is performed, a CCA coefficient between a text representation f (x) and an image representation g (y) is added into a loss function of the multi-mode representation learning, so that interference information and single-mode noise are eliminated from a multi-mode representation learning model, and common representations between the text representation f (x) and the image representation g (y) comprise a text feature f and an image feature g, and the accuracy of the common representations is improved.
After the common representation of the text information and the image information is obtained, S16 is executed to classify the obtained common representation, if the probability that the common representation is a risk type is greater than a set threshold value, a target website corresponding to the common representation is confirmed to be a risk website; and if the probability of the common characteristics representing the risk types is not greater than the set threshold value, determining that the target website corresponding to the common characteristics is not the risk website. The common representation obtained by multi-modal representation learning comprises a text feature f and an image feature g of a target website, the f and the g are similar due to the fact that the f and the g have the constraint of a CCA coefficient, and the f or the g can be used as an input feature during classification and is classified through a classifier GBDT/RF/LR/MLP/SVM and the like. Since f and g are already multi-modal representation learning filtered information, the scheme can reduce the dependence on black product injection information and improve the accuracy of risk website identification.
After the risk website is identified, the active inspection can be further performed on the risk website in the embodiment, so that risk prevention and control based on the risk website are realized. The active inspection is an important ring of active wind control, and according to the form of a code wheel code which is continuously cut at the back of a website corresponding to a risk website, an account at the back is identified in a modeling mode or a micro-guest mode. The identification mode based on the microbrowse mainly comprises the steps that the identified risk website is issued to the microbrowse in a task mode, the microbrowse returns an account behind the website to a risk prevention and control platform in a registration and recharging mode, and account punishment is carried out at the rear end of the risk prevention and control platform. And (4) regularly polling the identified risk websites through the micro-customer task to achieve prevention and control of illegal transaction behaviors.
Based on the same inventive concept, the present application provides a second aspect for the method for identifying a risky website provided in the foregoing embodiment, and an embodiment of the present specification provides an apparatus for identifying a risky website, please refer to fig. 2, where the apparatus includes:
an obtaining unit 21, configured to obtain a target website to be identified;
the crawling unit 22 is used for crawling the text information and the image information corresponding to the target website;
a learning unit 23 configured to learn to acquire a common representation between the text information and the image information by multi-modal representation;
and the classifying unit 24 is configured to classify the common representations through a machine learning algorithm, and determine whether the target website is a risk website.
As an optional implementation manner, when performing multi-modal representation learning, the learning unit 23 may perform vector conversion on the text information to obtain a text feature vector, and perform vector conversion on the image information to obtain an image feature vector; performing dimensionality reduction on the text feature vector and the image feature vector through an auto-encoder to obtain a text representation of the text feature vector and an image representation of the image feature vector, wherein the feature dimensions of the text representation and the image representation are the same; obtaining a typical correlation coefficient between the text representation and the image representation; the common representation is learned through multi-modal representation learning based on the text representation, the image representation, and the representative correlation coefficient.
As an alternative embodiment, the crawling unit 22 may be configured to: crawling a webpage text in a target webpage corresponding to the target website and a webpage screenshot of the target webpage; crawling sub-links in the target webpage and sub-link texts and sub-link images corresponding to the sub-links; and taking the webpage text and the sub-link text as text information corresponding to the target website, and taking the webpage screenshot and the sub-link image as image information corresponding to the target website.
As an optional embodiment, when acquiring the target website, the acquiring unit 22 may acquire, from the risk prevention and control platform, a complaint website and/or an incoming website of the risk prevention and control platform as the target website; and acquiring the website meeting the preset risk rule from the Internet as the target website.
As an optional implementation manner, the obtaining unit 22 is further configured to: performing website retrieval according to the risk keywords to obtain the target website; and/or monitoring risk complaint information of forum web pages, and extracting the target website based on the risk complaint information obtained by monitoring.
With regard to the apparatus in the above-described embodiments, the specific manner in which the respective units perform operations has been described in detail in the embodiments related to the method and will not be elaborated upon here.
Referring to fig. 3, a block diagram of an electronic device 700 for implementing a method for identifying a risky website is shown according to an exemplary embodiment. For example, the electronic device 700 may be a computer, database console, tablet device, personal digital assistant, and the like.
Referring to fig. 3, electronic device 700 may include one or more of the following components: a processing component 702, a memory 704, a power component 706, a multimedia component 708, an input/output (I/O) interface 710, and a communication component 712.
The processing component 702 generally controls overall operation of the electronic device 700, such as operations associated with display, data communication, and recording operations. The processing element 702 may include one or more processors 720 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 702 may include one or more modules that facilitate interaction between the processing component 702 and other components.
The memory 704 is configured to store various types of data to support operation at the device 700. Examples of such data include instructions for any application or method operating on the electronic device 700, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 704 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power supply component 706 provides power to the various components of the electronic device 700. The power components 706 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 700.
The I/O interface 710 provides an interface between the processing component 702 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The communication component 712 is configured to facilitate wired or wireless communication between the electronic device 700 and other devices. The electronic device 700 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication part 712 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 712 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 700 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 704 comprising instructions, executable by the processor 720 of the electronic device 700 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
A non-transitory computer readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable an electronic device to perform a method of identifying a risky website, the method comprising: acquiring a target website to be identified; crawling text information and image information corresponding to the target website; obtaining a common representation between the text information and the image information through multi-modal representation learning; and classifying the common representation through a machine learning algorithm to determine whether the target website is a risk website.
It will be understood that the present embodiments are not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present embodiments is limited only by the appended claims. The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present embodiment.

Claims (12)

1. A method for identifying a risky website, the method comprising:
acquiring a target website to be identified;
crawling text information and image information corresponding to the target website;
obtaining a common representation between the text information and the image information through multi-modal representation learning;
and classifying the common representation through a machine learning algorithm to determine whether the target website is a risk website.
2. The method of claim 1, said learning to obtain a common characterization between said textual information and said image information by multi-modal representation, comprising:
performing vector conversion on the text information to obtain a text characteristic vector, and performing vector conversion on the image information to obtain an image characteristic vector;
performing dimensionality reduction on the text feature vector and the image feature vector through an auto-encoder to obtain a text representation of the text feature vector and an image representation of the image feature vector, wherein the feature dimensions of the text representation and the image representation are the same;
obtaining a typical correlation coefficient between the text representation and the image representation;
the common representation is learned through multi-modal representation learning based on the text representation, the image representation, and the representative correlation coefficient.
3. The method of claim 1, crawling text information and image information corresponding to the target web site, comprising:
crawling a webpage text in a target webpage corresponding to the target website and a webpage screenshot of the target webpage;
crawling sub-links in the target webpage and sub-link texts and sub-link images corresponding to the sub-links;
and taking the webpage text and the sub-link text as text information corresponding to the target website, and taking the webpage screenshot and the sub-link image as image information corresponding to the target website.
4. The method of claim 1, wherein the obtaining of the target website to be identified comprises:
the method comprises the steps of obtaining a complaint website from a risk prevention and control platform, obtaining a part-entering website of the risk prevention and control platform, and obtaining an external website meeting a preset risk rule from the Internet;
and acquiring the target website to be identified based on the complaint website, the item-entering website and the external website.
5. The method as claimed in claim 4, wherein the obtaining the website meeting the preset risk rule from the internet as the target website comprises:
performing website retrieval according to the risk keywords to obtain the target website; and/or the presence of a gas in the gas,
and monitoring risk complaint information of forum type web pages, and extracting the target website based on the risk complaint information obtained by monitoring.
6. An apparatus for identifying a risky website, the apparatus comprising:
the acquisition unit is used for acquiring a target website to be identified;
the crawling unit is used for crawling text information and image information corresponding to the target website;
the learning unit is used for learning and acquiring common representation between the text information and the image information through multi-modal representation;
and the classification unit is used for classifying the common representation through a machine learning algorithm and confirming whether the target website is a risk website.
7. The apparatus of claim 6, the learning unit to:
performing vector conversion on the text information to obtain a text characteristic vector, and performing vector conversion on the image information to obtain an image characteristic vector;
performing dimensionality reduction on the text feature vector and the image feature vector through an auto-encoder to obtain a text representation of the text feature vector and an image representation of the image feature vector, wherein the feature dimensions of the text representation and the image representation are the same;
obtaining a typical correlation coefficient between the text representation and the image representation;
the common representation is learned through multi-modal representation learning based on the text representation, the image representation, and the representative correlation coefficient.
8. The apparatus of claim 6, the crawling unit to:
crawling a webpage text in a target webpage corresponding to the target website and a webpage screenshot of the target webpage;
crawling sub-links in the target webpage and sub-link texts and sub-link images corresponding to the sub-links;
and taking the webpage text and the sub-link text as text information corresponding to the target website, and taking the webpage screenshot and the sub-link image as image information corresponding to the target website.
9. The apparatus of claim 6, the obtaining unit to:
obtaining a complaint website and/or a piece-entering website of the risk prevention and control platform from the risk prevention and control platform as the target website; and the number of the first and second groups,
and acquiring the website meeting the preset risk rule from the Internet as the target website.
10. The apparatus of claim 9, the obtaining unit further to:
performing website retrieval according to the risk keywords to obtain the target website; and/or the presence of a gas in the gas,
and monitoring risk complaint information of forum type web pages, and extracting the target website based on the risk complaint information obtained by monitoring.
11. A computer readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out steps corresponding to the method according to any one of claims 1 to 5.
12. An electronic device comprising a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors to execute operating instructions included in the one or more programs for performing the corresponding method according to any one of claims 1 to 5.
CN202010454581.1A 2020-05-26 2020-05-26 Risk website identification method and device and electronic equipment Active CN111652622B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010454581.1A CN111652622B (en) 2020-05-26 2020-05-26 Risk website identification method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010454581.1A CN111652622B (en) 2020-05-26 2020-05-26 Risk website identification method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN111652622A true CN111652622A (en) 2020-09-11
CN111652622B CN111652622B (en) 2023-08-01

Family

ID=72343086

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010454581.1A Active CN111652622B (en) 2020-05-26 2020-05-26 Risk website identification method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN111652622B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112149179A (en) * 2020-09-18 2020-12-29 支付宝(杭州)信息技术有限公司 Risk identification method and device based on privacy protection
CN112948897A (en) * 2021-03-15 2021-06-11 东北农业大学 Webpage tamper-proofing detection method based on combination of DRAE and SVM
CN113222022A (en) * 2021-05-13 2021-08-06 支付宝(杭州)信息技术有限公司 Webpage classification identification method and device
CN113221032A (en) * 2021-04-08 2021-08-06 北京智奇数美科技有限公司 Link risk detection method, device and storage medium
CN113849760A (en) * 2021-12-02 2021-12-28 云账户技术(天津)有限公司 Sensitive information risk assessment method, system and storage medium
CN115796145A (en) * 2022-11-16 2023-03-14 珠海横琴指数动力科技有限公司 Method, system, server and readable storage medium for acquiring webpage text
CN116049597A (en) * 2023-01-10 2023-05-02 北京百度网讯科技有限公司 Pre-training method and device for multi-task model of webpage and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101281521A (en) * 2007-04-05 2008-10-08 中国科学院自动化研究所 Method and system for filtering sensitive web page based on multiple classifier amalgamation
CN105426356A (en) * 2015-10-29 2016-03-23 杭州九言科技股份有限公司 Target information identification method and apparatus
CN107547555A (en) * 2017-09-11 2018-01-05 北京匠数科技有限公司 A kind of web portal security monitoring method and device
CN108763325A (en) * 2018-05-04 2018-11-06 北京达佳互联信息技术有限公司 A kind of network object processing method and processing device
CN110275958A (en) * 2019-06-26 2019-09-24 北京市博汇科技股份有限公司 Site information recognition methods, device and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101281521A (en) * 2007-04-05 2008-10-08 中国科学院自动化研究所 Method and system for filtering sensitive web page based on multiple classifier amalgamation
CN105426356A (en) * 2015-10-29 2016-03-23 杭州九言科技股份有限公司 Target information identification method and apparatus
CN107547555A (en) * 2017-09-11 2018-01-05 北京匠数科技有限公司 A kind of web portal security monitoring method and device
CN108763325A (en) * 2018-05-04 2018-11-06 北京达佳互联信息技术有限公司 A kind of network object processing method and processing device
CN110275958A (en) * 2019-06-26 2019-09-24 北京市博汇科技股份有限公司 Site information recognition methods, device and electronic equipment

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
RYAN KIROS等: "Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models" *
SHIVANGI SINGHAL等: "SpotFake: A Multi-modal Framework for Fake News Detection", 《2019 IEEE FIFTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM)》 *
SU-FANG ZHANG 等: "Multimodal Representation Learning: Advances, Trends and Challenges" *
TADAS BALTRUSAITIS 等: "Multimodal Machine Learning: A Survey and Taxonomy" *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112149179A (en) * 2020-09-18 2020-12-29 支付宝(杭州)信息技术有限公司 Risk identification method and device based on privacy protection
CN112149179B (en) * 2020-09-18 2022-09-02 支付宝(杭州)信息技术有限公司 Risk identification method and device based on privacy protection
CN112948897A (en) * 2021-03-15 2021-06-11 东北农业大学 Webpage tamper-proofing detection method based on combination of DRAE and SVM
CN112948897B (en) * 2021-03-15 2022-08-26 东北农业大学 Webpage tamper-proofing detection method based on combination of DRAE and SVM
CN113221032A (en) * 2021-04-08 2021-08-06 北京智奇数美科技有限公司 Link risk detection method, device and storage medium
CN113222022A (en) * 2021-05-13 2021-08-06 支付宝(杭州)信息技术有限公司 Webpage classification identification method and device
CN113849760A (en) * 2021-12-02 2021-12-28 云账户技术(天津)有限公司 Sensitive information risk assessment method, system and storage medium
CN115796145A (en) * 2022-11-16 2023-03-14 珠海横琴指数动力科技有限公司 Method, system, server and readable storage medium for acquiring webpage text
CN115796145B (en) * 2022-11-16 2023-09-08 珠海横琴指数动力科技有限公司 Webpage text acquisition method, system, server and readable storage medium
CN116049597A (en) * 2023-01-10 2023-05-02 北京百度网讯科技有限公司 Pre-training method and device for multi-task model of webpage and electronic equipment
CN116049597B (en) * 2023-01-10 2024-04-19 北京百度网讯科技有限公司 Pre-training method and device for multi-task model of webpage and electronic equipment

Also Published As

Publication number Publication date
CN111652622B (en) 2023-08-01

Similar Documents

Publication Publication Date Title
CN111652622A (en) Risk website identification method and device and electronic equipment
CN109165940B (en) Anti-theft method and device and electronic equipment
CN109034209B (en) Training method and device for active risk real-time recognition model
CN110020009B (en) Online question and answer method, device and system
CN106874253A (en) Recognize the method and device of sensitive information
CN111079186B (en) Data analysis method, device, equipment and storage medium
CN113011889A (en) Account abnormity identification method, system, device, equipment and medium
CN111553318A (en) Sensitive information extraction method, referee document processing method and device and electronic equipment
CN111898675A (en) Credit wind control model generation method and device, scoring card generation method, machine readable medium and equipment
CN111143665A (en) Fraud qualitative method, device and equipment
CN115577172A (en) Article recommendation method, device, equipment and medium
CN117520503A (en) Financial customer service dialogue generation method, device, equipment and medium based on LLM model
CN114996348A (en) User portrait generation method and device, electronic equipment and storage medium
CN113064983B (en) Semantic detection method, semantic detection device, computer equipment and storage medium
CN111259216A (en) Information identification method, device and equipment
CN111988327B (en) Threat behavior detection and model establishment method and device, electronic equipment and storage medium
CN112883725A (en) File generation method and device, electronic equipment and storage medium
CN111797904A (en) Method and device for detecting tampering of webpage features
CN113259369B (en) Data set authentication method and system based on machine learning member inference attack
CN114119037B (en) Marketing anti-cheating system based on big data
CN111695117B (en) Webshell script detection method and device
CN110705439B (en) Information processing method, device and equipment
CN113744030A (en) Recommendation method, device, server and medium based on AI user portrait
CN111339829A (en) User identity authentication method, device, computer equipment and storage medium
CN117574410B (en) Risk data detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant