CN111652622A

CN111652622A - Risk website identification method and device and electronic equipment

Info

Publication number: CN111652622A
Application number: CN202010454581.1A
Authority: CN
Inventors: 李超; 汲小溪; 蒋博赟; 王维强; 王澜; 赵闻飙
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-05-26
Filing date: 2020-05-26
Publication date: 2020-09-11
Anticipated expiration: 2040-05-26
Also published as: CN111652622B

Abstract

The specification discloses a method, a device and electronic equipment for identifying a risk website, wherein the method comprises the following steps: acquiring a target website to be identified; crawling text information and image information corresponding to the target website; obtaining a common representation between the text information and the image information through multi-modal representation learning; and classifying the common representation through a machine learning algorithm to determine whether the target website is a risk website. In the technical scheme, the common representation between the text information and the image information corresponding to the target website is obtained through multi-mode representation learning, the invalid information and the interference information are eliminated, the risk website is identified based on the common representation, and the accuracy of risk website identification is improved.

Description

Risk website identification method and device and electronic equipment

Technical Field

The present disclosure relates to the field of software technologies, and in particular, to a risk prevention and control method, an apparatus, and an electronic device.

Background

At present, many illegal behaviors such as yellow gambling poison, illegal investment and financing, fraud and the like carry out online crime-making through a website. The principal who does the online crime through the web site has an account and a password. On one hand, because the account is always in an abnormal transaction state, the abnormal transaction characteristics which can be acquired based on the historical transaction record of the account in the transaction behavior are few, and the risk prevention and control of the transaction behavior is difficult to be performed through the abnormal transaction behavior. On the other hand, the accounts are usually multiple and can be switched continuously, risk prevention and control rules such as wind control high frequency large amount and the like can be bypassed, even if some accounts are punished or numbered, the whole operation mode is not influenced, and how to deal with the illegal behavior of carrying out online crime through the website becomes a problem to be solved urgently.

Disclosure of Invention

The embodiment of the specification provides a method and a device for identifying a risk website and electronic equipment, which are used for realizing website identification on online network committing.

In a first aspect, an embodiment of the present specification provides a method for identifying a risk website, where the method includes:

acquiring a target website to be identified;

crawling text information and image information corresponding to the target website;

obtaining a common representation between the text information and the image information through multi-modal representation learning;

and classifying the common representation through a machine learning algorithm to determine whether the target website is a risk website.

Optionally, the obtaining of the common representation between the text information and the image information through multi-modal representation learning includes:

performing vector conversion on the text information to obtain a text characteristic vector, and performing vector conversion on the image information to obtain an image characteristic vector;

performing dimensionality reduction on the text feature vector and the image feature vector through an auto-encoder to obtain a text representation of the text feature vector and an image representation of the image feature vector, wherein the feature dimensions of the text representation and the image representation are the same;

obtaining a typical correlation coefficient between the text representation and the image representation;

the common representation is learned through multi-modal representation learning based on the text representation, the image representation, and the representative correlation coefficient.

Optionally, crawling text information and image information corresponding to the target website includes:

crawling a webpage text in a target webpage corresponding to the target website and a webpage screenshot of the target webpage;

crawling sub-links in the target webpage and sub-link texts and sub-link images corresponding to the sub-links;

and taking the webpage text and the sub-link text as text information corresponding to the target website, and taking the webpage screenshot and the sub-link image as image information corresponding to the target website.

Optionally, the acquiring a target website to be identified includes:

obtaining a complaint website and/or a piece-entering website of the risk prevention and control platform from the risk prevention and control platform as the target website; and the number of the first and second groups,

and acquiring the website meeting the preset risk rule from the Internet as the target website.

Optionally, the obtaining, from the internet, a website that meets a preset risk rule as the target website includes:

performing website retrieval according to the risk keywords to obtain the target website; and/or the presence of a gas in the gas,

and monitoring risk complaint information of forum type web pages, and extracting the target website based on the risk complaint information obtained by monitoring.

In a second aspect, an embodiment of the present specification provides an apparatus for identifying a risk website, where the apparatus includes:

the acquisition unit is used for acquiring a target website to be identified;

the crawling unit is used for crawling text information and image information corresponding to the target website;

the learning unit is used for learning and acquiring common representation between the text information and the image information through multi-modal representation;

and the classification unit is used for classifying the common representation through a machine learning algorithm and confirming whether the target website is a risk website.

Optionally, the learning unit is configured to:

Optionally, the crawling unit is configured to:

Optionally, the obtaining unit is configured to:

Optionally, the obtaining unit is further configured to:

In a third aspect, the present specification provides a computer readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the program implements the corresponding steps of the method according to the first aspect.

In a fourth aspect, an embodiment of the present specification provides an electronic device, including a memory, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by one or more processors to execute operation instructions included in the one or more programs for performing the corresponding method according to the first aspect.

One or more technical solutions in the embodiments of the present specification have at least the following technical effects:

the embodiment of the specification provides a method for identifying a risk website, which is used for acquiring a target website to be identified; crawling text information and image information corresponding to a target website; acquiring common representation between text information and image information through multi-modal representation learning; whether the target website is a risk website is determined by classifying the common representation through a machine learning algorithm, and website identification on online network crime is realized, so that active prevention and control can be performed on the risk website which is possibly subjected to online crime, and the occurrence of crime on the website is reduced. In addition, the method and the device have the advantages that common representation between the text information and the image information corresponding to the website is obtained through multi-mode representation learning, invalid information and interference information are eliminated, risk website identification is conducted on the basis of the common representation, and accuracy of risk website identification is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present specification, a brief description will be given below of the embodiments or the drawings required in the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present specification, and it is obvious for a person skilled in the art to obtain other drawings based on these drawings without inventive labor.

Fig. 1 is a flowchart of a method for identifying a risk website according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of an identification apparatus for a risk website according to an embodiment of the present disclosure;

fig. 3 is a schematic view of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step are within the scope of the present specification.

The embodiment of the specification provides a method for identifying a risk website, which is used for realizing website identification on network online crime and improving accuracy of risk website identification, so that active prevention and control are performed on the network online crime.

The main implementation principle, the specific implementation mode and the corresponding beneficial effects of the technical solutions of the embodiments of the present description are explained in detail below with reference to the accompanying drawings.

Examples

Referring to fig. 1, the present embodiment provides a method for identifying a risk website, including the following steps S10-S16:

and S10, acquiring the target website to be identified.

The source of the target website to be identified comprises an internal source and an external source. The internal source refers to a risk prevention and control platform, and the internal target website is a complaint website and a part-entering website of the risk prevention and control platform. The network address of the incoming part is the network address needed when an application program is connected with an upstream application, such as the network address of an online business. For example: and if a certain payment platform is a risk prevention and control platform of the risk website, an upstream payment channel, namely the incoming website of the payment platform can be obtained as the target website to be identified. The external source refers to an external website which is acquired from the Internet and meets preset risk rules.

And S12, crawling the text information and the image information corresponding to the target website.

The text information and the image information in the webpage corresponding to the target website can be crawled through the content crawler. The crawled text information may be HTML content information and URL information of the target web site. The crawled image information can be the screenshot information of the home page of the target website.

And S14, obtaining the common representation between the text information and the image information through multi-modal representation learning.

Among them, the Multimodal Representation Learning is Representation Learning (repetition) in Multimodal Machine Learning (MML). The multi-modal machine is an artificial intelligence learning mode which comprehensively utilizes information of a plurality of modalities, and comprises the following steps: representation learning (Translation), Translation (Translation), Alignment (Alignment), multimodal Fusion (Fusion), and Co-learning (Co-learning).

S16, classifying the common characteristics through a machine learning algorithm, and determining whether the target website is a risk website.

The machine learning algorithm can be a gradient boosting iterative decision tree GBDT, a random forest RF, a linear regression LR, a fully-connected neural network MLP, a support vector machine SVM, and the like. By classifying the common characteristics of the text information and the image information, invalid information in the webpage can be effectively eliminated, so that the accuracy and the coverage rate of risk website identification are effectively improved, and the page tampering behavior aiming at the black products has stronger robustness.

In a specific implementation, when obtaining the target website of the internal source, S10 may obtain the complaint website and/or the incoming website from the inside of the risk prevention and control platform, where the complaint website includes a website for a complaint from the risk prevention and control platform after a user has financed, gambled, and cheated. S10 may be obtained through website retrieval and/or forum monitoring when obtaining the target website of the external source.

And (3) website retrieval: and searching the website according to the risk keywords to obtain the target website.

Firstly, acquiring known risk keywords according to a known risk website; then, obtaining risk keywords similar to the known risk keywords; and searching the known risk keywords or similar risk keywords to obtain the target website. Specifically, word vectors can be trained according to website keywords corresponding to known risk websites, and the training mode of the word vectors can be bert/word2vec and the like; and obtaining word vectors of the known risk keywords based on the trained word vectors to obtain similar word vectors, and taking keywords corresponding to the similar word vectors as the similar risk keywords of the known risk keywords. For example: the method is characterized in that the stock is known to be a risk keyword of an illegal investment and financing risk website, financing is obtained according to the similarity of word vectors and serves as a similar risk keyword of the stock, and website retrieval is respectively carried out on the stock and the financing to obtain a target website to be identified.

And (3) forum monitoring: and monitoring risk complaint information of the forum webpage, and extracting a target website based on the risk complaint information obtained by monitoring.

For forum-type web pages, such as "sticking bar", "Skyline forum", "know", etc., a lot of complaint information of users is often contained, such as "i are cheated about how good you are, XXXX web pages are pits, and tens of millions go, all are pits". When risk complaint information monitoring is carried out, complaint keywords such as: road suites, pot holders, money cheating, gambling, etc. are monitored and complaint-like keywords and their contexts in the web page are extracted to obtain risk complaint information. And acquiring the website as a target website in an entity extraction mode based on the risk complaint information obtained by monitoring.

The coverage rate of the target websites is improved through the acquisition of the target websites of the internal source and the external source, and risk website identification is carried out on each target website, so that the risk websites are identified as much as possible, and the coverage rate of active risk prevention and control is improved.

After the target website to be identified is acquired, S12 is further executed to perform webpage information crawling on the target website. In order to further expand the coverage of the website, the target website can be expanded in a plurality of degrees when the webpage information is crawled, and the text, the image and the URL of the sub-link on the target webpage are also acquired. Specifically, a webpage text in a target webpage corresponding to the target website and a webpage screenshot of the target webpage are crawled; crawling sub-links in a target webpage, sub-link texts and sub-link images corresponding to the sub-links; and taking the webpage text and the sub-link text as text information corresponding to the target website, and taking the webpage screenshot and the sub-link image as image information corresponding to the target website.

After S12, execution continues with S14 for multi-modal representation learning. The common representation of two modes of text information and image information is learned by a DCCAE (deep CCAAutoencode) method, and the method specifically comprises the following steps:

step 1: and (4) preprocessing information. And performing vector conversion on the text information corresponding to the target website to obtain a text characteristic vector x, and performing vector conversion on the image information to obtain an image characteristic vector y.

Step 2: and (5) reducing the dimension of the vector. The vector dimensions of the text feature vector x and the image feature vector y are often different, and the correlation calculation cannot be directly performed, so that the self-encoder is used for reducing the dimensions of the text feature vector and the image feature vector to obtain a text representation of the text feature vector and an image representation of the image feature vector, and the feature dimensions of the text representation and the image representation are the same. Inputting the text feature vector x into an auto encoder, and obtaining the feature f (x) of the middle layer encoder through learning of the auto encoder, wherein the feature f (x) is the text representation of the text feature vector. Similarly, the image feature vector y is input into the self-encoder automatic encoder, and the feature g (y) of the middle layer automatic encoder is obtained through the self-encoder automatic encoder learning, namely the image representation of the image feature vector.

And step 3: typical correlation coefficients between the text representations and the image representations are obtained. And performing Correlation Analysis on the text representation and the image representation which are reduced to the same dimension through a typical Correlation Analysis (CCA) method to obtain a typical Correlation coefficient between f (x) and g (y), namely a CCA coefficient.

And 4, step 4: multimodal representation learning. And on the basis of the text representation and the image representation obtained in the step 2 and the CCA coefficient obtained in the step 3, learning and obtaining a common representation between the text information and the image information through multi-modal representation. Interference information may exist in the text representation f (x) and the image representation g (y) obtained from the interference information because the target website corresponding to the target webpage may have the interference information, and if the text representation f (x) and the image representation g (y) are directly used for learning the model, the accuracy of the obtained common representation is low. In the embodiment, when multi-mode representation learning is performed, a CCA coefficient between a text representation f (x) and an image representation g (y) is added into a loss function of the multi-mode representation learning, so that interference information and single-mode noise are eliminated from a multi-mode representation learning model, and common representations between the text representation f (x) and the image representation g (y) comprise a text feature f and an image feature g, and the accuracy of the common representations is improved.

After the common representation of the text information and the image information is obtained, S16 is executed to classify the obtained common representation, if the probability that the common representation is a risk type is greater than a set threshold value, a target website corresponding to the common representation is confirmed to be a risk website; and if the probability of the common characteristics representing the risk types is not greater than the set threshold value, determining that the target website corresponding to the common characteristics is not the risk website. The common representation obtained by multi-modal representation learning comprises a text feature f and an image feature g of a target website, the f and the g are similar due to the fact that the f and the g have the constraint of a CCA coefficient, and the f or the g can be used as an input feature during classification and is classified through a classifier GBDT/RF/LR/MLP/SVM and the like. Since f and g are already multi-modal representation learning filtered information, the scheme can reduce the dependence on black product injection information and improve the accuracy of risk website identification.

After the risk website is identified, the active inspection can be further performed on the risk website in the embodiment, so that risk prevention and control based on the risk website are realized. The active inspection is an important ring of active wind control, and according to the form of a code wheel code which is continuously cut at the back of a website corresponding to a risk website, an account at the back is identified in a modeling mode or a micro-guest mode. The identification mode based on the microbrowse mainly comprises the steps that the identified risk website is issued to the microbrowse in a task mode, the microbrowse returns an account behind the website to a risk prevention and control platform in a registration and recharging mode, and account punishment is carried out at the rear end of the risk prevention and control platform. And (4) regularly polling the identified risk websites through the micro-customer task to achieve prevention and control of illegal transaction behaviors.

Based on the same inventive concept, the present application provides a second aspect for the method for identifying a risky website provided in the foregoing embodiment, and an embodiment of the present specification provides an apparatus for identifying a risky website, please refer to fig. 2, where the apparatus includes:

an obtaining unit 21, configured to obtain a target website to be identified;

the crawling unit 22 is used for crawling the text information and the image information corresponding to the target website;

a learning unit 23 configured to learn to acquire a common representation between the text information and the image information by multi-modal representation;

and the classifying unit 24 is configured to classify the common representations through a machine learning algorithm, and determine whether the target website is a risk website.

As an optional implementation manner, when performing multi-modal representation learning, the learning unit 23 may perform vector conversion on the text information to obtain a text feature vector, and perform vector conversion on the image information to obtain an image feature vector; performing dimensionality reduction on the text feature vector and the image feature vector through an auto-encoder to obtain a text representation of the text feature vector and an image representation of the image feature vector, wherein the feature dimensions of the text representation and the image representation are the same; obtaining a typical correlation coefficient between the text representation and the image representation; the common representation is learned through multi-modal representation learning based on the text representation, the image representation, and the representative correlation coefficient.

As an alternative embodiment, the crawling unit 22 may be configured to: crawling a webpage text in a target webpage corresponding to the target website and a webpage screenshot of the target webpage; crawling sub-links in the target webpage and sub-link texts and sub-link images corresponding to the sub-links; and taking the webpage text and the sub-link text as text information corresponding to the target website, and taking the webpage screenshot and the sub-link image as image information corresponding to the target website.

As an optional embodiment, when acquiring the target website, the acquiring unit 22 may acquire, from the risk prevention and control platform, a complaint website and/or an incoming website of the risk prevention and control platform as the target website; and acquiring the website meeting the preset risk rule from the Internet as the target website.

As an optional implementation manner, the obtaining unit 22 is further configured to: performing website retrieval according to the risk keywords to obtain the target website; and/or monitoring risk complaint information of forum web pages, and extracting the target website based on the risk complaint information obtained by monitoring.

With regard to the apparatus in the above-described embodiments, the specific manner in which the respective units perform operations has been described in detail in the embodiments related to the method and will not be elaborated upon here.

Referring to fig. 3, a block diagram of an electronic device 700 for implementing a method for identifying a risky website is shown according to an exemplary embodiment. For example, the electronic device 700 may be a computer, database console, tablet device, personal digital assistant, and the like.

Referring to fig. 3, electronic device 700 may include one or more of the following components: a processing component 702, a memory 704, a power component 706, a multimedia component 708, an input/output (I/O) interface 710, and a communication component 712.

The processing component 702 generally controls overall operation of the electronic device 700, such as operations associated with display, data communication, and recording operations. The processing element 702 may include one or more processors 720 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 702 may include one or more modules that facilitate interaction between the processing component 702 and other components.

The memory 704 is configured to store various types of data to support operation at the device 700. Examples of such data include instructions for any application or method operating on the electronic device 700, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 704 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 706 provides power to the various components of the electronic device 700. The power components 706 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 700.

The I/O interface 710 provides an interface between the processing component 702 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The communication component 712 is configured to facilitate wired or wireless communication between the electronic device 700 and other devices. The electronic device 700 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication part 712 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 712 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 700 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 704 comprising instructions, executable by the processor 720 of the electronic device 700 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable an electronic device to perform a method of identifying a risky website, the method comprising: acquiring a target website to be identified; crawling text information and image information corresponding to the target website; obtaining a common representation between the text information and the image information through multi-modal representation learning; and classifying the common representation through a machine learning algorithm to determine whether the target website is a risk website.

It will be understood that the present embodiments are not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present embodiments is limited only by the appended claims. The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present embodiment.

Claims

1. A method for identifying a risky website, the method comprising:

acquiring a target website to be identified;

2. The method of claim 1, said learning to obtain a common characterization between said textual information and said image information by multi-modal representation, comprising:

3. The method of claim 1, crawling text information and image information corresponding to the target web site, comprising:

4. The method of claim 1, wherein the obtaining of the target website to be identified comprises:

the method comprises the steps of obtaining a complaint website from a risk prevention and control platform, obtaining a part-entering website of the risk prevention and control platform, and obtaining an external website meeting a preset risk rule from the Internet;

and acquiring the target website to be identified based on the complaint website, the item-entering website and the external website.

5. The method as claimed in claim 4, wherein the obtaining the website meeting the preset risk rule from the internet as the target website comprises:

6. An apparatus for identifying a risky website, the apparatus comprising:

the acquisition unit is used for acquiring a target website to be identified;

7. The apparatus of claim 6, the learning unit to:

8. The apparatus of claim 6, the crawling unit to:

9. The apparatus of claim 6, the obtaining unit to:

10. The apparatus of claim 9, the obtaining unit further to:

11. A computer readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out steps corresponding to the method according to any one of claims 1 to 5.

12. An electronic device comprising a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors to execute operating instructions included in the one or more programs for performing the corresponding method according to any one of claims 1 to 5.