CN110275958B

CN110275958B - Website information identification method and device and electronic equipment

Info

Publication number: CN110275958B
Application number: CN201910565890.3A
Authority: CN
Inventors: 白冰; 栗阳力; 李国华
Original assignee: Beijing Bohui Technology Inc
Current assignee: Beijing Bohui Technology Inc
Priority date: 2019-06-26
Filing date: 2019-06-26
Publication date: 2021-07-27
Anticipated expiration: 2039-06-26
Also published as: CN110275958A

Abstract

The invention provides a website information identification method, a website information identification device and electronic equipment, wherein the method comprises the following steps: acquiring the content of the target website according to the address of the target website; the content comprises the following steps: text content, picture files and display effect screenshots; performing accurate matching and/or natural language analysis processing on the text content according to a preset sensitive violation word bank to determine a text recognition result of the target website; and respectively carrying out image classification recognition based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels to determine a picture recognition result of the target website. After the content of the target website is obtained, the text content is accurately matched and/or natural language analysis is carried out to obtain a text recognition result; and carrying out deep learning on the picture file and the display effect screenshot to obtain a picture identification result. Whether the website has bad content or not can be effectively judged, and the misjudgment rate is reduced.

Description

Website information identification method and device and electronic equipment

Technical Field

The invention relates to the technical field of website monitoring, in particular to a website information identification method and device and electronic equipment.

Background

In recent years, along with the development of the internet, the amount of bad information on the network is gradually increased, and how to automatically and effectively judge and screen the bad information on the network is a problem to be solved in the development of the internet at present. The existing solution is to acquire content data in a crawler manner and perform sensitive word segmentation matching; or crawl picture recognition analysis.

The false data can be used when part of websites process reverse crawling, so that the existing internet bad information identification method cannot correctly and effectively judge whether the website has bad content, and the misjudgment rate of the existing internet bad information identification method is increased.

Disclosure of Invention

In view of the above, the present invention provides a website information identification method, apparatus and electronic device to effectively determine whether the website has bad content, reduce the misjudgment rate and increase the accuracy of information identification.

In a first aspect, an embodiment of the present invention provides a website information identification method, including: acquiring the content of the target website according to the address of the target website; the content comprises the following steps: text content, picture files and display effect screenshots; performing accurate matching and/or natural language analysis processing on the text content according to a preset sensitive violation word bank to determine a text recognition result of the target website; and respectively carrying out image classification recognition based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels to determine a picture recognition result of the target website.

In a preferred embodiment of the present invention, the step of obtaining the content of the target website according to the address of the target website includes: acquiring the address of a target website; acquiring text content of a target website in a common request mode according to the address; and acquiring the picture file and the display effect screenshot of the target website through the headless browser according to the address.

In a preferred embodiment of the present invention, the step of determining the text recognition result of the target website by performing exact matching and/or natural language analysis processing on the text content according to a preset sensitive violation word library includes: segmenting the text content; judging whether text content is matched and analyzed by adopting a text accurate matching and/or NLP (Natural Language Processing) learning model according to a preset system configuration file; if the text content is analyzed by adopting text accurate matching, matching the text content after word segmentation with a sensitive illegal word bank to determine a text recognition result of the target website; if the text content is matched and analyzed by adopting the NLP learning model, inputting the text content after word segmentation into the NLP learning model which is learned in advance, and outputting a text recognition result of the target website; the NLP learning model is obtained by learning according to the sensitive violation word bank.

In a preferred embodiment of the present invention, the step of determining the picture recognition result of the target website by performing image classification recognition based on deep learning on the picture file and the screenshot with the display effect according to a preset sample picture with different types of tags includes: respectively inputting the picture file and the display effect screenshot into an image verification learning model which is learned in advance, and outputting a picture identification result of a target website; the image auditing learning model is obtained by learning according to the sample picture.

In a preferred embodiment of the present invention, after the step of obtaining the content of the target website according to the address of the target website, the method further comprises: data cleansing of the content was performed by Kafka clusters.

In a preferred embodiment of the present invention, the method further includes: storing the picture file and the display effect screenshot into a preset storage area; and/or storing the text recognition result and the picture recognition result in a preset storage area.

In a preferred embodiment of the present invention, the method further includes: and sending the text recognition result and the picture recognition result to a specified terminal.

In a second aspect, an embodiment of the present invention further provides a website information identification apparatus, including: the content acquisition module is used for acquiring the content of the target website according to the address of the target website; the content comprises the following steps: text content, picture files and display effect screenshots; the text recognition module is used for carrying out accurate matching and/or natural language analysis processing on the text content according to a preset sensitive violation word bank and determining a text recognition result of the target website; and the picture identification module is used for respectively carrying out image classification identification based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels to determine a picture identification result of the target website.

In a third aspect, an embodiment of the present invention provides an electronic device, which includes a processor and a memory, where the memory stores machine executable instructions that can be executed by the processor, and the processor executes the machine executable instructions to implement the steps of the website information identification method.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium storing machine-executable instructions, which when invoked and executed by a processor, cause the processor to implement the steps of the above website information identification method.

The embodiment of the invention has the following beneficial effects:

according to the website information identification method, the website information identification device and the electronic equipment, after the text content, the picture file and the display effect screenshot of the target website are obtained, the text content is accurately matched and/or natural language analysis processing is carried out according to the sensitive violation word bank to obtain a text identification result; and carrying out image classification and identification based on deep learning on the obtained picture file and the display effect screenshot according to the sample picture to obtain a picture identification result. Whether the website has bad content or not can be effectively judged, the misjudgment rate is reduced, and the accuracy of information identification is improved.

Additional features and advantages of the disclosure will be set forth in the description which follows, or in part may be learned by the practice of the above-described techniques of the disclosure, or may be learned by practice of the disclosure.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a website information identification method according to an embodiment of the present invention;

FIG. 2 is a flowchart of another website information identification method according to an embodiment of the present invention;

FIG. 3 is a flowchart of another website information identification method according to an embodiment of the present invention;

FIG. 4 is a flowchart of another website information identification method according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a website information identification system according to an embodiment of the present invention;

fig. 6 is a schematic block diagram of a website information identification system according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a website information identification apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In view of the problem that the existing internet bad information identification method cannot correctly and effectively judge whether the website has bad content or not and has high misjudgment rate, embodiments of the present invention provide a website information identification method, an apparatus and an electronic device.

To facilitate understanding of the embodiment, first, a website information identification method disclosed in the embodiment of the present invention is described in detail, and as shown in fig. 1, the method includes the following steps:

step S102, acquiring the content of the target website according to the address of the target website; the content comprises the following steps: text content, picture files and presentation effect screenshots.

The target website is a website to be detected, the text content is character content in the website, the picture file and the display effect screenshot are pictures, the picture file is a picture resource file of the website, and the display effect screenshot is an effect picture in the actual use process of the website, namely the screenshot of the effect displayed by the website opened by a user side. And the content of the target website is crawled from the target website to be detected by the content capturer.

And step S104, performing accurate matching and/or natural language analysis processing on the text content according to a preset sensitive violation word bank, and determining a text recognition result of the target website.

The sensitive violation word library is preset by an administrator, and words in the sensitive violation word library are all bad information. If the text content contains words in the sensitive violation lexicon, then the target website may contain bad information, perhaps at a high probability. The text recognition result may include words, numbers, and nearby paragraphs that match the sensitive violating word bank in the text content, or the text content may be scored and sorted according to scores, or the text content may be labeled differently.

And S106, respectively carrying out image classification and identification based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels, and determining a picture identification result of the target website.

Sample pictures of different types of labels are preset by an administrator, the number of the sample pictures is not fixed and can be deleted at any time, and if certain similarity with the sample pictures is found in picture files and display effect screenshots, the target website may contain bad information. The picture recognition result comprises the same or similar quantity and proportion of the picture files and the display effect screenshots as the sample pictures, the picture files and the display effect screenshots can be scored and sorted according to the scores, and different labels can be marked on the picture files and the display effect screenshots.

According to the website information identification method provided by the embodiment of the invention, after the text content, the picture file and the display effect screenshot of the target website are obtained, the text content is accurately matched and/or natural language analysis processing is carried out according to the sensitive violation word bank to obtain a text identification result; and carrying out image classification and identification based on deep learning on the obtained picture file and the display effect screenshot according to the sample picture to obtain a picture identification result. Whether the website has bad content or not can be effectively judged, the misjudgment rate is reduced, and the accuracy of information identification is improved.

The embodiment of the invention also provides another man-machine interaction method, which is realized on the basis of the method of the embodiment; the method mainly describes a specific implementation mode of content acquisition of a target website.

As shown in fig. 2, the method comprises the steps of:

step S202, the address of the target website is obtained.

Before the content of the target website is obtained, the address of the target website to be detected needs to be obtained, and the website level of the target website is crawled. The web address and web site hierarchy are obtained through a monitoring application interface. Generally, the website hierarchy is divided into a first layer, a second layer and a third layer. The content of the collected website is generally only crawled for the first three layers.

And step S204, acquiring the text content of the target website in a common request mode according to the address.

The website address judges the acquisition mode by an internal acquisition adaptation method, and selects a corresponding acquisition method. Reading data in the configuration file, and judging whether the content can be directly accessed and obtained through a common request; if the access is available, the content data is directly obtained in a common request mode; if the request is not accessible, the request is made in the mode of a headless browser. The ordinary request mode is to send an HTTP (HyperText Transfer Protocol) request by using a script to acquire content data. The headless browser mode is to send HTTP request and automatically load rendering page; the headless browser mode may contain more than just a single HTTP request.

Specifically, the request address http:// xxxx. aa/text of the data type is returned, and the JSON type format { ' desc ': this is a magic website ' }. Request address of text type: http:// xxx/aa/demo.txt. And when the data type is judged to be returned or the request is judged to be the text type, the configuration file acquires the text content of the target website in a common request mode.

And step S206, acquiring the picture file and the display effect screenshot of the target website through the headless browser according to the address.

When the crawling type is a common website, crawling is required to be performed through a headless browser, and a display effect screenshot added with delayed loading of picture files in a webpage is acquired through a content capturer by the headless browser.

And S208, performing accurate matching and/or natural language analysis processing on the text content according to a preset sensitive violation word bank, and determining a text recognition result of the target website.

And step S210, respectively carrying out image classification and identification based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels, and determining a picture identification result of the target website.

In the above manner, the content of the target website is acquired by adopting a common request manner or a headless browser according to different website types, so that the acquisition efficiency of the content of the target website can be increased.

The embodiment of the invention also provides another man-machine interaction method, which is realized on the basis of the method of the embodiment; the method mainly describes a specific implementation mode of the matching method of the text content.

As shown in fig. 3, the method comprises the steps of:

step S302, acquiring the content of the target website according to the address of the target website; the content comprises the following steps: text content, picture files and presentation effect screenshots.

And step S304, performing word segmentation on the text content.

The text content is generally a continuous sentence, and in order to ensure the matching accuracy, word segmentation processing is firstly required on the text content. Text content is divided into spaced words. The word segmentation process is typically implemented by a word segmenter.

And S306, judging whether text content is accurately matched and/or analyzed by NLP learning model matching according to a preset system configuration file. If the text content is analyzed by adopting text exact matching, executing step S308; if the text content is analyzed using NLP learning model matching, step S310 is performed.

The system configuration file can indicate which detection mode is adopted for the text content, and generally, two modes of text accurate matching and NLP learning model are available.

And step S308, matching the text content after word segmentation with a sensitive violation word library to determine a text recognition result of the target website.

The text accurate matching means that the text content after word segmentation corresponds to the words in the sensitive illegal word bank one by one, and whether the text content after word segmentation comprises the words in the sensitive illegal word bank or not is checked.

Step S310, inputting the text content after word segmentation into an NLP learning model which is learned in advance, and outputting a text recognition result of a target website; the NLP learning model is obtained by learning according to the sensitive violation word bank.

The NLP learning model can automatically learn the sensitive violation word bank in advance, analyze what types of bad information the text content contains, and give a matching score condition; the NLP learning model judges the violation type of text content matching according to different conditions. It should be noted that the text exact match and the NLP learning model can be used simultaneously to increase the accuracy of the bad information identification.

And step S312, respectively carrying out image classification and identification based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels, and determining a picture identification result of the target website.

In the above manner, the text content is analyzed by adopting text accurate matching and/or NLP learning model matching, so that the recognition efficiency and the recognition accuracy of the text content can be increased.

The embodiment of the invention also provides another man-machine interaction method, which is realized on the basis of the method of the embodiment; the method mainly describes a specific implementation mode of a matching method of the picture file and the display effect screenshot.

As shown in fig. 4, the method includes the steps of:

step S402, acquiring the content of the target website according to the address of the target website; the content comprises the following steps: text content, picture files and presentation effect screenshots.

And S404, performing accurate matching and/or natural language analysis processing on the text content according to a preset sensitive violation word bank, and determining a text recognition result of the target website.

Step S406, inputting the picture file and the display effect screenshot into a pre-learned image auditing learning model respectively, and outputting a picture identification result of the target website; the image auditing learning model is obtained by learning according to the sample picture.

After obtaining the picture file and the display effect screenshot, the image auditing and learning model evaluates the picture file and the display effect screenshot according to different types of classifications and gives scores. The image auditing learning model learns in advance according to sample pictures with different types of labels; the image auditing and learning model provides an interface for the outside, allows pictures to be analyzed to be transmitted when the images are called from the outside, and transmits the pictures to be analyzed to the prediction model for calculation and score giving.

After the content of the target website is acquired, there may be some problems because of the acquired text content, picture files and presentation effect screenshots, such as: repeated acquisition of text content and picture files, word overlapping of text content, disorder of text content and the like. After the step of performing data cleaning and acquiring the content of the target website according to the address of the target website, the method further comprises the following steps: data cleansing of the content was performed by Kafka clusters. Data cleansing refers to a procedure for finding and correcting recognizable errors in data files, including checking data consistency, processing invalid and missing values, and the like. Kafka is a high-throughput distributed publish-subscribe messaging system that can handle all the action flow data of a consumer in a web site. The data cleaning can obtain more accurate content of the target website, reduce the workload of subsequently obtaining a text recognition result and a picture recognition result, save time and increase the accuracy of bad information recognition.

The content and the identification result of the target website also need to be stored so as to facilitate the subsequent audit and inspection, and the method further comprises the following steps: storing the picture file and the display effect screenshot into a preset storage area; and/or storing the text recognition result and the picture recognition result in a preset storage area. The preset storage area refers to a saved position, and the preset storage area can be a disk array. The disk array is a disk group with huge capacity composed of a plurality of independent disks, and the performance of the whole disk system is improved by the additive effect generated by providing data by individual disks.

After obtaining the recognition result, the recognition result needs to be sent to a designated terminal, and a worker with the terminal performs display and analysis, wherein the method further comprises the following steps: and sending the text recognition result and the picture recognition result to a specified terminal. The designated terminal can be a computer, a mobile phone, a tablet computer and other devices which can be networked and have the script display function. The recognition result can be obtained through the terminal, analyzed and counted.

For the website information identification system, as shown in fig. 5, the acquisition probe acquires a target website and a target website hierarchy through the monitoring application interface, and downloads the content of the target website in a normal request manner or a headless browser manner, where the content of the target website includes: text content, picture files and presentation effect screenshots. And the content of the target website is sent to a message transfer cleaning module, and the cleaned text content is sent to a content analysis module to obtain a text recognition result. And sending the cleaned picture file and the display effect screenshot to an image analysis module to obtain a picture identification result. The disk array is used for storing the downloaded website content, the text recognition result and the image recognition result. The business analysis module is used for acquiring text recognition results and picture recognition results after analysis on the disk array, counting and analyzing the number of websites containing bad information in the website data crawled at this time, and storing the counted data on the disk array; the monitoring application module is used for issuing an acquisition website under the control of a platform or a third-party application, acquiring website levels and acquiring a data analysis strategy; the data extraction interface is used for being called by the platform or the third-party application and providing the analyzed data result for the platform or the third-party application to display.

The data flow direction of the website information identification system is shown in fig. 6, the website data are obtained by the acquisition probe in fig. 6, the image file is stored on the disk array by the acquisition probe transfer storage module, and the data are issued to the data transfer cleaning module (Kafka cluster) by the acquisition probe. The analysis module (content analysis and image analysis) is used for acquiring data to be processed by the subscription data transfer cleaning module, the real-time analysis module (content analysis and image analysis) respectively calls the image audit learning model interface and the text content matching interface according to different data types, and the real-time analysis module (content analysis and image analysis) is used for storing the analyzed content data to the disk array.

In the mode, the image auditing and learning model is used for identifying the image file and the screenshot of the display effect, so that the identification efficiency and the identification accuracy of the image file and the screenshot of the display effect can be improved.

It should be noted that the above method embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.

Corresponding to the above method embodiment, an embodiment of the present invention provides a website information identification apparatus, as shown in fig. 7, the apparatus includes:

a content obtaining module 71, configured to obtain content of the target website according to the address of the target website; the content comprises the following steps: text content, picture files and display effect screenshots;

the text recognition module 72 is used for performing precise matching and/or natural language analysis processing on the text content according to a preset sensitive violation word bank to determine a text recognition result of the target website;

and the picture identification module 73 is used for respectively carrying out image classification identification based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels to determine a picture identification result of the target website.

According to the website information identification device provided by the embodiment of the invention, after the text content, the picture file and the display effect screenshot of the target website are obtained, the text content is accurately matched and/or natural language analysis is carried out according to the sensitive violation word bank so as to obtain a text identification result; and carrying out image classification and identification based on deep learning on the obtained picture file and the display effect screenshot according to the sample picture to obtain a picture identification result. Whether the website has bad content or not can be effectively judged, the misjudgment rate is reduced, and the accuracy of information identification is improved.

In some embodiments, the content acquisition module is to: acquiring the address of a target website; acquiring text content of a target website in a common request mode according to the address; and acquiring the picture file and the display effect screenshot of the target website through the headless browser according to the address.

In some embodiments, a text recognition module to: segmenting the text content; judging whether text content is accurately matched and/or matched and analyzed by an NLP learning model according to a preset system configuration file; if the text content is analyzed by adopting text accurate matching, matching the text content after word segmentation with a sensitive illegal word bank to determine a text recognition result of the target website; if the text content is matched and analyzed by adopting the NLP learning model, inputting the text content after word segmentation into the NLP learning model which is learned in advance, and outputting a text recognition result of the target website; the NLP learning model is obtained by learning according to the sensitive violation word bank.

In some embodiments, a picture identification module to: respectively inputting the picture file and the display effect screenshot into an image verification learning model which is learned in advance, and outputting a picture identification result of a target website; the image auditing learning model is obtained by learning according to the sample picture.

In some embodiments, the above apparatus further comprises: and the data cleaning module is used for performing data cleaning on the content through the Kafka cluster.

In some embodiments, the above apparatus further comprises: the data storage module is used for storing the picture file and the display effect screenshot to a preset storage area; and/or storing the text recognition result and the picture recognition result in a preset storage area.

In some embodiments, the above apparatus further comprises: and the data sending module is used for sending the text recognition result and the picture recognition result to the appointed terminal.

The website information identification device provided by the embodiment of the invention has the same technical characteristics as the website information identification method provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.

The embodiment of the invention also provides electronic equipment for operating the website information identification method; referring to fig. 8, the electronic device includes a memory 100 and a processor 101, where the memory 100 is used to store one or more computer instructions, and the one or more computer instructions are executed by the processor 101 to implement the website information identification method.

Further, the electronic device shown in fig. 8 further includes a bus 102 and a communication interface 103, and the processor 101, the communication interface 103, and the memory 100 are connected through the bus 102.

The Memory 100 may include a high-speed Random Access Memory (RAM) and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 103 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used. The bus 102 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 7, but this does not indicate only one bus or one type of bus.

The processor 101 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 101. The Processor 101 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 100, and the processor 101 reads the information in the memory 100, and completes the steps of the method of the foregoing embodiment in combination with the hardware thereof.

The embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores computer-executable instructions, and when the computer-executable instructions are called and executed by a processor, the computer-executable instructions cause the processor to implement the above-mentioned human-computer interaction method.

The website information identification method, the website information identification device and the computer program product of the electronic device provided by the embodiment of the invention comprise a computer readable storage medium storing program codes, wherein instructions included in the program codes can be used for executing the method in the foregoing method embodiment, and specific implementation can refer to the method embodiment, which is not described herein again.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the apparatus and/or the electronic device described above may refer to corresponding processes in the foregoing method embodiments, and are not described herein again.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A website information identification method is characterized by comprising the following steps:

acquiring the content of a target website according to the address of the target website; the content comprises the following steps: text content, picture files and display effect screenshots;

performing accurate matching and/or natural language analysis processing on the text content according to a preset sensitive violation word bank to determine a text recognition result of the target website;

respectively carrying out image classification recognition based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels to determine a picture recognition result of the target website;

after the step of obtaining the content of the target website according to the address of the target website, the method further comprises:

data cleansing of the content was performed by Kafka clustering.

2. The method of claim 1, wherein the step of obtaining the content of the target website according to the address of the target website comprises:

acquiring the address of a target website;

acquiring the text content of the target website in a common request mode according to the address;

and acquiring the picture file and the display effect screenshot of the target website through a headless browser according to the address.

3. The method according to claim 1, wherein the step of determining the text recognition result of the target website by performing exact matching and/or natural language analysis processing on the text content according to a preset sensitive violation word bank comprises:

segmenting the text content;

judging whether to adopt text accurate matching and/or NLP learning model matching to analyze the text content according to a preset system configuration file;

if the text content is analyzed by adopting the text exact matching, matching the text content after word segmentation with the sensitive illegal word bank to determine a text recognition result of the target website;

if the text content is matched and analyzed by adopting the NLP learning model, inputting the text content after word segmentation into the NLP learning model which is learned in advance, and outputting a text recognition result of the target website; the NLP learning model is obtained by learning according to the sensitive violation word bank.

4. The method according to claim 1, wherein the step of determining the picture recognition result of the target website by performing image classification recognition based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels comprises:

inputting the picture file and the display effect screenshot into an image verification learning model which is learned in advance respectively, and outputting a picture identification result of the target website; and the image auditing learning model is obtained by learning according to the sample picture.

5. The method of claim 1, further comprising: saving the picture file and the display effect screenshot to a preset storage area; and/or storing the text recognition result and the picture recognition result to the preset storage area.

6. The method of claim 1, further comprising:

and sending the text recognition result and the picture recognition result to a specified terminal.

7. A website information identifying apparatus, comprising:

the content acquisition module is used for acquiring the content of the target website according to the address of the target website; the content comprises the following steps: text content, picture files and display effect screenshots;

the text recognition module is used for carrying out precise matching and/or natural language analysis processing on the text content according to a preset sensitive illegal word bank to determine a text recognition result of the target website;

the picture identification module is used for respectively carrying out image classification identification based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels to determine a picture identification result of the target website;

data cleansing of the content was performed by Kafka clustering.

8. An electronic device comprising a processor and a memory, the memory storing machine executable instructions executable by the processor, the processor executing the machine executable instructions to implement the steps of the website information identifying method of any one of claims 1 to 6.

9. A computer readable storage medium having stored thereon machine executable instructions which, when invoked and executed by a processor, cause the processor to carry out the steps of the website information identification method of any one of claims 1 to 6.