CN110275958B - Website information identification method and device and electronic equipment - Google Patents

Website information identification method and device and electronic equipment Download PDF

Info

Publication number
CN110275958B
CN110275958B CN201910565890.3A CN201910565890A CN110275958B CN 110275958 B CN110275958 B CN 110275958B CN 201910565890 A CN201910565890 A CN 201910565890A CN 110275958 B CN110275958 B CN 110275958B
Authority
CN
China
Prior art keywords
content
target website
text
picture
website
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910565890.3A
Other languages
Chinese (zh)
Other versions
CN110275958A (en
Inventor
白冰
栗阳力
李国华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Bohui Technology Inc
Original Assignee
Beijing Bohui Technology Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Bohui Technology Inc filed Critical Beijing Bohui Technology Inc
Priority to CN201910565890.3A priority Critical patent/CN110275958B/en
Publication of CN110275958A publication Critical patent/CN110275958A/en
Application granted granted Critical
Publication of CN110275958B publication Critical patent/CN110275958B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Information Transfer Between Computers (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a website information identification method, a website information identification device and electronic equipment, wherein the method comprises the following steps: acquiring the content of the target website according to the address of the target website; the content comprises the following steps: text content, picture files and display effect screenshots; performing accurate matching and/or natural language analysis processing on the text content according to a preset sensitive violation word bank to determine a text recognition result of the target website; and respectively carrying out image classification recognition based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels to determine a picture recognition result of the target website. After the content of the target website is obtained, the text content is accurately matched and/or natural language analysis is carried out to obtain a text recognition result; and carrying out deep learning on the picture file and the display effect screenshot to obtain a picture identification result. Whether the website has bad content or not can be effectively judged, and the misjudgment rate is reduced.

Description

Website information identification method and device and electronic equipment
Technical Field
The invention relates to the technical field of website monitoring, in particular to a website information identification method and device and electronic equipment.
Background
In recent years, along with the development of the internet, the amount of bad information on the network is gradually increased, and how to automatically and effectively judge and screen the bad information on the network is a problem to be solved in the development of the internet at present. The existing solution is to acquire content data in a crawler manner and perform sensitive word segmentation matching; or crawl picture recognition analysis.
The false data can be used when part of websites process reverse crawling, so that the existing internet bad information identification method cannot correctly and effectively judge whether the website has bad content, and the misjudgment rate of the existing internet bad information identification method is increased.
Disclosure of Invention
In view of the above, the present invention provides a website information identification method, apparatus and electronic device to effectively determine whether the website has bad content, reduce the misjudgment rate and increase the accuracy of information identification.
In a first aspect, an embodiment of the present invention provides a website information identification method, including: acquiring the content of the target website according to the address of the target website; the content comprises the following steps: text content, picture files and display effect screenshots; performing accurate matching and/or natural language analysis processing on the text content according to a preset sensitive violation word bank to determine a text recognition result of the target website; and respectively carrying out image classification recognition based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels to determine a picture recognition result of the target website.
In a preferred embodiment of the present invention, the step of obtaining the content of the target website according to the address of the target website includes: acquiring the address of a target website; acquiring text content of a target website in a common request mode according to the address; and acquiring the picture file and the display effect screenshot of the target website through the headless browser according to the address.
In a preferred embodiment of the present invention, the step of determining the text recognition result of the target website by performing exact matching and/or natural language analysis processing on the text content according to a preset sensitive violation word library includes: segmenting the text content; judging whether text content is matched and analyzed by adopting a text accurate matching and/or NLP (Natural Language Processing) learning model according to a preset system configuration file; if the text content is analyzed by adopting text accurate matching, matching the text content after word segmentation with a sensitive illegal word bank to determine a text recognition result of the target website; if the text content is matched and analyzed by adopting the NLP learning model, inputting the text content after word segmentation into the NLP learning model which is learned in advance, and outputting a text recognition result of the target website; the NLP learning model is obtained by learning according to the sensitive violation word bank.
In a preferred embodiment of the present invention, the step of determining the picture recognition result of the target website by performing image classification recognition based on deep learning on the picture file and the screenshot with the display effect according to a preset sample picture with different types of tags includes: respectively inputting the picture file and the display effect screenshot into an image verification learning model which is learned in advance, and outputting a picture identification result of a target website; the image auditing learning model is obtained by learning according to the sample picture.
In a preferred embodiment of the present invention, after the step of obtaining the content of the target website according to the address of the target website, the method further comprises: data cleansing of the content was performed by Kafka clusters.
In a preferred embodiment of the present invention, the method further includes: storing the picture file and the display effect screenshot into a preset storage area; and/or storing the text recognition result and the picture recognition result in a preset storage area.
In a preferred embodiment of the present invention, the method further includes: and sending the text recognition result and the picture recognition result to a specified terminal.
In a second aspect, an embodiment of the present invention further provides a website information identification apparatus, including: the content acquisition module is used for acquiring the content of the target website according to the address of the target website; the content comprises the following steps: text content, picture files and display effect screenshots; the text recognition module is used for carrying out accurate matching and/or natural language analysis processing on the text content according to a preset sensitive violation word bank and determining a text recognition result of the target website; and the picture identification module is used for respectively carrying out image classification identification based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels to determine a picture identification result of the target website.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a processor and a memory, where the memory stores machine executable instructions that can be executed by the processor, and the processor executes the machine executable instructions to implement the steps of the website information identification method.
In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium storing machine-executable instructions, which when invoked and executed by a processor, cause the processor to implement the steps of the above website information identification method.
The embodiment of the invention has the following beneficial effects:
according to the website information identification method, the website information identification device and the electronic equipment, after the text content, the picture file and the display effect screenshot of the target website are obtained, the text content is accurately matched and/or natural language analysis processing is carried out according to the sensitive violation word bank to obtain a text identification result; and carrying out image classification and identification based on deep learning on the obtained picture file and the display effect screenshot according to the sample picture to obtain a picture identification result. Whether the website has bad content or not can be effectively judged, the misjudgment rate is reduced, and the accuracy of information identification is improved.
Additional features and advantages of the disclosure will be set forth in the description which follows, or in part may be learned by the practice of the above-described techniques of the disclosure, or may be learned by practice of the disclosure.
In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a website information identification method according to an embodiment of the present invention;
FIG. 2 is a flowchart of another website information identification method according to an embodiment of the present invention;
FIG. 3 is a flowchart of another website information identification method according to an embodiment of the present invention;
FIG. 4 is a flowchart of another website information identification method according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a website information identification system according to an embodiment of the present invention;
fig. 6 is a schematic block diagram of a website information identification system according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a website information identification apparatus according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In view of the problem that the existing internet bad information identification method cannot correctly and effectively judge whether the website has bad content or not and has high misjudgment rate, embodiments of the present invention provide a website information identification method, an apparatus and an electronic device.
To facilitate understanding of the embodiment, first, a website information identification method disclosed in the embodiment of the present invention is described in detail, and as shown in fig. 1, the method includes the following steps:
step S102, acquiring the content of the target website according to the address of the target website; the content comprises the following steps: text content, picture files and presentation effect screenshots.
The target website is a website to be detected, the text content is character content in the website, the picture file and the display effect screenshot are pictures, the picture file is a picture resource file of the website, and the display effect screenshot is an effect picture in the actual use process of the website, namely the screenshot of the effect displayed by the website opened by a user side. And the content of the target website is crawled from the target website to be detected by the content capturer.
And step S104, performing accurate matching and/or natural language analysis processing on the text content according to a preset sensitive violation word bank, and determining a text recognition result of the target website.
The sensitive violation word library is preset by an administrator, and words in the sensitive violation word library are all bad information. If the text content contains words in the sensitive violation lexicon, then the target website may contain bad information, perhaps at a high probability. The text recognition result may include words, numbers, and nearby paragraphs that match the sensitive violating word bank in the text content, or the text content may be scored and sorted according to scores, or the text content may be labeled differently.
And S106, respectively carrying out image classification and identification based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels, and determining a picture identification result of the target website.
Sample pictures of different types of labels are preset by an administrator, the number of the sample pictures is not fixed and can be deleted at any time, and if certain similarity with the sample pictures is found in picture files and display effect screenshots, the target website may contain bad information. The picture recognition result comprises the same or similar quantity and proportion of the picture files and the display effect screenshots as the sample pictures, the picture files and the display effect screenshots can be scored and sorted according to the scores, and different labels can be marked on the picture files and the display effect screenshots.
According to the website information identification method provided by the embodiment of the invention, after the text content, the picture file and the display effect screenshot of the target website are obtained, the text content is accurately matched and/or natural language analysis processing is carried out according to the sensitive violation word bank to obtain a text identification result; and carrying out image classification and identification based on deep learning on the obtained picture file and the display effect screenshot according to the sample picture to obtain a picture identification result. Whether the website has bad content or not can be effectively judged, the misjudgment rate is reduced, and the accuracy of information identification is improved.
The embodiment of the invention also provides another man-machine interaction method, which is realized on the basis of the method of the embodiment; the method mainly describes a specific implementation mode of content acquisition of a target website.
As shown in fig. 2, the method comprises the steps of:
step S202, the address of the target website is obtained.
Before the content of the target website is obtained, the address of the target website to be detected needs to be obtained, and the website level of the target website is crawled. The web address and web site hierarchy are obtained through a monitoring application interface. Generally, the website hierarchy is divided into a first layer, a second layer and a third layer. The content of the collected website is generally only crawled for the first three layers.
And step S204, acquiring the text content of the target website in a common request mode according to the address.
The website address judges the acquisition mode by an internal acquisition adaptation method, and selects a corresponding acquisition method. Reading data in the configuration file, and judging whether the content can be directly accessed and obtained through a common request; if the access is available, the content data is directly obtained in a common request mode; if the request is not accessible, the request is made in the mode of a headless browser. The ordinary request mode is to send an HTTP (HyperText Transfer Protocol) request by using a script to acquire content data. The headless browser mode is to send HTTP request and automatically load rendering page; the headless browser mode may contain more than just a single HTTP request.
Specifically, the request address http:// xxxx. aa/text of the data type is returned, and the JSON type format { ' desc ': this is a magic website ' }. Request address of text type: http:// xxx/aa/demo.txt. And when the data type is judged to be returned or the request is judged to be the text type, the configuration file acquires the text content of the target website in a common request mode.
And step S206, acquiring the picture file and the display effect screenshot of the target website through the headless browser according to the address.
When the crawling type is a common website, crawling is required to be performed through a headless browser, and a display effect screenshot added with delayed loading of picture files in a webpage is acquired through a content capturer by the headless browser.
And S208, performing accurate matching and/or natural language analysis processing on the text content according to a preset sensitive violation word bank, and determining a text recognition result of the target website.
And step S210, respectively carrying out image classification and identification based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels, and determining a picture identification result of the target website.
In the above manner, the content of the target website is acquired by adopting a common request manner or a headless browser according to different website types, so that the acquisition efficiency of the content of the target website can be increased.
The embodiment of the invention also provides another man-machine interaction method, which is realized on the basis of the method of the embodiment; the method mainly describes a specific implementation mode of the matching method of the text content.
As shown in fig. 3, the method comprises the steps of:
step S302, acquiring the content of the target website according to the address of the target website; the content comprises the following steps: text content, picture files and presentation effect screenshots.
And step S304, performing word segmentation on the text content.
The text content is generally a continuous sentence, and in order to ensure the matching accuracy, word segmentation processing is firstly required on the text content. Text content is divided into spaced words. The word segmentation process is typically implemented by a word segmenter.
And S306, judging whether text content is accurately matched and/or analyzed by NLP learning model matching according to a preset system configuration file. If the text content is analyzed by adopting text exact matching, executing step S308; if the text content is analyzed using NLP learning model matching, step S310 is performed.
The system configuration file can indicate which detection mode is adopted for the text content, and generally, two modes of text accurate matching and NLP learning model are available.
And step S308, matching the text content after word segmentation with a sensitive violation word library to determine a text recognition result of the target website.
The text accurate matching means that the text content after word segmentation corresponds to the words in the sensitive illegal word bank one by one, and whether the text content after word segmentation comprises the words in the sensitive illegal word bank or not is checked.
Step S310, inputting the text content after word segmentation into an NLP learning model which is learned in advance, and outputting a text recognition result of a target website; the NLP learning model is obtained by learning according to the sensitive violation word bank.
The NLP learning model can automatically learn the sensitive violation word bank in advance, analyze what types of bad information the text content contains, and give a matching score condition; the NLP learning model judges the violation type of text content matching according to different conditions. It should be noted that the text exact match and the NLP learning model can be used simultaneously to increase the accuracy of the bad information identification.
And step S312, respectively carrying out image classification and identification based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels, and determining a picture identification result of the target website.
In the above manner, the text content is analyzed by adopting text accurate matching and/or NLP learning model matching, so that the recognition efficiency and the recognition accuracy of the text content can be increased.
The embodiment of the invention also provides another man-machine interaction method, which is realized on the basis of the method of the embodiment; the method mainly describes a specific implementation mode of a matching method of the picture file and the display effect screenshot.
As shown in fig. 4, the method includes the steps of:
step S402, acquiring the content of the target website according to the address of the target website; the content comprises the following steps: text content, picture files and presentation effect screenshots.
And S404, performing accurate matching and/or natural language analysis processing on the text content according to a preset sensitive violation word bank, and determining a text recognition result of the target website.
Step S406, inputting the picture file and the display effect screenshot into a pre-learned image auditing learning model respectively, and outputting a picture identification result of the target website; the image auditing learning model is obtained by learning according to the sample picture.
After obtaining the picture file and the display effect screenshot, the image auditing and learning model evaluates the picture file and the display effect screenshot according to different types of classifications and gives scores. The image auditing learning model learns in advance according to sample pictures with different types of labels; the image auditing and learning model provides an interface for the outside, allows pictures to be analyzed to be transmitted when the images are called from the outside, and transmits the pictures to be analyzed to the prediction model for calculation and score giving.
After the content of the target website is acquired, there may be some problems because of the acquired text content, picture files and presentation effect screenshots, such as: repeated acquisition of text content and picture files, word overlapping of text content, disorder of text content and the like. After the step of performing data cleaning and acquiring the content of the target website according to the address of the target website, the method further comprises the following steps: data cleansing of the content was performed by Kafka clusters. Data cleansing refers to a procedure for finding and correcting recognizable errors in data files, including checking data consistency, processing invalid and missing values, and the like. Kafka is a high-throughput distributed publish-subscribe messaging system that can handle all the action flow data of a consumer in a web site. The data cleaning can obtain more accurate content of the target website, reduce the workload of subsequently obtaining a text recognition result and a picture recognition result, save time and increase the accuracy of bad information recognition.
The content and the identification result of the target website also need to be stored so as to facilitate the subsequent audit and inspection, and the method further comprises the following steps: storing the picture file and the display effect screenshot into a preset storage area; and/or storing the text recognition result and the picture recognition result in a preset storage area. The preset storage area refers to a saved position, and the preset storage area can be a disk array. The disk array is a disk group with huge capacity composed of a plurality of independent disks, and the performance of the whole disk system is improved by the additive effect generated by providing data by individual disks.
After obtaining the recognition result, the recognition result needs to be sent to a designated terminal, and a worker with the terminal performs display and analysis, wherein the method further comprises the following steps: and sending the text recognition result and the picture recognition result to a specified terminal. The designated terminal can be a computer, a mobile phone, a tablet computer and other devices which can be networked and have the script display function. The recognition result can be obtained through the terminal, analyzed and counted.
For the website information identification system, as shown in fig. 5, the acquisition probe acquires a target website and a target website hierarchy through the monitoring application interface, and downloads the content of the target website in a normal request manner or a headless browser manner, where the content of the target website includes: text content, picture files and presentation effect screenshots. And the content of the target website is sent to a message transfer cleaning module, and the cleaned text content is sent to a content analysis module to obtain a text recognition result. And sending the cleaned picture file and the display effect screenshot to an image analysis module to obtain a picture identification result. The disk array is used for storing the downloaded website content, the text recognition result and the image recognition result. The business analysis module is used for acquiring text recognition results and picture recognition results after analysis on the disk array, counting and analyzing the number of websites containing bad information in the website data crawled at this time, and storing the counted data on the disk array; the monitoring application module is used for issuing an acquisition website under the control of a platform or a third-party application, acquiring website levels and acquiring a data analysis strategy; the data extraction interface is used for being called by the platform or the third-party application and providing the analyzed data result for the platform or the third-party application to display.
The data flow direction of the website information identification system is shown in fig. 6, the website data are obtained by the acquisition probe in fig. 6, the image file is stored on the disk array by the acquisition probe transfer storage module, and the data are issued to the data transfer cleaning module (Kafka cluster) by the acquisition probe. The analysis module (content analysis and image analysis) is used for acquiring data to be processed by the subscription data transfer cleaning module, the real-time analysis module (content analysis and image analysis) respectively calls the image audit learning model interface and the text content matching interface according to different data types, and the real-time analysis module (content analysis and image analysis) is used for storing the analyzed content data to the disk array.
In the mode, the image auditing and learning model is used for identifying the image file and the screenshot of the display effect, so that the identification efficiency and the identification accuracy of the image file and the screenshot of the display effect can be improved.
It should be noted that the above method embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.
Corresponding to the above method embodiment, an embodiment of the present invention provides a website information identification apparatus, as shown in fig. 7, the apparatus includes:
a content obtaining module 71, configured to obtain content of the target website according to the address of the target website; the content comprises the following steps: text content, picture files and display effect screenshots;
the text recognition module 72 is used for performing precise matching and/or natural language analysis processing on the text content according to a preset sensitive violation word bank to determine a text recognition result of the target website;
and the picture identification module 73 is used for respectively carrying out image classification identification based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels to determine a picture identification result of the target website.
According to the website information identification device provided by the embodiment of the invention, after the text content, the picture file and the display effect screenshot of the target website are obtained, the text content is accurately matched and/or natural language analysis is carried out according to the sensitive violation word bank so as to obtain a text identification result; and carrying out image classification and identification based on deep learning on the obtained picture file and the display effect screenshot according to the sample picture to obtain a picture identification result. Whether the website has bad content or not can be effectively judged, the misjudgment rate is reduced, and the accuracy of information identification is improved.
In some embodiments, the content acquisition module is to: acquiring the address of a target website; acquiring text content of a target website in a common request mode according to the address; and acquiring the picture file and the display effect screenshot of the target website through the headless browser according to the address.
In some embodiments, a text recognition module to: segmenting the text content; judging whether text content is accurately matched and/or matched and analyzed by an NLP learning model according to a preset system configuration file; if the text content is analyzed by adopting text accurate matching, matching the text content after word segmentation with a sensitive illegal word bank to determine a text recognition result of the target website; if the text content is matched and analyzed by adopting the NLP learning model, inputting the text content after word segmentation into the NLP learning model which is learned in advance, and outputting a text recognition result of the target website; the NLP learning model is obtained by learning according to the sensitive violation word bank.
In some embodiments, a picture identification module to: respectively inputting the picture file and the display effect screenshot into an image verification learning model which is learned in advance, and outputting a picture identification result of a target website; the image auditing learning model is obtained by learning according to the sample picture.
In some embodiments, the above apparatus further comprises: and the data cleaning module is used for performing data cleaning on the content through the Kafka cluster.
In some embodiments, the above apparatus further comprises: the data storage module is used for storing the picture file and the display effect screenshot to a preset storage area; and/or storing the text recognition result and the picture recognition result in a preset storage area.
In some embodiments, the above apparatus further comprises: and the data sending module is used for sending the text recognition result and the picture recognition result to the appointed terminal.
The website information identification device provided by the embodiment of the invention has the same technical characteristics as the website information identification method provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.
The embodiment of the invention also provides electronic equipment for operating the website information identification method; referring to fig. 8, the electronic device includes a memory 100 and a processor 101, where the memory 100 is used to store one or more computer instructions, and the one or more computer instructions are executed by the processor 101 to implement the website information identification method.
Further, the electronic device shown in fig. 8 further includes a bus 102 and a communication interface 103, and the processor 101, the communication interface 103, and the memory 100 are connected through the bus 102.
The Memory 100 may include a high-speed Random Access Memory (RAM) and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 103 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used. The bus 102 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 7, but this does not indicate only one bus or one type of bus.
The processor 101 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 101. The Processor 101 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 100, and the processor 101 reads the information in the memory 100, and completes the steps of the method of the foregoing embodiment in combination with the hardware thereof.
The embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores computer-executable instructions, and when the computer-executable instructions are called and executed by a processor, the computer-executable instructions cause the processor to implement the above-mentioned human-computer interaction method.
The website information identification method, the website information identification device and the computer program product of the electronic device provided by the embodiment of the invention comprise a computer readable storage medium storing program codes, wherein instructions included in the program codes can be used for executing the method in the foregoing method embodiment, and specific implementation can refer to the method embodiment, which is not described herein again.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the apparatus and/or the electronic device described above may refer to corresponding processes in the foregoing method embodiments, and are not described herein again.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (9)

1. A website information identification method is characterized by comprising the following steps:
acquiring the content of a target website according to the address of the target website; the content comprises the following steps: text content, picture files and display effect screenshots;
performing accurate matching and/or natural language analysis processing on the text content according to a preset sensitive violation word bank to determine a text recognition result of the target website;
respectively carrying out image classification recognition based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels to determine a picture recognition result of the target website;
after the step of obtaining the content of the target website according to the address of the target website, the method further comprises:
data cleansing of the content was performed by Kafka clustering.
2. The method of claim 1, wherein the step of obtaining the content of the target website according to the address of the target website comprises:
acquiring the address of a target website;
acquiring the text content of the target website in a common request mode according to the address;
and acquiring the picture file and the display effect screenshot of the target website through a headless browser according to the address.
3. The method according to claim 1, wherein the step of determining the text recognition result of the target website by performing exact matching and/or natural language analysis processing on the text content according to a preset sensitive violation word bank comprises:
segmenting the text content;
judging whether to adopt text accurate matching and/or NLP learning model matching to analyze the text content according to a preset system configuration file;
if the text content is analyzed by adopting the text exact matching, matching the text content after word segmentation with the sensitive illegal word bank to determine a text recognition result of the target website;
if the text content is matched and analyzed by adopting the NLP learning model, inputting the text content after word segmentation into the NLP learning model which is learned in advance, and outputting a text recognition result of the target website; the NLP learning model is obtained by learning according to the sensitive violation word bank.
4. The method according to claim 1, wherein the step of determining the picture recognition result of the target website by performing image classification recognition based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels comprises:
inputting the picture file and the display effect screenshot into an image verification learning model which is learned in advance respectively, and outputting a picture identification result of the target website; and the image auditing learning model is obtained by learning according to the sample picture.
5. The method of claim 1, further comprising: saving the picture file and the display effect screenshot to a preset storage area; and/or storing the text recognition result and the picture recognition result to the preset storage area.
6. The method of claim 1, further comprising:
and sending the text recognition result and the picture recognition result to a specified terminal.
7. A website information identifying apparatus, comprising:
the content acquisition module is used for acquiring the content of the target website according to the address of the target website; the content comprises the following steps: text content, picture files and display effect screenshots;
the text recognition module is used for carrying out precise matching and/or natural language analysis processing on the text content according to a preset sensitive illegal word bank to determine a text recognition result of the target website;
the picture identification module is used for respectively carrying out image classification identification based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels to determine a picture identification result of the target website;
after the step of obtaining the content of the target website according to the address of the target website, the method further comprises:
data cleansing of the content was performed by Kafka clustering.
8. An electronic device comprising a processor and a memory, the memory storing machine executable instructions executable by the processor, the processor executing the machine executable instructions to implement the steps of the website information identifying method of any one of claims 1 to 6.
9. A computer readable storage medium having stored thereon machine executable instructions which, when invoked and executed by a processor, cause the processor to carry out the steps of the website information identification method of any one of claims 1 to 6.
CN201910565890.3A 2019-06-26 2019-06-26 Website information identification method and device and electronic equipment Active CN110275958B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910565890.3A CN110275958B (en) 2019-06-26 2019-06-26 Website information identification method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910565890.3A CN110275958B (en) 2019-06-26 2019-06-26 Website information identification method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN110275958A CN110275958A (en) 2019-09-24
CN110275958B true CN110275958B (en) 2021-07-27

Family

ID=67962420

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910565890.3A Active CN110275958B (en) 2019-06-26 2019-06-26 Website information identification method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN110275958B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3783854B1 (en) * 2019-08-23 2021-12-01 Worldline Security server for dynamic verification of web content, end user's remote device, system comprising said end user's remote device and server, and method implemented by said system
CN110807197A (en) * 2019-10-31 2020-02-18 支付宝(杭州)信息技术有限公司 Training method and device for recognition model and risk website recognition method and device
CN111078979A (en) * 2019-11-29 2020-04-28 上海观安信息技术股份有限公司 Method and system for identifying network credit website based on OCR and text processing technology
CN111126373A (en) * 2019-12-23 2020-05-08 北京中科神探科技有限公司 Internet short video violation judgment device and method based on cross-modal identification technology
CN111311554B (en) * 2020-01-21 2023-09-01 腾讯科技(深圳)有限公司 Content quality determining method, device, equipment and storage medium for graphic content
CN111767918A (en) * 2020-02-21 2020-10-13 北京沃东天骏信息技术有限公司 Picture identification method and device
CN111652622B (en) * 2020-05-26 2023-08-01 支付宝(杭州)信息技术有限公司 Risk website identification method and device and electronic equipment
CN111767493A (en) * 2020-07-07 2020-10-13 杭州安恒信息技术股份有限公司 Method, device, equipment and storage medium for displaying content data of website
CN112101335B (en) * 2020-08-25 2022-04-15 深圳大学 APP violation monitoring method based on OCR and transfer learning
CN112347402A (en) * 2020-10-21 2021-02-09 上海淇玥信息技术有限公司 Illegal website/APP automatic identification method, system and electronic device
CN112199569A (en) * 2020-10-29 2021-01-08 重庆撼地大数据有限公司 Method and system for identifying prohibited website, computer equipment and storage medium
CN112508627B (en) * 2020-12-21 2022-11-04 苏州三六零智能安全科技有限公司 Advertisement address determining method, device, equipment and storage medium
CN112738567B (en) * 2020-12-22 2023-03-10 北京百度网讯科技有限公司 Platform content processing method and device, electronic equipment and storage medium
CN113177409B (en) * 2021-05-06 2024-05-31 上海慧洲信息技术有限公司 Intelligent sensitive word recognition system
CN113221890A (en) * 2021-05-25 2021-08-06 深圳市瑞驰信息技术有限公司 OCR-based cloud mobile phone text content supervision method, system and system
CN113688346A (en) * 2021-08-16 2021-11-23 杭州安恒信息技术股份有限公司 Illegal website identification method, device, equipment and storage medium
CN116939292B (en) * 2023-09-15 2023-11-24 天津市北海通信技术有限公司 Video text content monitoring method and system in rail transit environment

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101055621A (en) * 2006-04-10 2007-10-17 中国科学院自动化研究所 Content based sensitive web page identification method
CN101281521A (en) * 2007-04-05 2008-10-08 中国科学院自动化研究所 Method and system for filtering sensitive web page based on multiple classifier amalgamation
CN102521284A (en) * 2011-11-28 2012-06-27 优视科技有限公司 Page screenshot processing method and device based on mobile terminal browser
US8819022B1 (en) * 2011-08-08 2014-08-26 Aol Inc. Systems and methods for identifying and managing topical content for websites
CN105302884A (en) * 2015-10-19 2016-02-03 天津海量信息技术有限公司 Deep learning-based webpage mode recognition method and visual structure learning method
CN105975523A (en) * 2016-04-28 2016-09-28 浙江乾冠信息安全研究院有限公司 Hidden hyperlink detection method based on stack
CN106095903A (en) * 2016-06-08 2016-11-09 成都三零凯天通信实业有限公司 A kind of radio and television the analysis of public opinion method and system based on degree of depth learning art
CN106445998A (en) * 2016-05-26 2017-02-22 达而观信息科技(上海)有限公司 Text content auditing method and system based on sensitive word
CN106528769A (en) * 2016-11-04 2017-03-22 乐视控股(北京)有限公司 Data acquisition method and apparatus
CN107403200A (en) * 2017-08-10 2017-11-28 北京亚鸿世纪科技发展有限公司 Improve the multiple imperfect picture sorting technique of image segmentation algorithm combination deep learning
CN107818077A (en) * 2016-09-13 2018-03-20 北京金山云网络技术有限公司 A kind of sensitive content recognition methods and device
CN107862050A (en) * 2017-11-08 2018-03-30 国网四川省电力公司信息通信公司 A kind of web site contents safety detecting system and method
CN107957872A (en) * 2017-10-11 2018-04-24 中国互联网络信息中心 A kind of full web site source code acquisition methods and illegal website detection method, system
CN108052523A (en) * 2017-11-03 2018-05-18 中国互联网络信息中心 Gambling site recognition methods and system based on convolutional neural networks
CN108200191A (en) * 2018-01-29 2018-06-22 杭州电子科技大学 Utilize the client dynamic URL associated script character string detecting systems of perturbation method
CN108647309A (en) * 2018-05-09 2018-10-12 达而观信息科技(上海)有限公司 Chat content checking method based on sensitive word and system
CN109660552A (en) * 2019-01-03 2019-04-19 杭州电子科技大学 A kind of Web defence method combining address jump and WAF technology
CN109831751A (en) * 2019-01-04 2019-05-31 上海创蓝文化传播有限公司 A kind of short message content air control system and method based on natural language processing

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10223742B2 (en) * 2015-08-26 2019-03-05 Google Llc Systems and methods for selecting third party content based on feedback

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101055621A (en) * 2006-04-10 2007-10-17 中国科学院自动化研究所 Content based sensitive web page identification method
CN101281521A (en) * 2007-04-05 2008-10-08 中国科学院自动化研究所 Method and system for filtering sensitive web page based on multiple classifier amalgamation
US8819022B1 (en) * 2011-08-08 2014-08-26 Aol Inc. Systems and methods for identifying and managing topical content for websites
CN102521284A (en) * 2011-11-28 2012-06-27 优视科技有限公司 Page screenshot processing method and device based on mobile terminal browser
CN105302884A (en) * 2015-10-19 2016-02-03 天津海量信息技术有限公司 Deep learning-based webpage mode recognition method and visual structure learning method
CN105975523A (en) * 2016-04-28 2016-09-28 浙江乾冠信息安全研究院有限公司 Hidden hyperlink detection method based on stack
CN106445998A (en) * 2016-05-26 2017-02-22 达而观信息科技(上海)有限公司 Text content auditing method and system based on sensitive word
CN106095903A (en) * 2016-06-08 2016-11-09 成都三零凯天通信实业有限公司 A kind of radio and television the analysis of public opinion method and system based on degree of depth learning art
CN107818077A (en) * 2016-09-13 2018-03-20 北京金山云网络技术有限公司 A kind of sensitive content recognition methods and device
CN106528769A (en) * 2016-11-04 2017-03-22 乐视控股(北京)有限公司 Data acquisition method and apparatus
CN107403200A (en) * 2017-08-10 2017-11-28 北京亚鸿世纪科技发展有限公司 Improve the multiple imperfect picture sorting technique of image segmentation algorithm combination deep learning
CN107957872A (en) * 2017-10-11 2018-04-24 中国互联网络信息中心 A kind of full web site source code acquisition methods and illegal website detection method, system
CN108052523A (en) * 2017-11-03 2018-05-18 中国互联网络信息中心 Gambling site recognition methods and system based on convolutional neural networks
CN107862050A (en) * 2017-11-08 2018-03-30 国网四川省电力公司信息通信公司 A kind of web site contents safety detecting system and method
CN108200191A (en) * 2018-01-29 2018-06-22 杭州电子科技大学 Utilize the client dynamic URL associated script character string detecting systems of perturbation method
CN108647309A (en) * 2018-05-09 2018-10-12 达而观信息科技(上海)有限公司 Chat content checking method based on sensitive word and system
CN109660552A (en) * 2019-01-03 2019-04-19 杭州电子科技大学 A kind of Web defence method combining address jump and WAF technology
CN109831751A (en) * 2019-01-04 2019-05-31 上海创蓝文化传播有限公司 A kind of short message content air control system and method based on natural language processing

Also Published As

Publication number Publication date
CN110275958A (en) 2019-09-24

Similar Documents

Publication Publication Date Title
CN110275958B (en) Website information identification method and device and electronic equipment
CN107943954B (en) Method and device for detecting webpage sensitive information and electronic equipment
CN109167816B (en) Information pushing method, device, equipment and storage medium
US20150324478A1 (en) Detection method and scanning engine of web pages
CN110704304B (en) Application program testing method and device, storage medium and server
CN110401580B (en) Webpage state monitoring method based on heartbeat mechanism and related equipment
CN113271322B (en) Abnormal flow detection method and device, electronic equipment and storage medium
CN109657459A (en) Webpage back door detection method, equipment, storage medium and device
CN103744941A (en) Method and device for determining website evaluation result based on website attribute information
CN114996103A (en) Page abnormity detection method and device, electronic equipment and storage medium
CN114157568B (en) Browser secure access method, device, equipment and storage medium
CN111783159A (en) Webpage tampering verification method and device, computer equipment and storage medium
CN113364784B (en) Detection parameter generation method and device, electronic equipment and storage medium
CN110798481A (en) Malicious domain name detection method and device based on deep learning
CN112347457A (en) Abnormal account detection method and device, computer equipment and storage medium
CN111125704B (en) Webpage Trojan horse recognition method and system
CN110874475A (en) Vulnerability mining method, vulnerability mining platform and computer readable storage medium
CN111784053A (en) Transaction risk detection method, device and readable storage medium
CN110852091A (en) Method and device for monitoring wrongly written characters, electronic equipment and computer readable medium
CN115018783A (en) Video watermark detection method and device, electronic equipment and storage medium
CN110634018A (en) Feature depiction method, recognition method and related device for lost user
CN110990558B (en) Electronic book content display method, computing equipment and computer storage medium
CN112929458B (en) Method and device for determining address of server of APP (application) and storage medium
CN114218574A (en) Data detection method and device, electronic equipment and storage medium
CN113052509A (en) Model evaluation method, model evaluation apparatus, electronic device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant