CN114239689A - Multi-mode-based website type judgment method and device - Google Patents

Multi-mode-based website type judgment method and device Download PDF

Info

Publication number
CN114239689A
CN114239689A CN202111392189.XA CN202111392189A CN114239689A CN 114239689 A CN114239689 A CN 114239689A CN 202111392189 A CN202111392189 A CN 202111392189A CN 114239689 A CN114239689 A CN 114239689A
Authority
CN
China
Prior art keywords
website
text
webpage
web page
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111392189.XA
Other languages
Chinese (zh)
Inventor
林淑强
毕永辉
梁煜麓
王兵
鄢小征
朱聚江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Meiya Pico Information Co Ltd
Original Assignee
Xiamen Meiya Pico Information Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Meiya Pico Information Co Ltd filed Critical Xiamen Meiya Pico Information Co Ltd
Priority to CN202111392189.XA priority Critical patent/CN114239689A/en
Publication of CN114239689A publication Critical patent/CN114239689A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • G06F18/256Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a multi-mode-based website type judgment method and a multi-mode-based website type judgment device, wherein the method comprises the following steps: crawling a webpage html file and a webpage screenshot based on the URL of the website; identifying the webpage screenshot by using a first neural network model to determine a picture classification label of the website; identifying the web page html file by using a second neural network model and a third neural network to determine a content text semantic tag and a title text semantic tag of the website, and acquiring a filing information tag based on a website URL through a supervision information platform; and determining the final type of the website based on the picture classification label, the content text semantic label, the title text semantic label and the filing information label. In the invention, a multi-mode technology is used for fusing various classification results, and a specific fusion strategy is designed, so that the identification rate of the website types is greatly improved.

Description

Multi-mode-based website type judgment method and device
Technical Field
The invention relates to the technical field of machine learning, in particular to a multi-mode-based website type judgment method and device.
Background
A web site is a collection of web pages made on the internet using tools such as HTML (standard universal markup language) for displaying specific content. The website is a communication tool on the internet, and people can access the website to acquire a lot of information and information through a web browser.
With the development of the internet, the number of websites is countless and is increasing every day. In daily life, everyone visits various websites. The judgment on the website category has important significance, for example, the characteristics of personal interests and hobbies can be mined according to the types of websites visited by people, so that personalized recommendation and accurate marketing are performed specifically.
In the prior art, the following methods mainly exist for judging the category of a website at present: 1. extracting category characteristics of the url through the url of the website, so as to judge the type of the website; 2. judging the category of the web page by extracting the content of the web page and by a machine learning mode such as Bayes classification through a keyword strategy or text semantic features; 3. screenshot is carried out on the webpage homepage, OCR recognition is carried out, character information on the homepage is extracted, and the webpage category and the like are judged by using the mode 2. In the method, the technical means of single mode (url or text) is utilized to study and judge, and in the face of some novel webpage types, the websites belong to investment and financing classes, but the class information is embodied in the patterns of the website pages and is not embodied in characters, so the accuracy of studying and judging the method is very low.
Disclosure of Invention
The present invention proposes the following technical solutions to address one or more technical defects in the prior art.
A multi-modal-based website type judgment method comprises the following steps:
a crawling step, wherein web page html files and web page screenshots are crawled based on the URL of the website;
identifying the webpage screenshot, namely identifying the webpage screenshot by using a first neural network model to determine a picture classification label of the website;
identifying a webpage file, namely identifying the webpage html file by using a second neural network model and a third neural network to determine a content text semantic label and a title text semantic label of the website, and acquiring a filing information label based on a URL (uniform resource locator) of the website through a supervision information platform;
and a fusion step, namely determining the final type of the website based on the picture classification label, the content text semantic label, the title text semantic label and the record information label.
Further, before crawling a webpage html file and a webpage screenshot based on the URL of the website, judging whether a webpage URL is in a website knowledge base, if so, outputting webpage category information according to corresponding information in the website knowledge base; if not, crawling a webpage html file and a webpage screenshot based on the URL of the website.
Furthermore, when the webpage screenshot is identified, the webpage screenshot picture File is usedimgInputting the image data into a trained first neural network model, extracting image features, and outputting image classification labels through a classifier
Figure BDA0003364526660000021
Wherein
Figure BDA0003364526660000031
A confidence value indicating that the web page belongs to category i.
Furthermore, when the webpage File is identified, the webpage html File is analyzedhtmlObtaining the text content text and the title text of the webpageTitle(ii) a Inputting the text into a trained second neural network model, extracting text semantic features, and outputting content text semantic labels through a text classifier
Figure BDA0003364526660000032
Wherein
Figure BDA0003364526660000033
A confidence value representing that the web page belongs to category i; text willTitleInputting the semantic features of the text into a trained third neural network model, extracting semantic features of the text, and outputting semantic labels of the title text through a text classifier
Figure BDA0003364526660000034
Wherein
Figure BDA0003364526660000035
A confidence value representing that the web page belongs to category i; inquiring whether record information label P exists in webpage or not through supervision platformRecord keepingIf there is a record, PRecord keepingValue 1, no record PRecord keepingIs 0.
Further, the pictures are classified into labels according to a predetermined multi-mode fusion strategy calculation formula
Figure BDA0003364526660000036
Content text semantic classification labels
Figure BDA0003364526660000037
Title text semantic tags
Figure BDA0003364526660000038
And a filing information tag PRecord keepingInputting a multi-mode fusion strategy calculation formula for judging the webpage category, wherein the calculation formula is as follows:
Figure BDA0003364526660000039
wherein y isi=max(y0,y1...yn) Representing the final category of the web page as i, wherein WxRepresents a weight value, B is a constant, WxWherein x is 1, 2, 3, 4. .
The invention also provides a multi-mode-based website type judgment device, which comprises:
the crawling unit crawls a webpage html file and a webpage screenshot based on the URL of the website;
the webpage screenshot recognition unit is used for recognizing the webpage screenshot by using a first neural network model to determine a picture classification label of the website;
the web page file identification unit is used for identifying and determining a content text semantic label and a title text semantic label of the website by using a second neural network model and a third neural network, and acquiring a filing information label based on a URL (uniform resource locator) of the website through a supervision information platform;
and the fusion unit is used for determining the final type of the website based on the picture classification label, the content text semantic label, the title text semantic label and the record information label.
Further, before crawling a webpage html file and a webpage screenshot based on the URL of the website, judging whether a webpage URL is in a website knowledge base, if so, outputting webpage category information according to corresponding information in the website knowledge base; if not, crawling a webpage html file and a webpage screenshot based on the URL of the website.
Furthermore, when the webpage screenshot is identified, the webpage screenshot picture File is usedimgInputting the image data into a trained first neural network model, extracting image features, and outputting image classification labels through a classifier
Figure BDA0003364526660000041
Wherein
Figure BDA0003364526660000042
A confidence value indicating that the web page belongs to category i.
Furthermore, when the webpage File is identified, the webpage html File is analyzedhtmlObtaining the text content text and the title text of the webpageTitle(ii) a Inputting the text into a trained second neural network model, extracting text semantic features, and outputting content text semantic labels through a text classifier
Figure BDA0003364526660000051
Wherein
Figure BDA0003364526660000052
A confidence value representing that the web page belongs to category i; text willTitleInputting the semantic features of the text into a trained third neural network model, extracting semantic features of the text, and outputting semantic labels of the title text through a text classifier
Figure BDA0003364526660000053
Wherein
Figure BDA0003364526660000054
A confidence value representing that the web page belongs to category i; inquiring whether record information label P exists in webpage or not through supervision platformRecord keepingIf there is a record, PRecord keepingValue 1, no record PRecord keepingIs 0.
Further, the pictures are classified into labels according to a predetermined multi-mode fusion strategy calculation formula
Figure BDA0003364526660000055
Content text semantic classification labels
Figure BDA0003364526660000056
Title text semantic tags
Figure BDA0003364526660000057
And a filing information tag PRecord keepingInputting a multi-mode fusion strategy calculation formula for judging the webpage category, wherein the calculation formula is as follows:
Figure BDA0003364526660000058
wherein y isi=max(y0,y1...yn) Representing the final category of the web page as i, wherein WxRepresents a weight value, B is a constant, WxWherein x is 1, 2, 3, 4. .
The invention also proposes a computer-readable storage medium having stored thereon computer program code which, when executed by a computer, performs any of the methods described above.
The invention has the technical effects that: the invention discloses a multi-mode-based website type judgment method and a device, wherein the method comprises the following steps: a crawling step, wherein web page html files and web page screenshots are crawled based on the URL of the website; identifying the webpage screenshot, namely identifying the webpage screenshot by using a first neural network model to determine a picture classification label of the website; identifying a webpage file, namely identifying the webpage html file by using a second neural network model and a third neural network to determine a content text semantic label and a title text semantic label of the website, and acquiring a filing information label based on a URL (uniform resource locator) of the website through a supervision information platform; and a fusion step, namely determining the final type of the website based on the picture classification label, the content text semantic label, the title text semantic label and the record information label. In the invention, through a trained neural network picture classification model, the web sites are classified by utilizing image semantics, the meaning of web page patterns is fully utilized, and the problem that the web sites are difficult to accurately identify because the category information in the background technology is embodied in the patterns of the web sites is well solved; in the invention, for the screenshot of the web page, the web sites are classified by using image semantics through a trained neural network picture classification model, the meaning of the pattern of the web page is fully utilized, and the problem that the web sites of the web page only with pictures are difficult to classify is well solved; acquiring html files for web pages, analyzing web page structures, extracting web page text contents, and classifying the web pages through a trained neural network text semantic classification model; extracting title, namely title, of the website from the html file of the website, wherein the title information is more refined and less subjected to semantic interference, and part of the website can be judged more accurately by utilizing the semantics; extracting the filing information of the website, such as a website filing number, through the website webpage content, verifying the filing information on a supervision platform, inquiring whether the filing information is legal or not, and using the legal or not filing information as one dimension information for judging the website type; then, a multi-mode technology is used for fusing various classification results, and a specific fusion strategy is designed, so that the identification rate of the website types is greatly improved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.
Fig. 1 is a flowchart of a multi-modal based website type determination method according to an embodiment of the present invention.
Fig. 2 is a block diagram of a multi-modal based website type determination apparatus according to an embodiment of the present invention.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
FIG. 1 illustrates a multi-modal based website type determination method of the present invention, which comprises:
a crawling step S101, wherein web page html files and web page screenshots are crawled based on the URL of the website; in the invention, a crawler program is designed, and the html file and the screenshot of the website are crawled according to the url of the webpage, wherein the url of the webpage is input by a user or acquired when the user clicks an advertisement link, and the like, but the invention is not limited to this.
A web screenshot recognition step S102, namely recognizing the web screenshot by using a first neural network model to determine a picture classification label of the website; the first neural network model, also known as a neural network image classification model, requires training before use.
A webpage file identification step S103, identifying a webpage html file by using a second neural network model and a third neural network to determine a content text semantic label and a title text semantic label of the website, and acquiring a filing information label based on a URL (uniform resource locator) of the website through a supervision information platform; the second and third neural network models are also called neural network text semantic classification models, and need to be trained before use
And a fusion step S104, determining the final type of the website based on the image classification label, the content text semantic label, the title text semantic label and the record information label.
In the invention, through a trained neural network picture classification model, the web sites are classified by utilizing image semantics, the meaning of web page patterns is fully utilized, and the problem that the web sites are difficult to accurately identify because the category information in the background technology is embodied in the patterns of the web sites is well solved; acquiring html files for the web pages, analyzing the web page structure, extracting text contents and title contents of the web pages, classifying the content and the title of the web pages through a trained neural network text semantic classification model, and judging whether the web pages are filed or not; and then, result fusion is carried out based on a multi-modal technology, so that the accuracy of website category judgment is improved, which is one of the important invention points of the invention.
In one embodiment, before crawling a webpage html file and a webpage screenshot based on a URL of the website, judging whether a webpage URL is in a website knowledge base, if so, outputting webpage category information according to corresponding information in the website knowledge base; if not, crawling a webpage html file and a webpage screenshot based on the URL of the website. The invention introduces a website classification knowledge base which is acquired through various channels, such as a filed domain name knowledge base and collection and accumulation in work, the knowledge base is manually checked, the accuracy is ensured, the problem of wrong category judgment caused by complete technical research and judgment is effectively solved, screening is carried out through the knowledge base, if url exists in the knowledge base, the subsequent process is not required to be executed, and the accuracy and the classification speed of website classification are improved, which is one of important invention points in the invention.
In one embodiment, when the webpage screenshot is identified, the webpage screenshot picture is FileimgInputting the image data into a trained first neural network model, extracting image features, and outputting image classification labels through a classifier
Figure BDA0003364526660000091
Wherein
Figure BDA0003364526660000092
Representing a confidence value that the web page belongs to category i, where n is an integer greater than or equal to 2, the first neural network may be a deep neural network model.
In one embodiment, the webpage html File is parsed when the webpage File is identifiedhtmlObtaining the text content text and the title text of the webpageTitle(ii) a Inputting the text into a trained second neural network model, extracting text semantic features, and outputting content text semantic labels through a text classifier
Figure BDA0003364526660000093
Wherein
Figure BDA0003364526660000094
A confidence value representing that the web page belongs to category i; text willTitleInputting the semantic features of the text into a trained third neural network model, extracting semantic features of the text, and outputting semantic labels of the title text through a text classifier
Figure BDA0003364526660000095
Wherein
Figure BDA0003364526660000096
A confidence value representing that the web page belongs to category i; inquiring whether record information label P exists in webpage or not through supervision platformRecord keepingIf there is a record, PRecord keepingValue 1, no record PRecord keepingIs 0. The second and third neural network models may be bi-LSTM neural networks.
In one embodiment, the picture classification tags are labeled according to a predetermined multi-modal fusion policy calculation formula
Figure BDA0003364526660000097
Content text semantic classification labels
Figure BDA0003364526660000098
Title text semantic tags
Figure BDA0003364526660000099
And a filing information tag PRecord keepingInputting a multi-mode fusion strategy calculation formula for judging the webpage category, wherein the calculation formula is as follows:
Figure BDA0003364526660000101
wherein y isi=max(y0,y1...yn) Representing the final category of the web page as i, wherein WxRepresents a weight value, B is a constant, WxWherein x is 1, 2, 3, 4. WxThe specific value of (b) may be obtained by machine learning from historical data, or by historical data fitting, and so on.
In the invention: for the screenshot of the web page, classifying the website by using image semantics through a trained neural network picture classification model, fully utilizing the meaning of a web page pattern, and well solving the problem that the website of the web page only with pictures is difficult to classify; acquiring html files for web pages, analyzing web page structures, extracting web page text contents, and classifying the web pages through a trained neural network text semantic classification model; extracting title, namely title, of the website from the html file of the website, wherein the title information is more refined and less subjected to semantic interference, and part of the website can be judged more accurately by utilizing the semantics; extracting the filing information of the website, such as a website filing number, through the website webpage content, verifying the filing information on a supervision platform, inquiring whether the filing information is legal or not, and using the legal or not filing information as one dimension information for judging the website type; then, a multi-modal technology is used for fusing a plurality of classification results, and a specific fusion strategy is designed, so that the identification rate of the website types is greatly improved, which is another important invention point of the invention.
Fig. 2 shows a multi-modal based website type determination apparatus of the present invention, provided on an image acquisition apparatus, the apparatus comprising:
a crawling unit 201 that crawls a web html file and a web screenshot based on the URL of the website; in the invention, a crawler program is designed, and the html file and the screenshot of the website are crawled according to the url of the webpage, wherein the url of the webpage is input by a user or acquired when the user clicks an advertisement link, and the like, but the invention is not limited to this.
The webpage screenshot identifying unit 202 is used for identifying the webpage screenshot by using a first neural network model to determine a picture classification label of the website; the first neural network model, also known as a neural network image classification model, requires training before use.
The webpage file identification unit 203 identifies the webpage html file by using a second neural network model and a third neural network to determine a content text semantic tag and a title text semantic tag of the website, and acquires a filing information tag based on a URL (uniform resource locator) of the website through a supervision information platform; the second and third neural network models are also called neural network text semantic classification models, and need to be trained before use
And the fusion unit 204 is used for determining the final type of the website based on the picture classification label, the content text semantic label, the title text semantic label and the filing information label.
In the invention, through a trained neural network picture classification model, the web sites are classified by utilizing image semantics, the meaning of web page patterns is fully utilized, and the problem that the web sites are difficult to accurately identify because the category information in the background technology is embodied in the patterns of the web sites is well solved; acquiring html files for the web pages, analyzing the web page structure, extracting text contents and title contents of the web pages, classifying the content and the title of the web pages through a trained neural network text semantic classification model, and judging whether the web pages are filed or not; and then, result fusion is carried out based on a multi-modal technology, so that the accuracy of website category judgment is improved, which is one of the important invention points of the invention.
In one embodiment, before crawling a webpage html file and a webpage screenshot based on a URL of the website, judging whether a webpage URL is in a website knowledge base, if so, outputting webpage category information according to corresponding information in the website knowledge base; if not, crawling a webpage html file and a webpage screenshot based on the URL of the website. The invention introduces a website classification knowledge base which is acquired through various channels, such as a filed domain name knowledge base and collection and accumulation in work, the knowledge base is manually checked, the accuracy is ensured, the problem of wrong category judgment caused by complete technical research and judgment is effectively solved, screening is carried out through the knowledge base, if url exists in the knowledge base, the subsequent process is not required to be executed, and the accuracy and the classification speed of website classification are improved, which is one of important invention points in the invention.
In one embodiment, when the webpage screenshot is identified, the webpage screenshot picture is FileimgInputting the image data into a trained first neural network model, extracting image features, and outputting image classification labels through a classifier
Figure BDA0003364526660000121
Wherein
Figure BDA0003364526660000122
Representing a confidence value that the web page belongs to category i, where n is an integer greater than or equal to 2, the first neural network may be a deep neural network model.
In one embodiment, the webpage html File is parsed when the webpage File is identifiedhtmlObtaining the text content text and the title text of the webpageTitle(ii) a Inputting the text into a trained second neural network model, extracting text semantic features, and outputting content text semantic labels through a text classifier
Figure BDA0003364526660000123
Wherein
Figure BDA0003364526660000124
A confidence value representing that the web page belongs to category i; text willTitleInputting the semantic features of the text into a trained third neural network model, extracting semantic features of the text, and outputting semantic labels of the title text through a text classifier
Figure BDA0003364526660000125
Wherein
Figure BDA0003364526660000131
A confidence value representing that the web page belongs to category i; inquiring whether record information label P exists in webpage or not through supervision platformRecord keepingIf there is a record, PRecord keepingValue 1, no record PRecord keepingIs 0. The second and third neural network models may be bi-LSTM neural networks.
In one embodiment, the picture classification tags are labeled according to a predetermined multi-modal fusion policy calculation formula
Figure BDA0003364526660000132
Content text semantic classification labels
Figure BDA0003364526660000133
Title text semantic tags
Figure BDA0003364526660000134
And a filing information tag PRecord keepingInputting a multi-mode fusion strategy calculation formula for judging the webpage category, wherein the calculation formula is as follows:
Figure BDA0003364526660000135
wherein y isi=max(y0,y1...yn) Representing the final category of the web page as i, wherein WxRepresents a weight value, B is a constant, WxWherein x is 1, 2, 3, 4. WxThe specific value of (b) may be obtained by machine learning from historical data, or by historical data fitting, and so on.
In the invention: for the screenshot of the web page, classifying the website by using image semantics through a trained neural network picture classification model, fully utilizing the meaning of a web page pattern, and well solving the problem that the website of the web page only with pictures is difficult to classify; acquiring html files for web pages, analyzing web page structures, extracting web page text contents, and classifying the web pages through a trained neural network text semantic classification model; extracting title, namely title, of the website from the html file of the website, wherein the title information is more refined and less subjected to semantic interference, and part of the website can be judged more accurately by utilizing the semantics; the filing information is verified on the supervision platform through the URL of the website, whether the website is legal or not is inquired, and whether the legal filing information is legal or not is used as one dimension information for judging the type of the website; then, a multi-modal technology is used for fusing a plurality of classification results, and a specific fusion strategy is designed, so that the identification rate of the website types is greatly improved, which is another important invention point of the invention.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially implemented or the portions that contribute to the prior art may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the apparatuses described in the embodiments or some portions of the embodiments of the present application.
Finally, it should be noted that: although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: modifications and equivalents may be made thereto without departing from the spirit and scope of the invention and it is intended to cover in the claims the invention as defined in the appended claims.

Claims (10)

1. A multi-modal-based website type judgment method is characterized by comprising the following steps:
a crawling step, wherein web page html files and web page screenshots are crawled based on the URL of the website;
identifying the webpage screenshot, namely identifying the webpage screenshot by using a first neural network model to determine a picture classification label of the website;
identifying a webpage file, namely identifying the webpage html file by using a second neural network model and a third neural network to determine a content text semantic label and a title text semantic label of the website, and acquiring a filing information label based on a URL (uniform resource locator) of the website through a supervision information platform;
and a fusion step, namely determining the final type of the website based on the picture classification label, the content text semantic label, the title text semantic label and the record information label.
2. The method according to claim 1, wherein before crawling the web page html file and the web page screenshot based on the URL of the website, judging whether the web page URL is in a website knowledge base, if so, outputting the web page category information according to the corresponding information in the website knowledge base; if not, crawling a webpage html file and a webpage screenshot based on the URL of the website.
3. The method of claim 2, wherein the screenshot picture is File when the screenshot is identifiedimgInputting the image data into a trained first neural network model, extracting image features, and outputting image classification labels through a classifier
Figure FDA0003364526650000011
Wherein
Figure FDA0003364526650000012
A confidence value indicating that the web page belongs to category i.
4. The method of claim 3, wherein the web html File is parsed upon identification of the web FilehtmlObtaining the text content text and the title text of the webpageTitle(ii) a Inputting the text into a trained second neural network model, extracting text semantic features, and outputting content text semantic labels through a text classifier
Figure FDA0003364526650000021
Wherein
Figure FDA0003364526650000022
A confidence value representing that the web page belongs to category i; text willTitleInputting the semantic features of the text into a trained third neural network model, extracting semantic features of the text, and outputting semantic labels of the title text through a text classifier
Figure FDA0003364526650000023
Wherein
Figure FDA0003364526650000024
A confidence value representing that the web page belongs to category i; inquiring whether record information label P exists in webpage or not through supervision platformRecord keepingIf there is a record, PRecord keepingValue 1, no record PRecord keepingIs 0.
5. The method of claim 4, wherein the picture classification tags are labeled according to a predetermined multi-modal fusion policy calculation formula
Figure FDA0003364526650000025
Content text semantic classification labels
Figure FDA0003364526650000026
Title text semantic tags
Figure FDA0003364526650000027
And a filing information tag PRecord keepingInputting a multi-mode fusion strategy calculation formula for judging the webpage category, wherein the calculation formula is as follows:
Figure FDA0003364526650000028
wherein y isi=max(y0,y1...yn) Representing the final category of the web page as i, wherein WxRepresents a weight value, B is a constant, WxWherein x is 1, 2, 3, 4.
6. A multi-modal-based website type determination apparatus, comprising:
the crawling unit crawls a webpage html file and a webpage screenshot based on the URL of the website;
the webpage screenshot recognition unit is used for recognizing the webpage screenshot by using a first neural network model to determine a picture classification label of the website;
the web page file identification unit is used for identifying and determining a content text semantic label and a title text semantic label of the website by using a second neural network model and a third neural network, and acquiring a filing information label based on a URL (uniform resource locator) of the website through a supervision information platform;
and the fusion unit is used for determining the final type of the website based on the picture classification label, the content text semantic label, the title text semantic label and the record information label.
7. The apparatus of claim 6, wherein before crawling the web page html file and the web page screenshot based on the URL of the website, determining whether the web page URL is in a website knowledge base, if so, outputting the web page category information according to the corresponding information in the website knowledge base; if not, crawling a webpage html file and a webpage screenshot based on the URL of the website.
8. The apparatus of claim 7, wherein the screenshot picture is File when the screenshot is recognizedimgInputting the image data into a trained first neural network model, extracting image features, and outputting image classification labels through a classifier
Figure FDA0003364526650000031
Wherein
Figure FDA0003364526650000032
A confidence value indicating that the web page belongs to category i.
9. The apparatus of claim 8, wherein the web html File is parsed upon web File identificationhtmlObtaining the text content text and the title text of the webpageTitle(ii) a Inputting the text into a trained second neural network model, extracting text semantic features, and outputting content text semantic labels through a text classifier
Figure FDA0003364526650000041
Wherein
Figure FDA0003364526650000042
A confidence value representing that the web page belongs to category i; text willTitleInputting the semantic features of the text into a trained third neural network model, extracting semantic features of the text, and outputting semantic labels of the title text through a text classifier
Figure FDA0003364526650000043
Wherein
Figure FDA0003364526650000044
A confidence value representing that the web page belongs to category i; inquiring whether record information label P exists in webpage or not through supervision platformRecord keepingIf there is a record, PRecord keepingHas a value of1, case of failure PRecord keepingIs 0.
10. The apparatus of claim 9, wherein the picture classification tags are based on a predetermined multi-modal fusion policy calculation formula
Figure FDA0003364526650000045
Content text semantic classification labels
Figure FDA0003364526650000046
Title text semantic tags
Figure FDA0003364526650000047
And a filing information tag PRecord keepingInputting a multi-mode fusion strategy calculation formula for judging the webpage category, wherein the calculation formula is as follows:
Figure FDA0003364526650000048
wherein y isi=max(y0,y1...yn) Representing the final category of the web page as i, wherein WxRepresents a weight value, B is a constant, WxWherein x is 1, 2, 3, 4.
CN202111392189.XA 2021-11-19 2021-11-19 Multi-mode-based website type judgment method and device Pending CN114239689A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111392189.XA CN114239689A (en) 2021-11-19 2021-11-19 Multi-mode-based website type judgment method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111392189.XA CN114239689A (en) 2021-11-19 2021-11-19 Multi-mode-based website type judgment method and device

Publications (1)

Publication Number Publication Date
CN114239689A true CN114239689A (en) 2022-03-25

Family

ID=80750495

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111392189.XA Pending CN114239689A (en) 2021-11-19 2021-11-19 Multi-mode-based website type judgment method and device

Country Status (1)

Country Link
CN (1) CN114239689A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114662033A (en) * 2022-04-06 2022-06-24 昆明信息港传媒有限责任公司 Multi-modal harmful link recognition based on text and image

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114662033A (en) * 2022-04-06 2022-06-24 昆明信息港传媒有限责任公司 Multi-modal harmful link recognition based on text and image
CN114662033B (en) * 2022-04-06 2024-05-03 昆明信息港传媒有限责任公司 Multi-mode harmful link identification based on text and image

Similar Documents

Publication Publication Date Title
CN109325165B (en) Network public opinion analysis method, device and storage medium
CN102054015B (en) System and method of organizing community intelligent information by using organic matter data model
CN112749284B (en) Knowledge graph construction method, device, equipment and storage medium
US7783642B1 (en) System and method of identifying web page semantic structures
US11550856B2 (en) Artificial intelligence for product data extraction
CN102473190B (en) Keyword assignment to a web page
CN102073726B (en) Structured data import method and device for search engine system
CN102163187B (en) Document marking method and device
WO2022041406A1 (en) Ocr and transfer learning-based app violation monitoring method
CN107153716B (en) Webpage content extraction method and device
CN112347244A (en) Method for detecting website involved in yellow and gambling based on mixed feature analysis
CN105718533A (en) Information pushing method and device
CN105824822A (en) Method clustering phishing page to locate target page
CN101515272A (en) Method and device for extracting webpage content
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN105117434A (en) Webpage classification method and webpage classification system
CN108694325B (en) Method and device for identifying specified type of website
CN113918794B (en) Enterprise network public opinion benefit analysis method, system, electronic equipment and storage medium
CN114239689A (en) Multi-mode-based website type judgment method and device
Murthy XML URL classification based on their semantic structure orientation for web mining applications
Kuppusamy Machine learning based heterogeneous web advertisements detection using a diverse feature set
CN107169030A (en) A kind of method and system of identification check integration
CN113806667B (en) Method and system for supporting webpage classification
CN108171074B (en) Web tracking automatic detection method based on content association
CN105550279A (en) Vision-based list page identification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination