CN114239689A

CN114239689A - Multi-mode-based website type judgment method and device

Info

Publication number: CN114239689A
Application number: CN202111392189.XA
Authority: CN
Inventors: 林淑强; 毕永辉; 梁煜麓; 王兵; 鄢小征; 朱聚江
Original assignee: Xiamen Meiya Pico Information Co Ltd
Current assignee: Xiamen Meiya Pico Information Co Ltd
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2022-03-25

Abstract

The invention provides a multi-mode-based website type judgment method and a multi-mode-based website type judgment device, wherein the method comprises the following steps: crawling a webpage html file and a webpage screenshot based on the URL of the website; identifying the webpage screenshot by using a first neural network model to determine a picture classification label of the website; identifying the web page html file by using a second neural network model and a third neural network to determine a content text semantic tag and a title text semantic tag of the website, and acquiring a filing information tag based on a website URL through a supervision information platform; and determining the final type of the website based on the picture classification label, the content text semantic label, the title text semantic label and the filing information label. In the invention, a multi-mode technology is used for fusing various classification results, and a specific fusion strategy is designed, so that the identification rate of the website types is greatly improved.

Description

Multi-mode-based website type judgment method and device

Technical Field

The invention relates to the technical field of machine learning, in particular to a multi-mode-based website type judgment method and device.

Background

A web site is a collection of web pages made on the internet using tools such as HTML (standard universal markup language) for displaying specific content. The website is a communication tool on the internet, and people can access the website to acquire a lot of information and information through a web browser.

With the development of the internet, the number of websites is countless and is increasing every day. In daily life, everyone visits various websites. The judgment on the website category has important significance, for example, the characteristics of personal interests and hobbies can be mined according to the types of websites visited by people, so that personalized recommendation and accurate marketing are performed specifically.

In the prior art, the following methods mainly exist for judging the category of a website at present: 1. extracting category characteristics of the url through the url of the website, so as to judge the type of the website; 2. judging the category of the web page by extracting the content of the web page and by a machine learning mode such as Bayes classification through a keyword strategy or text semantic features; 3. screenshot is carried out on the webpage homepage, OCR recognition is carried out, character information on the homepage is extracted, and the webpage category and the like are judged by using the mode 2. In the method, the technical means of single mode (url or text) is utilized to study and judge, and in the face of some novel webpage types, the websites belong to investment and financing classes, but the class information is embodied in the patterns of the website pages and is not embodied in characters, so the accuracy of studying and judging the method is very low.

Disclosure of Invention

The present invention proposes the following technical solutions to address one or more technical defects in the prior art.

A multi-modal-based website type judgment method comprises the following steps:

a crawling step, wherein web page html files and web page screenshots are crawled based on the URL of the website;

identifying the webpage screenshot, namely identifying the webpage screenshot by using a first neural network model to determine a picture classification label of the website;

identifying a webpage file, namely identifying the webpage html file by using a second neural network model and a third neural network to determine a content text semantic label and a title text semantic label of the website, and acquiring a filing information label based on a URL (uniform resource locator) of the website through a supervision information platform;

and a fusion step, namely determining the final type of the website based on the picture classification label, the content text semantic label, the title text semantic label and the record information label.

Further, before crawling a webpage html file and a webpage screenshot based on the URL of the website, judging whether a webpage URL is in a website knowledge base, if so, outputting webpage category information according to corresponding information in the website knowledge base; if not, crawling a webpage html file and a webpage screenshot based on the URL of the website.

Furthermore, when the webpage screenshot is identified, the webpage screenshot picture File is used_imgInputting the image data into a trained first neural network model, extracting image features, and outputting image classification labels through a classifier

Wherein

A confidence value indicating that the web page belongs to category i.

Furthermore, when the webpage File is identified, the webpage html File is analyzed_htmlObtaining the text content text and the title text of the webpage_Title(ii) a Inputting the text into a trained second neural network model, extracting text semantic features, and outputting content text semantic labels through a text classifier

Wherein

A confidence value representing that the web page belongs to category i; text will_TitleInputting the semantic features of the text into a trained third neural network model, extracting semantic features of the text, and outputting semantic labels of the title text through a text classifier

Wherein

A confidence value representing that the web page belongs to category i; inquiring whether record information label P exists in webpage or not through supervision platform_{Record keeping}If there is a record, P_{Record keeping}Value 1, no record P_{Record keeping}Is 0.

Further, the pictures are classified into labels according to a predetermined multi-mode fusion strategy calculation formula

Content text semantic classification labels

Title text semantic tags

And a filing information tag P_{Record keeping}Inputting a multi-mode fusion strategy calculation formula for judging the webpage category, wherein the calculation formula is as follows:

wherein y isⁱ＝max(y⁰,y¹...yⁿ) Representing the final category of the web page as i, wherein W_xRepresents a weight value, B is a constant, W_xWherein x is 1, 2, 3, 4. .

The invention also provides a multi-mode-based website type judgment device, which comprises:

the crawling unit crawls a webpage html file and a webpage screenshot based on the URL of the website;

the webpage screenshot recognition unit is used for recognizing the webpage screenshot by using a first neural network model to determine a picture classification label of the website;

the web page file identification unit is used for identifying and determining a content text semantic label and a title text semantic label of the website by using a second neural network model and a third neural network, and acquiring a filing information label based on a URL (uniform resource locator) of the website through a supervision information platform;

and the fusion unit is used for determining the final type of the website based on the picture classification label, the content text semantic label, the title text semantic label and the record information label.

Wherein

A confidence value indicating that the web page belongs to category i.

Wherein

Wherein

Content text semantic classification labels

Title text semantic tags

The invention also proposes a computer-readable storage medium having stored thereon computer program code which, when executed by a computer, performs any of the methods described above.

The invention has the technical effects that: the invention discloses a multi-mode-based website type judgment method and a device, wherein the method comprises the following steps: a crawling step, wherein web page html files and web page screenshots are crawled based on the URL of the website; identifying the webpage screenshot, namely identifying the webpage screenshot by using a first neural network model to determine a picture classification label of the website; identifying a webpage file, namely identifying the webpage html file by using a second neural network model and a third neural network to determine a content text semantic label and a title text semantic label of the website, and acquiring a filing information label based on a URL (uniform resource locator) of the website through a supervision information platform; and a fusion step, namely determining the final type of the website based on the picture classification label, the content text semantic label, the title text semantic label and the record information label. In the invention, through a trained neural network picture classification model, the web sites are classified by utilizing image semantics, the meaning of web page patterns is fully utilized, and the problem that the web sites are difficult to accurately identify because the category information in the background technology is embodied in the patterns of the web sites is well solved; in the invention, for the screenshot of the web page, the web sites are classified by using image semantics through a trained neural network picture classification model, the meaning of the pattern of the web page is fully utilized, and the problem that the web sites of the web page only with pictures are difficult to classify is well solved; acquiring html files for web pages, analyzing web page structures, extracting web page text contents, and classifying the web pages through a trained neural network text semantic classification model; extracting title, namely title, of the website from the html file of the website, wherein the title information is more refined and less subjected to semantic interference, and part of the website can be judged more accurately by utilizing the semantics; extracting the filing information of the website, such as a website filing number, through the website webpage content, verifying the filing information on a supervision platform, inquiring whether the filing information is legal or not, and using the legal or not filing information as one dimension information for judging the website type; then, a multi-mode technology is used for fusing various classification results, and a specific fusion strategy is designed, so that the identification rate of the website types is greatly improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.

Fig. 1 is a flowchart of a multi-modal based website type determination method according to an embodiment of the present invention.

Fig. 2 is a block diagram of a multi-modal based website type determination apparatus according to an embodiment of the present invention.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

FIG. 1 illustrates a multi-modal based website type determination method of the present invention, which comprises:

a crawling step S101, wherein web page html files and web page screenshots are crawled based on the URL of the website; in the invention, a crawler program is designed, and the html file and the screenshot of the website are crawled according to the url of the webpage, wherein the url of the webpage is input by a user or acquired when the user clicks an advertisement link, and the like, but the invention is not limited to this.

A web screenshot recognition step S102, namely recognizing the web screenshot by using a first neural network model to determine a picture classification label of the website; the first neural network model, also known as a neural network image classification model, requires training before use.

A webpage file identification step S103, identifying a webpage html file by using a second neural network model and a third neural network to determine a content text semantic label and a title text semantic label of the website, and acquiring a filing information label based on a URL (uniform resource locator) of the website through a supervision information platform; the second and third neural network models are also called neural network text semantic classification models, and need to be trained before use

And a fusion step S104, determining the final type of the website based on the image classification label, the content text semantic label, the title text semantic label and the record information label.

In the invention, through a trained neural network picture classification model, the web sites are classified by utilizing image semantics, the meaning of web page patterns is fully utilized, and the problem that the web sites are difficult to accurately identify because the category information in the background technology is embodied in the patterns of the web sites is well solved; acquiring html files for the web pages, analyzing the web page structure, extracting text contents and title contents of the web pages, classifying the content and the title of the web pages through a trained neural network text semantic classification model, and judging whether the web pages are filed or not; and then, result fusion is carried out based on a multi-modal technology, so that the accuracy of website category judgment is improved, which is one of the important invention points of the invention.

In one embodiment, before crawling a webpage html file and a webpage screenshot based on a URL of the website, judging whether a webpage URL is in a website knowledge base, if so, outputting webpage category information according to corresponding information in the website knowledge base; if not, crawling a webpage html file and a webpage screenshot based on the URL of the website. The invention introduces a website classification knowledge base which is acquired through various channels, such as a filed domain name knowledge base and collection and accumulation in work, the knowledge base is manually checked, the accuracy is ensured, the problem of wrong category judgment caused by complete technical research and judgment is effectively solved, screening is carried out through the knowledge base, if url exists in the knowledge base, the subsequent process is not required to be executed, and the accuracy and the classification speed of website classification are improved, which is one of important invention points in the invention.

In one embodiment, when the webpage screenshot is identified, the webpage screenshot picture is File_imgInputting the image data into a trained first neural network model, extracting image features, and outputting image classification labels through a classifier

Wherein

Representing a confidence value that the web page belongs to category i, where n is an integer greater than or equal to 2, the first neural network may be a deep neural network model.

In one embodiment, the webpage html File is parsed when the webpage File is identified_htmlObtaining the text content text and the title text of the webpage_Title(ii) a Inputting the text into a trained second neural network model, extracting text semantic features, and outputting content text semantic labels through a text classifier

Wherein

Wherein

A confidence value representing that the web page belongs to category i; inquiring whether record information label P exists in webpage or not through supervision platform_{Record keeping}If there is a record, P_{Record keeping}Value 1, no record P_{Record keeping}Is 0. The second and third neural network models may be bi-LSTM neural networks.

In one embodiment, the picture classification tags are labeled according to a predetermined multi-modal fusion policy calculation formula

Content text semantic classification labels

Title text semantic tags

wherein y isⁱ＝max(y⁰,y¹...yⁿ) Representing the final category of the web page as i, wherein W_xRepresents a weight value, B is a constant, W_xWherein x is 1, 2, 3, 4. W_xThe specific value of (b) may be obtained by machine learning from historical data, or by historical data fitting, and so on.

In the invention: for the screenshot of the web page, classifying the website by using image semantics through a trained neural network picture classification model, fully utilizing the meaning of a web page pattern, and well solving the problem that the website of the web page only with pictures is difficult to classify; acquiring html files for web pages, analyzing web page structures, extracting web page text contents, and classifying the web pages through a trained neural network text semantic classification model; extracting title, namely title, of the website from the html file of the website, wherein the title information is more refined and less subjected to semantic interference, and part of the website can be judged more accurately by utilizing the semantics; extracting the filing information of the website, such as a website filing number, through the website webpage content, verifying the filing information on a supervision platform, inquiring whether the filing information is legal or not, and using the legal or not filing information as one dimension information for judging the website type; then, a multi-modal technology is used for fusing a plurality of classification results, and a specific fusion strategy is designed, so that the identification rate of the website types is greatly improved, which is another important invention point of the invention.

Fig. 2 shows a multi-modal based website type determination apparatus of the present invention, provided on an image acquisition apparatus, the apparatus comprising:

a crawling unit 201 that crawls a web html file and a web screenshot based on the URL of the website; in the invention, a crawler program is designed, and the html file and the screenshot of the website are crawled according to the url of the webpage, wherein the url of the webpage is input by a user or acquired when the user clicks an advertisement link, and the like, but the invention is not limited to this.

The webpage screenshot identifying unit 202 is used for identifying the webpage screenshot by using a first neural network model to determine a picture classification label of the website; the first neural network model, also known as a neural network image classification model, requires training before use.

The webpage file identification unit 203 identifies the webpage html file by using a second neural network model and a third neural network to determine a content text semantic tag and a title text semantic tag of the website, and acquires a filing information tag based on a URL (uniform resource locator) of the website through a supervision information platform; the second and third neural network models are also called neural network text semantic classification models, and need to be trained before use

And the fusion unit 204 is used for determining the final type of the website based on the picture classification label, the content text semantic label, the title text semantic label and the filing information label.

Wherein

Wherein

Wherein

Content text semantic classification labels

Title text semantic tags

In the invention: for the screenshot of the web page, classifying the website by using image semantics through a trained neural network picture classification model, fully utilizing the meaning of a web page pattern, and well solving the problem that the website of the web page only with pictures is difficult to classify; acquiring html files for web pages, analyzing web page structures, extracting web page text contents, and classifying the web pages through a trained neural network text semantic classification model; extracting title, namely title, of the website from the html file of the website, wherein the title information is more refined and less subjected to semantic interference, and part of the website can be judged more accurately by utilizing the semantics; the filing information is verified on the supervision platform through the URL of the website, whether the website is legal or not is inquired, and whether the legal filing information is legal or not is used as one dimension information for judging the type of the website; then, a multi-modal technology is used for fusing a plurality of classification results, and a specific fusion strategy is designed, so that the identification rate of the website types is greatly improved, which is another important invention point of the invention.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially implemented or the portions that contribute to the prior art may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the apparatuses described in the embodiments or some portions of the embodiments of the present application.

Finally, it should be noted that: although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: modifications and equivalents may be made thereto without departing from the spirit and scope of the invention and it is intended to cover in the claims the invention as defined in the appended claims.

Claims

1. A multi-modal-based website type judgment method is characterized by comprising the following steps:

2. The method according to claim 1, wherein before crawling the web page html file and the web page screenshot based on the URL of the website, judging whether the web page URL is in a website knowledge base, if so, outputting the web page category information according to the corresponding information in the website knowledge base; if not, crawling a webpage html file and a webpage screenshot based on the URL of the website.

3. The method of claim 2, wherein the screenshot picture is File when the screenshot is identified_imgInputting the image data into a trained first neural network model, extracting image features, and outputting image classification labels through a classifier

Wherein

A confidence value indicating that the web page belongs to category i.

4. The method of claim 3, wherein the web html File is parsed upon identification of the web File_htmlObtaining the text content text and the title text of the webpage_Title(ii) a Inputting the text into a trained second neural network model, extracting text semantic features, and outputting content text semantic labels through a text classifier

Wherein

Wherein

5. The method of claim 4, wherein the picture classification tags are labeled according to a predetermined multi-modal fusion policy calculation formula

Content text semantic classification labels

Title text semantic tags

wherein y isⁱ＝max(y⁰,y¹...yⁿ) Representing the final category of the web page as i, wherein W_xRepresents a weight value, B is a constant, W_xWherein x is 1, 2, 3, 4.

6. A multi-modal-based website type determination apparatus, comprising:

7. The apparatus of claim 6, wherein before crawling the web page html file and the web page screenshot based on the URL of the website, determining whether the web page URL is in a website knowledge base, if so, outputting the web page category information according to the corresponding information in the website knowledge base; if not, crawling a webpage html file and a webpage screenshot based on the URL of the website.

8. The apparatus of claim 7, wherein the screenshot picture is File when the screenshot is recognized_imgInputting the image data into a trained first neural network model, extracting image features, and outputting image classification labels through a classifier

Wherein

A confidence value indicating that the web page belongs to category i.

9. The apparatus of claim 8, wherein the web html File is parsed upon web File identification_htmlObtaining the text content text and the title text of the webpage_Title(ii) a Inputting the text into a trained second neural network model, extracting text semantic features, and outputting content text semantic labels through a text classifier

Wherein

Wherein

A confidence value representing that the web page belongs to category i; inquiring whether record information label P exists in webpage or not through supervision platform_{Record keeping}If there is a record, P_{Record keeping}Has a value of1, case of failure P_{Record keeping}Is 0.

10. The apparatus of claim 9, wherein the picture classification tags are based on a predetermined multi-modal fusion policy calculation formula

Content text semantic classification labels

Title text semantic tags