CN107819781B - Audio-visual website library construction method, audio-visual website inspection method and system - Google Patents

Audio-visual website library construction method, audio-visual website inspection method and system Download PDF

Info

Publication number
CN107819781B
CN107819781B CN201711173735.4A CN201711173735A CN107819781B CN 107819781 B CN107819781 B CN 107819781B CN 201711173735 A CN201711173735 A CN 201711173735A CN 107819781 B CN107819781 B CN 107819781B
Authority
CN
China
Prior art keywords
website
information
visual
monitoring
local
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711173735.4A
Other languages
Chinese (zh)
Other versions
CN107819781A (en
Inventor
李国华
白冰
张兆磊
申强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Bohui Technology Inc
Original Assignee
Beijing Bohui Technology Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Bohui Technology Inc filed Critical Beijing Bohui Technology Inc
Priority to CN201711173735.4A priority Critical patent/CN107819781B/en
Publication of CN107819781A publication Critical patent/CN107819781A/en
Application granted granted Critical
Publication of CN107819781B publication Critical patent/CN107819781B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/30Network architectures or network communication protocols for network security for supporting lawful interception, monitoring or retaining of communications or communication related information
    • H04L63/306Network architectures or network communication protocols for network security for supporting lawful interception, monitoring or retaining of communications or communication related information intercepting packet switched data communications, e.g. Web, Internet or IMS communications
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computing Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Technology Law (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the application discloses an audiovisual website library construction method, an audiovisual website inspection method and an audiovisual website inspection system. And establishing a patrol queue of the local audio-visual website according to the constructed local audio-visual website library, carrying out data monitoring on the audio-visual website in the local audio-visual website library, then storing a data monitoring result to a patrol historical information library, and updating the local audio-visual website library. When a local audio-visual website is monitored, the next website is monitored according to the patrol queue, and the local audio-visual website in the patrol queue is monitored in a circulating mode, so that the consumption of network resources and hardware resources during website monitoring is reduced, the problems of low monitoring efficiency, untimely monitoring, omission of monitoring and the like in the prior art are solved, and the local internet audio-visual website is monitored comprehensively and timely.

Description

Audio-visual website library construction method, audio-visual website inspection method and system
The application relates to the technical field of internet information processing, in particular to an audiovisual website library construction method, an audiovisual website inspection method and an audiovisual website inspection system.
Background
With the development of internet audio-visual technology, the number of internet audio-visual websites is continuously increased, audio-visual contents are more abundant, the demand of users for obtaining the audio-visual contents through the audio-visual websites is gradually increased, the influence of internet audio-visual programs is rapidly expanded, and the monitoring and guidance of the internet audio-visual websites and the internet audio-visual programs are urgently needed to promote the healthy development of the internet audio-visual industry. Monitoring of the internet audio-visual website is firstly realized by monitoring the audio-visual website and the internet audio-visual program.
In the internet audio-visual website monitoring method in the prior art, a new local website is discovered mainly by releasing a crawler on the internet, the type of the website is judged by keywords in a title label in a webpage source code, and the method belongs to the monitoring of the carelessness; further, since the keyword information in the "title" tag is limited, it is impossible to accurately determine the type of the website, for example, the content in the "title" tag of a certain viewing website is "super-cool — the world is very cool", and at this time, it is impossible to accurately determine whether the website is a viewing website by the keyword in the "title" tag.
In addition, in the prior art, the monitoring of the internet audio-visual websites adopts a total-station real-time monitoring mode, that is, data of all known internet audio-visual websites are monitored in real time, however, because the number of the audio-visual websites is sometimes huge in a monitoring area range, when the number of the local audio-visual websites is large, the total-station real-time monitoring mode in the prior art consumes a large amount of network resources and hardware resources, which causes the problems that the existing network resources and hardware resources cannot bear the requirements of the total-station real-time monitoring, the monitoring efficiency is low, the monitoring is not timely, and the like.
Therefore, the internet audio-visual website monitoring method in the prior art cannot realize the comprehensive and timely monitoring of the local internet audio-visual website.
Disclosure of Invention
The embodiment of the application provides an audio-visual website library construction method, an audio-visual website inspection method and an audio-visual website inspection system, which aim to solve the problems in the prior art.
In a first aspect, an embodiment of the present application provides a method for constructing an audiovisual website library, where the method includes: obtaining a local domain name library from a domain name scanning result of at least one scanning mode; obtaining classification information of a local website according to the local domain name library, wherein the classification information at least comprises webpage keywords and webpage description information; classifying the local websites according to the classification information to generate a local audio-visual type pending website library; acquiring website permission information from the web pages of the local audio-visual type to-be-examined website according to the local audio-visual type to-be-examined website library, wherein the website permission information at least comprises ICP (Internet protocol) filing information; and constructing a local audio-visual website library according to the local audio-visual website library to be examined and the website permission information.
In a second aspect, an embodiment of the present application provides an audiovisual website inspection method, where the method includes: according to the local audio-visual website library, creating an inspection queue of the local audio-visual website; acquiring a current detected website from the inspection queue, and acquiring website information of the current detected website, wherein the website information comprises at least one of classification information, webpage screen capturing and all link text information; according to the website information, performing data monitoring on the current detected website, wherein the data monitoring comprises at least one of website type monitoring, website layout monitoring, website effectiveness monitoring and website content monitoring; updating a routing inspection historical information base of the current detected website according to a data monitoring result of the current detected website, and if the routing inspection historical information base does not exist, creating the routing inspection historical information base; and updating the local audio-visual website library according to the result of data monitoring on the current website to be detected.
In a third aspect, an embodiment of the present application provides an audiovisual website inspection system, where the system includes: a memory and a processor; the memory is used for storing a local audio-visual website library, a routing inspection historical information library, website information and an executable program of the processor; the processor is configured to: creating a patrol queue of the local audio-visual website according to the local audio-visual website library; acquiring a current detected website from the inspection queue, and acquiring website information of the current detected website, wherein the website information comprises at least one of classification information, webpage screen capturing and all link text information; according to the website information, performing data monitoring on the current detected website, wherein the data monitoring comprises at least one of website type monitoring, website layout monitoring, website effectiveness monitoring and website content monitoring; updating the inspection historical information base of the current detected website according to the result of data monitoring on the current detected website, and if the inspection historical information base does not exist, creating the inspection historical information base; and updating the local audio-visual website library according to the result of data monitoring on the current detected website.
According to the technical scheme provided by the embodiment of the application, in order to solve the problem that the Internet audio-visual website monitoring method in the prior art cannot comprehensively and accurately monitor the local Internet audio-visual website, the local audio-visual website is accurately obtained through at least one scanning mode, the local newly-appeared audio-visual website can be timely and accurately found, the local audio-visual website library is constructed, and the routing inspection monitoring target of the local audio-visual website is determined. According to the audiovisual website inspection method provided by the embodiment of the application, an inspection queue of a local audiovisual website is created according to a constructed local audiovisual website library, periodic data monitoring including website type monitoring, website layout monitoring, website effectiveness monitoring and website content monitoring is performed on the audiovisual website in the local audiovisual website library, an inspection historical information library is created according to data monitoring results, the inspection historical information library is updated after each round of inspection, and the local audiovisual website library is updated according to data monitoring results. After monitoring a local audio-visual website, monitoring the next website according to the patrol queue, and circularly monitoring the local audio-visual websites in the patrol queue, thereby greatly reducing the consumption of network resources and hardware resources when monitoring the websites, solving the problems of low monitoring efficiency, untimely monitoring, omission of monitoring and the like in the prior art, and realizing the comprehensive and timely monitoring of the local internet audio-visual websites.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.
Fig. 1 is a flowchart of a method for constructing an audiovisual website library according to an embodiment of the present application;
fig. 2 is a flowchart of a step S110 of a method for constructing an audiovisual website library according to an embodiment of the present application;
FIG. 3 is a flowchart of another audiovisual website library construction method provided in the embodiments of the present application;
fig. 4 is a flowchart of an audiovisual website inspection method according to an embodiment of the present application;
FIG. 5 is a flowchart of website layout monitoring in an audiovisual website inspection method according to an embodiment of the present application;
fig. 6 is a flowchart of website content monitoring in an audiovisual website inspection method according to an embodiment of the present application;
FIG. 7 is a flowchart illustrating a website layout monitoring method according to another exemplary embodiment of the present disclosure;
fig. 8 is a block diagram of an audiovisual website inspection system according to an embodiment of the present application.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
With the development of internet audio-visual technology, the number of internet audio-visual websites is continuously increased, audio-visual contents are more abundant, and the audio-visual contents are various and are updated rapidly. More and more netizens are used to watch the audiovisual programs on line through the audiovisual websites, the demand of users for obtaining audiovisual contents through the audiovisual websites is gradually improved, and the influence of the internet audiovisual programs is rapidly expanded. In addition, besides the traditional audio-visual websites, various types of live broadcast websites and short video sharing websites are generated like spring shoots after rain, and the websites allow users to make audio-visual contents and spread the audio-visual contents through corresponding website platforms, so that the audio-visual contents of the internet audio-visual websites are richer and the updating is quicker, and therefore, the audio-visual websites and the internet audio-visual programs need to be supervised and guided urgently to promote the healthy development of the internet audio-visual industry. Monitoring of the internet audio-visual website is firstly realized by monitoring the audio-visual website and the internet audio-visual program.
In the internet audio-visual website monitoring method in the prior art, a new local website is discovered mainly by releasing a website spreading crawler on the internet, and the type of the website is judged by identifying keywords in a title label in a webpage source code, which belongs to the monitoring of carelessness; in the prior art, because the data acquisition mode is single, the result of acquiring a new website is not comprehensive enough, so that a plurality of local websites are not acquired or the acquisition efficiency is low; meanwhile, since the keyword information in the "title" tag is limited, the website type cannot be accurately judged only by the "title" tag, for example, the content in the "title" tag of a certain audiovisual website is "super-cool — this world is very cool", at this time, it cannot be accurately judged whether the website is the audiovisual website by the keyword in the "title" tag, and the content in the "title" tag changes from time to time due to the factors such as the change of the internet hotspot and the update of the content of the audiovisual website, so the judgment result obtained by the method for judging the website type by the keyword in the "title" tag in the web page source code in the prior art is unstable.
In addition, in the prior art, the monitoring of the internet audio-visual websites adopts a total-station real-time monitoring mode, that is, data of all known internet audio-visual websites are monitored in real time, however, because the number of the audio-visual websites is sometimes huge in a monitoring area range, when the number of the local audio-visual websites is large, the total-station real-time monitoring mode in the prior art consumes a large amount of network resources and hardware resources, which causes the existing network resources and hardware resources not to bear the requirements of the total-station real-time monitoring, resulting in low monitoring efficiency, untimely monitoring and other problems.
In order to solve the problems in the prior art, an embodiment of the present application provides a method for constructing an audiovisual website library, and fig. 1 is a flowchart of the method for constructing an audiovisual website library provided in the embodiment of the present application, and as shown in fig. 1, the method includes the following steps:
s110, obtaining a local domain name library from the domain name scanning result of at least one scanning mode.
According to the method and the device, in order to enable domain name coverage in the local domain name library to be more comprehensive, local domain name scanning is carried out in at least one scanning mode, and the problems that in the prior art, due to the fact that a single scanning method is adopted, the scanning result covers websites incompletely, some local websites are not acquired, and the acquisition efficiency is low due to the single scanning mode are solved.
In order to achieve sufficient complementation of the domain name scanning results obtained by each domain name scanning mode in the present application, and to obtain a comprehensive local domain name library by integrating each domain name scanning result, three or more domain name scanning modes are preferably used in step S110 in the present application.
Fig. 2 is a flowchart of step S110 of an audiovisual website library construction method provided in an embodiment of the present application, and as shown in fig. 2, in an alternative implementation, step S110 includes the following steps:
and step S111, obtaining the domain name scanning result according to the IP port scanning, the domain name library directional scanning and the whole network spreading scanning mode.
In step S111, domain name scanning is performed by using three scanning modes, i.e., IP port scanning, domain name base directional scanning, and full network extension scanning, to obtain a comprehensive local domain name base.
In step S111, the IP port scanning includes releasing a crawler of an IP address repository in the local internet, crawling a local IP address to obtain a local IP address repository, and then performing IP port scanning according to the obtained local IP address repository to obtain a local domain name corresponding to the IP address.
In step S111, the domain name repository targeting scan includes: according to the existing domain name library, releasing a domain name library crawler in the local Internet, carrying out domain name directional scanning on the local Internet, and obtaining a local IP-domain name library, wherein the local IP-domain name library comprises a domain name and a local IP address pointed by the domain name.
In step S111, a whole-web crawling crawler is used to perform shallow, fast and extensive search on the local internet, and obtain the IP address and domain name of the local website, especially the IP address and domain name of a new website.
In step S111, in the embodiment of the present application, three groups of domain name scanning results are obtained through three scanning modes, i.e., IP port scanning, domain name library directional scanning, and full network overspread scanning, and compared with a single scanning mode in the prior art, the three groups of domain name scanning results in the embodiment of the present application implement content integration and complement each other, for example: the domain name obtained by the IP port scanning can supplement the result missing caused by incomplete domain name base in the domain name base directional scanning, the IP obtained by the domain name base directional scanning can supplement the result missing caused by incomplete IP address base in the IP port scanning, and the whole network spreading scanning emphasizes on finding out a local new domain name. Therefore, the local domain name library acquired by using the three groups of domain name scanning results in the embodiment of the application can more comprehensively cover the local website.
And step S112, carrying out data combination on the domain name scanning results, taking intersection, cleaning data, removing non-local domain names and generating a local domain name library.
In step S112, the three sets of domain name scanning results obtained in step S111 are merged, and the intersection of the results is taken and retained, for example: only one repeated result is reserved, and a complete result is reserved for the results related to the same IP or the same domain name in the domain name scanning results of different scanning modes; and performing data cleaning on domain name scanning results of the three scanning modes, for example: and removing the non-local domain name of which the IP address does not belong to the local part in the scanning result, and removing noise in the result, including an incomplete result, a result irrelevant to the requirement of the application and the like.
In step S112, in the embodiment of the present application, the generated local domain name library is more comprehensive of the local website covered by the local domain name, and includes the domain name of the new local website.
And S120, acquiring classification information of the local website according to the local domain name library, wherein the classification information at least comprises webpage keywords and webpage description information.
In step S120, according to the obtained local domain name library, the local domain name therein is accessed, and the classification information is obtained from the local website to which the local domain name points. In an alternative embodiment, the classification information includes webpage keywords and webpage description information, wherein the webpage keywords and the webpage description information can be obtained by crawling specified tag content in the webpage source code.
For example, taking domain name www.iqiyi.com as an example, a web page keyword is obtained from a web page source code "title" tag.
Illustratively, the content of a certain webpage source code "title" tag is:
< title > Emotion-world leading online video website-mass real-edition high-definition video online watching </title >
Keywords such as "love art", "online video", "high-definition video" and "online watching" can be obtained from the "title" tag.
In addition, keywords can be obtained from the label of the 'keywords' field in the webpage source code.
Illustratively, there are the following fields in a certain web page source code:
< meta name ═ keywords ═ zh-CN ═ content ═ curiosity video, video website, high definition video, movie, drama, animation, synthesis, music "/>, and
keywords such as ' video ', ' movie ', ' TV play ', ' art and ' music ' can be obtained from the label.
In addition, the webpage description information can be obtained from the tag in which the "description" field in the webpage source code is located.
Illustratively, there are the following fields in a certain web page source code:
the < meta name ═ description "lang ═ zh-CN" content ═ curiosity (iqiyi.com) is a leading large video website in the world providing massive, high-quality and high-definition network video services, and a preferred platform for network video playing. The Aiqiyi movie and television is rich and diversified in content, and covers the purposes of movies, TV plays, animations, fantasy, life, music, funs, finance, military affairs, sports, film flowers, information, micro-movies, children, mothers and babies, education, science and technology, fashion, originality, public welfare, games, travel, flyers, automobiles, documentaries, Aiqiyi self-made art plays and the like. The video playing is clear and smooth, the operation interface is simple and friendly, and the online watching experience of 'pleasing quality' is really brought to the user. "/>
From the tag, the content in the content can be acquired as web page description information.
S130, classifying the local websites according to the classification information to generate a local audio-visual type to-be-examined website library.
In step S130, according to the classification information such as the web page keywords and the web page description information acquired in step S120, the local websites pointed by the local domain name may be classified, and according to the monitoring requirement of the application on the audio-visual websites, the local websites should at least include the audio-visual websites and other websites, and meanwhile, in order to facilitate the monitoring of other types of websites using the technical solution of the application, during the classification, the website types may include news, finance, shopping and the like in addition to the audio-visual websites.
In the example of step S120, web page keywords such as "love art", "online video", "online watching" and the like are obtained from the "title" tag, and it can be determined from the web page keywords that the website indicated by the domain name www.iqiyi.com is love art, and the website type is an audio-visual website. It should be noted that, the specific manner of determining the website type may be through a preset website type keyword library, and matching the webpage keywords with preset keywords in the preset website type keyword library, where if the same keywords are matched, the website is the same as the website type corresponding to the preset website type keyword library.
For example, for some websites, the "title" tag does not include a keyword that can be used for website classification, and taking domain name www.youku.com as an example, the content of the web page source code "title" tag is:
< title > Youkao-the world is very cool </title >
Therefore, if the website needs to be classified, the content in the web page description information needs to be used, and the web page source code of the website includes the following fields:
a "description" content "video service platform providing video playing, video distribution, video search, video sharing"/>, and so on
From the web page description information contained in the above-mentioned field, it can be determined that the website is an audiovisual website.
The website category in the embodiments of the present application may also include other types of websites, such as a finance category, for example.
Taking domain name www.10jqka.com as an example, the content of the web page source code "title" tag is:
< title > same-flower-oriented finance __ makes investment simpler >
Keywords "finance" and "investment" can be extracted from the "title" tag, so that the website can be determined to be a finance website.
As can be seen from steps S120 and S130, the present application classifies the local website pointed by the local domain name library according to the classification information of the website, where the classification information includes the webpage keywords and the webpage description information, and compared with the method in the prior art that only obtains the keywords from the "title" of the webpage source code to determine the website type. According to the website classification method based on the webpage keywords and the webpage description information, more accurate classification can be achieved, and the situation that websites cannot be classified correctly when the titles do not contain the webpage keywords can be avoided.
And S140, acquiring website permission information from the web pages of the local audio-visual type to-be-examined website according to the local audio-visual type to-be-examined website library, wherein the website permission information at least comprises ICP (Internet protocol) record information.
In the application, the license information of the website may include ICP filing information, and for the audiovisual website, the website license information may further include one or more of web culture business license information, information network broadcast audiovisual program license information, and publication business license information.
In the application, the website license information can be obtained by crawling the content of the corresponding field of the webpage source code of the website.
Illustratively, the following fields are obtained from the web page source code of the Aichi art homepage:
title ═ Jing ICP card No. 110636 "
Title-Jing network character [2015]0652- "No. 282-1"
The ICP filing information that can be obtained from the above fields for the website is: beijing ICP 110636, the network culture operating license information is: jing network text [2015]0652 and 282-1.
In the application, the website permission information can be used for judging whether the website pointed by the domain name changes or not, judging whether ICP records of the website are effective or not and the like in the process of polling the audio-visual website.
S150, constructing a local audio-visual website library according to the local audio-visual website library and the website permission information.
In the present application, the local audiovisual website library may include items such as a domain name, a website name, and website permission information.
Illustratively, the local audiovisual website library is of the form:
name of website Domain name Website licensing information
Love art www.iqiyi.com Jing ICP card No. 110636
Storm audio and video www.baofeng.com Jing ICP card No. 070364
Fox searching video tv.sohu.com Jing ICP card No. 030367
Music video tv.le.com Jing ICP card 060072 No.
…… …… ……
Fig. 3 is a flowchart of another audiovisual website library construction method provided in this application, and as shown in fig. 3, in an alternative embodiment, before step S150, the method further includes:
step S149, verifying the license information, and correcting the license information when the license information is incorrect.
Due to the reasons that the web page is not updated in time, the web site permission information is not disclosed in the web page, and the like, in step S140, wrong web site permission information or incomplete permission information may be acquired, for example: and the ICP records information which is expired or not acquired, and the like. Therefore, in step S149, the website license information is verified. The verification method adopted in the embodiment of the application may be as follows: setting a query interface connected with a query platform of a relevant supervision department such as the Ministry of industry and communications, querying and acquiring the website permission information recorded by the supervision department through the query interface, and if the result obtained by querying does not accord with the result acquired in the step S140, acquiring the information in the step S140 by mistake and correcting the wrong permission information.
According to the technical scheme, in order to solve the problem that the internet audio-visual website monitoring method in the prior art cannot accurately monitor the local internet audio-visual website comprehensively, the method for constructing the audio-visual website library obtains the local domain name library from the domain name scanning result of at least one scanning mode, the results obtained by multiple scanning modes can realize comprehensive complementation, so that the local domain name library covers the local website more comprehensively, then obtains the classification information of the local website according to the local domain name library, classifies the local website according to the classification information, generates the local audio-visual website library to be examined, then obtains the website permission information from the webpage of the local audio-visual website to be examined, and constructs the local audio-visual website library. The local audio-visual website library established by the method determines the routing inspection monitoring target of the local audio-visual website, can be used for purposefully monitoring the local audio-visual website, and solves the problems of inaccurate and incomplete monitoring caused by a purposeless monitoring method in the prior art.
An embodiment of the present application further provides an audiovisual website inspection method, and fig. 4 is a flowchart of the audiovisual website inspection method provided in the embodiment of the present application, and as shown in fig. 4, the method includes the following steps:
and step S210, creating a patrol queue of the local audio-visual website according to the local audio-visual website library.
In order to determine the polling sequence of the local audio-visual websites, the application creates a polling queue of the local audio-visual websites in step S210, and in the polling process, polling is performed in sequence according to the sequence of the local audio-visual websites in the polling queue.
For example, a database management system such as MySQ L may be used to store a local audiovisual website library, and when the local audiovisual website needs to be inspected, the domain name information stored in the local audiovisual website library is stored in a Redis equivalent key value pair storage database in sequence, so as to form an inspection queue of the local audiovisual website.
Step S220, acquiring the current detected website from the inspection queue, and acquiring website information of the current detected website, wherein the website information comprises at least one of classification information, webpage screen capturing and all link text information.
In step S220, the current website to be detected is determined according to the polling queue of the local audiovisual website, the domain name of the current website to be detected is obtained, and then the crawler is released to obtain website information.
Illustratively, the classification information may include at least a webpage keyword and webpage description information, where the webpage keyword may be obtained from a tag in which a "title" tag and a "keywords" field of the source code of the detected website webpage are located; the web page Description information may be obtained from a tag in which a "Description" field of the web page source code is located. The manner of acquiring all the link text information may be: and searching < a href ═ and </a > tags from the webpage source code, and extracting link text information from the tags.
Step S230, according to the website information, performing data monitoring on the currently detected website, wherein the data monitoring comprises at least one of website type monitoring, website layout monitoring, website effectiveness monitoring and website content monitoring.
In step S230, website type monitoring is used to monitor whether the website type changes, and in an alternative embodiment, the website type monitoring includes the following steps:
step S2311, according to the classification information, obtaining whether the current website to be detected is an audio-visual website.
In step S2311, the website type of the currently detected website may be determined according to the webpage keywords in the classification information, and if the webpage keywords are not obtained in step S220, the website type of the currently detected website is determined from the webpage description information. If the website type of the current detected website is not the audio-visual website, the website type is changed, and the website type of the current detected website after the change is recorded.
In step S230, website layout monitoring is used to monitor whether a website layout changes, fig. 5 is a flowchart of website layout monitoring of an audiovisual website inspection method provided in an embodiment of the present application, and as shown in fig. 5, in an alternative implementation, the website layout monitoring includes the following steps:
step S2321, the layout information of the current detected website is obtained from the home page screenshot of the current detected website.
In the method, the layout window in the webpage screenshot can be identified in an image identification mode, the webpage screenshot is cut into blocks according to the identification result, the position, the size and the relative relation of each block are recorded, and the layout information of the current detected website is generated. Or, the web page screenshot is normalized, for example, processed into a grayscale image, and the layout information of the current detected website is recorded in the form of the normalized image, which can be used for similarity matching between images.
Step S2322, the layout information is compared with the pre-stored layout information of the current detected website in a consistent manner.
The pre-stored layout information in the application can be acquired through a webpage crawler and subjected to block cutting or normalization processing when a local audio-visual website library is created, so that the pre-stored layout information is obtained and correspondingly stored in the local audio-visual website library.
When performing the alignment of consistency, an alternative alignment method is: and obtaining pre-stored layout information represented by the current detected website in a block cutting mode, carrying out block cutting comparison on the pre-stored layout information and the webpage screen shot obtained in the step S2321, and determining the proportion of the same blocks according to the positions, the sizes and the relative relation of the blocks.
Or, another optional comparison method is to generate a multidimensional vector for representing the preset layout information according to the number of the cut blocks in the preset layout information, where each cut block corresponds to one dimension of the vector, determine the weight of each cut block according to the position, size, and the like of the cut block, perform corresponding weighting processing on the value of each dimension of the reference vector, use the weighted multidimensional vector as a reference vector for consistency comparison, when it is necessary to perform consistency comparison on a web screenshot of a current detected website, generate a feature vector of the website layout information according to the content of the cut block of the web screenshot and the position and size information of each cut block, and then calculate the cosine similarity between the reference vector and the feature vector.
Or, an optional further comparison method is to obtain a preset normalized image of a web page screenshot prestored in the current detected website from the preset layout information, and calculate the primitive similarity between the normalized image obtained in step S2321 of the current detected website and the prestored normalized image.
Step S2323, if the consistency is lower than a preset threshold value, the website layout changes.
In the embodiment of the application, different thresholds can be set for different consistency comparison methods, and whether the layout of the website changes is judged in a manner that whether the consistency comparison result is lower than the threshold. If the consistency is lower than a preset threshold value, the website layout is changed, and if the consistency is not lower than the preset threshold value, the website layout is not changed.
For example, according to the consistency comparison method shown in step S2322, the threshold may be a ratio threshold of the same cut block, a cosine similarity threshold of the reference vector and the feature vector, a primitive similarity threshold, and the like, and in the consistency comparison method, if the comparison result is smaller than the threshold, the website layout changes.
In step S230, the website validity monitoring is used to monitor whether the website can operate normally, and in an alternative embodiment, the website validity monitoring includes the following steps:
step S2331, whether failure information is returned or not is analyzed when the website information is collected.
When information collection is performed on a website, if failure information is returned, it is indicated that the website cannot be normally accessed, that is, the website has failed, where the failure information includes an HTTP code and the like that return specific error information, for example: 403Forbidden, 404Not Found, 502Bad Gateway, etc.; or, the website can be normally accessed, but the content of the webpage is removed, only information such as website maintenance or closing bulletin is left, and if the information such as the website maintenance or closing bulletin is collected when the website information is collected, the failure information is considered to be returned. And when the collected website information returns failure information, the result of monitoring the effectiveness of the website is that the currently detected website is invalid.
In step S230, website content monitoring is used to monitor whether harmful content exists in a website, fig. 6 is a flowchart of website content monitoring of an audiovisual website inspection method provided in an embodiment of the present application, as shown in fig. 6, in an alternative implementation, the website content monitoring includes the following steps:
step S2341, obtaining a sensitive text from the link text information according to a preset sensitive word information base.
In the embodiment of the present application, the preset sensitive word information base may include known sensitive words in the internet, such as pornographic, sexual cueing, abuse, and illicit network expressions, words related to information such as violence, horror, and reactionary, and harmonic words, pinyin and text combinations thereof, and the content in the preset sensitive word information base may be updated in real time according to local public opinion wind direction, emergency, public activities, and the like.
In the embodiment of the present application, link text information is obtained by searching for an < a href ═ or > < a > tag from a webpage source code of a home page of a currently detected website, for example, the following source codes exist in a certain website:
< a href ≧ http:// www.iqiyi.com/v _19rre3arf0.html "> a journey </a > -which cannot be replaced by others
The text between the start tag < a > and the end tag < a > is a link text, that is, the link text information obtained from the website source code is: a trip that cannot be replaced by others.
In the embodiment of the application, according to a preset sensitive word information base, sensitive word matching is performed on all the obtained link text information, so that a sensitive text is obtained from the link text information.
It should be noted that, in the embodiment of the present application, in order to facilitate classification of harmful content that may exist in a current detected website, in the preset sensitive word information base, class storage may be performed on preset sensitive words, and each sensitive word or a classification setting sensitive level of each sensitive word may be set, so as to implement differentiation determination on whether harmful information exists in a webpage according to different sensitive levels.
Step S2342, if the sensitive text is obtained, whether harmful content exists in the current detected website or not is analyzed.
If the sensitive text is acquired in step S2341, which means that the content pointed by the link of the tag where the sensitive text is located may be harmful content, in step S2342, whether harmful content exists in the currently detected website is determined by analyzing the link text information with the sensitive text.
For example, since the content containing sensitive text is not necessarily harmful content, the link text information is, for example: "XX police are successful in thwarting terrorist attacks together," which, although included, are not harmful to the contents of their expression. Therefore, in order to find harmful contents from the link text information, as an alternative embodiment, semantic analysis is performed on the link text information containing sensitive text, and whether harmful contents are contained is determined according to the result of the semantic analysis.
Step S2343, if the harmful content exists, obtaining the type of the harmful content, and generating a classified statistical result of the harmful content.
In this embodiment of the application, if it is determined in step S2342 that harmful content is included in the currently detected website, determining the type of the harmful content according to the classification of the sensitive words in the preset sensitive word information base, for example: pornographic contents, explosive and terrorist contents and the like, then carrying out classification statistics on harmful contents according to the quantity and the type of the harmful contents in the current detected website to generate a classification statistical result of the harmful contents, wherein the classification statistical result of the harmful contents can be displayed in a fan-shaped graph, a histogram and other modes.
In addition, according to the information such as the quantity, the type and the sensitivity level of the harmful contents in the current detected website, the early warning condition of the harmful contents can be set, and when the quantity, the type or the sensitivity level of the harmful contents reach the early warning condition, the early warning is sent out.
Step S240, updating the inspection historical information base of the current detected website according to the data monitoring result of the current detected website, and if the inspection historical information base does not exist, creating the inspection historical information base.
The patrol history information base is used for recording the patrol history of each local audio-visual website in the local audio-visual website base, can record website information and data monitoring results acquired by websites in patrol within a past period of time, and can count website information change conditions, data monitoring change conditions, change trends of various monitoring results in data monitoring and other information of the websites within the past period of time.
In addition, in step S2311, website permission information of the currently detected website may also be obtained, and according to the website permission information stored in the local audiovisual website library, it is determined whether ICP filing information of the website and the like have changed, and if so, it indicates that the website pointed by the currently monitored domain name may have changed, and at this time, it is determined whether the current website is still an audiovisual website according to the classification information, and if still an audiovisual website, the corresponding item in the local audiovisual website library is updated, and subsequent data monitoring and other steps are continuously performed.
Step S250, updating the local audio-visual website library according to the result of data monitoring on the current detected website.
And analyzing whether the information stored in the local audio-visual website library of the current detected website is changed or not according to the data monitoring result, and if so, replacing the original content with the changed content and updating the local audio-visual website library.
In an alternative implementation, the method provided in the embodiment of the present application may further include:
step S2324, when the website layout changes, the webpage screen capture is saved, and a screen capture time stamp is added to the webpage screen capture.
In the embodiment of the application, when the website layout is not changed, the webpage screen capturing is deleted after the monitoring of the website layout is finished; and when the layout of the website changes, storing the webpage screenshot, and adding a screenshot time stamp to the webpage screenshot. The webpage screen capture with the timestamp can visually present the webpage screen capture time, and is convenient to search and compare. Meanwhile, when the layout of the website is not changed, the webpage screen capture is deleted, so that the storage space occupied by the webpage screen capture in the inspection process can be reduced, the quantity of webpage screen capture storage is reduced, and the retrieval efficiency is improved.
It should be noted that, in the technical solution provided in this embodiment of the present application, after completing monitoring on the currently detected website in the polling queue once, that is, after performing step S250, in the polling queue, a next website is selected as a new currently detected website, and the polling method provided in this embodiment of the present application is performed on the new currently detected website until all local audiovisual websites in the polling queue are cyclically monitored, and a round of polling is completed.
According to the technical scheme, in order to solve the problems in the prior art, the inspection queue of the local audio-visual website is created according to the local audio-visual website library, then each local audio-visual website is sequentially monitored according to the sequence of the local audio-visual websites in the inspection queue, the method comprises the steps of collecting website information of the current detected website, carrying out at least one of website type monitoring, website layout monitoring, website effectiveness monitoring and website content monitoring on the current detected website according to the website information, updating or creating the inspection history information library of the current detected website according to the data monitoring result, and updating the local audio-visual website library. According to the technical scheme provided by the embodiment of the application, purposeful monitoring of the local audio-visual website is realized according to the local audio-visual website library, in addition, the technical scheme provided by the embodiment of the application creates the polling queue of the local audio-visual website and orderly polls the local audio-visual website according to the polling queue, so that the consumption of network resources and hardware resources during website monitoring is greatly reduced, the problems of low monitoring efficiency, untimely monitoring, omission monitoring and the like in the prior art are avoided, and the comprehensive and timely monitoring of the local internet audio-visual website is realized.
An embodiment of the present application further provides an audiovisual website inspection system, and fig. 8 is a block diagram of the audiovisual website inspection system shown in the embodiment of the present application, and as shown in fig. 8, the system includes:
a memory 310 and a processor 320;
the memory 310 is used for storing a local audio-visual website library, a routing inspection history information library, website information, and an executable program of the processor 320;
the processor 320 is configured to:
creating a patrol queue of the local audio-visual website according to the local audio-visual website library;
acquiring a current detected website from the inspection queue, and acquiring website information of the current detected website, wherein the website information comprises at least one of classification information, webpage screen capturing and all link text information;
according to the website information, performing data monitoring on the current detected website, wherein the data monitoring comprises at least one of website type monitoring, website layout monitoring, website effectiveness monitoring and website content monitoring;
updating the inspection historical information base of the current detected website according to the result of data monitoring on the current detected website, and if the inspection historical information base does not exist, creating the inspection historical information base;
and updating the local audio-visual website library according to the result of data monitoring on the current detected website.
According to the technical scheme, in order to solve the problems in the prior art, the inspection queue of the local audio-visual website is created according to the local audio-visual website library, each local audio-visual website is sequentially monitored according to the sequence of the local audio-visual websites in the inspection queue, the inspection queue comprises the steps of collecting website information of the current detected website, performing at least one of website type monitoring, website layout monitoring, website effectiveness monitoring and website content monitoring on the current detected website according to the website information, updating or creating the inspection history information library of the current detected website according to the data monitoring result, and updating the local audio-visual website library. According to the technical scheme provided by the embodiment of the application, purposeful monitoring of the local audio-visual website is realized according to the local audio-visual website library, in addition, the technical scheme provided by the embodiment of the application creates the polling queue of the local audio-visual website and orderly polls the local audio-visual website according to the polling queue, so that the consumption of network resources and hardware resources during website monitoring is greatly reduced, the problems of low monitoring efficiency, untimely monitoring, omission monitoring and the like in the prior art are avoided, and the comprehensive and timely monitoring of the local internet audio-visual website is realized.
The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (10)

1. A method for constructing a library of audiovisual websites, the method comprising:
obtaining a local domain name library from domain name scanning results of at least three scanning modes;
obtaining classification information of a local website according to the local domain name library, wherein the classification information at least comprises webpage keywords and webpage description information;
classifying the local websites according to the classification information to generate a local audio-visual type pending website library;
acquiring website permission information from the web pages of the local audio-visual type to-be-examined website according to the local audio-visual type to-be-examined website library, wherein the website permission information at least comprises ICP (Internet protocol) filing information;
and constructing a local audio-visual website library according to the local audio-visual website library to be examined and the website permission information.
2. The method according to claim 1, wherein the step of obtaining the local domain name library from the domain name scanning results of at least three scanning modes comprises:
obtaining the domain name scanning result according to IP port scanning, domain name library directional scanning and whole network spreading scanning modes;
and carrying out data combination on the domain name scanning results, taking intersection, cleaning data, removing non-local domain names and generating a local domain name library.
3. The method of claim 1, wherein before the step of constructing a local audiovisual website library from a local audiovisual-like pending website library and the website licensing information, further comprising:
and checking the license information, and correcting the license information when the license information has errors.
4. An audiovisual website inspection method, comprising:
according to the local audio-visual website library, creating an inspection queue of the local audio-visual website;
acquiring a current detected website from the inspection queue, and acquiring website information of the current detected website, wherein the website information comprises at least one of classification information, webpage screen capturing and all link text information;
according to the website information, performing data monitoring on the current detected website, wherein the data monitoring comprises at least one of website type monitoring, website layout monitoring, website effectiveness monitoring and website content monitoring;
updating a routing inspection historical information base of the current detected website according to a data monitoring result of the current detected website, and if the routing inspection historical information base does not exist, creating the routing inspection historical information base;
and updating the local audio-visual website library according to the result of data monitoring on the current detected website.
5. The method of claim 4, wherein the website type monitoring comprises:
acquiring whether the current detected website is an audio-visual website or not according to the classification information;
if the website is not an audio-visual website, the website type is changed.
6. The method of claim 4, wherein the website layout monitoring comprises:
acquiring the layout information of the current detected website from the home page screenshot of the current detected website;
comparing the layout information with the pre-stored layout information of the current detected website in a consistent manner;
and if the consistency is lower than a preset threshold value, the website layout is changed.
7. The method of claim 4, wherein the website availability monitoring comprises:
analyzing whether failure information is returned or not when the website information is collected;
and if failure information is returned, the current detected website is invalid.
8. The method of claim 4, wherein the website content monitoring comprises:
acquiring a sensitive text from the link text information according to a preset sensitive word information base;
if the sensitive text is obtained, analyzing whether harmful content exists in the current detected website or not;
and if the harmful content exists, acquiring the type of the harmful content, and generating a classified statistical result of the harmful content.
9. The method of claim 4, wherein the website layout monitoring further comprises:
and when the website layout changes, saving the webpage screenshot, and adding a screenshot time stamp to the webpage screenshot.
10. An audiovisual website inspection system, the system comprising:
a memory and a processor;
the memory is used for storing a local audio-visual website library, a routing inspection historical information library, website information and an executable program of the processor;
the processor is configured to:
creating a patrol queue of the local audio-visual website according to the local audio-visual website library;
acquiring a current detected website from the inspection queue, and acquiring website information of the current detected website, wherein the website information comprises at least one of classification information, webpage screen capturing and all link text information;
according to the website information, performing data monitoring on the current detected website, wherein the data monitoring comprises at least one of website type monitoring, website layout monitoring, website effectiveness monitoring and website content monitoring;
updating the inspection historical information base of the current detected website according to the result of data monitoring on the current detected website, and if the inspection historical information base does not exist, creating the inspection historical information base;
and updating the local audio-visual website library according to the result of data monitoring on the current detected website.
CN201711173735.4A 2017-11-22 2017-11-22 Audio-visual website library construction method, audio-visual website inspection method and system Active CN107819781B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711173735.4A CN107819781B (en) 2017-11-22 2017-11-22 Audio-visual website library construction method, audio-visual website inspection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711173735.4A CN107819781B (en) 2017-11-22 2017-11-22 Audio-visual website library construction method, audio-visual website inspection method and system

Publications (2)

Publication Number Publication Date
CN107819781A CN107819781A (en) 2018-03-20
CN107819781B true CN107819781B (en) 2020-07-31

Family

ID=61610376

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711173735.4A Active CN107819781B (en) 2017-11-22 2017-11-22 Audio-visual website library construction method, audio-visual website inspection method and system

Country Status (1)

Country Link
CN (1) CN107819781B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103927400A (en) * 2014-05-07 2014-07-16 重庆邮电大学 Web site product detailed information classification crawling and product information base establishing method
CN104598561A (en) * 2015-01-07 2015-05-06 中国农业大学 Text-based intelligent agricultural video classification method and text-based intelligent agricultural video classification system
CN107181620A (en) * 2017-06-09 2017-09-19 安徽博约信息科技股份有限公司 A kind of possession website supervisory systems

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103927400A (en) * 2014-05-07 2014-07-16 重庆邮电大学 Web site product detailed information classification crawling and product information base establishing method
CN104598561A (en) * 2015-01-07 2015-05-06 中国农业大学 Text-based intelligent agricultural video classification method and text-based intelligent agricultural video classification system
CN107181620A (en) * 2017-06-09 2017-09-19 安徽博约信息科技股份有限公司 A kind of possession website supervisory systems

Also Published As

Publication number Publication date
CN107819781A (en) 2018-03-20

Similar Documents

Publication Publication Date Title
CN109241461B (en) User portrait construction method and device
Boididou et al. Verifying multimedia use at mediaeval 2015
US8510795B1 (en) Video-based CAPTCHA
CN108319630B (en) Information processing method, information processing device, storage medium and computer equipment
CN106326391B (en) Multimedia resource recommendation method and device
US9098807B1 (en) Video content claiming classifier
KR102399787B1 (en) Recognition of behavioural changes of online services
Teyssou et al. The InVID plug-in: web video verification on the browser
CN111818198B (en) Domain name detection method, domain name detection device, equipment and medium
US11526586B2 (en) Copyright detection in videos based on channel context
Fontanini et al. Web video popularity prediction using sentiment and content visual features
Lago et al. Visual and textual analysis for image trustworthiness assessment within online news
CN104899306B (en) Information processing method, information display method and device
KR20190042984A (en) System for monitoring digital works distribution
Saravanou et al. Twitter floods when it rains: a case study of the UK floods in early 2014
CN106570020A (en) Method and apparatus used for providing recommended information
Dragoi et al. AnoShift: A distribution shift benchmark for unsupervised anomaly detection
CN114157568B (en) Browser secure access method, device, equipment and storage medium
CN107819781B (en) Audio-visual website library construction method, audio-visual website inspection method and system
CN112788356B (en) Live broadcast auditing method, device, server and storage medium
Shi et al. Be in the know: Connecting news articles to relevant twitter conversations
Zhang et al. An end-to-end scalable copyright detection system for online video sharing platforms
CN113515670B (en) Film and television resource state identification method, equipment and storage medium
US9208157B1 (en) Spam detection for user-generated multimedia items based on concept clustering
CN111221989A (en) Display control method, device and system, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant