CN112231518B - Method, system, electronic device and storage medium for discovering network propagation behavior of works - Google Patents

Method, system, electronic device and storage medium for discovering network propagation behavior of works Download PDF

Info

Publication number
CN112231518B
CN112231518B CN202011435954.7A CN202011435954A CN112231518B CN 112231518 B CN112231518 B CN 112231518B CN 202011435954 A CN202011435954 A CN 202011435954A CN 112231518 B CN112231518 B CN 112231518B
Authority
CN
China
Prior art keywords
work
platform
link
information
acquisition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011435954.7A
Other languages
Chinese (zh)
Other versions
CN112231518A (en
Inventor
石晓涛
潘军
王哲
张国鑫
丁鹏
郭铮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Xunsiya Information Technology Co ltd
Original Assignee
Nanjing Xunsiya Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Xunsiya Information Technology Co ltd filed Critical Nanjing Xunsiya Information Technology Co ltd
Priority to CN202011435954.7A priority Critical patent/CN112231518B/en
Publication of CN112231518A publication Critical patent/CN112231518A/en
Application granted granted Critical
Publication of CN112231518B publication Critical patent/CN112231518B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/907Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/908Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Abstract

The invention discloses a method and a system for discovering network propagation behaviors of works, electronic equipment and a storage medium. According to the invention, through managing works, collecting and auditing and adopting an automatic collecting method, data found by full-network spreading of multiple works is realized, data omission is greatly reduced, and through adopting a data cleaning method, identification of irrelevant data can be realized, and data effectiveness and cleaning speed are improved. The invention is suitable for discovering network transmission behaviors of film and television works, literary works, image-text works, music works and the like.

Description

Method, system, electronic device and storage medium for discovering network propagation behavior of works
Technical Field
The invention relates to the technical field of internet information acquisition, in particular to a method and a system for discovering network propagation behaviors of works, electronic equipment and a storage medium.
Background
With the development of internet technology and traffic emerging markets, a plurality of user platforms illegally spreading movie and television works in a video clip mode appear on the markets. Although many data acquisition systems, as well as video platforms or search engines, now present also provide search capabilities, discovery of the behavior of film and television works propagating on the network cannot be applied. The video platform has the advantages that the video platform has huge data volume, cannot carry out full-scale acquisition and discovery, and has the problem of large data omission; the second reason is that a large amount of collected data has a large amount of invalid data, although the number screening can be realized by the video fingerprint comparison technology, the comparison speed of the technology is linearly increased along with the increase of the product amount, and the requirement of large-scale video quick comparison cannot be met; the third reason is that for the network propagation behavior of a large number of works, the whole network discovery work has no proper automatic system, and the problems need to be solved by related technologies or systems.
Disclosure of Invention
The technical purpose is as follows: in order to solve the technical problems, the invention discloses a method, a system, electronic equipment and a storage medium for discovering network propagation behaviors of works, which can solve the discovery work of the network propagation behaviors of videos and simultaneously reduce the problems of data omission, difficulty in screening a large amount of invalid data, rapid discovery and automatic realization of the network propagation behaviors of large-batch works and the like as much as possible.
The technical scheme is as follows: in order to achieve the technical purpose, the invention adopts the following technical scheme:
a method for discovering the network propagation behavior of a work is characterized by sequentially executing the following steps:
s1, configuration rule: adding work information to be retrieved, a plurality of network platform information to be searched, individual uploader information in each network platform and a plurality of filtering rules into a local database, wherein the filtering rules comprise work basic filtering rules, work fingerprint filtering rules, a relation between works and platforms and filtering rules thereof;
s2, automatic data acquisition: the method comprises a task generation thread and a task acquisition thread which are performed simultaneously, wherein a task list for realizing data acquisition is generated in the task generation thread, and a data acquisition link which is updated in real time is stored in the task list; executing the task list by the task acquisition thread, completing the acquisition of data of the network propagation behavior of the works, and formulating video attribute information with preset content and display format one by one;
s3, data cleaning: processing the video attribute information collected in the step S2, and analyzing and formulating formatted links of the video attribute information according to different platforms; and outputting the screened effective formatting links in a list form according to the filtering rule.
Preferably, in step S1, the work information includes a work name, a work collection number, a work director, a work number, a work matching name-based addition, deletion, modification, a work fingerprint feature, and a work gathering word; the extraction method of the fingerprint characteristics of the works comprises a sift algorithm, a Baidu cloud video fingerprint algorithm or a Tencent cloud video fingerprint algorithm;
the network platform information comprises a platform name, a platform number, a platform website, a platform acquisition entry link, a platform search link and a platform attribute; the platform attribute is any one of a search engine, a video platform or a post forum screened from the whole network range; the personal uploader information includes an uploader home page link.
Preferably, the step S2 specifically includes:
s2.1, initializing an acquisition program, and starting a task generation thread and a task acquisition thread at the same time;
s2.2, in the task generation thread, reading the network platform information and the corresponding platform acquisition inlet link at fixed time intervals;
s2.3, setting a first acquisition time interval threshold value in a task generation thread, continuously reading the information of the acquired words of the works, and if the current acquisition time minus the last acquisition time is greater than the first acquisition time interval threshold value, acquiring the platform acquisition entry link in the step S2.2 and the acquired word combination read at the current acquisition time to obtain a new link;
judging whether the new link exists in a task list or not, if not, adding the new link into the task list, otherwise, not processing the new link;
s2.4, setting a second acquisition time interval threshold value in the task generation thread, continuously reading the upload person home page link, and if the current acquisition time minus the last acquisition time is greater than the second acquisition time interval threshold value, reserving the corresponding upload person home page link;
judging whether the obtained uploader home page link exists in the task list, if not, adding the obtained uploader home page link into the task list, otherwise, not processing the obtained uploader home page link;
and S2.5, in the task acquisition thread, continuously acquiring the links in the task list to obtain corresponding acquisition data, and formulating video attribute information comprising work serial numbers, platform serial numbers, uploaded homepage links, titles, duration and work fingerprint features one by one.
Preferably, in step S2.5, a video platform crawler-oriented mechanism is adopted, and links in the task list are collected by using an agent pool and different collection frames according to different platforms.
Preferably, the platform acquisition portal links include search default links carried by each platform, links combined by a plurality of query conditions and sorting conditions, and the average arrangement combination of the platform acquisition portal links of each work exceeds 100.
Preferably, in step S1, the work number, the first minimum duration, the forward regular title content, and the backward regular title content information are set in the work basic filtering rule, and are used for data filtering processing;
the work fingerprint filtering rule sets a work number, a work fingerprint characteristic and a designated score for data filtering processing;
the relation between the works and the platform and the filtering rule thereof set the relation between the works and the network platform, the second minimum duration and the white list, and are used for formulating a comprehensive acquisition rule and filtering data; the product and platform relations are many-to-many relations, and each relation comprises a product number, a platform number, the lowest duration and a white list range.
Preferably, step S3 specifically includes:
s3.1, extracting video attribute information including a work number, a platform number, a upload person home page link, a title, duration and work fingerprint characteristic information, and setting a heavy chain removing link list which can be called in a data cleaning process in a database;
in the data cleaning process, analyzing the video attribute information according to different platforms, extracting the video attribute information if the platform attribute analysis is normal, and formulating formatting links according to different platforms;
if the formatted link is not repeated in the existing heavy chain removing link list, adding the formatted link into the heavy chain removing link list, otherwise, discarding the formatted link;
s3.2, when the serial number of the work in the video attribute information read in the step S3.1 is not empty, further processing the video attribute information according to the basic filtering rule of the work:
if the duration in the video attribute information is greater than or equal to the first minimum duration, continuing the next step, otherwise, discarding the formatted link;
if the title in the video attribute information accords with the forward regular title content, continuing the next step, otherwise, discarding the formatted link;
if the title in the video attribute information conforms to the reverse regularization of the title content, discarding the formatted link, otherwise continuing the next step, and entering step S3.4;
s3.3, when the work number in the video attribute information read in the step S3.1 is empty, directly judging whether the title in the video attribute information can be matched with the positive regular content of the work title in the database, if so, taking the matched work number, and entering the step S3.4; if the matching cannot be achieved, the step S3.5 is carried out;
s3.4, according to the product number and the platform number, further according to the relation between the product and the platform and the filtering configuration rule, processing the video attribute information:
if the duration in the video attribute information is greater than or equal to the second minimum duration, continuing the next step, otherwise discarding the formatted link;
if the uploaded person home page link in the video attribute information exists in the white list range, discarding the formatted link, otherwise continuing the next step;
downloading videos, extracting fingerprint characteristics of works, and processing according to the fingerprint filtering rules of the works:
if one or more matched fingerprints exist in the fingerprint list in the database through the fingerprint features of the work, and the score is larger than the specified score, determining that the formatted link is effective, otherwise, discarding; entering step 3.6;
step 3.5, image searching is directly carried out in a database according to the fingerprint characteristics of the works in the video attribute information, if matched works fingerprints can be searched, and when one or more fingerprints exist in a fingerprint list searched through the fingerprints and the score is larger than the designated score, the formatted link is determined to be effective; otherwise, discarding; entering step 3.6;
and 3.6, registering and auditing the effective formatting link, and outputting the effective formatting link in a list form.
Preferably, in step S3.1, the method for extracting the video attribute information includes xpath or css; the formatted links comprise work numbers, platform numbers, uploaded person home page links, titles, duration and work fingerprint characteristic information, and further comprise platform timestamps, access equipment information and access user information;
in the step S3.4, when the video is downloaded, only the first 30 seconds are downloaded, so that resources are saved and the subsequent comparison speed is improved, the extraction and comparison method comprises a sift algorithm, a Baidu cloud video fingerprint algorithm and a Tencent cloud video fingerprint algorithm, and at least 1 key frame is extracted every 5 seconds;
and in the steps S3.1-S3.6, reading the video attribute information through the API, and storing the read information in the redis.
A system for discovering network transmission behaviors of film and television works is characterized in that: the system comprises a work registration management module, a platform registration management module, a work acquisition word management module, a work basic filtering configuration management module, a work fingerprint filtering configuration management module, an uploader management module, a work and platform relation and filtering configuration management module and an audit output module; wherein the content of the first and second substances,
the work registration management module is used for registering work information, and the work information comprises work names, work set numbers, work directors, work numbers and work matching name regular increase and deletion change record;
the platform registration management module registers platform information in a platform list form, wherein the platform information comprises a platform number, a platform name, a platform website, more than one platform acquisition entry link and an addition/deletion record of platform attributes;
the work acquisition word management module is used for maintaining the information of the acquired works, and comprises work numbers, acquisition words, the last acquisition time and acquisition time intervals;
the basic filtering configuration management module is used for counting basic filtering configuration information of the works, and the basic filtering configuration information comprises work numbers, the lowest duration, the forward regular addition and deletion record of the title content and the reverse regular addition and deletion record of the title content;
the work fingerprint filtering configuration management module is used for counting work fingerprint filtering configuration information and comprises work numbers and corresponding video feature files;
the uploading person management module is used for counting information of a video uploading person, and the information comprises a platform number, a platform name, an uploading person number, an uploading person home page link, the last acquisition time and an acquisition interval of a work source, and performing addition, deletion, modification and record checking;
the system comprises a product and platform relation and filtering configuration management module, a white list management module and a filtering configuration management module, wherein the product and platform relation and the filtering configuration management module are used for maintaining platform information and comprehensive collection rules of collected works, the product and platform relation is a many-to-many relation, and each relation comprises a product number, a platform number, the lowest duration and a white list range;
the audit management module is used for auditing the result processed by the data cleaning program;
and the transmission result query module is used for querying and displaying information including the work number, the work name, the platform name, the home page link of the uploader, the time length, the title and the video file path into a list, providing an alternative option for the combined sequencing of the uploader and the title while sequencing according to the work number, and the data range is data which is already audited in the audit management module.
An electronic device, comprising: a memory for storing at least one program and a processor for loading the at least one program to perform the method.
A storage medium having stored therein instructions executable by a processor, the storage medium comprising: the processor-executable instructions, when executed by a processor, are for performing the method. .
Has the advantages that: due to the adoption of the technical scheme, the invention has the following technical effects:
1) the invention has the advantages that the works are configured with a plurality of keywords to search on the key video platforms of the whole network, and all videos which are collected and released according to the biographer lists on each platform are combined, so that the problem of comprehensive discovery of the network transmission data of the works is solved;
2) according to the invention, 85% of irrelevant data can be removed through text screening, and the screening of the rest 15% of irrelevant data can be completed through image fingerprints, so that the problem of cleaning a large amount of invalid data is solved;
3) the invention solves the problem that the network propagation behavior of a large batch of works can be found out difficultly at the same time and quickly by the visualization and flexible configuration of the interface and the combination of an automatic acquisition method;
4) the invention is suitable for discovering the network propagation behaviors of film and television works, and is also suitable for discovering the network propagation behaviors of literary works, graphic and text works and music works.
Drawings
FIG. 1 is a flow chart of the overall implementation of the present invention;
FIG. 2 is a flow chart of the implementation of configuration rules in the present invention;
FIG. 3 is a flow chart of the automated acquisition steps of the present invention;
FIG. 4 is a flow chart of an implementation of the data cleansing step S3.1 in the present invention;
FIG. 5 is a flow chart of the implementation of the data cleansing step S3.2 in the present invention
FIG. 6 is a flow chart of the implementation of the data cleansing steps S3.3-S3.6 in the present invention.
Detailed Description
The invention provides a method and a system for discovering the network transmission behavior of film and television works, which comprises a management and audit module, an automatic acquisition method and a data cleaning method.
The management and auditing module comprises but is not limited to work registration management, platform registration management, work and platform relation and filtering configuration management, work acquisition word management, work basic filtering configuration management and work fingerprint filtering configuration management. Wherein the content of the first and second substances,
1. the work registration management includes but is not limited to regular increase and deletion check of the name of the work, the number of sets of the work, the director of the work, the number of the work (copyrightId), and the name for matching the work; the adding, deleting, modifying and checking means that the works list record is inquired, the works list record is newly added, the works list record is modified and the works list record is deleted.
2. Platform registration management includes but is not limited to adding, deleting, modifying and checking platform number (platform id), platform name, platform website, platform acquisition portal link (acquisition portal link of one platform is greater than or equal to 1 and no upper limit is set), and platform properties (including but not limited to search engine, video platform, post forum);
3. managing the collection words of the works, including but not limited to maintaining the relation among the serial numbers of the works, the collection words, the last collection time and the collection time intervals, wherein the relation is 1 to more;
4. the basic filtering configuration management of the works comprises but is not limited to the number of the works, the minimum duration, the forward regular title content and the backward regular title content increasing, deleting, modifying and checking;
5. the method comprises the following steps of managing the fingerprint filtering configuration of the works, including but not limited to the management of the numbers of the works and the video feature files corresponding to the numbers of the works, wherein the relation is more than 1 pair, and the fingerprint features are introduced into the video files (the extraction method includes but not limited to a sift algorithm, a Baidu cloud video fingerprint algorithm and an Tencent cloud video fingerprint algorithm, 1 key frame is recommended to be extracted every second, and the extraction according to the shots is not recommended);
6. managing the relationship between the works and the platforms, filtering configuration, maintaining the platform on which the works are collected and integrating collection rules, wherein the relationship between the works and the platforms is a many-to-many relationship, and each relationship (marked as multiplex Id) comprises but is not limited to the number of the works, the number of the platforms, the minimum duration and the white list range (multiplex Id: white list range =1: n); relevant work videos published by people within the white list are not expected to be included in the propagation data statistics, namely, the videos are used as the white list; but white lists may be bound for each platform because the actual names of white list people in each platform may be different.
7. The uploading person management comprises but is not limited to adding, deleting, modifying and checking platform numbers, platform names, uploading person numbers, uploading person home page links, last acquisition time and acquisition intervals;
8. the auditing management includes but is not limited to inquiring and displaying the path of the works, the names of the works, the platform, the transmission link picture, the name of the uploader, the link of the uploader, the duration, the title and the video file into a list, providing an option of two-choice or one-choice for the combined sequencing of the uploader and the title while sequencing according to the numbers of the works, and providing an auditing button of single-choice and multiple-choice.
9. The transmission result query module comprises but is not limited to query and display the paths of the works, the names of the platforms, the transmission links, the transmission link pictures, the names of the uploaders, the links of the uploaders, the duration, the titles and the video files into a list, the alternative selection of the combination and the sequence of the uploaders and the titles is provided while the ordering is carried out according to the numbers of the works, and the data range is the data which is approved in the audit management.
As shown in fig. 1 and fig. 2, the method for discovering the network propagation behavior of the work according to the present invention sequentially performs the following steps: the method comprises the steps of configuring rules, automatically acquiring data, cleaning the data, forming a list, checking the list and inquiring the list. The configuration rule refers to the configuration of functions of each module of work registration management, platform registration management, work and platform relation and filtering configuration management, work acquisition word management, work basic filtering configuration management and work fingerprint filtering configuration management, and the list content can comprise work numbers, work names, platform names, transmission links, transmission link pictures, upload person names, upload person links, duration, titles, video file paths and the like.
As shown in fig. 3, the present invention provides an automatic acquisition method, which comprises the following specific steps:
1. the acquisition program simultaneously starts a task generation thread and a task acquisition thread, including but not limited to a crawler frame developed by a python language and a java language;
2. a task generation thread is proposed within 1 hour every fixed time, all newly issued data of each platform in a network can be acquired in the interval, and information platforms registered in a platform list in platform registration management and corresponding platform acquisition entry links are read; the method comprises the following steps that (1) platform acquisition entry links, namely search default links carried by each platform and links formed by combining each query condition and sequencing condition of each platform are acquired, and the average arrangement and combination of entry links of each video platform can exceed 100; it is proposed to read through the API, loosely coupled with the management system;
3. the task generation thread continuously reads the information of the collection words of the works, and the information is recommended to be read through an API (application programming interface) and loosely coupled with a management system; if the current time minus the last acquisition time of the acquisition word is longer than the interval duration, combining the platform acquisition entry link in the last step and the keyword in the step into a new link, namely searching the value of the position of the keyword in the link according to each platform by the keyword, and replacing the keyword by using the current keyword;
4. if the final link in the steps is not in the task list, adding the final link into the task list, wherein the task list is suggested to be stored in a highly concurrent nosql database such as cluster redis, and otherwise, the task list is not processed;
5. the task generation thread continuously reads the uploader links in the uploader management and the corresponding last acquisition time and acquisition interval thereof, and if the current time minus the last acquisition time of the acquisition word is longer than the interval duration, the corresponding uploader links are reserved;
6. if the link in the step is not in the task list, adding the link into the task list, otherwise, not processing the link;
7. the task acquisition threads are used for formulating the home page link, title, duration and link extraction attributes of the uploaders one by one, and acquiring the links in the task list continuously; the video platform anti-crawler mechanism is sound, an agent pool and different acquisition frames can be adopted according to different platforms, multi-end deployment is adopted, and high concurrent acquisition of APP, static pages and dynamic pages is achieved.
As shown in fig. 4 to 6, the present invention provides a data cleaning method, which includes the following specific steps:
1. judging the extracted attributes in the acquisition process: the extraction method includes but is not limited to xpath and css, whether the information cannot be extracted by a certain 1 attribute exists, and if the information cannot be extracted by the certain 1 attribute, the acquisition is stopped and an alarm is given; the situation represents that the acquisition rule of the current platform changes, and manual intervention is needed to adjust the extraction attribute of the task acquisition thread; otherwise, continuing the next step;
2. taking out the serial number of the work, the serial number of the platform, the link of the home page of the uploading person, the title, the duration and the link of the uploading person from the information in the acquisition result;
3. analyzing the link into a standard format according to different platforms, wherein the standard format comprises information such as a time stamp of a video platform, attached information of access equipment, information of access users and the like, so that the link accessed by different people, different equipment and different time for playing the video is different; the general situation is that the content after the question mark in the link is removed, and a special platform needs special solution; searching with a heavy chain removing connection list, reading through an API, loosely coupling with a management system, storing information in a redis, adding the information into the heavy chain removing connection list when the link is not repeated, and otherwise, discarding;
4. when the number of the work in the step 2 is not empty, the lowest duration, the forward regular title content and the backward regular title content are taken from the basic filtering configuration management of the work according to the number of the work, the lowest duration, the forward regular title content and the backward regular title content can be read through an API (application programming interface), the lowest duration, the forward regular title content and the backward regular title content are loosely coupled with a management system, and information is stored in a;
5. taking the time length (n1) in the step 2, comparing the time length with the lowest time length (n2) in the step 4, continuing when n1> = n2, otherwise, discarding the link;
6. taking the title in the step 2, continuing when the title is matched with the title content in the step 4 in a positive regular mode, otherwise, discarding the link;
7. taking the title in the step 2, continuing when the title does not match the content of the title in the step 4 and is reversely regular, otherwise, discarding the link;
8. if the number of the work in the step 2 is not empty, taking the lowest duration and the white list from the relation between the work and the platform and the filtering configuration management according to the number of the work and the number of the platform; the information can be read through API and loosely coupled with a management system, and the information suggestion is stored in the redis;
9. taking the time length (n1) in the step 2, comparing the time length with the lowest time length (n3) in the step 8, continuing when n1> = n3, otherwise, discarding the link;
10. taking the uploaded person home page link in the step 2, discarding the link when the link has the white list range in the step 8, and otherwise, continuing;
11. downloading video and extracting video fingerprint t 1: the method is suggested to be only downloaded for 30 seconds, resources are saved, the subsequent comparison speed is improved, the video fingerprint extraction and comparison method comprises a sift algorithm, a Baidu cloud video fingerprint algorithm and a Tencent cloud video fingerprint algorithm, and the following description is not specifically described when fingerprint comparison is involved; it is proposed to extract 1 key frame per second, at least 1 key frame per 5 seconds;
12. when the serial number of the work in the step 2 is not empty, a fingerprint list in the work fingerprint filtering configuration management corresponding to the work is obtained according to the serial number of the work, when one or more fingerprints in the part of the fingerprint list are searched through t1 and the score is greater than the specified score, if the score is greater than 0.3, the highest score is 1, the link is determined to be effective, and the link is registered in an auditing module;
13. when the product number in the step 2 is empty, the product is matched with the name for matching in the product registration by the title, if the product number is matched, the linked product number is the product registration in the current product registration table, and the step 12 is returned to process; otherwise, searching all fingerprint lists, when one or more fingerprints in the fingerprint list are searched through t1 and the score is larger than the specified score, such as larger than 0.3 and the highest score is 1, determining that the link is valid, recording the product number of the searched fingerprint list with the highest score as the product number of the current link, and registering the product number in the auditing module.
According to the invention, through managing works, collecting and auditing and adopting an automatic collecting method, the method can realize the data found by the whole-network transmission of multiple works and greatly reduce data omission, and through adopting a data cleaning method, the identification of irrelevant data can be realized, and the data effectiveness and the cleaning speed are improved.
The Chinese meaning of an English character appearing in the text is briefly described as follows:
API: an application program interface; redis: is a Key-Value database, one of the general database types in the computer industry; nosql: a non-relational database; XML Path Language, which is XML Path Language, is a Language for searching information in XML documents; css, a Cascading Style Sheet, is a markup language for enhancing control over web page styles and allowing separation of Style information from web page content.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (9)

1. A method for discovering the network propagation behavior of a work is characterized by sequentially executing the following steps:
s1, configuration rule: adding work information to be retrieved, a plurality of network platform information to be searched, individual uploader information in each network platform and a plurality of filtering rules into a local database, wherein the filtering rules comprise work basic filtering rules, work fingerprint filtering rules, a relation between works and platforms and filtering rules thereof;
the basic filtering rule of the work sets the work number, the first minimum duration, the forward regular title content and the backward regular title content information for data filtering processing;
the work fingerprint filtering rule sets a work number, a work fingerprint characteristic and a designated score for data filtering processing;
the relation between the works and the platform and the filtering rule thereof set the relation between the works and the network platform, the second minimum duration and the white list, and are used for formulating a comprehensive acquisition rule and filtering data; the product and platform relations are many-to-many relations, and each relation comprises a product number, a platform number, the lowest duration and a white list range;
s2, automatic data acquisition: the method comprises a task generation thread and a task acquisition thread which are performed simultaneously, wherein a task list for realizing data acquisition is generated in the task generation thread, and a data acquisition link which is updated in real time is stored in the task list; executing the task list by the task acquisition thread, completing the acquisition of data of the network propagation behavior of the works, and formulating video attribute information with preset content and display format one by one;
s3, data cleaning: processing the video attribute information collected in the step S2, and analyzing and formulating formatted links of the video attribute information according to different platforms; outputting the screened effective formatting links in a list form according to the filtering rule; step S3 specifically includes:
s3.1, extracting video attribute information including a work number, a platform number, a upload person home page link, a title, duration and work fingerprint characteristic information, and setting a heavy chain removing link list which can be called in a data cleaning process in a database;
in the data cleaning process, analyzing the video attribute information according to different platforms, extracting the video attribute information if the platform attribute analysis is normal, and formulating formatting links according to different platforms;
if the formatted link is not repeated in the existing heavy chain removing link list, adding the formatted link into the heavy chain removing link list, otherwise, discarding the formatted link;
s3.2, when the serial number of the work in the video attribute information read in the step S3.1 is not empty, further processing the video attribute information according to the basic filtering rule of the work:
if the duration in the video attribute information is greater than or equal to the first minimum duration, continuing the next step, otherwise, discarding the formatted link;
if the title in the video attribute information accords with the forward regular title content, continuing the next step, otherwise, discarding the formatted link;
if the title in the video attribute information conforms to the reverse regularization of the title content, discarding the formatted link, otherwise continuing the next step, and entering step S3.4;
s3.3, when the work number in the video attribute information read in the step S3.1 is empty, directly judging whether the title in the video attribute information can be matched with the positive regular content of the work title in the database, if so, taking the matched work number, and entering the step S3.4; if the matching cannot be achieved, the step S3.5 is carried out;
s3.4, according to the product number and the platform number, further according to the relation between the product and the platform and the filtering configuration rule, processing the video attribute information:
if the duration in the video attribute information is greater than or equal to the second minimum duration, continuing the next step, otherwise discarding the formatted link;
if the uploaded person home page link in the video attribute information exists in the white list range, discarding the formatted link, otherwise continuing the next step;
downloading videos, extracting fingerprint characteristics of works, and processing according to the fingerprint filtering rules of the works:
if one or more matched fingerprints exist in the fingerprint list in the database through the fingerprint features of the work, and the score is larger than the specified score, determining that the formatted link is effective, otherwise, discarding; entering step 3.6;
step 3.5, image searching is directly carried out in a database according to the fingerprint characteristics of the works in the video attribute information, if matched works fingerprints can be searched, and when one or more fingerprints exist in a fingerprint list searched through the fingerprints and the score is larger than the designated score, the formatted link is determined to be effective; otherwise, discarding; entering step 3.6;
and 3.6, registering and auditing the effective formatting link, and outputting the effective formatting link in a list form.
2. The method for discovering the network propagation behavior of the works according to claim 1, wherein in the step S1, the work information includes a work name, a work set number, a work director, a work number, a work matching name-based add-drop modification, a work fingerprint feature, and a work gathering word; the extraction method of the fingerprint characteristics of the works comprises a sift algorithm, a Baidu cloud video fingerprint algorithm or a Tencent cloud video fingerprint algorithm;
the network platform information comprises a platform name, a platform number, a platform website, a platform acquisition entry link, a platform search link and a platform attribute; the platform attribute is any one of a search engine, a video platform or a post forum screened from the whole network range; the personal uploader information includes an uploader home page link.
3. The method for discovering network propagation behavior of a work according to claim 2, wherein step S2 specifically includes:
s2.1, initializing an acquisition program, and starting a task generation thread and a task acquisition thread at the same time;
s2.2, in the task generation thread, reading the network platform information and the corresponding platform acquisition inlet link at fixed time intervals;
s2.3, setting a first acquisition time interval threshold value in a task generation thread, continuously reading the information of the acquired words of the works, and if the current acquisition time minus the last acquisition time is greater than the first acquisition time interval threshold value, acquiring the platform acquisition entry link in the step S2.2 and the acquired word combination read at the current acquisition time to obtain a new link;
judging whether the new link exists in a task list or not, if not, adding the new link into the task list, otherwise, not processing the new link;
s2.4, setting a second acquisition time interval threshold value in the task generation thread, continuously reading the upload person home page link, and if the current acquisition time minus the last acquisition time is greater than the second acquisition time interval threshold value, reserving the corresponding upload person home page link;
judging whether the obtained uploader home page link exists in the task list, if not, adding the obtained uploader home page link into the task list, otherwise, not processing the obtained uploader home page link;
and S2.5, in the task acquisition thread, continuously acquiring the links in the task list to obtain corresponding acquisition data, and formulating video attribute information comprising work serial numbers, platform serial numbers, uploaded homepage links, titles, duration and work fingerprint features one by one.
4. The method of discovering network propagation behavior of a work according to claim 3, wherein: in the step S2.5, a video platform anti-crawler mechanism is adopted, and links in the task list are collected by using an agent pool and different collection frames according to different platforms.
5. The method of discovering network propagation behavior of a work according to claim 2, wherein: the platform acquisition entry links comprise search default links carried by each platform, links formed by combining various query conditions and sorting conditions, and the average arrangement combination of the platform acquisition entry links of each work exceeds 100.
6. The method of discovering network propagation behavior of a work according to claim 1, wherein:
in step S3.1, the method for extracting the video attribute information comprises xpath or css; the formatted links comprise work numbers, platform numbers, uploaded person home page links, titles, duration and work fingerprint characteristic information, and further comprise platform timestamps, access equipment information and access user information;
in the step S3.4, when the video is downloaded, only the first 30 seconds are downloaded, so that resources are saved and the subsequent comparison speed is improved, the extraction and comparison method comprises a sift algorithm, a Baidu cloud video fingerprint algorithm and a Tencent cloud video fingerprint algorithm, and at least 1 key frame is extracted every 5 seconds;
and in the steps S3.1-S3.6, reading the video attribute information through the API, and storing the read information in the redis.
7. A work network propagation behavior discovery system, characterized by: the system comprises a work registration management module, a platform registration management module, a work acquisition word management module, a work basic filtering configuration management module, a work fingerprint filtering configuration management module, an uploader management module, a work and platform relation and filtering configuration management module and an audit output module; wherein the content of the first and second substances,
the work registration management module is used for registering work information, and the work information comprises work names, work set numbers, work directors, work numbers and work matching name regular increase and deletion change record;
the platform registration management module registers platform information in a platform list form, wherein the platform information comprises a platform number, a platform name, a platform website, more than one platform acquisition entry link and an addition/deletion record of platform attributes;
the work acquisition word management module is used for maintaining the information of the acquired works, and comprises work numbers, acquisition words, the last acquisition time and acquisition time intervals;
the basic filtering configuration management module is used for counting basic filtering configuration information of the works, and the basic filtering configuration information comprises work numbers, the lowest duration, the forward regular addition and deletion record of the title content and the reverse regular addition and deletion record of the title content;
the work fingerprint filtering configuration management module is used for counting work fingerprint filtering configuration information and comprises work numbers and corresponding video feature files;
the uploading person management module is used for counting information of a video uploading person, and the information comprises a platform number, a platform name, an uploading person number, an uploading person home page link, the last acquisition time and an acquisition interval of a work source, and performing addition, deletion, modification and record checking;
the system comprises a product and platform relation and filtering configuration management module, a white list management module and a filtering configuration management module, wherein the product and platform relation and the filtering configuration management module are used for maintaining platform information and comprehensive collection rules of collected works, the product and platform relation is a many-to-many relation, and each relation comprises a product number, a platform number, the lowest duration and a white list range;
the audit management module is used for auditing the result processed by the data cleaning program;
and the transmission result query module is used for querying and displaying information including the work number, the work name, the platform name, the home page link of the uploader, the time length, the title and the video file path into a list, providing an alternative option for the combined sequencing of the uploader and the title while sequencing according to the work number, and the data range is data which is already audited in the audit management module.
8. An electronic device, comprising: a memory for storing at least one program and a processor for loading the at least one program to perform the method of any one of claims 1-6.
9. A storage medium having stored therein instructions executable by a processor, the storage medium comprising: the processor-executable instructions, when executed by a processor, are for performing the method of any one of claims 1-6.
CN202011435954.7A 2020-12-10 2020-12-10 Method, system, electronic device and storage medium for discovering network propagation behavior of works Active CN112231518B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011435954.7A CN112231518B (en) 2020-12-10 2020-12-10 Method, system, electronic device and storage medium for discovering network propagation behavior of works

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011435954.7A CN112231518B (en) 2020-12-10 2020-12-10 Method, system, electronic device and storage medium for discovering network propagation behavior of works

Publications (2)

Publication Number Publication Date
CN112231518A CN112231518A (en) 2021-01-15
CN112231518B true CN112231518B (en) 2021-04-06

Family

ID=74124597

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011435954.7A Active CN112231518B (en) 2020-12-10 2020-12-10 Method, system, electronic device and storage medium for discovering network propagation behavior of works

Country Status (1)

Country Link
CN (1) CN112231518B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114358643B (en) * 2022-01-13 2023-09-12 南京讯思雅信息科技有限公司 Multimedia content wind control management device and management method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9842113B1 (en) * 2013-08-27 2017-12-12 Google Inc. Context-based file selection
WO2018005569A1 (en) * 2016-06-30 2018-01-04 Microsoft Technology Licensing, Llc Videos associated with cells in spreadsheets
CN210721462U (en) * 2019-08-02 2020-06-09 上海碧虎网络科技有限公司 Dynamic data acquisition and analysis system based on cloud control
CN110598475A (en) * 2019-09-19 2019-12-20 腾讯科技(深圳)有限公司 Block chain-based work attribute information acquisition method and device and computer equipment

Also Published As

Publication number Publication date
CN112231518A (en) 2021-01-15

Similar Documents

Publication Publication Date Title
US20220164401A1 (en) Systems and methods for dynamically creating hyperlinks associated with relevant multimedia content
CN106331778B (en) Video recommendation method and device
CN106354861B (en) Film label automatic indexing method and automatic indexing system
US8972458B2 (en) Systems and methods for comments aggregation and carryover in word pages
US9256668B2 (en) System and method of detecting common patterns within unstructured data elements retrieved from big data sources
US7636714B1 (en) Determining query term synonyms within query context
US7606797B2 (en) Reverse value attribute extraction
US20100057694A1 (en) Semantic metadata creation for videos
US20020059215A1 (en) Data search apparatus and method
US20050171965A1 (en) Contents reuse management apparatus and contents reuse support apparatus
US10210211B2 (en) Code searching and ranking
US9852217B2 (en) Searching and ranking of code in videos
KR100463667B1 (en) System for processing patent materials, its method
US20070055699A1 (en) Photo image retrieval system and program
JP2002073677A (en) Device for collecting personal preference information on reader and information reading support device using the information collecting device
JP4042830B2 (en) Content attribute information normalization method, information collection / service provision system, and program storage recording medium
RU2568276C2 (en) Method of extracting useful content from mobile application setup files for further computer data processing, particularly search
US20030121058A1 (en) Personal adaptive memory system
CN112231518B (en) Method, system, electronic device and storage medium for discovering network propagation behavior of works
KR100876214B1 (en) Apparatus and method for context aware advertising and computer readable medium processing the method
JP2007164633A (en) Content retrieval method, system thereof, and program thereof
EP3706014A1 (en) Methods, apparatuses, devices, and storage media for content retrieval
JP2008217701A (en) Metadata providing device, metadata providing method, metadata providing program, and recording medium recording metadata providing program
KR101105798B1 (en) Apparatus and method refining keyword and contents searching system and method
KR100900467B1 (en) Personal media search service system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant