CN116304457B - Marking method for webpage multiple information attribute - Google Patents

Marking method for webpage multiple information attribute Download PDF

Info

Publication number
CN116304457B
CN116304457B CN202310166545.9A CN202310166545A CN116304457B CN 116304457 B CN116304457 B CN 116304457B CN 202310166545 A CN202310166545 A CN 202310166545A CN 116304457 B CN116304457 B CN 116304457B
Authority
CN
China
Prior art keywords
data
information
attribute
marking
mark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310166545.9A
Other languages
Chinese (zh)
Other versions
CN116304457A (en
Inventor
吕修政
刘兆民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yuncai Chain (Guangzhou) Information Technology Co.,Ltd.
Original Assignee
Shandong Qianshun Advertising Media Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Qianshun Advertising Media Co ltd filed Critical Shandong Qianshun Advertising Media Co ltd
Priority to CN202310166545.9A priority Critical patent/CN116304457B/en
Publication of CN116304457A publication Critical patent/CN116304457A/en
Application granted granted Critical
Publication of CN116304457B publication Critical patent/CN116304457B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a marking method of multiple information attributes of a webpage, and relates to the technical field of attribute marking. In order to solve the problems that in the prior art, the identification of webpage information attributes is realized by defining a webpage multiple information attribute identification model and a marking process, so that the limitation of identification is high, the model can not be flexibly changed and updated, and the adaptability is poor; a marking method of multiple information attributes of a webpage comprises the steps of determining the information attributes; information mark integration; matching the associated information; the attribute codes are extracted from the start marks, the numerical processing of the attribute codes is carried out, the attribute threshold value of the attribute codes is determined, the information attribute of the HTML marks is determined, and the modification is carried out based on the information attribute of the HTML marks, so that the webpage attribute information is conveniently and efficiently marked, extracted and applied, and the marking efficiency is greatly improved according to the established association relation between the content data marking distribution diagram and the information attribute.

Description

Marking method for webpage multiple information attribute
Technical Field
The invention relates to the technical field of attribute marking, in particular to a marking method of multiple information attributes of a webpage.
Background
The web pages are basic elements forming websites and are platforms for bearing various website applications, and the website functional planning is carried out according to information (including products, services, ideas and cultures) which the enterprises wish to transmit to the viewers, and then the page design and beautification work is carried out. As one of the external propaganda materials of enterprise, exquisite webpage design is crucial to promoting the internet brand image of enterprise.
A web page is a plain text file containing HTML tags, and is composed of tags and attributes. Related patents exist on the marking of multiple information attributes of web pages; for example, chinese patent with publication number CN104679804a discloses a method for marking multiple attributes of web page and implementation thereof, by providing an attribute identification module, an attribute configuration module and an attribute calling module for marking multiple information attributes of web page, the problems of identifying and storing multiple information attributes of a captured web page, marking multiple modes, and flexibly and repeatedly calling attribute marking results and processes are mainly solved from the whole and system perspective. The invention provides a unified new method for marking the webpage multiple information attribute by defining the webpage multiple information attribute identification model and the marking process, which can effectively increase the efficiency and the accuracy of the webpage information attribute marking process, thereby laying a foundation for the webpage multiple information attribute marking result and the convenient and repeated calling of the process in the service processing, and effectively increasing the efficiency of the system for processing the webpage multiple information attribute service.
Although the above patent marks, extracts and applies the web page attribute information conveniently and efficiently, the following problems still exist in the actual use process:
1. in the prior art, the marking of the webpage information attribute is realized by defining a webpage multiple information attribute identification model and a marking process, so that the limitation of identification is higher, the model can not be flexibly changed and updated, and the adaptability is poor;
2. in the prior art, matching and inquiring of associated information cannot be performed according to the existing information and information attributes, so that the information cannot be shared, the expansibility of the webpage is affected, and the display of the webpage and the effectiveness and timeliness of the information of the webpage are reduced.
Disclosure of Invention
The invention aims to provide a marking method of multiple information attributes of a webpage, which is characterized in that the information attributes of an HTML mark are determined and modified based on the information attributes of the HTML mark, so that webpage attribute information is marked, extracted and applied conveniently and efficiently, the information attributes of data in the webpage are automatically identified according to the established association relation between a content data marking distribution diagram and the information attributes, the marking efficiency is greatly improved, and the problems in the background technology are solved.
In order to achieve the above purpose, the present invention provides the following technical solutions:
a marking method of webpage multiple information attributes comprises the following steps:
determining information attributes: acquiring HTML marks in a webpage and performing recognition extraction, wherein the HTML marks are not less than one and are arranged in pairs, and the HTML marks comprise a start mark and an end mark;
determining an attribute code in the starting mark based on the extraction result of the HTML mark, carrying out numeric processing on an object code, determining an attribute threshold of the object code, and determining an information attribute of the HTML mark;
information mark integration: acquiring information attribute of the HTML mark, and dividing content data referenced by the HTML mark according to the information attribute to obtain a content data distribution diagram;
extracting keywords from the content data distribution map, converting the extracted keywords into word vectors, carrying out coordinate marking based on the word vectors to obtain a data information marking distribution map, and integrating the data to generate a data set;
and (3) matching the associated information: determining one or more ranking factors associated with the data set based on keywords, determining the association relation between the content data tag distribution map and information attributes, and obtaining corresponding content data and tag ranking according to the association relation based on the ranking factors.
Further, the HTML mark is used for referring to document parts of characters and pictures;
the attributes, options for the logo and placed within the start logo, are modified in color, alignment, height and width in the logo.
Further, the performing the digitizing process of the object code determines an attribute threshold of the object code, specifically:
analyzing the attribute codes, determining target code data, and dividing the target code data into a plurality of code blocks according to code categories;
acquiring the modification type of each code block, performing numerical processing to obtain a type value of each code block, and determining the extraction mode of corresponding object data according to the type value;
and acquiring attribute characteristics of object data corresponding to each code block based on the extraction mode, and generating information attributes of the HTML mark based on modification types of the attribute characteristics in the HTML mark.
Further, the obtaining the attribute features of the object data corresponding to each code block specifically includes:
storing the attribute characteristics of the object data on a storage node in a cloud network, and arranging the storage node from high to low based on the use frequency of the storage node when storing the data to establish a queuing queue;
and obtaining a plurality of matching groups according to the queuing queue, determining the use intensity of each code block, and matching the attribute characteristics of the object data by the plurality of matching groups according to the use intensity of each code block.
Further, storing the attribute characteristics of the object data on storage nodes in a cloud network, and based on the use frequency of the storage nodes when storing the data, arranging from high to low, and establishing a queuing queue, wherein the queuing queue comprises
Extracting attribute characteristics of the object data, and uploading the attribute characteristics of the object data to a storage node in a cloud network;
extracting the use frequency of the storage node when data are stored, and judging whether the use frequency has the same frequency value or not;
when the using frequency does not have the same frequency value, arranging according to the using frequency from high to low, and establishing a queuing queue;
when the using frequency has the same frequency value, extracting the number of storage nodes with the same frequency value;
when the number of the storage nodes does not exceed a preset number threshold, integrating the storage nodes into a storage node set, and arranging the storage node set as a storage node according to the use frequency from high to low to establish a queuing queue;
when the number of the storage nodes exceeds a preset number threshold, grouping the storage nodes to obtain a plurality of storage node sets, arranging the storage node sets as parallel storage nodes according to the use frequency from high to low, and establishing a queuing queue;
the number threshold is obtained through the following formula:
wherein M represents a number threshold, and M is an upward rounding; n represents the total number of storage nodes; c represents the total number of storage triggering times of the storage node in unit time; k represents the number of unit time; c (C) i Representing the storage triggering total times of the storage nodes corresponding to the ith unit time; n (N) i Representing the number of triggered storage nodes in the ith unit time; m is M 0 Represents the reference number value, M 0 The range of the value of (C) is [0.15N,0.24N ]]。
Further, when the number of the storage nodes exceeds a preset number threshold, grouping the storage nodes to obtain a plurality of storage node sets, including:
when the number of the storage nodes exceeds a preset number threshold value but does not exceed the upper limit of a number condition, two storage node sets are formed according to two N/2 or 1+N/2 and N/2 modes, the two storage node sets are used as parallel storage nodes and are arranged from high to low according to the use frequency, and a queuing queue is established;
when the number of the storage nodes exceeds a preset number threshold value and exceeds the upper limit of a number condition, node aggregation grouping is carried out according to the number corresponding to the number threshold value, and when the number of the storage nodes which are not full of the number corresponding to the number threshold value after grouping is remained, the number of the storage nodes is set to be an independent aggregation grouping;
wherein the upper limit of the exceeding number condition is obtained by the following formula:
wherein M is max Indicating that the number condition exceeds the upper limitNumerical value, and M max Is rounded upwards; n (N) max The maximum number of nodes with the same starting frequency in unit time is represented.
Further, the obtained data information mark distribution diagram specifically includes:
performing word segmentation operation on the content data cited by the HTML mark to obtain a plurality of words in the content data, determining word characteristics of each word and similarity among the words, and performing de-duplication processing on the words with the similarity to obtain target words;
cleaning the target word according to part-of-speech statistics characteristics to obtain a keyword, converting the keyword into word vectors, calculating the distance between each word vector and a standard word vector, and carrying out coordinate marking according to the distance;
drawing a data information mark distribution diagram according to the coordinate mark, and inputting a keyword corresponding to the word vector into a corresponding area of the data information mark distribution diagram;
and establishing an association relation between the data information mark distribution diagram and the content data distribution diagram based on the keywords and the content data referenced by the HTML mark.
Further, the integrating the data to generate a data set specifically includes:
based on the association relation between the data information mark distribution diagram and the content data distribution diagram, determining a plurality of different data integration rules for the data information mark distribution diagram according to a plurality of data integration requirements, and establishing a dynamic data integration instruction based on the different data integration rules;
and respectively carrying out dynamic integration on the content data referenced by the keywords and the HTML marks based on the dynamic data integration instruction to obtain a plurality of groups of integration data, and generating a data set based on the plurality of groups of integration data.
Further, the matching of the association information includes:
constructing a marking model of a content data marking distribution diagram, determining one or more sorting factors associated with a webpage, and determining weight values of the content data marking distribution diagram and information attributes;
analyzing the content data through the marking model to obtain marking labels and information attributes of the content data, and inputting the marking labels and the information attributes into one or more ranking factors associated with the web pages for comparison;
marking associated content data in one or more ranking factors associated with the web page based on the marking model, and corresponding marking labels and information attributes of the content data to the associated content data one by one;
and arranging according to the use intensity and the broadband occupation probability of the associated content data, calculating the equalization parameters of the associated content data, comparing the equalization parameters with preset equalization parameters, and establishing the association relation of the associated content data based on a comparison result.
Further, the establishing the association relationship of the associated content data based on the comparison result includes:
constructing a calculation model of balance parameters during data matching, and inputting the number of matched associated nodes and the load of the associated nodes into the calculation model for calculation during data matching;
when the balance parameter is determined to be smaller than a preset balance parameter, screening out an associated node with a load larger than a first preset load as a first associated node, and arranging from large to small based on the load of the first associated node to establish a first queuing queue;
when the balance parameter is determined to be smaller than a preset balance parameter, screening out the association node with the load smaller than a second preset load as a second association node, and arranging from small to large based on the load of the second association node to establish a second queuing queue;
the method comprises the steps that the association relation between a first association node and a second association node is analyzed according to the association relation, and a mark label and an information attribute are obtained;
and transmitting the main content to the second association node for association, and associating the mark label on the first association node.
Compared with the prior art, the invention has the beneficial effects that:
1. the method comprises the steps of extracting attribute codes from a start mark, carrying out numerical processing on the attribute codes, determining attribute threshold values of the attribute codes, determining information attributes of the HTML mark, and modifying the information attributes based on the information attributes of the HTML mark, so that webpage attribute information can be conveniently and efficiently marked, extracted and applied, information attribute marking is flexibly carried out, adaptability is improved, a content data distribution map and a data information marking distribution map are respectively drawn, data are integrated to generate a data set, a system can obtain effective information in the chart more easily, the information attributes of the data in a webpage can be automatically identified according to the established association relation between the content data marking distribution map and the information attributes, and marking efficiency is greatly improved.
2. The method has the advantages that the corresponding extraction mode of the object data is determined according to the type value of each code block, the object data is stored in a concentrated mode, the management and the visual development of the data are facilitated, meanwhile, the orderly feature extraction is facilitated, the extraction efficiency is improved, the modification type in the HTML mark is conveniently and accurately determined, the matching efficiency is improved based on the use frequency of each code block, the information attribute of the HTML mark is quickly generated, the user experience is improved, the extraction range is reduced, the calculated amount is reduced, and the calculation efficiency is improved.
3. By constructing a marking model of a content data marking distribution diagram, determining one or more ordering factors associated with the web page, carrying out matching and inquiring of associated information according to the existing information and information attribute, realizing information sharing, improving expansibility of the web page, guaranteeing display of the web page and effectiveness and timeliness of the information thereof, realizing balanced adjustment of each associated node, realizing reasonable use of each associated node, reducing load of each associated node, and improving overall operation efficiency of the web page.
Drawings
FIG. 1 is a flowchart of a method for marking multiple information attributes of a web page according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In order to solve the technical problems of low efficiency and inaccurate classification existing in the classification of information after the marking of the attribute of the existing web page, particularly when classifying the web page into a plurality of categories, referring to fig. 1, the present embodiment provides the following technical scheme:
a marking method of webpage multiple information attributes comprises the following steps:
determining information attributes: acquiring HTML marks in a webpage and performing recognition extraction, wherein the HTML marks are not less than one and are arranged in pairs, and the HTML marks comprise a start mark and an end mark; the HTML mark is used for referring to document parts of the characters and the pictures; the attribute is used for the option of the mark and is placed in the start mark, and the mark is modified by color, alignment mode, height and width; determining an attribute code in the starting mark based on the extraction result of the HTML mark, carrying out numeric processing on an object code, determining an attribute threshold of the object code, and determining an information attribute of the HTML mark;
information mark integration: acquiring information attribute of the HTML mark, and dividing content data referenced by the HTML mark according to the information attribute to obtain a content data distribution diagram; extracting keywords from the content data distribution map, converting the extracted keywords into word vectors, carrying out coordinate marking based on the word vectors to obtain a data information marking distribution map, and integrating the data to generate a data set;
and (3) matching the associated information: determining one or more ranking factors associated with the data set based on keywords, determining the association relation between the content data tag distribution map and information attributes, and obtaining corresponding content data and tag ranking according to the association relation based on the ranking factors.
Specifically, the attribute codes are extracted from the start marks, the numerical processing of the attribute codes is carried out, the attribute threshold of the attribute codes is determined, the information attribute of the HTML marks is determined, and the modification is carried out based on the information attribute of the HTML marks, so that the information attribute of the webpage is marked, extracted and applied conveniently and efficiently, the adaptability is improved, the content data distribution map and the data information marking distribution map are respectively drawn, the data are integrated to generate a data set, the graph has a uniform format through the graph, the efficiency of establishing the association relation is effectively improved, the system can obtain effective information in the graph more easily, the information attribute of the data in the webpage is automatically identified according to the established association relation of the content data marking distribution map and the information attribute, and the marking efficiency is greatly improved.
One embodiment of the invention stores the attribute characteristics of the object data on a storage node in a cloud network, and arranges the storage node from high to low based on the use frequency of the storage node in storing the data to establish a queuing queue, comprising
Extracting attribute characteristics of the object data, and uploading the attribute characteristics of the object data to a storage node in a cloud network;
extracting the use frequency of the storage node when data are stored, and judging whether the use frequency has the same frequency value or not;
when the using frequency does not have the same frequency value, arranging according to the using frequency from high to low, and establishing a queuing queue;
when the using frequency has the same frequency value, extracting the number of storage nodes with the same frequency value;
when the number of the storage nodes does not exceed a preset number threshold, integrating the storage nodes into a storage node set, and arranging the storage node set as a storage node according to the use frequency from high to low to establish a queuing queue;
when the number of the storage nodes exceeds a preset number threshold, grouping the storage nodes to obtain a plurality of storage node sets, arranging the storage node sets as parallel storage nodes according to the use frequency from high to low, and establishing a queuing queue;
the number threshold is obtained through the following formula:
wherein M represents a number threshold, and M is an upward rounding; n represents the total number of storage nodes; c represents the total number of storage triggering times of the storage node in unit time; k represents the number of unit time; c (C) i Representing the storage triggering total times of the storage nodes corresponding to the ith unit time; n (N) i Representing the number of triggered storage nodes in the ith unit time; m is M 0 Represents the reference number value, M 0 The range of the value of (C) is [0.15N,0.24N ]]。
Specifically, when the number of the storage nodes exceeds a preset number threshold, grouping the storage nodes to obtain a plurality of storage node sets, including:
when the number of the storage nodes exceeds a preset number threshold value but does not exceed the upper limit of a number condition, two storage node sets are formed according to two N/2 or 1+N/2 and N/2 modes, the two storage node sets are used as parallel storage nodes and are arranged from high to low according to the use frequency, and a queuing queue is established;
when the number of the storage nodes exceeds a preset number threshold value and exceeds the upper limit of a number condition, node aggregation grouping is carried out according to the number corresponding to the number threshold value, and when the number of the storage nodes which are not full of the number corresponding to the number threshold value after grouping is remained, the number of the storage nodes is set to be an independent aggregation grouping;
wherein the upper limit of the exceeding number condition is obtained by the following formula:
wherein M is max Represents a number corresponding to more than the upper limit of the number condition, and M max Is rounded upwards; n (N) max The maximum number of nodes with the same starting frequency in unit time is represented.
According to the method, the sorting accuracy of the storage nodes can be effectively improved, the problem that a plurality of storage nodes with the same use frequency cause data confusion in the sorting process is prevented, meanwhile, the matching degree between the number threshold and the number of the actual storage nodes can be effectively improved through the number threshold set in the mode and the grouping mode, and further the data ordering and accuracy of the subsequent storage node setting and sorting are further improved.
In order to solve the technical problems that in the prior art, the limitation of recognition is high, the model can not be flexibly changed and updated, and the adaptability is poor, referring to fig. 1, the present embodiment provides the following technical scheme:
performing the numerical processing of the target code, and determining the attribute threshold value of the target code, wherein the attribute threshold value is specifically: analyzing the attribute codes, determining target code data, and dividing the target code data into a plurality of code blocks according to code categories; acquiring the modification type of each code block, performing numerical processing to obtain a type value of each code block, and determining the extraction mode of corresponding object data according to the type value; acquiring attribute characteristics of object data corresponding to each code block based on the extraction mode, and generating information attributes of the HTML mark based on modification types of the attribute characteristics in the HTML mark;
the attribute characteristics of the object data corresponding to each code block are obtained, specifically: storing the attribute characteristics of the object data on a storage node in a cloud network, and arranging the storage node from high to low based on the use frequency of the storage node when storing the data to establish a queuing queue; and obtaining a plurality of matching groups according to the queuing queue, determining the use intensity of each code block, and matching the attribute characteristics of the object data by the plurality of matching groups according to the use intensity of each code block.
Specifically, the extraction modes of the corresponding object data are determined according to the type values of the code blocks, the object data are stored in a concentrated mode, and the difference of the types of the code blocks leads to the difference of the extraction modes of the object data corresponding to the code blocks, wherein the extraction modes comprise a first depth extraction, a second depth extraction and a third depth extraction, the first depth, the second depth and the third depth are sequentially increased, so that the management and the visual development of the data are facilitated, meanwhile, the orderly feature extraction is also facilitated, the extraction efficiency is improved, the modification type in the HTML mark is conveniently and accurately determined, the matching efficiency is improved based on the use frequency of each code block, the information attribute of the HTML mark is quickly generated, the user experience is improved, the extraction range is reduced, the calculation amount is reduced, and the calculation efficiency is improved.
In order to solve the technical problems that in the prior art, the algorithm is complex and time-consuming, and the consumption of equipment performance and resources is large, referring to fig. 1, the present embodiment provides the following technical scheme:
the data information mark distribution diagram is obtained, specifically: performing word segmentation operation on the content data cited by the HTML mark to obtain a plurality of words in the content data, determining word characteristics of each word and similarity among the words, and performing de-duplication processing on the words with the similarity to obtain target words; cleaning the target word according to part-of-speech statistics characteristics to obtain a keyword, converting the keyword into word vectors, calculating the distance between each word vector and a standard word vector, and carrying out coordinate marking according to the distance; drawing a data information mark distribution diagram according to the coordinate mark, and inputting a keyword corresponding to the word vector into a corresponding area of the data information mark distribution diagram; establishing an association relationship between the data information mark distribution map and the content data distribution map based on the keywords and the content data referenced by the HTML mark;
integrating the data to generate a data set, specifically: based on the association relation between the data information mark distribution diagram and the content data distribution diagram, determining a plurality of different data integration rules for the data information mark distribution diagram according to a plurality of data integration requirements, and establishing a dynamic data integration instruction based on the different data integration rules; and respectively carrying out dynamic integration on the content data referenced by the keywords and the HTML marks based on the dynamic data integration instruction to obtain a plurality of groups of integration data, and generating a data set based on the plurality of groups of integration data.
Specifically, through carrying out hierarchical analysis on the data information, carrying out de-duplication processing on the words to obtain target words and cleaning, so that the data processing range is conveniently reduced, the data processing efficiency is improved, a data information mark distribution map is drawn, the association relation between the data information mark distribution map and a content data distribution map is established, the dependency relation between text words is considered, the accuracy and objectivity of keyword extraction are improved, the accuracy of keyword extraction is improved, multiple groups of integrated data are obtained through the established association relation in a multiple-time different integration mode, a data set is finally generated, the diversity of the integrated data in the data set is ensured, and therefore the efficiency and the accuracy of data analysis carried out by extracting the data set in a follow-up mode are ensured.
In order to solve the technical problems that the information cannot be shared, the expansibility of the webpage is affected, and the display of the webpage and the effectiveness and timeliness of the information are reduced due to the fact that the related information cannot be matched and queried according to the existing information and information attributes, referring to fig. 1, the embodiment provides the following technical scheme:
the association information matching includes: constructing a marking model of a content data marking distribution diagram, determining one or more sorting factors associated with a webpage, and determining weight values of the content data marking distribution diagram and information attributes; analyzing the content data through the marking model to obtain marking labels and information attributes of the content data, and inputting the marking labels and the information attributes into one or more ranking factors associated with the web pages for comparison; marking associated content data in one or more ranking factors associated with the web page based on the marking model, and corresponding marking labels and information attributes of the content data to the associated content data one by one; according to the use intensity and broadband occupation probability of the associated content data, the equalization parameters of the associated content data are calculated, the equalization parameters are compared with preset equalization parameters, and the associated content data are established according to the comparison result;
establishing the association relation of the associated content data based on the comparison result, wherein the establishment comprises the following steps: constructing a calculation model of balance parameters during data matching, and inputting the number of matched associated nodes and the load of the associated nodes into the calculation model for calculation during data matching; when the balance parameter is determined to be smaller than a preset balance parameter, screening out an associated node with a load larger than a first preset load as a first associated node, and arranging from large to small based on the load of the first associated node to establish a first queuing queue; when the balance parameter is determined to be smaller than a preset balance parameter, screening out the association node with the load smaller than a second preset load as a second association node, and arranging from small to large based on the load of the second association node to establish a second queuing queue; the method comprises the steps that the association relation between a first association node and a second association node is analyzed according to the association relation, and a mark label and an information attribute are obtained; and transmitting the main content to the second association node for association, and associating the mark label on the first association node.
Specifically, by constructing a marking model of a content data marking distribution diagram, determining one or more ordering factors associated with the web page, matching and inquiring associated information according to the existing information and information attribute, realizing information sharing, improving the expansibility of the web page, ensuring the display of the web page and the effectiveness and timeliness of the information thereof, accurately calculating the equalization parameters when the data are matched by constructing a calculation model of the equalization parameters when the data are matched, improving the accuracy of judging the size of the equalization parameters and the preset equalization parameters, realizing the equalization adjustment of each associated node, realizing the reasonable use of each associated node, reducing the load of each associated node, and improving the overall operation efficiency of the web page.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should be covered by the protection scope of the present invention by making equivalents and modifications to the technical solution and the inventive concept thereof.

Claims (10)

1. A marking method of webpage multiple information attributes is characterized in that: the method comprises the following steps:
determining information attributes: acquiring HTML marks in a webpage and performing recognition extraction, wherein the HTML marks are not less than one and are arranged in pairs, and the HTML marks comprise a start mark and an end mark;
determining an attribute code in the starting mark based on the extraction result of the HTML mark, carrying out numeric processing on an object code, determining an attribute threshold of the object code, and determining an information attribute of the HTML mark;
information mark integration: acquiring information attribute of the HTML mark, and dividing content data referenced by the HTML mark according to the information attribute to obtain a content data distribution diagram;
extracting keywords from the content data distribution map, converting the extracted keywords into word vectors, carrying out coordinate marking based on the word vectors to obtain a data information marking distribution map, and integrating the data to generate a data set;
and (3) matching the associated information: determining one or more ranking factors associated with the data set based on keywords, determining the association relation between the content data tag distribution map and information attributes, and obtaining corresponding content data and tag ranking according to the association relation based on the ranking factors.
2. The method for marking multiple information attributes of a web page according to claim 1, wherein: the HTML mark is used for referring to document parts of characters and pictures;
the information attribute is used for selecting a sign and is placed in a start sign, and the sign is modified by color, alignment mode, height and width.
3. The method for marking multiple information attributes of a web page according to claim 2, wherein: performing the numerical processing of the target code, and determining the attribute threshold value of the target code, wherein the attribute threshold value is specifically:
analyzing the attribute codes, determining target code data, and dividing the target code data into a plurality of code blocks according to code categories;
acquiring the modification type of each code block, performing numerical processing to obtain a type value of each code block, and determining the extraction mode of corresponding object data according to the type value;
and acquiring attribute characteristics of object data corresponding to each code block based on the extraction mode, and generating information attributes of the HTML mark based on modification types of the attribute characteristics in the HTML mark.
4. The method for marking multiple information attributes of web page as claimed in claim 3, wherein: the attribute characteristics of the object data corresponding to each code block are obtained, specifically:
storing the attribute characteristics of the object data on a storage node in a cloud network, and arranging the storage node from high to low based on the use frequency of the storage node when storing the data to establish a queuing queue;
and obtaining a plurality of matching groups according to the queuing queue, determining the use intensity of each code block, and matching the attribute characteristics of the object data by the plurality of matching groups according to the use intensity of each code block.
5. The method for marking multiple information attributes of a web page according to claim 4, wherein: storing the attribute characteristics of the object data on a storage node in a cloud network, and based on the use frequency of the storage node when storing the data, arranging from high to low, and establishing a queuing queue, wherein the queuing queue comprises
Extracting attribute characteristics of the object data, and uploading the attribute characteristics of the object data to a storage node in a cloud network;
extracting the use frequency of the storage node when data are stored, and judging whether the use frequency has the same frequency value or not;
when the using frequency does not have the same frequency value, arranging according to the using frequency from high to low, and establishing a queuing queue;
when the using frequency has the same frequency value, extracting the number of storage nodes with the same frequency value;
when the number of the storage nodes does not exceed a preset number threshold, integrating the storage nodes into a storage node set, and arranging the storage node set as a storage node according to the use frequency from high to low to establish a queuing queue;
when the number of the storage nodes exceeds a preset number threshold, grouping the storage nodes to obtain a plurality of storage node sets, arranging the storage node sets as parallel storage nodes according to the use frequency from high to low, and establishing a queuing queue;
the number threshold is obtained through the following formula:
wherein M represents a number threshold, and M is an upward rounding; n represents the total number of storage nodes; c represents the total number of storage triggering times of the storage node in unit time; k represents the number of unit time; c (C) i Representing the storage triggering total times of the storage nodes corresponding to the ith unit time; n (N) i Representing the number of triggered storage nodes in the ith unit time; m is M 0 Represents the reference number value, M 0 The range of the value of (C) is [0.15N,0.24N ]]。
6. The method for marking multiple information attributes of a web page according to claim 5, wherein: when the number of the storage nodes exceeds a preset number threshold, grouping the storage nodes to obtain a plurality of storage node sets, wherein the method comprises the following steps:
when the number of the storage nodes exceeds a preset number threshold value but does not exceed the upper limit of a number condition, two storage node sets are formed according to two N/2 or 1+N/2 and N/2 modes, the two storage node sets are used as parallel storage nodes and are arranged from high to low according to the use frequency, and a queuing queue is established;
when the number of the storage nodes exceeds a preset number threshold value and exceeds the upper limit of a number condition, node aggregation grouping is carried out according to the number corresponding to the number threshold value, and when the number of the storage nodes which are not full of the number corresponding to the number threshold value after grouping is remained, the number of the storage nodes is set to be an independent aggregation grouping;
wherein the upper limit of the exceeding number condition is obtained by the following formula:
wherein M is max Represents a number corresponding to more than the upper limit of the number condition, and M max Is rounded upwards; n (N) max The maximum number of nodes with the same starting frequency in unit time is represented.
7. The method for marking multiple information attributes of a web page according to claim 5, wherein: the data information mark distribution diagram is obtained, specifically:
performing word segmentation operation on the content data cited by the HTML mark to obtain a plurality of words in the content data, determining word characteristics of each word and similarity among the words, and performing de-duplication processing on the words with the similarity to obtain target words;
cleaning the target word according to part-of-speech statistics characteristics to obtain a keyword, converting the keyword into word vectors, calculating the distance between each word vector and a standard word vector, and carrying out coordinate marking according to the distance;
drawing a data information mark distribution diagram according to the coordinate mark, and inputting a keyword corresponding to the word vector into a corresponding area of the data information mark distribution diagram;
and establishing an association relation between the data information mark distribution diagram and the content data distribution diagram based on the keywords and the content data referenced by the HTML mark.
8. The method for marking multiple information attributes of a web page according to claim 7, wherein: integrating the data to generate a data set, specifically:
based on the association relation between the data information mark distribution diagram and the content data distribution diagram, determining a plurality of different data integration rules for the data information mark distribution diagram according to a plurality of data integration requirements, and establishing a dynamic data integration instruction based on the different data integration rules;
and respectively carrying out dynamic integration on the content data referenced by the keywords and the HTML marks based on the dynamic data integration instruction to obtain a plurality of groups of integration data, and generating a data set based on the plurality of groups of integration data.
9. The method for marking multiple information attributes of a web page according to claim 1, wherein: the association information matching includes:
constructing a marking model of a content data marking distribution diagram, determining one or more sorting factors associated with a webpage, and determining weight values of the content data marking distribution diagram and information attributes;
analyzing the content data through the marking model to obtain marking labels and information attributes of the content data, and inputting the marking labels and the information attributes into one or more ranking factors associated with the web pages for comparison;
marking associated content data in one or more ranking factors associated with the web page based on the marking model, and corresponding marking labels and information attributes of the content data to the associated content data one by one;
and arranging according to the use intensity and the broadband occupation probability of the associated content data, calculating the equalization parameters of the associated content data, comparing the equalization parameters with preset equalization parameters, and establishing the association relation of the associated content data based on a comparison result.
10. The method for marking multiple information attributes of a web page according to claim 9, wherein: establishing the association relation of the associated content data based on the comparison result, wherein the establishment comprises the following steps:
constructing a calculation model of balance parameters during data matching, and inputting the number of matched associated nodes and the load of the associated nodes into the calculation model for calculation during data matching;
when the balance parameter is determined to be smaller than a preset balance parameter, screening out an associated node with a load larger than a first preset load as a first associated node, and arranging from large to small based on the load of the first associated node to establish a first queuing queue;
when the balance parameter is determined to be smaller than a preset balance parameter, screening out the association node with the load smaller than a second preset load as a second association node, and arranging from small to large based on the load of the second association node to establish a second queuing queue;
the method comprises the steps that the association relation between a first association node and a second association node is analyzed according to the association relation, and a mark label and an information attribute are obtained;
and transmitting the main content to the second association node for association, and associating the mark label on the first association node.
CN202310166545.9A 2023-02-27 2023-02-27 Marking method for webpage multiple information attribute Active CN116304457B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310166545.9A CN116304457B (en) 2023-02-27 2023-02-27 Marking method for webpage multiple information attribute

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310166545.9A CN116304457B (en) 2023-02-27 2023-02-27 Marking method for webpage multiple information attribute

Publications (2)

Publication Number Publication Date
CN116304457A CN116304457A (en) 2023-06-23
CN116304457B true CN116304457B (en) 2024-03-29

Family

ID=86823323

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310166545.9A Active CN116304457B (en) 2023-02-27 2023-02-27 Marking method for webpage multiple information attribute

Country Status (1)

Country Link
CN (1) CN116304457B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102768670A (en) * 2012-05-31 2012-11-07 哈尔滨工程大学 Webpage clustering method based on node property label propagation
CN104484451A (en) * 2014-12-25 2015-04-01 北京国双科技有限公司 Web page information extraction method and web page information extraction device
CN104679804A (en) * 2014-04-30 2015-06-03 宁波优策信息技术有限公司 Web page multiple attribute marking method and implementation thereof
CN107577783A (en) * 2017-09-15 2018-01-12 电子科技大学 The type of webpage automatic identifying method excavated based on Web architectural features
CN109241437A (en) * 2018-09-19 2019-01-18 麒麟合盛网络技术股份有限公司 A kind of generation method, advertisement recognition method and the system of advertisement identification model
CN110427541A (en) * 2019-08-05 2019-11-08 安徽大学 A kind of webpage content extracting method, system, electronic equipment and medium
CN112115269A (en) * 2020-08-07 2020-12-22 国家计算机网络与信息安全管理中心河南分中心 Webpage automatic classification method based on crawler
CN112784135A (en) * 2021-02-26 2021-05-11 张冶青 Webpage information identification system
CN114821610A (en) * 2022-05-16 2022-07-29 三峡高科信息技术有限责任公司 Method for generating webpage code from image based on tree-shaped neural network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090065568A1 (en) * 2007-09-07 2009-03-12 Elliott Grant Systems and Methods for Associating Production Attributes with Products
US20170212964A1 (en) * 2016-01-27 2017-07-27 Veeva Systems Inc. System and method for dynamic content rendering

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102768670A (en) * 2012-05-31 2012-11-07 哈尔滨工程大学 Webpage clustering method based on node property label propagation
CN104679804A (en) * 2014-04-30 2015-06-03 宁波优策信息技术有限公司 Web page multiple attribute marking method and implementation thereof
CN104484451A (en) * 2014-12-25 2015-04-01 北京国双科技有限公司 Web page information extraction method and web page information extraction device
CN107577783A (en) * 2017-09-15 2018-01-12 电子科技大学 The type of webpage automatic identifying method excavated based on Web architectural features
CN109241437A (en) * 2018-09-19 2019-01-18 麒麟合盛网络技术股份有限公司 A kind of generation method, advertisement recognition method and the system of advertisement identification model
CN110427541A (en) * 2019-08-05 2019-11-08 安徽大学 A kind of webpage content extracting method, system, electronic equipment and medium
CN112115269A (en) * 2020-08-07 2020-12-22 国家计算机网络与信息安全管理中心河南分中心 Webpage automatic classification method based on crawler
CN112784135A (en) * 2021-02-26 2021-05-11 张冶青 Webpage information identification system
CN114821610A (en) * 2022-05-16 2022-07-29 三峡高科信息技术有限责任公司 Method for generating webpage code from image based on tree-shaped neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
网络爬虫技术与策略分析;刘晓魁;网络安全技术与应用;第17-19页 *

Also Published As

Publication number Publication date
CN116304457A (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN106844407B (en) Tag network generation method and system based on data set correlation
CN102147815B (en) Method and system for searching images
CN110147483A (en) A kind of title method for reconstructing and device
CN110427488B (en) Document processing method and device
CN107766492A (en) A kind of method and apparatus of picture search
CN109242030A (en) Draw single generation method and device, electronic equipment, computer readable storage medium
CN102184240B (en) Webpage layout method and system based on mobile communication equipment terminal
CN106294363A (en) A kind of forum postings evaluation methodology, Apparatus and system
CN112417121A (en) Client intention recognition method and device, computer equipment and storage medium
WO2023071127A1 (en) Policy recommended method and apparatus, device, and storage medium
CN113111198B (en) Demonstration manuscript recommendation method based on collaborative filtering algorithm and related equipment
CN107895055A (en) A kind of photo management method and system
CN111177372A (en) Scientific and technological achievement classification method, device, equipment and medium
CN110347827A (en) Event Distillation method towards isomery text operation/maintenance data
CN116304457B (en) Marking method for webpage multiple information attribute
CN113592197A (en) Household service recommendation system and method
US10963690B2 (en) Method for identifying main picture in web page
CN111382254A (en) Electronic business card recommendation method, device, equipment and computer readable storage medium
CN112801099A (en) Image processing method, device, terminal equipment and medium
CN115935958A (en) Resume processing method and device, storage medium and electronic equipment
CN109767249A (en) The method and apparatus for predicting cost performance
CN113486148A (en) PDF file conversion method and device, electronic equipment and computer readable medium
CN112069230A (en) Data analysis method, device, equipment and storage medium
CN111523011A (en) Cold and hot wallet intelligent label system based on block chain technology distributed graph calculation engine
CN110889271A (en) Template-based data table construction method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240411

Address after: Room 2306, North of 23rd Floor, No. 472 Huanshi East Road, Yuexiu District, Guangzhou City, Guangdong Province, 510000

Patentee after: Yuncai Chain (Guangzhou) Information Technology Co.,Ltd.

Country or region after: China

Address before: Inside Yard 6, Baimashan South Road, Shizhong District, Jinan City, Shandong Province, 250220

Patentee before: Shandong Qianshun Advertising Media Co.,Ltd.

Country or region before: China