CN116304457B

CN116304457B - Marking method for webpage multiple information attribute

Info

Publication number: CN116304457B
Application number: CN202310166545.9A
Authority: CN
Inventors: 吕修政; 刘兆民
Original assignee: Shandong Qianshun Advertising Media Co ltd
Current assignee: Yuncai Chain (Guangzhou) Information Technology Co.,Ltd.
Priority date: 2023-02-27
Filing date: 2023-02-27
Publication date: 2024-03-29
Anticipated expiration: 2043-02-27
Also published as: CN116304457A

Abstract

The invention discloses a marking method of multiple information attributes of a webpage, and relates to the technical field of attribute marking. In order to solve the problems that in the prior art, the identification of webpage information attributes is realized by defining a webpage multiple information attribute identification model and a marking process, so that the limitation of identification is high, the model can not be flexibly changed and updated, and the adaptability is poor; a marking method of multiple information attributes of a webpage comprises the steps of determining the information attributes; information mark integration; matching the associated information; the attribute codes are extracted from the start marks, the numerical processing of the attribute codes is carried out, the attribute threshold value of the attribute codes is determined, the information attribute of the HTML marks is determined, and the modification is carried out based on the information attribute of the HTML marks, so that the webpage attribute information is conveniently and efficiently marked, extracted and applied, and the marking efficiency is greatly improved according to the established association relation between the content data marking distribution diagram and the information attribute.

Description

Marking method for webpage multiple information attribute

Technical Field

The invention relates to the technical field of attribute marking, in particular to a marking method of multiple information attributes of a webpage.

Background

The web pages are basic elements forming websites and are platforms for bearing various website applications, and the website functional planning is carried out according to information (including products, services, ideas and cultures) which the enterprises wish to transmit to the viewers, and then the page design and beautification work is carried out. As one of the external propaganda materials of enterprise, exquisite webpage design is crucial to promoting the internet brand image of enterprise.

A web page is a plain text file containing HTML tags, and is composed of tags and attributes. Related patents exist on the marking of multiple information attributes of web pages; for example, chinese patent with publication number CN104679804a discloses a method for marking multiple attributes of web page and implementation thereof, by providing an attribute identification module, an attribute configuration module and an attribute calling module for marking multiple information attributes of web page, the problems of identifying and storing multiple information attributes of a captured web page, marking multiple modes, and flexibly and repeatedly calling attribute marking results and processes are mainly solved from the whole and system perspective. The invention provides a unified new method for marking the webpage multiple information attribute by defining the webpage multiple information attribute identification model and the marking process, which can effectively increase the efficiency and the accuracy of the webpage information attribute marking process, thereby laying a foundation for the webpage multiple information attribute marking result and the convenient and repeated calling of the process in the service processing, and effectively increasing the efficiency of the system for processing the webpage multiple information attribute service.

Although the above patent marks, extracts and applies the web page attribute information conveniently and efficiently, the following problems still exist in the actual use process:

1. in the prior art, the marking of the webpage information attribute is realized by defining a webpage multiple information attribute identification model and a marking process, so that the limitation of identification is higher, the model can not be flexibly changed and updated, and the adaptability is poor;

2. in the prior art, matching and inquiring of associated information cannot be performed according to the existing information and information attributes, so that the information cannot be shared, the expansibility of the webpage is affected, and the display of the webpage and the effectiveness and timeliness of the information of the webpage are reduced.

Disclosure of Invention

The invention aims to provide a marking method of multiple information attributes of a webpage, which is characterized in that the information attributes of an HTML mark are determined and modified based on the information attributes of the HTML mark, so that webpage attribute information is marked, extracted and applied conveniently and efficiently, the information attributes of data in the webpage are automatically identified according to the established association relation between a content data marking distribution diagram and the information attributes, the marking efficiency is greatly improved, and the problems in the background technology are solved.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a marking method of webpage multiple information attributes comprises the following steps:

determining information attributes: acquiring HTML marks in a webpage and performing recognition extraction, wherein the HTML marks are not less than one and are arranged in pairs, and the HTML marks comprise a start mark and an end mark;

determining an attribute code in the starting mark based on the extraction result of the HTML mark, carrying out numeric processing on an object code, determining an attribute threshold of the object code, and determining an information attribute of the HTML mark;

information mark integration: acquiring information attribute of the HTML mark, and dividing content data referenced by the HTML mark according to the information attribute to obtain a content data distribution diagram;

extracting keywords from the content data distribution map, converting the extracted keywords into word vectors, carrying out coordinate marking based on the word vectors to obtain a data information marking distribution map, and integrating the data to generate a data set;

and (3) matching the associated information: determining one or more ranking factors associated with the data set based on keywords, determining the association relation between the content data tag distribution map and information attributes, and obtaining corresponding content data and tag ranking according to the association relation based on the ranking factors.

Further, the HTML mark is used for referring to document parts of characters and pictures;

the attributes, options for the logo and placed within the start logo, are modified in color, alignment, height and width in the logo.

Further, the performing the digitizing process of the object code determines an attribute threshold of the object code, specifically:

analyzing the attribute codes, determining target code data, and dividing the target code data into a plurality of code blocks according to code categories;

acquiring the modification type of each code block, performing numerical processing to obtain a type value of each code block, and determining the extraction mode of corresponding object data according to the type value;

and acquiring attribute characteristics of object data corresponding to each code block based on the extraction mode, and generating information attributes of the HTML mark based on modification types of the attribute characteristics in the HTML mark.

Further, the obtaining the attribute features of the object data corresponding to each code block specifically includes:

storing the attribute characteristics of the object data on a storage node in a cloud network, and arranging the storage node from high to low based on the use frequency of the storage node when storing the data to establish a queuing queue;

and obtaining a plurality of matching groups according to the queuing queue, determining the use intensity of each code block, and matching the attribute characteristics of the object data by the plurality of matching groups according to the use intensity of each code block.

Further, storing the attribute characteristics of the object data on storage nodes in a cloud network, and based on the use frequency of the storage nodes when storing the data, arranging from high to low, and establishing a queuing queue, wherein the queuing queue comprises

Extracting attribute characteristics of the object data, and uploading the attribute characteristics of the object data to a storage node in a cloud network;

extracting the use frequency of the storage node when data are stored, and judging whether the use frequency has the same frequency value or not;

when the using frequency does not have the same frequency value, arranging according to the using frequency from high to low, and establishing a queuing queue;

when the using frequency has the same frequency value, extracting the number of storage nodes with the same frequency value;

when the number of the storage nodes does not exceed a preset number threshold, integrating the storage nodes into a storage node set, and arranging the storage node set as a storage node according to the use frequency from high to low to establish a queuing queue;

when the number of the storage nodes exceeds a preset number threshold, grouping the storage nodes to obtain a plurality of storage node sets, arranging the storage node sets as parallel storage nodes according to the use frequency from high to low, and establishing a queuing queue;

the number threshold is obtained through the following formula:

wherein M represents a number threshold, and M is an upward rounding; n represents the total number of storage nodes; c represents the total number of storage triggering times of the storage node in unit time; k represents the number of unit time; c (C) _i Representing the storage triggering total times of the storage nodes corresponding to the ith unit time; n (N) _i Representing the number of triggered storage nodes in the ith unit time; m is M ₀ Represents the reference number value, M ₀ The range of the value of (C) is [0.15N,0.24N ]]。

Further, when the number of the storage nodes exceeds a preset number threshold, grouping the storage nodes to obtain a plurality of storage node sets, including:

when the number of the storage nodes exceeds a preset number threshold value but does not exceed the upper limit of a number condition, two storage node sets are formed according to two N/2 or 1+N/2 and N/2 modes, the two storage node sets are used as parallel storage nodes and are arranged from high to low according to the use frequency, and a queuing queue is established;

when the number of the storage nodes exceeds a preset number threshold value and exceeds the upper limit of a number condition, node aggregation grouping is carried out according to the number corresponding to the number threshold value, and when the number of the storage nodes which are not full of the number corresponding to the number threshold value after grouping is remained, the number of the storage nodes is set to be an independent aggregation grouping;

wherein the upper limit of the exceeding number condition is obtained by the following formula:

wherein M is _max Indicating that the number condition exceeds the upper limitNumerical value, and M _max Is rounded upwards; n (N) _max The maximum number of nodes with the same starting frequency in unit time is represented.

Further, the obtained data information mark distribution diagram specifically includes:

performing word segmentation operation on the content data cited by the HTML mark to obtain a plurality of words in the content data, determining word characteristics of each word and similarity among the words, and performing de-duplication processing on the words with the similarity to obtain target words;

cleaning the target word according to part-of-speech statistics characteristics to obtain a keyword, converting the keyword into word vectors, calculating the distance between each word vector and a standard word vector, and carrying out coordinate marking according to the distance;

drawing a data information mark distribution diagram according to the coordinate mark, and inputting a keyword corresponding to the word vector into a corresponding area of the data information mark distribution diagram;

and establishing an association relation between the data information mark distribution diagram and the content data distribution diagram based on the keywords and the content data referenced by the HTML mark.

Further, the integrating the data to generate a data set specifically includes:

based on the association relation between the data information mark distribution diagram and the content data distribution diagram, determining a plurality of different data integration rules for the data information mark distribution diagram according to a plurality of data integration requirements, and establishing a dynamic data integration instruction based on the different data integration rules;

and respectively carrying out dynamic integration on the content data referenced by the keywords and the HTML marks based on the dynamic data integration instruction to obtain a plurality of groups of integration data, and generating a data set based on the plurality of groups of integration data.

Further, the matching of the association information includes:

constructing a marking model of a content data marking distribution diagram, determining one or more sorting factors associated with a webpage, and determining weight values of the content data marking distribution diagram and information attributes;

analyzing the content data through the marking model to obtain marking labels and information attributes of the content data, and inputting the marking labels and the information attributes into one or more ranking factors associated with the web pages for comparison;

marking associated content data in one or more ranking factors associated with the web page based on the marking model, and corresponding marking labels and information attributes of the content data to the associated content data one by one;

and arranging according to the use intensity and the broadband occupation probability of the associated content data, calculating the equalization parameters of the associated content data, comparing the equalization parameters with preset equalization parameters, and establishing the association relation of the associated content data based on a comparison result.

Further, the establishing the association relationship of the associated content data based on the comparison result includes:

constructing a calculation model of balance parameters during data matching, and inputting the number of matched associated nodes and the load of the associated nodes into the calculation model for calculation during data matching;

when the balance parameter is determined to be smaller than a preset balance parameter, screening out an associated node with a load larger than a first preset load as a first associated node, and arranging from large to small based on the load of the first associated node to establish a first queuing queue;

when the balance parameter is determined to be smaller than a preset balance parameter, screening out the association node with the load smaller than a second preset load as a second association node, and arranging from small to large based on the load of the second association node to establish a second queuing queue;

the method comprises the steps that the association relation between a first association node and a second association node is analyzed according to the association relation, and a mark label and an information attribute are obtained;

and transmitting the main content to the second association node for association, and associating the mark label on the first association node.

Compared with the prior art, the invention has the beneficial effects that:

1. the method comprises the steps of extracting attribute codes from a start mark, carrying out numerical processing on the attribute codes, determining attribute threshold values of the attribute codes, determining information attributes of the HTML mark, and modifying the information attributes based on the information attributes of the HTML mark, so that webpage attribute information can be conveniently and efficiently marked, extracted and applied, information attribute marking is flexibly carried out, adaptability is improved, a content data distribution map and a data information marking distribution map are respectively drawn, data are integrated to generate a data set, a system can obtain effective information in the chart more easily, the information attributes of the data in a webpage can be automatically identified according to the established association relation between the content data marking distribution map and the information attributes, and marking efficiency is greatly improved.

2. The method has the advantages that the corresponding extraction mode of the object data is determined according to the type value of each code block, the object data is stored in a concentrated mode, the management and the visual development of the data are facilitated, meanwhile, the orderly feature extraction is facilitated, the extraction efficiency is improved, the modification type in the HTML mark is conveniently and accurately determined, the matching efficiency is improved based on the use frequency of each code block, the information attribute of the HTML mark is quickly generated, the user experience is improved, the extraction range is reduced, the calculated amount is reduced, and the calculation efficiency is improved.

3. By constructing a marking model of a content data marking distribution diagram, determining one or more ordering factors associated with the web page, carrying out matching and inquiring of associated information according to the existing information and information attribute, realizing information sharing, improving expansibility of the web page, guaranteeing display of the web page and effectiveness and timeliness of the information thereof, realizing balanced adjustment of each associated node, realizing reasonable use of each associated node, reducing load of each associated node, and improving overall operation efficiency of the web page.

Drawings

FIG. 1 is a flowchart of a method for marking multiple information attributes of a web page according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order to solve the technical problems of low efficiency and inaccurate classification existing in the classification of information after the marking of the attribute of the existing web page, particularly when classifying the web page into a plurality of categories, referring to fig. 1, the present embodiment provides the following technical scheme:

determining information attributes: acquiring HTML marks in a webpage and performing recognition extraction, wherein the HTML marks are not less than one and are arranged in pairs, and the HTML marks comprise a start mark and an end mark; the HTML mark is used for referring to document parts of the characters and the pictures; the attribute is used for the option of the mark and is placed in the start mark, and the mark is modified by color, alignment mode, height and width; determining an attribute code in the starting mark based on the extraction result of the HTML mark, carrying out numeric processing on an object code, determining an attribute threshold of the object code, and determining an information attribute of the HTML mark;

information mark integration: acquiring information attribute of the HTML mark, and dividing content data referenced by the HTML mark according to the information attribute to obtain a content data distribution diagram; extracting keywords from the content data distribution map, converting the extracted keywords into word vectors, carrying out coordinate marking based on the word vectors to obtain a data information marking distribution map, and integrating the data to generate a data set;

Specifically, the attribute codes are extracted from the start marks, the numerical processing of the attribute codes is carried out, the attribute threshold of the attribute codes is determined, the information attribute of the HTML marks is determined, and the modification is carried out based on the information attribute of the HTML marks, so that the information attribute of the webpage is marked, extracted and applied conveniently and efficiently, the adaptability is improved, the content data distribution map and the data information marking distribution map are respectively drawn, the data are integrated to generate a data set, the graph has a uniform format through the graph, the efficiency of establishing the association relation is effectively improved, the system can obtain effective information in the graph more easily, the information attribute of the data in the webpage is automatically identified according to the established association relation of the content data marking distribution map and the information attribute, and the marking efficiency is greatly improved.

One embodiment of the invention stores the attribute characteristics of the object data on a storage node in a cloud network, and arranges the storage node from high to low based on the use frequency of the storage node in storing the data to establish a queuing queue, comprising

the number threshold is obtained through the following formula:

Specifically, when the number of the storage nodes exceeds a preset number threshold, grouping the storage nodes to obtain a plurality of storage node sets, including:

wherein M is _max Represents a number corresponding to more than the upper limit of the number condition, and M _max Is rounded upwards; n (N) _max The maximum number of nodes with the same starting frequency in unit time is represented.

According to the method, the sorting accuracy of the storage nodes can be effectively improved, the problem that a plurality of storage nodes with the same use frequency cause data confusion in the sorting process is prevented, meanwhile, the matching degree between the number threshold and the number of the actual storage nodes can be effectively improved through the number threshold set in the mode and the grouping mode, and further the data ordering and accuracy of the subsequent storage node setting and sorting are further improved.

In order to solve the technical problems that in the prior art, the limitation of recognition is high, the model can not be flexibly changed and updated, and the adaptability is poor, referring to fig. 1, the present embodiment provides the following technical scheme:

performing the numerical processing of the target code, and determining the attribute threshold value of the target code, wherein the attribute threshold value is specifically: analyzing the attribute codes, determining target code data, and dividing the target code data into a plurality of code blocks according to code categories; acquiring the modification type of each code block, performing numerical processing to obtain a type value of each code block, and determining the extraction mode of corresponding object data according to the type value; acquiring attribute characteristics of object data corresponding to each code block based on the extraction mode, and generating information attributes of the HTML mark based on modification types of the attribute characteristics in the HTML mark;

the attribute characteristics of the object data corresponding to each code block are obtained, specifically: storing the attribute characteristics of the object data on a storage node in a cloud network, and arranging the storage node from high to low based on the use frequency of the storage node when storing the data to establish a queuing queue; and obtaining a plurality of matching groups according to the queuing queue, determining the use intensity of each code block, and matching the attribute characteristics of the object data by the plurality of matching groups according to the use intensity of each code block.

Specifically, the extraction modes of the corresponding object data are determined according to the type values of the code blocks, the object data are stored in a concentrated mode, and the difference of the types of the code blocks leads to the difference of the extraction modes of the object data corresponding to the code blocks, wherein the extraction modes comprise a first depth extraction, a second depth extraction and a third depth extraction, the first depth, the second depth and the third depth are sequentially increased, so that the management and the visual development of the data are facilitated, meanwhile, the orderly feature extraction is also facilitated, the extraction efficiency is improved, the modification type in the HTML mark is conveniently and accurately determined, the matching efficiency is improved based on the use frequency of each code block, the information attribute of the HTML mark is quickly generated, the user experience is improved, the extraction range is reduced, the calculation amount is reduced, and the calculation efficiency is improved.

In order to solve the technical problems that in the prior art, the algorithm is complex and time-consuming, and the consumption of equipment performance and resources is large, referring to fig. 1, the present embodiment provides the following technical scheme:

the data information mark distribution diagram is obtained, specifically: performing word segmentation operation on the content data cited by the HTML mark to obtain a plurality of words in the content data, determining word characteristics of each word and similarity among the words, and performing de-duplication processing on the words with the similarity to obtain target words; cleaning the target word according to part-of-speech statistics characteristics to obtain a keyword, converting the keyword into word vectors, calculating the distance between each word vector and a standard word vector, and carrying out coordinate marking according to the distance; drawing a data information mark distribution diagram according to the coordinate mark, and inputting a keyword corresponding to the word vector into a corresponding area of the data information mark distribution diagram; establishing an association relationship between the data information mark distribution map and the content data distribution map based on the keywords and the content data referenced by the HTML mark;

integrating the data to generate a data set, specifically: based on the association relation between the data information mark distribution diagram and the content data distribution diagram, determining a plurality of different data integration rules for the data information mark distribution diagram according to a plurality of data integration requirements, and establishing a dynamic data integration instruction based on the different data integration rules; and respectively carrying out dynamic integration on the content data referenced by the keywords and the HTML marks based on the dynamic data integration instruction to obtain a plurality of groups of integration data, and generating a data set based on the plurality of groups of integration data.

Specifically, through carrying out hierarchical analysis on the data information, carrying out de-duplication processing on the words to obtain target words and cleaning, so that the data processing range is conveniently reduced, the data processing efficiency is improved, a data information mark distribution map is drawn, the association relation between the data information mark distribution map and a content data distribution map is established, the dependency relation between text words is considered, the accuracy and objectivity of keyword extraction are improved, the accuracy of keyword extraction is improved, multiple groups of integrated data are obtained through the established association relation in a multiple-time different integration mode, a data set is finally generated, the diversity of the integrated data in the data set is ensured, and therefore the efficiency and the accuracy of data analysis carried out by extracting the data set in a follow-up mode are ensured.

In order to solve the technical problems that the information cannot be shared, the expansibility of the webpage is affected, and the display of the webpage and the effectiveness and timeliness of the information are reduced due to the fact that the related information cannot be matched and queried according to the existing information and information attributes, referring to fig. 1, the embodiment provides the following technical scheme:

the association information matching includes: constructing a marking model of a content data marking distribution diagram, determining one or more sorting factors associated with a webpage, and determining weight values of the content data marking distribution diagram and information attributes; analyzing the content data through the marking model to obtain marking labels and information attributes of the content data, and inputting the marking labels and the information attributes into one or more ranking factors associated with the web pages for comparison; marking associated content data in one or more ranking factors associated with the web page based on the marking model, and corresponding marking labels and information attributes of the content data to the associated content data one by one; according to the use intensity and broadband occupation probability of the associated content data, the equalization parameters of the associated content data are calculated, the equalization parameters are compared with preset equalization parameters, and the associated content data are established according to the comparison result;

establishing the association relation of the associated content data based on the comparison result, wherein the establishment comprises the following steps: constructing a calculation model of balance parameters during data matching, and inputting the number of matched associated nodes and the load of the associated nodes into the calculation model for calculation during data matching; when the balance parameter is determined to be smaller than a preset balance parameter, screening out an associated node with a load larger than a first preset load as a first associated node, and arranging from large to small based on the load of the first associated node to establish a first queuing queue; when the balance parameter is determined to be smaller than a preset balance parameter, screening out the association node with the load smaller than a second preset load as a second association node, and arranging from small to large based on the load of the second association node to establish a second queuing queue; the method comprises the steps that the association relation between a first association node and a second association node is analyzed according to the association relation, and a mark label and an information attribute are obtained; and transmitting the main content to the second association node for association, and associating the mark label on the first association node.

Specifically, by constructing a marking model of a content data marking distribution diagram, determining one or more ordering factors associated with the web page, matching and inquiring associated information according to the existing information and information attribute, realizing information sharing, improving the expansibility of the web page, ensuring the display of the web page and the effectiveness and timeliness of the information thereof, accurately calculating the equalization parameters when the data are matched by constructing a calculation model of the equalization parameters when the data are matched, improving the accuracy of judging the size of the equalization parameters and the preset equalization parameters, realizing the equalization adjustment of each associated node, realizing the reasonable use of each associated node, reducing the load of each associated node, and improving the overall operation efficiency of the web page.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should be covered by the protection scope of the present invention by making equivalents and modifications to the technical solution and the inventive concept thereof.

Claims

1. A marking method of webpage multiple information attributes is characterized in that: the method comprises the following steps:

2. The method for marking multiple information attributes of a web page according to claim 1, wherein: the HTML mark is used for referring to document parts of characters and pictures;

the information attribute is used for selecting a sign and is placed in a start sign, and the sign is modified by color, alignment mode, height and width.

3. The method for marking multiple information attributes of a web page according to claim 2, wherein: performing the numerical processing of the target code, and determining the attribute threshold value of the target code, wherein the attribute threshold value is specifically:

4. The method for marking multiple information attributes of web page as claimed in claim 3, wherein: the attribute characteristics of the object data corresponding to each code block are obtained, specifically:

5. The method for marking multiple information attributes of a web page according to claim 4, wherein: storing the attribute characteristics of the object data on a storage node in a cloud network, and based on the use frequency of the storage node when storing the data, arranging from high to low, and establishing a queuing queue, wherein the queuing queue comprises

the number threshold is obtained through the following formula:

6. The method for marking multiple information attributes of a web page according to claim 5, wherein: when the number of the storage nodes exceeds a preset number threshold, grouping the storage nodes to obtain a plurality of storage node sets, wherein the method comprises the following steps:

7. The method for marking multiple information attributes of a web page according to claim 5, wherein: the data information mark distribution diagram is obtained, specifically:

8. The method for marking multiple information attributes of a web page according to claim 7, wherein: integrating the data to generate a data set, specifically:

9. The method for marking multiple information attributes of a web page according to claim 1, wherein: the association information matching includes:

10. The method for marking multiple information attributes of a web page according to claim 9, wherein: establishing the association relation of the associated content data based on the comparison result, wherein the establishment comprises the following steps: