CN106611008B - Internet content label management method and device - Google Patents

Internet content label management method and device Download PDF

Info

Publication number
CN106611008B
CN106611008B CN201510703822.0A CN201510703822A CN106611008B CN 106611008 B CN106611008 B CN 106611008B CN 201510703822 A CN201510703822 A CN 201510703822A CN 106611008 B CN106611008 B CN 106611008B
Authority
CN
China
Prior art keywords
content
tree
tag
label
tags
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510703822.0A
Other languages
Chinese (zh)
Other versions
CN106611008A (en
Inventor
赵耀红
高丹
熊龙
邓超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201510703822.0A priority Critical patent/CN106611008B/en
Publication of CN106611008A publication Critical patent/CN106611008A/en
Application granted granted Critical
Publication of CN106611008B publication Critical patent/CN106611008B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Abstract

The invention discloses a management method of Internet content labels, which comprises the steps of creating a content label library, and the method also comprises the following steps: acquiring first content tag trees corresponding to different websites, and respectively determining category tags of categories to which the first content tag trees belong; and matching the content label of the first content label tree with the content label in the content label library according to a preset matching rule based on the category label, and updating the content label library according to a matching result. The invention also discloses a management device of the Internet content label.

Description

Internet content label management method and device
Technical Field
The invention relates to the technical field of internet, in particular to a method and a device for managing internet content tags.
Background
With the rapid development of the internet, more and more content providers are continuously emerging, each field has different content providers, each content provider has its own specific content label system, each content label is different, but some labels are the same or have greater similarity. For user portrait construction based on user access internet content and behaviors and a personalized accurate content pushing scene based on a client, a set of uniform and complete internet content label system is urgently needed, and convenience and flexibility can be provided for personalized content pushing based on user access or user behaviors and the like. At present, no solution exists to merge different content tagging systems.
Disclosure of Invention
In view of this, embodiments of the present invention are expected to provide a management and an apparatus for internet content tags, which can form a unified internet content tag system, and provide convenience and flexibility for personalized content push based on user access or user behavior.
In order to achieve the above purpose, the technical solution of the embodiment of the present invention is realized as follows:
the embodiment of the invention provides a method for managing an internet content label, which comprises the following steps: creating a library of content tags, the method further comprising:
acquiring first content tag trees corresponding to different websites, and respectively determining category tags of categories to which the first content tag trees belong;
and matching the content label of the first content label tree with the content label in the content label library according to a preset matching rule based on the category label, and updating the content label library according to a matching result.
In the foregoing scheme, the obtaining the first content tag trees corresponding to different websites includes:
acquiring website domain names corresponding to different websites and at least one Uniform Resource Locator (URL) under the website domain names, and determining first content tag trees corresponding to the different websites by using content tag rules of the websites based on the website domain names and the URLs.
In the foregoing solution, the determining the category label of the category to which each first content label tree belongs includes:
and respectively reading the website domain names of the websites corresponding to the first content label trees, and determining the category labels of the categories to which the first content label trees belong according to the website domain names and a preset website domain name classification library.
In the foregoing solution, the matching, based on the category label, the content label of the first content label tree and the content label in the content label library according to a preset matching rule includes:
and acquiring a second content label tree with the root node content labels in the content label library identical to the category labels, and matching the content labels of the first content label tree with the content labels of each level in the second content label tree step by step according to the sequence from left to right or from top to bottom by combining semantic analysis.
In the foregoing solution, the updating the content tag library according to the matching result includes:
if the content label identical or similar to the content label does not exist in the second content label tree, adding the content label on the corresponding level of the second content label tree;
if the content tag similar to the content tag exists in the second content tag tree, updating the name of the content tag similar to the content tag; the content tags similar to the content tags are content tags with different names and the same corresponding parent tags and child tags.
In the foregoing solution, after the content tag library is updated according to the matching result, the method further includes:
starting from a content label in the second content label tree which is the same as a content label of a root node of the first content label tree, matching the content label in the second content label tree with the content label of the corresponding level in the first content label tree according to the sequence from left to right or from top to bottom, and deleting the content label in the second content label tree if the content label in the second content label tree does not exist in the content label of the corresponding level in the first content label tree.
The embodiment of the invention also provides a management device of the internet content label, which comprises the following components: the system comprises a creating module, an obtaining module and an updating module; wherein the content of the first and second substances,
the creating module is used for creating a content label library;
the acquisition module is used for acquiring first content tag trees corresponding to different websites and respectively determining category tags of categories to which the first content tag trees belong;
and the updating module is used for matching the content tags of the first content tag tree with the content tags in the content tag library according to a preset matching rule based on the category tags and updating the content tag library according to a matching result.
In the above scheme, the obtaining module is specifically configured to obtain website domain names corresponding to different websites and at least one uniform resource locator URL under the website domain names, and determine, based on the website domain names and the URLs, first content tag trees corresponding to the different websites by using content tag rules of the websites.
In the above scheme, the obtaining module is specifically configured to read the website domain name of the website corresponding to each first content tag tree, and determine the category tag of the category to which each first content tag tree belongs according to the website domain name and a preset website domain name classification library.
In the foregoing scheme, the updating module is specifically configured to obtain a second content tag tree in which content tags of root nodes in the content tag library are the same as the category tags, and match content tags of the first content tag tree with content tags of each level in the second content tag tree step by step according to a left-to-right or top-to-bottom order in combination with semantic analysis.
In the foregoing solution, the updating module is specifically configured to determine that no content tag that is the same as or similar to the content tag exists in the second content tag tree, and add the content tag to a corresponding level of the second content tag tree;
if the content tag similar to the content tag exists in the second content tag tree, updating the name of the content tag similar to the content tag; the content tags similar to the content tags are content tags with different names and the same corresponding parent tags and child tags.
In the foregoing scheme, the updating module is further configured to match, starting from a content tag in the second content tag tree that is the same as a content tag of a root node of the first content tag tree, the content tag in the second content tag tree with the content tag of the corresponding level in the first content tag tree according to a sequence from left to right or from top to bottom, and delete the content tag in the second content tag tree if the content tag in the second content tag tree does not exist in the content tag of the corresponding level in the first content tag tree.
The method and the device for managing the internet content tags provided by the embodiment of the invention establish a content tag library, and the method further comprises the following steps: acquiring first content tag trees corresponding to different websites, and respectively determining category tags of categories to which the first content tag trees belong; matching the content label of the first content label tree with the content label in the content label library according to a preset matching rule based on the category label, and updating the content label library according to a matching result; therefore, different content label systems which are useful by different internet content providers can be combined to form a unified internet content label system, so that the analysis of the access history track of the user in the later period is facilitated, the user portrait is constructed, the personalized recommendation and the real-time marketing service based on the user preference label are developed, the user experience of the user for accessing the internet content is improved, and convenience and flexibility are provided for the personalized content recommendation based on the user access or the user behavior and the like.
Drawings
Fig. 1 is a schematic flow chart of a method for managing internet content tags according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a first content tag tree according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a second content tag tree in the content tag library according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a second content tag tree according to a second embodiment of the present invention;
FIG. 5 is a flowchart illustrating a second Internet content tag management method according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a management apparatus for internet content tags according to an embodiment of the present invention.
Detailed Description
In an embodiment of the present invention, a content tag library is created, and the method further includes: acquiring first content tag trees corresponding to different websites, and respectively determining category tags of categories to which the first content tag trees belong; and matching the content label of the first content label tree with the content label in the content label library according to a preset matching rule based on the category label, and updating the content label library according to a matching result.
Example one
Fig. 1 is a schematic flow chart illustrating a method for managing internet content tags according to an embodiment of the present invention; as shown in fig. 1, the method for managing internet content tags according to the embodiment of the present invention includes:
step 100: creating a content tag library;
here, it should be noted that, the step is only executed when the method for managing the internet content tags is executed for the first time, and the subsequent content tag library can be directly applied.
Step 101: acquiring first content tag trees corresponding to different websites, and respectively determining category tags of categories to which the first content tag trees belong;
here, the acquisition may be periodic acquisition, and the period may be set according to actual needs, for example, the period is two weeks.
In an embodiment, the obtaining the first content tag tree corresponding to the different websites includes:
acquiring website domain names (HOST) corresponding to different websites and at least one Uniform Resource Locator (URL) under the website domain names, and determining first content label trees corresponding to the different websites by using content label rules of the websites based on the website domain names and the URLs;
here, a website domain name (HOST) corresponding to different websites and at least one URL under the website domain name may be acquired through a tag data interface existing in an internet crawler tool or other external systems;
each website of the internet content provider has a set of content label system, namely a content label rule, and the whole website is of a tree structure, such as Baidu news including science and technology, entertainment, society, military and the like, Baidu video, Baidu map and the like; determining a first content tag tree corresponding to the website by using a content tag rule of the website according to the website domain name of the website and at least one URL (uniform resource locator) under the website; the first content tag tree is of a tree structure as a whole, all content tags have a certain hierarchical arrangement, meanwhile, the content tags of adjacent hierarchies have a membership relationship, the content tag close to the root node is called a parent tag, and the content tag far from the root node is called a child tag, as shown in fig. 2, which is a schematic diagram of the first content tag tree according to the embodiment of the present invention.
In an embodiment, the determining the category label of the category to which each first content label tree belongs includes:
respectively reading website domain names of websites corresponding to the first content label trees, and determining category labels of categories to which the first content label trees belong according to the website domain names and a preset website domain name classification library;
here, the website domain name classification library may be pre-constructed or existing, and the website domain name classification library includes categories of website contents corresponding to different website domain names; for example: com, the website content corresponding to the website with the website domain name xxsy.com is book; com, the category label of the first content label tree is a book.
Step 102: matching the content label of the first content label tree with the content label in the content label library according to a preset matching rule based on the category label, and updating the content label library according to a matching result;
in an embodiment, the matching, according to a preset matching rule, the content tag of the first content tag tree and the content tag in the content tag library based on the category tag includes:
acquiring a second content label tree in which the content labels of the root nodes in the content label library are the same as the category labels, and matching the content labels of the first content label tree with the content labels of each level in the second content label tree step by step according to the sequence from left to right or from top to bottom by combining semantic analysis;
here, at an initial stage of creating the content tag library, there may be no second content tag tree in which a root node content tag is the same as the category tag in the content tag library, and in this case, a second content tag tree in which a root node content tag is the same as the category tag is created;
the semantic analysis is the semantic analysis of the names of the content tags, and if the names of the two content tags are the same and the corresponding parent tags and the corresponding child tags are the same, the two content tags are judged to be the same; if the names of the two content tags are different and the corresponding parent tag and the corresponding child tag are the same, judging that the two content tags are similar; if the names of the two content tags are different and the corresponding parent tags or the corresponding child tags are different, judging that the two content tags are different;
in an embodiment, if the category label is "book", the obtaining of the second content label tree in which the content label of the root node in the content label library is the same as the category label is as follows: and acquiring a second content label tree of which the root node in the content label library is a book.
The sequence from left to right or from top to bottom is the sequence from the parent node to the child node.
In an embodiment, the step-by-step matching, in combination with the semantic analysis, the content tags of the first content tag tree with the content tags of each level in the second content tag tree in the order from left to right or from top to bottom includes:
from the content label of the root node of the first content label tree, according to the sequence from left to right or from top to bottom, the content label of the root node is firstly matched with the content label of each level in the second content label tree step by step, if the second content label tree has the content label same as the content label of the root node, continuously and gradually matching each sub-label of the root node content label with the content label of each level in the second content label tree, if the second content label tree has the content label same as the sub-label, continuing to match each sub-label of the sub-labels with the content label of each level in the second content label tree step by step, and repeating the steps until the matching of all the content labels in the first content label tree is completed.
In an embodiment, the updating the content tag library according to the matching result includes:
if the content label identical or similar to the content label does not exist in the second content label tree, adding the content label on the corresponding level of the second content label tree;
if the content tag similar to the content tag exists in the second content tag tree, updating the name of the content tag similar to the content tag; the content tags similar to the content tags are content tags with different names and same corresponding father tags and son tags;
if the content label identical to the content label exists in the second content label tree, keeping the content label identical to the content label in the second content label tree unchanged, and continuously matching other content labels in the first content label library;
here, fig. 3 is a schematic diagram of a second content tag tree in a content tag library according to an embodiment of the present invention, as shown in fig. 2 and 3, a category tag of the first content tag tree is "book", and from a root node content tag of the first content tag tree, that is, "agriculture and forestry", the "agriculture and forestry" is respectively matched with content tags of each level in the second content tag tree step by step according to a sequence from left to right or from top to bottom, that is, the "agriculture and forestry" is matched with a content tag "novel," "history," and "science and technology" of a child node content tag of the "book" first, and after semantic analysis, it is found that there is no content tag identical to the "agriculture and forestry", and then the "agriculture and forestry" is added to a child tag of the "book";
fig. 4 is a schematic diagram of a second content tag tree according to an embodiment of the present invention, and as shown in fig. 3 and 4, a category tag of the first content tag tree is "book", the category tag of the first content tag tree is respectively matched with content tags of each level in the second content tag tree step by step from a root content tag of the first content tag tree, that is, "fiction", and the same "fiction" is found in the second content tag tree through semantic analysis, then sub-tags of the "fiction" are respectively matched with content tags of each level in the second content tag tree step by step through "terrorism", and the content tag of the second content tag tree is updated to "terrorism" through discovering the content tag "thrilling" similar to the "terrorism" through the semantic analysis.
In an embodiment, after the updating the content tag library according to the matching result, the method further includes:
matching content labels in the second content label tree with content labels of corresponding levels in the first content label tree from left to right or from top to bottom from content labels in the second content label tree which are the same as content labels of a root node of the first content label tree, and deleting the content labels in the second content label tree if the content labels in the second content label tree do not exist in the content labels of the corresponding levels in the first content label tree;
here, as shown in fig. 3 and 4, starting from the same content tag in the second content tag tree as the content tag of the root node of the first content tag tree, that is, starting from the content tag "novel" in the second content tag tree, matching the content tag in the second content tag tree with the content tag of the corresponding level in the first content tag tree in the order from left to right or from top to bottom, and deleting the content tag "novel" in the second content tag tree if the sub-tag "novel" in the second content tag tree does not exist in the sub-tag of the "novel" in the first content tag tree.
Example two
Fig. 5 is a schematic flow chart illustrating a method for managing internet content tags according to an embodiment of the present invention; as shown in fig. 5, the method for managing internet content tags according to the embodiment of the present invention includes:
step 501: acquiring first content tag trees corresponding to different websites, and determining a category tag of a category to which each first content tag tree belongs;
before this step, the method further comprises: creating a content tag library; it should be noted that the creation of the content tag library is only executed when the management method of the internet content tag of the present invention is executed for the first time, and the creation can be directly applied subsequently.
The acquisition may be periodic acquisition, and the period may be set according to actual needs, for example, the period is two weeks.
In an embodiment, the obtaining the first content tag tree corresponding to the different websites includes:
acquiring website domain names corresponding to different websites and at least one URL (uniform resource locator) under the website domain names through an existing label data interface of an internet crawler tool or other external systems, and determining a first content label tree corresponding to the different websites by using a content label rule of each website based on the website domain names and the URLs;
here, each website of the internet content provider has a set of content tag system, and the whole website is in a tree structure, for example, hundred degrees comprise hundred degree news, hundred degree knowledge, hundred degree video, hundred degree map, and the like, and the hundred degree news comprises science and technology, entertainment, society, military, and the like; and determining a first content label tree corresponding to the website by using the content label rule of the website according to the website domain name of the website and at least one URL under the website.
In one embodiment, the determining the category label of the category to which each first content label tree belongs comprises:
respectively reading website domain names of websites corresponding to the first content label trees, and determining category labels of categories to which the first content label trees belong according to the website domain names and a preset website domain name classification library;
here, the website domain name classification library may be pre-constructed or existing, and the website domain name classification library includes categories of website contents corresponding to different website domain names; for example: com, the category of the website content corresponding to the website with the website domain name of youku is video; com, if the website domain name corresponding to the first content tag tree is youku, the category tag of the first content tag tree is a video.
Step 502: matching the content label of the first content label tree with the content label in the content label library according to a preset matching rule based on the category label, judging whether the content label library has the same content label as the content label in the first content label tree, and if so, executing step 503; if not, go to step 505;
in this embodiment of the present invention, the matching, based on the category label, the content label of the first content label tree with the content label in the content label library according to a preset matching rule includes:
acquiring a second content label tree in which the content labels of the root nodes in the content label library are the same as the category labels, and matching the content labels of the first content label tree with the content labels of each level in the second content label tree step by step according to the sequence from left to right or from top to bottom by combining semantic analysis;
here, the semantic analysis is semantic analysis of names of the content tags, and if the names of the two content tags are the same and the corresponding parent tags and child tags are the same, it is determined that the two content tags are the same; if the names of the two content tags are different and the corresponding parent tag and the corresponding child tag are the same, judging that the two content tags are similar; if the names of the two content tags are different and the corresponding parent tags or the corresponding child tags are different, judging that the two content tags are different;
in an embodiment, the step-by-step matching, in combination with the semantic analysis, the content tags of the first content tag tree with the content tags of each level in the second content tag tree in the order from left to right or from top to bottom includes:
from the content label of the root node of the first content label tree, according to the sequence from left to right or from top to bottom, the content label of the root node is firstly matched with the content label of each level in the second content label tree step by step, if the second content label tree has the content label same as the content label of the root node, continuously and gradually matching each sub-label of the root node content label with the content label of each level in the second content label tree, if the second content label tree has the content label same as the sub-label, continuing to match each sub-label of the sub-labels with the content label of each level in the second content label tree step by step, and repeating the steps until the matching of all the content labels in the first content label tree is completed.
Step 503: matching the content labels in the second content label tree with the content labels of the corresponding levels in the first content label tree from left to right or from top to bottom from the content label in the second content label tree which is the same as the content label of the root node of the first content label tree, judging whether the content labels in the second content label tree exist in the first content label tree, and if not, executing a step 504; if so, go to step 508;
here, since some useless content tags may exist in the content tag library due to dynamic changes of content tags corresponding to content provided by a content provider, in order to ensure synchronization between the content tags in the content tag library and content tags in the internet, the useless tags in the content tag library need to be periodically deleted; as shown in fig. 3 and 4, starting from the content tag in the second content tag tree that is the same as the content tag of the root node of the first content tag tree, that is, starting from the content tag "novel" in the second content tag tree, matching the content tag in the second content tag tree with the content tag of the corresponding level in the first content tag tree in the order from left to right or from top to bottom, and deleting the content tag "novel" in the second content tag tree if the sub-tag "novel" in the second content tag tree does not exist in the sub-tag of the "novel" in the first content tag tree.
Step 504: content tags in the second content tag tree that are not present in the first content tag tree are deleted and step 508 is performed.
Step 505: judging whether a content tag similar to the content tag in the first content tag tree exists in the content tag library, and if so, executing step 506; if not, go to step 507;
step 506: the names of the content tags similar to the content tag are updated to the names of the content tags in the first content tag tree, and step 508 is performed.
Step 507: adding the content tags at respective levels of the second content tag tree.
Step 508: and ending the processing flow.
EXAMPLE III
FIG. 6 is a schematic diagram of a structure of a management apparatus for Internet content tags according to an embodiment of the present invention; as shown in fig. 6, the apparatus for managing internet content tags according to the embodiment of the present invention includes: a creation module 61, an acquisition module 62 and an update module 63; wherein the content of the first and second substances,
the creating module 61 is configured to create a content tag library;
the obtaining module 62 is configured to obtain first content tag trees corresponding to different websites, and determine category tags of categories to which the first content tag trees belong respectively;
the updating module 63 is configured to match the content tag of the first content tag tree with the content tag in the content tag library according to a preset matching rule based on the category tag, and update the content tag library according to a matching result;
here, the acquisition may be periodic acquisition, and the period may be set according to actual needs, for example, the period is two weeks.
In an embodiment, the obtaining module 62 is specifically configured to obtain website domain names corresponding to different websites and at least one URL under the website domain names, and determine, based on the website domain names and the URLs, first content tag trees corresponding to the different websites by using content tag rules of the websites.
In an embodiment, the obtaining module 62 is specifically configured to read a website domain name of a website corresponding to each first content tag tree, and determine a category tag of a category to which each first content tag tree belongs according to the website domain name and a preset website domain name classification library;
here, the website domain name classification library may be pre-constructed or existing, and the website domain name classification library includes categories of website contents corresponding to different website domain names; for example: com, the website content corresponding to the website with the website domain name xxsy.com is book; com, the category label of the first content label tree is a book.
In an embodiment, the updating module 63 is specifically configured to obtain a second content tag tree in which content tags of root nodes in the content tag library are the same as the category tags, and match content tags of the first content tag tree with content tags of each level in the second content tag tree step by step according to a left-to-right or top-to-bottom order by combining semantic analysis;
here, the semantic analysis is semantic analysis of names of the content tags, and if the names of the two content tags are the same and the corresponding parent tags and child tags are the same, it is determined that the two content tags are the same; if the names of the two content tags are different and the corresponding parent tag and the corresponding child tag are the same, judging that the two content tags are similar; and if the names of the two content tags are different and the corresponding parent tags or the corresponding child tags are different, judging that the two content tags are different.
In an embodiment, the updating module 63 is specifically configured to start from a root node content tag of the first content tag tree, according to the sequence from left to right or from top to bottom, the content labels of the root nodes are respectively matched with the content labels of each hierarchy in the second content label tree step by step, if the second content label tree has the content label same as the content label of the root node, continuously and gradually matching each sub-label of the root node content label with the content label of each level in the second content label tree, if the second content label tree has the content label same as the sub-label, continuing to match each sub-label of the sub-labels with the content label of each level in the second content label tree step by step, and repeating the steps until the matching of all the content labels in the first content label tree is completed.
In an embodiment, the updating module 63 is specifically configured to determine that no content tag identical or similar to the content tag exists in the second content tag tree, and add the content tag to a corresponding level of the second content tag tree;
if the content tag similar to the content tag exists in the second content tag tree, updating the name of the content tag similar to the content tag; the content tags similar to the content tags are content tags with different names and same corresponding father tags and son tags;
and if the content label identical to the content label exists in the second content label tree, keeping the content label identical to the content label in the second content label tree unchanged, and continuously matching other content labels in the first content label library.
In an embodiment, the updating module 63 is further configured to, starting from a content tag in the second content tag tree that is the same as the content tag of the root node of the first content tag tree, match the content tag in the second content tag tree with the content tag of the corresponding level in the first content tag tree from left to right or from top to bottom, and delete the content tag in the second content tag tree if the content tag in the second content tag tree does not exist in the content tag of the corresponding level in the first content tag tree.
Here, it should be noted that: the above description related to the apparatus is similar to the above description of the method, and the description of the beneficial effects of the method is omitted for brevity. For technical details not disclosed in the embodiments of the apparatus of the present invention, reference is made to the description of the embodiments of the method of the present invention.
In the embodiment of the present invention, the creating module 61, the obtaining module 62, and the updating module 63 may be implemented by a Central Processing Unit (CPU) or a Digital Signal Processor (DSP) in a terminal or a server, a Field Programmable Gate Array (FPGA), or an Integrated Circuit (ASIC).
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims (10)

1. A method for managing internet content tags, wherein a content tag library is created, the method further comprising:
acquiring first content tag trees corresponding to different websites, and respectively determining category tags of categories to which the first content tag trees belong;
matching the content label of the first content label tree with the content label in the content label library according to a preset matching rule based on the category label, and updating the content label library according to a matching result;
the matching the content tag of the first content tag tree with the content tag in the content tag library according to a preset matching rule based on the category tag comprises:
acquiring a second content label tree in which the content labels of the root nodes in the content label library are the same as the category labels, and matching the content labels of the first content label tree with the content labels of each level in the second content label tree step by step according to the sequence from left to right or from top to bottom by combining semantic analysis;
the updating the content tag library according to the matching result comprises:
if the second content label tree is determined to have content labels similar to the content labels of the first content label tree, updating the names of the content labels similar to the content labels of the first content label tree; the content tags similar to the content tags of the first content tag tree are content tags with names different from those of the content tags of the first content tag tree, but corresponding parent tags and corresponding child tags are the same.
2. The method of claim 1, wherein the obtaining the first content tag tree corresponding to different websites comprises:
acquiring website domain names corresponding to different websites and at least one Uniform Resource Locator (URL) under the website domain names, and determining first content tag trees corresponding to the different websites by using content tag rules of the websites based on the website domain names and the URLs.
3. The method according to claim 1 or 2, wherein the determining the category label of the category to which each first content label tree belongs comprises:
and respectively reading the website domain names of the websites corresponding to the first content label trees, and determining the category labels of the categories to which the first content label trees belong according to the website domain names and a preset website domain name classification library.
4. The method of claim 1, wherein said updating said content tag library according to said matching further comprises:
and if the content label identical or similar to the content label does not exist in the second content label tree, adding the content label on the corresponding level of the second content label tree.
5. The method of claim 1, wherein after the updating the content tag library according to the matching result, the method further comprises:
starting from a content label in the second content label tree which is the same as a content label of a root node of the first content label tree, matching the content label in the second content label tree with the content label of the corresponding level in the first content label tree according to the sequence from left to right or from top to bottom, and deleting the content label in the second content label tree if the content label in the second content label tree does not exist in the content label of the corresponding level in the first content label tree.
6. An apparatus for managing tags of internet contents, the apparatus comprising: the system comprises a creating module, an obtaining module and an updating module; wherein the content of the first and second substances,
the creating module is used for creating a content label library;
the acquisition module is used for acquiring first content tag trees corresponding to different websites and respectively determining category tags of categories to which the first content tag trees belong;
the updating module is used for matching the content tags of the first content tag tree with the content tags in the content tag library according to a preset matching rule based on the category tags and updating the content tag library according to a matching result;
the updating module is specifically configured to acquire a second content tag tree in which content tags of root nodes in the content tag library are the same as the category tags, and match content tags of the first content tag tree with content tags of each level in the second content tag tree step by step according to a left-to-right or top-to-bottom order by combining semantic analysis;
the updating module is specifically configured to update the name of the content tag similar to the content tag of the first content tag tree if it is determined that the second content tag tree has the content tag similar to the content tag of the first content tag tree; the content tags similar to the content tags of the first content tag tree are content tags with names different from those of the content tags of the first content tag tree, but corresponding parent tags and corresponding child tags are the same.
7. The apparatus according to claim 6, wherein the obtaining module is specifically configured to obtain website domain names corresponding to different websites and at least one uniform resource locator URL under the website domain names, and determine, based on the website domain names and the URLs, first content tag trees corresponding to the different websites by using content tag rules of the websites.
8. The apparatus according to claim 6 or 7, wherein the obtaining module is specifically configured to read website domain names of websites corresponding to the first content tag trees, and determine category tags of categories to which the first content tag trees belong according to the website domain names and a preset website domain name classification library.
9. The apparatus of claim 6, wherein the updating module is further configured to determine that no content tag identical or similar to the content tag exists in the second content tag tree, and add the content tag at a corresponding level of the second content tag tree.
10. The apparatus of claim 6, wherein the updating module is further configured to match content tags in the second content tag tree with content tags in corresponding levels of the first content tag tree in a left-to-right or top-to-bottom order starting from a content tag in the second content tag tree that is the same as a content tag at a root node of the first content tag tree, and delete a content tag in the second content tag tree if the content tag in the second content tag tree does not exist in the content tag in the corresponding level of the first content tag tree.
CN201510703822.0A 2015-10-26 2015-10-26 Internet content label management method and device Active CN106611008B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510703822.0A CN106611008B (en) 2015-10-26 2015-10-26 Internet content label management method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510703822.0A CN106611008B (en) 2015-10-26 2015-10-26 Internet content label management method and device

Publications (2)

Publication Number Publication Date
CN106611008A CN106611008A (en) 2017-05-03
CN106611008B true CN106611008B (en) 2020-06-12

Family

ID=58613998

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510703822.0A Active CN106611008B (en) 2015-10-26 2015-10-26 Internet content label management method and device

Country Status (1)

Country Link
CN (1) CN106611008B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109872173A (en) * 2017-12-04 2019-06-11 北京京东尚科信息技术有限公司 Construct method, system and the terminal device of user's portrait label
CN108427749B (en) * 2018-03-12 2021-06-18 上海哔哩哔哩科技有限公司 Commodity label management method, server and commodity label management system
CN109189513A (en) * 2018-06-29 2019-01-11 深圳市彬讯科技有限公司 Stacking methods of exhibiting, device, electronic equipment and storage medium based on label
CN110737824B (en) * 2018-07-03 2022-08-09 百度在线网络技术(北京)有限公司 Content query method and device
CN109189774A (en) * 2018-09-14 2019-01-11 南威软件股份有限公司 A kind of user tag method for transformation and system based on script rule
CN109739910B (en) * 2018-11-22 2020-12-04 北京亿幕信息技术有限公司 Method and system for managing openness label
CN109582675A (en) * 2018-11-29 2019-04-05 北京达佳互联信息技术有限公司 Tag match method, apparatus, server and storage medium
CN109933731A (en) * 2019-03-18 2019-06-25 苏州亿歌网络科技有限公司 A kind of friend recommendation method, apparatus, equipment and storage medium

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101192234A (en) * 2007-06-07 2008-06-04 腾讯科技(深圳)有限公司 Searching system and method based on web page extraction
CN101192220A (en) * 2006-11-21 2008-06-04 财团法人资讯工业策进会 Label construction method and system
CN101201834A (en) * 2007-11-01 2008-06-18 复旦大学 Method for searching XML data stream keyword based on document type definition
CN101251855A (en) * 2008-03-27 2008-08-27 腾讯科技(深圳)有限公司 Equipment, system and method for cleaning internet web page
CN101458690A (en) * 2007-12-14 2009-06-17 北京龙拓互动广告有限公司 Advertisement publishing method and advertisement server
CN102004805A (en) * 2010-12-30 2011-04-06 上海交通大学 Webpage denoising system and method based on maximum similarity matching
CN102104986A (en) * 2009-12-18 2011-06-22 北京闻言科技有限公司 Novel method for paging wap content
CN102118439A (en) * 2011-01-19 2011-07-06 百度在线网络技术(北京)有限公司 Method and device for automatically processing document contents and editor
CN102457528A (en) * 2010-10-19 2012-05-16 北京邮电大学 Method for adaptively issuing web content facing to mobile phone terminal and system thereof
CN103116588A (en) * 2011-11-17 2013-05-22 腾讯科技(深圳)有限公司 Method and system for personalized recommendation
CN103279513A (en) * 2013-05-22 2013-09-04 百度在线网络技术(北京)有限公司 Method for generating content label and method and device for providing multi-media content information
CN103605742A (en) * 2013-11-20 2014-02-26 北京搜狗科技发展有限公司 Method and device for recognizing network resource entity content page
CN104077273A (en) * 2013-03-27 2014-10-01 腾讯科技(深圳)有限公司 Method and device for extracting webpage contents

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040002953A1 (en) * 2002-06-28 2004-01-01 Kremer Karl Heinz Page tags for automated document processing

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101192220A (en) * 2006-11-21 2008-06-04 财团法人资讯工业策进会 Label construction method and system
CN101192234A (en) * 2007-06-07 2008-06-04 腾讯科技(深圳)有限公司 Searching system and method based on web page extraction
CN101201834A (en) * 2007-11-01 2008-06-18 复旦大学 Method for searching XML data stream keyword based on document type definition
CN101458690A (en) * 2007-12-14 2009-06-17 北京龙拓互动广告有限公司 Advertisement publishing method and advertisement server
CN101251855A (en) * 2008-03-27 2008-08-27 腾讯科技(深圳)有限公司 Equipment, system and method for cleaning internet web page
CN102104986A (en) * 2009-12-18 2011-06-22 北京闻言科技有限公司 Novel method for paging wap content
CN102457528A (en) * 2010-10-19 2012-05-16 北京邮电大学 Method for adaptively issuing web content facing to mobile phone terminal and system thereof
CN102004805A (en) * 2010-12-30 2011-04-06 上海交通大学 Webpage denoising system and method based on maximum similarity matching
CN102118439A (en) * 2011-01-19 2011-07-06 百度在线网络技术(北京)有限公司 Method and device for automatically processing document contents and editor
CN103116588A (en) * 2011-11-17 2013-05-22 腾讯科技(深圳)有限公司 Method and system for personalized recommendation
CN104077273A (en) * 2013-03-27 2014-10-01 腾讯科技(深圳)有限公司 Method and device for extracting webpage contents
CN103279513A (en) * 2013-05-22 2013-09-04 百度在线网络技术(北京)有限公司 Method for generating content label and method and device for providing multi-media content information
CN103605742A (en) * 2013-11-20 2014-02-26 北京搜狗科技发展有限公司 Method and device for recognizing network resource entity content page

Also Published As

Publication number Publication date
CN106611008A (en) 2017-05-03

Similar Documents

Publication Publication Date Title
CN106611008B (en) Internet content label management method and device
IL295003A (en) Web document enhancement
US9304979B2 (en) Authorized syndicated descriptions of linked web content displayed with links in user-generated content
CN102789484B (en) A kind of Web information processing method and apparatus
US7797350B2 (en) System and method for processing downloaded data
US20140089786A1 (en) Automated Processor For Web Content To Mobile-Optimized Content Transformation
CN111259006A (en) Universal distributed heterogeneous data integrated physical aggregation, organization, release and service method and system
US9298839B2 (en) Resolving a dead shortened uniform resource locator
US20100313149A1 (en) Aggregating dynamic visual content
CN104133878A (en) User label generation method and device
CN104133877A (en) Software label generation method and device
CN112434811A (en) Knowledge graph construction method and device, computing equipment and storage medium
US20130117716A1 (en) Function Extension for Browsers or Documents
US20130138654A1 (en) Methods and apparatuses for generating semantic signatures for media content
Ravulavaru Google Cloud AI Services Quick Start Guide: Build Intelligent Applications with Google Cloud AI Services
KR101503268B1 (en) Symantic client, symantic information management server, method for generaing symantic information, method for searching symantic information and computer program recording medium for performing the methods
US20200073925A1 (en) Method and system for generating a website from collected content
Leadbetter et al. Where big data meets linked data: applying standard data models to environmental data streams
KR101372942B1 (en) Social media contents sharing apparatus and method
CN111078773B (en) Data processing method and device
CN115705313A (en) Data processing method, device, equipment and computer readable storage medium
CN113806647A (en) Method for identifying development framework and related equipment
JP2012059212A (en) Extraction apparatus, extraction method and extraction program
Schmidt et al. An integrated facet-based library for arbitrary software components
Junior et al. Cost optimization on public cloud provider for big geospatial data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant