CN106611008B

CN106611008B - Internet content label management method and device

Info

Publication number: CN106611008B
Application number: CN201510703822.0A
Authority: CN
Inventors: 赵耀红; 高丹; 熊龙; 邓超
Original assignee: China Mobile Communications Group Co Ltd
Current assignee: China Mobile Communications Group Co Ltd
Priority date: 2015-10-26
Filing date: 2015-10-26
Publication date: 2020-06-12
Anticipated expiration: 2035-10-26
Also published as: CN106611008A

Abstract

The invention discloses a management method of Internet content labels, which comprises the steps of creating a content label library, and the method also comprises the following steps: acquiring first content tag trees corresponding to different websites, and respectively determining category tags of categories to which the first content tag trees belong; and matching the content label of the first content label tree with the content label in the content label library according to a preset matching rule based on the category label, and updating the content label library according to a matching result. The invention also discloses a management device of the Internet content label.

Description

Internet content label management method and device

Technical Field

The invention relates to the technical field of internet, in particular to a method and a device for managing internet content tags.

Background

With the rapid development of the internet, more and more content providers are continuously emerging, each field has different content providers, each content provider has its own specific content label system, each content label is different, but some labels are the same or have greater similarity. For user portrait construction based on user access internet content and behaviors and a personalized accurate content pushing scene based on a client, a set of uniform and complete internet content label system is urgently needed, and convenience and flexibility can be provided for personalized content pushing based on user access or user behaviors and the like. At present, no solution exists to merge different content tagging systems.

Disclosure of Invention

In view of this, embodiments of the present invention are expected to provide a management and an apparatus for internet content tags, which can form a unified internet content tag system, and provide convenience and flexibility for personalized content push based on user access or user behavior.

In order to achieve the above purpose, the technical solution of the embodiment of the present invention is realized as follows:

the embodiment of the invention provides a method for managing an internet content label, which comprises the following steps: creating a library of content tags, the method further comprising:

acquiring first content tag trees corresponding to different websites, and respectively determining category tags of categories to which the first content tag trees belong;

and matching the content label of the first content label tree with the content label in the content label library according to a preset matching rule based on the category label, and updating the content label library according to a matching result.

In the foregoing scheme, the obtaining the first content tag trees corresponding to different websites includes:

acquiring website domain names corresponding to different websites and at least one Uniform Resource Locator (URL) under the website domain names, and determining first content tag trees corresponding to the different websites by using content tag rules of the websites based on the website domain names and the URLs.

In the foregoing solution, the determining the category label of the category to which each first content label tree belongs includes:

and respectively reading the website domain names of the websites corresponding to the first content label trees, and determining the category labels of the categories to which the first content label trees belong according to the website domain names and a preset website domain name classification library.

In the foregoing solution, the matching, based on the category label, the content label of the first content label tree and the content label in the content label library according to a preset matching rule includes:

and acquiring a second content label tree with the root node content labels in the content label library identical to the category labels, and matching the content labels of the first content label tree with the content labels of each level in the second content label tree step by step according to the sequence from left to right or from top to bottom by combining semantic analysis.

In the foregoing solution, the updating the content tag library according to the matching result includes:

if the content label identical or similar to the content label does not exist in the second content label tree, adding the content label on the corresponding level of the second content label tree;

if the content tag similar to the content tag exists in the second content tag tree, updating the name of the content tag similar to the content tag; the content tags similar to the content tags are content tags with different names and the same corresponding parent tags and child tags.

In the foregoing solution, after the content tag library is updated according to the matching result, the method further includes:

starting from a content label in the second content label tree which is the same as a content label of a root node of the first content label tree, matching the content label in the second content label tree with the content label of the corresponding level in the first content label tree according to the sequence from left to right or from top to bottom, and deleting the content label in the second content label tree if the content label in the second content label tree does not exist in the content label of the corresponding level in the first content label tree.

The embodiment of the invention also provides a management device of the internet content label, which comprises the following components: the system comprises a creating module, an obtaining module and an updating module; wherein the content of the first and second substances,

the creating module is used for creating a content label library;

the acquisition module is used for acquiring first content tag trees corresponding to different websites and respectively determining category tags of categories to which the first content tag trees belong;

and the updating module is used for matching the content tags of the first content tag tree with the content tags in the content tag library according to a preset matching rule based on the category tags and updating the content tag library according to a matching result.

In the above scheme, the obtaining module is specifically configured to obtain website domain names corresponding to different websites and at least one uniform resource locator URL under the website domain names, and determine, based on the website domain names and the URLs, first content tag trees corresponding to the different websites by using content tag rules of the websites.

In the above scheme, the obtaining module is specifically configured to read the website domain name of the website corresponding to each first content tag tree, and determine the category tag of the category to which each first content tag tree belongs according to the website domain name and a preset website domain name classification library.

In the foregoing scheme, the updating module is specifically configured to obtain a second content tag tree in which content tags of root nodes in the content tag library are the same as the category tags, and match content tags of the first content tag tree with content tags of each level in the second content tag tree step by step according to a left-to-right or top-to-bottom order in combination with semantic analysis.

In the foregoing solution, the updating module is specifically configured to determine that no content tag that is the same as or similar to the content tag exists in the second content tag tree, and add the content tag to a corresponding level of the second content tag tree;

In the foregoing scheme, the updating module is further configured to match, starting from a content tag in the second content tag tree that is the same as a content tag of a root node of the first content tag tree, the content tag in the second content tag tree with the content tag of the corresponding level in the first content tag tree according to a sequence from left to right or from top to bottom, and delete the content tag in the second content tag tree if the content tag in the second content tag tree does not exist in the content tag of the corresponding level in the first content tag tree.

The method and the device for managing the internet content tags provided by the embodiment of the invention establish a content tag library, and the method further comprises the following steps: acquiring first content tag trees corresponding to different websites, and respectively determining category tags of categories to which the first content tag trees belong; matching the content label of the first content label tree with the content label in the content label library according to a preset matching rule based on the category label, and updating the content label library according to a matching result; therefore, different content label systems which are useful by different internet content providers can be combined to form a unified internet content label system, so that the analysis of the access history track of the user in the later period is facilitated, the user portrait is constructed, the personalized recommendation and the real-time marketing service based on the user preference label are developed, the user experience of the user for accessing the internet content is improved, and convenience and flexibility are provided for the personalized content recommendation based on the user access or the user behavior and the like.

Drawings

Fig. 1 is a schematic flow chart of a method for managing internet content tags according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a first content tag tree according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a second content tag tree in the content tag library according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a second content tag tree according to a second embodiment of the present invention;

FIG. 5 is a flowchart illustrating a second Internet content tag management method according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a management apparatus for internet content tags according to an embodiment of the present invention.

Detailed Description

In an embodiment of the present invention, a content tag library is created, and the method further includes: acquiring first content tag trees corresponding to different websites, and respectively determining category tags of categories to which the first content tag trees belong; and matching the content label of the first content label tree with the content label in the content label library according to a preset matching rule based on the category label, and updating the content label library according to a matching result.

Example one

Fig. 1 is a schematic flow chart illustrating a method for managing internet content tags according to an embodiment of the present invention; as shown in fig. 1, the method for managing internet content tags according to the embodiment of the present invention includes:

step 100: creating a content tag library;

here, it should be noted that, the step is only executed when the method for managing the internet content tags is executed for the first time, and the subsequent content tag library can be directly applied.

Step 101: acquiring first content tag trees corresponding to different websites, and respectively determining category tags of categories to which the first content tag trees belong;

here, the acquisition may be periodic acquisition, and the period may be set according to actual needs, for example, the period is two weeks.

In an embodiment, the obtaining the first content tag tree corresponding to the different websites includes:

acquiring website domain names (HOST) corresponding to different websites and at least one Uniform Resource Locator (URL) under the website domain names, and determining first content label trees corresponding to the different websites by using content label rules of the websites based on the website domain names and the URLs;

here, a website domain name (HOST) corresponding to different websites and at least one URL under the website domain name may be acquired through a tag data interface existing in an internet crawler tool or other external systems;

each website of the internet content provider has a set of content label system, namely a content label rule, and the whole website is of a tree structure, such as Baidu news including science and technology, entertainment, society, military and the like, Baidu video, Baidu map and the like; determining a first content tag tree corresponding to the website by using a content tag rule of the website according to the website domain name of the website and at least one URL (uniform resource locator) under the website; the first content tag tree is of a tree structure as a whole, all content tags have a certain hierarchical arrangement, meanwhile, the content tags of adjacent hierarchies have a membership relationship, the content tag close to the root node is called a parent tag, and the content tag far from the root node is called a child tag, as shown in fig. 2, which is a schematic diagram of the first content tag tree according to the embodiment of the present invention.

In an embodiment, the determining the category label of the category to which each first content label tree belongs includes:

respectively reading website domain names of websites corresponding to the first content label trees, and determining category labels of categories to which the first content label trees belong according to the website domain names and a preset website domain name classification library;

here, the website domain name classification library may be pre-constructed or existing, and the website domain name classification library includes categories of website contents corresponding to different website domain names; for example: com, the website content corresponding to the website with the website domain name xxsy.com is book; com, the category label of the first content label tree is a book.

Step 102: matching the content label of the first content label tree with the content label in the content label library according to a preset matching rule based on the category label, and updating the content label library according to a matching result;

in an embodiment, the matching, according to a preset matching rule, the content tag of the first content tag tree and the content tag in the content tag library based on the category tag includes:

acquiring a second content label tree in which the content labels of the root nodes in the content label library are the same as the category labels, and matching the content labels of the first content label tree with the content labels of each level in the second content label tree step by step according to the sequence from left to right or from top to bottom by combining semantic analysis;

here, at an initial stage of creating the content tag library, there may be no second content tag tree in which a root node content tag is the same as the category tag in the content tag library, and in this case, a second content tag tree in which a root node content tag is the same as the category tag is created;

the semantic analysis is the semantic analysis of the names of the content tags, and if the names of the two content tags are the same and the corresponding parent tags and the corresponding child tags are the same, the two content tags are judged to be the same; if the names of the two content tags are different and the corresponding parent tag and the corresponding child tag are the same, judging that the two content tags are similar; if the names of the two content tags are different and the corresponding parent tags or the corresponding child tags are different, judging that the two content tags are different;

in an embodiment, if the category label is "book", the obtaining of the second content label tree in which the content label of the root node in the content label library is the same as the category label is as follows: and acquiring a second content label tree of which the root node in the content label library is a book.

The sequence from left to right or from top to bottom is the sequence from the parent node to the child node.

In an embodiment, the step-by-step matching, in combination with the semantic analysis, the content tags of the first content tag tree with the content tags of each level in the second content tag tree in the order from left to right or from top to bottom includes:

from the content label of the root node of the first content label tree, according to the sequence from left to right or from top to bottom, the content label of the root node is firstly matched with the content label of each level in the second content label tree step by step, if the second content label tree has the content label same as the content label of the root node, continuously and gradually matching each sub-label of the root node content label with the content label of each level in the second content label tree, if the second content label tree has the content label same as the sub-label, continuing to match each sub-label of the sub-labels with the content label of each level in the second content label tree step by step, and repeating the steps until the matching of all the content labels in the first content label tree is completed.

In an embodiment, the updating the content tag library according to the matching result includes:

if the content tag similar to the content tag exists in the second content tag tree, updating the name of the content tag similar to the content tag; the content tags similar to the content tags are content tags with different names and same corresponding father tags and son tags;

if the content label identical to the content label exists in the second content label tree, keeping the content label identical to the content label in the second content label tree unchanged, and continuously matching other content labels in the first content label library;

here, fig. 3 is a schematic diagram of a second content tag tree in a content tag library according to an embodiment of the present invention, as shown in fig. 2 and 3, a category tag of the first content tag tree is "book", and from a root node content tag of the first content tag tree, that is, "agriculture and forestry", the "agriculture and forestry" is respectively matched with content tags of each level in the second content tag tree step by step according to a sequence from left to right or from top to bottom, that is, the "agriculture and forestry" is matched with a content tag "novel," "history," and "science and technology" of a child node content tag of the "book" first, and after semantic analysis, it is found that there is no content tag identical to the "agriculture and forestry", and then the "agriculture and forestry" is added to a child tag of the "book";

fig. 4 is a schematic diagram of a second content tag tree according to an embodiment of the present invention, and as shown in fig. 3 and 4, a category tag of the first content tag tree is "book", the category tag of the first content tag tree is respectively matched with content tags of each level in the second content tag tree step by step from a root content tag of the first content tag tree, that is, "fiction", and the same "fiction" is found in the second content tag tree through semantic analysis, then sub-tags of the "fiction" are respectively matched with content tags of each level in the second content tag tree step by step through "terrorism", and the content tag of the second content tag tree is updated to "terrorism" through discovering the content tag "thrilling" similar to the "terrorism" through the semantic analysis.

In an embodiment, after the updating the content tag library according to the matching result, the method further includes:

matching content labels in the second content label tree with content labels of corresponding levels in the first content label tree from left to right or from top to bottom from content labels in the second content label tree which are the same as content labels of a root node of the first content label tree, and deleting the content labels in the second content label tree if the content labels in the second content label tree do not exist in the content labels of the corresponding levels in the first content label tree;

here, as shown in fig. 3 and 4, starting from the same content tag in the second content tag tree as the content tag of the root node of the first content tag tree, that is, starting from the content tag "novel" in the second content tag tree, matching the content tag in the second content tag tree with the content tag of the corresponding level in the first content tag tree in the order from left to right or from top to bottom, and deleting the content tag "novel" in the second content tag tree if the sub-tag "novel" in the second content tag tree does not exist in the sub-tag of the "novel" in the first content tag tree.

Example two

Fig. 5 is a schematic flow chart illustrating a method for managing internet content tags according to an embodiment of the present invention; as shown in fig. 5, the method for managing internet content tags according to the embodiment of the present invention includes:

step 501: acquiring first content tag trees corresponding to different websites, and determining a category tag of a category to which each first content tag tree belongs;

before this step, the method further comprises: creating a content tag library; it should be noted that the creation of the content tag library is only executed when the management method of the internet content tag of the present invention is executed for the first time, and the creation can be directly applied subsequently.

The acquisition may be periodic acquisition, and the period may be set according to actual needs, for example, the period is two weeks.

acquiring website domain names corresponding to different websites and at least one URL (uniform resource locator) under the website domain names through an existing label data interface of an internet crawler tool or other external systems, and determining a first content label tree corresponding to the different websites by using a content label rule of each website based on the website domain names and the URLs;

here, each website of the internet content provider has a set of content tag system, and the whole website is in a tree structure, for example, hundred degrees comprise hundred degree news, hundred degree knowledge, hundred degree video, hundred degree map, and the like, and the hundred degree news comprises science and technology, entertainment, society, military, and the like; and determining a first content label tree corresponding to the website by using the content label rule of the website according to the website domain name of the website and at least one URL under the website.

In one embodiment, the determining the category label of the category to which each first content label tree belongs comprises:

here, the website domain name classification library may be pre-constructed or existing, and the website domain name classification library includes categories of website contents corresponding to different website domain names; for example: com, the category of the website content corresponding to the website with the website domain name of youku is video; com, if the website domain name corresponding to the first content tag tree is youku, the category tag of the first content tag tree is a video.

Step 502: matching the content label of the first content label tree with the content label in the content label library according to a preset matching rule based on the category label, judging whether the content label library has the same content label as the content label in the first content label tree, and if so, executing step 503; if not, go to step 505;

in this embodiment of the present invention, the matching, based on the category label, the content label of the first content label tree with the content label in the content label library according to a preset matching rule includes:

here, the semantic analysis is semantic analysis of names of the content tags, and if the names of the two content tags are the same and the corresponding parent tags and child tags are the same, it is determined that the two content tags are the same; if the names of the two content tags are different and the corresponding parent tag and the corresponding child tag are the same, judging that the two content tags are similar; if the names of the two content tags are different and the corresponding parent tags or the corresponding child tags are different, judging that the two content tags are different;

Step 503: matching the content labels in the second content label tree with the content labels of the corresponding levels in the first content label tree from left to right or from top to bottom from the content label in the second content label tree which is the same as the content label of the root node of the first content label tree, judging whether the content labels in the second content label tree exist in the first content label tree, and if not, executing a step 504; if so, go to step 508;

here, since some useless content tags may exist in the content tag library due to dynamic changes of content tags corresponding to content provided by a content provider, in order to ensure synchronization between the content tags in the content tag library and content tags in the internet, the useless tags in the content tag library need to be periodically deleted; as shown in fig. 3 and 4, starting from the content tag in the second content tag tree that is the same as the content tag of the root node of the first content tag tree, that is, starting from the content tag "novel" in the second content tag tree, matching the content tag in the second content tag tree with the content tag of the corresponding level in the first content tag tree in the order from left to right or from top to bottom, and deleting the content tag "novel" in the second content tag tree if the sub-tag "novel" in the second content tag tree does not exist in the sub-tag of the "novel" in the first content tag tree.

Step 504: content tags in the second content tag tree that are not present in the first content tag tree are deleted and step 508 is performed.

Step 505: judging whether a content tag similar to the content tag in the first content tag tree exists in the content tag library, and if so, executing step 506; if not, go to step 507;

step 506: the names of the content tags similar to the content tag are updated to the names of the content tags in the first content tag tree, and step 508 is performed.

Step 507: adding the content tags at respective levels of the second content tag tree.

Step 508: and ending the processing flow.

EXAMPLE III

FIG. 6 is a schematic diagram of a structure of a management apparatus for Internet content tags according to an embodiment of the present invention; as shown in fig. 6, the apparatus for managing internet content tags according to the embodiment of the present invention includes: a creation module 61, an acquisition module 62 and an update module 63; wherein the content of the first and second substances,

the creating module 61 is configured to create a content tag library;

the obtaining module 62 is configured to obtain first content tag trees corresponding to different websites, and determine category tags of categories to which the first content tag trees belong respectively;

the updating module 63 is configured to match the content tag of the first content tag tree with the content tag in the content tag library according to a preset matching rule based on the category tag, and update the content tag library according to a matching result;

In an embodiment, the obtaining module 62 is specifically configured to obtain website domain names corresponding to different websites and at least one URL under the website domain names, and determine, based on the website domain names and the URLs, first content tag trees corresponding to the different websites by using content tag rules of the websites.

In an embodiment, the obtaining module 62 is specifically configured to read a website domain name of a website corresponding to each first content tag tree, and determine a category tag of a category to which each first content tag tree belongs according to the website domain name and a preset website domain name classification library;

In an embodiment, the updating module 63 is specifically configured to obtain a second content tag tree in which content tags of root nodes in the content tag library are the same as the category tags, and match content tags of the first content tag tree with content tags of each level in the second content tag tree step by step according to a left-to-right or top-to-bottom order by combining semantic analysis;

here, the semantic analysis is semantic analysis of names of the content tags, and if the names of the two content tags are the same and the corresponding parent tags and child tags are the same, it is determined that the two content tags are the same; if the names of the two content tags are different and the corresponding parent tag and the corresponding child tag are the same, judging that the two content tags are similar; and if the names of the two content tags are different and the corresponding parent tags or the corresponding child tags are different, judging that the two content tags are different.

In an embodiment, the updating module 63 is specifically configured to start from a root node content tag of the first content tag tree, according to the sequence from left to right or from top to bottom, the content labels of the root nodes are respectively matched with the content labels of each hierarchy in the second content label tree step by step, if the second content label tree has the content label same as the content label of the root node, continuously and gradually matching each sub-label of the root node content label with the content label of each level in the second content label tree, if the second content label tree has the content label same as the sub-label, continuing to match each sub-label of the sub-labels with the content label of each level in the second content label tree step by step, and repeating the steps until the matching of all the content labels in the first content label tree is completed.

In an embodiment, the updating module 63 is specifically configured to determine that no content tag identical or similar to the content tag exists in the second content tag tree, and add the content tag to a corresponding level of the second content tag tree;

and if the content label identical to the content label exists in the second content label tree, keeping the content label identical to the content label in the second content label tree unchanged, and continuously matching other content labels in the first content label library.

In an embodiment, the updating module 63 is further configured to, starting from a content tag in the second content tag tree that is the same as the content tag of the root node of the first content tag tree, match the content tag in the second content tag tree with the content tag of the corresponding level in the first content tag tree from left to right or from top to bottom, and delete the content tag in the second content tag tree if the content tag in the second content tag tree does not exist in the content tag of the corresponding level in the first content tag tree.

Here, it should be noted that: the above description related to the apparatus is similar to the above description of the method, and the description of the beneficial effects of the method is omitted for brevity. For technical details not disclosed in the embodiments of the apparatus of the present invention, reference is made to the description of the embodiments of the method of the present invention.

In the embodiment of the present invention, the creating module 61, the obtaining module 62, and the updating module 63 may be implemented by a Central Processing Unit (CPU) or a Digital Signal Processor (DSP) in a terminal or a server, a Field Programmable Gate Array (FPGA), or an Integrated Circuit (ASIC).

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. A method for managing internet content tags, wherein a content tag library is created, the method further comprising:

matching the content label of the first content label tree with the content label in the content label library according to a preset matching rule based on the category label, and updating the content label library according to a matching result;

the matching the content tag of the first content tag tree with the content tag in the content tag library according to a preset matching rule based on the category tag comprises:

the updating the content tag library according to the matching result comprises:

if the second content label tree is determined to have content labels similar to the content labels of the first content label tree, updating the names of the content labels similar to the content labels of the first content label tree; the content tags similar to the content tags of the first content tag tree are content tags with names different from those of the content tags of the first content tag tree, but corresponding parent tags and corresponding child tags are the same.

2. The method of claim 1, wherein the obtaining the first content tag tree corresponding to different websites comprises:

3. The method according to claim 1 or 2, wherein the determining the category label of the category to which each first content label tree belongs comprises:

4. The method of claim 1, wherein said updating said content tag library according to said matching further comprises:

and if the content label identical or similar to the content label does not exist in the second content label tree, adding the content label on the corresponding level of the second content label tree.

5. The method of claim 1, wherein after the updating the content tag library according to the matching result, the method further comprises:

6. An apparatus for managing tags of internet contents, the apparatus comprising: the system comprises a creating module, an obtaining module and an updating module; wherein the content of the first and second substances,

the creating module is used for creating a content label library;

the updating module is used for matching the content tags of the first content tag tree with the content tags in the content tag library according to a preset matching rule based on the category tags and updating the content tag library according to a matching result;

the updating module is specifically configured to acquire a second content tag tree in which content tags of root nodes in the content tag library are the same as the category tags, and match content tags of the first content tag tree with content tags of each level in the second content tag tree step by step according to a left-to-right or top-to-bottom order by combining semantic analysis;

the updating module is specifically configured to update the name of the content tag similar to the content tag of the first content tag tree if it is determined that the second content tag tree has the content tag similar to the content tag of the first content tag tree; the content tags similar to the content tags of the first content tag tree are content tags with names different from those of the content tags of the first content tag tree, but corresponding parent tags and corresponding child tags are the same.

7. The apparatus according to claim 6, wherein the obtaining module is specifically configured to obtain website domain names corresponding to different websites and at least one uniform resource locator URL under the website domain names, and determine, based on the website domain names and the URLs, first content tag trees corresponding to the different websites by using content tag rules of the websites.

8. The apparatus according to claim 6 or 7, wherein the obtaining module is specifically configured to read website domain names of websites corresponding to the first content tag trees, and determine category tags of categories to which the first content tag trees belong according to the website domain names and a preset website domain name classification library.

9. The apparatus of claim 6, wherein the updating module is further configured to determine that no content tag identical or similar to the content tag exists in the second content tag tree, and add the content tag at a corresponding level of the second content tag tree.

10. The apparatus of claim 6, wherein the updating module is further configured to match content tags in the second content tag tree with content tags in corresponding levels of the first content tag tree in a left-to-right or top-to-bottom order starting from a content tag in the second content tag tree that is the same as a content tag at a root node of the first content tag tree, and delete a content tag in the second content tag tree if the content tag in the second content tag tree does not exist in the content tag in the corresponding level of the first content tag tree.