CN111694951B - Interest topic generation method, device, equipment and storage medium - Google Patents

Interest topic generation method, device, equipment and storage medium Download PDF

Info

Publication number
CN111694951B
CN111694951B CN201910200002.8A CN201910200002A CN111694951B CN 111694951 B CN111694951 B CN 111694951B CN 201910200002 A CN201910200002 A CN 201910200002A CN 111694951 B CN111694951 B CN 111694951B
Authority
CN
China
Prior art keywords
tag
cluster
word
interest
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910200002.8A
Other languages
Chinese (zh)
Other versions
CN111694951A (en
Inventor
刘少杰
许金泉
周俊
戴明洋
王栋
石逸轩
潘剑飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910200002.8A priority Critical patent/CN111694951B/en
Publication of CN111694951A publication Critical patent/CN111694951A/en
Application granted granted Critical
Publication of CN111694951B publication Critical patent/CN111694951B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a method, a device, equipment and a storage medium for generating an interest theme. The method comprises the following steps: clustering each tag word included in the target verticals to obtain at least two tag clusters; and generating an interest theme of the target vertical class according to the information of each tag word included in at least two tag clusters. By adopting the technical scheme of the embodiment, each tag word contained in any verticality can be clustered to divide each tag word into different tag clusters, further, the interest subject of the verticality can be generated and obtained automatically according to the information of each tag word in each tag cluster, abundant interest subjects can be generated without relying on manual operation in the generation process, and tag words supporting the interest subjects can be obtained in time, so that the coverage rate and timeliness of the interest subjects in the verticality are improved.

Description

Interest topic generation method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of Internet, in particular to a method, a device, equipment and a storage medium for generating an interest theme.
Background
In the process of recommending the articles for the user, article screening is performed mainly according to the correlation degree between each tag word associated with the interest subject concerned by the user and each tag word of each article.
Currently, attention topics under various large categories are manually summarized, and tag words are added to the attention topics through rules. And then the labels are summed up under the attention topic through the relativity with the attention topic, and the uncovered attention topic is supplemented or the uncovered label word is added under the attention topic through the feedback of the product service line.
The related art has the following defects: the coverage rate of the tag words of the interest topics under each vertical class is insufficient; manual operation has high requirements on professionals; the timeliness of new words attribution is low.
Disclosure of Invention
In view of the above problems, the embodiments of the present invention provide a method, an apparatus, a device, and a storage medium for generating an interest topic, so as to generate an interest topic and a tag word of the interest topic independently of manual operations.
In a first aspect, an embodiment of the present invention provides a method for generating an interest topic, including:
clustering each tag word included in the target verticals to obtain at least two tag clusters;
and generating the interest subject of the target vertical class according to the tag word information included in the at least two tag clusters.
In a second aspect, an embodiment of the present invention further provides an interest topic generating device, including:
the clustering module is used for clustering each tag word included in the target vertical class to obtain at least two tag clusters;
and the generating module is used for generating the interest subject of the target vertical class according to the tag word information included in the at least two tag clusters.
In a third aspect, an embodiment of the present invention further provides an apparatus, including:
one or more processors;
storage means for storing one or more programs,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the subject of interest generation methods as described in any of the embodiments of the invention.
In a fourth aspect, there is also provided in an embodiment of the present invention a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a subject of interest generation method according to any embodiment of the present invention.
According to the interest topic generation scheme provided by the embodiment of the invention, each tag word contained in any verticality can be clustered by adopting the technical scheme of the embodiment so as to divide each tag word into different tag clusters, and further, the interest topic of the verticality can be generated and obtained automatically according to each tag word information in each tag cluster, and the rich interest topic can be generated and the tag words supporting the interest topic can be obtained in time without relying on manual operation in the generation process, so that the coverage rate and timeliness of the interest topic in the verticality are improved.
The foregoing summary is merely an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more fully understood, and in order that the same or additional objects, features and advantages of the present invention may be more fully understood.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 is a flow chart of a method for generating a topic of interest according to an embodiment of the present invention;
FIG. 2 is a flow chart of another method for generating a topic of interest provided in an embodiment of the invention;
FIG. 3 is a schematic diagram of a device for generating a subject of interest according to an embodiment of the present invention;
fig. 4 is a schematic structural view of an apparatus according to an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
Before discussing exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently, or at the same time. Furthermore, the order of the operations may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like.
Fig. 1 is a flow chart of a method for generating an interest topic provided in an embodiment of the present invention, where the embodiment may be applied to a case of constructing a topic of interest and a tag word of attention of interest, especially a case of constructing a topic system of interest and a supporting tag word of a topic. The method may be performed by a topic of interest generating means, which may be implemented in software and/or hardware and integrated on any device having network communication capabilities. The device may be a terminal device or a server, where the terminal device may include, but is not limited to, a mobile phone, a tablet computer, etc.
As shown in fig. 1, the interest topic generation method provided in the embodiment of the present invention may include the following S101 to S102:
s101, clustering each tag word included in the object cluster to obtain at least two tag clusters.
In this embodiment, the target dropoff class may be a target object selected from a plurality of dropoff classes, which is required to perform an interest topic generation operation. The drop class refers to the existing large class under a certain label system; for example, the drooping class may refer to the broad class already existing under the Feed tag system, respectively including: entertainment, sports, automobiles, military, emotion, games, cartoon, infant and mother, fashion, history, financial accounting and the like.
In this embodiment, for each dropoff class, an interest topic system under the dropoff class may be constructed using interest topics focused by the user, and tag words supporting the interest topics in the interest topic system under the dropoff class. Wherein the tag word may be used to characterize the subject of interest so that the subject of interest may be understood by the tag.
For example, taking the corresponding verticals of the mother and infant child care as an example, the corresponding verticals of the mother and infant child care can include not only the interest subject of pregnancy examination, but also other interest subjects. For the interest subject of the pregnancy examination, some tag words are needed to support when constructing the interest subject of the pregnancy examination, so that the interest subject of the pregnancy examination is vividly depicted through the tag words. The tag word supporting the construction of the pregnancy check of this subject of interest may include: fetal movement, b-ultrasound, gene detection, chromosomal abnormalities, early pregnancy examinations, and the like. Similarly, for other subjects of interest in the corresponding category of the infant and the mother, a series of tag words are also needed to support, and no further description is given here.
In this embodiment, the articles under the target verticals may include one or more articles, each article under the target verticals may include a plurality of tag words, and each tag word under the target verticals may be used to support building each interest topic under the target verticals. In view of the fact that the tag words of each article under the target verticals may exist in a part that is the same and another part that is different, if the tag words of the same part are not distinguished, the problem of repetition and confusion of each tag word under the target verticals is easily caused, so that the use of each tag word included in the subsequent target verticals is greatly affected.
Therefore, after obtaining a plurality of tag obtaining words contained in the target verticals, clustering processing can be performed on each tag word contained in the obtained target verticals, so that each tag word obtained is clustered and divided, and at least two tag clusters are obtained. The clustering can be understood that each tag word included in the acquired target verticals is divided into different clusters according to a certain specific standard, so that the similarity of the tag words in the same cluster is as large as possible, and meanwhile, the difference of the tag words in different clusters is as large as possible.
In an alternative manner of this embodiment, clustering each tag word included in the target verticals may include the following S1011 to S1012:
s1011, determining the characteristic representation of each label according to each label word included in the target vertical class;
s1012, clustering the tag characteristic representations.
In this embodiment, after each tag word included in the target verticals is obtained, the tag feature representation corresponding to each tag word may be determined according to the preset corpus database associated with the target verticals. The tag characteristic representation is represented in the form of a word vector. The mapping relation between the tag words and the word vectors corresponding to the tag words can be stored in advance in a preset corpus database.
In this embodiment, each vertical class may be associated with a corpus database corresponding to the vertical class, and the target vertical class may be associated with a corresponding preset corpus database. For example, there are corpus databases corresponding to various genres of entertainment, sports, automobiles, military, emotion, games, cartoon, infant and mother, fashion, history, financial and other articles under various genres.
In this embodiment, after each tag word included in the target vertical class is expressed by using a word vector, that is, after determining to express the tag feature corresponding to each tag word, the tag feature corresponding to each tag word included in the target vertical class may be clustered, so that each tag word may be allocated to a corresponding cluster according to a clustering result. For example, clustering is performed on each tag feature representation by using a K-Means clustering algorithm, so that each tag word is distributed into corresponding clusters to obtain at least two tag clusters. The at least two tag clusters may include tag words included in the clustered object cluster.
S102, generating an interest subject of the target vertical class according to the information of each tag word included in at least two tag clusters.
In this embodiment, the interest topics under the target verticals are generated through manual summary or feedback of the product line nowadays, and in view of the limitations of manual summary and feedback of the product line, the generated interest topics under the target verticals may be relatively small, that is, all the interest topics that can be focused by the user may not be covered under the target verticals, so that coverage rate of the interest topics under the target verticals is relatively low. For this reason, there is a need for an improved way to actively mine topics of interest under the generated target verticals.
In this embodiment, when generating the interest subject of the target verticals, the tag word information in the tag clusters obtained after the clustering process may be used as a generation basis, and the interest subject of the target verticals may be generated according to each tag word information included in at least two tag clusters. By adopting the method, the interest topics of the target verticals can be automatically obtained through excavation without relying on manual summary or product line feedback, so that the interest topics under the target verticals can be well supplemented, and the target verticals can contain more interest topics as much as possible.
According to the interest topic generation scheme provided by the embodiment of the invention, each tag word contained in any verticality can be clustered by adopting the technical scheme of the embodiment so as to divide each tag word into different tag clusters, and further, the interest topic of the verticality can be generated and obtained automatically according to each tag word information in each tag cluster, and the rich interest topic can be generated and the tag words supporting the interest topic can be obtained in time without relying on manual operation in the generation process, so that the coverage rate and timeliness of the interest topic in the verticality are improved.
FIG. 2 is a flow chart of another method for generating a topic of interest provided in an embodiment of the present invention that is optimized based on the above embodiments, and the embodiment of the present invention may be combined with each of the alternatives in one or more embodiments. As shown in fig. 2, the interest topic generation method provided in the embodiment of the present invention may include S201 to S204:
s201, clustering each tag word included in the target verticals to obtain at least two tag clusters.
In an alternative manner of this embodiment, before clustering each tag word included in the target vertical class, the method may further include:
determining the article space number of the target article with the tag word aiming at each tag word included in the target tag class; the target articles are articles belonging to the target verticals;
and if the article space number of the target article does not meet the preset space number condition, filtering the tag word.
In this embodiment, after each tag word included in the target tab class is acquired, each tag word may only occasionally appear in a few articles under the target tab class, and such tag word is not representative for the target tab class and belongs to nonsensical tag words. For this reason, before clustering the tag words included in the target tab class, a statistical analysis may be performed for each tag word included in the target tab class, to determine the article space of the target article in which the tag word appears. For example, for each tag word included in the target tab class, a statistical analysis is performed to statistically analyze which articles under the target tab class the tag word appears in, and to determine the article sizes of the articles in which the tag word appears.
In this embodiment, after determining the article space of the target article in which the tag word appears, it may be determined whether the article space of the target article satisfies a preset space condition. If the article space number of the target article does not meet the preset space number condition, filtering the tag word; if the article space of the target article meets the preset space condition, the label word is reserved. For example, if the article space number of the target article is smaller than a preset space number threshold (for example, the preset space number threshold is 10), determining the tag word as a nonsensical word, and filtering the tag word; if the article space number of the target article is greater than or equal to a preset space number threshold value, the label word is reserved.
In an alternative manner of this embodiment, before clustering each tag word included in the target vertical class, the method may further include:
determining, for each tag word included in the target tag class, a probability that the tag word appears in the respective tag class;
and calculating information entropy corresponding to the tag word according to the probability that the tag word appears in each vertical class, and determining whether to filter the tag word according to the information entropy corresponding to the tag word.
In this embodiment, generally nonsensical tag words will appear under multiple verticals at the same time, while premium meaningful tag words will generally appear under one or both verticals only. To this end, for each tag word included in the target verticals, a probability that the tag word appears in the respective verticals may be determined. After determining the probability that the tag word appears in each vertical class, the information entropy corresponding to the tag word can be calculated according to the probability that the tag word appears in each vertical class. Alternatively, the information entropy corresponding to the tag word may be calculated using the formula H (x) = - Σpi (x) log (Pi (x)). Where Pi (x) represents the probability that the tag word x falls within the vertical class i.
In this embodiment, when the tag word has fewer vertical classes, the larger the value of the information entropy corresponding to the tag word is, the smaller the availability of the tag word is; when the tag word appears in more vertical classes, the smaller the value of the information entropy corresponding to the tag word is, the greater the usability of the tag word is. If the information entropy corresponding to the tag word is larger than a preset information entropy threshold, filtering the tag word; if the information entropy corresponding to the tag word is smaller than or equal to a preset information entropy threshold, the tag word is reserved.
By adopting the mode, before clustering each tag word included in the target vertical class, impurity tag word filtering processing can be carried out on each tag word included in the acquired target vertical class, so that the data processing amount of clustering processing can be reduced, the speed of clustering processing is increased, and data processing resources are saved.
S202, determining the weight of each label cluster according to the weight of each label word in each label cluster.
In this embodiment, different tag words included in the target verticals may have different importance, and some tag words may have a higher importance level in the target verticals, and may have a higher effect on the generation of the interest subject, and some tag words may also have a lower importance level in the target verticals, and may have a lower effect on the generation of the interest subject. In view of the above, different tag words included in the target verticals may have different weight ratios. Tag words of higher importance included in the target verticals may be weighted higher, while tag words of lower importance included in the target verticals may be weighted lower.
In this embodiment, after clustering each tag word included in the target cluster to obtain at least two tag clusters, each tag cluster may include each tag word in the clustering process, and each tag word included in the tag cluster is provided with a corresponding weight. For this reason, when determining the weight of the tag cluster, specific determination may be performed according to the weight of each tag word included in the tag cluster. The greater the weight of each tag word contained in the tag cluster, the greater the weight of the tag cluster; the smaller the weight of each tag word contained in a tag cluster, the smaller the weight of the tag cluster. Optionally, for each tag cluster, the weights of the tag words may be accumulated according to the weights of the tag words included in the tag cluster, and the result after the accumulation is used as the weight of the tag cluster. In addition, the tag words with the weight smaller than the preset weight threshold value in the tag clusters can be deleted according to the experience value.
In an alternative manner of this embodiment, before determining the weight of each tag in each tag cluster according to the weight of each tag in the tag cluster, the method may further include:
and determining the weight of each tag word according to the frequency of each tag word in the target vertical article.
In this embodiment, the target verticals may refer to respective articles corresponding to the target verticals, and, taking the maternal and infant raising verticals as an example, the target verticals may refer to articles about maternal and infant raising. For each tag word included in the target verticals, the frequency of occurrence of each tag word in the target verticals may be different, some tag words may occur more frequently in the target verticals, and some tag words may occur less frequently in the target verticals.
In this embodiment, when the frequency of the tag word appearing in the object vertical article is higher, the importance of the tag word in the object vertical is higher; when the tag word appears less frequently in the object verticals, the tag word is indicated to be of lower importance in the object verticals. For this reason, for each tag word included in the target verticals, the weight of the tag word may be determined according to the frequency of occurrence of the tag word in the target verticals, so that the weight of each tag word included in the target verticals may be obtained.
In an alternative manner of this embodiment, before determining the weight of each tag word in each tag cluster according to the weight of the tag word, the method may further include:
determining the similarity between each label word in each label cluster and the center of the label cluster;
and screening each label word in the label cluster according to the similarity.
In this embodiment, after clustering the tag words included in the target cluster, not only at least two tag clusters may be obtained, but also a tag cluster center of each tag cluster may be determined. The tag cluster center of each tag cluster may be represented in a vector form. The tag words in each tag cluster may be tag feature representations, for example, the tag words may be represented in a word vector form.
In this embodiment, after determining the tag characteristic representation of each tag word included in the tag cluster and the tag cluster center of the tag cluster, the similarity between the tag characteristic representation of each tag word and the tag cluster center may be calculated. For example, by calculating the cosine distance between the tag characteristic representation of each tag word and the center of the tag cluster, the similarity between each tag word and the center of the tag cluster can be obtained. The larger the cosine distance between the label characteristic representation and the center of the label cluster is, the lower the similarity between the label word and the center of the label cluster is; the smaller the cosine distance between the tag feature representation and the center of the tag cluster, the higher the similarity between the tag word and the center of the tag cluster.
In this embodiment, according to the similarity between each tag word and the center of the tag cluster, each tag word is sorted according to the sequence from the big to the small of the similarity, tag words with low similarity are deleted, and tag words with high similarity are reserved, so as to screen each tag word in the tag cluster. By adopting the mode, the label words with low similarity between the label words in the label cluster and the center of the label cluster can be screened and deleted, so that the data processing amount for determining the weight of the label cluster according to the weight of the label words can be reduced, the data processing speed is increased, and the data processing resources are saved.
S203, screening at least two tag clusters according to the weight of each tag cluster.
In this embodiment, there may be a plurality of tag clusters obtained by clustering each tag word included in the target cluster, and in view of the fact that the importance of each tag word included in the tag cluster is different, the importance of each tag cluster is also different. Therefore, at least two tag clusters can be screened according to the weight of each tag cluster, the tag clusters with smaller weight are screened and deleted, and the tag clusters with larger weight are screened and reserved.
S204, generating interest topics of the target verticals according to the rest tag clusters.
In this embodiment, after the screening process is performed on at least two obtained tag clusters, each tag word included in the remaining tag clusters may fully describe the subject of interest under the target cluster. For this purpose, the interest topic of the target cluster can be generated according to the rest of each tag cluster.
In an alternative manner of this embodiment, generating the interest topic of the target class according to the remaining tag clusters may include:
and taking the cluster center tag word of each rest of the tag clusters as the interest subject of the tag clusters.
In this embodiment, in the remaining tag clusters, although each tag word included in the tag cluster may fully describe an interest subject under the target vertical class, not all tag words may be used as the interest subject, and some tag words may have commonalities, and commonalities of each tag word may be used as the interest subject. Therefore, when generating the interest subject of the target cluster, the tag words in the rest tag clusters can be analyzed and processed, and the center tag word of each rest tag cluster is used as the interest subject of the tag cluster.
In an optional manner of this embodiment, after generating the interest subject of the target cluster according to the remaining tag clusters, the method may further include:
and taking the tag words in each remaining tag cluster as the tag words of the interest subject related to the tag cluster.
According to the method and the device for generating the interest topics, the technical scheme is adopted, all tag words contained in any vertical class can be clustered, so that all tag words can be divided into different tag clusters, further, the interest topics of the vertical class can be generated and obtained automatically according to all tag word information in all tag clusters, manual operation is not needed in the generation process, abundant interest topics can be well obtained in an unfamiliar field system, tag words supporting the interest topics can be timely obtained, and coverage rate and timeliness of the interest topics in the vertical class are improved. The new tag word can be attributed to a certain interest subject quickly by calculating the vector similarity between the word vector of the new tag word and the clustered tag cluster center.
Fig. 3 is a schematic structural diagram of an interest topic generation device provided in an embodiment of the present invention, where the embodiment may be applied to a case of constructing interest topics and interest attention tag words, especially a case of constructing interest topic systems and topic support tag words. The apparatus may be implemented in software and/or hardware and integrated on any device having network communication capabilities. As shown in fig. 3, the auxiliary translation device based on the translation history in the embodiment of the present invention may include: a clustering module 301 and a generating module 302. Wherein:
the clustering module 301 is configured to cluster each tag word included in the target cluster to obtain at least two tag clusters;
and the generating module 302 is configured to generate the interest subject of the target vertical class according to the tag word information included in the at least two tag clusters.
Based on the foregoing embodiment, optionally, the generating module 302 may include:
the tag cluster weight determining unit is used for determining the weight of each tag cluster according to the weight of each tag word in each tag cluster;
the label cluster screening processing unit is used for screening the at least two label clusters according to the weight of each label cluster;
and the interest topic generation unit is used for generating the interest topic of the target vertical class according to the rest tag clusters.
On the basis of the technical solution of the foregoing embodiment, optionally, the interest topic generation unit is specifically configured to: and taking the cluster center tag word of each rest of the tag clusters as the interest subject of the tag clusters.
On the basis of the technical solution of the foregoing embodiment, optionally, the apparatus may further include:
the tag word determining module 303 is configured to use the tag word in each remaining tag cluster as a tag word of the interest topic associated with the tag cluster.
Based on the foregoing embodiment, optionally, the generating module 302 may further include:
the label word similarity determining unit is used for determining the similarity between each label word in each label cluster and the center of the label cluster;
and the tag word screening unit is used for screening each tag word in the tag cluster according to the similarity.
Based on the foregoing embodiment, optionally, the generating module 302 may further include:
and the tag word weight determining unit is used for determining the weight of each tag word according to the frequency of each tag word in the target vertical article.
Based on the foregoing embodiment, optionally, the clustering module 301 may include:
the tag feature determining unit is used for determining each tag feature representation according to each tag word included in the target vertical class;
and the tag word clustering unit is used for clustering the tag characteristic representations.
The interest topic generating device provided in the embodiment of the invention can execute the interest topic generating method provided in any embodiment of the invention, and has the corresponding functions and beneficial effects of executing the interest topic generating method.
Fig. 4 is a schematic structural view of an apparatus according to an embodiment of the present invention. Fig. 4 illustrates a block diagram of an exemplary device 412 suitable for use in implementing embodiments of the invention. The device 412 shown in fig. 4 is only an example and should not be construed as limiting the functionality and scope of use of embodiments of the invention.
As shown in fig. 4, device 412 is in the form of a general purpose computing device. The components of the device 412 may include, but are not limited to: one or more processors 416, a storage 428, and a bus 418 that connects the various system components (including the storage 428 and the processors 416).
Bus 418 represents one or more of several types of bus structures, including a memory device bus or memory device controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include industry standard architecture (Industry Subversive Alliance, ISA) bus, micro channel architecture (Micro Channel Architecture, MAC) bus, enhanced ISA bus, video electronics standards association (Video Electronics Standards Association, VESA) local bus, and peripheral component interconnect (Peripheral Component Interconnect, PCI) bus.
Device 412 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by device 412 and includes both volatile and nonvolatile media, removable and non-removable media.
The storage 428 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory, RAM) 430 and/or cache memory 432. Device 412 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 434 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, commonly referred to as a "hard disk drive"). Although not shown in fig. 4, a magnetic disk drive for reading from and writing to a removable nonvolatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable nonvolatile optical disk such as a Read Only Memory (CD-ROM), digital versatile disk (Digital Video Disc-Read Only Memory, DVD-ROM), or other optical media, may be provided. In such cases, each drive may be coupled to bus 418 via one or more data medium interfaces. Storage 428 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.
A program/utility 440 having a set (at least one) of program modules 442 may be stored, for example, in the storage 428, such program modules 442 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 442 generally perform the functions and/or methodologies in the described embodiments of the invention.
The device 412 may also communicate with one or more external devices 414 (e.g., keyboard, pointing terminal, display 424, etc.), one or more terminals that enable a user to interact with the device 412, and/or any terminals (e.g., network card, modem, etc.) that enable the device 412 to communicate with one or more other computing terminals. Such communication may occur through an input/output (I/O) interface 422. Also, device 412 may communicate with one or more networks such as a local area network (Local Area Network, LAN), a wide area network (Wide Area Network, WAN) and/or a public network such as the internet via network adapter 420. As shown in fig. 4, network adapter 420 communicates with other modules of device 412 over bus 418. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with device 412, including, but not limited to: microcode, end drives, redundant processors, external disk drive arrays, disk array (Redundant Arrays of Independent Disks, RAID) systems, tape drives, data backup storage systems, and the like.
The processor 416 executes various functional applications and data processing by running programs stored in the storage 428, for example, implementing the subject of interest generation method provided in any embodiment of the invention, which may include:
clustering each tag word included in the target verticals to obtain at least two tag clusters;
and generating the interest subject of the target vertical class according to the tag word information included in the at least two tag clusters.
Of course, those skilled in the art will appreciate that the processor may also implement the technical solution of the method for generating a subject of interest provided in any embodiment of the present invention.
There is also provided in an embodiment of the present invention a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a subject of interest generation method as provided in any embodiment of the present invention, the method may include:
clustering each tag word included in the target verticals to obtain at least two tag clusters;
and generating the interest subject of the target vertical class according to the tag word information included in the at least two tag clusters.
Of course, the computer-readable storage medium provided in the embodiments of the present invention, on which the computer program stored, is not limited to the method operations described above, but may also perform related operations in the subject of interest generation method provided in any of the embodiments of the present invention.
The computer storage media of embodiments of the invention may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or terminal. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims (12)

1. A method for generating a topic of interest, comprising:
clustering each tag word included in the target verticals to obtain at least two tag clusters;
generating an interest topic of the target vertical class according to the tag word information included in the at least two tag clusters, wherein the interest topic comprises the following steps:
determining the weight of each label cluster according to the weight of each label word in each label cluster;
screening the at least two tag clusters according to the weight of each tag cluster;
generating the interest subject of the target cluster according to the rest label clusters, wherein the interest subject comprises the following steps:
taking the cluster center tag word of each rest of the tag clusters as the interest subject of the tag cluster;
the cluster center tag word of each tag cluster is represented in a vector form.
2. The method of claim 1, further comprising, after generating the subject of interest for the target verticals based on the remaining tag clusters:
and taking the tag words in each remaining tag cluster as the tag words of the interest subject related to the tag cluster.
3. The method of claim 1, wherein prior to determining the weight of each tag cluster based on the weight of each tag word in the tag cluster, further comprising:
determining the similarity between each label word in each label cluster and the center of the label cluster;
and screening each label word in the label cluster according to the similarity.
4. The method of claim 1, wherein prior to determining the weights of the tags in each tag cluster based on the weights of the tags in the tag cluster, further comprising:
and determining the weight of each tag word according to the frequency of each tag word in the target vertical article.
5. The method of any of claims 1-4, wherein clustering the tag words included in the target verticals comprises:
determining each tag characteristic representation according to each tag word included in the target vertical class;
clustering is performed on each tag feature representation.
6. An interest topic generation device, comprising:
the clustering module is used for clustering each tag word included in the target vertical class to obtain at least two tag clusters;
the generating module is used for generating the interest subject of the target vertical class according to the tag word information included in the at least two tag clusters;
wherein, the generating module includes:
the tag cluster weight determining unit is used for determining the weight of each tag cluster according to the weight of each tag word in each tag cluster;
the label cluster screening processing unit is used for screening the at least two label clusters according to the weight of each label cluster;
the interest topic generation unit is used for generating an interest topic of the target vertical class according to the rest tag clusters;
the interest topic generation unit is specifically configured to: taking the cluster center tag word of each rest of the tag clusters as the interest subject of the tag cluster;
the cluster center tag word of each tag cluster is represented in a vector form.
7. The apparatus of claim 6, wherein the apparatus further comprises:
and the tag word determining module is used for taking the tag word in each remaining tag cluster as the tag word of the interest subject associated with the tag cluster.
8. The apparatus of claim 6, wherein the generating module further comprises:
the label word similarity determining unit is used for determining the similarity between each label word in each label cluster and the center of the label cluster;
and the tag word screening unit is used for screening each tag word in the tag cluster according to the similarity.
9. The apparatus of claim 6, wherein the generating module further comprises:
and the tag word weight determining unit is used for determining the weight of each tag word according to the frequency of each tag word in the target vertical article.
10. The apparatus of any one of claims 6-9, wherein the clustering module comprises:
the tag feature determining unit is used for determining each tag feature representation according to each tag word included in the target vertical class;
and the tag word clustering unit is used for clustering the tag characteristic representations.
11. An apparatus, the apparatus comprising:
one or more processors;
storage means for storing one or more programs,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the topic of interest generation method of any of claims 1-5.
12. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the subject of interest generation method of any of claims 1-5.
CN201910200002.8A 2019-03-15 2019-03-15 Interest topic generation method, device, equipment and storage medium Active CN111694951B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910200002.8A CN111694951B (en) 2019-03-15 2019-03-15 Interest topic generation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910200002.8A CN111694951B (en) 2019-03-15 2019-03-15 Interest topic generation method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111694951A CN111694951A (en) 2020-09-22
CN111694951B true CN111694951B (en) 2023-08-01

Family

ID=72475569

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910200002.8A Active CN111694951B (en) 2019-03-15 2019-03-15 Interest topic generation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111694951B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114926234A (en) * 2022-05-10 2022-08-19 南京数睿数据科技有限公司 Article information pushing method and device, electronic equipment and computer readable medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101430708A (en) * 2008-11-21 2009-05-13 哈尔滨工业大学深圳研究生院 Blog hierarchy classification tree construction method based on label clustering
CN105825396A (en) * 2016-03-11 2016-08-03 合网络技术(北京)有限公司 Co-occurrence-based advertisement label clustering method and system
CN106126669A (en) * 2016-06-28 2016-11-16 北京邮电大学 User collaborative based on label filters content recommendation method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101430708A (en) * 2008-11-21 2009-05-13 哈尔滨工业大学深圳研究生院 Blog hierarchy classification tree construction method based on label clustering
CN105825396A (en) * 2016-03-11 2016-08-03 合网络技术(北京)有限公司 Co-occurrence-based advertisement label clustering method and system
CN106126669A (en) * 2016-06-28 2016-11-16 北京邮电大学 User collaborative based on label filters content recommendation method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于最大频繁项集的搜索引擎查询结果聚类算法;苏冲等;《中文信息学报》(第02期);全文 *
标签聚类在政府门户网站信息资源分类中的应用;邓媛等;《情报理论与实践》(第04期);全文 *

Also Published As

Publication number Publication date
CN111694951A (en) 2020-09-22

Similar Documents

Publication Publication Date Title
CN107992596B (en) Text clustering method, text clustering device, server and storage medium
CN110909222B (en) User portrait establishing method and device based on clustering, medium and electronic equipment
CN111352800A (en) Big data cluster monitoring method and related equipment
CN110909165A (en) Data processing method, device, medium and electronic equipment
EP3702912A1 (en) Background application cleaning method and apparatus, and storage medium and electronic device
CN110135912B (en) Information pushing method and device, server and storage medium
CN111400600A (en) Message pushing method, device, equipment and storage medium
CN109543154B (en) Type conversion method and device of table data, storage medium and electronic equipment
CN110833696A (en) Player ranking method and device, storage medium and electronic equipment
CN112017062B (en) Resource quota distribution method and device based on guest group subdivision and electronic equipment
CN111241043A (en) Multimedia file sharing method, terminal and storage medium
CN111694951B (en) Interest topic generation method, device, equipment and storage medium
CN113051911B (en) Method, apparatus, device, medium and program product for extracting sensitive words
US9626433B2 (en) Supporting acquisition of information
CN116028868B (en) Equipment fault classification method and device, electronic equipment and readable storage medium
CN112035732A (en) Method, system, equipment and storage medium for expanding search results
CN108416014B (en) Data processing method, medium, system and electronic device
CN115292487A (en) Text classification method, device, equipment and medium based on naive Bayes
CN110502630B (en) Information processing method and device
CN113806556A (en) Method, device, equipment and medium for constructing knowledge graph based on power grid data
CN112905885A (en) Method, apparatus, device, medium, and program product for recommending resources to a user
CN112784046A (en) Text clustering method, device and equipment and storage medium
CN112699872A (en) Form auditing processing method and device, electronic equipment and storage medium
CN110688508A (en) Image-text data expansion method and device and electronic equipment
US20230145853A1 (en) Method of generating pre-training model, electronic device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant