CN111667023B - Method and device for acquiring articles of target category - Google Patents

Method and device for acquiring articles of target category Download PDF

Info

Publication number
CN111667023B
CN111667023B CN202010612869.7A CN202010612869A CN111667023B CN 111667023 B CN111667023 B CN 111667023B CN 202010612869 A CN202010612869 A CN 202010612869A CN 111667023 B CN111667023 B CN 111667023B
Authority
CN
China
Prior art keywords
articles
article
target
sample
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010612869.7A
Other languages
Chinese (zh)
Other versions
CN111667023A (en
Inventor
康战辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010612869.7A priority Critical patent/CN111667023B/en
Publication of CN111667023A publication Critical patent/CN111667023A/en
Application granted granted Critical
Publication of CN111667023B publication Critical patent/CN111667023B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method and a device for acquiring articles of a target class, and belongs to the technical field of article searching. The method comprises the steps of obtaining m articles in a target time period; dividing m articles into k candidate article sets according to titles of the m articles, wherein the articles in the same candidate article set are related to the same news event; according to the number of articles in the candidate article sets corresponding to the news events, determining the candidate article set with the number of articles not less than a threshold value as a target article set in the candidate article sets corresponding to the news events; according to the content of the articles in the target article set, articles belonging to the target category are screened out and used as articles to be distributed of the target article set; and publishing the article to be published in the target application program. People will be interested in articles derived from popular news events, and the articles derived from popular news events are published on the target application program, so that the activity of the target application program can be greatly improved.

Description

Method and device for acquiring articles of target category
Technical Field
The present disclosure relates to the field of article searching technologies, and in particular, to a method and an apparatus for obtaining articles of a target category.
Background
With the rapid development of the internet and terminal technology, various applications, for example, medical applications, are relatively common applications, which can be installed on a terminal.
Many medical articles are released in the medical application program, so that medical science popularization can be performed on a user, but the user usually reminds the medical application program when suffering from illness or discomfort, and the activity of the medical application program is low.
Disclosure of Invention
The embodiment of the application provides a method and a device for acquiring articles of a target class, which can overcome the problem of low liveness of medical application programs. The technical scheme is as follows:
in one aspect, a method of obtaining articles of a target category is provided, the method comprising:
obtaining m articles in a target time period, wherein m is a positive integer;
dividing the m articles into k candidate article sets according to the titles of the m articles, wherein the articles in the same candidate article set are related to the same news event, and k is a positive integer and is smaller than or equal to m;
according to the number of articles in the candidate article sets corresponding to the news events, determining the candidate article set with the number of articles not less than a threshold value as a target article set in the candidate article sets corresponding to the news events;
Screening articles belonging to the target category according to the content of each article in the target article set and the classification model after training, and taking the articles as articles to be distributed in the target article set;
and publishing the articles to be published in the target article set in a target application program.
In another aspect, there is provided an apparatus for acquiring articles of a target category, the apparatus comprising:
the acquisition module is used for acquiring m articles in a target time period, wherein m is a positive integer;
the clustering module is used for dividing the m articles into k candidate article sets according to the titles of the m articles, wherein the articles in the same candidate article set are related to the same news event, and k is a positive integer and is smaller than or equal to m;
the first screening module is used for determining the candidate article sets with the article number not less than a threshold value as a target article set in the candidate article sets corresponding to the news events according to the article number in the candidate article sets corresponding to the news events;
the second screening module is used for screening out articles belonging to the target category according to the content of each article in the target article set and the classification model which is trained, and taking the articles as articles to be distributed in the target article set;
And the publishing module is used for publishing the articles to be published in the target article set in the target application program.
In another aspect, a computer device is provided, the computer device including a processor and a memory, the memory storing at least one instruction, the at least one instruction being loaded and executed by the processor to implement the method of obtaining articles of a target category described above.
In another aspect, a computer readable storage medium having stored therein at least one instruction loaded and executed by a processor to implement the method of obtaining articles of a target category described above is provided.
In the embodiment of the application, in order to improve the activity of the target application, some articles derived from the popular news events may be published in the target application in time, for example, some articles of the medical category derived from the popular news events may be published in time in the application of the medical category. Since people often pay attention to the trending news events, articles derived from the trending news events are also interested, and the articles of the target category derived from the trending news events are released on the target application program, the activity of the target application program can be greatly improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a method for obtaining an article of a target category according to an embodiment of the present application;
FIG. 2 is a flowchart of a method for obtaining articles of a target category according to an embodiment of the present application;
FIG. 3 is a schematic view of a scenario of a method for obtaining articles of a target category according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an apparatus for acquiring articles of a target category according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an apparatus for acquiring articles of a target category according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an apparatus for acquiring articles of a target category according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an apparatus for acquiring articles of a target category according to an embodiment of the present application;
Fig. 8 is a schematic diagram of a device for acquiring articles of a target category according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The embodiment of the application provides a method for acquiring articles of a target category, which can be executed by a server, for example, a background server capable of issuing target application programs of the target category, or a terminal device installed with the target application programs, and the execution subject of the method is not limited, and the method is executed by the server.
Wherein the target application is associated with a target category, for example, articles of the target category may be published in the target application. For example, only articles of the target category may be published in the target application. For another example, the target application may issue articles of a target category or articles of a non-target category.
Wherein the target class may be a medical class and the target application may be an application associated with the medical class; the target class may be a health class, and the target application may be an application of the health class; the target category may be a fitness category, the target application may be an application of the fitness category, etc. In this embodiment, the target class and the target application program are not limited, and the example may use the target class as a medical class, and the target application program is a medical application program, and other cases are similar, and will not be described in detail.
As shown in fig. 1, a flowchart of a method for acquiring an article of a target category is shown:
in step 101, m articles within a target time period are acquired, where m is a positive integer.
The target time period may be a period from a starting time point of the day to a current time point, for example, the current time point is 2020, 4 months, 1, 9:00, the target time period may be 2020, 4, 1, 00:00 to 2020, 4 months 1 day 9: a time period within 00.
Alternatively, the target time period may be a period from a time point when m articles were last acquired to a current time point. The target time period is not limited in this embodiment, and a technician can flexibly select according to actual situations.
In one example, the server may periodically obtain m articles within the target time period, e.g., each time a preset period is reached, obtain m articles published within the target time period. In this embodiment, the specific value of m is not limited, and the larger m is, the more articles of the target category are obtained later.
The m articles may be articles caused by news events, for example, the articles may be obtained in various news information websites, news applications, social applications, and the like.
In step 102, m articles are divided into k candidate article sets according to the titles of the m articles, and the articles in the same candidate article set are related to the same news event, wherein k is a positive integer and is less than or equal to m.
The articles in the same candidate article set are related to the same news event, namely, the candidate article set corresponds to the news event one by one.
In one example, the server may classify m articles according to the headline of each article, and classify the articles related to the same news event into one class, and may obtain k candidate article sets, where the articles in the same candidate article set are related to the same news event.
For example, one candidate article set (which may be referred to as candidate article set 1) has articles 1 titled "sudden three death," talking about sudden death pathogenesis and how to prevent sudden death from the perspective of traditional Chinese medicine, articles 2 titled "sudden three death by rescue video exposure," articles 3 titled "sudden three death," articles 4 titled "sudden three death just after the onset of film and television cold winter," articles 5 titled "sudden three death causes heart health continuous discussion, and the bias for treating heart diseases is rumble. All four articles are related to sudden death, the sudden death belongs to a news event, and the four articles are divided into the same candidate article set.
Thus, after the server obtains m articles, the articles belonging to the same news event can be divided together according to the title of each article, and k candidate article sets can be obtained.
Where the k value may be related to m, e.g., the larger m, the larger k. Alternatively, the k value is a value set by a technician, and for example, k may be 20. In this embodiment, the value manner of k is not limited, where the larger the value of k is, the finer the candidate article set is divided, and the technician can flexibly set according to the actual situation.
In step 103, according to the number of articles in the candidate article sets corresponding to each news event, the candidate article set with the number of articles not less than the threshold value is determined as the target article set in the candidate article sets corresponding to each news event.
For example, according to the number of articles in each candidate article set, n article sets with the number of articles meeting the preset requirement are screened out, wherein n is a positive integer and is smaller than or equal to k, and in an exemplary manner, n candidate article sets with the number of articles not smaller than a number threshold are determined as n target article sets, wherein the target article sets contain more articles, so that news events corresponding to the target article sets can be reported in a large number, and news events corresponding to the visible target article sets can be called popular news events.
The n article sets are the target article sets, the n target article sets correspond to n popular news events, and the target article sets correspond to the popular news events one by one.
In one example, the server may select n candidate article sets with a relatively large number of articles from the k candidate article sets, as the article set used later, that is, the target article set. The n target article sets have a relatively large number of articles, which indicates that news events corresponding to the target article sets all belong to popular news events. Thus, the server can select articles corresponding to the hot news events.
The principle of the server for screening n target article sets according to the number of articles in the candidate article sets may be that n article sets with the number of articles not less than a threshold value may be screened according to the number of articles in each candidate article set, for example, an article set with the number of articles not less than 10 is selected as n target article sets.
The number threshold may be a value set by a technician, or may be a value related to n, for example, the smaller n is, and the larger n is, the larger n is. The skilled person can flexibly set the choice of the number threshold.
Or, the principle of the server for screening n target article sets according to the number of articles in the candidate article sets may be that after the server determines the number of articles in each candidate article set, the candidate article sets may be ranked according to the number of articles, the more the ranking of the corresponding candidate article sets is, the more front the corresponding candidate article sets may be ranked, and the first n candidate article sets may be selected as the article sets used subsequently, that is, the target article sets.
In step 104, articles belonging to the target category are screened out as articles to be distributed in the target article set according to the content of each article in the target article set and the trained classification model.
Where the content of an article is the entire content of the article, including the title and the body.
In one example, although the articles in each target article set are related to the same popular news event, not all articles in the same target article set belong to the target category, so the server can further screen out the articles belonging to the target category through the content of each article in the target article set.
For example, as in the target article set corresponding to the above-mentioned sudden-death news event, some articles are in the field process of reporting sudden-death, some articles are reporting medical category derived from sudden-death, and some articles are reporting development prospects of industries engaged in by sudden-death, such as the movie industry. If the target category is said to be a medical category, the server may filter out medical articles derived from sudden death according to the content of each article. If the target category is a movie category, the server can screen out the articles of the movie category derived from sudden death according to the content of each article.
The server may use a completely trained classification model to screen articles belonging to the target category, and the training process of the classification model will be described below.
In step 105, articles to be published in the target article set are published in the target application.
In one example, for each target article set, after the server filters out articles belonging to the target category in the target article set, the articles belonging to the target category may be published in the application.
Based on the above, the server may first obtain a large number of articles, then classify the articles according to their titles, and divide the articles of the same news event into a group, so as to obtain a plurality of candidate article sets. And then, selecting a candidate article set with a large number of articles according to the number of articles in each candidate article set, and taking the candidate article set as a target article set corresponding to the hot news event in the target time period, wherein one target article set corresponds to one hot news event.
Since the target article sets are all aimed at the same popular news event, but there are articles reporting the news event itself and articles derived based on the news event, the articles may not belong to the same category. Then, the server screens out articles belonging to the target category from the target article sets corresponding to the hot news events, for example, the server may screen out articles belonging to the medical category from the target article sets corresponding to the hot news events. The server may then post these articles of the target category, derived from the trending news event, as articles to be posted in the target application. For example, the server may publish these articles of medical categories derived from trending news events in a medical related application.
Since people often pay attention to news events, especially popular news events, people pay attention to popular news events, and articles derived from popular news events tend to be interested in the articles of the target category derived from the popular news events are released on the target application program, so that the activity of the target application program can be greatly improved.
For example, articles of medical category derived from trending news events may be published in medical related applications, and then the liveness of medical related applications may be enhanced by the trending news events.
In order to improve the activity and the professional performance of the target application program, the target application program can become an information source trusted by people, and articles with strong speciality and high quality can be screened from articles with target categories and released in the target application program.
The profession of the article can be evaluated through the issuing account information and the reading information of the article, the issuing account information of the article can be the author of the article and the content number of the article, wherein the content number can be the account information corresponding to each news application program or the public number account information in a certain application program. The reading information of the article may be the reading amount, the rating number, the praise number, etc. of the article, which may reflect the value and the specialty of the article.
For example, the server may determine, according to the publishing account information and the reading information of the articles, a heat score of each article to be published in the target article set; then, in the target article set, determining the p articles to be distributed which are arranged in front according to the heat scores of the articles to be distributed, wherein p is a positive integer; and then, the articles to be distributed, which are ranked in the first p in the target article set, are distributed in the target application program.
The ranking may be performed in an article belonging to a target category in a target article set, that is, in a target article set from which articles not belonging to the target category are removed.
The higher the heat score, the higher the heat of the article, and the higher the professionality and authority of the article in the target category can be reflected indirectly, so that the higher the use value of the article is. For example, the popularity score may be determined by a weighted sum between the posting account information and the reading information of the article. For another example, the heat score may also be determined by the following formula:
s (i) = [ log (number of reads) ]× [ log (number of praise) ]× [ log (number of comments) ]× [ log (number of attention user) ]
Where S (i) represents the popularity score of the i-th article belonging to the target category in the target article set, and the number of users of interest represents the number of users of interest of the content number where the article is located, or the number of users of interest of the news application where the article is located, the number of users of interest may also be referred to as the number of fans.
Based on the above, after the server obtains a large number of articles in the target time period, the articles may be classified according to the titles of the articles, and the articles reporting the same news event are classified into one type, so as to obtain a plurality of candidate article sets, where each candidate article set corresponds to one news event. Because not all news events belong to the popular news events, the server can select some target article sets with a large number of articles according to the number of articles in each candidate article set, and the target article sets are in one-to-one correspondence with the popular news events.
And because of the object article collection that is screened out, some articles are reporting the news event corresponding to the news event, and some articles derived from the news event. The server may screen out articles belonging to the target category from each article set.
The server can screen out high-quality articles with strong specialization and high reliability from the articles which are derived by the popular news events and belong to the target category, and can issue the articles with high specialization and high reliability which are derived by the popular news events and belong to the target category in the target application program as articles to be issued.
It can be seen that the method for obtaining the articles of the target category for the target application program can improve the activity, the specialty and the credibility of the target application program, and once the specialty and the credibility of the target application program are established, the activity of the target application program can be further enhanced.
In one example, a classification model trained by machine learning or deep learning may be used, and in each target article set, articles belonging to a target category may be screened, and correspondingly, the server may screen articles belonging to the target category as articles to be distributed in the target article set according to the content of each article in the target article set and the classification model trained after completion.
The classification model may be trained by deep learning or by machine learning, which is not limited in this embodiment, and may be exemplified by training by a conventional machine learning method. The training process can be performed according to the flow diagram shown in fig. 2:
in step 201, a training sample is obtained, the training sample comprising a sample article and a sample tag.
The training samples may be all positive samples, or may be a set of positive samples and negative samples, which is not limited in this embodiment, and the training samples may be taken as a set of positive samples and negative samples for example.
In one example, the sample articles may include articles of a target category and articles of a non-target category, and accordingly, a sample tag corresponding to an article of a target category in the training sample is a tag of a target category, and a sample tag corresponding to an article of a non-target category in the training sample is a tag of a non-target category. Articles belonging to the target category derived from news events may be included in the articles of the target category. The label corresponding to the article of the target category may be marked as 1, and the label corresponding to the article of the non-target category may be marked as 0.
For example, the target category is a medical category, and the training sample may include articles of the medical category and 1, and articles of the non-medical category and 0, respectively. The articles of the medical category may also be articles of the medical category derived from news events.
In step 202, important terms are extracted from the sample articles of the training sample, so as to obtain the important terms of the sample articles.
Wherein, the important words can be keywords, and also can comprise keywords and terms in target categories, for example, the target category is a medical category, and the terms in the medical category can be words related to diseases, treatments and the like.
In one example, the keywords may be extracted by a TF-IDF algorithm, and correspondingly, the content of a sample article of the training sample may be segmented to obtain a plurality of words of the sample article; then, among the plurality of words of the sample article, the words of q preceding the products of the TF and the IDF in the sample article are determined as key words according to the products of the TF and the IDF, wherein q is a positive integer.
Where TF is an abbreviation for Term Frequency, which means the Frequency of words, which is the Frequency with which a word appears in the current article.
Wherein IDF is a shortening of Inverse Document Frequency, representing the inverse text frequencyThe index of the rate, IDF, of a word can be obtained by dividing the total number of documents by the number of documents containing the word, and taking the obtained quotient as a base 10 logarithm. The calculation formula of IDF isIn the formula, IDF i The IDF value of the i-th word is represented, D represents the total number of files in the word stock, and the larger the word stock is selected, the higher the accuracy of the calculated IDF value.
For example, the term library may be a full-web medical encyclopedia database, i.e., the total number of documents in the term library, i.e., the total number of articles included in the full-web medical encyclopedia data. For another example, the word library may be an encyclopedia database of various types. { j: t i ∈d j ' means article d containing the ith term t j Is a number j of (c).
For example, the target category is a medical category, a total of 10,000,000 articles are contained in the full-web medical encyclopedia database, and the term t i In 1000 articles, t i IDF of (2) is
The principle of selecting the key words through the product of TF-IDF is that if a certain word appears in a target article with high frequency, but appears in other articles except the target article with low frequency, the word can be used to represent the target article. Conversely, if a term appears very frequently in a target article, but also in articles other than the target article, the term may be a commonly used term in the target category to which the target article belongs.
In this way, after the server calculates the products of TF and IDF of the words in the sample article, the products may be ranked, and the words ranked in the top q are selected as the key words of the sample article. The q may be a value preset by a technician, or may be a value determined according to the number of words in the sample article.
Or after calculating the products of TF and IDF of the words in the sample article, the server can select the words corresponding to the products larger than the preset threshold value as the key words of the sample article. The specific manner of selecting the key words is not limited in this embodiment, and the words representing the sample article can be selected.
In one example, the important terms in the sample article include not only key terms, but also terms in the term library of the target category, where the terms are commonly used or custom terms that are recorded in the term library of the target category.
The server may obtain terms belonging to the target category in the sample article by comparing each term in the sample article with terms in the term library of the target category, finding out terms consistent with or close to the terms in the term library, and using the terms as terms belonging to the target category in the sample article.
After the server obtains the keywords and terms used to represent the sample article, there may be a repetition of the keywords and terms, and one may be reserved.
In step 203, the important terms of the sample article are input to the classification model to be trained, and the test label of the sample article is obtained.
In one example, after the server sorts the important terms, the important terms may be input into a classification model to be trained, resulting in test tags for the sample articles. For example, the server may first search a word encoding table stored in advance for an encoding number corresponding to each important word, and then input the encoding numbers of the important words to a classification model to be trained, so as to obtain a test tag of the sample article.
In step 204, training the classification model to be trained according to the test label and the sample label of the sample article to obtain the classification model.
After obtaining the test tag of the sample article, the server can obtain a loss value corresponding to the sample article according to the test tag and the sample tag of the sample article, the loss function and the like. And then, carrying out error back propagation training on the classification model to be trained by using the loss value to obtain the classification model.
Thus, after the training of the classification model, the server may first perform word segmentation on each article in the target article set when in use. And extracting important words from the article to obtain the important words of the article. Then, the coding numbers corresponding to the important words are searched from the pre-stored word coding table, then, the coding numbers of the important words are input into a classification model, the classification model can output the labels corresponding to the articles, for example, 0 or 1 can be output, if 1 is output, the articles belong to the target category, and if 0 is output, the articles do not belong to the target category. The classification model can screen articles belonging to the target category from the target article set.
The classification model may be a classifier that performs binary classification based on an SVM (Support Vector Machine ) mode, and the decision boundary is a maximum margin hyperplane for solving the learning sample, which may be described with reference to a corresponding document and will not be described in detail herein.
In one example, the server divides m articles into k candidate article sets, and there are multiple implementations of determining the trending news event by the number of articles in each candidate article set. For example, direct clustering algorithms such as Density-based k-means clustering algorithms and hierarchical clustering algorithms such as Dbscan, where Dbscan is an abbreviation for Density-Based Spatial Clustering of Applications with Noise, chinese may represent seal-based clustering in noisy spatial databases, and so forth, may be employed. Indirect clustering algorithms may also be employed, e.g., the search for "Zhang Sano" may be bumped according to the amount of search for certain terms, and, illustratively, "Zhang Sano" may be bumped, then the top-ranked articles obtained when searching for "Zhang Sano" may be considered a treble news event, e.g., the "Zhang Sano" may be ranked first, and "Zhang Sano" may be considered a treble news event for the current stage.
In this embodiment, the manner of determining the trending news event is not limited, and articles related to the trending news event may be obtained.
The specific algorithm for clustering m articles in the embodiment is not limited, and may be exemplified by a k-means clustering algorithm. For example, the server may partition m articles into k candidate article sets by employing a k-means clustering algorithm based on the title of each article. Wherein the k-means clustering algorithm is also called the k-means algorithm.
In one example, the server may first word the headlines of m articles, each of which may be chopped into a plurality of words. Then, the words of each title are respectively input into the feature extraction model which completes training to obtain feature vectors of the words of each title, and the feature vectors of the words are added to obtain the feature vector of each title. And then, the server can cluster the feature vectors of the m titles by using a k-means clustering algorithm to obtain k candidate article sets.
The feature vector output by the feature extraction model may be a 100-dimensional vector, and of course, a technician may set the feature vector to be a 50-dimensional vector.
The feature extraction model used may be a Doc2Vec model, where the Doc2Vec model is also called a Paragraph Vector, and the Tomas Mikolov is based on a word2Vec model.
In one example, the process of training and saving the Doc2Vec model may be implemented by the following code, which is only one example and not limiting of the present embodiment:
import gensim
sentences=gensim.models.doc2vec.TaggedlLineDocument(token_path)
model=gensim.models.doc2vec(sentences,size=100,window=2,min_count=3)
model.train(sentences,total_examles=model.corpus_count,epochs=1000)
model.save(‘../model/doc2vec.pkl’)。
in the above code, gensim represents the name of the software package that performs the above process; size represents the dimension of the feature vector, for example, a feature vector of 100 dimensions; window represents window size, e.g., may be in units of two words; min_count represents the number of iterations, e.g., the minimum number of iterations is 3 words; epochs represents the number of times that all data is sent to the network to complete one forward computation and reverse propagation process during training, and also the number of times that convergence is achieved, for example, the training is ended when the number of times that all data is sent to the network to complete one forward computation and reverse propagation process during training reaches 1000 times.
In one example, the server may use the k-means algorithm on the titles of m articles to obtain k candidate article sets as follows:
After obtaining m articles and performing word segmentation and feature vector extraction on titles of the m articles, the server obtains m feature vectors, and then clustering is started.
The first step, from m eigenvectors, randomly selecting k eigenvectors as cluster centroid points of k-means algorithm, which can be recorded as mu 1 、μ 2 、……、μ k
Second, for each feature vector x (i) I belongs to any one of values 1 to m, the class it should belong to is calculated, i.e. each x is calculated (i) The proximity degree of each cluster centroid point is respectively selected as x, wherein the proximity degree is the largest (i) Belonging to the class.
Wherein each x is (i) The proximity to each cluster centroid point can be calculated according to the following formula:
wherein, c (i) Represents x (i) The nearest one of the classes 1 to k has a value of one of 1 to k.
For each class j, the centroid of the class is recalculated to obtain a new cluster centroid point. The formula for calculating the cluster centroid point can be as follows:
the second step is repeated until convergence, i.e. until the re-calculated cluster centroid point coincides or is close to the last calculated cluster centroid point.
In one example, the clustering process using the k-means algorithm may be implemented by the following code, which is only an example and not limiting on the present embodiment:
In the above code, n_clusters represents the number of cluster centroid points, for example, the number of cluster centroid points is 20.i is any one of the feature vectors 1 to m.
Based on the above, as shown in the application scenario schematic diagram of fig. 3, after obtaining m articles, the server obtains a total set composed of m articles, and then clusters the total set to obtain k candidate article sets, where each candidate article set corresponds to a news event. And determining n target article sets with a large number of articles according to the number of articles contained in each candidate article set, wherein each target article set corresponds to one hot news event. As shown in fig. 3, the candidate article set 1 and the target article set 1 belong to the same set, but some sets are screened out in the process of converting the candidate article set into the target article set, the number of sets is reduced, and the number of articles in the sets is unchanged.
After n target article sets are obtained, articles which do not belong to the target category in each target article set are screened out through a classification model, and a target article set composed of articles of the target category is obtained. In this conversion, the number of collections is unchanged, while the number of articles in a collection is reduced.
And after each target article set is subjected to category screening, quality screening is performed according to the popularity scores of the rest articles, so that the articles with strong specialities are obtained, namely, the articles which are derived from popular news events, belong to the target categories and have higher popularity scores are obtained.
It can be seen that the method for obtaining the articles of the target category for the target application program can improve the activity, the specialty and the credibility of the target application program, and once the specialty and the credibility of the target application program are established, the activity of the target application program can be further enhanced.
In addition, compared with manual search query, the process of issuing the professional high-quality articles of the target category derived from the popular news event in the target application program can greatly shorten the acquisition time and achieve timely issuing.
In the embodiment of the application, in order to improve the activity of the target application, some articles derived from the popular news events may be published in the target application in time, for example, some articles of the medical category derived from the popular news events may be published in time in the application of the medical category. Since people often pay attention to the trending news events, the articles derived from the trending news events tend to be interested, and the articles of the target category derived from the trending news events are released on the target application program, so that the liveness of the target application program can be greatly improved.
Based on the same technical concept, the embodiment of the application further provides a device for acquiring articles of a target category, as shown in fig. 4, the device includes:
an obtaining module 410, configured to obtain m articles in a target time period, where m is a positive integer;
the clustering module 420 is configured to divide the m articles into k candidate article sets according to titles of the m articles, where k is a positive integer and is less than or equal to m, and the articles in the same candidate article set are related to the same news event;
the first filtering module 430 is configured to determine, according to the number of articles in the candidate article sets corresponding to each news event, a candidate article set with the number of articles not less than a number threshold value as a target article set in the candidate article sets corresponding to each news event;
the second screening module 440 is configured to screen out articles belonging to the target category according to the content of each article in the target article set and the trained classification model, and use the articles as articles to be distributed in the target article set;
and the publishing module 450 is used for publishing the articles to be published in the target article set in the target application program.
Optionally, as shown in fig. 5, the apparatus further includes:
the scoring module 441 is configured to determine a heat score of each article to be distributed in the target article set according to a number of users of interest, a reading amount, an endorsement number and a comment number of the article, where the content number of the article is located;
a third screening module 442, configured to determine, in the target article set, p articles to be distributed that are ranked in front according to the heat scores of the articles to be distributed, where p is a positive integer;
the publishing module 450 is specifically configured to:
and publishing the p articles to be published in the first group in the target article set in the target application program.
Optionally, as shown in fig. 6, the apparatus further includes:
a sample obtaining module 610, configured to obtain a training sample, where the training sample includes a sample article and a sample label of the sample article, the sample article includes an article of a target category and an article of a non-target category, and the sample label includes a label of a target category and a label of a non-target category;
the extraction module 620 is configured to extract important terms from a sample article of the training sample, so as to obtain important terms of the sample article;
A determining module 630, configured to input important terms of the sample article into a classification model to be trained, and obtain a test tag of the sample article;
and the training module 640 is configured to train the classification model to be trained according to the test label and the sample label of the sample article, so as to obtain the classification model.
Optionally, the extracting module 620 is specifically configured to:
word segmentation is carried out on the content of the sample article of the training sample, so that a plurality of words of the sample article are obtained;
among the words of the sample article, determining the words of which the products of the TF and the IDF are ranked in the front q names in the sample article as key words according to the products of the word frequency TF and the inverse text frequency index IDF, wherein q is a positive integer;
determining terms belonging to the target category in the sample article according to terms in a term library of the target category in a plurality of terms of the sample article;
and using the key words and the terms in the sample article as important words of the sample article.
Optionally, the clustering module 420 is specifically configured to:
and dividing the m articles into k candidate article sets according to titles of the m articles and a k-means clustering algorithm.
Optionally, as shown in fig. 7, the apparatus further includes:
the word segmentation module 411 is configured to segment the titles of the m articles to obtain a plurality of words of each title;
the feature vector obtaining module 412 is configured to input a plurality of words of each title into a feature extraction model that completes training, to obtain a feature vector of each title;
the clustering module 420 is specifically configured to:
and clustering the feature vectors of the m titles by using a k-means clustering algorithm to obtain k candidate article sets.
In the embodiment of the present application, in order to improve the liveness of the target application, some articles derived from the popular news events may be published in the target application in time, for example, some articles of medical category derived from the popular news events may be published in time in the application of medical category. Since people often pay attention to the trending news events, articles derived from the trending news events are also interested, and the articles of the target category derived from the trending news events are released on the target application program, the activity of the target application program can be greatly improved.
It should be noted that: the device for acquiring the article of the target category provided in the above embodiment is only exemplified by the division of the above functional modules when acquiring the article of the target category, and in practical application, the above functional allocation may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the device for acquiring the article of the target category provided in the above embodiment belongs to the same concept as the method embodiment for acquiring the article of the target category, and the detailed implementation process of the device is referred to the method embodiment and is not repeated here.
Fig. 8 is a schematic structural diagram of a computer device provided in an embodiment of the present application, where the computer device 800 may have a relatively large difference due to different configurations or performances, and may include one or more processors (Central Processing Units, CPU) 801 and one or more memories 802, where at least one instruction is stored in the memories 802, and the at least one instruction is loaded and executed by the processor 801 to implement the above method steps for obtaining the article of the target category.
The foregoing description of the preferred embodiments is merely exemplary in nature and is in no way intended to limit the invention, since it is intended that all modifications, equivalents, improvements, etc. that fall within the spirit and scope of the invention.

Claims (6)

1. A method of obtaining articles of a target category, the method comprising:
obtaining m articles published in a target time period from a starting time point of the day to a current time point in the news information website, a news application program and a social application program, wherein m is a positive integer, and the m articles are articles which are derived from news events and belong to different categories;
Dividing the m articles into k candidate article sets according to the titles of the m articles, wherein the articles in the same candidate article set are related to the same news event, and k is a positive integer and is smaller than or equal to m;
according to the number of articles in the candidate article sets corresponding to the news events, determining the candidate article set with the number of articles not less than a threshold value as a target article set in the candidate article sets corresponding to the news events;
screening articles belonging to a target category as articles to be distributed in the target article set according to the content of each article in the target article set and a classification model for completing training, wherein the target category is a medical category;
determining the heat score of each article to be distributed in the target article set according to the number of users of interest of the content number of the article or the number of users of interest of the news application program of the article, the reading quantity of the article, the endorsement number of the article and the comment number of the article;
in the target article set, determining p articles to be distributed which are arranged in front according to the heat scores of the articles to be distributed, wherein p is a positive integer;
The p articles to be distributed, which are arranged in the target article set, are distributed in a target application program, wherein the target application program is an application program related to medical categories;
the classification model is trained according to the following steps: acquiring a training sample, wherein the training sample comprises a sample article and sample labels of the sample article, the sample article comprises articles of target categories and articles of non-target categories, the sample labels comprise labels of target categories and labels of non-target categories, and the articles of the target categories included in the sample article are articles of medical categories derived from news events; extracting important words from the sample articles of the training sample to obtain important words of the sample articles, wherein the important words comprise keywords and terms in a word library of a target class; inputting important words of the sample articles into a classification model to be trained to obtain test labels of the sample articles; and training the classification model to be trained according to the test label and the sample label of the sample article to obtain the classification model.
2. The method of claim 1, wherein the extracting the important terms from the sample articles of the training sample to obtain the important terms of the sample articles comprises:
Word segmentation is carried out on the content of the sample article of the training sample, so that a plurality of words of the sample article are obtained;
among the words of the sample article, determining the words of which the products of the TF and the IDF are ranked in the front q names in the sample article as key words according to the products of the word frequency TF and the inverse text frequency index IDF, wherein q is a positive integer;
determining terms belonging to the target category in the sample article according to terms in a term library of the target category in a plurality of terms of the sample article;
and using the key words and the terms in the sample article as important words of the sample article.
3. An apparatus for capturing articles of a target category, the apparatus comprising:
the acquisition module is used for acquiring m articles in a target time period from a starting time point of the day to a current time point in the news information website, the news application program and the social application program, wherein m is a positive integer, and the m articles are articles which are derived from news events and belong to different categories;
the clustering module is used for dividing the m articles into k candidate article sets according to the titles of the m articles, wherein the articles in the same candidate article set are related to the same news event, and k is a positive integer and is smaller than or equal to m;
The first screening module is used for determining the candidate article sets with the article number not less than a threshold value as a target article set in the candidate article sets corresponding to the news events according to the article number in the candidate article sets corresponding to the news events;
the second screening module is used for screening articles belonging to a target category according to the content of each article in the target article set and the classification model which is trained, and the articles are used as articles to be distributed in the target article set, and the target category is a medical category;
the scoring module is used for determining the heat scores of the articles to be distributed in the target article set according to the number of users of interest of the content number of the articles or the number of users of interest of the news application program of the articles, the reading quantity of the articles, the endorsement number of the articles and the comment number of the articles;
the third screening module is used for determining p articles to be distributed which are arranged in front according to the heat scores of the articles to be distributed in the target article set, wherein p is a positive integer;
the publishing module is used for publishing the p previous articles to be published in the target article set in a target application program, wherein the target application program is an application program related to the medical category;
The apparatus further comprises:
the system comprises a sample acquisition module, a sample acquisition module and a processing module, wherein the sample acquisition module is used for acquiring a training sample, the training sample comprises a sample article and a sample label of the sample article, the sample article comprises an article of a target category and an article of a non-target category, the sample label comprises a label of the target category and a label of the non-target category, and the article of the target category included in the sample article is an article of a medical category derived from a news event;
the extraction module is used for extracting important words from the sample articles of the training samples to obtain important words of the sample articles, wherein the important words comprise keywords and terms in a word library of target categories;
the determining module is used for inputting the important words of the sample article into the classification model to be trained to obtain the test tag of the sample article;
and the training module is used for training the classification model to be trained according to the test label and the sample label of the sample article to obtain the classification model.
4. A device according to claim 3, characterized in that the extraction module is specifically configured to:
word segmentation is carried out on the content of the sample article of the training sample, so that a plurality of words of the sample article are obtained;
Among the words of the sample article, determining the words of which the products of the TF and the IDF are ranked in the front q names in the sample article as key words according to the products of the word frequency TF and the inverse text frequency index IDF, wherein q is a positive integer;
determining terms belonging to the target category in the sample article according to terms in a term library of the target category in a plurality of terms of the sample article;
and using the key words and the terms in the sample article as important words of the sample article.
5. A computer device comprising a processor and a memory having stored therein at least one instruction that is loaded and executed by the processor to implement the method of retrieving articles of a target category as claimed in any one of claims 1 to 2.
6. A computer readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the method of retrieving articles of a target category as claimed in any one of claims 1 to 2.
CN202010612869.7A 2020-06-30 2020-06-30 Method and device for acquiring articles of target category Active CN111667023B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010612869.7A CN111667023B (en) 2020-06-30 2020-06-30 Method and device for acquiring articles of target category

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010612869.7A CN111667023B (en) 2020-06-30 2020-06-30 Method and device for acquiring articles of target category

Publications (2)

Publication Number Publication Date
CN111667023A CN111667023A (en) 2020-09-15
CN111667023B true CN111667023B (en) 2024-04-05

Family

ID=72390571

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010612869.7A Active CN111667023B (en) 2020-06-30 2020-06-30 Method and device for acquiring articles of target category

Country Status (1)

Country Link
CN (1) CN111667023B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613296A (en) * 2020-12-07 2021-04-06 深圳价值在线信息科技股份有限公司 News importance degree acquisition method and device, terminal equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019811A (en) * 2018-01-02 2019-07-16 腾讯科技(深圳)有限公司 Article recommended method, device and equipment
CN110162796A (en) * 2019-05-31 2019-08-23 阿里巴巴集团控股有限公司 Special Topics in Journalism creation method and device
CN110750212A (en) * 2019-09-06 2020-02-04 中国平安财产保险股份有限公司 Article publishing method and device, computer equipment and storage medium
CN111143655A (en) * 2019-12-30 2020-05-12 创新奇智(青岛)科技有限公司 Method for calculating news popularity

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019811A (en) * 2018-01-02 2019-07-16 腾讯科技(深圳)有限公司 Article recommended method, device and equipment
CN110162796A (en) * 2019-05-31 2019-08-23 阿里巴巴集团控股有限公司 Special Topics in Journalism creation method and device
CN110750212A (en) * 2019-09-06 2020-02-04 中国平安财产保险股份有限公司 Article publishing method and device, computer equipment and storage medium
CN111143655A (en) * 2019-12-30 2020-05-12 创新奇智(青岛)科技有限公司 Method for calculating news popularity

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郎志慧 等.智慧医学引领未来:医学事务优秀案例荟萃.科学技术文献出版社,2019,第229-230页. *

Also Published As

Publication number Publication date
CN111667023A (en) 2020-09-15

Similar Documents

Publication Publication Date Title
CN107463605B (en) Method and device for identifying low-quality news resource, computer equipment and readable medium
CN109960756B (en) News event information induction method
Shi et al. Learning-to-rank for real-time high-precision hashtag recommendation for streaming news
CN111368038B (en) Keyword extraction method and device, computer equipment and storage medium
CN108090068B (en) Classification method and device for tables in hospital database
CN107688616B (en) Make the unique facts of the entity appear
JP6056610B2 (en) Text information processing apparatus, text information processing method, and text information processing program
WO2019041520A1 (en) Social data-based method of recommending financial product, electronic device and medium
US20100306214A1 (en) Identifying modifiers in web queries over structured data
WO2022222942A1 (en) Method and apparatus for generating question and answer record, electronic device, and storage medium
WO2021017300A1 (en) Question generation method and apparatus, computer device, and storage medium
CN110569349A (en) Big data-based method, system, equipment and storage medium for pushing articles for education
Ozkan et al. A large-scale database of images and captions for automatic face naming
CN108170845B (en) Multimedia data processing method, device and storage medium
Schinas et al. Mgraph: multimodal event summarization in social media using topic models and graph-based ranking
CN111667023B (en) Method and device for acquiring articles of target category
CN110019556B (en) Topic news acquisition method, device and equipment thereof
Balakrishnan et al. Tamil offensive language detection: Supervised versus unsupervised learning approaches
CN110019763B (en) Text filtering method, system, equipment and computer readable storage medium
Balaneshin-kordan et al. Sequential query expansion using concept graph
CN114943285B (en) Intelligent auditing system for internet news content data
Lioma et al. A study of factuality, objectivity and relevance: three desiderata in large-scale information retrieval?
JP6260678B2 (en) Information processing apparatus, information processing method, and information processing program
US11822609B2 (en) Prediction of future prominence attributes in data set
CN114443820A (en) Text aggregation method and text recommendation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant