CN111506727B - Text content category acquisition method, apparatus, computer device and storage medium - Google Patents

Text content category acquisition method, apparatus, computer device and storage medium Download PDF

Info

Publication number
CN111506727B
CN111506727B CN202010301372.3A CN202010301372A CN111506727B CN 111506727 B CN111506727 B CN 111506727B CN 202010301372 A CN202010301372 A CN 202010301372A CN 111506727 B CN111506727 B CN 111506727B
Authority
CN
China
Prior art keywords
content category
content
word
text data
material selection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010301372.3A
Other languages
Chinese (zh)
Other versions
CN111506727A (en
Inventor
王冬冬
胡炜
王彦陶
李诗茵
黄利贤
单掠风
吕虹
余可鸣
钱璟辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010301372.3A priority Critical patent/CN111506727B/en
Publication of CN111506727A publication Critical patent/CN111506727A/en
Application granted granted Critical
Publication of CN111506727B publication Critical patent/CN111506727B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a text content category acquisition method, a text content category acquisition device, a text content category acquisition computer device and a text content category storage medium. The method comprises the following steps: acquiring hot news text data in a news website; word segmentation is carried out on the hot news text data, and word segmentation results of the hot news text data are obtained; matching word segmentation results of the hot news text data with target feature word sets of all content categories in a preset industry field, and obtaining association degrees of the hot news text data and all the content categories according to the matching results; when the same association degree exists, searching content categories corresponding to the same association degree to obtain a content category set to be compared; and acquiring news material selection indexes of each content category to be compared in the content category set to be compared, and determining the content category of the text to be edited according to the association degree and the news material selection indexes. By adopting the method, the accuracy of determining the content category of the text to be edited can be improved.

Description

Text content category acquisition method, apparatus, computer device and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and apparatus for obtaining a text content category, a computer device, and a storage medium.
Background
With the development of computer technology, a text pushing technology is developed, and the text pushing technology refers to that people responsible for text pushing edit and push texts in combination with the responsible industry field. In the text pushing process, the content category of the text to be edited needs to be determined.
In the conventional technology, the content category of the text to be edited is determined by adopting a manual selection mode, namely, a person in charge of text pushing reads the hot news of each news website in time, the association between the hot news of each news website and each content category in the responsible industry field is determined empirically, and the content category of the text to be edited is determined according to the association relationship. For example, when hot news is associated with a certain illness, a person in charge of text pushing in the medical field may determine the content category of the text to be edited as the illness.
However, in the current manner of determining the content category of the text to be edited, since a large amount of hot news needs to be manually read every day, the association is determined by using own experience, and the user experience of the reader is too dependent, the content category most relevant to the hot news may not be associated, so that the content category of the text to be edited cannot be accurately determined.
Disclosure of Invention
Based on this, it is necessary to provide an accurate text content category acquiring method, apparatus, computer device and storage medium in view of the above technical problems.
A text content category retrieval method, the method comprising:
acquiring hot news text data in a news website;
word segmentation is carried out on the hot news text data, and word segmentation results of the hot news text data are obtained;
matching word segmentation results of the hot news text data with target feature word sets of all content categories in a preset industry field, and obtaining association degrees of the hot news text data and all the content categories according to the matching results;
when the same association degree exists, searching content categories corresponding to the same association degree to obtain a content category set to be compared;
and acquiring news material selection indexes of each content category to be compared in the content category set to be compared, and determining the content category of the text to be edited according to the association degree and the news material selection indexes.
A text content category retrieval device, the device comprising:
the acquisition module is used for acquiring hot news text data in a news website;
the word segmentation module is used for segmenting the hot news text data to obtain word segmentation results of the hot news text data;
The matching module is used for matching word segmentation results of the hot news text data with target feature word sets of all content categories in a preset industry field, and obtaining the association degree of the hot news text data and all the content categories according to the matching results;
the searching module is used for searching the content category corresponding to the same association degree when the same association degree exists, so as to obtain a content category set to be compared;
the processing module is used for acquiring news material selection indexes of all the content categories to be compared in the content category set to be compared, and determining the content category of the text to be edited according to the association degree and the news material selection indexes.
A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:
acquiring hot news text data in a news website;
word segmentation is carried out on the hot news text data, and word segmentation results of the hot news text data are obtained;
matching word segmentation results of the hot news text data with target feature word sets of all content categories in a preset industry field, and obtaining association degrees of the hot news text data and all the content categories according to the matching results;
When the same association degree exists, searching content categories corresponding to the same association degree to obtain a content category set to be compared;
and acquiring news material selection indexes of each content category to be compared in the content category set to be compared, and determining the content category of the text to be edited according to the association degree and the news material selection indexes.
A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
acquiring hot news text data in a news website;
word segmentation is carried out on the hot news text data, and word segmentation results of the hot news text data are obtained;
matching word segmentation results of the hot news text data with target feature word sets of all content categories in a preset industry field, and obtaining association degrees of the hot news text data and all the content categories according to the matching results;
when the same association degree exists, searching content categories corresponding to the same association degree to obtain a content category set to be compared;
and acquiring news material selection indexes of each content category to be compared in the content category set to be compared, and determining the content category of the text to be edited according to the association degree and the news material selection indexes.
According to the text content category obtaining method, the device, the computer equipment and the storage medium, hot news text data in a news website are obtained, word segmentation is carried out on the hot news text data to obtain word segmentation results of the hot news text data, the word segmentation results of the hot news text data are matched with target feature word sets of all content categories in a preset industry field, the relevance of the hot news text data and all content categories is obtained according to the matching results, therefore, text content category obtaining can be carried out according to the relevance, when the same relevance exists, content categories corresponding to the same relevance are searched to obtain a content category set to be compared, news material selection indexes of all content categories to be compared in the content category set to be obtained, content categories of texts to be edited are determined according to the relevance and the news material selection indexes, the whole process can be used for analyzing the hot news text data from the word segmentation angle, relevance of all content categories can be determined according to words appearing in the hot news text data, and the relevance of all content categories can be further determined according to the news text category, and the text category to be edited accurately can be edited, and the text category can be further determined.
Drawings
FIG. 1 is a flow diagram of a text content category retrieval method in one embodiment;
FIG. 2 is a flow chart of a text content category obtaining method according to another embodiment;
FIG. 3 is an application environment diagram of a text content category retrieval method in one embodiment;
FIG. 4 is a schematic diagram of a text content category retrieval method in one embodiment;
FIG. 5 is a block diagram of a text content category retrieving device in one embodiment;
FIG. 6 is a block diagram of a text content category retrieving device in one embodiment;
fig. 7 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In one embodiment, as shown in fig. 1, a text content category obtaining method is provided, where the method is applied to a terminal to illustrate the text content category obtaining method, it is understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server. In this embodiment, the method comprises the steps of:
And 102, acquiring hot news text data in a news website.
The hot news text data refers to a high-heat report article in a news website. For example, the hot news text data may specifically be articles that are put in a striking location in a news website. For another example, the hot news text data may specifically be top-ranked articles in a news listing ranked by popularity in a news website.
Specifically, the terminal may periodically obtain hot news text data in the news website by using a crawler technology. For example, the terminal may obtain hot news text data from each news website daily using crawler technology.
And 104, word segmentation is carried out on the hot news text data, and word segmentation results of the hot news text data are obtained.
The word segmentation refers to splitting each sentence in the hot news text data, and splitting each sentence into a plurality of word combinations. The word segmentation result of the hot news text data refers to a set formed by a plurality of words corresponding to the hot news text data after the hot news text data is segmented and screened, wherein screening can be performed by using a preset deactivated word stock, and the deactivated words refer to certain words or words which are automatically filtered before or after processing natural language data (or text) in order to save storage space and improve search efficiency in information retrieval. For example, the stop words may be we, so, go, etc., in particular.
Specifically, the terminal invokes a word segmentation tool to segment the hot news text data, and then filters the combination of a plurality of words after word segmentation according to a preset stop word bank to obtain a word segmentation result of the hot news text data, wherein the word segmentation tool has preset word segmentation parameters, and the word segmentation tool segments the hot news text data according to the word segmentation parameters. The word segmentation tool can split the hot news text data by using a word segmentation technology. For example, the word segmentation tool may specifically split the hot news text data by utilizing jieba word segmentation. The word segmentation parameters can be the words which are not separable in the industry field, and the word segmentation tool can split sentences according to the specific conditions in the industry field by presetting the words which are not separable in the word segmentation tool. For example, when the industry field is a medical field, the term indivisible within the industry field may specifically refer to some disease names as well as disease symptoms. For example, when the industry field is a medical field, the words that are not separable in the industry field may specifically be depression, avoidance of people, aversion to cold, weakness, and the like.
And 106, matching the word segmentation result of the hot news text data with a target feature word set of each content category in the preset industry field, and obtaining the association degree of the hot news text data and each content category according to the matching result.
The industry field refers to a field corresponding to production and management. For example, an industry field may specifically refer to a medical field. For another example, an industry domain may specifically refer to a computer software domain. For another example, the industry domain may specifically refer to the financial domain. The content categories in the industry refer to content that can be the subject of text to be edited. For example, in the medical field, each content category may specifically refer to various disease names. For another example, in the field of computer software, each content category may specifically refer to a software function. As another example, in the financial field, each content category may specifically be various investment patterns, such as stocks, funds, and the like. The target feature word set refers to a set of words that can be used to characterize a content category. For example, when the industry field is a medical field, the target feature word set may specifically refer to a set of words that characterize symptoms of various diseases. The degree of association refers to the degree of association of the hot news text data with each content category, and the higher the degree of association is, the more relevant the hot news text data is to the content category.
Specifically, the terminal matches the word segmentation result of the hot news text data with each target feature word in the target feature word set of each content category in the preset industry field, counts the target feature words appearing in the hot news text data, and obtains the association degree of the hot news text data and each content category according to the target feature words appearing in the hot news text data.
And step 108, searching for the content category corresponding to the same association degree when the same association degree exists, and obtaining a content category set to be compared.
The content category to be compared is a content category with the same association degree. For example, when there are two content categories with the same association degree with the hot news text data, the content category to be compared specifically refers to the two content categories. For another example, when there are two groups of content categories with the same association degree with the hot news text data, the content category to be compared specifically refers to each content category in the two groups of content categories, where the association degree of each group of content categories is different.
Specifically, when the same association degree exists, the terminal searches for the content category corresponding to the same association degree, and uses the content category corresponding to the same association degree as the content category to be compared to obtain a content category set to be compared.
Step 110, obtaining news material selection indexes of each content category to be compared in the content category set to be compared, and determining the content category of the text to be edited according to the association degree and the news material selection indexes.
The news material selection index is used for indicating the operational degree of each content category to be compared, and can be obtained by combining the number of material selection materials corresponding to the content category to be compared and audience parameters. The historical data refers to published data corresponding to the category of the content to be compared. For example, the history material may specifically be published news. For another example, the history may be a published paper. Audience parameters refer to audience ranges. For example, the audience parameter may specifically be 10 years to 40 years. As another example, the audience parameter may specifically be between 25 years old and 30 years old.
Specifically, the terminal obtains news material selection indexes of the content categories to be compared according to the number of material selection materials corresponding to the content categories to be compared and audience parameters, ranks the content categories to be compared again according to the relevance and the news material selection indexes, determines the sequence of the content categories to be compared, and obtains the ranking results of the hot news text data and the content categories by combining the ranking results of the relevance ranks, the sequence of the content categories to be compared and the comprehensive relevance obtained according to the acquired heat indexes, and determines the content categories of the text to be edited according to the ranking results of the hot news text data and the content categories. The heat index is used for representing the attention degree of hot news, and can be obtained according to heat data published by a search platform and the like. For example, the search platform may represent the hotness of each hot news by a quantized value, and rank the hot news according to the hotness, where the quantized value is a hotness index. The comprehensive relevance is a calculation result obtained by combining the relevance, the relevance to be compared, the heat index and a preset comprehensive weight, and the preset comprehensive weight can be set according to the needs.
According to the text content category obtaining method, hot news text data in a news website are obtained, word segmentation is carried out on the hot news text data to obtain word segmentation results of the hot news text data, the word segmentation results of the hot news text data are matched with target feature word sets of content categories in a preset industry field, the association degree of the hot news text data and the content categories is obtained according to the matching results, so that text content category obtaining can be carried out according to the association degree, when the same association degree exists, content categories corresponding to the same association degree are searched for obtaining a content category set to be compared, news material selection indexes of the content categories to be compared in the content category set to be obtained, the content category of a text to be edited is determined according to the association degree and the news material selection indexes, the word segmentation results and the word segmentation results are obtained through word segmentation of the hot news text data in the whole process, the hot news text data can be analyzed according to the association degree of the word segmentation, the association degree of the hot news text data and the content categories can be determined according to the occurrence of the hot news text data, further, the content category accuracy of the hot news text data and the content category can be determined according to the association degree of the news text data and the news material selection index can be determined, and the content category accuracy can be achieved.
In one embodiment, matching the word segmentation result of the hot news text data with a target feature word set of each content category in a preset industry field, and obtaining the association degree of the hot news text data and each content category according to the matching result includes:
matching word segmentation results of the hot news text data with each target feature word in a target feature word set of each content category in a preset industry field to obtain hit feature words corresponding to each content category in the hot news text data;
and obtaining the association degree of the hot news text data and each content category according to hit feature words corresponding to each content category in the hot news text data and the feature weights of preset target feature words.
Wherein, the target feature word refers to a word which can be used for representing the content category. For example, when the industry field is a medical field, the target feature word may specifically refer to a word that characterizes symptoms of various diseases. Hit feature words refer to target feature words of respective content categories that appear in the hot news text data. The feature weight is used for representing the association degree of each target feature word and the content category, and the larger the feature weight is, the more the target feature word is associated with the content category. For example, when the industry domain is a medical domain, the feature weights may be used to characterize the degree of association of symptoms of various diseases with the disease. For example, the symptoms of different diseases may be similar, for example, both common cold and common pneumonia may be manifested as fever and cough. Then, the two target feature words fever and cough are associated with both content categories of common cold and common pneumonia, and are in a many-to-many relationship. The primary diagnosis symptoms of the common cold are fever and cough, while the primary diagnosis symptoms of the common pneumonia are shadowed in the lung, and the fever and cough are secondary diagnosis symptoms, so that the characteristic weight of the fever and the cough when being related to the common cold is larger than the characteristic weight of the fever and the cough when being related to the common pneumonia, and the characteristic weight of the shadowed lung when being related to the common pneumonia is larger than the characteristic weight of the fever and the cough when being related to the common pneumonia.
Specifically, the terminal matches the word segmentation result of the hot news text data with each target feature word in the target feature word set of each content category in the preset industry field, obtains the hit feature word corresponding to each content category from the hot news text data, further determines the feature weight of the hit feature word corresponding to each content category according to the hit feature word corresponding to each content category and the feature weight of each target feature word, and calculates and obtains the association degree of the hot news text data and each content category according to the feature weight of the hit feature word corresponding to each content category. The calculating, according to the feature weights of hit feature words corresponding to the content categories, the association degree between the hot news text data and the content categories comprises: if the feature weight of a hit feature word corresponding to a content category is 1, the association degree between the news hot text and the content category is 100%, and if the feature weight of a hit feature word corresponding to the content category is not 1, the association degree between the hot news text data and the content category is calculated by accumulating the feature weights of the hit feature words. For example, when the industry field is a medical field, if the feature weight of the test paper positive for the target feature word of a certain disease is 1 and the test paper positive for the test paper is the hit feature word corresponding to the disease, the association degree between the hot news text data and the disease is 100%.
In this embodiment, by matching the word segmentation result of the hot news text data with each target feature word in the target feature word set of each content category in the preset industry field, hit feature words corresponding to each content category can be obtained, and then the association degree between the hot news text data and each content category can be obtained according to the feature weights of the hit feature words and each target feature word.
In one embodiment, before matching the word segmentation result of the hot news text data with the target feature word set of each content category in the preset industry field, obtaining the association degree between the hot news text data and each content category according to the matching result, the method further includes:
acquiring historical data carrying category labels;
classifying the historical data according to the category labels to obtain a historical data set corresponding to each content category;
performing word segmentation on each historical data in the historical data set to obtain word segmentation results corresponding to each content category;
screening word segmentation results corresponding to each content category according to an initial feature word set of each content category, and obtaining target word segmentation results corresponding to each content category;
counting the occurrence times of each word in the target word segmentation result corresponding to each content category;
And determining a target feature word set of each content category and feature weights of each target feature word in the target feature word set according to the statistical result.
The historical data refers to data related to each content category in the industry field. For example, when the industry field is a medical field, the historical data may specifically be historical review data. Category labels refer to labels used to characterize categories of historical data. For example, when the industry field is a medical field, the category label may specifically be a category label for characterizing various diseases. For another example, the category label may be a digital label, for example, the category label of cold may be 1, the category label of common pneumonia may be 2, and the specific form of the category label may be set according to the requirement. The historical data set corresponding to each content category refers to the related data set corresponding to each content category. For example, when the industry field is a medical field, the set of history data corresponding to each content category may specifically refer to historical review data associated with each disease. The preset initial feature word set of each content category refers to a word set used for representing each content category, and each content category can be comprehensively described through each initial feature word in the initial feature word set. For example, the initial feature word set preset for each content category may specifically be a manually labeled word set for characterizing each content category. For another example, when the industry field is a medical field, the initial feature word set may specifically be a set of words used to characterize all symptoms of each disease. The target word segmentation result corresponding to each content category refers to a set of words which appear in the initial feature word set and the word segmentation result corresponding to each content category at the same time, namely hit feature words obtained according to each initial feature word in the initial feature word set.
Specifically, the terminal obtains historical data carrying class labels, classifies the historical data by identifying the analog labels to obtain a historical data set corresponding to each content class, invokes a word segmentation tool to segment each historical data in the historical data set, and then screens the combination of a plurality of words after word segmentation according to a preset stop word bank to obtain word segmentation results corresponding to each content class, wherein the word segmentation tool has preset word segmentation parameters, and the word segmentation tool segments each historical data according to the word segmentation parameters. The word segmentation tool can split each historical data by using word segmentation technology. For example, the word segmentation tool may specifically split each history by utilizing jieba word segmentation. The word segmentation parameters can be the inseparable words of each content category in the industry field, and the word segmentation tool can split sentences according to the specific conditions of each content category in the industry field by presetting the inseparable words in the word segmentation tool. For example, when the industry field is a medical field, the indivisible words of each content category in the industry field may specifically refer to some words describing symptoms of a disease. For example, when the industry field is a medical field, the indivisible words of each content category in the industry field may specifically be avoidance of crowd, aversion to cold, hypodynamia, and the like.
Specifically, after obtaining word segmentation results corresponding to each content category, the terminal screens word segmentation results corresponding to each content category by matching an initial feature word set of each preset content category with word segmentation results corresponding to each content category, screens words corresponding to each initial feature word in the initial feature word set from the word segmentation results corresponding to each content category, obtains target word segmentation results corresponding to each content category, counts the occurrence times of each word in a historical data set corresponding to each content category in the target word segmentation results corresponding to each content category, sorts each word in the target word segmentation results according to the counted results, screens target feature words of each content category from each word in the target word segmentation results according to the preset feature word number, obtains a target feature word set, and obtains feature weights of each target feature word according to coverage rate of each target feature word in the historical data set corresponding to the content category in the target feature word set. Wherein, the preset characteristic word number can be set according to the requirement.
For example, when the industry field is a medical field, the terminal may acquire historical diagnosis data carrying a category label, classify the historical diagnosis data according to the category label to obtain historical diagnosis data related to each disease, call a word segmentation tool to segment the historical diagnosis data related to each disease, and screen a combination of a plurality of segmented words according to a preset deactivated word bank to obtain a word segmentation result corresponding to each disease, wherein the word segmentation tool is preset with the words which are used for describing symptoms and are indistinguishable for each disease. After obtaining word segmentation results corresponding to each disease, the terminal screens the word segmentation results corresponding to each disease according to a set preset to characterize all symptoms of each disease, obtains target word segmentation results corresponding to each disease, counts the occurrence times of words used for describing symptoms in the target word segmentation results corresponding to each disease, and determines a target feature word set of each disease and feature weights of each target feature word in the target feature word set according to the counted results.
In this embodiment, by classifying the historical data carrying the category labels to obtain a historical data set, performing word segmentation on each historical data in the historical data set to obtain a word segmentation result, screening the word segmentation result to obtain a target word segmentation result, counting the occurrence times of each word in the target word segmentation result corresponding to each content category, and determining the target feature word set of each content category and the feature weight of each target feature word in the target feature word set according to the statistical result.
In one embodiment, determining the set of target feature words for each content category and the feature weights for each target feature word in the set of target feature words based on the statistics includes:
sequencing the words according to the occurrence times of the words in the target word segmentation result, and obtaining a target feature word set of each content category according to the sequencing result and the preset feature word number;
counting the coverage rate of each target feature word in the target feature word set in the historical data set corresponding to the content category;
and calculating the feature weight of each target feature word according to the coverage rate.
The preset feature word number refers to the preset feature word number and can be set according to the needs. The coverage rate of each target feature word in the history data set corresponding to the content category refers to the probability that each target feature word appears in each history data in the history data set. For example, when the total number of histories in the set of histories is m and a certain target feature word appears in n histories therein, the coverage of the target feature word is obtained as
Specifically, the terminal performs ascending order or descending order sorting on each word according to the occurrence times of each word in the target word segmentation result, and selects N words with the largest occurrence times from the target word segmentation result according to the sorting result and the preset feature word number as a target feature word set, wherein N is the preset feature word number. After obtaining the target feature word set of each content category, the terminal determines whether each target feature word appears in each historical data in the historical data set by matching the target feature words, calculates the coverage rate of each target feature word in the historical data set corresponding to the content category by counting the occurrence times of each target feature word, and calculates the feature weight of each target feature word according to the coverage rate. Wherein, according to the coverage rate, calculating the feature weight of each target feature word comprises: if the coverage rate of any target feature word is 100%, determining that the feature weight of the target feature word is 1, and if the coverage rate of the target feature word is not 100%, calculating the feature weight of each target feature word according to the proportion of the coverage rate of each target feature word to the total coverage rate of each target feature word.
In this embodiment, the words are ranked according to the number of occurrences of the words in the target word segmentation result, a target feature word set of each content category is obtained according to the ranking result and the preset feature word number, the coverage rate of each target feature word in the target feature word set in the historical data set corresponding to the content category is counted, and the feature weight of each target feature word is calculated according to the coverage rate, so that the feature weight of each target feature word in the target feature word set and each target feature word in the target feature word set of each content category can be determined.
In one embodiment, after obtaining the association degree between the hot news text data and each content category according to the matching result, the method further comprises:
when the relevancy is different, acquiring a heat index of the hot news text data, and sorting all content categories according to the heat index and the relevancy;
and determining the content category of the text to be edited according to the preset news material selection number and the sequencing result.
The preset news selection number refers to the preset number of content categories to be selected, and can be set according to the needs.
Specifically, when the association degrees of the hot news text data and the content categories are different, the terminal obtains the heat index of the hot news text data, sorts the association degrees of the hot news text data and the content categories according to the heat index and the association degrees, and determines the content category of the text to be edited according to the preset news material selection number and the sorting result. For example, the terminal may calculate the comprehensive relevance according to the heat index, the relevance and the preset comprehensive weight, sort the relevance between the hot news text data and each content category in a descending order according to the comprehensive relevance, sort the content category with the highest relevance at the forefront, and determine the content category of the text to be edited according to the preset news material selection number, where when the comprehensive relevance of any two content categories is the same, the content categories with the same comprehensive relevance are sorted again by referring to the relevance, and the priority with a high relevance is higher than that with a small relevance. For example, when the number of preset news materials is N, the terminal may select N content categories with the highest association degree from the N content categories as the content categories of the text to be edited according to the sorting result, and the preset comprehensive weight may be set as required.
In this embodiment, when the relevancy is different, the heat index and the content categories are ranked according to the relevancy, and the content category of the text to be edited is determined according to the preset news material selection number and the ranking result, so that the determination of the content category of the text to be edited can be realized.
In one embodiment, obtaining news material selection indexes of each content category to be compared in the content category set to be compared includes:
acquiring the number of material selection materials corresponding to each content category and audience parameters;
calculating the total number of the material selection materials according to the material selection materials corresponding to each content category;
and obtaining news material selection indexes of the content categories to be compared according to the audience parameters, the total number of material selection materials and the number of material selection materials corresponding to the content categories to be compared.
Wherein, the number of material selection data refers to the number of referenceable data corresponding to each content category. For example, the number of material selection materials may be the number of published papers corresponding to each content category. For another example, the number of material selection materials may specifically be the number of news published corresponding to each content category. For another example, the number of material selection materials may be an accumulated value of the number of published papers corresponding to each content category and the number of published news corresponding to each content category. The total number of material selection data is the sum of the number of material selection data corresponding to each content category. The news material selection index is used for indicating the operational degree of each content category to be compared and can be obtained by combining the number of material selection materials corresponding to the content category to be compared. For example, the news log index may specifically refer to a ratio of the number of log entries to the total number of log entries. Audience parameters refer to audience ranges. For example, the audience parameter may specifically be 10 years to 40 years. As another example, the audience parameter may specifically be between 25 years old and 30 years old. For example, when the industry field is a medical field, the audience parameter may specifically refer to a range of patient ages in the historical review data.
Specifically, the terminal may obtain the number of material selection materials corresponding to each content category by querying a preset database table, the corresponding relationship between each content category and the number of material selection materials is stored in the preset database table, the corresponding relationship between each content category and the number of material selection materials stored in the preset database table may be obtained by manual statistics, and the statistics personnel may count the number of material selection materials corresponding to each content category by referring to each paper website, each news website, and the like. The terminal may obtain audience parameters corresponding to each content category by querying a preset database table, and store the corresponding relation between each content category and the audience parameters in the preset database table, where the corresponding relation between each content category and the audience parameters stored in the preset database table may be obtained by manual statistics, and the statistics personnel may obtain the audience parameters by referring to historical data carrying category labels, and in this embodiment, the manner of obtaining the audience parameters is not limited. After obtaining the number of the material selection materials corresponding to each content category, the terminal calculates the total number of the material selection materials according to the number of the material selection materials corresponding to each content category, calculates the ratio of the number of the material selection materials corresponding to each content category to be compared to the total number of the material selection materials according to the total number of the material selection materials, and obtains news material selection indexes of each content category to be compared according to the ratio and audience parameters.
Specifically, the manner of obtaining news material selection indexes of each content category to be compared according to the ratio and the audience parameter may be: and calculating according to preset weight factors, ratio and audience parameters to obtain news material selection indexes of the content categories to be compared, wherein the preset weight factors can be set according to the needs.
In this embodiment, by acquiring the number of material selection materials corresponding to each content category and the audience parameter, calculating the total number of material selection materials according to the number of material selection materials corresponding to each content category, and acquiring the news material selection index of each content category to be compared according to the audience parameter, the total number of material selection materials and the number of material selection materials corresponding to each content category to be compared, the acquisition of the news material selection index of each content category to be compared can be realized.
In one embodiment, determining the content category of the text to be edited based on the relevance and the news log index includes:
determining the to-be-compared association degree of each to-be-compared content category according to the association degree, the news material selection index and a preset material selection weight factor, and acquiring a heat index of hot news text data;
sorting the content categories according to the association degree, the association degree to be compared and the heat index;
And determining the content category of the text to be edited according to the preset news material selection number and the sequencing result.
The preset material selection weight factor refers to a parameter used for representing the association degree and the weight of the news material selection index, and can be set automatically according to the requirement. The correlation degree to be compared refers to the correlation degree used for comparison and calculated by combining the correlation degree, the news material selection index and a preset material selection weight factor. The heat index is used for representing the attention degree of hot news, and can be obtained according to heat data published by a search platform and the like. For example, the search platform may represent the hotness of each hot news by a quantized value, and rank the hot news according to the hotness, where the quantized value is a hotness index.
Specifically, the terminal calculates the relevance to be compared of each content category to be compared according to the relevance, the news material selection index and a preset material selection weight factor, acquires the heat index corresponding to the hot news text data from the heat data published by the search platform and the like, sorts each content category according to the relevance, the relevance to be compared and the heat index, and determines the content category of the text to be edited according to the preset news material selection number and the sorting result. The method for sorting the content categories according to the association degree, the association degree to be compared and the heat index may be as follows: firstly, sorting the content categories for the first time according to the association degree to obtain a first sorting result, sorting the content categories to be compared for the second time according to the association degree to be compared, which has the same association degree, synthesizing the first sorting result and the second sorting result to obtain a second sorting result, finally obtaining the comprehensive association degree of the content categories according to the heat index, the preset comprehensive weight and the second sorting result, and sorting the content categories for the third time according to the comprehensive association degree to obtain a third sorting result. When the comprehensive association degrees of any two content categories are the same, the content categories with the same comprehensive association degrees are ranked again by referring to the association degrees and the association degrees to be compared, the priority of the association degrees is higher than that of the association degrees to be compared, the priority of the content category with high association degrees is higher than that of the content category with low association degrees, and the priority of the content category with high association degrees to be compared is higher than that of the content category with low association degrees to be compared.
In this embodiment, the relevance to be compared of each content category to be compared is determined according to the relevance, the news material selection index and the preset material selection weight factor, the content categories are ordered according to the relevance, the relevance to be compared and the heat index, and the content category of the text to be edited is determined according to the preset news material selection number and the ordering result, so that the determination of the content category of the text to be edited can be realized.
In one embodiment, as shown in fig. 2, the text content category acquiring method of the present application is described by a most detailed embodiment, which includes the steps of:
step 202, hot news text data in a news website is obtained;
step 204, word segmentation is carried out on the hot news text data, and word segmentation results of the hot news text data are obtained;
step 206, obtaining historical data carrying class labels;
step 208, classifying the historical data according to the category labels to obtain a historical data set corresponding to each content category;
step 210, word segmentation is carried out on each historical data in the historical data set to obtain word segmentation results corresponding to each content category;
step 212, screening word segmentation results corresponding to each content category according to an initial feature word set of each content category, and obtaining target word segmentation results corresponding to each content category;
Step 214, counting the occurrence times of each word in the target word segmentation result corresponding to each content category;
step 216, sorting the words according to the occurrence times of the words in the target word segmentation result, and obtaining a target feature word set of each content category according to the sorting result and the preset feature word number;
step 218, counting the coverage rate of each target feature word in the target feature word set in the historical data set corresponding to the content category;
step 220, calculating the feature weight of each target feature word according to the coverage rate;
step 222, matching the word segmentation result of the hot news text data with each target feature word in the target feature word set of each content category in the preset industry field to obtain hit feature words corresponding to each content category in the hot news text data;
step 224, obtaining the association degree between the hot news text data and each content category according to the hit feature words corresponding to each content category in the hot news text data and the feature weights of the preset target feature words;
step 226, judging whether the same association degree exists in the association degree of the hot news text data and each content category, if yes, jumping to step 232, and if no, jumping to step 228;
Step 228, acquiring a heat index of the hot news text data, and sorting all content categories according to the association degree and the heat index;
step 230, determining the content category of the text to be edited according to the preset news material selection number and the sequencing result;
step 232, searching content categories corresponding to the same association degree to obtain a content category set to be compared;
step 234, obtaining the number of material selection materials and audience parameters corresponding to each content category;
step 236, calculating the total number of material selection materials according to the number of material selection materials corresponding to each content category;
step 238, obtaining news material selection indexes of each content category to be compared according to audience parameters, total material selection data and material selection data corresponding to each content category to be compared in the content category set to be compared;
step 240, determining the degree of correlation to be compared of each content category to be compared according to the degree of correlation, the news material selection index and a preset material selection weight factor, and obtaining the heat index of the hot news text data;
step 242, sorting the content categories according to the association degree, the association degree to be compared and the heat index;
and step 244, determining the content category of the text to be edited according to the preset news material selection number and the sequencing result.
The application also provides an application scene, which applies the text content category acquisition method. Specifically, the application of the text content category obtaining method in the application scene is as follows:
when the content operation is carried out, the determination of the content category of the operation text to be edited for the content operation can be realized through a text content category acquisition method. As shown in fig. 3, a terminal firstly classifies historical data according to category labels to obtain a historical data set corresponding to each operable content category, performs word segmentation on each historical data in the historical data set to obtain word segmentation results corresponding to each operable content category, screens word segmentation results corresponding to each operable content category according to an initial feature word set preset for each operable content category to obtain a target word segmentation result corresponding to each operable content category, counts the occurrence times of each word in the target word segmentation result corresponding to each operable content category, sorts each word according to the occurrence times of each word in the target word segmentation result, and obtains a target feature word set of each operable content category according to the sorting result and the preset feature word number, counts the coverage rate of each target feature word in the historical data set corresponding to each content category in the target feature word set, calculates the feature weight of each target feature word according to the coverage rate, and realizes construction of the operable content category and word library.
After the operable content category and the word stock are constructed, the terminal periodically crawls hot news by acquiring hot news text data in a news website, and word segmentation results of the hot news text data are obtained by word segmentation of the hot news text data, the word segmentation results of the hot news text data are matched with target feature words in target feature word sets of all the operable content categories in the industry field, hit feature words corresponding to all the operable content categories in the hot news text data are obtained, and the association degree of the hot news text data and all the operable content categories is obtained according to the hit feature words corresponding to all the operable content categories in the hot news text data and the feature weights of all the preset target feature words.
After obtaining the association degree of the hot news text data and each operable content category, the terminal visually displays the association degree between each hot news and the content category by comparing the association degree. When the relevancy is different, acquiring a heat index of hot news text data, sorting and displaying each operable content category according to the relevancy and the heat index (as shown in fig. 4, wherein the relevancy is represented by a relevancy score, and the relevancy is a hit feature word), determining the content category of the operation text to be edited for content operation according to the preset news material selection number and the sorting result, when the same relevancy exists, searching the content category corresponding to the same relevancy to obtain a set of the content categories to be compared, acquiring the number of material selection materials and audience parameters corresponding to each operable content category, calculating the total number of material selection materials according to the audience parameter, the total number of material selection materials and the number of material selection materials corresponding to each to-be-compared content category in the set, determining the degree of association to be compared according to the relevancy, the news material selection index and a preset material weighting factor, and obtaining the heat index of each to be compared content category, and displaying the heat index according to the relevancy, wherein the news material selection index is the news material selection number and the news material selection index is displayed in the text, and the text to be edited for content category to be compared according to the preset in the set.
After the content type of the operation text to be edited in the content operation is determined, a person responsible for text pushing can edit the operation text in combination with the responsible industry field and the content type to obtain an operation text to be pushed, and the operation text to be pushed is pushed to a user, so that accurate pushing of the operation text related to the hot news text data is realized.
It should be understood that, although the steps in the flowcharts of fig. 1-2 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1-2 may include multiple steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the steps or stages in other steps or other steps.
In one embodiment, as shown in fig. 5, there is provided a text content category obtaining apparatus, which may employ a software module or a hardware module, or a combination of both, as a part of a computer device, and specifically includes: an acquisition module 502, a word segmentation module 504, a matching module 506, a search module 508, and a processing module 510, wherein:
An acquisition module 502, configured to acquire hot news text data in a news website;
the word segmentation module 504 is configured to segment the hot news text data to obtain a word segmentation result of the hot news text data;
the matching module 506 is configured to match the word segmentation result of the hot news text data with a target feature word set of each content category in a preset industry field, and obtain a degree of association between the hot news text data and each content category according to the matching result;
the searching module 508 is configured to search for content categories corresponding to the same association degree when the same association degree exists, to obtain a content category set to be compared;
the processing module 510 is configured to obtain news material selection indexes of each content category to be compared in the content category set to be compared, and determine a content category of the text to be edited according to the association degree and the news material selection indexes.
According to the text content category obtaining device, hot news text data in a news website are obtained, word segmentation is carried out on the hot news text data to obtain word segmentation results of the hot news text data, the word segmentation results of the hot news text data are matched with target feature word sets of all content categories in a preset industry field, the association degree of the hot news text data and all the content categories is obtained according to the matching results, so that text content category obtaining can be carried out according to the association degree, when the same association degree exists, content categories corresponding to the same association degree are searched to obtain a content category set to be compared, news material selection indexes of all the content categories to be compared in the content category set to be compared are obtained, content categories of texts to be edited are determined according to the association degree and the news material selection indexes, the whole process can be carried out on the hot news text data to obtain word segmentation results and word segmentation results, the hot news text data can be analyzed according to the association degree of all the content categories, the association degree of the hot news text data with all the content categories is determined, further the content category of the news text to be edited can be accurately determined according to the association degree and the news material selection indexes of the news.
In one embodiment, the matching module includes:
the matching unit is used for matching word segmentation results of the hot news text data with each target feature word in a target feature word set of each content category in a preset industry field to obtain hit feature words corresponding to each content category in the hot news text data;
and the association degree calculation unit is used for obtaining the association degree of the hot news text data and each content category according to hit feature words corresponding to each content category in the hot news text data and the preset feature weights of each target feature word.
In one embodiment, the text content category obtaining device further includes a word stock construction module, the word stock construction module is used for obtaining historical data carrying category labels, classifying the historical data according to the category labels to obtain historical data sets corresponding to each content category, segmenting each historical data in the historical data sets to obtain word segmentation results corresponding to each content category, screening the word segmentation results corresponding to each content category according to an initial feature word set preset for each content category to obtain target word segmentation results corresponding to each content category, counting the occurrence times of each word in the target word segmentation results corresponding to each content category, and determining a target feature word set of each content category and feature weights of each target feature word in the target feature word set according to the counted results.
In one embodiment, the word stock construction module further comprises:
the feature word selecting unit is used for sequencing the words according to the occurrence times of the words in the target word segmentation result, and obtaining a target feature word set of each content category according to the sequencing result and the preset feature word number;
the statistics unit is used for counting the coverage rate of each target feature word in the target feature word set in the historical data set corresponding to the content category;
and the weight calculation unit is used for calculating the feature weight of each target feature word according to the coverage rate.
In one embodiment, the text content category obtaining device further includes a comparison module, and the comparison module includes:
the first ordering unit is used for acquiring the heat index of the hot news text data when the relevancy is different, and ordering each content category according to the heat index and the relevancy;
the first material selecting unit is used for determining the content category of the text to be edited according to the preset news material selecting number and the sequencing result.
In one embodiment, the processing module includes:
the data acquisition unit is used for acquiring the number of the material selection data and audience parameters corresponding to each content category;
the data total number calculation unit is used for calculating the total number of the selected materials according to the number of the selected materials corresponding to each content category;
And the news material selection index calculation unit is used for obtaining news material selection indexes of the content categories to be compared according to audience parameters, the total number of material selection materials and the number of material selection materials corresponding to the content categories to be compared.
In one embodiment, the processing module includes:
the to-be-compared relevance calculating unit is used for determining the to-be-compared relevance of each to-be-compared content category according to the relevance, the news material selection index and the preset material selection weight factor, and acquiring the heat index of the hot news text data;
the second sorting unit is used for sorting the content categories according to the association degree, the association degree to be compared and the heat index;
and the second material selecting unit is used for determining the content category of the text to be edited according to the preset news material selecting number and the sequencing result.
In one embodiment, as shown in fig. 6, there is provided a text content category obtaining apparatus, which may employ a software module or a hardware module, or a combination of both, as a part of a computer device, and specifically includes: the system comprises a word stock module, a news hot spot acquisition module, a word segmentation module, an indexing module and a display module, wherein:
the word stock module is used for acquiring historical data carrying class labels, classifying the historical data according to the class labels to obtain a historical data set corresponding to each content class, segmenting each historical data in the historical data set to obtain a segmentation result corresponding to each content class, screening the segmentation result corresponding to each content class according to an initial feature word set of each preset content class to obtain a target segmentation result corresponding to each content class, counting the occurrence times of each word in the target segmentation result corresponding to each content class, sorting each word according to the occurrence times of each word in the target segmentation result, obtaining a target feature word set of each content class according to the sorting result and the preset feature word number, counting the coverage rate of each target feature word in the historical data set corresponding to the content class in the target feature word set, and calculating the feature weight of each target feature word according to the coverage rate;
The news hot spot acquisition module is used for acquiring hot spot news text data in a news website;
the word segmentation module is used for segmenting the hot news text data to obtain word segmentation results of the hot news text data;
the indexing module is used for matching word segmentation results of the hot news text data with target feature words in target feature word sets of all content categories in the industry field to obtain hit feature words corresponding to all content categories in the hot news text data, calculating the total number of selected material materials according to the total number of selected material materials and the total number of selected material materials corresponding to all content categories in the hot news text data, and obtaining a heat index of the hot news text data when the relevancy is different, sorting all content categories according to the heat index and the relevancy, determining content categories of text to be edited according to a preset news selection number and a sorting result, searching the content categories corresponding to the same relevancy when the same relevancy exists, obtaining a selected material number and an audience parameter corresponding to all content categories, calculating the total number of selected material materials according to the selected material number and the number of selected material materials corresponding to all content categories to be compared, and obtaining a heat index of the material to be compared in the content category to be compared, sorting the heat index, determining the heat index to be edited according to the heat index, and the heat index to be compared, and determining the heat index to be compared according to the heat index;
And the display module is used for displaying the sorting result of sorting the content categories.
For specific limitations on the text content category acquiring means, reference is made to the above limitation on the text content category acquiring method, and no further description is given here. The respective modules in the text content category acquiring device described above may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a text content category retrieval method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in FIG. 7 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (12)

1. A text content category retrieval method, the method comprising:
acquiring hot news text data in a news website;
word segmentation is carried out on the hot news text data, and word segmentation results of the hot news text data are obtained;
matching word segmentation results of the hot news text data with target feature word sets of all content categories in a preset industry field, and obtaining association degrees of the hot news text data and all the content categories according to the matching results;
When the same association degree exists, searching content categories corresponding to the same association degree to obtain a content category set to be compared;
acquiring the number of material selection materials and audience parameters corresponding to each content category in the content category set to be compared, acquiring news material selection indexes of each content category to be compared in the content category set to be compared according to the number of material selection materials and the audience parameters, determining the degree of correlation of each content category to be compared according to the degree of correlation, the news material selection indexes and preset material selection weight factors, acquiring the heat index of the hot news text data, sorting each content category according to the degree of correlation, the degree of correlation to be compared and the heat index, and determining the content category of the text to be edited according to the preset news material selection number and sorting results; the preset news selection number refers to the preset number of content categories to be selected;
and when the relevancy is different, sorting the content categories according to the heat index and the relevancy, and determining the content category of the text to be edited according to the preset news material selection number and the sorting result.
2. The method of claim 1, wherein the matching the word segmentation result of the hot news text data with the target feature word set of each content category in the preset industry field, and obtaining the association degree between the hot news text data and each content category according to the matching result comprises:
matching word segmentation results of the hot news text data with each target feature word in a target feature word set of each content category in a preset industry field to obtain hit feature words corresponding to each content category in the hot news text data;
and obtaining the association degree of the hot news text data and each content category according to hit feature words corresponding to each content category in the hot news text data and the feature weights of preset target feature words.
3. The method according to claim 1, further comprising, before the matching the word segmentation result of the hot news text data with the target feature word set of each content category in the preset industry domain, obtaining the association degree between the hot news text data and each content category according to the matching result:
acquiring historical data carrying category labels;
Classifying the historical data according to the category labels to obtain a historical data set corresponding to each content category;
performing word segmentation on each historical data in the historical data set to obtain word segmentation results corresponding to each content category;
screening word segmentation results corresponding to the content categories according to an initial feature word set of the preset content categories to obtain target word segmentation results corresponding to the content categories;
counting the occurrence times of each word in the target word segmentation result corresponding to each content category;
and determining the feature weights of the target feature words in the target feature word sets according to the statistical result.
4. The method of claim 3, wherein determining the set of target feature words for each content category and the feature weights for each target feature word in the set of target feature words based on the statistics comprises:
sequencing the words according to the occurrence times of the words in the target word segmentation result, and obtaining a target feature word set of each content category according to the sequencing result and the preset feature word number;
counting the coverage rate of each target feature word in the target feature word set in the historical data set corresponding to the content category;
And calculating the feature weight of each target feature word according to the coverage rate.
5. The method of claim 1, wherein obtaining news material selection indices for each of the content categories to be compared in the set of content categories to be compared according to the number of material selection materials and the audience parameter comprises:
calculating the total number of the material selecting materials according to the material selecting materials;
and obtaining news material selection indexes of each content category to be compared according to the audience parameters, the total number of the material selection materials and the number of the material selection materials.
6. A text content category retrieval device, the device comprising:
the acquisition module is used for acquiring hot news text data in a news website;
the word segmentation module is used for segmenting the hot news text data to obtain word segmentation results of the hot news text data;
the matching module is used for matching the word segmentation result of the hot news text data with a target feature word set of each content category in a preset industry field, and obtaining the association degree of the hot news text data and each content category according to the matching result;
the searching module is used for searching the content category corresponding to the same association degree when the same association degree exists, so as to obtain a content category set to be compared;
The processing module is used for acquiring the number of material selection materials and audience parameters corresponding to each content category in the content category set to be compared, obtaining news material selection indexes of each content category to be compared in the content category set to be compared according to the number of material selection materials and the audience parameters, determining the degree of correlation of each content category to be compared according to the degree of correlation, the news material selection indexes and preset material selection weight factors, acquiring the heat index of the hot news text data, sorting each content category according to the degree of correlation, the degree of correlation to be compared and the heat index, and determining the content category of the text to be edited according to the preset news material selection number and sorting results; the preset news selection number refers to the preset number of content categories to be selected;
and the comparison module is used for sequencing the content categories according to the heat index and the association degree when the association degrees are different, and determining the content category of the text to be edited according to the preset news material selection number and the sequencing result.
7. The apparatus of claim 6, wherein the matching module comprises:
The matching unit is used for matching the word segmentation result of the hot news text data with each target feature word in the target feature word set of each content category in the preset industry field to obtain hit feature words corresponding to each content category in the hot news text data;
and the association degree calculation unit is used for obtaining the association degree of the hot news text data and each content category according to hit feature words corresponding to each content category in the hot news text data and preset feature weights of each target feature word.
8. The apparatus of claim 6, further comprising a word stock construction module, wherein the word stock construction module is configured to obtain historical data carrying category labels, classify the historical data according to the category labels to obtain a set of historical data corresponding to each content category, segment each historical data in the set of historical data to obtain a segmented result corresponding to each content category, filter the segmented result corresponding to each content category according to an initial set of feature words preset for each content category to obtain a target segmented result corresponding to each content category, count the number of occurrences of each word in the target segmented result corresponding to each content category, and determine a set of target feature words of each content category and feature weights of each target feature word in the set of target feature words according to the counted result.
9. The apparatus of claim 8, wherein the thesaurus construction module further comprises:
the characteristic word selecting unit is used for sorting the words according to the occurrence times of the words in the target word segmentation result and obtaining target characteristic word sets of the content categories according to the sorting result and the preset characteristic word number;
the statistics unit is used for counting the coverage rate of each target feature word in the target feature word set in the historical data set corresponding to the content category;
and the weight calculation unit is used for calculating the feature weight of each target feature word according to the coverage rate.
10. The apparatus of claim 6, wherein the processing module comprises:
the data total number calculation unit is used for calculating the total number of the material selection data according to the material selection data number;
and the news material selection index calculation unit is used for obtaining news material selection indexes of various content categories to be compared according to audience parameters, the total number of material selection materials and the number of the material selection materials.
11. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 5 when the computer program is executed.
12. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any one of claims 1 to 5.
CN202010301372.3A 2020-04-16 2020-04-16 Text content category acquisition method, apparatus, computer device and storage medium Active CN111506727B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010301372.3A CN111506727B (en) 2020-04-16 2020-04-16 Text content category acquisition method, apparatus, computer device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010301372.3A CN111506727B (en) 2020-04-16 2020-04-16 Text content category acquisition method, apparatus, computer device and storage medium

Publications (2)

Publication Number Publication Date
CN111506727A CN111506727A (en) 2020-08-07
CN111506727B true CN111506727B (en) 2023-10-03

Family

ID=71874352

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010301372.3A Active CN111506727B (en) 2020-04-16 2020-04-16 Text content category acquisition method, apparatus, computer device and storage medium

Country Status (1)

Country Link
CN (1) CN111506727B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112015864A (en) * 2020-08-26 2020-12-01 深圳市金蝶天燕云计算股份有限公司 Information query method and related equipment
CN112395881B (en) * 2020-11-27 2022-12-13 北京筑龙信息技术有限责任公司 Material label construction method and device, readable storage medium and electronic equipment
CN113779969B (en) * 2021-09-16 2024-09-20 平安国际智慧城市科技股份有限公司 Case information processing method, device, equipment and medium based on artificial intelligence
CN115544250B (en) * 2022-09-01 2023-06-23 睿智合创(北京)科技有限公司 Data processing method and system
CN116701561B (en) * 2023-06-09 2024-04-26 读书郎教育科技有限公司 Learning resource collection method matched with dictionary pen and system thereof

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107844559A (en) * 2017-10-31 2018-03-27 国信优易数据有限公司 A kind of file classifying method, device and electronic equipment
CN108334610A (en) * 2018-02-06 2018-07-27 北京神州泰岳软件股份有限公司 A kind of newsletter archive sorting technique, device and server
CN109376237A (en) * 2018-09-04 2019-02-22 中国平安人寿保险股份有限公司 Prediction technique, device, computer equipment and the storage medium of client's stability
CN109635082A (en) * 2018-11-26 2019-04-16 平安科技(深圳)有限公司 Policy implication analysis method, device, computer equipment and storage medium
CN109657137A (en) * 2018-11-26 2019-04-19 平安科技(深圳)有限公司 Public sentiment news category model building method, device, computer equipment and storage medium
JP2019109662A (en) * 2017-12-18 2019-07-04 ヤフー株式会社 Classification device, data structure, classification method, and program
WO2019184217A1 (en) * 2018-03-26 2019-10-03 平安科技(深圳)有限公司 Hotspot event classification method and apparatus, and storage medium
CN110895586A (en) * 2018-08-22 2020-03-20 腾讯科技(深圳)有限公司 Method and device for generating news page, computer equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10452727B2 (en) * 2011-09-26 2019-10-22 Oath Inc. Method and system for dynamically providing contextually relevant news based on an article displayed on a web page
US10489438B2 (en) * 2016-05-19 2019-11-26 Conduent Business Services, Llc Method and system for data processing for text classification of a target domain
US10210157B2 (en) * 2016-06-16 2019-02-19 Conduent Business Services, Llc Method and system for data processing for real-time text analysis
US11106716B2 (en) * 2017-11-13 2021-08-31 Accenture Global Solutions Limited Automatic hierarchical classification and metadata identification of document using machine learning and fuzzy matching

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107844559A (en) * 2017-10-31 2018-03-27 国信优易数据有限公司 A kind of file classifying method, device and electronic equipment
JP2019109662A (en) * 2017-12-18 2019-07-04 ヤフー株式会社 Classification device, data structure, classification method, and program
CN108334610A (en) * 2018-02-06 2018-07-27 北京神州泰岳软件股份有限公司 A kind of newsletter archive sorting technique, device and server
WO2019184217A1 (en) * 2018-03-26 2019-10-03 平安科技(深圳)有限公司 Hotspot event classification method and apparatus, and storage medium
CN110895586A (en) * 2018-08-22 2020-03-20 腾讯科技(深圳)有限公司 Method and device for generating news page, computer equipment and storage medium
CN109376237A (en) * 2018-09-04 2019-02-22 中国平安人寿保险股份有限公司 Prediction technique, device, computer equipment and the storage medium of client's stability
CN109635082A (en) * 2018-11-26 2019-04-16 平安科技(深圳)有限公司 Policy implication analysis method, device, computer equipment and storage medium
CN109657137A (en) * 2018-11-26 2019-04-19 平安科技(深圳)有限公司 Public sentiment news category model building method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN111506727A (en) 2020-08-07

Similar Documents

Publication Publication Date Title
CN111506727B (en) Text content category acquisition method, apparatus, computer device and storage medium
US11663254B2 (en) System and engine for seeded clustering of news events
US9117006B2 (en) Recommending keywords
US9710457B2 (en) Computer-implemented patent portfolio analysis method and apparatus
US7814102B2 (en) Method and system for linking documents with multiple topics to related documents
US20090024612A1 (en) Full text query and search systems and methods of use
US20040049499A1 (en) Document retrieval system and question answering system
US11379665B1 (en) Document analysis architecture
US11373424B1 (en) Document analysis architecture
CN105653562B (en) The calculation method and device of correlation between a kind of content of text and inquiry request
US8364679B2 (en) Method, system, and apparatus for delivering query results from an electronic document collection
US20080228752A1 (en) Technical correlation analysis method for evaluating patents
EP2457182A1 (en) Method, system, and apparatus for delivering query results from an electronic document collection
US20180373754A1 (en) System and method for conducting a textual data search
Lu et al. How do author-selected keywords function semantically in scientific manuscripts?
CN110569273A (en) Patent retrieval system and method based on relevance sorting
EP4165487A1 (en) Document analysis architecture
CN114201598B (en) Text recommendation method and text recommendation device
US11776291B1 (en) Document analysis architecture
CN108509449B (en) Information processing method and server
Geleijnse et al. Web-Based Artist Categorization.
CN109213830A (en) The document retrieval system of professional technical documentation
US11893505B1 (en) Document analysis architecture
US11893065B2 (en) Document analysis architecture
JP2004206571A (en) Method, device, and program for presenting document information, and recording medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant