CN109241297B - Content classification and aggregation method, electronic equipment, storage medium and engine - Google Patents

Content classification and aggregation method, electronic equipment, storage medium and engine Download PDF

Info

Publication number
CN109241297B
CN109241297B CN201810744608.3A CN201810744608A CN109241297B CN 109241297 B CN109241297 B CN 109241297B CN 201810744608 A CN201810744608 A CN 201810744608A CN 109241297 B CN109241297 B CN 109241297B
Authority
CN
China
Prior art keywords
frequency
content
word
article
establishing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810744608.3A
Other languages
Chinese (zh)
Other versions
CN109241297A (en
Inventor
李剑
陈星�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Pinwei Software Co Ltd
Original Assignee
Guangzhou Pinwei Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Pinwei Software Co Ltd filed Critical Guangzhou Pinwei Software Co Ltd
Priority to CN201810744608.3A priority Critical patent/CN109241297B/en
Publication of CN109241297A publication Critical patent/CN109241297A/en
Application granted granted Critical
Publication of CN109241297B publication Critical patent/CN109241297B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a content classification and aggregation method, which comprises the following steps: when the original article content and the article content to be detected are not comment articles, establishing attribute tags corresponding to the original article content according to different types, and establishing a mapping relation between the attribute tags and the original article content; adopting a word segmentation device to deconstruct different types of original article contents, respectively extracting high-frequency phrases corresponding to each original article content, and establishing a mapping relation between each high-frequency phrase and an attribute label; inputting each high-frequency phrase into a plurality of linear models to be trained respectively for training and obtaining trained linear models corresponding to the attribute labels; and screening the contents of the article to be tested according to different trained linear models and matching corresponding attribute labels. According to the content classification and aggregation method, the labor cost is reduced, the article content to be detected can be found to the user in different attribute label modes according to the attribute labels corresponding to the article content to be detected, and the experience of the user is greatly improved.

Description

Content classification and aggregation method, electronic equipment, storage medium and engine
Technical Field
The present invention relates to the field of natural language processing, and in particular, to a content classification and aggregation method, an electronic device, a storage medium, and an engine.
Background
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Natural language processing is not a general study of natural language but is directed to the development of computer systems, and particularly software systems therein, that can efficiently implement natural language communications.
At present, each platform has a content shopping guide concept, and good content is more sticky for users. For example, a makeup album may be effective for female users, and a fitness outdoor album may be effective for male users. Meanwhile, the albums can be well combined with the shelves and goods of the shopping platform, so that the user stickiness is increased, and the content shopping guide is performed. With the increase of the number of articles created for various commodities and the increase of the number of articles crawled, how to manage the articles and reuse the articles become problems. At present, the articles are labeled manually, so that the labor cost is increased remarkably, and when the number of the articles exceeds too many, the labor cannot be solved.
Disclosure of Invention
In order to overcome the defects of the prior art, an object of the present invention is to provide a content classification and aggregation method, which can solve the problem that the articles are manually labeled at present, which significantly increases the labor cost, and when the number of the articles exceeds too many, the labor cannot be solved.
The second objective of the present invention is to provide an electronic device, which can solve the problem that the manual labeling of the articles is adopted, which significantly increases the labor cost, and the labor power cannot be solved when the number of the articles exceeds too many.
The invention also aims to provide a computer storage medium which can solve the problem that the manual labeling of the articles is adopted at present, so that the labor cost is obviously increased, and when the number of the articles exceeds too many, the labor cannot be solved.
The fourth objective of the present invention is to provide a content classification and aggregation engine, which can solve the problem that the manual labeling of the articles is adopted, which significantly increases the labor cost, and the labor power cannot be solved when the number of the articles exceeds too many.
One of the purposes of the invention is realized by adopting the following technical scheme:
a method for content classification aggregation, comprising:
establishing article labels, acquiring different types of original article contents and article contents to be detected on an online platform, establishing attribute labels corresponding to the original article contents according to different types when the original article contents and the article contents to be detected are not comment articles, and establishing a mapping relation between the attribute labels and the original article contents;
inducing high-frequency words, deconstructing the original article contents of different types by adopting a word segmentation device, respectively extracting high-frequency word groups corresponding to the original article contents, and establishing a mapping relation between each high-frequency word group and the attribute label;
establishing a linear model, and inputting each high-frequency phrase into a plurality of linear models to be trained respectively for training to obtain a trained linear model corresponding to the attribute label;
and (4) content classification, namely screening the contents of the article to be tested according to different trained linear models and matching the contents to obtain the corresponding attribute labels.
Further, when the original article content and the article content to be tested are both comment articles, the following steps are executed:
establishing a hot word bank, acquiring real comments of a plurality of online platforms, and establishing the hot word bank according to the real comments;
sorting a hot word bank, and carrying out attribute classification on a plurality of real comments in the hot word bank to obtain word number attributes and quality attributes;
enriching a hot word bank, deducing a near-sense word bank from the hot word bank by using word2vec, and gradually iterating the real comments with different word number attributes by using the near-sense word bank to obtain an enriched hot word bank;
and comment classification, namely inputting the hot word bank and the article content to be detected into a greedy matching model for classification, wherein the greedy matching model is used for matching the hot word bank to obtain the corresponding quality attribute.
Further, the sorting hot word library specifically classifies a plurality of real comments in the hot word library according to the number of words and quality, and the quality attributes are good comments, bad comments and medium comments.
Further, each high-frequency word group comprises a plurality of high-frequency words, high-frequency word standardization processing is further included before the linear model is established, the current occurrence frequency of each high-frequency word in the corresponding original article is counted, and the maximum occurrence frequency and the minimum occurrence frequency in the content of the original article are counted; and calculating the weight corresponding to the high-frequency words according to the current occurrence frequency, the maximum occurrence frequency and the minimum occurrence frequency, and performing weight sequencing on the high-frequency words in each high-frequency word group according to the weight.
Further, the content classification specifically includes: the method comprises the steps of inputting contents of articles to be tested into the trained linear models with different values, outputting corresponding phasor values by each trained linear model, screening the trained linear model corresponding to the largest phasor value, and screening the corresponding attribute labels according to the trained models.
Further, the attribute tags can be women's dresses, gourmet, digital science and technology, movies, fresher and antique, and the original article contents are women's dress articles, gourmet articles, digital science and technology articles, movies articles, fresher articles and antique articles.
The second purpose of the invention is realized by adopting the following technical scheme:
an electronic device, comprising: a processor;
a memory; and a program, wherein the program is stored in the memory and configured to be executed by the processor, the program comprising instructions for performing a content classification aggregation method of the present invention.
The third purpose of the invention is realized by adopting the following technical scheme:
a computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program is executed by a processor to perform a content classification aggregation method of the present invention.
The fourth purpose of the invention is realized by adopting the following technical scheme:
a content classification aggregation engine, comprising:
the method comprises the steps of establishing an article label module, wherein the article label establishing module is used for acquiring different types of original article contents and article contents to be detected on an online platform, and when the original article contents and the article contents to be detected are not comment articles, establishing attribute labels corresponding to the original article contents according to different types, and establishing a mapping relation between the attribute labels and the original article contents;
the high-frequency word induction module is used for deconstructing the original article contents of different types by adopting a word segmentation device, respectively extracting high-frequency word groups corresponding to the original article contents, and establishing a mapping relation between each high-frequency word group and the attribute label;
the linear model building module is used for inputting each high-frequency phrase into a plurality of linear models to be trained respectively for training and obtaining a trained linear model corresponding to the attribute label;
and the content classification module is used for screening the contents of the article to be tested according to different trained linear models and matching the corresponding attribute labels.
Further, when the original article content and the article content to be tested are both comment-type articles, the method includes:
the method comprises the steps of establishing a hot word library module, wherein the hot word library establishing module is used for acquiring real comments of a plurality of online platforms and establishing a hot word library according to the real comments;
the hot word database arrangement module is used for carrying out attribute classification on a plurality of real comments in the hot word database and obtaining word number attributes and quality attributes;
the rich hot word library module is used for deducing a near meaning word library from the hot word library by using word2vec, gradually iterating the real comments with different word number attributes by using the near meaning word library and obtaining a rich hot word library;
and the comment classification module is used for inputting the hot word bank and the content of the article to be detected into a greedy matching model for classification, and the greedy matching model is used for matching the hot word bank in sheets to obtain the corresponding quality attribute.
Compared with the prior art, the invention has the beneficial effects that: the invention relates to a content classification and aggregation method, which comprises the steps of classifying original article contents, establishing corresponding attribute labels, adopting a word splitter to structure different types of original article contents, extracting high-frequency phrases corresponding to each original article content, establishing a mapping relation between the high-frequency phrases and the attribute labels, inputting the high-frequency phrases into a linear model, thus obtaining a trained linear model corresponding to the attribute labels, screening the article contents to be tested by using the trained linear model, matching the article contents to be tested to obtain the corresponding attribute labels, establishing the corresponding relation between the article contents to be tested and the attribute labels, classifying and aggregating according to the corresponding relation, wherein the classification mode does not need manual intervention treatment any more, intelligently classifies the article contents to be tested, improves the classification accuracy, reduces the labor cost, and can present the article contents to be tested in front of users in different attribute labels according to the attribute labels corresponding to the article contents to be tested, the experience of the user is greatly improved.
The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical solutions of the present invention more clearly understood and to implement them in accordance with the contents of the description, the following detailed description is given with reference to the preferred embodiments of the present invention and the accompanying drawings. The detailed description of the present invention is given in detail by the following examples and the accompanying drawings.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow chart of a content aggregation method of the present invention;
FIG. 2 is a block diagram of a content aggregation engine of the present invention;
FIG. 3 is a diagram illustrating operation of a content aggregation engine according to the present invention in an operating state;
FIG. 4 is a first schematic diagram of a display interface of a content aggregation engine in an operating state according to the present invention;
fig. 5 is a schematic diagram of a display interface of a content aggregation engine in an operating state according to a second embodiment of the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings and the detailed description, and it should be noted that any combination of the embodiments or technical features described below can be used to form a new embodiment without conflict.
As shown in fig. 1, a content classification and aggregation method of the present invention includes the following steps:
establishing article labels, acquiring different types of original article contents and article contents to be detected on an online platform, establishing attribute labels corresponding to the original article contents according to different types when the original article contents and the article contents to be detected are not comment articles, and establishing a mapping relation between the attribute labels and the original article contents; in the embodiment, different types of original article contents are acquired on each online network platform according to a crawler tool, and can be divided into types such as women's dresses, gourmets, digital science and technology, movies, freshenes, antiques and the like according to the types, and the original article contents can be divided into appraisal articles and non-appraisal articles; when the original article content and the article content to be detected are not comment articles, and the original article content is content originally produced by a user or produced by a platform professional, firstly, attribute tags are established according to categories, the attribute tags are women's clothing, gourmet food, digital science and technology, movies, small refreshments, antiques and the like, a mapping relation is established between each attribute tag and each original article content, and all original article contents are classified according to the attribute tags. In this embodiment, the number of original article contents corresponding to each attribute tag is at least one thousand.
And inducing high-frequency words, deconstructing the original article contents of different types by adopting a word segmentation device, respectively extracting high-frequency word groups corresponding to the original article contents, and establishing a mapping relation between each high-frequency word group and the attribute label. In this embodiment, an IKAnalyzer and a coding classifier are used to extract high-frequency words by integrating different types of original article contents, and positive keywords and negative keywords in each original article content are extracted first (the positive keywords are usually selected from top50 of the high-frequency words classified by the article, and the negative keywords are selected from top3 or top5 of other categories of articles), and the positive keywords and the negative keywords are high-frequency words; performing standardization processing on each high-frequency vocabulary, namely counting the current occurrence frequency of each high-frequency vocabulary in the corresponding original article as a, wherein the maximum occurrence frequency in the content of the original article is maxHot and the minimum occurrence frequency in the content of the original article is minHot; and calculating the weight corresponding to the high-frequency words according to the current occurrence frequency, the maximum occurrence frequency and the minimum occurrence frequency, and performing weight sequencing on the high-frequency words in each high-frequency word group according to the weight. With specific reference to equation (1):
weight ═ a-minHot)/(maxHot-minHot) (1)
Wherein a is the current occurrence number, maxHot is the maximum occurrence number, and minHot is the minimum occurrence number.
Establishing a linear model, and inputting each high-frequency phrase into a plurality of linear models to be trained respectively for training to obtain a trained linear model corresponding to the attribute label; in this embodiment, a plurality of trained linear models corresponding to the high-frequency words can be established according to the types of the high-frequency words, and the trained models are subjected to weight convergence by using the sigmond function.
And (4) content classification, namely screening the contents of the article to be tested according to different trained linear models and matching the contents to obtain the corresponding attribute labels. The method comprises the steps of inputting contents of articles to be tested into the trained linear models with different values, outputting corresponding phasor values by each trained linear model, screening the trained linear model corresponding to the largest phasor value, and screening the corresponding attribute labels according to the trained models. In other words, in the multiple trained linear models with the input value of the article content to be tested, each trained linear model outputs a corresponding phasor value, and the higher the phasor value is, the closer the article content to be tested to the attribute label corresponding to the article content to be tested is, so that when the attribute of the article content to be tested is evaluated, the article content to be tested is evaluated according to the highest phasor value, and thus, the accurate and reasonable classification and aggregation of different article contents to be tested are realized.
In this embodiment, when the original article content and the article content to be tested are both comment-like articles, and when the original article content and the article content to be tested are both comment-like articles, the comments are classified and aggregated. Then the following steps are performed:
establishing a hot word bank, acquiring real comments of a plurality of online platforms, and establishing the hot word bank according to the real comments; real comments on the line are obtained. In this embodiment, 90 ten thousand real comment component hot thesaurus are collected.
Sorting a hot word bank, and carrying out attribute classification on a plurality of real comments in the hot word bank to obtain word number attributes and quality attributes; and classifying the real comments in the hot word library according to the number of words and the quality, wherein the quality attributes are good comments, bad comments and medium comments. The method includes the steps of firstly classifying the words into 1-word comments, 2-word comments, 3-word comments, 4-word comments and 5-word comments according to the number of the words, and then classifying the 1-word comments, the 2-word comments, the 3-word comments, the 4-word comments and the 5-word comments into good comments, poor comments and medium comments according to quality attributes. And sorting the comments in the hot word bank according to the quality and the number of words.
Enriching a hot word bank, deducing a near-sense word bank from the hot word bank by using word2vec, and gradually iterating the real comments with different word number attributes by using the near-sense word bank to obtain an enriched hot word bank; and (3) gradually iterating the 1-word comment type real comment, the 2-word comment type real comment, the 3-word comment real comment, the 4-word comment real comment and the 5-word real comment by using word2vec, so that the effect of enriching the hot word stock is achieved.
And comment classification, namely inputting the hot word bank and the article content to be detected into a greedy matching model for classification, wherein the greedy matching model is used for matching the hot word bank to obtain the corresponding quality attribute. The method includes the steps that the abundant hot word library is led into a greedy matching model, the greedy matching model is used for classifying and aggregating the contents of articles to be tested according to a greedy matching strategy, the greedy matching strategy in the embodiment is strict and loose, the contents of the articles to be tested are classified and aggregated finally, and finally all comments are displayed to a user according to quality attributes, namely the good comments are displayed to the user on the same page.
The present invention provides an electronic device including: a processor;
a memory; and a program, wherein the program is stored in the memory and configured to be executed by the processor, the program comprising instructions for performing a content classification aggregation method of the present invention.
The present invention provides a computer-readable storage medium having stored thereon a computer program characterized in that: the computer program is executed by a processor to perform a content classification aggregation method of the present invention.
As shown in fig. 2, the present invention provides a content classification and aggregation engine, comprising: the method comprises the steps of establishing an article label module, wherein the article label establishing module is used for acquiring different types of original article contents and article contents to be detected on an online platform, and when the original article contents and the article contents to be detected are not comment articles, establishing attribute labels corresponding to the original article contents according to different types, and establishing a mapping relation between the attribute labels and the original article contents;
the high-frequency word induction module is used for deconstructing the original article contents of different types by adopting a word segmentation device, respectively extracting high-frequency word groups corresponding to the original article contents, and establishing a mapping relation between each high-frequency word group and the attribute label;
the linear model building module is used for inputting each high-frequency phrase into a plurality of linear models to be trained respectively for training and obtaining a trained linear model corresponding to the attribute label;
and the content classification module is used for screening the contents of the article to be tested according to different trained linear models and matching the corresponding attribute labels. At this time, after the articles to be tested are classified and aggregated, as shown in fig. 5, the display interface after the content of the articles to be tested is classified and aggregated is shown, in fig. 5, the content of the articles is divided into attribute tags of make-up, wearing, family, mother and infant, and the similar content of the articles to be tested is displayed under each attribute tag category.
Further, when the original article content and the article content to be tested are both comment-type articles, the method includes:
the method comprises the steps of establishing a hot word library module, wherein the hot word library establishing module is used for acquiring real comments of a plurality of online platforms and establishing a hot word library according to the real comments;
the hot word database arrangement module is used for carrying out attribute classification on a plurality of real comments in the hot word database and obtaining word number attributes and quality attributes;
the rich hot word library module is used for deducing a near meaning word library from the hot word library by using word2vec, gradually iterating the real comments with different word number attributes by using the near meaning word library and obtaining a rich hot word library;
and the comment classification module is used for inputting the hot word bank and the comment articles into a greedy matching model for classification, and the greedy matching model is used for matching the corresponding quality attributes in the hot word bank. Finally, as shown in fig. 4, all the comments are displayed to the user according to the quality attributes, that is, the good comments are displayed to the user on the same page.
As shown in fig. 3, when the content classification and aggregation engine in this embodiment is applied, first, image-text information and the like in shared data are cached, and then, the content classification and aggregation engine in this embodiment performs classification and aggregation on the shared data, at this time, a worker configures a service list on a content service platform through a content service management system, and the content service platform puts the classified and aggregated shared data into the service list according to the configured service list, and publishes the shared data in windows of discovery, arrival, live broadcast, activity, sub-channels and the like through the same external interface for display.
The invention relates to a content classification and aggregation method, which comprises the steps of classifying original article contents, establishing corresponding attribute labels, adopting a word splitter to structure different types of original article contents, extracting high-frequency phrases corresponding to each original article content, establishing a mapping relation between the high-frequency phrases and the attribute labels, inputting the high-frequency phrases into a linear model, thus obtaining a trained linear model corresponding to the attribute labels, screening the article contents to be tested by using the trained linear model, matching the article contents to be tested to obtain the corresponding attribute labels, establishing the corresponding relation between the article contents to be tested and the attribute labels, classifying and aggregating according to the corresponding relation, wherein the classification mode does not need manual intervention treatment any more, intelligently classifies the article contents to be tested, improves the classification accuracy, reduces the labor cost, and can present the article contents to be tested in front of users in different attribute labels according to the attribute labels corresponding to the article contents to be tested, the experience of the user is greatly improved.
The foregoing is merely a preferred embodiment of the invention and is not intended to limit the invention in any manner; those skilled in the art can readily practice the invention as shown and described in the drawings and detailed description herein; however, those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiments as a basis for designing or modifying other structures for carrying out the same purposes of the present invention without departing from the scope of the invention as defined by the appended claims; meanwhile, any changes, modifications, and evolutions of the equivalent changes of the above embodiments according to the actual techniques of the present invention are still within the protection scope of the technical solution of the present invention.

Claims (8)

1. A method for content classification aggregation, comprising:
establishing article labels, acquiring different types of original article contents and article contents to be detected on an online platform, establishing attribute labels corresponding to the original article contents according to different types when the original article contents and the article contents to be detected are not comment articles, and establishing a mapping relation between the attribute labels and the original article contents;
inducing high-frequency words, deconstructing the original article contents of different types by adopting a word segmentation device, respectively extracting high-frequency word groups corresponding to the original article contents, and establishing a mapping relation between each high-frequency word group and the attribute label;
establishing a linear model, and inputting each high-frequency phrase into a plurality of linear models to be trained respectively for training to obtain a trained linear model corresponding to the attribute label;
classifying the content, screening the content of the article to be tested according to different trained linear models, and matching corresponding attribute labels;
the content classification specifically comprises: respectively inputting the contents of articles to be tested into different trained linear models, outputting a corresponding phasor value by each trained linear model, screening out the trained linear model corresponding to the maximum phasor value, and screening out the corresponding attribute label according to the trained linear model;
each high-frequency word group comprises a plurality of high-frequency words, high-frequency word standardization processing is further included before the linear model is established, the current occurrence frequency of each high-frequency word in the corresponding original article is counted, and the maximum occurrence frequency and the minimum occurrence frequency in the content of the original article are counted; and calculating the weight corresponding to the high-frequency words according to the current occurrence frequency, the maximum occurrence frequency and the minimum occurrence frequency, and performing weight sequencing on the high-frequency words in each high-frequency word group according to the weight.
2. The content classification and aggregation method according to claim 1, wherein: when the original article content and the article content to be detected are both comment articles, executing the following steps:
establishing a hot word bank, acquiring real comments of a plurality of online platforms, and establishing the hot word bank according to the real comments;
sorting a hot word bank, and carrying out attribute classification on a plurality of real comments in the hot word bank to obtain word number attributes and quality attributes;
enriching a hot word bank, deducing a near-sense word bank from the hot word bank by using word2vec, and gradually iterating the real comments with different word number attributes by using the near-sense word bank to obtain an enriched hot word bank;
and comment classification, namely inputting the hot word bank and the article content to be detected into a greedy matching model for classification, wherein the greedy matching model is used for matching the hot word bank to obtain the corresponding quality attribute.
3. The content classification and aggregation method according to claim 2, wherein: the sorting hot word library is specifically to classify a plurality of real comments in the hot word library according to the number of words and quality, wherein the quality attributes are good comments, bad comments and medium comments.
4. The content classification and aggregation method according to claim 1, wherein: the attribute labels can be women's clothes, gourmet, digital science and technology, movies, fresher and antique, and the original article contents are women's clothes articles, gourmet articles, digital science and technology articles, movies articles, fresher articles and antique articles.
5. An electronic device, characterized by comprising: a processor;
a memory; and a program, wherein the program is stored in the memory and configured to be executed by the processor, the program comprising instructions for carrying out the method of any one of claims 1-4.
6. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program is executed by a processor for performing the method according to any of claims 1-4.
7. A content classification aggregation engine, comprising:
the method comprises the steps of establishing an article label module, wherein the article label establishing module is used for acquiring different types of original article contents and article contents to be detected on an online platform, and when the original article contents and the article contents to be detected are not comment articles, establishing attribute labels corresponding to the original article contents according to different types, and establishing a mapping relation between the attribute labels and the original article contents;
the high-frequency word induction module is used for deconstructing the original article contents of different types by adopting a word segmentation device, respectively extracting high-frequency word groups corresponding to the original article contents, and establishing a mapping relation between each high-frequency word group and the attribute label;
the linear model building module is used for inputting each high-frequency phrase into a plurality of linear models to be trained respectively for training and obtaining a trained linear model corresponding to the attribute label;
the content classification module is used for screening the contents of the article to be tested according to different trained linear models and matching the contents to obtain corresponding attribute labels;
each high-frequency phrase comprises a plurality of high-frequency words, and the method further comprises the following steps before the high-frequency phrases are respectively input into a plurality of linear models to be trained for training and the trained linear models corresponding to the attribute labels are obtained:
counting the current occurrence frequency of each high-frequency vocabulary in the corresponding original article, wherein the maximum occurrence frequency and the minimum occurrence frequency in the content of the original article are counted; and calculating the weight corresponding to the high-frequency words according to the current occurrence frequency, the maximum occurrence frequency and the minimum occurrence frequency, and performing weight sequencing on the high-frequency words in each high-frequency word group according to the weight.
8. The content classification aggregation engine of claim 7, wherein: when the original article content and the article content to be tested are both comment articles, the method comprises the following steps:
the method comprises the steps of establishing a hot word library module, wherein the hot word library establishing module is used for acquiring real comments of a plurality of online platforms and establishing a hot word library according to the real comments;
the hot word database arrangement module is used for carrying out attribute classification on a plurality of real comments in the hot word database and obtaining word number attributes and quality attributes;
the rich hot word library module is used for deducing a near meaning word library from the hot word library by using word2vec, gradually iterating the real comments with different word number attributes by using the near meaning word library and obtaining a rich hot word library;
and the comment classification module is used for inputting the hot word bank and the content of the article to be detected into a greedy matching model for classification, and the greedy matching model is used for matching the hot word bank in sheets to obtain the corresponding quality attribute.
CN201810744608.3A 2018-07-09 2018-07-09 Content classification and aggregation method, electronic equipment, storage medium and engine Active CN109241297B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810744608.3A CN109241297B (en) 2018-07-09 2018-07-09 Content classification and aggregation method, electronic equipment, storage medium and engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810744608.3A CN109241297B (en) 2018-07-09 2018-07-09 Content classification and aggregation method, electronic equipment, storage medium and engine

Publications (2)

Publication Number Publication Date
CN109241297A CN109241297A (en) 2019-01-18
CN109241297B true CN109241297B (en) 2022-04-19

Family

ID=65071818

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810744608.3A Active CN109241297B (en) 2018-07-09 2018-07-09 Content classification and aggregation method, electronic equipment, storage medium and engine

Country Status (1)

Country Link
CN (1) CN109241297B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020729B (en) * 2019-03-05 2021-03-16 中国联合网络通信集团有限公司 Article review method and device based on artificial intelligence
CN110413759A (en) * 2019-07-31 2019-11-05 杭州凡闻科技有限公司 A kind of multi-platform user interaction data analysis method and system for from media
CN110955816B (en) * 2019-11-08 2022-11-08 广州坚和网络科技有限公司 Method for aggregating subject content based on content label
CN111177369A (en) * 2019-11-19 2020-05-19 厦门二五八网络科技集团股份有限公司 Method and device for automatically classifying labels of articles
CN111159347B (en) * 2019-12-30 2023-03-21 掌阅科技股份有限公司 Article content quality data calculation method, calculation device and storage medium
CN112131346B (en) * 2020-09-25 2024-04-30 北京达佳互联信息技术有限公司 Comment aggregation method and device, storage medium and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103207913A (en) * 2013-04-15 2013-07-17 武汉理工大学 Method and system for acquiring commodity fine-grained semantic relation
CN105740389A (en) * 2016-01-27 2016-07-06 上海晶赞科技发展有限公司 Classification method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104331394A (en) * 2014-08-29 2015-02-04 南通大学 Text classification method based on viewpoint
US10354009B2 (en) * 2016-08-24 2019-07-16 Microsoft Technology Licensing, Llc Characteristic-pattern analysis of text

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103207913A (en) * 2013-04-15 2013-07-17 武汉理工大学 Method and system for acquiring commodity fine-grained semantic relation
CN105740389A (en) * 2016-01-27 2016-07-06 上海晶赞科技发展有限公司 Classification method and device

Also Published As

Publication number Publication date
CN109241297A (en) 2019-01-18

Similar Documents

Publication Publication Date Title
CN109241297B (en) Content classification and aggregation method, electronic equipment, storage medium and engine
CN108182279B (en) Object classification method, device and computer equipment based on text feature
CN110377804A (en) Method for pushing, device, system and the storage medium of training course data
CN107145485B (en) Method and apparatus for compressing topic models
CN111104526A (en) Financial label extraction method and system based on keyword semantics
CN109711925A (en) Cross-domain recommending data processing method, cross-domain recommender system with multiple auxiliary domains
CN108897784A (en) One emergency event dimensional analytic system based on social media
CN107944911A (en) A kind of recommendation method of the commending system based on text analyzing
CN112948575B (en) Text data processing method, apparatus and computer readable storage medium
CN110297888A (en) A kind of domain classification method based on prefix trees and Recognition with Recurrent Neural Network
CN110245228A (en) The method and apparatus for determining text categories
CN110807086A (en) Text data labeling method and device, storage medium and electronic equipment
CN105912525A (en) Sentiment classification method for semi-supervised learning based on theme characteristics
CN110110035A (en) Data processing method and device and computer readable storage medium
CN108932648A (en) A kind of method and apparatus for predicting its model of item property data and training
Mozafari et al. Emotion detection by using similarity techniques
CN113407644A (en) Enterprise industry secondary industry multi-label classifier based on deep learning algorithm
CN104778205B (en) A kind of mobile application sequence and clustering method based on Heterogeneous Information network
CN108268461A (en) A kind of document sorting apparatus based on hybrid classifer
CN113743079A (en) Text similarity calculation method and device based on co-occurrence entity interaction graph
CN113590809A (en) Method and device for automatically generating referee document abstract
CN115248890A (en) User interest portrait generation method and device, electronic equipment and storage medium
CN107368610A (en) Big text CRF and rule classification method and system based on full text
CN107506407A (en) A kind of document classification, the method and device called
CN109284376A (en) Cross-cutting news data sentiment analysis method based on domain-adaptive

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant