CN109241297B

CN109241297B - Content classification and aggregation method, electronic equipment, storage medium and engine

Info

Publication number: CN109241297B
Application number: CN201810744608.3A
Authority: CN
Inventors: 李剑; 陈星�
Original assignee: Guangzhou Pinwei Software Co Ltd
Current assignee: Guangzhou Pinwei Software Co Ltd
Priority date: 2018-07-09
Filing date: 2018-07-09
Publication date: 2022-04-19
Anticipated expiration: 2038-07-09
Also published as: CN109241297A

Abstract

The invention provides a content classification and aggregation method, which comprises the following steps: when the original article content and the article content to be detected are not comment articles, establishing attribute tags corresponding to the original article content according to different types, and establishing a mapping relation between the attribute tags and the original article content; adopting a word segmentation device to deconstruct different types of original article contents, respectively extracting high-frequency phrases corresponding to each original article content, and establishing a mapping relation between each high-frequency phrase and an attribute label; inputting each high-frequency phrase into a plurality of linear models to be trained respectively for training and obtaining trained linear models corresponding to the attribute labels; and screening the contents of the article to be tested according to different trained linear models and matching corresponding attribute labels. According to the content classification and aggregation method, the labor cost is reduced, the article content to be detected can be found to the user in different attribute label modes according to the attribute labels corresponding to the article content to be detected, and the experience of the user is greatly improved.

Description

Content classification and aggregation method, electronic equipment, storage medium and engine

Technical Field

The present invention relates to the field of natural language processing, and in particular, to a content classification and aggregation method, an electronic device, a storage medium, and an engine.

Background

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Natural language processing is not a general study of natural language but is directed to the development of computer systems, and particularly software systems therein, that can efficiently implement natural language communications.

At present, each platform has a content shopping guide concept, and good content is more sticky for users. For example, a makeup album may be effective for female users, and a fitness outdoor album may be effective for male users. Meanwhile, the albums can be well combined with the shelves and goods of the shopping platform, so that the user stickiness is increased, and the content shopping guide is performed. With the increase of the number of articles created for various commodities and the increase of the number of articles crawled, how to manage the articles and reuse the articles become problems. At present, the articles are labeled manually, so that the labor cost is increased remarkably, and when the number of the articles exceeds too many, the labor cannot be solved.

Disclosure of Invention

In order to overcome the defects of the prior art, an object of the present invention is to provide a content classification and aggregation method, which can solve the problem that the articles are manually labeled at present, which significantly increases the labor cost, and when the number of the articles exceeds too many, the labor cannot be solved.

The second objective of the present invention is to provide an electronic device, which can solve the problem that the manual labeling of the articles is adopted, which significantly increases the labor cost, and the labor power cannot be solved when the number of the articles exceeds too many.

The invention also aims to provide a computer storage medium which can solve the problem that the manual labeling of the articles is adopted at present, so that the labor cost is obviously increased, and when the number of the articles exceeds too many, the labor cannot be solved.

The fourth objective of the present invention is to provide a content classification and aggregation engine, which can solve the problem that the manual labeling of the articles is adopted, which significantly increases the labor cost, and the labor power cannot be solved when the number of the articles exceeds too many.

One of the purposes of the invention is realized by adopting the following technical scheme:

a method for content classification aggregation, comprising:

establishing article labels, acquiring different types of original article contents and article contents to be detected on an online platform, establishing attribute labels corresponding to the original article contents according to different types when the original article contents and the article contents to be detected are not comment articles, and establishing a mapping relation between the attribute labels and the original article contents;

inducing high-frequency words, deconstructing the original article contents of different types by adopting a word segmentation device, respectively extracting high-frequency word groups corresponding to the original article contents, and establishing a mapping relation between each high-frequency word group and the attribute label;

establishing a linear model, and inputting each high-frequency phrase into a plurality of linear models to be trained respectively for training to obtain a trained linear model corresponding to the attribute label;

and (4) content classification, namely screening the contents of the article to be tested according to different trained linear models and matching the contents to obtain the corresponding attribute labels.

Further, when the original article content and the article content to be tested are both comment articles, the following steps are executed:

establishing a hot word bank, acquiring real comments of a plurality of online platforms, and establishing the hot word bank according to the real comments;

sorting a hot word bank, and carrying out attribute classification on a plurality of real comments in the hot word bank to obtain word number attributes and quality attributes;

enriching a hot word bank, deducing a near-sense word bank from the hot word bank by using word2vec, and gradually iterating the real comments with different word number attributes by using the near-sense word bank to obtain an enriched hot word bank;

and comment classification, namely inputting the hot word bank and the article content to be detected into a greedy matching model for classification, wherein the greedy matching model is used for matching the hot word bank to obtain the corresponding quality attribute.

Further, the sorting hot word library specifically classifies a plurality of real comments in the hot word library according to the number of words and quality, and the quality attributes are good comments, bad comments and medium comments.

Further, each high-frequency word group comprises a plurality of high-frequency words, high-frequency word standardization processing is further included before the linear model is established, the current occurrence frequency of each high-frequency word in the corresponding original article is counted, and the maximum occurrence frequency and the minimum occurrence frequency in the content of the original article are counted; and calculating the weight corresponding to the high-frequency words according to the current occurrence frequency, the maximum occurrence frequency and the minimum occurrence frequency, and performing weight sequencing on the high-frequency words in each high-frequency word group according to the weight.

Further, the content classification specifically includes: the method comprises the steps of inputting contents of articles to be tested into the trained linear models with different values, outputting corresponding phasor values by each trained linear model, screening the trained linear model corresponding to the largest phasor value, and screening the corresponding attribute labels according to the trained models.

Further, the attribute tags can be women's dresses, gourmet, digital science and technology, movies, fresher and antique, and the original article contents are women's dress articles, gourmet articles, digital science and technology articles, movies articles, fresher articles and antique articles.

The second purpose of the invention is realized by adopting the following technical scheme:

an electronic device, comprising: a processor;

a memory; and a program, wherein the program is stored in the memory and configured to be executed by the processor, the program comprising instructions for performing a content classification aggregation method of the present invention.

The third purpose of the invention is realized by adopting the following technical scheme:

a computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program is executed by a processor to perform a content classification aggregation method of the present invention.

The fourth purpose of the invention is realized by adopting the following technical scheme:

a content classification aggregation engine, comprising:

the method comprises the steps of establishing an article label module, wherein the article label establishing module is used for acquiring different types of original article contents and article contents to be detected on an online platform, and when the original article contents and the article contents to be detected are not comment articles, establishing attribute labels corresponding to the original article contents according to different types, and establishing a mapping relation between the attribute labels and the original article contents;

the high-frequency word induction module is used for deconstructing the original article contents of different types by adopting a word segmentation device, respectively extracting high-frequency word groups corresponding to the original article contents, and establishing a mapping relation between each high-frequency word group and the attribute label;

the linear model building module is used for inputting each high-frequency phrase into a plurality of linear models to be trained respectively for training and obtaining a trained linear model corresponding to the attribute label;

and the content classification module is used for screening the contents of the article to be tested according to different trained linear models and matching the corresponding attribute labels.

Further, when the original article content and the article content to be tested are both comment-type articles, the method includes:

the method comprises the steps of establishing a hot word library module, wherein the hot word library establishing module is used for acquiring real comments of a plurality of online platforms and establishing a hot word library according to the real comments;

the hot word database arrangement module is used for carrying out attribute classification on a plurality of real comments in the hot word database and obtaining word number attributes and quality attributes;

the rich hot word library module is used for deducing a near meaning word library from the hot word library by using word2vec, gradually iterating the real comments with different word number attributes by using the near meaning word library and obtaining a rich hot word library;

and the comment classification module is used for inputting the hot word bank and the content of the article to be detected into a greedy matching model for classification, and the greedy matching model is used for matching the hot word bank in sheets to obtain the corresponding quality attribute.

Compared with the prior art, the invention has the beneficial effects that: the invention relates to a content classification and aggregation method, which comprises the steps of classifying original article contents, establishing corresponding attribute labels, adopting a word splitter to structure different types of original article contents, extracting high-frequency phrases corresponding to each original article content, establishing a mapping relation between the high-frequency phrases and the attribute labels, inputting the high-frequency phrases into a linear model, thus obtaining a trained linear model corresponding to the attribute labels, screening the article contents to be tested by using the trained linear model, matching the article contents to be tested to obtain the corresponding attribute labels, establishing the corresponding relation between the article contents to be tested and the attribute labels, classifying and aggregating according to the corresponding relation, wherein the classification mode does not need manual intervention treatment any more, intelligently classifies the article contents to be tested, improves the classification accuracy, reduces the labor cost, and can present the article contents to be tested in front of users in different attribute labels according to the attribute labels corresponding to the article contents to be tested, the experience of the user is greatly improved.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical solutions of the present invention more clearly understood and to implement them in accordance with the contents of the description, the following detailed description is given with reference to the preferred embodiments of the present invention and the accompanying drawings. The detailed description of the present invention is given in detail by the following examples and the accompanying drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow chart of a content aggregation method of the present invention;

FIG. 2 is a block diagram of a content aggregation engine of the present invention;

FIG. 3 is a diagram illustrating operation of a content aggregation engine according to the present invention in an operating state;

FIG. 4 is a first schematic diagram of a display interface of a content aggregation engine in an operating state according to the present invention;

fig. 5 is a schematic diagram of a display interface of a content aggregation engine in an operating state according to a second embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and the detailed description, and it should be noted that any combination of the embodiments or technical features described below can be used to form a new embodiment without conflict.

As shown in fig. 1, a content classification and aggregation method of the present invention includes the following steps:

establishing article labels, acquiring different types of original article contents and article contents to be detected on an online platform, establishing attribute labels corresponding to the original article contents according to different types when the original article contents and the article contents to be detected are not comment articles, and establishing a mapping relation between the attribute labels and the original article contents; in the embodiment, different types of original article contents are acquired on each online network platform according to a crawler tool, and can be divided into types such as women's dresses, gourmets, digital science and technology, movies, freshenes, antiques and the like according to the types, and the original article contents can be divided into appraisal articles and non-appraisal articles; when the original article content and the article content to be detected are not comment articles, and the original article content is content originally produced by a user or produced by a platform professional, firstly, attribute tags are established according to categories, the attribute tags are women's clothing, gourmet food, digital science and technology, movies, small refreshments, antiques and the like, a mapping relation is established between each attribute tag and each original article content, and all original article contents are classified according to the attribute tags. In this embodiment, the number of original article contents corresponding to each attribute tag is at least one thousand.

And inducing high-frequency words, deconstructing the original article contents of different types by adopting a word segmentation device, respectively extracting high-frequency word groups corresponding to the original article contents, and establishing a mapping relation between each high-frequency word group and the attribute label. In this embodiment, an IKAnalyzer and a coding classifier are used to extract high-frequency words by integrating different types of original article contents, and positive keywords and negative keywords in each original article content are extracted first (the positive keywords are usually selected from top50 of the high-frequency words classified by the article, and the negative keywords are selected from top3 or top5 of other categories of articles), and the positive keywords and the negative keywords are high-frequency words; performing standardization processing on each high-frequency vocabulary, namely counting the current occurrence frequency of each high-frequency vocabulary in the corresponding original article as a, wherein the maximum occurrence frequency in the content of the original article is maxHot and the minimum occurrence frequency in the content of the original article is minHot; and calculating the weight corresponding to the high-frequency words according to the current occurrence frequency, the maximum occurrence frequency and the minimum occurrence frequency, and performing weight sequencing on the high-frequency words in each high-frequency word group according to the weight. With specific reference to equation (1):

weight ═ a-minHot)/(maxHot-minHot) (1)

Wherein a is the current occurrence number, maxHot is the maximum occurrence number, and minHot is the minimum occurrence number.

Establishing a linear model, and inputting each high-frequency phrase into a plurality of linear models to be trained respectively for training to obtain a trained linear model corresponding to the attribute label; in this embodiment, a plurality of trained linear models corresponding to the high-frequency words can be established according to the types of the high-frequency words, and the trained models are subjected to weight convergence by using the sigmond function.

And (4) content classification, namely screening the contents of the article to be tested according to different trained linear models and matching the contents to obtain the corresponding attribute labels. The method comprises the steps of inputting contents of articles to be tested into the trained linear models with different values, outputting corresponding phasor values by each trained linear model, screening the trained linear model corresponding to the largest phasor value, and screening the corresponding attribute labels according to the trained models. In other words, in the multiple trained linear models with the input value of the article content to be tested, each trained linear model outputs a corresponding phasor value, and the higher the phasor value is, the closer the article content to be tested to the attribute label corresponding to the article content to be tested is, so that when the attribute of the article content to be tested is evaluated, the article content to be tested is evaluated according to the highest phasor value, and thus, the accurate and reasonable classification and aggregation of different article contents to be tested are realized.

In this embodiment, when the original article content and the article content to be tested are both comment-like articles, and when the original article content and the article content to be tested are both comment-like articles, the comments are classified and aggregated. Then the following steps are performed:

establishing a hot word bank, acquiring real comments of a plurality of online platforms, and establishing the hot word bank according to the real comments; real comments on the line are obtained. In this embodiment, 90 ten thousand real comment component hot thesaurus are collected.

Sorting a hot word bank, and carrying out attribute classification on a plurality of real comments in the hot word bank to obtain word number attributes and quality attributes; and classifying the real comments in the hot word library according to the number of words and the quality, wherein the quality attributes are good comments, bad comments and medium comments. The method includes the steps of firstly classifying the words into 1-word comments, 2-word comments, 3-word comments, 4-word comments and 5-word comments according to the number of the words, and then classifying the 1-word comments, the 2-word comments, the 3-word comments, the 4-word comments and the 5-word comments into good comments, poor comments and medium comments according to quality attributes. And sorting the comments in the hot word bank according to the quality and the number of words.

Enriching a hot word bank, deducing a near-sense word bank from the hot word bank by using word2vec, and gradually iterating the real comments with different word number attributes by using the near-sense word bank to obtain an enriched hot word bank; and (3) gradually iterating the 1-word comment type real comment, the 2-word comment type real comment, the 3-word comment real comment, the 4-word comment real comment and the 5-word real comment by using word2vec, so that the effect of enriching the hot word stock is achieved.

And comment classification, namely inputting the hot word bank and the article content to be detected into a greedy matching model for classification, wherein the greedy matching model is used for matching the hot word bank to obtain the corresponding quality attribute. The method includes the steps that the abundant hot word library is led into a greedy matching model, the greedy matching model is used for classifying and aggregating the contents of articles to be tested according to a greedy matching strategy, the greedy matching strategy in the embodiment is strict and loose, the contents of the articles to be tested are classified and aggregated finally, and finally all comments are displayed to a user according to quality attributes, namely the good comments are displayed to the user on the same page.

The present invention provides an electronic device including: a processor;

The present invention provides a computer-readable storage medium having stored thereon a computer program characterized in that: the computer program is executed by a processor to perform a content classification aggregation method of the present invention.

As shown in fig. 2, the present invention provides a content classification and aggregation engine, comprising: the method comprises the steps of establishing an article label module, wherein the article label establishing module is used for acquiring different types of original article contents and article contents to be detected on an online platform, and when the original article contents and the article contents to be detected are not comment articles, establishing attribute labels corresponding to the original article contents according to different types, and establishing a mapping relation between the attribute labels and the original article contents;

and the content classification module is used for screening the contents of the article to be tested according to different trained linear models and matching the corresponding attribute labels. At this time, after the articles to be tested are classified and aggregated, as shown in fig. 5, the display interface after the content of the articles to be tested is classified and aggregated is shown, in fig. 5, the content of the articles is divided into attribute tags of make-up, wearing, family, mother and infant, and the similar content of the articles to be tested is displayed under each attribute tag category.

and the comment classification module is used for inputting the hot word bank and the comment articles into a greedy matching model for classification, and the greedy matching model is used for matching the corresponding quality attributes in the hot word bank. Finally, as shown in fig. 4, all the comments are displayed to the user according to the quality attributes, that is, the good comments are displayed to the user on the same page.

As shown in fig. 3, when the content classification and aggregation engine in this embodiment is applied, first, image-text information and the like in shared data are cached, and then, the content classification and aggregation engine in this embodiment performs classification and aggregation on the shared data, at this time, a worker configures a service list on a content service platform through a content service management system, and the content service platform puts the classified and aggregated shared data into the service list according to the configured service list, and publishes the shared data in windows of discovery, arrival, live broadcast, activity, sub-channels and the like through the same external interface for display.

The invention relates to a content classification and aggregation method, which comprises the steps of classifying original article contents, establishing corresponding attribute labels, adopting a word splitter to structure different types of original article contents, extracting high-frequency phrases corresponding to each original article content, establishing a mapping relation between the high-frequency phrases and the attribute labels, inputting the high-frequency phrases into a linear model, thus obtaining a trained linear model corresponding to the attribute labels, screening the article contents to be tested by using the trained linear model, matching the article contents to be tested to obtain the corresponding attribute labels, establishing the corresponding relation between the article contents to be tested and the attribute labels, classifying and aggregating according to the corresponding relation, wherein the classification mode does not need manual intervention treatment any more, intelligently classifies the article contents to be tested, improves the classification accuracy, reduces the labor cost, and can present the article contents to be tested in front of users in different attribute labels according to the attribute labels corresponding to the article contents to be tested, the experience of the user is greatly improved.

The foregoing is merely a preferred embodiment of the invention and is not intended to limit the invention in any manner; those skilled in the art can readily practice the invention as shown and described in the drawings and detailed description herein; however, those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiments as a basis for designing or modifying other structures for carrying out the same purposes of the present invention without departing from the scope of the invention as defined by the appended claims; meanwhile, any changes, modifications, and evolutions of the equivalent changes of the above embodiments according to the actual techniques of the present invention are still within the protection scope of the technical solution of the present invention.

Claims

1. A method for content classification aggregation, comprising:

classifying the content, screening the content of the article to be tested according to different trained linear models, and matching corresponding attribute labels;

the content classification specifically comprises: respectively inputting the contents of articles to be tested into different trained linear models, outputting a corresponding phasor value by each trained linear model, screening out the trained linear model corresponding to the maximum phasor value, and screening out the corresponding attribute label according to the trained linear model;

each high-frequency word group comprises a plurality of high-frequency words, high-frequency word standardization processing is further included before the linear model is established, the current occurrence frequency of each high-frequency word in the corresponding original article is counted, and the maximum occurrence frequency and the minimum occurrence frequency in the content of the original article are counted; and calculating the weight corresponding to the high-frequency words according to the current occurrence frequency, the maximum occurrence frequency and the minimum occurrence frequency, and performing weight sequencing on the high-frequency words in each high-frequency word group according to the weight.

2. The content classification and aggregation method according to claim 1, wherein: when the original article content and the article content to be detected are both comment articles, executing the following steps:

3. The content classification and aggregation method according to claim 2, wherein: the sorting hot word library is specifically to classify a plurality of real comments in the hot word library according to the number of words and quality, wherein the quality attributes are good comments, bad comments and medium comments.

4. The content classification and aggregation method according to claim 1, wherein: the attribute labels can be women's clothes, gourmet, digital science and technology, movies, fresher and antique, and the original article contents are women's clothes articles, gourmet articles, digital science and technology articles, movies articles, fresher articles and antique articles.

5. An electronic device, characterized by comprising: a processor;

a memory; and a program, wherein the program is stored in the memory and configured to be executed by the processor, the program comprising instructions for carrying out the method of any one of claims 1-4.

6. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program is executed by a processor for performing the method according to any of claims 1-4.

7. A content classification aggregation engine, comprising:

the content classification module is used for screening the contents of the article to be tested according to different trained linear models and matching the contents to obtain corresponding attribute labels;

each high-frequency phrase comprises a plurality of high-frequency words, and the method further comprises the following steps before the high-frequency phrases are respectively input into a plurality of linear models to be trained for training and the trained linear models corresponding to the attribute labels are obtained:

counting the current occurrence frequency of each high-frequency vocabulary in the corresponding original article, wherein the maximum occurrence frequency and the minimum occurrence frequency in the content of the original article are counted; and calculating the weight corresponding to the high-frequency words according to the current occurrence frequency, the maximum occurrence frequency and the minimum occurrence frequency, and performing weight sequencing on the high-frequency words in each high-frequency word group according to the weight.

8. The content classification aggregation engine of claim 7, wherein: when the original article content and the article content to be tested are both comment articles, the method comprises the following steps: