CN109241297A

CN109241297A - A kind of classifying content polymerization, electronic equipment, storage medium and engine

Info

Publication number: CN109241297A
Application number: CN201810744608.3A
Authority: CN
Inventors: 李剑; 陈星�
Original assignee: Guangzhou Pinwei Software Co Ltd
Current assignee: Guangzhou Pinwei Software Co Ltd
Priority date: 2018-07-09
Filing date: 2018-07-09
Publication date: 2019-01-18
Anticipated expiration: 2038-07-09
Also published as: CN109241297B

Abstract

The present invention provides a kind of classifying content polymerization, include: when original article content and article content to be measured for comment class article when, attribute tags corresponding with original article content are established according to variety classes, attribute tags and original article content are established into mapping relations；Destructing is carried out to different types of original article content using segmenter and extracts the corresponding high-frequency phrase of each original article content respectively, and each high-frequency phrase and attribute tags are established into mapping relations；Each high-frequency phrase is separately input into and is trained and obtains corresponding with attribute tags to have trained linear model in several linear models to be trained；It has trained linear model to screen article content to be measured according to difference and has matched corresponding attribute tags.A kind of classifying content polymerization of the invention, reduces cost of labor, according to the corresponding attribute tags of article content to be measured can once present user at the moment, greatly improves the experience sense of user in a manner of different attribute label by it.

Description

A kind of classifying content polymerization, electronic equipment, storage medium and engine

Technical field

The present invention relates to natural language processing field more particularly to a kind of classifying content polymerizations, electronic equipment, storage Medium and engine.

Background technique

Natural language processing (NLP) is an important directions in computer science and artificial intelligence field.It grinds Study carefully the various theory and methods for being able to achieve and carrying out efficient communication between people and computer with natural language.Natural language processing is one Door melts linguistics, computer science, mathematics in the science of one.Natural language processing is not generally to study natural language, And it is to develop the computer system that can effectively realize natural language communication, software systems especially therein.

All there is content shopping guide concept on current each platform, good content has more user's viscosity.Such as makeup album Female user can effectively be attracted, body-building open air album can effectively attract male user.These albums again can be with simultaneously Combine well with the working days of shopping platform, cargo, on the one hand increases user's viscosity, be on the one hand content shopping guide.With The growth of creation article quantity about all kinds of commodity, crawls the surge of article quantity, how to manage these articles, be multiplexed article All at problem.It is all to use to carry out labeling to these articles manually at present, this measure dramatically increases human cost, when article number More than it is excessive when, manpower can not solve.

Summary of the invention

For overcome the deficiencies in the prior art, one of the objects of the present invention is to provide a kind of classifying content polymerization, It is all to use to carry out labeling to these articles manually at present that it, which can solve, and this measure dramatically increases human cost, when article number is super When excessive, manpower insurmountable problem.

The second object of the present invention is to provide a kind of electronic equipment, and can solve all is using giving these texts manually at present Zhang Jinhang labeling, this measure dramatically increase human cost, when article number is more than excessive, manpower insurmountable problem.

The third object of the present invention is to provide a kind of computer storage medium, can solve all be at present using manually to These articles carry out labeling, and this measure dramatically increases human cost, and when article number is more than excessive, manpower is insurmountable Problem.

The fourth object of the present invention is to provide a kind of classifying content aggregation engine, and can solve all is using manual at present Labeling is carried out to these articles, this measure dramatically increases human cost, and when article number is more than excessive, manpower can not be solved The problem of.

An object of the present invention is implemented with the following technical solutions:

A kind of classifying content polymerization, characterized by comprising:

Story label is established, different types of original article content and article content to be measured on line platform are obtained, When the original article content and the article content to be measured for comment class article when, according to variety classes establish with it is described The attribute tags and the original article content are established mapping relations by the corresponding attribute tags of original article content；

High frequency words are concluded, and different types of original article content deconstruct and extracted respectively every using segmenter The corresponding high-frequency phrase of a original article content, and each high-frequency phrase and the attribute tags are established into mapping and closed System；

Linear model is established, each high-frequency phrase is separately input into several linear models to be trained and is trained And it obtains corresponding with the attribute tags having trained linear model；

Classifying content has trained linear model to screen article content to be measured according to difference and has matched correspondence The attribute tags.

Further, when the original article content and the article content to be measured be comment class article when, execute with Lower step:

Hot word bank is established, if obtaining the true comment of main line upper mounting plate, establishes hot word bank according to several true comments；

Hot word bank is arranged, several true comments in the hot word bank are subjected to attributive classification and obtains number of words attribute and matter Measure attribute；

Abundant hot word bank, deduces out near synonym library using word2vec from the hot word bank, uses the near synonym library Progressive alternate is carried out to the true comment of the different number of words attributes and has been enriched hot word bank；

Comment classification, the hot word bank and the article content to be measured are input in greedy Matching Model and are classified, Greediness Matching Model piece in the hot word bank matches the corresponding qualitative attribute.

Further, the hot word bank that arranges is specially by several true comments in the hot word bank successively according to number of words How much carry out classification and well also classifying according to quality, the qualitative attribute is preferably commented on, difference is commented on, medium comment.

Further, each high-frequency phrase includes several high frequency vocabulary, and the linear model of establishing further includes before High frequency words standardization counts current frequency of occurrence of each high frequency vocabulary in the corresponding original article, institute It states most frequency of occurrence in original article content and number at least occurs；According to the current frequency of occurrence, at most occur this number and Minimum frequency of occurrence calculates the corresponding weight of the high frequency vocabulary, according to the weight to described in each high-frequency phrase High frequency vocabulary carries out weight sequencing.

Further, the classifying content specifically: line will have been trained described in article content to be measured difference input value difference Property model in, it is each described that linear model has been trained to export corresponding phasic property value, filter out described in the maximum phasic property value corresponds to Linear model has been trained, the corresponding attribute tags are filtered out according to the training pattern.

Further, the attribute tags can be women's dress, cuisines, numeral science and technology, film, small pure and fresh, trend of back-to-ancients, the original Beginning article content is women's dress class article, cuisines class article, numeral science and technology class article, film class article, small pure and fresh class article, pseudo-classic Wind class article.

The second object of the present invention is implemented with the following technical solutions:

A kind of electronic equipment, comprising: processor；

Memory；And program, wherein described program is stored in the memory, and is configured to by processor It executes, described program includes for executing a kind of classifying content polymerization of the invention.

The third object of the present invention is implemented with the following technical solutions:

A kind of computer readable storage medium, is stored thereon with computer program, it is characterised in that: the computer program It is executed by processor a kind of classifying content polymerization of the invention.

The fourth object of the present invention is implemented with the following technical solutions:

A kind of classifying content aggregation engine, characterized by comprising:

Story label module is established, the story label module of establishing is for obtaining different types of original on line platform Beginning article content and article content to be measured, when the original article content and the article content to be measured are not comment class text Zhang Shi establishes attribute tags corresponding with the original article content according to variety classes, by the attribute tags and the original Beginning article content establishes mapping relations；

High frequency words conclude module, and the high frequency words are concluded module and are used for using segmenter to different types of original text Chapter content deconstruct and extract respectively the corresponding high-frequency phrase of each original article content, and by each high frequency words Group establishes mapping relations with the attribute tags；

Linear model module is established, the linear model module of establishing is for each high-frequency phrase to be separately input into It is trained in several linear models to be trained and obtains corresponding with the attribute tags having trained linear model；

Content, classification module, the content, classification module according to difference for having trained linear model to article to be measured Content is screened and matches the corresponding attribute tags.

Further, when the original article content and the article content to be measured are comment class article, comprising:

Hot word library module is established, if the true comment established hot word library module and be used to obtain main line upper mounting plate, according to Hot word bank is established in several true comments；

Hot word bank module is arranged, the hot word bank module that arranges is used to carry out several true comments in the hot word bank Attributive classification simultaneously obtains number of words attribute and qualitative attribute；

Abundant hot word library module, the abundant hot word library module are used to deduce out from the hot word bank using word2vec Near synonym library carries out progressive alternate to the true comment of the different number of words attributes using the near synonym library and obtains Abundant hot word bank；

Categorization module is commented on, the comment categorization module is for the hot word bank and the article content to be measured to be input to Classify in greedy Matching Model, greediness Matching Model piece in the hot word bank matches the corresponding quality category Property.

Compared with prior art, the beneficial effects of the present invention are: a kind of classifying content polymerization of the invention passes through elder generation Classified to original article content and establish corresponding attribute tags, using segmenter to different types of original article content It carries out structure and extracts the corresponding high-frequency phrase of each original article content, high-frequency phrase and attribute tags are established into mapping and closed System, by high-frequency phrase input value linear model, thus obtain it is corresponding with attribute tags trained linear model, reuse Training linear model screens article content to be measured and matches corresponding attribute tags, i.e., by article content to be measured and category Property label establishes corresponding relationship, and carries out classification polymerization according to corresponding relationship, and this mode classification no longer requires manual intervention place Reason, it is intelligentized that article content to be measured is classified, the rate of precision of classification is improved, cost of labor is reduced, according to be measured The corresponding attribute tags of article content can once present user at the moment, greatly improves use in a manner of different attribute label by it The experience sense at family.

The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And can be implemented in accordance with the contents of the specification, the following is a detailed description of the preferred embodiments of the present invention and the accompanying drawings. A specific embodiment of the invention is shown in detail by following embodiment and its attached drawing.

Detailed description of the invention

The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:

Fig. 1 is a kind of flow chart of content polymerization process of the invention；

Fig. 2 is a kind of module frame chart of content-aggregated engine of the invention；

Fig. 3 is a kind of operation schematic diagram of content-aggregated engine of the invention in working condition；

Fig. 4 is a kind of display interface schematic diagram one of content-aggregated engine of the invention in working condition；

Fig. 5 is a kind of display interface schematic diagram two of content-aggregated engine of the invention in working condition.

Specific embodiment

In the following, being described further in conjunction with attached drawing and specific embodiment to the present invention, it should be noted that not Under the premise of conflicting, new implementation can be formed between various embodiments described below or between each technical characteristic in any combination Example.

As shown in Figure 1, a kind of classifying content polymerization of the invention, comprising the following steps:

Story label is established, different types of original article content and article content to be measured on line platform are obtained, When the original article content and the article content to be measured for comment class article when, according to variety classes establish with it is described The attribute tags and the original article content are established mapping relations by the corresponding attribute tags of original article content；At this It is that different types of original article content is obtained in each online network platform according to reptile instrument in embodiment, can be divided by type The types such as women's dress, cuisines, numeral science and technology, film, small pure and fresh, trend of back-to-ancients, original article content can be divided into comment and non-comment By class article；When the original article content and the article content to be measured are not comment class article and in original article Holding is that user is original or the content of platform professional production, establishes attribute tags first, in accordance with type, attribute tags are women's dress, beauty Food, numeral science and technology, film, small pure and fresh, trend of back-to-ancients etc., and mapping pass is established by each attribute tags and per each original article content System, i.e., classify all original article contents according to attribute tags.The corresponding original of each attribute tags in the present embodiment The quantity of beginning article content is at least 1,000.

High frequency words are concluded, and different types of original article content deconstruct and extracted respectively every using segmenter The corresponding high-frequency phrase of a original article content, and each high-frequency phrase and the attribute tags are established into mapping and closed System.It is mentioned in the present embodiment using the comprehensive different types of original article content of IKAnalyzer and paoding classifier High frequency words are taken, extracting positive keyword in each original article content and negative sense keyword seniority among brothers and sisters first, (positive keyword is logical The top50 of this article classification high frequency words is often chosen, negative sense can choose other classification article high frequency words top3 or top5), on The positive keyword and negative sense keyword stated are high frequency words；Each high frequency vocabulary is standardized, i.e., statistics is each Current frequency of occurrence of the high frequency vocabulary in the corresponding original article is a, is most had more in the original article content Occurrence number is maxHot and at least number occurs to be minHot；According to the current frequency of occurrence, at most there is this number and at least go out Occurrence number calculates the corresponding weight of the high frequency vocabulary, according to the weight to the high frequency words in each high-frequency phrase It converges and carries out weight sequencing.Referring in particular to formula (1):

Weight=(a-minHot)/(maxHot-minHot) (1)

Wherein, a is current frequency of occurrence, and maxHot is most frequency of occurrence, and minHot is number at least occur.

Linear model is established, each high-frequency phrase is separately input into several linear models to be trained and is trained And it obtains corresponding with the attribute tags having trained linear model；It can be established in the present embodiment according to the type of high frequency words multiple It is corresponding to have trained linear model, and training pattern is subjected to weight convergence using sigmond function.

Classifying content has trained linear model to screen article content to be measured according to difference and has matched correspondence The attribute tags.It will have been trained in linear model described in article content to be measured difference input value difference, it is each described to have instructed Practice linear model and export corresponding phasic property value, filter out the maximum phasic property value it is corresponding it is described trained linear model, according to institute It states training pattern and filters out the corresponding attribute tags.Linear model has been trained by article content input value to be measured is multiple In, each train linear model that can export corresponding phasic property value, phasic property value is higher, then article content to be measured is corresponding Attribute tags are closest, therefore when carrying out evaluation attribute to article content to be measured, evaluated according to phasic property value is highest To which different article contents to be measured is realized precisely reasonable classification and polymerization.

In the present embodiment, when the original article content and the article content to be measured are comment class article, when When article class to be measured is held and original article content is comment class article, classification polymerization as is carried out to comment.It then executes following Step:

Hot word bank is established, if obtaining the true comment of main line upper mounting plate, establishes hot word bank according to several true comments； Obtain the true comment on line.900,000 true comment component hot word banks are had collected in the present embodiment.

Hot word bank is arranged, several true comments in the hot word bank are subjected to attributive classification and obtains number of words attribute and matter Measure attribute；By in the hot word bank it is several it is true comment successively according to number of words how much carry out classification and according to quality well also into Row classification, the qualitative attribute is preferably commented on, difference is commented on, medium comment.How much it is divided into the comment of 1 word class, 2 words first, in accordance with number of words Class comment, the comment of 3 word classes, the comment of 4 word classes and the comment of 5 word classes take this five kinds of number of words classification, further according to quality in the present embodiment Attribute comments on above-mentioned 1 word class, the comment of 2 word classes, the comment of 3 word classes, 4 word classes are commented on and the comment of 5 word classes is divided into good comment, difference is commented Refer to middle comment.How much comment in hot word bank is arranged according to quality good job and number of words.

Abundant hot word bank, deduces out near synonym library using word2vec from the hot word bank, uses the near synonym library Progressive alternate is carried out to the true comment of the different number of words attributes and has been enriched hot word bank；It will using word2vec 1 word comment class is really commented on, and 2 words comment class is really commented on, the true comment of 3 words comment, the true comment of 4 words comment, and 5 words are really commented By comment progressive alternate, to achieve the effect that abundant hot word bank.

Comment classification, the hot word bank and the article content to be measured are input in greedy Matching Model and are classified, Greediness Matching Model piece in the hot word bank matches the corresponding qualitative attribute.Hot word bank abundant is led Enter in greedy Matching Model, and article content to be measured is classified and gathered using greedy Matching Model according to greedy matching strategy It closes, the greedy matching strategy in the present embodiment is divided into stringent and loose, and finally article content to be measured is classified and polymerize, most Eventually as shown in figure 5, all comments are shown to user according to qualitative attribute, i.e., favorable comment is same is shown to user in the same page.

Offer a kind of electronic equipment of the invention, comprising: processor；

Of the invention provides a kind of computer readable storage medium, is stored thereon with computer program, it is characterised in that: institute It states computer program and is executed by processor a kind of classifying content polymerization of the invention.

If Fig. 2 shows, the present invention provides a kind of classifying content aggregation engine, comprising: establishes story label module, the foundation Story label module is for obtaining different types of original article content and article content to be measured on line platform, when described When original article content and the article content to be measured are not comment class article, established and the original text according to variety classes The attribute tags and the original article content are established mapping relations by the corresponding attribute tags of chapter content；

Content, classification module, the content, classification module according to difference for having trained linear model to article to be measured Content is screened and matches the corresponding attribute tags.After carrying out classification polymerization to article to be measured at this time, as Fig. 5 is shown as Display interface after classification polymerization carried out to article content to be measured, in Fig. 5 by article content be divided into makeups, wear take, at home, it is female Baby's attribute tags, each attribute tags classification lower section show similar article content to be measured.

Categorization module is commented on, the comment categorization module is used to the hot word bank and the comment being input to greediness Classify in Matching Model, greediness Matching Model piece in the hot word bank matches the corresponding qualitative attribute. Finally as Fig. 4 shows, all comments are shown to user according to qualitative attribute, i.e., favorable comment is same is shown to user in the same page.

Classifying content aggregation engine in the present embodiment is natural language processing classifying content aggregation engine, such as Fig. 3 institute Show, the classifying content aggregation engine in the present embodiment is carried out in application, first carrying out the image-text information etc. in shared data Caching, then classification polymerization is carried out to shared data with the classifying content aggregation engine in the present embodiment, staff passes through at this time Service list on contents management system configuration content service platform, content service platform is according to configured service list It will be put into service list by the shared data of classification polymerization, and sent out shared data publication by same external interface It is shown in the windows such as existing, intelligent, live streaming, activity, subchannel.

A kind of classifying content polymerization of the invention, by first being classified to original article content and being established corresponding Attribute tags carry out structure to different types of original article content using segmenter and extract each original article content to correspond to High-frequency phrase, high-frequency phrase and attribute tags are established into mapping relations, by high-frequency phrase input value linear model, thus Linear model has been trained to corresponding with attribute tags, has reused and linear model has been trained to screen simultaneously article content to be measured Match corresponding attribute tags, i.e., article content to be measured and attribute tags established into corresponding relationship, and according to corresponding relationship into Row classification polymerization, this mode classification no longer requires manual intervention processing, intelligentized that article content to be measured is classified, and mentions The high rate of precision of classification, reduces cost of labor, can be by it with difference according to the corresponding attribute tags of article content to be measured Once present user at the moment, greatly improves the experience sense of user to the mode of attribute tags.

More than, only presently preferred embodiments of the present invention is not intended to limit the present invention in any form；All current rows The those of ordinary skill of industry can be shown in by specification attached drawing and above and swimmingly implement the present invention；But all to be familiar with sheet special The technical staff of industry without departing from the scope of the present invention, is made a little using disclosed above technology contents The equivalent variations of variation, modification and evolution is equivalent embodiment of the invention；Meanwhile all substantial technologicals according to the present invention The variation, modification and evolution etc. of any equivalent variations to the above embodiments, still fall within technical solution of the present invention Within protection scope.

Claims

1. a kind of classifying content polymerization, characterized by comprising:

Story label is established, different types of original article content and article content to be measured on line platform is obtained, works as institute State original article content and the article content to be measured for comment class article when, according to variety classes establish with it is described original The attribute tags and the original article content are established mapping relations by the corresponding attribute tags of article content；

High frequency words are concluded, and are carried out destructing to different types of original article content using segmenter and are extracted each institute respectively The corresponding high-frequency phrase of original article content is stated, and each high-frequency phrase and the attribute tags are established into mapping relations；

Linear model is established, each high-frequency phrase is separately input into several linear models to be trained and is trained and obtains Linear model has been trained to corresponding with the attribute tags；

Classifying content has trained linear model to screen article content to be measured according to difference and has matched corresponding institute State attribute tags.

2. a kind of classifying content polymerization as described in claim 1, it is characterised in that: when the original article content and institute Stating article content to be measured is when commenting on class article, to execute following steps:

Hot word bank is arranged, several true comments in the hot word bank are subjected to attributive classification and obtains number of words attribute and quality category Property；

Abundant hot word bank, deduces out near synonym library using word2vec, using the near synonym library to not from the hot word bank The true comment with the number of words attribute carries out progressive alternate and has been enriched hot word bank；

Comment classification, the hot word bank and the article content to be measured are input in greedy Matching Model and are classified, described Greedy Matching Model piece in the hot word bank matches the corresponding qualitative attribute.

3. a kind of classifying content polymerization as claimed in claim 2, it is characterised in that: the arrangement hot word bank is specially will How much several true comments in the hot word bank successively carry out classification and well also classifying according to quality, institute according to number of words State qualitative attribute preferably comment on, difference comment, medium comment.

4. a kind of classifying content polymerization as described in claim 1, it is characterised in that: if each high-frequency phrase includes Dry high frequency vocabulary, the linear model of establishing further includes before high frequency words standardization, counts each high frequency vocabulary and exists Current frequency of occurrence in the corresponding original article at most frequency of occurrence and minimum occurs in the original article content Number；According to the current frequency of occurrence, at most there is this number and the corresponding weight of the minimum frequency of occurrence calculating high frequency vocabulary, Weight sequencing is carried out to the high frequency vocabulary in each high-frequency phrase according to the weight.

5. a kind of classifying content polymerization as described in claim 1, it is characterised in that: the classifying content specifically: will It has been trained in linear model described in article content difference input value to be measured is different, it is each described that linear model output has been trained to correspond to Phasic property value, filter out the maximum phasic property value it is corresponding it is described trained linear model, filtered out according to the training pattern The corresponding attribute tags.

6. a kind of classifying content polymerization as described in claim 1, it is characterised in that: the attribute tags can for women's dress, Cuisines, numeral science and technology, film, small pure and fresh, trend of back-to-ancients, the original article content are women's dress class article, cuisines class article, number Science and technology article, film class article, small pure and fresh class article, trend of back-to-ancients class article.

7. a kind of electronic equipment, characterized by comprising: processor；

Memory；And program, wherein described program is stored in the memory, and is configured to be held by processor Row, described program include requiring method described in 1-6 any one for perform claim.

8. a kind of computer readable storage medium, is stored thereon with computer program, it is characterised in that: the computer program quilt Processor executes method as claimed in any one of claims 1 to 6.

9. a kind of classifying content aggregation engine, characterized by comprising:

Story label module is established, the story label module of establishing is for obtaining different types of original text on line platform Chapter content and article content to be measured, when the original article content and the article content to be measured are not comment class article When, establish attribute tags corresponding with the original article content according to variety classes, by the attribute tags with it is described original Article content establishes mapping relations；

High frequency words conclude module, and the high frequency words are concluded module and are used for using segmenter in different types of original article Appearance deconstruct and extract respectively the corresponding high-frequency phrase of each original article content, and will each high-frequency phrase and The attribute tags establish mapping relations；

Establish linear model module, described to establish linear model module several for each high-frequency phrase to be separately input into It is trained in linear model to be trained and obtains corresponding with the attribute tags having trained linear model；

Content, classification module, the content, classification module according to difference for having trained linear model to article content to be measured It is screened and matches the corresponding attribute tags.

10. a kind of classifying content aggregation engine as claimed in claim 9, it is characterised in that: when the original article content and When the article content to be measured is comment class article, comprising:

Hot word library module is established, if the true comment established hot word library module and be used to obtain main line upper mounting plate, according to several Hot word bank is established in the true comment；

Hot word bank module is arranged, the hot word bank module that arranges is used to several true comments in the hot word bank carrying out attribute Classify and obtains number of words attribute and qualitative attribute；

Abundant hot word library module, the abundant hot word library module are used to deduce out nearly justice from the hot word bank using word2vec Dictionary carries out progressive alternate to the true comment of the different number of words attributes using the near synonym library and has been enriched Hot word bank；

Categorization module is commented on, the comment categorization module is used to the hot word bank and the article content to be measured being input to greediness Classify in Matching Model, greediness Matching Model piece in the hot word bank matches the corresponding qualitative attribute.