CN107729401A - High quality articles method for digging, device and storage medium based on artificial intelligence - Google Patents
High quality articles method for digging, device and storage medium based on artificial intelligence Download PDFInfo
- Publication number
- CN107729401A CN107729401A CN201710862013.3A CN201710862013A CN107729401A CN 107729401 A CN107729401 A CN 107729401A CN 201710862013 A CN201710862013 A CN 201710862013A CN 107729401 A CN107729401 A CN 107729401A
- Authority
- CN
- China
- Prior art keywords
- article
- high quality
- filtering
- microblogging
- quality articles
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000013473 artificial intelligence Methods 0.000 title claims abstract description 17
- 238000009412 basement excavation Methods 0.000 claims abstract description 21
- 238000001914 filtration Methods 0.000 claims description 55
- 238000012549 training Methods 0.000 claims description 18
- 235000010627 Phaseolus vulgaris Nutrition 0.000 claims description 8
- 244000046052 Phaseolus vulgaris Species 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 5
- 210000003813 thumb Anatomy 0.000 description 10
- 230000006870 function Effects 0.000 description 5
- 230000005291 magnetic effect Effects 0.000 description 5
- 238000003058 natural language processing Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 206010052428 Wound Diseases 0.000 description 3
- 208000027418 Wounds and injury Diseases 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 239000000686 essence Substances 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000032683 aging Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- Marketing (AREA)
- General Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- General Health & Medical Sciences (AREA)
- Economics (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses high quality articles method for digging, device and the storage medium based on artificial intelligence, wherein method includes:Article excavation is carried out according to the microblogging blog article got;Filter out the article that pre-provisioning request is not met in the article excavated;Remaining article is divided into positive sample and negative sample;Trained to obtain high quality articles identification model according to positive sample and negative sample;According to high quality articles identification model, quality Identification, the high quality articles being identified out are carried out to the article excavated from microblogging blog article.Using scheme of the present invention, substantial amounts of high quality articles can be effectively excavated, and cost is low, possesses novelty.
Description
【Technical field】
The present invention relates to Computer Applied Technology, high quality articles method for digging, dress more particularly to based on artificial intelligence
Put and storage medium.
【Background technology】
Artificial intelligence (Artificial Intelligence), english abbreviation AI.It is research, develop for simulating,
Extension and the extension intelligent theory of people, method, a new technological sciences of technology and application system.Artificial intelligence is to calculate
One branch of machine science, it attempts to understand essence of intelligence, and produce it is a kind of it is new can be in a manner of human intelligence be similar
The intelligence machine made a response, the research in the field include robot, language identification, image recognition, natural language processing and specially
Family's system etc..
Generate the epoch of information outburst in Internet user, how to be excavated from the data of microblog users generation it is reliable,
The article of high quality has huge commercial value.
Created however, microblogging is that a kind of emphasis is ageing with the random platform shared with exchanged, this randomness
Low quality data is spread unchecked, and the low quality data may include ad data and daily exchange data etc., so as to be high quality text
The excavation of chapter brings very big difficulty.
And this problem is directed to, in the prior art also without a kind of effective settling mode.
In the prior art, it is acquisition high quality articles, the following processing mode of generally use:
1) rely on from media or employ writer to go to write article;
2) using semi-automatic form, add some manual interventions, go to construct article, for example, going to generate by a solid plate
Sports reports.
But above-mentioned each mode can have the problem of certain in actual applications, such as:Employing mode 1), due to dependent on certainly
Media or writer, therefore the article amount of output is few and cost is high, employing mode 2), article can be caused of low quality, and lack wound
New property, because template number is limited, and the scope covered is certain, thus relies on often thousand one, the article of template generation
Rule, lack innovative.
【The content of the invention】
In view of this, the invention provides high quality articles method for digging, device and the storage medium based on artificial intelligence,
Substantial amounts of high quality articles can be effectively excavated, and cost is low, possesses novelty.
Concrete technical scheme is as follows:
A kind of high quality articles method for digging based on artificial intelligence, including:
Article excavation is carried out according to the microblogging blog article got;
Filter out the article that pre-provisioning request is not met in the article excavated;
Remaining article is divided into positive sample and negative sample;
Train to obtain high quality articles identification model according to the positive sample and the negative sample;
According to the high quality articles identification model, quality Identification is carried out to the article excavated from microblogging blog article, obtained
To the high quality articles identified.
According to one preferred embodiment of the present invention, the microblogging blog article that the basis is got, which carries out article excavation, to be included:
The short chain of article is obtained from microblogging blog article;
The short chain is reverted into long-chain;
Obtain article corresponding to the long-chain.
According to one preferred embodiment of the present invention, it is described to filter out the article bag that pre-provisioning request is not met in the article excavated
Include:
For every article, advertisement filter and yellow filtering are carried out to it respectively, if any filtering does not pass through, it is determined that institute
It is the article for not meeting pre-provisioning request to state article.
According to one preferred embodiment of the present invention, it is described to be directed to every article, advertisement filter and yellow mistake are carried out to it respectively
Filter includes:
By the way of rule-based filtering, advertisement filter is carried out to the article;
By the way of keyword filtration, yellow filtering is carried out to the article.
According to one preferred embodiment of the present invention, this method further comprises:
The forwarding, comment and like time of every microblogging blog article are obtained respectively;
For every article, most influential microblogging blog article corresponding to it is determined respectively, it is described most influential
Microblogging blog article is forwarding in the microblogging blog article comprising the article, comment, the microblogging blog article of like time sum maximum;
Described filter out does not meet the article of pre-provisioning request and further comprised in the article excavated:
For every article, the bean vermicelli of the bloger of most influential microblogging blog article corresponding to the article is determined respectively
Whether number is more than predetermined threshold, if not, it is determined that the article is the article for not meeting pre-provisioning request.
According to one preferred embodiment of the present invention, it is described to be trained to obtain high quality text according to the positive sample and the negative sample
Chapter identification model includes:
Feature extraction is carried out to the positive sample and the negative sample respectively, the feature extracted includes:It can react
The feature of article temperature;
Features training according to extracting obtains the high quality articles identification model.
A kind of high quality articles excavating gear based on artificial intelligence, including:Pretreatment unit and excavation unit;
The pretreatment unit, for carrying out article excavation according to the microblogging blog article got;Filter out the text excavated
The article of pre-provisioning request is not met in chapter;Remaining article is divided into positive sample and negative sample;According to the positive sample and institute
Negative sample is stated to train to obtain high quality articles identification model;
The excavation unit, for according to the high quality articles identification model, to the text excavated from microblogging blog article
Zhang Jinhang quality Identifications, the high quality articles being identified out.
According to one preferred embodiment of the present invention, the pretreatment unit includes:Obtain subelement, filtering subelement and
Train subelement;
The acquisition subelement, for obtaining the short chain of article from microblogging blog article, the short chain is reverted into long-chain, obtained
Take article corresponding to the long-chain;
The filtering subelement, the article of pre-provisioning request is not met for filtering out;
The training subelement, for remaining article to be divided into positive sample and negative sample, according to the positive sample and
The negative sample trains to obtain high quality articles identification model.
According to one preferred embodiment of the present invention, the filtering subelement is directed to every article, carries out advertisement to it respectively
Filter and yellow filtering, if any filtering does not pass through, it is determined that the article is the article for not meeting pre-provisioning request.
According to one preferred embodiment of the present invention, the filtering subelement is entered by the way of rule-based filtering to the article
Row advertisement filter;
The filtering subelement carries out yellow filtering by the way of keyword filtration to the article.
According to one preferred embodiment of the present invention, the acquisition subelement is further used for,
The forwarding, comment and like time of every microblogging blog article are obtained respectively;
For every article, most influential microblogging blog article corresponding to it is determined respectively, it is described most influential
Microblogging blog article is forwarding in the microblogging blog article comprising the article, comment, the microblogging blog article of like time sum maximum;
The filtering subelement is further used for,
For every article, the bean vermicelli of the bloger of most influential microblogging blog article corresponding to the article is determined respectively
Whether number is more than predetermined threshold, if not, it is determined that the article is the article for not meeting pre-provisioning request.
According to one preferred embodiment of the present invention, the training subelement is carried out to the positive sample and the negative sample respectively
Feature extraction, the feature extracted include:The feature of article temperature can be reacted, institute is obtained according to the features training extracted
State high quality articles identification model.
A kind of computer equipment, including memory, processor and be stored on the memory and can be in the processor
The computer program of upper operation, method as described above is realized during the computing device described program.
A kind of computer-readable recording medium, computer program is stored thereon with, it is real when described program is executed by processor
Existing method as described above.
It can be seen that using scheme of the present invention, can be entered first according to the microblogging blog article got based on above-mentioned introduction
Chapter of composing a piece of writing excavates, and filters out the article that pre-provisioning request is not met in the article excavated, afterwards can divide remaining article
For positive sample and negative sample, and trained according to positive sample and negative sample to obtain high quality articles identification model, it is so, follow-up
According to high quality articles identification model, quality Identification is carried out to the article excavated from microblogging blog article, so as to be identified out
High quality articles, compared to prior art, high quality articles can be excavated in heretofore described scheme from microblogging blog article, and
The data volume of microblogging blog article is very large, so as to get substantial amounts of high quality articles, and need not employ writer etc., cost
Lowly, in addition, can be excavated to various high quality articles from microblogging blog article, do not limited, had enough by template
It is innovative.
【Brief description of the drawings】
Fig. 1 is the flow chart of the high quality articles method for digging embodiment of the present invention based on artificial intelligence.
Fig. 2 is the composition structural representation of the high quality articles excavating gear embodiment of the present invention based on artificial intelligence
Figure.
Fig. 3 shows the block diagram suitable for being used for the exemplary computer system/server 12 for realizing embodiment of the present invention.
【Embodiment】
For problems of the prior art, propose that a kind of high quality articles based on artificial intelligence excavate in the present invention
Mode, microblog data is accessed in real time from microblog, excavates high quality articles therein in real time.
In order that technical scheme is clearer, clear, develop simultaneously embodiment referring to the drawings, to institute of the present invention
The scheme of stating is further described.
Obviously, described embodiment is part of the embodiment of the present invention, rather than whole embodiments.Based on the present invention
In embodiment, all other embodiment that those skilled in the art are obtained under the premise of creative work is not made, all
Belong to the scope of protection of the invention.
Fig. 1 is the flow chart of the high quality articles method for digging embodiment of the present invention based on artificial intelligence.Such as Fig. 1 institutes
Show, including implementation in detail below.
In 101, article excavation is carried out according to the microblogging blog article got.
In 102, the article that pre-provisioning request is not met in the article excavated is filtered out.
In 103, remaining article is divided into positive sample and negative sample.
In 104, trained to obtain high quality articles identification model according to positive sample and negative sample.
In 105, according to high quality articles identification model, quality knowledge is carried out to the article excavated from microblogging blog article
Not, the high quality articles being identified out.
As can be seen that in above-described embodiment, can train to obtain a high quality articles identification model first, afterwards, can profit
With the high quality articles identification model, the excavation of high quality articles is carried out from microblogging blog article in real time.
Wherein, training obtains the process of high quality articles identification model mainly including article excavation, article filtering and mould
The subprocess such as type training, are described in detail individually below.
1) article excavates
Article excavation can be carried out according to the microblogging blog article got.
Specifically, the short chain of article can be obtained from microblogging blog article, and short chain is reverted into long-chain, and then obtains long-chain pair
The article answered.
A certain article may be quoted or forwarded in microblogging blog article, then the short chain of article can be obtained from microblogging blog article,
And short chain is reverted into long-chain according to prior art, and then the article including full content is obtained using long-chain, realize article
Completion, and it is stored in article database.Wherein, using gripping tool, to capture article content corresponding to long-chain etc..
In addition, can also obtain forwarding, comment and like time of every microblogging blog article etc. respectively, and every text can be directed to
Chapter, determines most influential microblogging blog article corresponding to it respectively, and most influential microblogging blog article is to include this article
Forward, comment in microblogging blog article, the microblogging blog article that like time sum is maximum.
Such as a certain article, it is first determined go out in which microblogging blog article and quote or forwarded this article, and then can be from
Forwarding, comment, the microblogging blog article of like time sum maximum are selected in these microblogging blog articles, regard the microblogging blog article selected as this
Most influential microblogging blog article corresponding to article.
2) article filters
For each article being stored in article database, further it can be filtered, not met so as to filter out
The article of pre-provisioning request, such as obvious low quality article.
Such as every article, advertisement filter and yellow filtering can be carried out to it respectively, if any filtering not by,
Then determine that this article is the article for not meeting pre-provisioning request.
Wherein, advertisement filter can be carried out to every article by the way of rule-based filtering, the side of keyword filtration can be used
Formula, yellow filtering is carried out to every article.
It is preferred that the flow of advertisement filter can be:Article white list->Title blacklist->Content blacklist.Such as
For a certain article, the content of front and rear each N sections of this article can be obtained first, N is positive integer, and specific value can be according to actual need
Depending on wanting, it is then determined that whether including content, form of presentation etc. specified in article white list in the content got, such as reporter
Report etc., if it is, can determine that the advertisement filter of this article by continuing to retain this article, otherwise, this can be further determined that
Whether content specified in title blacklist is included in the title of article, if it is, this article can be filtered out, otherwise, can
Further determining that in the content of this article and whether include content specified in content blacklist, such as price is how many, if it is,
This article can then be filtered out, otherwise, it may be determined that the advertisement filter of this article is by continuing to retain this article.
Filtered for yellow, if the title of a certain article or content include set yellow keyword, then then
This article can be filtered out, otherwise, it may be determined that the yellow of this article is filtered through, and continues to retain this article.
In addition, except above by carry out advertisement filter and yellow filtering come filter out do not met in the article excavated it is predetermined
It is required that article outside, can also be directed to every article, determine most influential microblogging blog article corresponding to this article respectively
Whether the bean vermicelli number of bloger is more than predetermined threshold, if it is not, then can determine that this article is the article for not meeting pre-provisioning request.
The specific value of the threshold value can be decided according to the actual requirements, such as, can value be 10000, bean vermicelli number be more than should
The bloger of threshold value is usually big V user, and its authority is of a relatively high, therefore, can retain corresponding most influential microblogging and win
The bloger of text is the article of big V user.
3) model training
After above-mentioned filtration treatment, remaining article can be divided into positive sample and negative sample, and then according to division
The positive sample and negative sample gone out trains to obtain high quality articles identification model.
In actual applications, every article can be directed to, the mode manually marked is respectively adopted, this article is labeled as high-quality
Article or low quality article are measured, if being labeled as high quality articles, this article rule and regulation are positive sample, are negative sample otherwise.
For obtained positive sample and negative sample, feature extraction can be carried out to it respectively, and then according to the feature extracted
Training obtains high quality articles identification model.
It may include the feature that can react article temperature in the feature extracted, that is, react whether article content is focus thing
Part, in addition, can also further comprise some further features, such as paragraph number, specifically including which feature can be according to being actually needed
Depending on.
The feature that article temperature can be reacted can be to thumb up difference feature, for every article, can be utilized respectively this article
Corresponding most influential microblogging blog article, utilizes existing Baidu's natural language processing (NLP, Natural Language
Processing prediction model) is thumbed up, the number etc. that thumbs up following to the microblogging blog article is estimated, and then can be calculated and be thumbed up difference
Value, the temperature of article can be reflected by thumbing up difference, such as, the microblogging blog article it is current thumb up number be will change after 100, one day
Into 1000, then then illustrate that temperature is larger, conversely, the current number that thumbs up of the microblogging blog article is will to be changed into 110 after 100, one day, that
Then illustrate that temperature is smaller, it is bigger to thumb up difference, illustrates that temperature is bigger.
High quality articles identification model is specially which kind of model is not restricted, and can be such as neural network model, based on upper
Give an account of and continue, how to train to obtain high quality articles identification model is prior art.
After high quality articles identification model is obtained, you can using high quality articles identification model, in real time from microblogging
The excavation of high quality articles is carried out in blog article.
4) high quality articles excavate
Microblogging blog article can be obtained from microblog in real time, also, follow-up processing speed can be more than in view of acquisition speed
Degree, therefore, first can be cached the microblogging blog article got.
For example microblogging blog article protobufization can will be first got, binary stream is converted to, then enters two after conversion
System is spread in kafka message queues, consequently facilitating entering line access, correspondingly, be can be read and is parsed in kafka message queues
Data.
Can be according to 1) -2) described in mode, according to microblogging blog article carry out article excavate and article filtering etc., after filtering
The article retained, it can determine whether this article is high quality articles respectively by high quality articles identification model, it is specific real
Now repeat no more.
In a word, using scheme described in above method embodiment, article digging can be carried out according to the microblogging blog article got first
Pick, and the article that pre-provisioning request is not met in the article excavated is filtered out, remaining article can be divided into positive sample afterwards
And negative sample, and trained according to positive sample and negative sample to obtain high quality articles identification model, so, subsequently can be according to high-quality
Article identification model is measured, quality Identification is carried out to the article excavated from microblogging blog article, so as to the high quality being identified out
Article, compared to prior art, high quality articles can be excavated in scheme described in above method embodiment from microblogging blog article, and it is micro-
The data volume of rich blog article is very large, so as to get substantial amounts of high quality articles, and need not employ writer etc., cost is low
Under, in addition, can be excavated to various high quality articles from microblogging blog article, do not limited by template, there is enough wounds
New property.
Above is the introduction on embodiment of the method, below by way of device embodiment, enters to advance to scheme of the present invention
One step explanation.
Fig. 2 is the composition structural representation of the high quality articles excavating gear embodiment of the present invention based on artificial intelligence
Figure.As shown in Fig. 2 including:Pretreatment unit 201 and excavation unit 202.
Pretreatment unit 201, for carrying out article excavation according to the microblogging blog article got;Filter out the article excavated
In do not meet the article of pre-provisioning request;Remaining article is divided into positive sample and negative sample;Instructed according to positive sample and negative sample
Get high quality articles identification model.
Unit 202 is excavated, for according to high quality articles identification model, being carried out to the article excavated from microblogging blog article
Quality Identification, the high quality articles being identified out.
As shown in Fig. 2 it may particularly include in pretreatment unit 201:Obtain subelement 2011, filtering subelement 2012 and
Train subelement 2013.
Subelement 2011 is obtained, for obtaining the short chain of article from microblogging blog article, short chain is reverted into long-chain, obtains length
Article corresponding to chain.
Subelement 2012 is filtered, the article of pre-provisioning request is not met for filtering out.
Subelement 2013 is trained, for remaining article to be divided into positive sample and negative sample, according to positive sample and negative sample
This training obtains high quality articles identification model.
The short chain of article can be obtained from microblogging blog article by obtaining subelement 2011, and revert to short chain according to prior art
Long-chain, and then the article including full content is obtained using long-chain, article completion is realized, and be stored in article database.
In addition, forwarding, comment and like time of every microblogging blog article etc. can also be obtained respectively by obtaining subelement 2011, and
Every article can be directed to, determines most influential microblogging blog article corresponding to it respectively, most influential microblogging blog article is
Forward, comment in microblogging blog article comprising this article, the microblogging blog article that like time sum is maximum.
For each article being stored in article database, further it can be filtered, not met so as to filter out
The article of pre-provisioning request, such as obvious low quality article.
For example for every article, filtering subelement 2012 can carry out advertisement filter and yellow filtering to it respectively, if appointing
One filtering is by the way that then can determine that this article is the article for not meeting pre-provisioning request.
Wherein, advertisement filter can be carried out to every article by the way of rule-based filtering, the side of keyword filtration can be used
Formula, yellow filtering is carried out to every article.
It is preferred that the flow of advertisement filter can be:Article white list->Title blacklist->Content blacklist.
Except being filtered above by progress advertisement filter and yellow pre-provisioning request is not met to filter out in the article excavated
Article outside, filtering subelement 2012 can also be directed to every article, determine respectively most influential corresponding to this article
Whether the bean vermicelli number of the bloger of microblogging blog article is more than predetermined threshold, if it is not, then can determine that this article is not meet pre-provisioning request
Article.
The specific value of the threshold value can be decided according to the actual requirements, such as, can value be 10000, bean vermicelli number be more than should
The bloger of threshold value is usually big V user, and its authority is of a relatively high, therefore, can retain corresponding most influential microblogging and win
The bloger of text is the article of big V user.
After above-mentioned filtration treatment, remaining article can be divided into positive sample and negative sample by training subelement 2013
This, and then train to obtain high quality articles identification model according to the positive sample and negative sample that mark off.
In actual applications, every article can be directed to, the mode manually marked is respectively adopted, this article is labeled as high-quality
Article or low quality article are measured, if being labeled as high quality articles, this article rule and regulation are positive sample, are negative sample otherwise.
For obtained positive sample and negative sample, training subelement 2013 can carry out feature extraction, Jin Ergen to it respectively
High quality articles identification model is obtained according to the features training extracted.
It may include the feature that can react article temperature in the feature extracted, that is, react whether article content is focus thing
Part, in addition, can also further comprise some further features, such as paragraph number, specifically including which feature can be according to being actually needed
Depending on.
The feature that article temperature can be reacted can be to thumb up difference feature, for every article, can be utilized respectively this article
Corresponding most influential microblogging blog article, prediction model is thumbed up using existing Baidu NLP, the point following to the microblogging blog article
Praise number etc. to be estimated, and then can calculate and thumb up difference, the temperature of article can be reflected by thumbing up difference, such as, this is micro-
The current number that thumbs up of rich blog article is will to become 1000 after 100, one day, then then illustrates that temperature is larger, conversely, the microblogging blog article ought
The preceding number that thumbs up is will to be changed into 110 after 100, one day, then then illustrates that temperature is smaller, it is bigger to thumb up difference, illustrates that temperature is bigger.
After high quality articles identification model is obtained, unit 202 is excavated i.e. using high quality articles identification model, it is real
When from microblogging blog article carry out high quality articles excavation.
The specific workflow of Fig. 2 shown device embodiments refer to the respective description in preceding method embodiment, no longer
Repeat.
In a word, using scheme described in said apparatus embodiment, article digging can be carried out according to the microblogging blog article got first
Pick, and the article that pre-provisioning request is not met in the article excavated is filtered out, remaining article can be divided into positive sample afterwards
And negative sample, and trained according to positive sample and negative sample to obtain high quality articles identification model, so, subsequently can be according to high-quality
Article identification model is measured, quality Identification is carried out to the article excavated from microblogging blog article, so as to the high quality being identified out
Article, compared to prior art, high quality articles can be excavated in scheme described in said apparatus embodiment from microblogging blog article, and it is micro-
The data volume of rich blog article is very large, so as to get substantial amounts of high quality articles, and need not employ writer etc., cost is low
Under, in addition, can be excavated to various high quality articles from microblogging blog article, do not limited by template, there is enough wounds
New property.
Fig. 3 shows the block diagram suitable for being used for the exemplary computer system/server 12 for realizing embodiment of the present invention.
The computer system/server 12 that Fig. 3 is shown is only an example, should not be to the function and use range of the embodiment of the present invention
Bring any restrictions.
As shown in figure 3, computer system/server 12 is showed in the form of universal computing device.Computer system/service
The component of device 12 can include but is not limited to:One or more processor (processing unit) 16, memory 28, connect not homology
The bus 18 of system component (including memory 28 and processor 16).
Bus 18 represents the one or more in a few class bus structures, including memory bus or Memory Controller,
Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.Lift
For example, these architectures include but is not limited to industry standard architecture (ISA) bus, MCA (MAC)
Bus, enhanced isa bus, VESA's (VESA) local bus and periphery component interconnection (PCI) bus.
Computer system/server 12 typically comprises various computing systems computer-readable recording medium.These media can be appointed
What usable medium that can be accessed by computer system/server 12, including volatibility and non-volatile media, it is moveable and
Immovable medium.
Memory 28 can include the computer system readable media of form of volatile memory, such as random access memory
Device (RAM) 30 and/or cache memory 32.Computer system/server 12 may further include it is other it is removable/no
Movably, volatile/non-volatile computer system storage medium.Only as an example, storage system 34 can be used for reading and writing
Immovable, non-volatile magnetic media (Fig. 3 is not shown, is commonly referred to as " hard disk drive ").Although not shown in Fig. 3, can
To provide the disc driver being used for may move non-volatile magnetic disk (such as " floppy disk ") read-write, and to removable non-volatile
Property CD (such as CD-ROM, DVD-ROM or other optical mediums) read-write CD drive.In these cases, it is each to drive
Dynamic device can be connected by one or more data media interfaces with bus 18.Memory 28 can include at least one program
Product, the program product have one group of (for example, at least one) program module, and these program modules are configured to perform the present invention
The function of each embodiment.
Program/utility 40 with one group of (at least one) program module 42, such as memory 28 can be stored in
In, such program module 42 includes --- but being not limited to --- operating system, one or more application program, other programs
Module and routine data, the realization of network environment may be included in each or certain combination in these examples.Program mould
Block 42 generally performs function and/or method in embodiment described in the invention.
Computer system/server 12 can also be (such as keyboard, sensing equipment, aobvious with one or more external equipments 14
Show device 24 etc.) communication, it can also enable a user to lead to the equipment that the computer system/server 12 interacts with one or more
Letter, and/or any set with make it that the computer system/server 12 communicated with one or more of the other computing device
Standby (such as network interface card, modem etc.) communicates.This communication can be carried out by input/output (I/O) interface 22.And
And computer system/server 12 can also pass through network adapter 20 and one or more network (such as LAN
(LAN), wide area network (WAN) and/or public network, such as internet) communication.As shown in figure 3, network adapter 20 passes through bus
18 communicate with other modules of computer system/server 12.It should be understood that although not shown in the drawings, computer can be combined
Systems/servers 12 use other hardware and/or software module, include but is not limited to:Microcode, device driver, at redundancy
Manage unit, external disk drive array, RAID system, tape drive and data backup storage system etc..
Processor 16 is stored in the program in memory 28 by operation, so as to perform at various function application and data
Reason, such as realize the method in embodiment illustrated in fig. 1, i.e., article excavation is carried out according to the microblogging blog article got, filter out digging
The article of pre-provisioning request is not met in the article dug, remaining article is divided into positive sample and negative sample, according to positive sample
Train to obtain high quality articles identification model with negative sample, according to high quality articles identification model, to excavating from microblogging blog article
The article gone out carries out quality Identification, the high quality articles being identified out.
Specific implementation refer to the related description in foregoing embodiments, repeat no more.
The present invention discloses a kind of computer-readable recording medium, computer program is stored thereon with, the program quilt
The method in embodiment as shown in Figure 1 will be realized during computing device.
Any combination of one or more computer-readable media can be used.Computer-readable medium can be calculated
Machine readable signal medium or computer-readable recording medium.Computer-readable recording medium for example can be --- but it is unlimited
In system, device or the device of --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or it is any more than combination.Calculate
The more specifically example (non exhaustive list) of machine readable storage medium storing program for executing includes:Electrical connection with one or more wires, just
Take formula computer disk, hard disk, random access memory (RAM), read-only storage (ROM), erasable type and may be programmed read-only storage
Device (EPROM or flash memory), optical fiber, portable compact disc read-only storage (CD-ROM), light storage device, magnetic memory device,
Or above-mentioned any appropriate combination.In this document, computer-readable recording medium can any include or store journey
The tangible medium of sequence, the program can be commanded the either device use or in connection of execution system, device.
Computer-readable signal media can include in a base band or as carrier wave a part propagation data-signal,
Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including --- but
It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be
Any computer-readable medium beyond computer-readable recording medium, the computer-readable medium can send, propagate or
Transmit for by instruction execution system, device either device use or program in connection.
The program code included on computer-readable medium can be transmitted with any appropriate medium, including --- but it is unlimited
In --- wireless, electric wire, optical cable, RF etc., or above-mentioned any appropriate combination.
It can be write with one or more programming languages or its combination for performing the computer that operates of the present invention
Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++,
Also include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with
Fully perform, partly perform on the user computer on the user computer, the software kit independent as one performs, portion
Divide and partly perform or performed completely on remote computer or server on the remote computer on the user computer.
Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including LAN (LAN) or
Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as carried using Internet service
Pass through Internet connection for business).
In several embodiments provided by the present invention, it should be understood that disclosed apparatus and method etc., can pass through
Other modes are realized.For example, device embodiment described above is only schematical, for example, the division of the unit,
Only a kind of division of logic function, can there is other dividing mode when actually realizing.
The unit illustrated as separating component can be or may not be physically separate, show as unit
The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple
On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs
's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also
That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list
Member can both be realized in the form of hardware, can also be realized in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit realized in the form of SFU software functional unit, can be stored in one and computer-readable deposit
In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are causing a computer
It is each that equipment (can be personal computer, server, or network equipment etc.) or processor (processor) perform the present invention
The part steps of embodiment methods described.And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM,
Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. it is various
Can be with the medium of store program codes.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention
God any modification, equivalent substitution and improvements done etc., should be included within the scope of protection of the invention with principle.
Claims (14)
- A kind of 1. high quality articles method for digging based on artificial intelligence, it is characterised in that including:Article excavation is carried out according to the microblogging blog article got;Filter out the article that pre-provisioning request is not met in the article excavated;Remaining article is divided into positive sample and negative sample;Train to obtain high quality articles identification model according to the positive sample and the negative sample;According to the high quality articles identification model, quality Identification is carried out to the article excavated from microblogging blog article, known The high quality articles not gone out.
- 2. according to the method described in claim 1, it is characterised in thatThe microblogging blog article that the basis is got, which carries out article excavation, to be included:The short chain of article is obtained from microblogging blog article;The short chain is reverted into long-chain;Obtain article corresponding to the long-chain.
- 3. according to the method for claim 1, it is characterised in thatDescribed filter out does not meet the article of pre-provisioning request and included in the article excavated:For every article, advertisement filter and yellow filtering are carried out to it respectively, if any filtering does not pass through, it is determined that the text Chapter is the article for not meeting pre-provisioning request.
- 4. according to the method for claim 3, it is characterised in thatDescribed to be directed to every article, carrying out advertisement filter and yellow filtering to it respectively includes:By the way of rule-based filtering, advertisement filter is carried out to the article;By the way of keyword filtration, yellow filtering is carried out to the article.
- 5. according to the method for claim 3, it is characterised in thatThis method further comprises:The forwarding, comment and like time of every microblogging blog article are obtained respectively;For every article, most influential microblogging blog article corresponding to it, the most influential microblogging are determined respectively Blog article is forwarding in the microblogging blog article comprising the article, comment, the microblogging blog article of like time sum maximum;Described filter out does not meet the article of pre-provisioning request and further comprised in the article excavated:For every article, determining the bean vermicelli number of the bloger of most influential microblogging blog article corresponding to the article respectively is It is no to be more than predetermined threshold, if not, it is determined that the article is the article for not meeting pre-provisioning request.
- 6. according to the method for claim 1, it is characterised in thatIt is described train to obtain high quality articles identification model according to the positive sample and the negative sample include:Feature extraction is carried out to the positive sample and the negative sample respectively, the feature extracted includes:Article can be reacted The feature of temperature;Features training according to extracting obtains the high quality articles identification model.
- A kind of 7. high quality articles excavating gear based on artificial intelligence, it is characterised in that including:Pretreatment unit and excavation Unit;The pretreatment unit, for carrying out article excavation according to the microblogging blog article got;Filter out in the article excavated The article of pre-provisioning request is not met;Remaining article is divided into positive sample and negative sample;According to the positive sample and described negative Sample training obtains high quality articles identification model;The excavation unit, for according to the high quality articles identification model, entering to the article excavated from microblogging blog article Row quality Identification, the high quality articles being identified out.
- 8. according to the device described in claim 7, it is characterised in thatThe pretreatment unit includes:Obtain subelement, filtering subelement and training subelement;The acquisition subelement, for obtaining the short chain of article from microblogging blog article, the short chain is reverted into long-chain, obtains institute State article corresponding to long-chain;The filtering subelement, the article of pre-provisioning request is not met for filtering out;The training subelement, for remaining article to be divided into positive sample and negative sample, according to the positive sample and described Negative sample trains to obtain high quality articles identification model.
- 9. device according to claim 8, it is characterised in thatThe filtering subelement is directed to every article, carries out advertisement filter and yellow filtering to it respectively, if any filtering is not led to Cross, it is determined that the article is the article for not meeting pre-provisioning request.
- 10. device according to claim 9, it is characterised in thatThe filtering subelement carries out advertisement filter by the way of rule-based filtering to the article;The filtering subelement carries out yellow filtering by the way of keyword filtration to the article.
- 11. device according to claim 9, it is characterised in thatThe acquisition subelement is further used for,The forwarding, comment and like time of every microblogging blog article are obtained respectively;For every article, most influential microblogging blog article corresponding to it, the most influential microblogging are determined respectively Blog article is forwarding in the microblogging blog article comprising the article, comment, the microblogging blog article of like time sum maximum;The filtering subelement is further used for,For every article, determining the bean vermicelli number of the bloger of most influential microblogging blog article corresponding to the article respectively is It is no to be more than predetermined threshold, if not, it is determined that the article is the article for not meeting pre-provisioning request.
- 12. device according to claim 8, it is characterised in thatThe training subelement carries out feature extraction to the positive sample and the negative sample respectively, is wrapped in the feature extracted Include:The feature of article temperature can be reacted, the high quality articles identification model is obtained according to the features training extracted.
- 13. a kind of computer equipment, including memory, processor and it is stored on the memory and can be on the processor The computer program of operation, it is characterised in that realized during the computing device described program as any in claim 1~6 Method described in.
- 14. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that described program is processed Such as method according to any one of claims 1 to 6 is realized when device performs.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710862013.3A CN107729401A (en) | 2017-09-21 | 2017-09-21 | High quality articles method for digging, device and storage medium based on artificial intelligence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710862013.3A CN107729401A (en) | 2017-09-21 | 2017-09-21 | High quality articles method for digging, device and storage medium based on artificial intelligence |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107729401A true CN107729401A (en) | 2018-02-23 |
Family
ID=61206735
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710862013.3A Pending CN107729401A (en) | 2017-09-21 | 2017-09-21 | High quality articles method for digging, device and storage medium based on artificial intelligence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107729401A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111292134A (en) * | 2020-02-25 | 2020-06-16 | 上海昌投网络科技有限公司 | Method and device for judging whether WeChat public number can be advertised |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100036784A1 (en) * | 2008-08-07 | 2010-02-11 | Yahoo! Inc. | Systems and methods for finding high quality content in social media |
CN103970801A (en) * | 2013-02-05 | 2014-08-06 | 腾讯科技(深圳)有限公司 | Method and device for recognizing microblog advertisement blog articles |
CN104239539A (en) * | 2013-09-22 | 2014-12-24 | 中科嘉速(北京)并行软件有限公司 | Microblog information filtering method based on multi-information fusion |
CN104281653A (en) * | 2014-09-16 | 2015-01-14 | 南京弘数信息科技有限公司 | Viewpoint mining method for ten million microblog texts |
CN106202211A (en) * | 2016-06-27 | 2016-12-07 | 四川大学 | A kind of integrated microblogging rumour recognition methods based on microblogging type |
-
2017
- 2017-09-21 CN CN201710862013.3A patent/CN107729401A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100036784A1 (en) * | 2008-08-07 | 2010-02-11 | Yahoo! Inc. | Systems and methods for finding high quality content in social media |
CN103970801A (en) * | 2013-02-05 | 2014-08-06 | 腾讯科技(深圳)有限公司 | Method and device for recognizing microblog advertisement blog articles |
CN104239539A (en) * | 2013-09-22 | 2014-12-24 | 中科嘉速(北京)并行软件有限公司 | Microblog information filtering method based on multi-information fusion |
CN104281653A (en) * | 2014-09-16 | 2015-01-14 | 南京弘数信息科技有限公司 | Viewpoint mining method for ten million microblog texts |
CN106202211A (en) * | 2016-06-27 | 2016-12-07 | 四川大学 | A kind of integrated microblogging rumour recognition methods based on microblogging type |
Non-Patent Citations (2)
Title |
---|
莫祖英: "微博信息内容质量评价及其对用户", 《中国博士学位论文全文数据库信息科技辑》 * |
薛国林: "《政府官员开微博的16个要诀》", 30 June 2013 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111292134A (en) * | 2020-02-25 | 2020-06-16 | 上海昌投网络科技有限公司 | Method and device for judging whether WeChat public number can be advertised |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108170792A (en) | Question and answer bootstrap technique, device and computer equipment based on artificial intelligence | |
CN108170773A (en) | Media event method for digging, device, computer equipment and storage medium | |
CN103914548B (en) | Information search method and device | |
CN104731881B (en) | A kind of chat record method and its mobile terminal based on communications applications | |
CN107085730A (en) | A kind of deep learning method and device of character identifying code identification | |
CN107678561A (en) | Phonetic entry error correction method and device based on artificial intelligence | |
CN109271493A (en) | A kind of language text processing method, device and storage medium | |
CN107766371A (en) | A kind of text message sorting technique and its device | |
CN107240395A (en) | A kind of acoustic training model method and apparatus, computer equipment, storage medium | |
WO2006078912A3 (en) | Automatic dynamic contextual data entry completion system | |
CN103678269A (en) | Information processing method and device | |
CN108319720A (en) | Man-machine interaction method, device based on artificial intelligence and computer equipment | |
CN107808307A (en) | Business personnel's picture forming method, electronic installation and computer-readable recording medium | |
CN108491421A (en) | A kind of method, apparatus, equipment and computer storage media generating question and answer | |
CN104866308A (en) | Scenario image generation method and apparatus | |
CN107346229A (en) | Pronunciation inputting method and device, computer installation and readable storage medium storing program for executing | |
CN108563655A (en) | Text based event recognition method and device | |
CN108153719A (en) | Merge the method and apparatus of electrical form | |
CN108510096A (en) | Trade company's attrition prediction method, apparatus, equipment and storage medium | |
CN102915493A (en) | Information processing apparatus and method | |
CN104267922A (en) | Information processing method and electronic equipment | |
Kshetri et al. | Big data and cloud computing for development: Lessons from key industries and economies in the global south | |
CN106777336A (en) | A kind of exabyte composition extraction system and method based on deep learning | |
CN107908796A (en) | E-Government duplicate checking method, apparatus and computer-readable recording medium | |
CN107330009A (en) | Descriptor disaggregated model creation method, creating device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180223 |
|
RJ01 | Rejection of invention patent application after publication |