CN107729401A - High quality articles method for digging, device and storage medium based on artificial intelligence - Google Patents

High quality articles method for digging, device and storage medium based on artificial intelligence Download PDF

Info

Publication number
CN107729401A
CN107729401A CN201710862013.3A CN201710862013A CN107729401A CN 107729401 A CN107729401 A CN 107729401A CN 201710862013 A CN201710862013 A CN 201710862013A CN 107729401 A CN107729401 A CN 107729401A
Authority
CN
China
Prior art keywords
article
high quality
filtering
microblogging
quality articles
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710862013.3A
Other languages
Chinese (zh)
Inventor
黄俊衡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201710862013.3A priority Critical patent/CN107729401A/en
Publication of CN107729401A publication Critical patent/CN107729401A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • General Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses high quality articles method for digging, device and the storage medium based on artificial intelligence, wherein method includes:Article excavation is carried out according to the microblogging blog article got;Filter out the article that pre-provisioning request is not met in the article excavated;Remaining article is divided into positive sample and negative sample;Trained to obtain high quality articles identification model according to positive sample and negative sample;According to high quality articles identification model, quality Identification, the high quality articles being identified out are carried out to the article excavated from microblogging blog article.Using scheme of the present invention, substantial amounts of high quality articles can be effectively excavated, and cost is low, possesses novelty.

Description

High quality articles method for digging, device and storage medium based on artificial intelligence
【Technical field】
The present invention relates to Computer Applied Technology, high quality articles method for digging, dress more particularly to based on artificial intelligence Put and storage medium.
【Background technology】
Artificial intelligence (Artificial Intelligence), english abbreviation AI.It is research, develop for simulating, Extension and the extension intelligent theory of people, method, a new technological sciences of technology and application system.Artificial intelligence is to calculate One branch of machine science, it attempts to understand essence of intelligence, and produce it is a kind of it is new can be in a manner of human intelligence be similar The intelligence machine made a response, the research in the field include robot, language identification, image recognition, natural language processing and specially Family's system etc..
Generate the epoch of information outburst in Internet user, how to be excavated from the data of microblog users generation it is reliable, The article of high quality has huge commercial value.
Created however, microblogging is that a kind of emphasis is ageing with the random platform shared with exchanged, this randomness Low quality data is spread unchecked, and the low quality data may include ad data and daily exchange data etc., so as to be high quality text The excavation of chapter brings very big difficulty.
And this problem is directed to, in the prior art also without a kind of effective settling mode.
In the prior art, it is acquisition high quality articles, the following processing mode of generally use:
1) rely on from media or employ writer to go to write article;
2) using semi-automatic form, add some manual interventions, go to construct article, for example, going to generate by a solid plate Sports reports.
But above-mentioned each mode can have the problem of certain in actual applications, such as:Employing mode 1), due to dependent on certainly Media or writer, therefore the article amount of output is few and cost is high, employing mode 2), article can be caused of low quality, and lack wound New property, because template number is limited, and the scope covered is certain, thus relies on often thousand one, the article of template generation Rule, lack innovative.
【The content of the invention】
In view of this, the invention provides high quality articles method for digging, device and the storage medium based on artificial intelligence, Substantial amounts of high quality articles can be effectively excavated, and cost is low, possesses novelty.
Concrete technical scheme is as follows:
A kind of high quality articles method for digging based on artificial intelligence, including:
Article excavation is carried out according to the microblogging blog article got;
Filter out the article that pre-provisioning request is not met in the article excavated;
Remaining article is divided into positive sample and negative sample;
Train to obtain high quality articles identification model according to the positive sample and the negative sample;
According to the high quality articles identification model, quality Identification is carried out to the article excavated from microblogging blog article, obtained To the high quality articles identified.
According to one preferred embodiment of the present invention, the microblogging blog article that the basis is got, which carries out article excavation, to be included:
The short chain of article is obtained from microblogging blog article;
The short chain is reverted into long-chain;
Obtain article corresponding to the long-chain.
According to one preferred embodiment of the present invention, it is described to filter out the article bag that pre-provisioning request is not met in the article excavated Include:
For every article, advertisement filter and yellow filtering are carried out to it respectively, if any filtering does not pass through, it is determined that institute It is the article for not meeting pre-provisioning request to state article.
According to one preferred embodiment of the present invention, it is described to be directed to every article, advertisement filter and yellow mistake are carried out to it respectively Filter includes:
By the way of rule-based filtering, advertisement filter is carried out to the article;
By the way of keyword filtration, yellow filtering is carried out to the article.
According to one preferred embodiment of the present invention, this method further comprises:
The forwarding, comment and like time of every microblogging blog article are obtained respectively;
For every article, most influential microblogging blog article corresponding to it is determined respectively, it is described most influential Microblogging blog article is forwarding in the microblogging blog article comprising the article, comment, the microblogging blog article of like time sum maximum;
Described filter out does not meet the article of pre-provisioning request and further comprised in the article excavated:
For every article, the bean vermicelli of the bloger of most influential microblogging blog article corresponding to the article is determined respectively Whether number is more than predetermined threshold, if not, it is determined that the article is the article for not meeting pre-provisioning request.
According to one preferred embodiment of the present invention, it is described to be trained to obtain high quality text according to the positive sample and the negative sample Chapter identification model includes:
Feature extraction is carried out to the positive sample and the negative sample respectively, the feature extracted includes:It can react The feature of article temperature;
Features training according to extracting obtains the high quality articles identification model.
A kind of high quality articles excavating gear based on artificial intelligence, including:Pretreatment unit and excavation unit;
The pretreatment unit, for carrying out article excavation according to the microblogging blog article got;Filter out the text excavated The article of pre-provisioning request is not met in chapter;Remaining article is divided into positive sample and negative sample;According to the positive sample and institute Negative sample is stated to train to obtain high quality articles identification model;
The excavation unit, for according to the high quality articles identification model, to the text excavated from microblogging blog article Zhang Jinhang quality Identifications, the high quality articles being identified out.
According to one preferred embodiment of the present invention, the pretreatment unit includes:Obtain subelement, filtering subelement and Train subelement;
The acquisition subelement, for obtaining the short chain of article from microblogging blog article, the short chain is reverted into long-chain, obtained Take article corresponding to the long-chain;
The filtering subelement, the article of pre-provisioning request is not met for filtering out;
The training subelement, for remaining article to be divided into positive sample and negative sample, according to the positive sample and The negative sample trains to obtain high quality articles identification model.
According to one preferred embodiment of the present invention, the filtering subelement is directed to every article, carries out advertisement to it respectively Filter and yellow filtering, if any filtering does not pass through, it is determined that the article is the article for not meeting pre-provisioning request.
According to one preferred embodiment of the present invention, the filtering subelement is entered by the way of rule-based filtering to the article Row advertisement filter;
The filtering subelement carries out yellow filtering by the way of keyword filtration to the article.
According to one preferred embodiment of the present invention, the acquisition subelement is further used for,
The forwarding, comment and like time of every microblogging blog article are obtained respectively;
For every article, most influential microblogging blog article corresponding to it is determined respectively, it is described most influential Microblogging blog article is forwarding in the microblogging blog article comprising the article, comment, the microblogging blog article of like time sum maximum;
The filtering subelement is further used for,
For every article, the bean vermicelli of the bloger of most influential microblogging blog article corresponding to the article is determined respectively Whether number is more than predetermined threshold, if not, it is determined that the article is the article for not meeting pre-provisioning request.
According to one preferred embodiment of the present invention, the training subelement is carried out to the positive sample and the negative sample respectively Feature extraction, the feature extracted include:The feature of article temperature can be reacted, institute is obtained according to the features training extracted State high quality articles identification model.
A kind of computer equipment, including memory, processor and be stored on the memory and can be in the processor The computer program of upper operation, method as described above is realized during the computing device described program.
A kind of computer-readable recording medium, computer program is stored thereon with, it is real when described program is executed by processor Existing method as described above.
It can be seen that using scheme of the present invention, can be entered first according to the microblogging blog article got based on above-mentioned introduction Chapter of composing a piece of writing excavates, and filters out the article that pre-provisioning request is not met in the article excavated, afterwards can divide remaining article For positive sample and negative sample, and trained according to positive sample and negative sample to obtain high quality articles identification model, it is so, follow-up According to high quality articles identification model, quality Identification is carried out to the article excavated from microblogging blog article, so as to be identified out High quality articles, compared to prior art, high quality articles can be excavated in heretofore described scheme from microblogging blog article, and The data volume of microblogging blog article is very large, so as to get substantial amounts of high quality articles, and need not employ writer etc., cost Lowly, in addition, can be excavated to various high quality articles from microblogging blog article, do not limited, had enough by template It is innovative.
【Brief description of the drawings】
Fig. 1 is the flow chart of the high quality articles method for digging embodiment of the present invention based on artificial intelligence.
Fig. 2 is the composition structural representation of the high quality articles excavating gear embodiment of the present invention based on artificial intelligence Figure.
Fig. 3 shows the block diagram suitable for being used for the exemplary computer system/server 12 for realizing embodiment of the present invention.
【Embodiment】
For problems of the prior art, propose that a kind of high quality articles based on artificial intelligence excavate in the present invention Mode, microblog data is accessed in real time from microblog, excavates high quality articles therein in real time.
In order that technical scheme is clearer, clear, develop simultaneously embodiment referring to the drawings, to institute of the present invention The scheme of stating is further described.
Obviously, described embodiment is part of the embodiment of the present invention, rather than whole embodiments.Based on the present invention In embodiment, all other embodiment that those skilled in the art are obtained under the premise of creative work is not made, all Belong to the scope of protection of the invention.
Fig. 1 is the flow chart of the high quality articles method for digging embodiment of the present invention based on artificial intelligence.Such as Fig. 1 institutes Show, including implementation in detail below.
In 101, article excavation is carried out according to the microblogging blog article got.
In 102, the article that pre-provisioning request is not met in the article excavated is filtered out.
In 103, remaining article is divided into positive sample and negative sample.
In 104, trained to obtain high quality articles identification model according to positive sample and negative sample.
In 105, according to high quality articles identification model, quality knowledge is carried out to the article excavated from microblogging blog article Not, the high quality articles being identified out.
As can be seen that in above-described embodiment, can train to obtain a high quality articles identification model first, afterwards, can profit With the high quality articles identification model, the excavation of high quality articles is carried out from microblogging blog article in real time.
Wherein, training obtains the process of high quality articles identification model mainly including article excavation, article filtering and mould The subprocess such as type training, are described in detail individually below.
1) article excavates
Article excavation can be carried out according to the microblogging blog article got.
Specifically, the short chain of article can be obtained from microblogging blog article, and short chain is reverted into long-chain, and then obtains long-chain pair The article answered.
A certain article may be quoted or forwarded in microblogging blog article, then the short chain of article can be obtained from microblogging blog article, And short chain is reverted into long-chain according to prior art, and then the article including full content is obtained using long-chain, realize article Completion, and it is stored in article database.Wherein, using gripping tool, to capture article content corresponding to long-chain etc..
In addition, can also obtain forwarding, comment and like time of every microblogging blog article etc. respectively, and every text can be directed to Chapter, determines most influential microblogging blog article corresponding to it respectively, and most influential microblogging blog article is to include this article Forward, comment in microblogging blog article, the microblogging blog article that like time sum is maximum.
Such as a certain article, it is first determined go out in which microblogging blog article and quote or forwarded this article, and then can be from Forwarding, comment, the microblogging blog article of like time sum maximum are selected in these microblogging blog articles, regard the microblogging blog article selected as this Most influential microblogging blog article corresponding to article.
2) article filters
For each article being stored in article database, further it can be filtered, not met so as to filter out The article of pre-provisioning request, such as obvious low quality article.
Such as every article, advertisement filter and yellow filtering can be carried out to it respectively, if any filtering not by, Then determine that this article is the article for not meeting pre-provisioning request.
Wherein, advertisement filter can be carried out to every article by the way of rule-based filtering, the side of keyword filtration can be used Formula, yellow filtering is carried out to every article.
It is preferred that the flow of advertisement filter can be:Article white list->Title blacklist->Content blacklist.Such as For a certain article, the content of front and rear each N sections of this article can be obtained first, N is positive integer, and specific value can be according to actual need Depending on wanting, it is then determined that whether including content, form of presentation etc. specified in article white list in the content got, such as reporter Report etc., if it is, can determine that the advertisement filter of this article by continuing to retain this article, otherwise, this can be further determined that Whether content specified in title blacklist is included in the title of article, if it is, this article can be filtered out, otherwise, can Further determining that in the content of this article and whether include content specified in content blacklist, such as price is how many, if it is, This article can then be filtered out, otherwise, it may be determined that the advertisement filter of this article is by continuing to retain this article.
Filtered for yellow, if the title of a certain article or content include set yellow keyword, then then This article can be filtered out, otherwise, it may be determined that the yellow of this article is filtered through, and continues to retain this article.
In addition, except above by carry out advertisement filter and yellow filtering come filter out do not met in the article excavated it is predetermined It is required that article outside, can also be directed to every article, determine most influential microblogging blog article corresponding to this article respectively Whether the bean vermicelli number of bloger is more than predetermined threshold, if it is not, then can determine that this article is the article for not meeting pre-provisioning request.
The specific value of the threshold value can be decided according to the actual requirements, such as, can value be 10000, bean vermicelli number be more than should The bloger of threshold value is usually big V user, and its authority is of a relatively high, therefore, can retain corresponding most influential microblogging and win The bloger of text is the article of big V user.
3) model training
After above-mentioned filtration treatment, remaining article can be divided into positive sample and negative sample, and then according to division The positive sample and negative sample gone out trains to obtain high quality articles identification model.
In actual applications, every article can be directed to, the mode manually marked is respectively adopted, this article is labeled as high-quality Article or low quality article are measured, if being labeled as high quality articles, this article rule and regulation are positive sample, are negative sample otherwise.
For obtained positive sample and negative sample, feature extraction can be carried out to it respectively, and then according to the feature extracted Training obtains high quality articles identification model.
It may include the feature that can react article temperature in the feature extracted, that is, react whether article content is focus thing Part, in addition, can also further comprise some further features, such as paragraph number, specifically including which feature can be according to being actually needed Depending on.
The feature that article temperature can be reacted can be to thumb up difference feature, for every article, can be utilized respectively this article Corresponding most influential microblogging blog article, utilizes existing Baidu's natural language processing (NLP, Natural Language Processing prediction model) is thumbed up, the number etc. that thumbs up following to the microblogging blog article is estimated, and then can be calculated and be thumbed up difference Value, the temperature of article can be reflected by thumbing up difference, such as, the microblogging blog article it is current thumb up number be will change after 100, one day Into 1000, then then illustrate that temperature is larger, conversely, the current number that thumbs up of the microblogging blog article is will to be changed into 110 after 100, one day, that Then illustrate that temperature is smaller, it is bigger to thumb up difference, illustrates that temperature is bigger.
High quality articles identification model is specially which kind of model is not restricted, and can be such as neural network model, based on upper Give an account of and continue, how to train to obtain high quality articles identification model is prior art.
After high quality articles identification model is obtained, you can using high quality articles identification model, in real time from microblogging The excavation of high quality articles is carried out in blog article.
4) high quality articles excavate
Microblogging blog article can be obtained from microblog in real time, also, follow-up processing speed can be more than in view of acquisition speed Degree, therefore, first can be cached the microblogging blog article got.
For example microblogging blog article protobufization can will be first got, binary stream is converted to, then enters two after conversion System is spread in kafka message queues, consequently facilitating entering line access, correspondingly, be can be read and is parsed in kafka message queues Data.
Can be according to 1) -2) described in mode, according to microblogging blog article carry out article excavate and article filtering etc., after filtering The article retained, it can determine whether this article is high quality articles respectively by high quality articles identification model, it is specific real Now repeat no more.
In a word, using scheme described in above method embodiment, article digging can be carried out according to the microblogging blog article got first Pick, and the article that pre-provisioning request is not met in the article excavated is filtered out, remaining article can be divided into positive sample afterwards And negative sample, and trained according to positive sample and negative sample to obtain high quality articles identification model, so, subsequently can be according to high-quality Article identification model is measured, quality Identification is carried out to the article excavated from microblogging blog article, so as to the high quality being identified out Article, compared to prior art, high quality articles can be excavated in scheme described in above method embodiment from microblogging blog article, and it is micro- The data volume of rich blog article is very large, so as to get substantial amounts of high quality articles, and need not employ writer etc., cost is low Under, in addition, can be excavated to various high quality articles from microblogging blog article, do not limited by template, there is enough wounds New property.
Above is the introduction on embodiment of the method, below by way of device embodiment, enters to advance to scheme of the present invention One step explanation.
Fig. 2 is the composition structural representation of the high quality articles excavating gear embodiment of the present invention based on artificial intelligence Figure.As shown in Fig. 2 including:Pretreatment unit 201 and excavation unit 202.
Pretreatment unit 201, for carrying out article excavation according to the microblogging blog article got;Filter out the article excavated In do not meet the article of pre-provisioning request;Remaining article is divided into positive sample and negative sample;Instructed according to positive sample and negative sample Get high quality articles identification model.
Unit 202 is excavated, for according to high quality articles identification model, being carried out to the article excavated from microblogging blog article Quality Identification, the high quality articles being identified out.
As shown in Fig. 2 it may particularly include in pretreatment unit 201:Obtain subelement 2011, filtering subelement 2012 and Train subelement 2013.
Subelement 2011 is obtained, for obtaining the short chain of article from microblogging blog article, short chain is reverted into long-chain, obtains length Article corresponding to chain.
Subelement 2012 is filtered, the article of pre-provisioning request is not met for filtering out.
Subelement 2013 is trained, for remaining article to be divided into positive sample and negative sample, according to positive sample and negative sample This training obtains high quality articles identification model.
The short chain of article can be obtained from microblogging blog article by obtaining subelement 2011, and revert to short chain according to prior art Long-chain, and then the article including full content is obtained using long-chain, article completion is realized, and be stored in article database.
In addition, forwarding, comment and like time of every microblogging blog article etc. can also be obtained respectively by obtaining subelement 2011, and Every article can be directed to, determines most influential microblogging blog article corresponding to it respectively, most influential microblogging blog article is Forward, comment in microblogging blog article comprising this article, the microblogging blog article that like time sum is maximum.
For each article being stored in article database, further it can be filtered, not met so as to filter out The article of pre-provisioning request, such as obvious low quality article.
For example for every article, filtering subelement 2012 can carry out advertisement filter and yellow filtering to it respectively, if appointing One filtering is by the way that then can determine that this article is the article for not meeting pre-provisioning request.
Wherein, advertisement filter can be carried out to every article by the way of rule-based filtering, the side of keyword filtration can be used Formula, yellow filtering is carried out to every article.
It is preferred that the flow of advertisement filter can be:Article white list->Title blacklist->Content blacklist.
Except being filtered above by progress advertisement filter and yellow pre-provisioning request is not met to filter out in the article excavated Article outside, filtering subelement 2012 can also be directed to every article, determine respectively most influential corresponding to this article Whether the bean vermicelli number of the bloger of microblogging blog article is more than predetermined threshold, if it is not, then can determine that this article is not meet pre-provisioning request Article.
The specific value of the threshold value can be decided according to the actual requirements, such as, can value be 10000, bean vermicelli number be more than should The bloger of threshold value is usually big V user, and its authority is of a relatively high, therefore, can retain corresponding most influential microblogging and win The bloger of text is the article of big V user.
After above-mentioned filtration treatment, remaining article can be divided into positive sample and negative sample by training subelement 2013 This, and then train to obtain high quality articles identification model according to the positive sample and negative sample that mark off.
In actual applications, every article can be directed to, the mode manually marked is respectively adopted, this article is labeled as high-quality Article or low quality article are measured, if being labeled as high quality articles, this article rule and regulation are positive sample, are negative sample otherwise.
For obtained positive sample and negative sample, training subelement 2013 can carry out feature extraction, Jin Ergen to it respectively High quality articles identification model is obtained according to the features training extracted.
It may include the feature that can react article temperature in the feature extracted, that is, react whether article content is focus thing Part, in addition, can also further comprise some further features, such as paragraph number, specifically including which feature can be according to being actually needed Depending on.
The feature that article temperature can be reacted can be to thumb up difference feature, for every article, can be utilized respectively this article Corresponding most influential microblogging blog article, prediction model is thumbed up using existing Baidu NLP, the point following to the microblogging blog article Praise number etc. to be estimated, and then can calculate and thumb up difference, the temperature of article can be reflected by thumbing up difference, such as, this is micro- The current number that thumbs up of rich blog article is will to become 1000 after 100, one day, then then illustrates that temperature is larger, conversely, the microblogging blog article ought The preceding number that thumbs up is will to be changed into 110 after 100, one day, then then illustrates that temperature is smaller, it is bigger to thumb up difference, illustrates that temperature is bigger.
After high quality articles identification model is obtained, unit 202 is excavated i.e. using high quality articles identification model, it is real When from microblogging blog article carry out high quality articles excavation.
The specific workflow of Fig. 2 shown device embodiments refer to the respective description in preceding method embodiment, no longer Repeat.
In a word, using scheme described in said apparatus embodiment, article digging can be carried out according to the microblogging blog article got first Pick, and the article that pre-provisioning request is not met in the article excavated is filtered out, remaining article can be divided into positive sample afterwards And negative sample, and trained according to positive sample and negative sample to obtain high quality articles identification model, so, subsequently can be according to high-quality Article identification model is measured, quality Identification is carried out to the article excavated from microblogging blog article, so as to the high quality being identified out Article, compared to prior art, high quality articles can be excavated in scheme described in said apparatus embodiment from microblogging blog article, and it is micro- The data volume of rich blog article is very large, so as to get substantial amounts of high quality articles, and need not employ writer etc., cost is low Under, in addition, can be excavated to various high quality articles from microblogging blog article, do not limited by template, there is enough wounds New property.
Fig. 3 shows the block diagram suitable for being used for the exemplary computer system/server 12 for realizing embodiment of the present invention. The computer system/server 12 that Fig. 3 is shown is only an example, should not be to the function and use range of the embodiment of the present invention Bring any restrictions.
As shown in figure 3, computer system/server 12 is showed in the form of universal computing device.Computer system/service The component of device 12 can include but is not limited to:One or more processor (processing unit) 16, memory 28, connect not homology The bus 18 of system component (including memory 28 and processor 16).
Bus 18 represents the one or more in a few class bus structures, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.Lift For example, these architectures include but is not limited to industry standard architecture (ISA) bus, MCA (MAC) Bus, enhanced isa bus, VESA's (VESA) local bus and periphery component interconnection (PCI) bus.
Computer system/server 12 typically comprises various computing systems computer-readable recording medium.These media can be appointed What usable medium that can be accessed by computer system/server 12, including volatibility and non-volatile media, it is moveable and Immovable medium.
Memory 28 can include the computer system readable media of form of volatile memory, such as random access memory Device (RAM) 30 and/or cache memory 32.Computer system/server 12 may further include it is other it is removable/no Movably, volatile/non-volatile computer system storage medium.Only as an example, storage system 34 can be used for reading and writing Immovable, non-volatile magnetic media (Fig. 3 is not shown, is commonly referred to as " hard disk drive ").Although not shown in Fig. 3, can To provide the disc driver being used for may move non-volatile magnetic disk (such as " floppy disk ") read-write, and to removable non-volatile Property CD (such as CD-ROM, DVD-ROM or other optical mediums) read-write CD drive.In these cases, it is each to drive Dynamic device can be connected by one or more data media interfaces with bus 18.Memory 28 can include at least one program Product, the program product have one group of (for example, at least one) program module, and these program modules are configured to perform the present invention The function of each embodiment.
Program/utility 40 with one group of (at least one) program module 42, such as memory 28 can be stored in In, such program module 42 includes --- but being not limited to --- operating system, one or more application program, other programs Module and routine data, the realization of network environment may be included in each or certain combination in these examples.Program mould Block 42 generally performs function and/or method in embodiment described in the invention.
Computer system/server 12 can also be (such as keyboard, sensing equipment, aobvious with one or more external equipments 14 Show device 24 etc.) communication, it can also enable a user to lead to the equipment that the computer system/server 12 interacts with one or more Letter, and/or any set with make it that the computer system/server 12 communicated with one or more of the other computing device Standby (such as network interface card, modem etc.) communicates.This communication can be carried out by input/output (I/O) interface 22.And And computer system/server 12 can also pass through network adapter 20 and one or more network (such as LAN (LAN), wide area network (WAN) and/or public network, such as internet) communication.As shown in figure 3, network adapter 20 passes through bus 18 communicate with other modules of computer system/server 12.It should be understood that although not shown in the drawings, computer can be combined Systems/servers 12 use other hardware and/or software module, include but is not limited to:Microcode, device driver, at redundancy Manage unit, external disk drive array, RAID system, tape drive and data backup storage system etc..
Processor 16 is stored in the program in memory 28 by operation, so as to perform at various function application and data Reason, such as realize the method in embodiment illustrated in fig. 1, i.e., article excavation is carried out according to the microblogging blog article got, filter out digging The article of pre-provisioning request is not met in the article dug, remaining article is divided into positive sample and negative sample, according to positive sample Train to obtain high quality articles identification model with negative sample, according to high quality articles identification model, to excavating from microblogging blog article The article gone out carries out quality Identification, the high quality articles being identified out.
Specific implementation refer to the related description in foregoing embodiments, repeat no more.
The present invention discloses a kind of computer-readable recording medium, computer program is stored thereon with, the program quilt The method in embodiment as shown in Figure 1 will be realized during computing device.
Any combination of one or more computer-readable media can be used.Computer-readable medium can be calculated Machine readable signal medium or computer-readable recording medium.Computer-readable recording medium for example can be --- but it is unlimited In system, device or the device of --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or it is any more than combination.Calculate The more specifically example (non exhaustive list) of machine readable storage medium storing program for executing includes:Electrical connection with one or more wires, just Take formula computer disk, hard disk, random access memory (RAM), read-only storage (ROM), erasable type and may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only storage (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In this document, computer-readable recording medium can any include or store journey The tangible medium of sequence, the program can be commanded the either device use or in connection of execution system, device.
Computer-readable signal media can include in a base band or as carrier wave a part propagation data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including --- but It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be Any computer-readable medium beyond computer-readable recording medium, the computer-readable medium can send, propagate or Transmit for by instruction execution system, device either device use or program in connection.
The program code included on computer-readable medium can be transmitted with any appropriate medium, including --- but it is unlimited In --- wireless, electric wire, optical cable, RF etc., or above-mentioned any appropriate combination.
It can be write with one or more programming languages or its combination for performing the computer that operates of the present invention Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, Also include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with Fully perform, partly perform on the user computer on the user computer, the software kit independent as one performs, portion Divide and partly perform or performed completely on remote computer or server on the remote computer on the user computer. Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including LAN (LAN) or Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as carried using Internet service Pass through Internet connection for business).
In several embodiments provided by the present invention, it should be understood that disclosed apparatus and method etc., can pass through Other modes are realized.For example, device embodiment described above is only schematical, for example, the division of the unit, Only a kind of division of logic function, can there is other dividing mode when actually realizing.
The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list Member can both be realized in the form of hardware, can also be realized in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit realized in the form of SFU software functional unit, can be stored in one and computer-readable deposit In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are causing a computer It is each that equipment (can be personal computer, server, or network equipment etc.) or processor (processor) perform the present invention The part steps of embodiment methods described.And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. it is various Can be with the medium of store program codes.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention God any modification, equivalent substitution and improvements done etc., should be included within the scope of protection of the invention with principle.

Claims (14)

  1. A kind of 1. high quality articles method for digging based on artificial intelligence, it is characterised in that including:
    Article excavation is carried out according to the microblogging blog article got;
    Filter out the article that pre-provisioning request is not met in the article excavated;
    Remaining article is divided into positive sample and negative sample;
    Train to obtain high quality articles identification model according to the positive sample and the negative sample;
    According to the high quality articles identification model, quality Identification is carried out to the article excavated from microblogging blog article, known The high quality articles not gone out.
  2. 2. according to the method described in claim 1, it is characterised in that
    The microblogging blog article that the basis is got, which carries out article excavation, to be included:
    The short chain of article is obtained from microblogging blog article;
    The short chain is reverted into long-chain;
    Obtain article corresponding to the long-chain.
  3. 3. according to the method for claim 1, it is characterised in that
    Described filter out does not meet the article of pre-provisioning request and included in the article excavated:
    For every article, advertisement filter and yellow filtering are carried out to it respectively, if any filtering does not pass through, it is determined that the text Chapter is the article for not meeting pre-provisioning request.
  4. 4. according to the method for claim 3, it is characterised in that
    Described to be directed to every article, carrying out advertisement filter and yellow filtering to it respectively includes:
    By the way of rule-based filtering, advertisement filter is carried out to the article;
    By the way of keyword filtration, yellow filtering is carried out to the article.
  5. 5. according to the method for claim 3, it is characterised in that
    This method further comprises:
    The forwarding, comment and like time of every microblogging blog article are obtained respectively;
    For every article, most influential microblogging blog article corresponding to it, the most influential microblogging are determined respectively Blog article is forwarding in the microblogging blog article comprising the article, comment, the microblogging blog article of like time sum maximum;
    Described filter out does not meet the article of pre-provisioning request and further comprised in the article excavated:
    For every article, determining the bean vermicelli number of the bloger of most influential microblogging blog article corresponding to the article respectively is It is no to be more than predetermined threshold, if not, it is determined that the article is the article for not meeting pre-provisioning request.
  6. 6. according to the method for claim 1, it is characterised in that
    It is described train to obtain high quality articles identification model according to the positive sample and the negative sample include:
    Feature extraction is carried out to the positive sample and the negative sample respectively, the feature extracted includes:Article can be reacted The feature of temperature;
    Features training according to extracting obtains the high quality articles identification model.
  7. A kind of 7. high quality articles excavating gear based on artificial intelligence, it is characterised in that including:Pretreatment unit and excavation Unit;
    The pretreatment unit, for carrying out article excavation according to the microblogging blog article got;Filter out in the article excavated The article of pre-provisioning request is not met;Remaining article is divided into positive sample and negative sample;According to the positive sample and described negative Sample training obtains high quality articles identification model;
    The excavation unit, for according to the high quality articles identification model, entering to the article excavated from microblogging blog article Row quality Identification, the high quality articles being identified out.
  8. 8. according to the device described in claim 7, it is characterised in that
    The pretreatment unit includes:Obtain subelement, filtering subelement and training subelement;
    The acquisition subelement, for obtaining the short chain of article from microblogging blog article, the short chain is reverted into long-chain, obtains institute State article corresponding to long-chain;
    The filtering subelement, the article of pre-provisioning request is not met for filtering out;
    The training subelement, for remaining article to be divided into positive sample and negative sample, according to the positive sample and described Negative sample trains to obtain high quality articles identification model.
  9. 9. device according to claim 8, it is characterised in that
    The filtering subelement is directed to every article, carries out advertisement filter and yellow filtering to it respectively, if any filtering is not led to Cross, it is determined that the article is the article for not meeting pre-provisioning request.
  10. 10. device according to claim 9, it is characterised in that
    The filtering subelement carries out advertisement filter by the way of rule-based filtering to the article;
    The filtering subelement carries out yellow filtering by the way of keyword filtration to the article.
  11. 11. device according to claim 9, it is characterised in that
    The acquisition subelement is further used for,
    The forwarding, comment and like time of every microblogging blog article are obtained respectively;
    For every article, most influential microblogging blog article corresponding to it, the most influential microblogging are determined respectively Blog article is forwarding in the microblogging blog article comprising the article, comment, the microblogging blog article of like time sum maximum;
    The filtering subelement is further used for,
    For every article, determining the bean vermicelli number of the bloger of most influential microblogging blog article corresponding to the article respectively is It is no to be more than predetermined threshold, if not, it is determined that the article is the article for not meeting pre-provisioning request.
  12. 12. device according to claim 8, it is characterised in that
    The training subelement carries out feature extraction to the positive sample and the negative sample respectively, is wrapped in the feature extracted Include:The feature of article temperature can be reacted, the high quality articles identification model is obtained according to the features training extracted.
  13. 13. a kind of computer equipment, including memory, processor and it is stored on the memory and can be on the processor The computer program of operation, it is characterised in that realized during the computing device described program as any in claim 1~6 Method described in.
  14. 14. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that described program is processed Such as method according to any one of claims 1 to 6 is realized when device performs.
CN201710862013.3A 2017-09-21 2017-09-21 High quality articles method for digging, device and storage medium based on artificial intelligence Pending CN107729401A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710862013.3A CN107729401A (en) 2017-09-21 2017-09-21 High quality articles method for digging, device and storage medium based on artificial intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710862013.3A CN107729401A (en) 2017-09-21 2017-09-21 High quality articles method for digging, device and storage medium based on artificial intelligence

Publications (1)

Publication Number Publication Date
CN107729401A true CN107729401A (en) 2018-02-23

Family

ID=61206735

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710862013.3A Pending CN107729401A (en) 2017-09-21 2017-09-21 High quality articles method for digging, device and storage medium based on artificial intelligence

Country Status (1)

Country Link
CN (1) CN107729401A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111292134A (en) * 2020-02-25 2020-06-16 上海昌投网络科技有限公司 Method and device for judging whether WeChat public number can be advertised

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100036784A1 (en) * 2008-08-07 2010-02-11 Yahoo! Inc. Systems and methods for finding high quality content in social media
CN103970801A (en) * 2013-02-05 2014-08-06 腾讯科技(深圳)有限公司 Method and device for recognizing microblog advertisement blog articles
CN104239539A (en) * 2013-09-22 2014-12-24 中科嘉速(北京)并行软件有限公司 Microblog information filtering method based on multi-information fusion
CN104281653A (en) * 2014-09-16 2015-01-14 南京弘数信息科技有限公司 Viewpoint mining method for ten million microblog texts
CN106202211A (en) * 2016-06-27 2016-12-07 四川大学 A kind of integrated microblogging rumour recognition methods based on microblogging type

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100036784A1 (en) * 2008-08-07 2010-02-11 Yahoo! Inc. Systems and methods for finding high quality content in social media
CN103970801A (en) * 2013-02-05 2014-08-06 腾讯科技(深圳)有限公司 Method and device for recognizing microblog advertisement blog articles
CN104239539A (en) * 2013-09-22 2014-12-24 中科嘉速(北京)并行软件有限公司 Microblog information filtering method based on multi-information fusion
CN104281653A (en) * 2014-09-16 2015-01-14 南京弘数信息科技有限公司 Viewpoint mining method for ten million microblog texts
CN106202211A (en) * 2016-06-27 2016-12-07 四川大学 A kind of integrated microblogging rumour recognition methods based on microblogging type

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
莫祖英: "微博信息内容质量评价及其对用户", 《中国博士学位论文全文数据库信息科技辑》 *
薛国林: "《政府官员开微博的16个要诀》", 30 June 2013 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111292134A (en) * 2020-02-25 2020-06-16 上海昌投网络科技有限公司 Method and device for judging whether WeChat public number can be advertised

Similar Documents

Publication Publication Date Title
CN108170792A (en) Question and answer bootstrap technique, device and computer equipment based on artificial intelligence
CN108170773A (en) Media event method for digging, device, computer equipment and storage medium
CN103914548B (en) Information search method and device
CN104731881B (en) A kind of chat record method and its mobile terminal based on communications applications
CN107085730A (en) A kind of deep learning method and device of character identifying code identification
CN107678561A (en) Phonetic entry error correction method and device based on artificial intelligence
CN109271493A (en) A kind of language text processing method, device and storage medium
CN107766371A (en) A kind of text message sorting technique and its device
CN107240395A (en) A kind of acoustic training model method and apparatus, computer equipment, storage medium
WO2006078912A3 (en) Automatic dynamic contextual data entry completion system
CN103678269A (en) Information processing method and device
CN108319720A (en) Man-machine interaction method, device based on artificial intelligence and computer equipment
CN107808307A (en) Business personnel's picture forming method, electronic installation and computer-readable recording medium
CN108491421A (en) A kind of method, apparatus, equipment and computer storage media generating question and answer
CN104866308A (en) Scenario image generation method and apparatus
CN107346229A (en) Pronunciation inputting method and device, computer installation and readable storage medium storing program for executing
CN108563655A (en) Text based event recognition method and device
CN108153719A (en) Merge the method and apparatus of electrical form
CN108510096A (en) Trade company's attrition prediction method, apparatus, equipment and storage medium
CN102915493A (en) Information processing apparatus and method
CN104267922A (en) Information processing method and electronic equipment
Kshetri et al. Big data and cloud computing for development: Lessons from key industries and economies in the global south
CN106777336A (en) A kind of exabyte composition extraction system and method based on deep learning
CN107908796A (en) E-Government duplicate checking method, apparatus and computer-readable recording medium
CN107330009A (en) Descriptor disaggregated model creation method, creating device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180223

RJ01 Rejection of invention patent application after publication