CN106682007A - Data acquisition method and device - Google Patents

Data acquisition method and device Download PDF

Info

Publication number
CN106682007A
CN106682007A CN201510752393.6A CN201510752393A CN106682007A CN 106682007 A CN106682007 A CN 106682007A CN 201510752393 A CN201510752393 A CN 201510752393A CN 106682007 A CN106682007 A CN 106682007A
Authority
CN
China
Prior art keywords
source
web page
news
information
page news
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510752393.6A
Other languages
Chinese (zh)
Inventor
刘嘉
钦滨杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510752393.6A priority Critical patent/CN106682007A/en
Publication of CN106682007A publication Critical patent/CN106682007A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a data acquisition method and device, relates to the technical field of networks, and solves the problem of low accuracy of influence data of existing acquired webpage news. The main technical scheme includes that the method includes the steps: acquiring source messages of the webpage news; extracting source messages corresponding to the source messages of the webpage news from a preset webpage source library; determining weight values corresponding to the extracted source messages as data with influence on the webpage news. A plurality of source messages and weight values respectively corresponding to the source messages are stored in the preset webpage source library.

Description

Data capture method and device
Technical field
The present invention relates to networking technology area, more particularly to a kind of data capture method and device.
Background technology
With the popularization and the surge of netizen's quantity of network, Internet news is used as a kind of brand-new relative Independent dissemination of news pattern is a dark horse, and has become the another important channel that people obtain information. Wherein, Internet news refers to the news information passed by based on internet.And for the impact of Internet news Power research is increasingly becoming the focus of concern, is news by carrying out influence power calculating to Internet news Validity differentiates the related foundation of offer, and compared to other mass media, the network media shows more Complexity, the generation of this complexity had both come from media technology, also came from the spatial character of network.
At present, by the reprinting rate and response rate of web page news as the index for judging news influence, But the response rate of web page news and reprinting rate are proportional with the time that news occurs, and at one section Taper off after time, thus this computational methods for the influence power for evaluating real-time news be inaccurate , so as to the accuracy rate of the web page news influence power of existing acquisition is low.
The content of the invention
In view of the above problems, it is proposed that the present invention, the problems referred to above or at least are overcome to provide one kind The data capture method for partly solving the above problems and device.
To reach above-mentioned purpose, present invention generally provides following technical scheme:
On the one hand, a kind of data capture method is embodiments provided, the method includes:
Obtain the source-information of web page news;
From preset web page source storehouse, source letter corresponding with the source-information of the web page news is extracted Breath, be stored with multiple source-informations in the preset web page source storehouse, and with the source-information The corresponding weighted value of difference;
The corresponding weighted value of the source-information of the extraction is defined as having the web page news to be affected Data.
On the other hand, the embodiment of the present invention also provides a kind of data acquisition facility, and the device includes:
Acquiring unit, for obtaining the source-information of web page news;
Extraction unit, believes for from preset web page source storehouse, extracting with the source of the web page news Corresponding source-information is ceased, be stored with multiple source-informations in the preset web page source storehouse, and With the source-information corresponding weighted value of difference;
Determining unit, for the corresponding weighted value of the source-information of the extraction to be defined as to the net Page news has influential data.
By above-mentioned technical proposal, technical scheme provided in an embodiment of the present invention at least has following advantages:
A kind of data capture method and device are embodiments provided, web page news are obtained first Source-information, then from preset web page source storehouse, extracts the source-information pair with the web page news The source-information answered, be stored with multiple source-informations in the preset web page source storehouse, and with institute State source-information and distinguish corresponding weighted value, finally by the corresponding weighted value of the source-information of the extraction It is defined as on the web page news influential data of tool.With at present by the reprinting rate of web page news and Response rate is compared as the data target of evaluating network page news influence, and the embodiment of the present invention is obtained first The source-information of web page news, then from preset web page source storehouse, extracts and the web page news The corresponding source-information of source-information, finally determines the corresponding weighted value of the source-information of the extraction It is that on the web page news influential data of tool, the present invention is by by the data of Internet news influence power Evaluation is converted into evaluation to source of news webpage such that it is able to some high forward rates and higher assessment opinion Deceptive news are identified, and the real-time influence power to Internet news differentiates, and then improve The accuracy rate that web page news influence power is obtained.
Description of the drawings
By the detailed description for reading hereafter preferred embodiment, various other advantage and benefit for Those of ordinary skill in the art will be clear from understanding.Accompanying drawing is only used for illustrating the mesh of preferred embodiment , and it is not considered as limitation of the present invention.And in whole accompanying drawing, with identical with reference to symbol Number represent identical part.In the accompanying drawings:
Fig. 1 is a kind of data capture method flow chart provided in an embodiment of the present invention;
Fig. 2 is another kind of data capture method flow chart provided in an embodiment of the present invention;
Fig. 3 is a kind of composition frame chart of data acquisition facility provided in an embodiment of the present invention;
Fig. 4 is the composition frame chart of another kind of data acquisition facility provided in an embodiment of the present invention.
Specific embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing in accompanying drawing The exemplary embodiment of the disclosure is shown, it being understood, however, that may be realized in various forms the disclosure And should not be limited by embodiments set forth here.On the contrary, there is provided these embodiments are able to more Thoroughly understand the disclosure, and can be by the complete technology for conveying to this area of the scope of the present disclosure Personnel.
To make the advantage of technical solution of the present invention clearer, with reference to the accompanying drawings and examples to this It is bright to elaborate.
A kind of data capture method is embodiments provided, as shown in figure 1, methods described includes:
S101, the source-information for obtaining web page news.
Wherein, the source-information of Internet news be used for represent which website Internet news be specifically under the jurisdiction of, For example, there is the news with regard to " 18 the fifth plenary sessions will be held tomorrow " in the middle of network, obtain It is central government website to the corresponding source-information of the web page news.
It should be noted that the source-information detailed process for obtaining web page news is as follows:Obtaining first should Web page news, then crawl whole network data and judge to whether there is and the web page news in network by reptile Identical webpage, needs to search what web page news were initially originated in from these identical webpages if existing Website, then obtains the source-information of web page news from the website in initial source;It is straight if not existing Connect and obtain source-information from the web page news.
S102, from preset web page source storehouse, extract it is corresponding with the source-information of the web page news Source-information.
Wherein, be stored with multiple source-informations in the preset web page source storehouse, and next with described Source information distinguishes corresponding weighted value, the corresponding weight of source-information in the preset web page source storehouse Value can be divided according to actual website credit rating grade, it is also possible to grading according to the indicated weight of website etc. Row divide, can also synthesis indicated weight grade and website credit grade divided, the embodiment of the present invention is not It is specifically limited.The weighted value is used to represent the corresponding influence power of web page news, the bigger table of weighted value The influence power of bright web page news is higher.
Indicated weight grade mentioned here may refer to government department's rank of official's announcement, or each Rank between the main web site subnet station that individual website is announced etc..Any restriction is not done to this present invention.
For example, the indicated weight rank in preset web page library according to website is divided, the source-information of division And the concrete ratio of corresponding with source-information weighted value can be with as follows:1 grade of central indicated weight website 50%;2 grades of local indicated weight websites 30%;2.1 grades of provincial indicated weight websites 15%;2.2 grades of city-level indicated weight nets Stand 10%;2.3 grades of indicated weight websites 5% at county level;3 grades of news websites 20%;3.1 grades of provincial News Networks 10%;3.2 grades of city-level News Networks 6%;3.3 grades of News Networks at county level 4%.
S103, by the corresponding weighted value of the source-information of the extraction be defined as to the web page news have Influential data.
For the embodiment of the present invention, the source-information of web page news is obtained first, then from preset webpage In the storehouse of source, source-information corresponding with the source-information of the web page news is extracted, finally will be described The corresponding weighted value of source-information of extraction is defined as on the web page news influential data of tool, this Invention by the way that the data evaluation of Internet news influence power to be converted into the evaluation to source of news webpage, from And the Deceptive news of some high forward rates and higher assessment opinion can be identified, and it is new to network in real time The influence power of news is differentiated, and then improves the accuracy rate of web page news influence power acquisition.
A kind of data capture method is embodiments provided, the source letter of web page news is obtained first Breath, then from preset web page source storehouse, extracts corresponding with the source-information of the web page news next Source information, be stored with multiple source-informations in the preset web page source storehouse, and with the source Information distinguishes corresponding weighted value, is finally defined as the corresponding weighted value of the source-information of the extraction On the web page news influential data of tool.With the reprinting rate and response rate for passing through web page news at present Compare as the data target of evaluating network page news influence, it is new that the embodiment of the present invention obtains first webpage The source-information of news, then from preset web page source storehouse, extracts and believes with the source of the web page news Corresponding source-information is ceased, finally the corresponding weighted value of the source-information of the extraction is defined as to institute State web page news and have influential data, the present invention is by the way that the data evaluation of web page news influence power is turned Turn to the evaluation to source of news webpage such that it is able to new to the falseness of some high forward rates and higher assessment opinion News is identified, and the real-time influence power to Internet news differentiates, and then it is new to improve webpage Hear the accuracy rate that influence power is obtained.
Another kind of data capture method is embodiments provided, as shown in Fig. 2 methods described bag Include:
S201, the source-information for obtaining web page news.
Wherein, the source-information of Internet news be used for represent which website Internet news be specifically under the jurisdiction of, For example, there is one in the middle of network with regard to " foundation Innovation Base in Shenyang Dadong District enters in the first batch 42 enterprises The news of industry ", gets the corresponding source-information of the web page news for Liaoning Province indicated weight website.
For the embodiment of the present invention, step S201 includes:Obtain the web page news;Climbed by reptile Whole network data is taken, is judged whether and the web page news identical webpage;If not existing, from The source-information is obtained in the web page news.
In the embodiment of the present invention, it is described judge whether with after the web page news identical webpage, Methods described also includes:If existing, the webpage that starting source is extracted from the identical webpage is new Hear;The source-information is obtained from the web page news in the starting source.For the embodiment of the present invention, Obtain the web page news first, then by reptile crawl whole network data judge in network whether there is with The web page news identical webpage, needs to search web page news in from these identical webpages if existing Initially the website in source, then obtains the source-information of web page news from the website in initial source;If Do not exist, directly obtain source-information from the web page news.
It should be noted that the web page news in starting source are extracted from the identical webpage, can be with Extracted by following parameter, webpage PR values, it is considered as original version that webpage PR values are higher This possibility is bigger;The time that webpage is included for the first time, the searched engine of webpage include when For waiting the webpage of more early, to find after comparing identical content, the possibility in original source is taken as just It is bigger;Domain name registration time, the webpage above older domain name is treated as the possibility in original source It is bigger;Technorati authority of website etc., the embodiment of the present invention is not specifically limited.
S202, judge whether that history weighted value can be extracted from the web page news.
In embodiments of the present invention, by judging whether that history can be extracted from the web page news Weighted value, can improve the efficiency of the weighted value for obtaining web page news.If can be from the web page news In extract history weighted value, then directly history weighted value is defined as into the weighted value of the web page news, So as to again power corresponding with the info web need not be obtained by way of the preset web page source storehouse of lookup Weight values, and then improve the data efficiency for obtaining web page news influence power.
If S203a, history weighted value can be extracted from the web page news, extract described History weighted value is defined as on the web page news influential data of tool.
If S203b, history weighted value can not be extracted from the web page news, from preset webpage In the storehouse of source, source-information corresponding with the source-information of the web page news is extracted.
Wherein, step S203b is the step arranged side by side of step S203a, in the preset web page source storehouse Be stored with multiple source-informations, and with the source-information corresponding weighted value of difference, it is described pre- The corresponding weighted value of source-information put in web page source storehouse can be grading according to actual website credit rating etc. Row is divided, it is also possible to divided according to the indicated weight grade of website, can also synthesis indicated weight grade and net Credit grade of standing is divided, and the embodiment of the present invention is not specifically limited.The weighted value is used to represent The corresponding influence power of web page news, weighted value shows that more greatly the influence power of web page news is higher.
For the embodiment of the present invention, the source-information in the preset web page source storehouse is according to information source Indicated weight rank configure corresponding weighted value.For example, according to the indicated weight rank of website in preset web page library Divided, the source-information of division and the concrete ratio of weighted value corresponding with source-information can be as follows It is shown:1 grade of central government website 50%;2 grades of local government websites 30%;2.1 grades of provincial government's nets Stand 15%;2.2 grades of municipal government websites 10%;2.3 grades of county's government websites 5%;3 grades of news websites 20%;3.1 grades of provincial News Networks 10%;3.2 grades of city-level News Networks 6%;3.3 grades of News Networks at county level 4%.
S204b, by the corresponding weighted value of the source-information of the extraction be defined as to the web page news have Influential data.
For the embodiment of the present invention, the source-information of web page news is obtained first, then judge whether energy It is enough that history weighted value is extracted from the web page news, if can extract from the web page news History weighted value, then be defined as having impact to the web page news by the history weighted value that extracts Data;If history weighted value can not be extracted from the web page news, come from preset webpage In the storehouse of source, source-information corresponding with the source-information of the web page news is extracted, by the extraction The corresponding weighted value of source-information is defined as on the web page news influential data of tool.By by net The data evaluation of page news influence is converted into the evaluation to source of news webpage such that it is able to some The Deceptive news of high forward rate and higher assessment opinion are identified, and the real-time influence power to Internet news is entered Row differentiates, and then improves the accuracy rate and efficiency of the acquisition of web page news influence power.
Another kind of data capture method is embodiments provided, the source of web page news is obtained first Information, then from preset web page source storehouse, extracts corresponding with the source-information of the web page news Source-information, be stored with multiple source-informations in the preset web page source storehouse, and next with described Source information distinguishes corresponding weighted value, finally determines the corresponding weighted value of the source-information of the extraction It is on the web page news influential data of tool.With reprinting rate at present by web page news and reply Rate is compared as the data target of evaluating network page news influence, and the embodiment of the present invention obtains first webpage The source-information of news, then from preset web page source storehouse, extracts the source with the web page news The corresponding source-information of information, it is right to be finally defined as the corresponding weighted value of the source-information of the extraction The web page news have an influential data, and the present invention is by by the data evaluation of web page news influence power It is converted into the evaluation to source of news webpage such that it is able to some high forward rates and the falseness of higher assessment opinion News is identified, and the real-time influence power to Internet news differentiates, and then improves webpage The accuracy rate that news influence is obtained.
Further, the embodiment of the present invention provides a kind of data acquisition facility, as shown in figure 3, described Device includes:Acquiring unit 31, extraction unit 32, determining unit 33.
Acquiring unit 31, for obtaining the source-information of web page news.
Extraction unit 32, for from preset web page source storehouse, extracting the source with the web page news The corresponding source-information of information, be stored with multiple source-informations in the preset web page source storehouse, And with the source-information corresponding weighted value of difference.
Determining unit 33, for the corresponding weighted value of the source-information of the extraction to be defined as to described Web page news have influential data.
It should be noted that each function involved by a kind of data acquisition facility provided in an embodiment of the present invention Other corresponding descriptions of unit, may be referred to the correspondence description of method shown in Fig. 1, will not be described here, It should be understood that the device in the present embodiment can be corresponded in the whole that realize in preceding method embodiment Hold.
A kind of data acquisition facility is embodiments provided, the source letter of web page news is obtained first Breath, then from preset web page source storehouse, extracts corresponding with the source-information of the web page news next Source information, be stored with multiple source-informations in the preset web page source storehouse, and with the source Information distinguishes corresponding weighted value, is finally defined as the corresponding weighted value of the source-information of the extraction On the web page news influential data of tool.With the reprinting rate and response rate for passing through web page news at present Compare as the data target of evaluating network page news influence, it is new that the embodiment of the present invention obtains first webpage The source-information of news, then from preset web page source storehouse, extracts and believes with the source of the web page news Corresponding source-information is ceased, finally the corresponding weighted value of the source-information of the extraction is defined as to institute State web page news and have influential data, the present invention is by the way that the data evaluation of web page news influence power is turned Turn to the evaluation to source of news webpage such that it is able to new to the falseness of some high forward rates and higher assessment opinion News is identified, and the real-time influence power to Internet news differentiates, and then it is new to improve webpage Hear the accuracy rate that influence power is obtained.
Further, the embodiment of the present invention provides another kind of data acquisition facility, as shown in figure 4, institute Stating device includes:Acquiring unit 41, extraction unit 42, determining unit 43.
Acquiring unit 41, for obtaining the source-information of web page news.
Extraction unit 42, for from preset web page source storehouse, extracting the source with the web page news The corresponding source-information of information, be stored with multiple source-informations in the preset web page source storehouse, And with the source-information corresponding weighted value of difference.
Determining unit 43, for the corresponding weighted value of the source-information of the extraction to be defined as to described Web page news have influential data.
Further, the acquiring unit 41 includes:
Acquisition module 411, for obtaining the web page news.
Judge module 412, for crawling whole network data by reptile, judges whether and the webpage News identical webpage.
The acquisition module 411, if being additionally operable to not exist and the web page news identical webpage, from The source-information is obtained in the web page news.
Further, the acquiring unit 41, also including extraction module 412;
The extraction module 412, if for existing and the web page news identical webpage, from described The web page news in starting source are extracted in identical webpage.
The acquisition module 411, specifically for obtaining described next in the web page news from the starting source Source information.
Further, described device also includes:Judging unit 44;
The judging unit 44, for judging whether that history power can be extracted from the web page news Weight values.
The determining unit 43, if specifically for history weight can be extracted from the web page news Value, then be defined as the history weighted value of the extraction on the web page news influential data of tool.
The extraction unit 42, if specifically for history power can not be extracted from the web page news Weight values, then from preset web page source storehouse, extract corresponding with the source-information of the web page news next Source information.
For the embodiment of the present invention, the source-information in the preset web page source storehouse is according to information source Indicated weight rank configure corresponding weighted value.
It should be noted that each work(involved by another kind of data acquisition facility provided in an embodiment of the present invention Other corresponding descriptions of energy unit, may be referred to the correspondence description of method shown in Fig. 2, and here is no longer gone to live in the household of one's in-laws on getting married State, it should be understood that the device in the present embodiment can correspond to realize it is complete in preceding method embodiment Portion's content.
Another kind of data acquisition facility is embodiments provided, the source of web page news is obtained first Information, then from preset web page source storehouse, extracts corresponding with the source-information of the web page news Source-information, be stored with multiple source-informations in the preset web page source storehouse, and next with described Source information distinguishes corresponding weighted value, finally determines the corresponding weighted value of the source-information of the extraction It is on the web page news influential data of tool.With reprinting rate at present by web page news and reply Rate is compared as the data target of evaluating network page news influence, and the embodiment of the present invention obtains first webpage The source-information of news, then from preset web page source storehouse, extracts the source with the web page news The corresponding source-information of information, it is right to be finally defined as the corresponding weighted value of the source-information of the extraction The web page news have an influential data, and the present invention is by by the data evaluation of web page news influence power It is converted into the evaluation to source of news webpage such that it is able to some high forward rates and the falseness of higher assessment opinion News is identified, and the real-time influence power to Internet news differentiates, and then improves webpage The accuracy rate that news influence is obtained.
The netpage registration device includes processor and memory, above-mentioned acquiring unit, extraction unit, Determining unit and judging unit etc. are stored in memory as program unit, are deposited by computing device Storage said procedure unit in memory is realizing corresponding function.
Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can To arrange one or more, the data for improving web page news influence power by adjusting kernel parameter are accurate Rate.
Memory potentially includes the volatile memory in computer-readable medium, random access memory The form such as device (RAM) and/or Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM), memory includes at least one storage chip.
Present invention also provides a kind of computer program, when performing on data processing equipment, It is adapted for carrying out initializing the program code of there are as below methods step:Obtain the source-information of web page news; From preset web page source storehouse, source-information corresponding with the source-information of the web page news is extracted, Be stored with multiple source-informations in the preset web page source storehouse, and distinguishes with the source-information Corresponding weighted value;The corresponding weighted value of the source-information of the extraction is defined as new to the webpage Hear the influential data of tool.
Those skilled in the art it should be appreciated that embodiments herein can be provided as method, system, Or computer program.Therefore, the application can be implemented using complete hardware embodiment, complete software Example or with reference to the form of the embodiment in terms of software and hardware.And, the application can be adopted at one Or it is multiple wherein include computer usable program code computer-usable storage medium (including but not Be limited to magnetic disc store, CD-ROM, optical memory etc.) on the computer program implemented Form.
The application is with reference to the method according to the embodiment of the present application, equipment (system) and computer journey The flow chart and/or block diagram of sequence product is describing.It should be understood that can be realized by computer program instructions Each flow process and/or square frame and flow chart and/or block diagram in flow chart and/or block diagram In flow process and/or square frame combination.Can provide these computer program instructions to all-purpose computer, The processor of special-purpose computer, Embedded Processor or other programmable data processing devices is producing one Individual machine so that by the instruction of computer or the computing device of other programmable data processing devices Produce for realizing in one square frame or multiple of one flow process of flow chart or multiple flow processs and/or block diagram The device of the function of specifying in square frame.
These computer program instructions may be alternatively stored in can guide computer or other programmable datas to process In the computer-readable memory that equipment works in a specific way so that be stored in the computer-readable and deposit Instruction in reservoir is produced and includes the manufacture of command device, and command device realization is in flow chart one The function of specifying in flow process or one square frame of multiple flow processs and/or block diagram or multiple square frames.
These computer program instructions can also be loaded into computer or other programmable data processing devices On so that series of operation steps is performed on computer or other programmable devices to produce computer The process of realization, so as to the instruction performed on computer or other programmable devices is provided for realizing Specify in one flow process of flow chart or one square frame of multiple flow processs and/or block diagram or multiple square frames The step of function.
In a typical configuration, computing device include one or more processors (CPU), input/ Output interface, network interface and internal memory.
Memory potentially includes the volatile memory in computer-readable medium, random access memory The form such as device (RAM) and/or Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Memory is the example of computer-readable medium.
Computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be with Information Store is realized by any method or technique.Information can be computer-readable instruction, data knot Structure, the module of program or other data.The example of the storage medium of computer includes, but are not limited to phase Become internal memory (PRAM), static RAM (SRAM), dynamic random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electricity can Erasable programmable read-only memory (EPROM) (EEPROM), fast flash memory bank or other memory techniques, read-only light Disk read-only storage (CD-ROM), digital versatile disc (DVD) or other optical storages, magnetic Cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus or any other non-transmission medium, Can be used to store the information that can be accessed by a computing device.Define according to herein, computer-readable Medium does not include temporary computer readable media (transitory media), the such as data-signal and load of modulation Ripple.
Embodiments herein is these are only, the application is not limited to.For this area skill For art personnel, the application can have various modifications and variations.It is all spirit herein and principle it Interior made any modification, equivalent substitution and improvements etc., should be included in claims hereof model Within enclosing.

Claims (10)

1. a kind of data capture method, it is characterised in that include:
Obtain the source-information of web page news;
From preset web page source storehouse, source letter corresponding with the source-information of the web page news is extracted Breath, be stored with multiple source-informations in the preset web page source storehouse, and with the source-information The corresponding weighted value of difference;
The corresponding weighted value of the source-information of the extraction is defined as having the web page news to be affected Data.
2. method according to claim 1, it is characterised in that the source of the acquisition web page news Information includes:
Obtain the web page news;
Whole network data is crawled by reptile, is judged whether and the web page news identical webpage;
If not existing, the source-information is obtained from the web page news.
3. method according to claim 2, it is characterised in that it is described judge whether with it is described After web page news identical webpage, methods described also includes:
If existing, the web page news in starting source are extracted from the identical webpage;
The source-information is obtained from the web page news in the starting source.
4. method according to claim 3, it is characterised in that described from preset web page source storehouse, Before extracting source-information corresponding with the source-information of the web page news, methods described also includes:
Judge whether that history weighted value can be extracted from the web page news;
If history weighted value can be extracted from the web page news, the history of the extraction is weighed Weight values are defined as on the web page news influential data of tool.
5. method according to claim 4, it is characterised in that described from preset web page source storehouse, Extracting source-information corresponding with the source-information of the web page news includes:
If history weighted value can not be extracted from the web page news, from preset web page source storehouse In, extract source-information corresponding with the source-information of the web page news.
6. according to the arbitrary methods described of claim 1-5, it is characterised in that the preset web page source Source-information in storehouse configures corresponding weighted value according to the indicated weight rank of information source.
7. a kind of data acquisition facility, it is characterised in that include:
Acquiring unit, for obtaining the source-information of web page news;
Extraction unit, believes for from preset web page source storehouse, extracting with the source of the web page news Corresponding source-information is ceased, be stored with multiple source-informations in the preset web page source storehouse, and With the source-information corresponding weighted value of difference;
Determining unit, for the corresponding weighted value of the source-information of the extraction to be defined as to the net Page news has influential data.
8. device according to claim 7, it is characterised in that the acquiring unit includes:
Acquisition module, for obtaining the web page news;
Judge module, for crawling whole network data by reptile, judges whether new with the webpage Hear identical webpage;
The acquisition module, if being additionally operable to not exist and the web page news identical webpage, from institute State and obtain in web page news the source-information.
9. device according to claim 8, it is characterised in that the acquiring unit, also including carrying Delivery block;
The extraction module, if for existing and the web page news identical webpage, from the phase The web page news in starting source are extracted in same webpage;
The acquisition module, specifically for obtaining the source in the web page news from the starting source Information.
10. device according to claim 9, it is characterised in that described device also includes:Judge Unit;
The judging unit, for judging whether that history weight can be extracted from the web page news Value;
The determining unit, if specifically for history weighted value can be extracted from the web page news, Then the history weighted value that extracts is defined as on the web page news influential data of tool.
CN201510752393.6A 2015-11-06 2015-11-06 Data acquisition method and device Pending CN106682007A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510752393.6A CN106682007A (en) 2015-11-06 2015-11-06 Data acquisition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510752393.6A CN106682007A (en) 2015-11-06 2015-11-06 Data acquisition method and device

Publications (1)

Publication Number Publication Date
CN106682007A true CN106682007A (en) 2017-05-17

Family

ID=58863906

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510752393.6A Pending CN106682007A (en) 2015-11-06 2015-11-06 Data acquisition method and device

Country Status (1)

Country Link
CN (1) CN106682007A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101409634A (en) * 2007-10-10 2009-04-15 中国科学院自动化研究所 Quantitative analysis tools and method for internet news influence based on information retrieval
CN101625693A (en) * 2009-08-10 2010-01-13 北京精讯云顿数据软件有限公司 Method and system of online article statistics
JP2013015973A (en) * 2011-07-01 2013-01-24 Kddi Corp Method and program for extracting small group from social network, and naming and visualizing the same
CN104598477A (en) * 2013-10-31 2015-05-06 北大方正集团有限公司 News transmission effect determining method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101409634A (en) * 2007-10-10 2009-04-15 中国科学院自动化研究所 Quantitative analysis tools and method for internet news influence based on information retrieval
CN101625693A (en) * 2009-08-10 2010-01-13 北京精讯云顿数据软件有限公司 Method and system of online article statistics
JP2013015973A (en) * 2011-07-01 2013-01-24 Kddi Corp Method and program for extracting small group from social network, and naming and visualizing the same
CN104598477A (en) * 2013-10-31 2015-05-06 北大方正集团有限公司 News transmission effect determining method and system

Similar Documents

Publication Publication Date Title
CN104679743B (en) A kind of method and device of the preference pattern of determining user
CN106651416A (en) Analyzing method and analyzing device of application popularization information
CN105306495B (en) user identification method and device
CN104899228A (en) Method and device for publishing webpage resources
CN108134760A (en) Website monitoring data acquisition methods and device
CN106535204A (en) Service coverage quality evaluation method and device
CN104348871A (en) Similar account expanding method and device
CN109800364A (en) Amount of access statistical method, device, equipment and storage medium based on block chain
CN102902790B (en) Web page classification system and method
CN110689211A (en) Method and device for evaluating website service capability
JP2015525956A5 (en)
CN106919576A (en) Using the method and device of two grades of classes keywords database search for application now
CN105528399A (en) Multi-source terminal parameter data fusion method and apparatus
CN106909567A (en) Data processing method and device
CN108121749A (en) Website user's behavior analysis method and device
CN108243046A (en) A kind of evaluation the quality method and device based on data auditing
CN103605670B (en) A kind of method and apparatus for determining the crawl frequency of network resource point
CN107665208A (en) User preference measure and device
CN105183806A (en) Method and system for identifying same user among different platforms
CN107909496A (en) User influence in social network analysis method, device and electronic equipment
CN108255891A (en) A kind of method and device for differentiating type of webpage
CN105989019B (en) A kind of method and device for cleaning data
CN104794135A (en) Method and device for carrying out sorting on search results
CN106933849A (en) The method and device that keyword is pushed
CN106682007A (en) Data acquisition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170517