CN109783619A - A kind of data filtering method for digging - Google Patents

A kind of data filtering method for digging Download PDF

Info

Publication number
CN109783619A
CN109783619A CN201811532016.1A CN201811532016A CN109783619A CN 109783619 A CN109783619 A CN 109783619A CN 201811532016 A CN201811532016 A CN 201811532016A CN 109783619 A CN109783619 A CN 109783619A
Authority
CN
China
Prior art keywords
data
information
model
text
filtering method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811532016.1A
Other languages
Chinese (zh)
Inventor
柴满
吴少丹
刘坤杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GUANGDONG CREAWOR TECHNOLOGY DEVELOPMENT Co Ltd
Original Assignee
GUANGDONG CREAWOR TECHNOLOGY DEVELOPMENT Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GUANGDONG CREAWOR TECHNOLOGY DEVELOPMENT Co Ltd filed Critical GUANGDONG CREAWOR TECHNOLOGY DEVELOPMENT Co Ltd
Priority to CN201811532016.1A priority Critical patent/CN109783619A/en
Publication of CN109783619A publication Critical patent/CN109783619A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

Technical solution of the present invention includes a kind of data filtering method for digging, for realizing: exploitation crawler engine crawls internet mass data, the high price value information of mass data is excavated using neural network, the real-time automatic fitration of machine learning algorithm building core value model, including carrying out big data processing and natural language processing to the data of acquisition, and the Data Analysis Model system including text analyzing model, information value evaluation model and deep learning algorithm tuning model is established, target data is handled and is pushed to user.The invention has the benefit that user is helped precisely to identify potential customers, compare artificial treatment, treatment process, more simply, efficiently, discrimination it is higher, high price value information, potential customers are precisely identified by machine learning algorithm automatically, artificial erroneous judgement is reduced and fails to judge.

Description

A kind of data filtering method for digging
Technical field
The present invention relates to a kind of data filtering method for digging, belong to field of computer technology.
Background technique
Computer technology and the communication technology rapidly develop, user demand rapid growth, so that the application of computer network is got over Come that wider, scale is increasing, big data technology is with data be essence the revolutionary information technology of a new generation, dug in data During latent, it is able to drive the innovation of theory, mode, technology and application practice.Under big data era background, large quantities of enterprises Business is to digitlization, information-based transition, and in face of there is a large amount of lengthy and jumbled internet datas, and each user is in the limited time Under energy, be difficult efficiently, accurately recognize valuable information.
Summary of the invention
To solve the above problems, developing crawler engine the purpose of the present invention is to provide a kind of data filtering method for digging Internet mass data are crawled, are excavated using neural network, the real-time automatic fitration of machine learning algorithm building core value model The high price value information of mass data including carrying out big data processing and natural language processing to the data of acquisition, and is established and includes The Data Analysis Model system of text analyzing model, information value evaluation model and deep learning algorithm tuning model, to mesh Mark data are handled and are pushed to user.
Technical solution used by the present invention solves the problems, such as it is: a kind of data filtering method for digging, which is characterized in that packet It includes following steps: data acquisition S100, being carried out to targeted website using Nutch distributed reptile technology;S200, the number to acquisition According to big data processing and natural language processing is carried out, wherein big data processing include data pick-up, verification, loading, storage and It calculates, wherein natural language processing includes carrying out re-scheduling, filtering, text classification and abstract to data;S300, data analysis is established Model system, the Data Analysis Model system include that text analyzing model, information value evaluation model and deep learning are calculated Method tuning model;S400, using Data Analysis Model system, to treated, data are analyzed, and according to field feedback It optimizes.
Further, S100 includes:
S101, creation uniform resource locator list file;S102, according to uniform resource locator list file, addition Entrance URL address generates list to be downloaded;S103, list generation downloading task to be downloaded is extracted;S104, from Specific content of pages is downloaded according to downloading task in global wide area network;S105, analysis extract content of pages, and judge content of pages In whether include uniform resource locator, if so, return to step S102, index inspection is otherwise established according to content of pages Rope, data acquisition.
Further, in the step S500, extracting content of pages includes: that web page contents automatically extract, and is distinguished in webpage Title and text message, and internally container there are successional multiple web page contents to be merged automatically, network forum information oneself It is dynamic to extract;Web page text extracts, and automatically analyzes from the structure of a webpage confusion and proposes body part;Character sets multiple coding Including but not limited to gb2132 and UTF8 character is executed identification conversion using Unicode coding mode by conversion.
It is further, described that carry out re-scheduling to data include re-scheduling based on uniform resource locator and based in webpage The re-scheduling of appearance.
Further, the text classification includes: to be trained using assignment algorithm to data, obtains different training moulds Type is simultaneously put into the container of selection;Classified according to obtained training pattern to text data.
Further, the assignment algorithm includes linear classification algorithm, closes on sorting algorithm and Blang's term clustering algorithm.
Further, the information extraction that data are carried out to make a summary include: to data based on statistics, according to clue word word Allusion quotation, word frequency, word and sentence statistical law carry out pattern match and draw digest;Information extraction to data based on understanding utilizes The knowledge such as syntax, semantic knowledge extract digest on the basis of the content to article understands.
Further, the text analyzing model is based on natural language processing technique and carries out sentence and list to long text content The segmentation of word obtains keyword, the keyword classification matching, word frequency statistics information that text includes, is information value evaluation model Data preparation is provided.
Further, the information value evaluation model is according to the data after text analyzing model treatment, based on certainly The subordinate sentence word segmentation result of right Language Processing carries out classifying rules and article by the matching of keyword as unit of words/phrases The calculating of information value score, classification and value for different articles are scored;According to the pass for meeting customer service demand Key sentence carries out screening identification to the sentence to have scored, obtains article information abstract, and be sent to user.
Further, the deep learning algorithm tuning model is used for according to the field feedback received, to value Keyword basic score appraisement system and value information decision threshold in evaluation model carry out algorithm iteration amendment;Pass through nerve Network is iterated training to every a batch of sample size, is carried out specified time according to positive and negative sample size accounting to multiplying power and basis point Tuning is completed in number training, and to approach desired value, wherein predetermined number of times can self-setting.
The beneficial effects of the present invention are: a kind of data filtering method for digging that the present invention uses, helps user precisely to identify Potential customers, compare artificial treatment, treatment process, more simply, efficiently, discrimination it is higher, it is automatic by machine learning algorithm Precisely identification high price value information, potential customers reduces artificial erroneous judgement and fails to judge.
Detailed description of the invention
Fig. 1 show the method flow schematic diagram of present pre-ferred embodiments;
Fig. 2 show the method flow schematic diagram of preferred embodiment according to the present invention;
Fig. 3 show preferred embodiment flow diagram according to the present invention;
Fig. 4 show preferred embodiment one according to the present invention.
Specific embodiment
It is carried out below with reference to technical effect of the embodiment and attached drawing to design of the invention, specific structure and generation clear Chu, complete description, to be completely understood by the purpose of the present invention, scheme and effect.
It should be noted that unless otherwise specified, the descriptions such as upper and lower, left and right used in the disclosure are only opposite In attached drawing for the mutual alignment relation of each component part of the disclosure." the one of used singular in the disclosure Kind ", " described " and "the" are also intended to including most forms, unless the context clearly indicates other meaning.In addition, unless otherwise Definition, all technical and scientific terms used herein and the normally understood meaning phase of those skilled in the art Together.Term used in the description is intended merely to description specific embodiment herein, is not intended to be limiting of the invention.
The use of provided in this article any and all example or exemplary language (" such as ", " such as ") is intended merely to more Illustrate the embodiment of the present invention well, and unless the context requires otherwise, otherwise the scope of the present invention will not be applied and be limited.
The method flow schematic diagram for showing present pre-ferred embodiments referring to 1, includes the following steps,
Data acquisition is carried out to targeted website using Nutch distributed reptile technology;
Nutch is the search engine that an open source Java is realized.It provides complete needed for the search engine of operation oneself Portion's tool.Including full-text search and Web crawler.Although Web search is the basic demand for roaming Internet, existing web The number of search engine is but declining, and this very possible further differentiation is almost all of as a company monopolizing Web search seeks commercial interest for it, this is obviously unfavorable for numerous Internet user, and the search commercial relative to those is drawn Hold up, Nutch will be more transparent as open source code search engine, thus it is more worth everybody to trust currently all main Search engine all uses privately owned sort algorithm, without explaining why a webpage can come a specific position, remove Except this, the expense that some search engines are paid according to website, rather than be ranked up according to the value of themselves, with them What difference, Nutch need to conceal without, and the result .Nutch for also going distortion to search for without motivation will use up oneself maximum effort Best search result is provided for user.
The emphasis of crawler is in two aspects, the format and meaning of the workflow of crawler and the data file being related to.Data File mainly includes three classes, is web database respectively, and a series of segment adds index, the physical file point of three It is not stored under the db catalogue under crawling results catalogue in webdb sub-folder, segments file and index file.
Many segment can be generated by once creeping, and what is stored in each segment is that crawler crawler is individually once grabbing Take the index of the webpage and these webpages caught in circulation.Crawler can be according to the link relationship in WebDB according to one when creeping Fetchlist needed for fixed crawl policy generates crawl circulation every time, then Fetcher passes through the URLs in fetchlist It grabs these webpages and indexes, be then deposited into segment.Segment has the time limit, when these webpages by crawler again After crawl, the segment that previously crawl had generated just cancels.In storage.Segment file is named with generation time , facilitate the segments for deleting and cancelling to save memory space.
Index is the index of all webpages of crawler capturing, it is by carrying out to the index in all single segment Merging treatment is resulting.Nutch is indexed using Lucene technology, so the interface pair operated in Lucene to index Index in Nutch is equally effective.It is however noted that the difference in segment and Nutch in Lucene, Segment in Lucene is a part for indexing index, but the segment in Nutch is various pieces in WebDB The content and index of webpage have had no bearing on finally by its index generated with these segment.
Web database, is also WebDB, wherein what is stored is the link structure information between the grabbed webpage of crawler, It only crawler crawler work in use and with the no any relationship of the work of Searcher.Two kinds of entities are stored in WebDB Information: page and link.Page entity characterizes an actual net by the characteristic information of a webpage on description network Page passes through two kinds of indexing means pair of MD5 of the URL of webpage and web page contents because webpage has many to need to describe in WebDB These page entities are indexed.The web page characteristics of Page entity description mainly include the link number in webpage, grab this The related crawl information such as time of webpage, the different degree scoring etc. to this webpage.Likewise, Link entity description is two Linking relationship between page entity.WebDB constitutes the link structure figure of a grabbed webpage, Page entity in this figure It is the node of figure, and Link entity then represents the side of figure.Specific data acquisition flow can refer to Fig. 2 and Fig. 3.
Big data processing is carried out to the data of acquisition, wherein big data processing includes data pick-up, verification, loading, storage And calculate step;
Big data processing is also referred to as ELT, and ETL is responsible for data will disperse, in heterogeneous data source such as relation data, puts down It after face data file etc. is drawn into interim middle layer, is cleaned, converted, integrated, be finally loaded into data warehouse or data set In city, become on-line analytical processing, data mining provides the data of decision support.
1, data cleansing:
Data are filled a vacancy: being carried out data to empty data, missing data and filled a vacancy operation, what can not be handled makes marks.
Data replacement: the replacement of data is carried out to invalid data.
Format specification: the Data Format Transform that source data extracts is become to the target data lattice for facilitating access for warehouse processing Formula.
Main foreign key constraint: by establishing main foreign key constraint, data replacement is carried out to invalid data or exports to wrong file Again it handles.
2, data conversion
Data merge: multimeter association realizes that size table, which is associated with, uses lookup, significantly table intersection join (each field Family's index, guarantees the efficiency of correlation inquiry)
Data are split: carrying out data fractionation by certain rule
Ranks exchange, sequence/modification serial number, removal repeat to record
Data verification: loolup, sum, count
Implementation:
(SQL cannot achieve) is carried out in ETL engine
Carry out (SQL may be implemented) in the database
3, data loading method:
Timestamp mode: unified addition field is as timestamp in traffic table, when OLAP system updates modification business number According to when, while modification time stab field value.
Log sheet mode: adding log sheet in OLAP system, when business datum changes, in updating maintenance log sheet Hold.
Full table way of contrast: extracting institute's active data, first carries out data according to major key and field before updating object table It compares, there is the carry out update or insert of update.
Full table deletes inserted mode: source data is entirely insertable by delete target table data.
Abnormality processing
During ETL, essential the problem of facing data exception, treating method:
1, error message is individually exported, continues to execute ETL, individually load again after wrong data modification.ETL is interrupted, is repaired ETL is re-executed after changing.Principle: data are received to greatest extent.
2, abnormal for caused by the external causes such as network interruption, it sets number of attempt or attempts the time, it is super several or overtime Afterwards, by external staff's manual intervention.
3, change such as source data structure, unusual condition interface change, after should synchronizing, in loading data.
Natural language processing is carried out to the data of acquisition, wherein natural language processing include data are carried out re-scheduling, filtering, Text classification and abstract;
The NLP that this programme uses supports the natural language processing under multi-lingual system and the kit developed, is also included as realizing The machine learning algorithm and data set of these tasks are realized automatic re-scheduling and automatic fitration, text classification of natural language etc., are closed The Data Datas treatment processes such as keyword extraction.
This programme will realize the re-scheduling of information in terms of two:
Re-scheduling based on URL: the work is realized during internet information grabs.
Re-scheduling based on web page contents: the weight that disappears is carried out according to web page contents.
Text classification includes two parts, and first part is text training, is instructed by different algorithms to training data Different training patterns is got, the model that training generates is stored in the container of selection;Second part is according to obtained instruction Practice model to classify to text data.
There are two ways to text training: linear classification, KNN classification.Classify for linear classification and KNN, text analyzing Program can read in selected container and be trained that (file name corresponds to classification, such as with the file that .data ends up Sport.data).After having selected container and training algorithm, training mission is submitted.Training can be in selected appearance after terminating The file that model.gz is generated in device, tests for text classification and uses.Text cluster function also needs selection and includes cluster numbers According to container and clustering algorithm.For Brown term clustering algorithm, clustering program can read the txt in selected container File (changeable format) then carries out clustering to file.
Autoabstract is to carry out the important form of information extraction, mainly includes based on statistics and two kinds of sides based on understanding Formula, the digest based on statistics are to carry out pattern match according to the statistical law of clue word dictionary, word frequency, word and sentence to draw text It plucks;And the mode based on understanding is then using knowledge such as syntax, semantic knowledges, on the basis of the content to article understands Extract digest.Both the above method, integrated use statistics and philological knowledge are combined in this programme, at document Reason, provides the digest system of high quality.
Establish Data Analysis Model system, the Data Analysis Model system includes that text analyzing model, information value are commented Valence model and deep learning algorithm tuning model;
Internet information content obtains a series of after the acquisition of network information crawler, natural language analysis processing Word, phrase or sentence, wherein only a small amount of word, phrase or sentence include value information relevant to business.Pass through Establish Data Analysis Model system, including text analyzing model, value assessment model, user interface feedback mechanism, deep learning The Data Analysis Models such as algorithm tuning model form value assessment process closed loop.
Using Data Analysis Model system, to treated, data are analyzed, and excellent according to field feedback progress Change.
Text analyzing believes that model carries out the segmentation of sentence and word based on natural language processing technique to long text content, obtains The keyword that includes to text, keyword classification matching, word frequency statistics etc. information, be the information value evaluation model of next step Data preparation is provided.
Subordinate sentence word segmentation result based on natural language processing is carried out as unit of words/phrases by the matching of keyword Classification and value scoring for different articles are realized in the calculating of classifying rules and article information value score.
The partition of sentence is had been completed for the article content that needs are analyzed by information value scoring calculating process And the Quantitative marking of the information value of each sentence.It is to need to carry out screening identification to the sentence to have scored in next step, selects symbol The key sentence of family business demand is shared, article information abstract is formed, exports as final result to user.
After the information for receiving user feedback each time, start deep learning algorithm tuning model, to value assessment mould Keyword basic score appraisement system and value information decision threshold in type carry out algorithm iteration amendment.Pass through neural network pair It is iterated training per a batch of sample size, n times training is carried out to multiplying power and basis point according to positive and negative sample size accounting and is completed Tuning constantly approaches desired value.
It show the method flow schematic diagram of preferred embodiment according to the present invention referring to Fig. 2, is originally data source in embodiment It is set to targeted website, data acquisition is carried out to targeted website by the crawler technology of internet, collected data are counted According to processing, ETL data warehouse technology and natural language processing including big data, to treated, data carry out data point Analysis, initially sets up analysis model and information value assessment models, by the data-pushing after model analysis to user, according to user Feedback, carry out double optimization using deep learning algorithm, and the information after optimization be pushed to user again, to value assessment Keyword basic score appraisement system and value information decision threshold in model carry out algorithm iteration amendment.Pass through neural network Training is iterated to every a batch of sample size, n times are carried out to multiplying power and basis point according to positive and negative sample size accounting and have been trained At tuning, desired value is constantly approached.
The step of showing preferred embodiment flow diagram according to the present invention, specially internet crawler referring to Fig. 3,
Create uniform resource locator list file;
According to uniform resource locator list file, entrance URL address is added, list to be downloaded is generated;
It extracts list to be downloaded and generates downloading task;
Specific content of pages is downloaded according to downloading task from global wide area network;
Whether content of pages is extracted in analysis, and judge to hold comprising uniform resource locator if so, returning in content of pages Otherwise row second step establishes indexed search, data acquisition according to content of pages.
It is shown preferred embodiment one according to the present invention referring to Fig. 4, is the application scenarios of data acquisition,
The technology of network information crawler is realized
Targeted website quantity in this programme is more, web site contents structure is complicated multiplicity, need using distributed reptile come Realize the acquisition of data.This programme uses Nutch distributed reptile open source technology, relies on the MapReduce of Hadoop to carry out big The acquisition of scale website adds our own protocol processes mode and data processing method by the Plugin Mechanism of Nutch.
(2) web page contents automatically extract
The contents such as advertisement, copyright information, script description language are generally comprised in webpage.Web page contents Intelligent Extraction Technology energy Efficiently extract the effective information in webpage, distinguish the items of information such as title, the text in webpage, and internally container have it is successional Multiple web page contents are merged automatically, network forum information automation extraction etc..
(3) Web page text extracts
Automatically extracting text is exactly the part for automatically analyzing and proposing text from the structure of a webpage confusion.
(4) character sets multiple code conversion
Design when system and website that the staff of the character set of one website and this website uses are established etc. because It is known as pass.Common character set has gb2132, UTF8 etc..Unicode is that a kind of mainly show encodes terrestrial reference with switch character It is quasi-.It covers the U.S., Europe, the Middle East, Africa, India, Asia-pacific region ground language.The website of some country in order to It supports global access, using Unicode coding mode, automatic identification and can convert.
It should be appreciated that the embodiment of the present invention can be by computer hardware, the combination of hardware and software or by depositing The computer instruction in non-transitory computer-readable memory is stored up to be effected or carried out.Standard volume can be used in the method Journey technology-includes that the non-transitory computer-readable storage media configured with computer program is realized in computer program, In configured in this way storage medium computer is operated in a manner of specific and is predefined --- according in a particular embodiment The method and attached drawing of description.Each program can with the programming language of level process or object-oriented come realize with department of computer science System communication.However, if desired, the program can be realized with compilation or machine language.Under any circumstance, which can be volume The language translated or explained.In addition, the program can be run on the specific integrated circuit of programming for this purpose.
In addition, the operation of process described herein can be performed in any suitable order, unless herein in addition instruction or Otherwise significantly with contradicted by context.Process described herein (or modification and/or combination thereof) can be held being configured with It executes, and is can be used as jointly on the one or more processors under the control of one or more computer systems of row instruction The code (for example, executable instruction, one or more computer program or one or more application) of execution, by hardware or its group It closes to realize.The computer program includes the multiple instruction that can be performed by one or more processors.
Further, the method can be realized in being operably coupled to suitable any kind of computing platform, wrap Include but be not limited to PC, mini-computer, main frame, work station, network or distributed computing environment, individual or integrated Computer platform or communicated with charged particle tool or other imaging devices etc..Each aspect of the present invention can be to deposit The machine readable code on non-transitory storage medium or equipment is stored up to realize no matter be moveable or be integrated to calculating Platform, such as hard disk, optical reading and/or write-in storage medium, RAM, ROM, so that it can be read by programmable calculator, when Storage medium or equipment can be used for configuration and operation computer to execute process described herein when being read by computer.This Outside, machine readable code, or part thereof can be transmitted by wired or wireless network.When such media include combining microprocessor Or other data processors realize steps described above instruction or program when, invention as described herein including these and other not The non-transitory computer-readable storage media of same type.When methods and techniques according to the present invention programming, the present invention It further include computer itself.
Computer program can be applied to input data to execute function as described herein, to convert input data with life At storing to the output data of nonvolatile memory.Output information can also be applied to one or more output equipments as shown Device.In the preferred embodiment of the invention, the data of conversion indicate physics and tangible object, including the object generated on display Reason and the particular visual of physical objects are described.
The above, only presently preferred embodiments of the present invention, the invention is not limited to above embodiment, as long as It reaches technical effect of the invention with identical means, all should belong to protection scope of the present invention.In protection model of the invention Its technical solution and/or embodiment can have a variety of different modifications and variations in enclosing.

Claims (10)

1. a kind of data filtering method for digging, which comprises the following steps:
S100, data acquisition is carried out to targeted website using Nutch distributed reptile technology;
S200, big data processing and natural language processing are carried out to the data of acquisition, wherein big data processing include data pick-up, Verification is loaded, stores and is calculated, and wherein natural language processing includes that data are carried out with re-scheduling, filtering, text classification and is plucked It wants;
S300, Data Analysis Model system is established, the Data Analysis Model system includes that text analyzing model, information value are commented Valence model and deep learning algorithm tuning model;
S400, using Data Analysis Model system, to treated, data are analyzed, and excellent according to field feedback progress Change.
2. data filtering method for digging according to claim 1, which is characterized in that the S100 includes:
S101, creation uniform resource locator list file;
S102, according to uniform resource locator list file, add entrance URL address, generate column to be downloaded Table;
S103, list generation downloading task to be downloaded is extracted;
S104, specific content of pages downloaded according to downloading task from global wide area network;
Whether content of pages is extracted in S105, analysis, and judge comprising uniform resource locator in content of pages, if so, returning Step S102 is executed, indexed search, data acquisition are otherwise established according to content of pages.
3. data filtering method for digging according to claim 2, which is characterized in that in the step S105, extract the page Content includes:
Web page contents automatically extract, and distinguish title and text message in webpage, and internally container has successional multiple webpages Content merged automatically, network forum information automation extraction;
Web page text extracts, and automatically analyzes from the structure of a webpage confusion and proposes body part;
Character sets multiple code conversion executes including but not limited to gb2132 and UTF8 character using Unicode coding mode Identification conversion.
4. data filtering method for digging according to claim 1, which is characterized in that described includes base to data progress re-scheduling Re-scheduling in uniform resource locator and the re-scheduling based on web page contents.
5. data filtering method for digging according to claim 1, which is characterized in that the text classification includes:
Data are trained using assignment algorithm, obtain different training patterns and are put into the container of selection;
Classified according to obtained training pattern to text data.
6. data filtering method for digging according to claim 5, which is characterized in that the assignment algorithm includes linear classification Algorithm closes on sorting algorithm and Blang's term clustering algorithm.
7. data filtering method for digging according to claim 1, which is characterized in that it is described to data carry out abstract include:
Information extraction to data based on statistics carries out pattern match according to the statistical law of clue word dictionary, word frequency, word and sentence Draw digest;
Information extraction to data based on understanding is understood using knowledge such as syntax, semantic knowledges in the content to article On the basis of extract digest.
8. data filtering method for digging according to claim 1, which is characterized in that the text analyzing model is based on nature Language processing techniques carry out the segmentation of sentence and word to long text content, obtain keyword, the keyword classification that text includes Matching, word frequency statistics information, provide data preparation for information value evaluation model.
9. data filtering method for digging according to claim 1, which is characterized in that the information value evaluation model according to Data after text analyzing model treatment, the subordinate sentence word segmentation result based on natural language processing are single with words/phrases Position carries out the calculating that classifying rules and article information are worth score by the matching of keyword, classification for different articles and Value scores;
According to the key sentence for meeting customer service demand, screening identification is carried out to the sentence to have scored, article information is obtained and plucks It wants, and is sent to user.
10. data filtering method for digging according to claim 1, which is characterized in that the deep learning algorithm tuning mould Type is used for according to the field feedback received, to the keyword basic score appraisement system and value in value assessment model Information decision threshold carries out algorithm iteration amendment;
Training is iterated to every a batch of sample size by neural network, according to positive and negative sample size accounting to multiplying power and basis Divide and carry out predetermined number of times training completion tuning, to approach desired value, wherein predetermined number of times can self-setting.
CN201811532016.1A 2018-12-14 2018-12-14 A kind of data filtering method for digging Pending CN109783619A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811532016.1A CN109783619A (en) 2018-12-14 2018-12-14 A kind of data filtering method for digging

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811532016.1A CN109783619A (en) 2018-12-14 2018-12-14 A kind of data filtering method for digging

Publications (1)

Publication Number Publication Date
CN109783619A true CN109783619A (en) 2019-05-21

Family

ID=66496954

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811532016.1A Pending CN109783619A (en) 2018-12-14 2018-12-14 A kind of data filtering method for digging

Country Status (1)

Country Link
CN (1) CN109783619A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765106A (en) * 2019-10-23 2020-02-07 深圳报业集团 Data information processing method and system based on visual features
CN111324797A (en) * 2020-02-20 2020-06-23 民生科技有限责任公司 Method and device for acquiring data accurately at high speed
CN111723082A (en) * 2020-05-25 2020-09-29 贵州华泰智远大数据服务有限公司 Data quality monitoring system based on traceability analysis technology
CN112035549A (en) * 2020-08-31 2020-12-04 中国平安人寿保险股份有限公司 Data mining method and device, computer equipment and storage medium
CN112148956A (en) * 2020-09-30 2020-12-29 上海交通大学 Hidden net threat information mining system and method based on machine learning
CN113556344A (en) * 2021-07-21 2021-10-26 广州科腾信息技术有限公司 General index monitoring billboard based on organizational performance scene

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104951539A (en) * 2015-06-19 2015-09-30 成都艾尔普科技有限责任公司 Internet data center harmful information monitoring system
US20170228439A1 (en) * 2016-02-09 2017-08-10 Ca, Inc. Automatic natural language processing based data extraction
CN107766889A (en) * 2017-10-26 2018-03-06 济南浪潮高新科技投资发展有限公司 A kind of the deep learning computing system and method for the fusion of high in the clouds edge calculations
CN108683724A (en) * 2018-05-11 2018-10-19 江苏舜天全圣特科技有限公司 A kind of intelligence children's safety and gait health monitoring system
CN108921739A (en) * 2018-08-06 2018-11-30 四川工商学院 A kind of legislation intellectualized analysis platform based on big data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104951539A (en) * 2015-06-19 2015-09-30 成都艾尔普科技有限责任公司 Internet data center harmful information monitoring system
US20170228439A1 (en) * 2016-02-09 2017-08-10 Ca, Inc. Automatic natural language processing based data extraction
CN107766889A (en) * 2017-10-26 2018-03-06 济南浪潮高新科技投资发展有限公司 A kind of the deep learning computing system and method for the fusion of high in the clouds edge calculations
CN108683724A (en) * 2018-05-11 2018-10-19 江苏舜天全圣特科技有限公司 A kind of intelligence children's safety and gait health monitoring system
CN108921739A (en) * 2018-08-06 2018-11-30 四川工商学院 A kind of legislation intellectualized analysis platform based on big data

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765106A (en) * 2019-10-23 2020-02-07 深圳报业集团 Data information processing method and system based on visual features
CN111324797A (en) * 2020-02-20 2020-06-23 民生科技有限责任公司 Method and device for acquiring data accurately at high speed
CN111324797B (en) * 2020-02-20 2023-08-11 民生科技有限责任公司 Method and device for precisely acquiring data at high speed
CN111723082A (en) * 2020-05-25 2020-09-29 贵州华泰智远大数据服务有限公司 Data quality monitoring system based on traceability analysis technology
CN112035549A (en) * 2020-08-31 2020-12-04 中国平安人寿保险股份有限公司 Data mining method and device, computer equipment and storage medium
CN112035549B (en) * 2020-08-31 2023-12-08 中国平安人寿保险股份有限公司 Data mining method, device, computer equipment and storage medium
CN112148956A (en) * 2020-09-30 2020-12-29 上海交通大学 Hidden net threat information mining system and method based on machine learning
CN113556344A (en) * 2021-07-21 2021-10-26 广州科腾信息技术有限公司 General index monitoring billboard based on organizational performance scene

Similar Documents

Publication Publication Date Title
CN109783619A (en) A kind of data filtering method for digging
Demartini et al. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking
CN112612902A (en) Knowledge graph construction method and device for power grid main device
CN105468744B (en) Big data platform for realizing tax public opinion analysis and full text retrieval
CN103902703B (en) Based on the content of text sorting technique of mobile Internet access
CN105740227B (en) A kind of genetic simulated annealing method of neologisms in solution Chinese word segmentation
CN105160038A (en) Data analysis method and system based on audit database
Zhang et al. Big data versus the crowd: Looking for relationships in all the right places
EP3671526A1 (en) Dependency graph based natural language processing
CN107291895B (en) Quick hierarchical document query method
CN113254630B (en) Domain knowledge map recommendation method for global comprehensive observation results
Huang et al. Learning human-written commit messages to document code changes
CN107194617A (en) A kind of app software engineers soft skill categorizing system and method
CN106227788A (en) Database query method based on Lucene
CN107330007A (en) A kind of Method for Ontology Learning based on multi-data source
CN112434024A (en) Relational database-oriented data dictionary generation method, device, equipment and medium
CN109471934B (en) Financial risk clue mining method based on Internet
CN105095400B (en) The lookup method of personal homepage
Alsarkhi et al. An analysis of the effect of stop words on the performance of the matrix comparator for entity resolution
CN108228787A (en) According to the method and apparatus of multistage classification processing information
Pakdeetrakulwong et al. An Ontology-based Knowledge Management for Organic Agriculture and Good Agricultural Practices: A Case Study of Nakhon Pathom Province, Thailand
Castro et al. Ontology applied in the judicial sentences
Basharat et al. Crowdlink: Crowdsourcing for large-scale linked data management
CN113127650A (en) Technical map construction method and system based on map database
Schuh et al. Identification of requirements for focused crawlers in technology intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190521