CN107103094A - Data among enterprises incidence relation method for catching and its system based on mass data - Google Patents

Data among enterprises incidence relation method for catching and its system based on mass data Download PDF

Info

Publication number
CN107103094A
CN107103094A CN201710353297.3A CN201710353297A CN107103094A CN 107103094 A CN107103094 A CN 107103094A CN 201710353297 A CN201710353297 A CN 201710353297A CN 107103094 A CN107103094 A CN 107103094A
Authority
CN
China
Prior art keywords
data
module
training set
magnanimity
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710353297.3A
Other languages
Chinese (zh)
Inventor
李小强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qianhai Sycamore (shenzhen) Data Co Ltd
Original Assignee
Qianhai Sycamore (shenzhen) Data Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qianhai Sycamore (shenzhen) Data Co Ltd filed Critical Qianhai Sycamore (shenzhen) Data Co Ltd
Priority to CN201710353297.3A priority Critical patent/CN107103094A/en
Publication of CN107103094A publication Critical patent/CN107103094A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention relates to the data among enterprises incidence relation method for catching based on mass data and its system, this method includes obtaining magnanimity enterprise related data;Magnanimity enterprise related data is accumulated, basis of formation data;Magnanimity enterprise related data to acquisition is handled, and forms processing data;According to processing data and basic data, training set data storehouse is obtained;New data is handled using training set data storehouse, data among enterprises incidence relation is obtained.The present invention carries out data processing using big data technology, ensure the safety storage of mass data, ensure mass data distributed treatment, efficiency high, the problem of storage and calculating of mass data being solved with big data technology humanized and based on Distributed Parallel Computing framework, use the theory of machine learning and natural language processing, machine intelligence is allowed to handle company-related information, realize the degree of accuracy for improving and catching, and from mass data, the data effective to enterprise carry out auto-associating and automatic classification, and recognition efficiency is high.

Description

Data among enterprises incidence relation method for catching and its system based on mass data
Technical field
The present invention relates to data processing, more specifically refer to that the data among enterprises incidence relation based on mass data is caught Method and its system.
Background technology
With the development of science and technology, increasing enterprise carries out a surname of itself enterprise in the form of data are announced in internet Investee is passed or finds, therefore, the data of enterprise on the internet are more and more, and the enterprise database on internet is more next It is huger.
The number between enterprise is found during publicizing or finding investee, it is necessary to from the mass data on internet According to incidence relation, in this, as location condition, be accurately positioned the required enterprise looked for.But, number between current searching enterprise Can only be by artificial screening and analysis according to incidence relation, this causes to be difficult to carry out enterprise comprehensive analysis and enterprise's holography Portrait, and artificial screening and analysis efficiency are low, and accuracy rate is also low.
Chinese patent 201510810811.2 provides one kind and identical principal and subordinate is retrieved under big data based on relational database The algorithm of relation data, is a kind of algorithm of progress comparing in mass data, using " changing small, point behind first face greatly ", profit Comparing scope is progressively reduced with packet traverses, middle table storage scheduling algorithm, efficient retrieval goes out identical record.Foregoing invention For magnanimity host-guest architecture data in business data, the need for quick-searching goes out the methods of identical recordings suitable for enterprise managing The various situations of identical host-guest architecture data are retrieved, strengthen the management and control ability of enterprise, is that more preferable market environment is built by enterprise, carries High enterprise competitiveness.
Above-mentioned patent uses the method that quick-searching goes out identical recordings, and this mode can only find similar note Record, the degree of accuracy is not high.
Therefore, it is necessary to design a kind of data among enterprises incidence relation method for catching based on mass data, realize and improve The degree of accuracy of seizure, and from mass data, the data effective to enterprise carry out auto-associating and automatic classification, efficiency high.
The content of the invention
Associate and close there is provided the data among enterprises based on mass data it is an object of the invention to the defect for overcoming prior art It is method for catching and its system.
To achieve the above object, the present invention uses following technical scheme:Data among enterprises association based on mass data is closed It is method for catching, methods described includes:
Obtain magnanimity enterprise related data;
Magnanimity enterprise related data is accumulated, basis of formation data;
Magnanimity enterprise related data to acquisition is handled, and forms processing data;
According to processing data and basic data, training set data storehouse is obtained;
New data is handled using training set data storehouse, data among enterprises incidence relation is obtained.
Its further technical scheme is:Magnanimity enterprise related data is accumulated, the step of basis of formation data, including Step in detail below:
Magnanimity enterprise related data is regularly updated;
Magnanimity enterprise related data is excavated and classified, basic database is set up;
Magnanimity enterprise related data is stored in the basic database;
Obtain the basic data in the basic database.
Its further technical scheme is:Magnanimity enterprise related data to acquisition is handled, and forms processing data The step of, including step in detail below:
The magnanimity enterprise related data of acquisition is cleaned, sorted out, summary is extracted and extracts keyword;
Index is set up to the summary and keyword;
Described information, summary and keyword are classified, classification results are obtained;
Real-time matching and statistics are carried out to classification results, processing data is formed.
Its further technical scheme is:According to processing data and basic data, the step of obtaining training set data storehouse, bag Include step in detail below:
According to processing data and basic data, training set is made;
Investigation is sampled to processing data and is adjusted;
Processing data after adjustment is stored to training set;
Training set is trained;
Improvement is trained using weights, training set data storehouse is formed.
Its further technical scheme is:New data is handled using training set data storehouse, data among enterprises is obtained and closes The step of connection relation, including step in detail below:
Training set data storehouse is trained using training set data, acquisition uses model;
New data is classified and predicted using using model, data among enterprises incidence relation is obtained.
Present invention also offers based on mass data data among enterprises incidence relation catch system, including acquiring unit, Basic data formation unit, processing data formation unit, database acquiring unit and Relation acquisition unit;
The acquiring unit, for obtaining magnanimity enterprise related data;
The basic data formation unit, for being accumulated to magnanimity enterprise related data, basis of formation data;
The processing data formation unit, is handled for the magnanimity enterprise related data to acquisition, at formation Manage data;
The database acquiring unit, for according to processing data and basic data, obtaining training set data storehouse;
The Relation acquisition unit, for being handled using training set data storehouse new data, obtains data among enterprises Incidence relation.
Its further technical scheme is:The basic data formation unit includes update module, Database module, deposited Store up module and basic data acquisition module;
The update module, for being regularly updated to magnanimity enterprise related data;
The Database module, for magnanimity enterprise related data to be excavated and classified, sets up base Plinth database;
The memory module, for storing magnanimity enterprise related data in the basic database;
The basic data acquisition module, for obtaining the basic data in the basic database.
Its further technical scheme is:The processing data formation unit includes processing module, index and sets up module, classification Module and matching statistical module;
The processing module, cleaned, sorted out for the magnanimity enterprise related data to acquisition, extract summary with And extract keyword;
The index sets up module, is indexed for being set up to the summary and keyword;
The sort module, for classifying to described information, summary and keyword, obtains classification results;
The matching statistical module, for carrying out real-time matching and statistics to classification results, forms processing data.
Its further technical scheme is:The database acquiring unit includes training set formation module, adjusting module, processing Data memory module, training module and improvement module;
The training set formation module, for according to processing data and basic data, making training set;
The adjusting module, for being sampled investigation to processing data and adjusting;
The processing data memory module, for the processing data after adjustment to be stored to training set;
The training module, for being trained to training set;
The improvement module, for being trained improvement using weights, forms training set data storehouse.
Its further technical scheme is:The Relation acquisition unit includes model acquisition module and classification prediction module;
The model acquisition module, for being trained using training set data to training set data storehouse, acquisition uses mould Type;
The classification prediction module, for using new data is classified and predicted using model, obtaining number between enterprise According to incidence relation.
Compared with the prior art, the invention has the advantages that:The association of the data among enterprises based on mass data of the present invention Relation method for catching, by gathering enterprise's related data of magnanimity, procurement cost is low, is carried out using big data technology at data Reason, it is ensured that the safety storage of mass data, it is ensured that mass data distributed treatment, efficiency high, the degree of accuracy is with the accumulation of data Constantly lifting, the storage and calculating of mass data are solved with big data technology humanized and based on Distributed Parallel Computing framework Problem, using the theory of machine learning and natural language processing, allows machine intelligence to handle company-related information, is made a summary, returned Class and extraction, realize the degree of accuracy for improving and catching, and from mass data, the data effective to enterprise carry out auto-associating with And automatic classification, recognition efficiency height.
The invention will be further described with specific embodiment below in conjunction with the accompanying drawings.
Brief description of the drawings
The data among enterprises incidence relation method for catching based on mass data that Fig. 1 provides for the specific embodiment of the invention Flow chart;
The particular flow sheet for the basis of formation data that Fig. 2 provides for the specific embodiment of the invention;
The particular flow sheet for the formation processing data that Fig. 3 provides for the specific embodiment of the invention;
The particular flow sheet in the acquisition training set data storehouse that Fig. 4 provides for the specific embodiment of the invention;
The particular flow sheet for the acquisition data among enterprises incidence relation that Fig. 5 provides for the specific embodiment of the invention;
Fig. 6 catches system for the data among enterprises incidence relation based on mass data that the specific embodiment of the invention is provided Structured flowchart;
The structured flowchart for the basic data formation unit that Fig. 7 provides for the specific embodiment of the invention;
The structured flowchart for the processing data formation unit that Fig. 8 provides for the specific embodiment of the invention;
The structured flowchart for the database acquiring unit that Fig. 9 provides for the specific embodiment of the invention;
The structured flowchart for the Relation acquisition unit that Figure 10 provides for the specific embodiment of the invention.
Embodiment
In order to more fully understand the technology contents of the present invention, technical scheme is entered with reference to specific embodiment One step introduction and explanation, but it is not limited to this.
Specific embodiment as shown in Fig. 1~10, the association of the data among enterprises based on mass data that the present embodiment is provided Relation method for catching, can be used in the publicity of enterprise or find investee's process, realize the degree of accuracy for improving and catching, and From mass data, the data effective to enterprise carry out auto-associating and automatic classification, efficiency high.
As shown in figure 1, be the data among enterprises incidence relation method for catching based on mass data that the present embodiment is provided, should Method includes:
S1, acquisition magnanimity enterprise related data;
S2, magnanimity enterprise related data is accumulated, basis of formation data;
S3, the magnanimity enterprise related data to acquisition are handled, and form processing data;
S4, according to processing data and basic data, obtain training set data storehouse;
S5, using training set data storehouse new data is handled, obtain data among enterprises incidence relation.
For S1 steps, the step of obtaining magnanimity enterprise's related data specifically crawls technology, daily from mutual using data Enterprise's related data is gathered and crawled in networking.
Further, above-mentioned S2 steps, are accumulated, the step of basis of formation data to magnanimity enterprise related data Suddenly, including in detail below step:
S21, magnanimity enterprise related data is regularly updated;
S22, magnanimity enterprise related data is excavated and classified, set up basic database;
S23, storage magnanimity enterprise related data are in the basic database;
S24, the basic data obtained in the basic database.
For above-mentioned S21 steps, magnanimity enterprise related data is regularly updated, plays a part of accumulating data.
For above-mentioned S22 steps, specifically using machine learning techniques, by magnanimity enterprise related data on internet Excavated and classified, basic data database is set up with this.
For above-mentioned S23 steps, specifically using big data HDFS technology distributions formula storage magnanimity enterprise related data.
Basic data in above-mentioned S24 steps, basic database be accumulated by magnanimity enterprise related data and Data after processing.
Further, above-mentioned S3 steps, the magnanimity enterprise related data to acquisition is handled, formation processing The step of data, including step in detail below:
S31, the magnanimity enterprise related data to acquisition are cleaned, sorted out, extracted summary and extract keyword;
S32, to it is described summary and keyword set up index;
S33, described information, summary and keyword are classified, obtain classification results;
S34, real-time matching and statistics are carried out to classification results, form processing data.
Above-mentioned S31 steps, are specifically the theory and technology based on natural language processing, the magnanimity enterprise returned to collection Related data is cleaned, sorted out, extracted summary and extract keyword.
It is specifically to the theory using natural language processing and the summary after technical finesse and pass for above-mentioned S32 steps Key word, sets up index.
Above-mentioned S33 steps, are specifically used using K arest neighbors (k-Nearest Neighbor, KNN) classification to upper Information, summary and the keyword stated are classified, and obtain classification results.
Above-mentioned S34 steps, specifically use carry out real-time matching and system to classification results using big data SPARK Meter, processing data is formed with this.
Above-mentioned S1 steps are all based on the big data technology of maturation to the magnanimity that is got from internet to S3 steps Enterprise's related data is handled, it is ensured that the safety storage of mass data, it is ensured that mass data distributed treatment, efficiency high is accurate Exactness is constantly lifted with the accumulation of data.And with big data technology humanized, based on Distributed Parallel Computing framework, solve sea The problem of measuring the storage and calculating of data, using the theory of machine learning and natural language processing, allows machine intelligence processing to look forward to Industry relevant information, is made a summary, sorted out and is extracted.Collected and handled based on internet public information, in the absence of sensitivity letter Breath, data acquisition cost is relatively low.
Further, above-mentioned S4 steps, according to processing data and basic data, obtain the step in training set data storehouse Suddenly, including in detail below step:
S41, according to processing data and basic data, make training set;
S42, to processing data be sampled investigation and adjust;
S43, the processing data after adjustment stored to training set;
S44, training set is trained;
S45, it is trained improvement using weights, forms training set data storehouse.
Above-mentioned S41 steps, utilize the processing number after the basic data in basic database and classification, matching and statistics According to being integrated and being matched, in this, as training set, with the incidence relation of clear and definite basic data and processing data, it is easy to catch new The incidence relation of data.
Above-mentioned S42 steps, primarily to the degree of accuracy of incidence relation between basic data and processing data is improved, because This is, it is necessary to which artificial be sampled investigation to processing data and adjust, to ensure the degree of accuracy of processing data, so that it is guaranteed that association is closed The degree of accuracy of system.
Above-mentioned S43 steps, primarily to corrigendum training set in processing data, using the processing data after adjustment as Standard, is integrated with basic data, forms the higher data correlation relation of the degree of accuracy.
For above-mentioned S45 steps, with adding up for data, improvement, the side of weights are trained by the way of weights Formula is mainly and the sample is big apart from small neighbours' weights.Specifically, weights setting is too small can reduce nicety of grading, if setting It is excessive, and test sample belongs in training set comprising the less class of data, then can increase noise, reduce classifying quality.Therefore, weigh Value will set appropriate, can just improve the degree of accuracy of data among enterprises incidence relation seizure, generally, and the setting of K values is using intersection The mode (on the basis of K=1) of inspection, empirical rule:K is generally below the square root of number of training.
Further, above-mentioned S5 steps, are handled new data using training set data storehouse, obtain number between enterprise The step of according to incidence relation, including step in detail below:
S51, using training set data training set data storehouse is trained, acquisition uses model;
S52, using using model new data is classified and predicted, obtain data among enterprises incidence relation.
Above-mentioned S51 steps, are trained to training set data storehouse, are conducive to improving the validity in training set data storehouse, The degree of accuracy of data among enterprises incidence relation seizure is improved with this.
For above-mentioned S52 steps, using the training set data storehouse after training as model is used, using use model to new Data are classified and predicted, obtain data among enterprises incidence relation, so that automatic classification is realized, while tiring out with data volume Product, accuracy rate more and more higher.
Above-mentioned S51 steps can refer to following embodiments to S52 steps:
Training set is updated in KNN models by #;
Clf=KNeighborsClassifier (n_neighbors=3);
clf.fit(X_train,y_train);
# weighs model accuracy using test set;
clf.score(X_test,y_test);
# sets new data;
New_data=np.array ([[5000,40000]]);
# carries out classification prediction to new data;
clf.predict(new_data)。
For example described above, the data in training set data storehouse are as shown in the table:
Period Data Data Data Classification
1 1.0 2.0 3.0 1
2 1.0 2.1 3.1 1
3 0.9 2.2 2.9 1
4 3.4 6.7 8.9 2
5 3.0 7.0 8.7 2
6 3.3 6.9 8.8 2
7 2.5 3.3 10.0 3
8 2.4 2.9 8.0 3
New data is as shown in the table:
Period Data Data Data Classification
1 2.1 5.5 7.2 0
2 1.1 2.5 4.2 0
3 4.1 3.5 9.2 0
Sorted new data is as shown in the table:
Period Data Data Data Classification
1 1.1 2.5 4..2 1
2 2.1 5.5 7.2 2
3 4.1 3.5 9.2 3
The above-mentioned data among enterprises incidence relation method for catching based on mass data is related by the enterprise for gathering magnanimity Data, procurement cost is low, and data processing is carried out using big data technology, it is ensured that the safety storage of mass data, it is ensured that magnanimity number According to distributed treatment, efficiency high, the degree of accuracy is constantly lifted with the accumulation of data, with big data technology humanized and based on distribution The problem of formula parallel computation framework solves the storage and calculating of mass data, uses the reason of machine learning and natural language processing By, allow machine intelligence to handle company-related information, made a summary, sorted out and extracted, the degree of accuracy that realization raising is caught, and from In mass data, the data effective to enterprise carry out auto-associating and automatic classification, and recognition efficiency is high.
As shown in fig. 6, be that the data among enterprises incidence relation based on mass data that the present embodiment is provided catches system, its Including acquiring unit 1, basic data formation unit 2, processing data formation unit 3, database acquiring unit 4 and Relation acquisition Unit 5.
Acquiring unit 1, for obtaining magnanimity enterprise related data.
Basic data formation unit 2, for being accumulated to magnanimity enterprise related data, basis of formation data.
Processing data formation unit 3, is handled for the magnanimity enterprise related data to acquisition, forms processing number According to.
Database acquiring unit 4, for according to processing data and basic data, obtaining training set data storehouse.
Relation acquisition unit 5, for being handled using training set data storehouse new data, obtains data among enterprises association Relation.
Acquiring unit 1 is specifically to crawl technology using data, gathers and crawl daily enterprise's related data from internet.
Further, basic data formation unit 2 includes update module 21, Database module 22, memory module 23 and basic data acquisition module 24.
Update module 21, for being regularly updated to magnanimity enterprise related data.
Database module 22, for magnanimity enterprise related data to be excavated and classified, sets up basis Database.
Memory module 23, for storing magnanimity enterprise related data in the basic database.
Basic data acquisition module 24, for obtaining the basic data in the basic database.
Update module 21 is regularly updated to magnanimity enterprise related data, plays a part of accumulating data
Database module 22 is specifically to use machine learning techniques, by magnanimity enterprise related data on internet Excavated and classified, basic data database is set up with this.
Memory module 23 is specifically using big data HDFS technology distributions formula storage magnanimity enterprise related data.
Basic data in above-mentioned basic database is after magnanimity enterprise related data is accumulated and handled Data.
Further, processing data formation unit 3 includes processing module 31, indexes and set up module 32, sort module 33 And matching statistical module 34.
Processing module 31, cleaned, sorted out for the magnanimity enterprise related data to acquisition, extract summary and Extract keyword.
Index sets up module 32, is indexed for being set up to the summary and keyword.
Sort module 33, for classifying to described information, summary and keyword, obtains classification results.
Statistical module 34 is matched, for carrying out real-time matching and statistics to classification results, processing data is formed.
Processing module 31 is specifically the theory and technology based on natural language processing, and the magnanimity enterprise returned to collection is related Data are cleaned, sorted out, extracted summary and extract keyword.
It is specifically to the theory using natural language processing and the summary and key after technical finesse that index, which sets up module 32, Word, sets up index.
Sort module 33 is specifically used using K arest neighbors (k-Nearest Neighbor, KNN) classification to above-mentioned Information, summary and keyword are classified, and obtain classification results.
The specific carry out real-time matching and system used using big data SPARK to classification results of matching statistical module 34 Meter, processing data is formed with this.
Above-mentioned acquiring unit 1, basic data formation unit 2 and reason data formation unit is all based on the big number of maturation The magnanimity enterprise related data got from internet is handled according to technology, it is ensured that the safety storage of mass data, protected Mass data distributed treatment is demonstrate,proved, efficiency high, the degree of accuracy is constantly lifted with the accumulation of data.And driven with big data technology Dynamic, based on Distributed Parallel Computing framework, the problem of solving the storage and calculating of mass data uses machine learning and nature The theory of Language Processing, allows machine intelligence to handle company-related information, is made a summary, sorted out and extracted.Based on internet Public information is collected and handled, and in the absence of sensitive information, data acquisition cost is relatively low.
In addition, database acquiring unit 4 includes training set formation module 41, adjusting module 42, processing data memory module 43rd, training module 44 and improvement module 45.
Training set formation module 41, for according to processing data and basic data, making training set.
Adjusting module 42, for being sampled investigation to processing data and adjusting.
Processing data memory module 43, for the processing data after adjustment to be stored to training set.
Training module 44, for being trained to training set.
Module 45 is improved, for being trained improvement using weights, training set data storehouse is formed.
After above-mentioned training set formation module 41 is using the basic data in basic database and classification, matching and statistics Processing data integrated and matched, in this, as training set, with the incidence relation of clear and definite basic data and processing data, just In the incidence relation for catching new data.
Adjusting module 42 is primarily to improve the degree of accuracy of incidence relation between basic data and processing data, therefore, Need manually to be sampled processing data investigation and adjust, to ensure the degree of accuracy of processing data, so that it is guaranteed that incidence relation The degree of accuracy.
Processing data memory module 43 is primarily to correct the processing data in training set, with the processing data after adjustment It is defined, is integrated with basic data, forms the higher data correlation relation of the degree of accuracy.
With adding up for data, improve module 45 and improvement is trained by the way of weights, the mode of weights is mainly It is big apart from small neighbours' weights with the sample.Specifically, weights setting is too small can reduce nicety of grading, if setting is excessive, and Test sample belongs in training set comprising the class that data are less, then can increase noise, reduces classifying quality.Therefore, weights will be set Put appropriate, can just improve the degree of accuracy of data among enterprises incidence relation seizure, generally, the setting of K values is using crosscheck Mode (on the basis of K=1), empirical rule:K is generally below the square root of number of training.
Further, Relation acquisition unit 5 includes model acquisition module 51 and classification prediction module 52.
Model acquisition module 51, for being trained using training set data to training set data storehouse, acquisition uses model.
Classification prediction module 52, for using new data is classified and predicted using model, obtaining data among enterprises Incidence relation.
Model acquisition module 51 is trained to training set data storehouse, is conducive to improving the validity in training set data storehouse, The degree of accuracy of data among enterprises incidence relation seizure is improved with this.
Training set data storehouse of the prediction module 52 of classifying using after training is as model is used, using use model to new data Classified and predicted, obtain data among enterprises incidence relation, so that automatic classification is realized, while with the accumulation of data volume, Accuracy rate more and more higher.
Above-mentioned model acquisition module 51 and the course of work for prediction module 52 of classifying, can refer to following embodiments:
Training set is updated in KNN models by #;
Clf=KNeighborsClassifier (n_neighbors=3);
clf.fit(X_train,y_train);
# weighs model accuracy using test set;
clf.score(X_test,y_test);
# sets new data;
New_data=np.array ([[5000,40000]]);
# carries out classification prediction to new data;
clf.predict(new_data)。
The above-mentioned data among enterprises incidence relation based on mass data catches system, related by the enterprise for gathering magnanimity Data, procurement cost is low, and data processing is carried out using big data technology, it is ensured that the safety storage of mass data, it is ensured that magnanimity number According to distributed treatment, efficiency high, the degree of accuracy is constantly lifted with the accumulation of data, with big data technology humanized and based on distribution The problem of formula parallel computation framework solves the storage and calculating of mass data, uses the reason of machine learning and natural language processing By, allow machine intelligence to handle company-related information, made a summary, sorted out and extracted, the degree of accuracy that realization raising is caught, and from In mass data, the data effective to enterprise carry out auto-associating and automatic classification, and recognition efficiency is high.
The above-mentioned technology contents that the present invention is only further illustrated with embodiment, in order to which reader is easier to understand, but not Represent embodiments of the present invention and be only limitted to this, any technology done according to the present invention extends or recreated, by the present invention's Protection.Protection scope of the present invention is defined by claims.

Claims (10)

1. the data among enterprises incidence relation method for catching based on mass data, it is characterised in that methods described includes:
Obtain magnanimity enterprise related data;
Magnanimity enterprise related data is accumulated, basis of formation data;
Magnanimity enterprise related data to acquisition is handled, and forms processing data;
According to processing data and basic data, training set data storehouse is obtained;
New data is handled using training set data storehouse, data among enterprises incidence relation is obtained.
2. the data among enterprises incidence relation method for catching according to claim 1 based on mass data, it is characterised in that Magnanimity enterprise related data is accumulated, the step of basis of formation data, including step in detail below:
Magnanimity enterprise related data is regularly updated;
Magnanimity enterprise related data is excavated and classified, basic database is set up;
Magnanimity enterprise related data is stored in the basic database;
Obtain the basic data in the basic database.
3. the data among enterprises incidence relation method for catching according to claim 1 or 2 based on mass data, its feature exists In the magnanimity enterprise related data to acquisition is handled, the step of forming processing data, including step in detail below:
The magnanimity enterprise related data of acquisition is cleaned, sorted out, summary is extracted and extracts keyword;
Index is set up to the summary and keyword;
Described information, summary and keyword are classified, classification results are obtained;
Real-time matching and statistics are carried out to classification results, processing data is formed.
4. the data among enterprises incidence relation method for catching according to claim 3 based on mass data, it is characterised in that According to processing data and basic data, the step of obtaining training set data storehouse, including step in detail below:
According to processing data and basic data, training set is made;
Investigation is sampled to processing data and is adjusted;
Processing data after adjustment is stored to training set;
Training set is trained;
Improvement is trained using weights, training set data storehouse is formed.
5. the data among enterprises incidence relation method for catching according to claim 4 based on mass data, it is characterised in that New data is handled using training set data storehouse, the step of obtaining data among enterprises incidence relation, including walked in detail below Suddenly:
Training set data storehouse is trained using training set data, acquisition uses model;
New data is classified and predicted using using model, data among enterprises incidence relation is obtained.
6. the data among enterprises incidence relation based on mass data catches system, it is characterised in that including acquiring unit, basic number Unit, database acquiring unit and Relation acquisition unit are formed according to unit, processing data is formed;
The acquiring unit, for obtaining magnanimity enterprise related data;
The basic data formation unit, for being accumulated to magnanimity enterprise related data, basis of formation data;
The processing data formation unit, is handled for the magnanimity enterprise related data to acquisition, forms processing number According to;
The database acquiring unit, for according to processing data and basic data, obtaining training set data storehouse;
The Relation acquisition unit, for being handled using training set data storehouse new data, obtains data among enterprises association Relation.
7. the data among enterprises incidence relation according to claim 6 based on mass data catches system, it is characterised in that The basic data formation unit includes update module, Database module, memory module and basic data acquisition module;
The update module, for being regularly updated to magnanimity enterprise related data;
The Database module, for magnanimity enterprise related data to be excavated and classified, sets up basic number According to storehouse;
The memory module, for storing magnanimity enterprise related data in the basic database;
The basic data acquisition module, for obtaining the basic data in the basic database.
8. the data among enterprises incidence relation according to claim 7 based on mass data catches system, it is characterised in that The processing data formation unit includes processing module, index and sets up module, sort module and matching statistical module;
The processing module, is cleaned for the magnanimity enterprise related data to acquisition, is sorted out, extracted summary and carry Take keyword;
The index sets up module, is indexed for being set up to the summary and keyword;
The sort module, for classifying to described information, summary and keyword, obtains classification results;
The matching statistical module, for carrying out real-time matching and statistics to classification results, forms processing data.
9. the data among enterprises incidence relation according to claim 8 based on mass data catches system, it is characterised in that The database acquiring unit includes training set formation module, adjusting module, processing data memory module, training module and changed Enter module;
The training set formation module, for according to processing data and basic data, making training set;
The adjusting module, for being sampled investigation to processing data and adjusting;
The processing data memory module, for the processing data after adjustment to be stored to training set;
The training module, for being trained to training set;
The improvement module, for being trained improvement using weights, forms training set data storehouse.
10. the data among enterprises incidence relation according to claim 9 based on mass data catches system, its feature exists In the Relation acquisition unit includes model acquisition module and classification prediction module;
The model acquisition module, for being trained using training set data to training set data storehouse, acquisition uses model;
The classification prediction module, for using new data is classified and predicted using model, obtaining data among enterprises and closing Connection relation.
CN201710353297.3A 2017-05-18 2017-05-18 Data among enterprises incidence relation method for catching and its system based on mass data Pending CN107103094A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710353297.3A CN107103094A (en) 2017-05-18 2017-05-18 Data among enterprises incidence relation method for catching and its system based on mass data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710353297.3A CN107103094A (en) 2017-05-18 2017-05-18 Data among enterprises incidence relation method for catching and its system based on mass data

Publications (1)

Publication Number Publication Date
CN107103094A true CN107103094A (en) 2017-08-29

Family

ID=59669252

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710353297.3A Pending CN107103094A (en) 2017-05-18 2017-05-18 Data among enterprises incidence relation method for catching and its system based on mass data

Country Status (1)

Country Link
CN (1) CN107103094A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107844912A (en) * 2017-11-22 2018-03-27 前海梧桐(深圳)数据有限公司 The computational methods and its system that enterprise's characteristic influences on enterprise development
CN108647292A (en) * 2018-05-07 2018-10-12 前海梧桐(深圳)数据有限公司 Enterprise's property sort computational methods based on neural network algorithm and system
CN109344199A (en) * 2018-09-20 2019-02-15 普元信息技术股份有限公司 The system and method for big data quantity distributed treatment is realized in cloud computing platform
CN109829034A (en) * 2018-08-24 2019-05-31 长威信息科技发展股份有限公司 A kind of enterprise's tree spectrogram methods of exhibiting based on main market players's credit data
CN111242760A (en) * 2019-12-30 2020-06-05 航天信息股份有限公司企业服务分公司 Method and system for carrying out accounting on capital service based on capital institution
WO2020207252A1 (en) * 2019-04-09 2020-10-15 Oppo广东移动通信有限公司 Data storage method and device, storage medium, and electronic apparatus
CN112364228A (en) * 2020-11-26 2021-02-12 深圳前瞻资讯股份有限公司 Construction method, system, application method, terminal device and storage medium of enterprise big data system based on physical position
US11397778B2 (en) 2018-05-30 2022-07-26 Beijing Baidu Netcom Service and Technology Co., Ltd. Method and device for mining an enterprise relationship

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101853277A (en) * 2010-05-14 2010-10-06 南京信息工程大学 A Vulnerability Data Mining Method Based on Classification and Association Analysis
CN103136337A (en) * 2013-02-01 2013-06-05 北京邮电大学 Distributed knowledge data mining device and mining method used for complex network
CN106570171A (en) * 2016-11-03 2017-04-19 中国电子科技集团公司第二十八研究所 Semantics-based sci-tech information processing method and system
US20170109786A1 (en) * 2015-10-20 2017-04-20 Korea Electronics Technology Institute System for producing promotional media content and method thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101853277A (en) * 2010-05-14 2010-10-06 南京信息工程大学 A Vulnerability Data Mining Method Based on Classification and Association Analysis
CN103136337A (en) * 2013-02-01 2013-06-05 北京邮电大学 Distributed knowledge data mining device and mining method used for complex network
US20170109786A1 (en) * 2015-10-20 2017-04-20 Korea Electronics Technology Institute System for producing promotional media content and method thereof
CN106570171A (en) * 2016-11-03 2017-04-19 中国电子科技集团公司第二十八研究所 Semantics-based sci-tech information processing method and system

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107844912A (en) * 2017-11-22 2018-03-27 前海梧桐(深圳)数据有限公司 The computational methods and its system that enterprise's characteristic influences on enterprise development
CN108647292A (en) * 2018-05-07 2018-10-12 前海梧桐(深圳)数据有限公司 Enterprise's property sort computational methods based on neural network algorithm and system
US11397778B2 (en) 2018-05-30 2022-07-26 Beijing Baidu Netcom Service and Technology Co., Ltd. Method and device for mining an enterprise relationship
CN109829034A (en) * 2018-08-24 2019-05-31 长威信息科技发展股份有限公司 A kind of enterprise's tree spectrogram methods of exhibiting based on main market players's credit data
CN109344199A (en) * 2018-09-20 2019-02-15 普元信息技术股份有限公司 The system and method for big data quantity distributed treatment is realized in cloud computing platform
WO2020207252A1 (en) * 2019-04-09 2020-10-15 Oppo广东移动通信有限公司 Data storage method and device, storage medium, and electronic apparatus
CN111797175A (en) * 2019-04-09 2020-10-20 Oppo广东移动通信有限公司 Data storage method and device, storage medium and electronic equipment
CN111797175B (en) * 2019-04-09 2023-12-19 Oppo广东移动通信有限公司 Data storage method and device, storage medium and electronic equipment
CN111242760A (en) * 2019-12-30 2020-06-05 航天信息股份有限公司企业服务分公司 Method and system for carrying out accounting on capital service based on capital institution
CN111242760B (en) * 2019-12-30 2024-02-27 航天信息股份有限公司企业服务分公司 Method and system for billing fund business based on fund institutions
CN112364228A (en) * 2020-11-26 2021-02-12 深圳前瞻资讯股份有限公司 Construction method, system, application method, terminal device and storage medium of enterprise big data system based on physical position
CN112364228B (en) * 2020-11-26 2021-08-13 深圳前瞻资讯股份有限公司 Construction method, system, application method, terminal device and storage medium of a physical location-based enterprise big data system

Similar Documents

Publication Publication Date Title
CN107103094A (en) Data among enterprises incidence relation method for catching and its system based on mass data
CN112966926B (en) Flood sensitivity risk assessment method based on ensemble learning
CN107577688B (en) Original article influence analysis system based on media information acquisition
CN109034194B (en) Transaction fraud behavior deep detection method based on feature differentiation
CN101964000B (en) Automatic filtering management system for sensitive words
CN116205310B (en) Soil water content influence factor sensitive interval judging method based on interpretable integrated learning model
CN107316007A (en) A kind of monitoring image multiclass object detection and recognition methods based on deep learning
CN107798600A (en) The credit risk recognition methods of the small micro- loan of internet finance and device
CN112132233A (en) Criminal personnel dangerous behavior prediction method and system based on effective influence factors
CN104820724B (en) Text class educational resource knowledge point forecast model preparation method and application method
CN103092975A (en) Detection and filter method of network community garbage information based on topic consensus coverage rate
CN107292744A (en) Investment Trend analysis method and its system based on machine learning
CN110020712A (en) A kind of optimization population BP neural network forecast method and system based on cluster
CN106446124A (en) Website classification method based on network relation graph
CN104239722A (en) Forecasting method based on recognition of correlational relationship between factors
CN112330095A (en) Quality management method based on decision tree algorithm
CN117520660A (en) Information extraction and pushing method and system based on big data
CN111507528A (en) A long-term stock trend prediction method based on CNN-LSTM
CN109471934B (en) Internet-based financial risk clues mining method
CN114912798A (en) Earthquake loss evaluation system based on random forest and earthquake damage big data
CN114186002A (en) Scientific and technological achievement data processing and analyzing method and system
CN109582743A (en) A kind of data digging method for the attack of terrorism
CN109063944A (en) City banking index analysis method and device based on big data analysis technology
CN111898579A (en) An unbiased semi-supervised classification model for high-resolution remote sensing images based on extreme gradient boosting
Kirshners et al. A comparative analysis of short time series processing methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170829