CN107103094A - Data among enterprises incidence relation method for catching and its system based on mass data - Google Patents
Data among enterprises incidence relation method for catching and its system based on mass data Download PDFInfo
- Publication number
- CN107103094A CN107103094A CN201710353297.3A CN201710353297A CN107103094A CN 107103094 A CN107103094 A CN 107103094A CN 201710353297 A CN201710353297 A CN 201710353297A CN 107103094 A CN107103094 A CN 107103094A
- Authority
- CN
- China
- Prior art keywords
- data
- module
- training set
- magnanimity
- processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention relates to the data among enterprises incidence relation method for catching based on mass data and its system, this method includes obtaining magnanimity enterprise related data;Magnanimity enterprise related data is accumulated, basis of formation data;Magnanimity enterprise related data to acquisition is handled, and forms processing data;According to processing data and basic data, training set data storehouse is obtained;New data is handled using training set data storehouse, data among enterprises incidence relation is obtained.The present invention carries out data processing using big data technology, ensure the safety storage of mass data, ensure mass data distributed treatment, efficiency high, the problem of storage and calculating of mass data being solved with big data technology humanized and based on Distributed Parallel Computing framework, use the theory of machine learning and natural language processing, machine intelligence is allowed to handle company-related information, realize the degree of accuracy for improving and catching, and from mass data, the data effective to enterprise carry out auto-associating and automatic classification, and recognition efficiency is high.
Description
Technical field
The present invention relates to data processing, more specifically refer to that the data among enterprises incidence relation based on mass data is caught
Method and its system.
Background technology
With the development of science and technology, increasing enterprise carries out a surname of itself enterprise in the form of data are announced in internet
Investee is passed or finds, therefore, the data of enterprise on the internet are more and more, and the enterprise database on internet is more next
It is huger.
The number between enterprise is found during publicizing or finding investee, it is necessary to from the mass data on internet
According to incidence relation, in this, as location condition, be accurately positioned the required enterprise looked for.But, number between current searching enterprise
Can only be by artificial screening and analysis according to incidence relation, this causes to be difficult to carry out enterprise comprehensive analysis and enterprise's holography
Portrait, and artificial screening and analysis efficiency are low, and accuracy rate is also low.
Chinese patent 201510810811.2 provides one kind and identical principal and subordinate is retrieved under big data based on relational database
The algorithm of relation data, is a kind of algorithm of progress comparing in mass data, using " changing small, point behind first face greatly ", profit
Comparing scope is progressively reduced with packet traverses, middle table storage scheduling algorithm, efficient retrieval goes out identical record.Foregoing invention
For magnanimity host-guest architecture data in business data, the need for quick-searching goes out the methods of identical recordings suitable for enterprise managing
The various situations of identical host-guest architecture data are retrieved, strengthen the management and control ability of enterprise, is that more preferable market environment is built by enterprise, carries
High enterprise competitiveness.
Above-mentioned patent uses the method that quick-searching goes out identical recordings, and this mode can only find similar note
Record, the degree of accuracy is not high.
Therefore, it is necessary to design a kind of data among enterprises incidence relation method for catching based on mass data, realize and improve
The degree of accuracy of seizure, and from mass data, the data effective to enterprise carry out auto-associating and automatic classification, efficiency high.
The content of the invention
Associate and close there is provided the data among enterprises based on mass data it is an object of the invention to the defect for overcoming prior art
It is method for catching and its system.
To achieve the above object, the present invention uses following technical scheme:Data among enterprises association based on mass data is closed
It is method for catching, methods described includes:
Obtain magnanimity enterprise related data;
Magnanimity enterprise related data is accumulated, basis of formation data;
Magnanimity enterprise related data to acquisition is handled, and forms processing data;
According to processing data and basic data, training set data storehouse is obtained;
New data is handled using training set data storehouse, data among enterprises incidence relation is obtained.
Its further technical scheme is:Magnanimity enterprise related data is accumulated, the step of basis of formation data, including
Step in detail below:
Magnanimity enterprise related data is regularly updated;
Magnanimity enterprise related data is excavated and classified, basic database is set up;
Magnanimity enterprise related data is stored in the basic database;
Obtain the basic data in the basic database.
Its further technical scheme is:Magnanimity enterprise related data to acquisition is handled, and forms processing data
The step of, including step in detail below:
The magnanimity enterprise related data of acquisition is cleaned, sorted out, summary is extracted and extracts keyword;
Index is set up to the summary and keyword;
Described information, summary and keyword are classified, classification results are obtained;
Real-time matching and statistics are carried out to classification results, processing data is formed.
Its further technical scheme is:According to processing data and basic data, the step of obtaining training set data storehouse, bag
Include step in detail below:
According to processing data and basic data, training set is made;
Investigation is sampled to processing data and is adjusted;
Processing data after adjustment is stored to training set;
Training set is trained;
Improvement is trained using weights, training set data storehouse is formed.
Its further technical scheme is:New data is handled using training set data storehouse, data among enterprises is obtained and closes
The step of connection relation, including step in detail below:
Training set data storehouse is trained using training set data, acquisition uses model;
New data is classified and predicted using using model, data among enterprises incidence relation is obtained.
Present invention also offers based on mass data data among enterprises incidence relation catch system, including acquiring unit,
Basic data formation unit, processing data formation unit, database acquiring unit and Relation acquisition unit;
The acquiring unit, for obtaining magnanimity enterprise related data;
The basic data formation unit, for being accumulated to magnanimity enterprise related data, basis of formation data;
The processing data formation unit, is handled for the magnanimity enterprise related data to acquisition, at formation
Manage data;
The database acquiring unit, for according to processing data and basic data, obtaining training set data storehouse;
The Relation acquisition unit, for being handled using training set data storehouse new data, obtains data among enterprises
Incidence relation.
Its further technical scheme is:The basic data formation unit includes update module, Database module, deposited
Store up module and basic data acquisition module;
The update module, for being regularly updated to magnanimity enterprise related data;
The Database module, for magnanimity enterprise related data to be excavated and classified, sets up base
Plinth database;
The memory module, for storing magnanimity enterprise related data in the basic database;
The basic data acquisition module, for obtaining the basic data in the basic database.
Its further technical scheme is:The processing data formation unit includes processing module, index and sets up module, classification
Module and matching statistical module;
The processing module, cleaned, sorted out for the magnanimity enterprise related data to acquisition, extract summary with
And extract keyword;
The index sets up module, is indexed for being set up to the summary and keyword;
The sort module, for classifying to described information, summary and keyword, obtains classification results;
The matching statistical module, for carrying out real-time matching and statistics to classification results, forms processing data.
Its further technical scheme is:The database acquiring unit includes training set formation module, adjusting module, processing
Data memory module, training module and improvement module;
The training set formation module, for according to processing data and basic data, making training set;
The adjusting module, for being sampled investigation to processing data and adjusting;
The processing data memory module, for the processing data after adjustment to be stored to training set;
The training module, for being trained to training set;
The improvement module, for being trained improvement using weights, forms training set data storehouse.
Its further technical scheme is:The Relation acquisition unit includes model acquisition module and classification prediction module;
The model acquisition module, for being trained using training set data to training set data storehouse, acquisition uses mould
Type;
The classification prediction module, for using new data is classified and predicted using model, obtaining number between enterprise
According to incidence relation.
Compared with the prior art, the invention has the advantages that:The association of the data among enterprises based on mass data of the present invention
Relation method for catching, by gathering enterprise's related data of magnanimity, procurement cost is low, is carried out using big data technology at data
Reason, it is ensured that the safety storage of mass data, it is ensured that mass data distributed treatment, efficiency high, the degree of accuracy is with the accumulation of data
Constantly lifting, the storage and calculating of mass data are solved with big data technology humanized and based on Distributed Parallel Computing framework
Problem, using the theory of machine learning and natural language processing, allows machine intelligence to handle company-related information, is made a summary, returned
Class and extraction, realize the degree of accuracy for improving and catching, and from mass data, the data effective to enterprise carry out auto-associating with
And automatic classification, recognition efficiency height.
The invention will be further described with specific embodiment below in conjunction with the accompanying drawings.
Brief description of the drawings
The data among enterprises incidence relation method for catching based on mass data that Fig. 1 provides for the specific embodiment of the invention
Flow chart;
The particular flow sheet for the basis of formation data that Fig. 2 provides for the specific embodiment of the invention;
The particular flow sheet for the formation processing data that Fig. 3 provides for the specific embodiment of the invention;
The particular flow sheet in the acquisition training set data storehouse that Fig. 4 provides for the specific embodiment of the invention;
The particular flow sheet for the acquisition data among enterprises incidence relation that Fig. 5 provides for the specific embodiment of the invention;
Fig. 6 catches system for the data among enterprises incidence relation based on mass data that the specific embodiment of the invention is provided
Structured flowchart;
The structured flowchart for the basic data formation unit that Fig. 7 provides for the specific embodiment of the invention;
The structured flowchart for the processing data formation unit that Fig. 8 provides for the specific embodiment of the invention;
The structured flowchart for the database acquiring unit that Fig. 9 provides for the specific embodiment of the invention;
The structured flowchart for the Relation acquisition unit that Figure 10 provides for the specific embodiment of the invention.
Embodiment
In order to more fully understand the technology contents of the present invention, technical scheme is entered with reference to specific embodiment
One step introduction and explanation, but it is not limited to this.
Specific embodiment as shown in Fig. 1~10, the association of the data among enterprises based on mass data that the present embodiment is provided
Relation method for catching, can be used in the publicity of enterprise or find investee's process, realize the degree of accuracy for improving and catching, and
From mass data, the data effective to enterprise carry out auto-associating and automatic classification, efficiency high.
As shown in figure 1, be the data among enterprises incidence relation method for catching based on mass data that the present embodiment is provided, should
Method includes:
S1, acquisition magnanimity enterprise related data;
S2, magnanimity enterprise related data is accumulated, basis of formation data;
S3, the magnanimity enterprise related data to acquisition are handled, and form processing data;
S4, according to processing data and basic data, obtain training set data storehouse;
S5, using training set data storehouse new data is handled, obtain data among enterprises incidence relation.
For S1 steps, the step of obtaining magnanimity enterprise's related data specifically crawls technology, daily from mutual using data
Enterprise's related data is gathered and crawled in networking.
Further, above-mentioned S2 steps, are accumulated, the step of basis of formation data to magnanimity enterprise related data
Suddenly, including in detail below step:
S21, magnanimity enterprise related data is regularly updated;
S22, magnanimity enterprise related data is excavated and classified, set up basic database;
S23, storage magnanimity enterprise related data are in the basic database;
S24, the basic data obtained in the basic database.
For above-mentioned S21 steps, magnanimity enterprise related data is regularly updated, plays a part of accumulating data.
For above-mentioned S22 steps, specifically using machine learning techniques, by magnanimity enterprise related data on internet
Excavated and classified, basic data database is set up with this.
For above-mentioned S23 steps, specifically using big data HDFS technology distributions formula storage magnanimity enterprise related data.
Basic data in above-mentioned S24 steps, basic database be accumulated by magnanimity enterprise related data and
Data after processing.
Further, above-mentioned S3 steps, the magnanimity enterprise related data to acquisition is handled, formation processing
The step of data, including step in detail below:
S31, the magnanimity enterprise related data to acquisition are cleaned, sorted out, extracted summary and extract keyword;
S32, to it is described summary and keyword set up index;
S33, described information, summary and keyword are classified, obtain classification results;
S34, real-time matching and statistics are carried out to classification results, form processing data.
Above-mentioned S31 steps, are specifically the theory and technology based on natural language processing, the magnanimity enterprise returned to collection
Related data is cleaned, sorted out, extracted summary and extract keyword.
It is specifically to the theory using natural language processing and the summary after technical finesse and pass for above-mentioned S32 steps
Key word, sets up index.
Above-mentioned S33 steps, are specifically used using K arest neighbors (k-Nearest Neighbor, KNN) classification to upper
Information, summary and the keyword stated are classified, and obtain classification results.
Above-mentioned S34 steps, specifically use carry out real-time matching and system to classification results using big data SPARK
Meter, processing data is formed with this.
Above-mentioned S1 steps are all based on the big data technology of maturation to the magnanimity that is got from internet to S3 steps
Enterprise's related data is handled, it is ensured that the safety storage of mass data, it is ensured that mass data distributed treatment, efficiency high is accurate
Exactness is constantly lifted with the accumulation of data.And with big data technology humanized, based on Distributed Parallel Computing framework, solve sea
The problem of measuring the storage and calculating of data, using the theory of machine learning and natural language processing, allows machine intelligence processing to look forward to
Industry relevant information, is made a summary, sorted out and is extracted.Collected and handled based on internet public information, in the absence of sensitivity letter
Breath, data acquisition cost is relatively low.
Further, above-mentioned S4 steps, according to processing data and basic data, obtain the step in training set data storehouse
Suddenly, including in detail below step:
S41, according to processing data and basic data, make training set;
S42, to processing data be sampled investigation and adjust;
S43, the processing data after adjustment stored to training set;
S44, training set is trained;
S45, it is trained improvement using weights, forms training set data storehouse.
Above-mentioned S41 steps, utilize the processing number after the basic data in basic database and classification, matching and statistics
According to being integrated and being matched, in this, as training set, with the incidence relation of clear and definite basic data and processing data, it is easy to catch new
The incidence relation of data.
Above-mentioned S42 steps, primarily to the degree of accuracy of incidence relation between basic data and processing data is improved, because
This is, it is necessary to which artificial be sampled investigation to processing data and adjust, to ensure the degree of accuracy of processing data, so that it is guaranteed that association is closed
The degree of accuracy of system.
Above-mentioned S43 steps, primarily to corrigendum training set in processing data, using the processing data after adjustment as
Standard, is integrated with basic data, forms the higher data correlation relation of the degree of accuracy.
For above-mentioned S45 steps, with adding up for data, improvement, the side of weights are trained by the way of weights
Formula is mainly and the sample is big apart from small neighbours' weights.Specifically, weights setting is too small can reduce nicety of grading, if setting
It is excessive, and test sample belongs in training set comprising the less class of data, then can increase noise, reduce classifying quality.Therefore, weigh
Value will set appropriate, can just improve the degree of accuracy of data among enterprises incidence relation seizure, generally, and the setting of K values is using intersection
The mode (on the basis of K=1) of inspection, empirical rule:K is generally below the square root of number of training.
Further, above-mentioned S5 steps, are handled new data using training set data storehouse, obtain number between enterprise
The step of according to incidence relation, including step in detail below:
S51, using training set data training set data storehouse is trained, acquisition uses model;
S52, using using model new data is classified and predicted, obtain data among enterprises incidence relation.
Above-mentioned S51 steps, are trained to training set data storehouse, are conducive to improving the validity in training set data storehouse,
The degree of accuracy of data among enterprises incidence relation seizure is improved with this.
For above-mentioned S52 steps, using the training set data storehouse after training as model is used, using use model to new
Data are classified and predicted, obtain data among enterprises incidence relation, so that automatic classification is realized, while tiring out with data volume
Product, accuracy rate more and more higher.
Above-mentioned S51 steps can refer to following embodiments to S52 steps:
Training set is updated in KNN models by #;
Clf=KNeighborsClassifier (n_neighbors=3);
clf.fit(X_train,y_train);
# weighs model accuracy using test set;
clf.score(X_test,y_test);
# sets new data;
New_data=np.array ([[5000,40000]]);
# carries out classification prediction to new data;
clf.predict(new_data)。
For example described above, the data in training set data storehouse are as shown in the table:
Period | Data | Data | Data | Classification |
1 | 1.0 | 2.0 | 3.0 | 1 |
2 | 1.0 | 2.1 | 3.1 | 1 |
3 | 0.9 | 2.2 | 2.9 | 1 |
4 | 3.4 | 6.7 | 8.9 | 2 |
5 | 3.0 | 7.0 | 8.7 | 2 |
6 | 3.3 | 6.9 | 8.8 | 2 |
7 | 2.5 | 3.3 | 10.0 | 3 |
8 | 2.4 | 2.9 | 8.0 | 3 |
New data is as shown in the table:
Period | Data | Data | Data | Classification |
1 | 2.1 | 5.5 | 7.2 | 0 |
2 | 1.1 | 2.5 | 4.2 | 0 |
3 | 4.1 | 3.5 | 9.2 | 0 |
Sorted new data is as shown in the table:
Period | Data | Data | Data | Classification |
1 | 1.1 | 2.5 | 4..2 | 1 |
2 | 2.1 | 5.5 | 7.2 | 2 |
3 | 4.1 | 3.5 | 9.2 | 3 |
The above-mentioned data among enterprises incidence relation method for catching based on mass data is related by the enterprise for gathering magnanimity
Data, procurement cost is low, and data processing is carried out using big data technology, it is ensured that the safety storage of mass data, it is ensured that magnanimity number
According to distributed treatment, efficiency high, the degree of accuracy is constantly lifted with the accumulation of data, with big data technology humanized and based on distribution
The problem of formula parallel computation framework solves the storage and calculating of mass data, uses the reason of machine learning and natural language processing
By, allow machine intelligence to handle company-related information, made a summary, sorted out and extracted, the degree of accuracy that realization raising is caught, and from
In mass data, the data effective to enterprise carry out auto-associating and automatic classification, and recognition efficiency is high.
As shown in fig. 6, be that the data among enterprises incidence relation based on mass data that the present embodiment is provided catches system, its
Including acquiring unit 1, basic data formation unit 2, processing data formation unit 3, database acquiring unit 4 and Relation acquisition
Unit 5.
Acquiring unit 1, for obtaining magnanimity enterprise related data.
Basic data formation unit 2, for being accumulated to magnanimity enterprise related data, basis of formation data.
Processing data formation unit 3, is handled for the magnanimity enterprise related data to acquisition, forms processing number
According to.
Database acquiring unit 4, for according to processing data and basic data, obtaining training set data storehouse.
Relation acquisition unit 5, for being handled using training set data storehouse new data, obtains data among enterprises association
Relation.
Acquiring unit 1 is specifically to crawl technology using data, gathers and crawl daily enterprise's related data from internet.
Further, basic data formation unit 2 includes update module 21, Database module 22, memory module
23 and basic data acquisition module 24.
Update module 21, for being regularly updated to magnanimity enterprise related data.
Database module 22, for magnanimity enterprise related data to be excavated and classified, sets up basis
Database.
Memory module 23, for storing magnanimity enterprise related data in the basic database.
Basic data acquisition module 24, for obtaining the basic data in the basic database.
Update module 21 is regularly updated to magnanimity enterprise related data, plays a part of accumulating data
Database module 22 is specifically to use machine learning techniques, by magnanimity enterprise related data on internet
Excavated and classified, basic data database is set up with this.
Memory module 23 is specifically using big data HDFS technology distributions formula storage magnanimity enterprise related data.
Basic data in above-mentioned basic database is after magnanimity enterprise related data is accumulated and handled
Data.
Further, processing data formation unit 3 includes processing module 31, indexes and set up module 32, sort module 33
And matching statistical module 34.
Processing module 31, cleaned, sorted out for the magnanimity enterprise related data to acquisition, extract summary and
Extract keyword.
Index sets up module 32, is indexed for being set up to the summary and keyword.
Sort module 33, for classifying to described information, summary and keyword, obtains classification results.
Statistical module 34 is matched, for carrying out real-time matching and statistics to classification results, processing data is formed.
Processing module 31 is specifically the theory and technology based on natural language processing, and the magnanimity enterprise returned to collection is related
Data are cleaned, sorted out, extracted summary and extract keyword.
It is specifically to the theory using natural language processing and the summary and key after technical finesse that index, which sets up module 32,
Word, sets up index.
Sort module 33 is specifically used using K arest neighbors (k-Nearest Neighbor, KNN) classification to above-mentioned
Information, summary and keyword are classified, and obtain classification results.
The specific carry out real-time matching and system used using big data SPARK to classification results of matching statistical module 34
Meter, processing data is formed with this.
Above-mentioned acquiring unit 1, basic data formation unit 2 and reason data formation unit is all based on the big number of maturation
The magnanimity enterprise related data got from internet is handled according to technology, it is ensured that the safety storage of mass data, protected
Mass data distributed treatment is demonstrate,proved, efficiency high, the degree of accuracy is constantly lifted with the accumulation of data.And driven with big data technology
Dynamic, based on Distributed Parallel Computing framework, the problem of solving the storage and calculating of mass data uses machine learning and nature
The theory of Language Processing, allows machine intelligence to handle company-related information, is made a summary, sorted out and extracted.Based on internet
Public information is collected and handled, and in the absence of sensitive information, data acquisition cost is relatively low.
In addition, database acquiring unit 4 includes training set formation module 41, adjusting module 42, processing data memory module
43rd, training module 44 and improvement module 45.
Training set formation module 41, for according to processing data and basic data, making training set.
Adjusting module 42, for being sampled investigation to processing data and adjusting.
Processing data memory module 43, for the processing data after adjustment to be stored to training set.
Training module 44, for being trained to training set.
Module 45 is improved, for being trained improvement using weights, training set data storehouse is formed.
After above-mentioned training set formation module 41 is using the basic data in basic database and classification, matching and statistics
Processing data integrated and matched, in this, as training set, with the incidence relation of clear and definite basic data and processing data, just
In the incidence relation for catching new data.
Adjusting module 42 is primarily to improve the degree of accuracy of incidence relation between basic data and processing data, therefore,
Need manually to be sampled processing data investigation and adjust, to ensure the degree of accuracy of processing data, so that it is guaranteed that incidence relation
The degree of accuracy.
Processing data memory module 43 is primarily to correct the processing data in training set, with the processing data after adjustment
It is defined, is integrated with basic data, forms the higher data correlation relation of the degree of accuracy.
With adding up for data, improve module 45 and improvement is trained by the way of weights, the mode of weights is mainly
It is big apart from small neighbours' weights with the sample.Specifically, weights setting is too small can reduce nicety of grading, if setting is excessive, and
Test sample belongs in training set comprising the class that data are less, then can increase noise, reduces classifying quality.Therefore, weights will be set
Put appropriate, can just improve the degree of accuracy of data among enterprises incidence relation seizure, generally, the setting of K values is using crosscheck
Mode (on the basis of K=1), empirical rule:K is generally below the square root of number of training.
Further, Relation acquisition unit 5 includes model acquisition module 51 and classification prediction module 52.
Model acquisition module 51, for being trained using training set data to training set data storehouse, acquisition uses model.
Classification prediction module 52, for using new data is classified and predicted using model, obtaining data among enterprises
Incidence relation.
Model acquisition module 51 is trained to training set data storehouse, is conducive to improving the validity in training set data storehouse,
The degree of accuracy of data among enterprises incidence relation seizure is improved with this.
Training set data storehouse of the prediction module 52 of classifying using after training is as model is used, using use model to new data
Classified and predicted, obtain data among enterprises incidence relation, so that automatic classification is realized, while with the accumulation of data volume,
Accuracy rate more and more higher.
Above-mentioned model acquisition module 51 and the course of work for prediction module 52 of classifying, can refer to following embodiments:
Training set is updated in KNN models by #;
Clf=KNeighborsClassifier (n_neighbors=3);
clf.fit(X_train,y_train);
# weighs model accuracy using test set;
clf.score(X_test,y_test);
# sets new data;
New_data=np.array ([[5000,40000]]);
# carries out classification prediction to new data;
clf.predict(new_data)。
The above-mentioned data among enterprises incidence relation based on mass data catches system, related by the enterprise for gathering magnanimity
Data, procurement cost is low, and data processing is carried out using big data technology, it is ensured that the safety storage of mass data, it is ensured that magnanimity number
According to distributed treatment, efficiency high, the degree of accuracy is constantly lifted with the accumulation of data, with big data technology humanized and based on distribution
The problem of formula parallel computation framework solves the storage and calculating of mass data, uses the reason of machine learning and natural language processing
By, allow machine intelligence to handle company-related information, made a summary, sorted out and extracted, the degree of accuracy that realization raising is caught, and from
In mass data, the data effective to enterprise carry out auto-associating and automatic classification, and recognition efficiency is high.
The above-mentioned technology contents that the present invention is only further illustrated with embodiment, in order to which reader is easier to understand, but not
Represent embodiments of the present invention and be only limitted to this, any technology done according to the present invention extends or recreated, by the present invention's
Protection.Protection scope of the present invention is defined by claims.
Claims (10)
1. the data among enterprises incidence relation method for catching based on mass data, it is characterised in that methods described includes:
Obtain magnanimity enterprise related data;
Magnanimity enterprise related data is accumulated, basis of formation data;
Magnanimity enterprise related data to acquisition is handled, and forms processing data;
According to processing data and basic data, training set data storehouse is obtained;
New data is handled using training set data storehouse, data among enterprises incidence relation is obtained.
2. the data among enterprises incidence relation method for catching according to claim 1 based on mass data, it is characterised in that
Magnanimity enterprise related data is accumulated, the step of basis of formation data, including step in detail below:
Magnanimity enterprise related data is regularly updated;
Magnanimity enterprise related data is excavated and classified, basic database is set up;
Magnanimity enterprise related data is stored in the basic database;
Obtain the basic data in the basic database.
3. the data among enterprises incidence relation method for catching according to claim 1 or 2 based on mass data, its feature exists
In the magnanimity enterprise related data to acquisition is handled, the step of forming processing data, including step in detail below:
The magnanimity enterprise related data of acquisition is cleaned, sorted out, summary is extracted and extracts keyword;
Index is set up to the summary and keyword;
Described information, summary and keyword are classified, classification results are obtained;
Real-time matching and statistics are carried out to classification results, processing data is formed.
4. the data among enterprises incidence relation method for catching according to claim 3 based on mass data, it is characterised in that
According to processing data and basic data, the step of obtaining training set data storehouse, including step in detail below:
According to processing data and basic data, training set is made;
Investigation is sampled to processing data and is adjusted;
Processing data after adjustment is stored to training set;
Training set is trained;
Improvement is trained using weights, training set data storehouse is formed.
5. the data among enterprises incidence relation method for catching according to claim 4 based on mass data, it is characterised in that
New data is handled using training set data storehouse, the step of obtaining data among enterprises incidence relation, including walked in detail below
Suddenly:
Training set data storehouse is trained using training set data, acquisition uses model;
New data is classified and predicted using using model, data among enterprises incidence relation is obtained.
6. the data among enterprises incidence relation based on mass data catches system, it is characterised in that including acquiring unit, basic number
Unit, database acquiring unit and Relation acquisition unit are formed according to unit, processing data is formed;
The acquiring unit, for obtaining magnanimity enterprise related data;
The basic data formation unit, for being accumulated to magnanimity enterprise related data, basis of formation data;
The processing data formation unit, is handled for the magnanimity enterprise related data to acquisition, forms processing number
According to;
The database acquiring unit, for according to processing data and basic data, obtaining training set data storehouse;
The Relation acquisition unit, for being handled using training set data storehouse new data, obtains data among enterprises association
Relation.
7. the data among enterprises incidence relation according to claim 6 based on mass data catches system, it is characterised in that
The basic data formation unit includes update module, Database module, memory module and basic data acquisition module;
The update module, for being regularly updated to magnanimity enterprise related data;
The Database module, for magnanimity enterprise related data to be excavated and classified, sets up basic number
According to storehouse;
The memory module, for storing magnanimity enterprise related data in the basic database;
The basic data acquisition module, for obtaining the basic data in the basic database.
8. the data among enterprises incidence relation according to claim 7 based on mass data catches system, it is characterised in that
The processing data formation unit includes processing module, index and sets up module, sort module and matching statistical module;
The processing module, is cleaned for the magnanimity enterprise related data to acquisition, is sorted out, extracted summary and carry
Take keyword;
The index sets up module, is indexed for being set up to the summary and keyword;
The sort module, for classifying to described information, summary and keyword, obtains classification results;
The matching statistical module, for carrying out real-time matching and statistics to classification results, forms processing data.
9. the data among enterprises incidence relation according to claim 8 based on mass data catches system, it is characterised in that
The database acquiring unit includes training set formation module, adjusting module, processing data memory module, training module and changed
Enter module;
The training set formation module, for according to processing data and basic data, making training set;
The adjusting module, for being sampled investigation to processing data and adjusting;
The processing data memory module, for the processing data after adjustment to be stored to training set;
The training module, for being trained to training set;
The improvement module, for being trained improvement using weights, forms training set data storehouse.
10. the data among enterprises incidence relation according to claim 9 based on mass data catches system, its feature exists
In the Relation acquisition unit includes model acquisition module and classification prediction module;
The model acquisition module, for being trained using training set data to training set data storehouse, acquisition uses model;
The classification prediction module, for using new data is classified and predicted using model, obtaining data among enterprises and closing
Connection relation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710353297.3A CN107103094A (en) | 2017-05-18 | 2017-05-18 | Data among enterprises incidence relation method for catching and its system based on mass data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710353297.3A CN107103094A (en) | 2017-05-18 | 2017-05-18 | Data among enterprises incidence relation method for catching and its system based on mass data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107103094A true CN107103094A (en) | 2017-08-29 |
Family
ID=59669252
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710353297.3A Pending CN107103094A (en) | 2017-05-18 | 2017-05-18 | Data among enterprises incidence relation method for catching and its system based on mass data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107103094A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107844912A (en) * | 2017-11-22 | 2018-03-27 | 前海梧桐(深圳)数据有限公司 | The computational methods and its system that enterprise's characteristic influences on enterprise development |
CN108647292A (en) * | 2018-05-07 | 2018-10-12 | 前海梧桐(深圳)数据有限公司 | Enterprise's property sort computational methods based on neural network algorithm and system |
CN109344199A (en) * | 2018-09-20 | 2019-02-15 | 普元信息技术股份有限公司 | The system and method for big data quantity distributed treatment is realized in cloud computing platform |
CN109829034A (en) * | 2018-08-24 | 2019-05-31 | 长威信息科技发展股份有限公司 | A kind of enterprise's tree spectrogram methods of exhibiting based on main market players's credit data |
CN111242760A (en) * | 2019-12-30 | 2020-06-05 | 航天信息股份有限公司企业服务分公司 | Method and system for carrying out accounting on capital service based on capital institution |
WO2020207252A1 (en) * | 2019-04-09 | 2020-10-15 | Oppo广东移动通信有限公司 | Data storage method and device, storage medium, and electronic apparatus |
CN112364228A (en) * | 2020-11-26 | 2021-02-12 | 深圳前瞻资讯股份有限公司 | Construction method, system, application method, terminal device and storage medium of enterprise big data system based on physical position |
US11397778B2 (en) | 2018-05-30 | 2022-07-26 | Beijing Baidu Netcom Service and Technology Co., Ltd. | Method and device for mining an enterprise relationship |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101853277A (en) * | 2010-05-14 | 2010-10-06 | 南京信息工程大学 | A Vulnerability Data Mining Method Based on Classification and Association Analysis |
CN103136337A (en) * | 2013-02-01 | 2013-06-05 | 北京邮电大学 | Distributed knowledge data mining device and mining method used for complex network |
CN106570171A (en) * | 2016-11-03 | 2017-04-19 | 中国电子科技集团公司第二十八研究所 | Semantics-based sci-tech information processing method and system |
US20170109786A1 (en) * | 2015-10-20 | 2017-04-20 | Korea Electronics Technology Institute | System for producing promotional media content and method thereof |
-
2017
- 2017-05-18 CN CN201710353297.3A patent/CN107103094A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101853277A (en) * | 2010-05-14 | 2010-10-06 | 南京信息工程大学 | A Vulnerability Data Mining Method Based on Classification and Association Analysis |
CN103136337A (en) * | 2013-02-01 | 2013-06-05 | 北京邮电大学 | Distributed knowledge data mining device and mining method used for complex network |
US20170109786A1 (en) * | 2015-10-20 | 2017-04-20 | Korea Electronics Technology Institute | System for producing promotional media content and method thereof |
CN106570171A (en) * | 2016-11-03 | 2017-04-19 | 中国电子科技集团公司第二十八研究所 | Semantics-based sci-tech information processing method and system |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107844912A (en) * | 2017-11-22 | 2018-03-27 | 前海梧桐(深圳)数据有限公司 | The computational methods and its system that enterprise's characteristic influences on enterprise development |
CN108647292A (en) * | 2018-05-07 | 2018-10-12 | 前海梧桐(深圳)数据有限公司 | Enterprise's property sort computational methods based on neural network algorithm and system |
US11397778B2 (en) | 2018-05-30 | 2022-07-26 | Beijing Baidu Netcom Service and Technology Co., Ltd. | Method and device for mining an enterprise relationship |
CN109829034A (en) * | 2018-08-24 | 2019-05-31 | 长威信息科技发展股份有限公司 | A kind of enterprise's tree spectrogram methods of exhibiting based on main market players's credit data |
CN109344199A (en) * | 2018-09-20 | 2019-02-15 | 普元信息技术股份有限公司 | The system and method for big data quantity distributed treatment is realized in cloud computing platform |
WO2020207252A1 (en) * | 2019-04-09 | 2020-10-15 | Oppo广东移动通信有限公司 | Data storage method and device, storage medium, and electronic apparatus |
CN111797175A (en) * | 2019-04-09 | 2020-10-20 | Oppo广东移动通信有限公司 | Data storage method and device, storage medium and electronic equipment |
CN111797175B (en) * | 2019-04-09 | 2023-12-19 | Oppo广东移动通信有限公司 | Data storage method and device, storage medium and electronic equipment |
CN111242760A (en) * | 2019-12-30 | 2020-06-05 | 航天信息股份有限公司企业服务分公司 | Method and system for carrying out accounting on capital service based on capital institution |
CN111242760B (en) * | 2019-12-30 | 2024-02-27 | 航天信息股份有限公司企业服务分公司 | Method and system for billing fund business based on fund institutions |
CN112364228A (en) * | 2020-11-26 | 2021-02-12 | 深圳前瞻资讯股份有限公司 | Construction method, system, application method, terminal device and storage medium of enterprise big data system based on physical position |
CN112364228B (en) * | 2020-11-26 | 2021-08-13 | 深圳前瞻资讯股份有限公司 | Construction method, system, application method, terminal device and storage medium of a physical location-based enterprise big data system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107103094A (en) | Data among enterprises incidence relation method for catching and its system based on mass data | |
CN112966926B (en) | Flood sensitivity risk assessment method based on ensemble learning | |
CN107577688B (en) | Original article influence analysis system based on media information acquisition | |
CN109034194B (en) | Transaction fraud behavior deep detection method based on feature differentiation | |
CN101964000B (en) | Automatic filtering management system for sensitive words | |
CN116205310B (en) | Soil water content influence factor sensitive interval judging method based on interpretable integrated learning model | |
CN107316007A (en) | A kind of monitoring image multiclass object detection and recognition methods based on deep learning | |
CN107798600A (en) | The credit risk recognition methods of the small micro- loan of internet finance and device | |
CN112132233A (en) | Criminal personnel dangerous behavior prediction method and system based on effective influence factors | |
CN104820724B (en) | Text class educational resource knowledge point forecast model preparation method and application method | |
CN103092975A (en) | Detection and filter method of network community garbage information based on topic consensus coverage rate | |
CN107292744A (en) | Investment Trend analysis method and its system based on machine learning | |
CN110020712A (en) | A kind of optimization population BP neural network forecast method and system based on cluster | |
CN106446124A (en) | Website classification method based on network relation graph | |
CN104239722A (en) | Forecasting method based on recognition of correlational relationship between factors | |
CN112330095A (en) | Quality management method based on decision tree algorithm | |
CN117520660A (en) | Information extraction and pushing method and system based on big data | |
CN111507528A (en) | A long-term stock trend prediction method based on CNN-LSTM | |
CN109471934B (en) | Internet-based financial risk clues mining method | |
CN114912798A (en) | Earthquake loss evaluation system based on random forest and earthquake damage big data | |
CN114186002A (en) | Scientific and technological achievement data processing and analyzing method and system | |
CN109582743A (en) | A kind of data digging method for the attack of terrorism | |
CN109063944A (en) | City banking index analysis method and device based on big data analysis technology | |
CN111898579A (en) | An unbiased semi-supervised classification model for high-resolution remote sensing images based on extreme gradient boosting | |
Kirshners et al. | A comparative analysis of short time series processing methods |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170829 |