CN109344405A - A kind of similarity processing method based on TF-IDF thought and neural network - Google Patents
A kind of similarity processing method based on TF-IDF thought and neural network Download PDFInfo
- Publication number
- CN109344405A CN109344405A CN201811114655.6A CN201811114655A CN109344405A CN 109344405 A CN109344405 A CN 109344405A CN 201811114655 A CN201811114655 A CN 201811114655A CN 109344405 A CN109344405 A CN 109344405A
- Authority
- CN
- China
- Prior art keywords
- neural network
- sample
- idf
- processing method
- method based
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of similarity processing method based on TF-IDF thought and neural network comprising the steps of: A, creation element dictionary;B, according to all samples in TF-IDF thought numeralization sample set;C, all samples in matrixing sample set;D, neural network is built;E, the similitude between certain sample and all samples is calculated.The present invention is based on the similarity processing methods of TF-IDF thought and neural network can be in relatively something or other and N number of things similitude, it need to only be calculated 1 time by the extremely short time, the similitude that the things Yu N number of things can be obtained is greatly improved the efficiency for calculating something or other and N number of things calculating similitude.
Description
Technical field
It is specifically a kind of based on the processing of the similitude of TF-IDF thought and neural network the present invention relates to artificial intelligence field
Method.
Background technique
Currently, generally requiring to carry out numerical value to correlate using in similitude between calculating things using mathematical way
Change processing.
TF-IDF means Term Frequency-Inverse Document Frequency, the i.e. inverse text frequency of word frequency-
Rate, theoretical foundation are principles in information theory, presently mainly for the content to text (Term) in article (Document)
Carry out a kind of mode of numeralization processing, and be based on TF-IDF thought, can to many things (it is main its be something or other by sub- member
The case where element composition) carry out similar processing.
According to TF-IDF thought, more similar idea such as PF-IPF(Part-Frequncey- can be derived
Inverse Product Frequency, part frequency-traitor's property product frequency), FF-IPF(Feature Frequency-
Inverse Part Frequency, the inverse part frequency of characteristic frequency -) etc..
After carrying out numeralization processing to correlate, so that it may (such as European by related algorithm based on feature vector
Distance, the cosine law, Pearson came correlation, Spearman rank correlation coefficient etc.) to progress similitude meter between different things
It calculates.
However in this way, similitude that can only two-by-two between things, if necessary to calculate something or other and N number of things
Between similitude, it is necessary to calculate its similitude between each things, the in this way calculation amount when calculating similitude just
Can be very big, it will result in a large amount of wasting of resources in this way and the waiting time be too long.
Summary of the invention
The purpose of the present invention is to provide a kind of similarity processing method based on TF-IDF thought and neural network, with solution
Certainly the problems mentioned above in the background art.
To achieve the above object, the invention provides the following technical scheme:
A kind of similarity processing method based on TF-IDF thought and neural network comprising the steps of:
A, element dictionary is created;
B, according to all samples in TF-IDF thought numeralization sample set;
C, all samples in matrixing sample set;
D, neural network is built;
E, training neural network;
F, the similitude between certain sample and all samples is calculated.
As further technical solution of the present invention: the step A is specifically: acquisition need to currently calculate similitude object
Subset of elements in sample set in all the elements is carried out polymerization processing by all sample sets, to remove duplicate element.Using poly-
Element after conjunction is classified to element dictionary library.
As further technical solution of the present invention: the step B is specifically: according to TF-IDF thought, in sample set
All samples carry out numeralization processing.
As further technical solution of the present invention: the step C is specifically: by the sample in step B, being converted into one
Input matrix and the sparse output matrix of a M × M are tieed up with N × M.
As further technical solution of the present invention: the step D is specifically: by first in the element dictionary in step A
The number of element, determines the input layer number N of neural network;By the number of sample in sample set, neural network is determined
In output layer neuron number M;The neuron number of the number of plies of hidden layer and each hidden layer, according to the quasi- of sample training
Conjunction degree is determined, and gradually approaches the neuron number of optimal the hidden layer number of plies and each hidden layer.
As further technical solution of the present invention: the step E is specifically: utilizing the neural network and step in step D
Sample set in rapid C after matrixing, is trained neural network.
As further technical solution of the present invention: the step F is specifically: for sample to be calculated, being based on TF-
IDF thought carries out vectorization processing to it, is made inferences using the neural network after training in step E, by once calculating, fastly
Speed obtains the similitude between current sample and all known samples.
Compared with prior art, the beneficial effects of the present invention are: the present invention is based on TF-IDF thought and the phases of neural network
It can need to only be calculated 1 time by the extremely short time in relatively something or other and N number of things similitude like property processing method
The similitude of the things Yu N number of things is obtained, the efficiency for calculating something or other and N number of things calculating similitude is greatly improved.
Detailed description of the invention
Fig. 1 is product sample collection matrixing schematic diagram;
Fig. 2 is training (product) sample set schematic diagram in neural network;
Fig. 3 is the similitude schematic diagram quickly calculated between certain sample (product) and all samples (product) using neural network.
Specific embodiment
The technical scheme in the embodiments of the invention will be clearly and completely described below, it is clear that described implementation
Example is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, this field is common
Technical staff's every other embodiment obtained without making creative work belongs to the model that the present invention protects
It encloses.
Fig. 1-3 is please referred to, a kind of similarity processing method based on TF-IDF thought and neural network includes following step
It is rapid:
A, element dictionary is created.All sample sets of similitude object need to currently be calculated by obtaining, will be in sample set in all the elements
Subset of elements (form the subset element of the object, such as the product, part be its subset) carry out polymerization processing, to go
Except duplicate element.Using the element after polymerization, it is classified to element dictionary library.(by taking part dictionary as an example, such as table 1);
B, according to all samples in TF-IDF thought numeralization sample set;According to TF-IDF thought, each sample is calculated in element
The TF-IDF value of each element in dictionary library.(by taking the PF-IPF for calculating product as an example, calculation method " is based on reference to patent of invention
The product structure numeralization processing method of TF-IDF thought ", calculates the PF-IPF value of each part in the product) (such as table 2
It is shown);
C, all samples in matrixing sample set.By the sample in step B, being converted into one has N × M dimension input matrix (such as
Shown in Fig. 1) and a M × M sparse output matrix;
D, neural network is built.By the number of element in the element dictionary in step A, the input layer nerve of neural network is determined
First number N;By the number of sample in sample set, the number M of the output layer neuron in neural network is determined;The layer of hidden layer
Several neuron numbers with each hidden layer, are determined according to the fitting degree of sample training, and gradually approach optimal hidden
Containing several neuron numbers with each hidden layer layer by layer;
E, training neural network.Using the neural network in step D and the sample set after matrixing in step C, to neural network
It is trained (as shown in Figure 2);
F, the similitude of certain sample Yu all sample parts is calculated.For sample to be calculated, it is carried out based on TF-IDF thought
Vectorization processing, is made inferences using the neural network after training in step E, by once calculating, quickly obtains current sample
Similitude (as shown in Figure 3) between all known samples.
Table 1 is part dictionary table:
Dash number | Part |
1 | Outer-hexagonal bolt M10 × 20 |
2 | Nut M10 |
3 | Full thread stud M10 × 25 |
… | … |
1000 | Four cylinder engine rack |
… | … |
N | Turbine |
;
Table 2 is the PF-IPF Value Data table of certain product:
Dash number | PF-IPF value |
1 | 0 |
2 | 0.001 |
3 | 0.00065 |
… | … |
1000 | 1.889 |
… | … |
N | 0 |
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie
In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter
From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power
Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims
Variation is included within the present invention.
In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped
Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should
It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art
The other embodiments being understood that.
Claims (7)
1. a kind of similarity processing method based on TF-IDF thought and neural network, which is characterized in that comprise the steps of:
A, element dictionary is created;
B, according to all samples in TF-IDF thought numeralization sample set;
C, all samples in matrixing sample set;
D, neural network is built;
E, training neural network;
F, the similitude between certain sample and all samples is calculated.
2. a kind of similarity processing method based on TF-IDF thought and neural network according to claim 1, feature
It is, the step A is specifically: all sample sets that need to currently calculate similitude object is obtained, by all the elements in sample set
In subset of elements carry out polymerization processing, to remove duplicate element, using the element after polymerization, be classified to element dictionary
Library.
3. a kind of similarity processing method based on TF-IDF thought and neural network according to claim 1, feature
It is, the step B is specifically: according to TF-IDF thought, numeralization processing is carried out to all samples in sample set.
4. a kind of similarity processing method based on TF-IDF thought and neural network according to claim 1, feature
Be, the step C is specifically: by the sample in step B, be converted into one have N × M dimension input matrix and a M ×
The sparse output matrix of M.
5. a kind of similarity processing method based on TF-IDF thought and neural network according to claim 1, feature
It is, the step D is specifically: by the number of element in the element dictionary in step A, determines the input layer mind of neural network
Through first number N;By the number of sample in sample set, the number M of the output layer neuron in neural network is determined;Hidden layer
The neuron number of the number of plies and each hidden layer is determined according to the fitting degree of sample training, and is gradually approached optimal
The neuron number of the hidden layer number of plies and each hidden layer.
6. a kind of similarity processing method based on TF-IDF thought and neural network according to claim 1, feature
It is, the step E is specifically: using the neural network in step D and the sample set after matrixing in step C, to nerve net
Network is trained.
7. -6 any a kind of similarity processing method based on TF-IDF thought and neural network according to claim 1,
It is characterized in that, the step F is specifically: for sample to be calculated, vectorization processing is carried out to it based on TF-IDF thought,
It is made inferences using the neural network after training in step E, by once calculating, quickly obtains current sample and all known samples
This similitude.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811114655.6A CN109344405B (en) | 2018-09-25 | 2018-09-25 | Similarity processing method based on TF-IDF thought and neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811114655.6A CN109344405B (en) | 2018-09-25 | 2018-09-25 | Similarity processing method based on TF-IDF thought and neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109344405A true CN109344405A (en) | 2019-02-15 |
CN109344405B CN109344405B (en) | 2023-04-14 |
Family
ID=65306681
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811114655.6A Active CN109344405B (en) | 2018-09-25 | 2018-09-25 | Similarity processing method based on TF-IDF thought and neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109344405B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110688722A (en) * | 2019-10-17 | 2020-01-14 | 深制科技(苏州)有限公司 | Automatic generation method of part attribute matrix based on deep learning |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001011487A2 (en) * | 1999-08-04 | 2001-02-15 | Board Of Trustees Of The University Of Illinois | Apparatus, method and product for multi-attribute drug comparison |
CN104392247A (en) * | 2014-11-07 | 2015-03-04 | 上海交通大学 | Similarity network fast fusion method used for data clustering |
CN105373547A (en) * | 2014-08-25 | 2016-03-02 | 北大方正集团有限公司 | Knowledge point importance calculation method and apparatus |
CN105808689A (en) * | 2016-03-03 | 2016-07-27 | 中国地质大学(武汉) | Drainage system entity semantic similarity measurement method based on artificial neural network |
US20170357878A1 (en) * | 2014-08-05 | 2017-12-14 | Sri International | Multi-dimensional realization of visual content of an image collection |
-
2018
- 2018-09-25 CN CN201811114655.6A patent/CN109344405B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001011487A2 (en) * | 1999-08-04 | 2001-02-15 | Board Of Trustees Of The University Of Illinois | Apparatus, method and product for multi-attribute drug comparison |
US20170357878A1 (en) * | 2014-08-05 | 2017-12-14 | Sri International | Multi-dimensional realization of visual content of an image collection |
CN105373547A (en) * | 2014-08-25 | 2016-03-02 | 北大方正集团有限公司 | Knowledge point importance calculation method and apparatus |
CN104392247A (en) * | 2014-11-07 | 2015-03-04 | 上海交通大学 | Similarity network fast fusion method used for data clustering |
CN105808689A (en) * | 2016-03-03 | 2016-07-27 | 中国地质大学(武汉) | Drainage system entity semantic similarity measurement method based on artificial neural network |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110688722A (en) * | 2019-10-17 | 2020-01-14 | 深制科技(苏州)有限公司 | Automatic generation method of part attribute matrix based on deep learning |
CN110688722B (en) * | 2019-10-17 | 2023-08-08 | 深制科技(苏州)有限公司 | Automatic generation method of part attribute matrix based on deep learning |
Also Published As
Publication number | Publication date |
---|---|
CN109344405B (en) | 2023-04-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104298715B (en) | A kind of more indexed results ordering by merging methods based on TF IDF | |
CN104573046A (en) | Comment analyzing method and system based on term vector | |
CN105512808A (en) | Power system transient stability assessment method based on big data | |
CN111581364B (en) | Chinese intelligent question-answer short text similarity calculation method oriented to medical field | |
CN112100999A (en) | Resume text similarity matching method and system | |
CN106980650A (en) | A kind of emotion enhancing word insertion learning method towards Twitter opinion classifications | |
CN104008187A (en) | Semi-structured text matching method based on the minimum edit distance | |
CN113157919B (en) | Sentence text aspect-level emotion classification method and sentence text aspect-level emotion classification system | |
CN112925904A (en) | Lightweight text classification method based on Tucker decomposition | |
Li et al. | Shape optimisation of blended-wing-body underwater gliders based on free-form deformation | |
CN117132132A (en) | Photovoltaic power generation power prediction method based on meteorological data | |
Ye et al. | Big data processing framework for manufacturing | |
Nabil et al. | Cufe at semeval-2016 task 4: A gated recurrent model for sentiment classification | |
CN109344405A (en) | A kind of similarity processing method based on TF-IDF thought and neural network | |
CN110611334A (en) | Copula-garch model-based multi-wind-farm output correlation method | |
CN104778205B (en) | A kind of mobile application sequence and clustering method based on Heterogeneous Information network | |
CN112231476B (en) | Improved graphic neural network scientific literature big data classification method | |
CN110825852B (en) | Long text-oriented semantic matching method and system | |
Shen et al. | An improved parallel Bayesian text classification algorithm | |
Luo | A new text classifier based on random forests | |
CN107577690B (en) | Recommendation method and recommendation device for mass information data | |
CN113722951B (en) | Scatterer three-dimensional finite element grid optimization method based on neural network | |
CN104462458A (en) | Data mining method of big data system | |
CN109145518B (en) | Method for constructing reliability decision graph model of large-scale complex equipment | |
CN104965869A (en) | Mobile application sorting and clustering method based on heterogeneous information network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |