CN109344405A - A kind of similarity processing method based on TF-IDF thought and neural network - Google Patents

A kind of similarity processing method based on TF-IDF thought and neural network Download PDF

Info

Publication number
CN109344405A
CN109344405A CN201811114655.6A CN201811114655A CN109344405A CN 109344405 A CN109344405 A CN 109344405A CN 201811114655 A CN201811114655 A CN 201811114655A CN 109344405 A CN109344405 A CN 109344405A
Authority
CN
China
Prior art keywords
neural network
sample
idf
processing method
method based
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811114655.6A
Other languages
Chinese (zh)
Other versions
CN109344405B (en
Inventor
马佳
支含绪
邓森洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mdt Infotech Ltd Jiaxing
Original Assignee
Mdt Infotech Ltd Jiaxing
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mdt Infotech Ltd Jiaxing filed Critical Mdt Infotech Ltd Jiaxing
Priority to CN201811114655.6A priority Critical patent/CN109344405B/en
Publication of CN109344405A publication Critical patent/CN109344405A/en
Application granted granted Critical
Publication of CN109344405B publication Critical patent/CN109344405B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of similarity processing method based on TF-IDF thought and neural network comprising the steps of: A, creation element dictionary;B, according to all samples in TF-IDF thought numeralization sample set;C, all samples in matrixing sample set;D, neural network is built;E, the similitude between certain sample and all samples is calculated.The present invention is based on the similarity processing methods of TF-IDF thought and neural network can be in relatively something or other and N number of things similitude, it need to only be calculated 1 time by the extremely short time, the similitude that the things Yu N number of things can be obtained is greatly improved the efficiency for calculating something or other and N number of things calculating similitude.

Description

A kind of similarity processing method based on TF-IDF thought and neural network
Technical field
It is specifically a kind of based on the processing of the similitude of TF-IDF thought and neural network the present invention relates to artificial intelligence field Method.
Background technique
Currently, generally requiring to carry out numerical value to correlate using in similitude between calculating things using mathematical way Change processing.
TF-IDF means Term Frequency-Inverse Document Frequency, the i.e. inverse text frequency of word frequency- Rate, theoretical foundation are principles in information theory, presently mainly for the content to text (Term) in article (Document) Carry out a kind of mode of numeralization processing, and be based on TF-IDF thought, can to many things (it is main its be something or other by sub- member The case where element composition) carry out similar processing.
According to TF-IDF thought, more similar idea such as PF-IPF(Part-Frequncey- can be derived Inverse Product Frequency, part frequency-traitor's property product frequency), FF-IPF(Feature Frequency- Inverse Part Frequency, the inverse part frequency of characteristic frequency -) etc..
After carrying out numeralization processing to correlate, so that it may (such as European by related algorithm based on feature vector Distance, the cosine law, Pearson came correlation, Spearman rank correlation coefficient etc.) to progress similitude meter between different things It calculates.
However in this way, similitude that can only two-by-two between things, if necessary to calculate something or other and N number of things Between similitude, it is necessary to calculate its similitude between each things, the in this way calculation amount when calculating similitude just Can be very big, it will result in a large amount of wasting of resources in this way and the waiting time be too long.
Summary of the invention
The purpose of the present invention is to provide a kind of similarity processing method based on TF-IDF thought and neural network, with solution Certainly the problems mentioned above in the background art.
To achieve the above object, the invention provides the following technical scheme:
A kind of similarity processing method based on TF-IDF thought and neural network comprising the steps of:
A, element dictionary is created;
B, according to all samples in TF-IDF thought numeralization sample set;
C, all samples in matrixing sample set;
D, neural network is built;
E, training neural network;
F, the similitude between certain sample and all samples is calculated.
As further technical solution of the present invention: the step A is specifically: acquisition need to currently calculate similitude object Subset of elements in sample set in all the elements is carried out polymerization processing by all sample sets, to remove duplicate element.Using poly- Element after conjunction is classified to element dictionary library.
As further technical solution of the present invention: the step B is specifically: according to TF-IDF thought, in sample set All samples carry out numeralization processing.
As further technical solution of the present invention: the step C is specifically: by the sample in step B, being converted into one Input matrix and the sparse output matrix of a M × M are tieed up with N × M.
As further technical solution of the present invention: the step D is specifically: by first in the element dictionary in step A The number of element, determines the input layer number N of neural network;By the number of sample in sample set, neural network is determined In output layer neuron number M;The neuron number of the number of plies of hidden layer and each hidden layer, according to the quasi- of sample training Conjunction degree is determined, and gradually approaches the neuron number of optimal the hidden layer number of plies and each hidden layer.
As further technical solution of the present invention: the step E is specifically: utilizing the neural network and step in step D Sample set in rapid C after matrixing, is trained neural network.
As further technical solution of the present invention: the step F is specifically: for sample to be calculated, being based on TF- IDF thought carries out vectorization processing to it, is made inferences using the neural network after training in step E, by once calculating, fastly Speed obtains the similitude between current sample and all known samples.
Compared with prior art, the beneficial effects of the present invention are: the present invention is based on TF-IDF thought and the phases of neural network It can need to only be calculated 1 time by the extremely short time in relatively something or other and N number of things similitude like property processing method The similitude of the things Yu N number of things is obtained, the efficiency for calculating something or other and N number of things calculating similitude is greatly improved.
Detailed description of the invention
Fig. 1 is product sample collection matrixing schematic diagram;
Fig. 2 is training (product) sample set schematic diagram in neural network;
Fig. 3 is the similitude schematic diagram quickly calculated between certain sample (product) and all samples (product) using neural network.
Specific embodiment
The technical scheme in the embodiments of the invention will be clearly and completely described below, it is clear that described implementation Example is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, this field is common Technical staff's every other embodiment obtained without making creative work belongs to the model that the present invention protects It encloses.
Fig. 1-3 is please referred to, a kind of similarity processing method based on TF-IDF thought and neural network includes following step It is rapid:
A, element dictionary is created.All sample sets of similitude object need to currently be calculated by obtaining, will be in sample set in all the elements Subset of elements (form the subset element of the object, such as the product, part be its subset) carry out polymerization processing, to go Except duplicate element.Using the element after polymerization, it is classified to element dictionary library.(by taking part dictionary as an example, such as table 1);
B, according to all samples in TF-IDF thought numeralization sample set;According to TF-IDF thought, each sample is calculated in element The TF-IDF value of each element in dictionary library.(by taking the PF-IPF for calculating product as an example, calculation method " is based on reference to patent of invention The product structure numeralization processing method of TF-IDF thought ", calculates the PF-IPF value of each part in the product) (such as table 2 It is shown);
C, all samples in matrixing sample set.By the sample in step B, being converted into one has N × M dimension input matrix (such as Shown in Fig. 1) and a M × M sparse output matrix;
D, neural network is built.By the number of element in the element dictionary in step A, the input layer nerve of neural network is determined First number N;By the number of sample in sample set, the number M of the output layer neuron in neural network is determined;The layer of hidden layer Several neuron numbers with each hidden layer, are determined according to the fitting degree of sample training, and gradually approach optimal hidden Containing several neuron numbers with each hidden layer layer by layer;
E, training neural network.Using the neural network in step D and the sample set after matrixing in step C, to neural network It is trained (as shown in Figure 2);
F, the similitude of certain sample Yu all sample parts is calculated.For sample to be calculated, it is carried out based on TF-IDF thought Vectorization processing, is made inferences using the neural network after training in step E, by once calculating, quickly obtains current sample Similitude (as shown in Figure 3) between all known samples.
Table 1 is part dictionary table:
Dash number Part
1 Outer-hexagonal bolt M10 × 20
2 Nut M10
3 Full thread stud M10 × 25
1000 Four cylinder engine rack
N Turbine
Table 2 is the PF-IPF Value Data table of certain product:
Dash number PF-IPF value
1 0
2 0.001
3 0.00065
1000 1.889
N 0
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included within the present invention.
In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art The other embodiments being understood that.

Claims (7)

1. a kind of similarity processing method based on TF-IDF thought and neural network, which is characterized in that comprise the steps of:
A, element dictionary is created;
B, according to all samples in TF-IDF thought numeralization sample set;
C, all samples in matrixing sample set;
D, neural network is built;
E, training neural network;
F, the similitude between certain sample and all samples is calculated.
2. a kind of similarity processing method based on TF-IDF thought and neural network according to claim 1, feature It is, the step A is specifically: all sample sets that need to currently calculate similitude object is obtained, by all the elements in sample set In subset of elements carry out polymerization processing, to remove duplicate element, using the element after polymerization, be classified to element dictionary Library.
3. a kind of similarity processing method based on TF-IDF thought and neural network according to claim 1, feature It is, the step B is specifically: according to TF-IDF thought, numeralization processing is carried out to all samples in sample set.
4. a kind of similarity processing method based on TF-IDF thought and neural network according to claim 1, feature Be, the step C is specifically: by the sample in step B, be converted into one have N × M dimension input matrix and a M × The sparse output matrix of M.
5. a kind of similarity processing method based on TF-IDF thought and neural network according to claim 1, feature It is, the step D is specifically: by the number of element in the element dictionary in step A, determines the input layer mind of neural network Through first number N;By the number of sample in sample set, the number M of the output layer neuron in neural network is determined;Hidden layer The neuron number of the number of plies and each hidden layer is determined according to the fitting degree of sample training, and is gradually approached optimal The neuron number of the hidden layer number of plies and each hidden layer.
6. a kind of similarity processing method based on TF-IDF thought and neural network according to claim 1, feature It is, the step E is specifically: using the neural network in step D and the sample set after matrixing in step C, to nerve net Network is trained.
7. -6 any a kind of similarity processing method based on TF-IDF thought and neural network according to claim 1, It is characterized in that, the step F is specifically: for sample to be calculated, vectorization processing is carried out to it based on TF-IDF thought, It is made inferences using the neural network after training in step E, by once calculating, quickly obtains current sample and all known samples This similitude.
CN201811114655.6A 2018-09-25 2018-09-25 Similarity processing method based on TF-IDF thought and neural network Active CN109344405B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811114655.6A CN109344405B (en) 2018-09-25 2018-09-25 Similarity processing method based on TF-IDF thought and neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811114655.6A CN109344405B (en) 2018-09-25 2018-09-25 Similarity processing method based on TF-IDF thought and neural network

Publications (2)

Publication Number Publication Date
CN109344405A true CN109344405A (en) 2019-02-15
CN109344405B CN109344405B (en) 2023-04-14

Family

ID=65306681

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811114655.6A Active CN109344405B (en) 2018-09-25 2018-09-25 Similarity processing method based on TF-IDF thought and neural network

Country Status (1)

Country Link
CN (1) CN109344405B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110688722A (en) * 2019-10-17 2020-01-14 深制科技(苏州)有限公司 Automatic generation method of part attribute matrix based on deep learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001011487A2 (en) * 1999-08-04 2001-02-15 Board Of Trustees Of The University Of Illinois Apparatus, method and product for multi-attribute drug comparison
CN104392247A (en) * 2014-11-07 2015-03-04 上海交通大学 Similarity network fast fusion method used for data clustering
CN105373547A (en) * 2014-08-25 2016-03-02 北大方正集团有限公司 Knowledge point importance calculation method and apparatus
CN105808689A (en) * 2016-03-03 2016-07-27 中国地质大学(武汉) Drainage system entity semantic similarity measurement method based on artificial neural network
US20170357878A1 (en) * 2014-08-05 2017-12-14 Sri International Multi-dimensional realization of visual content of an image collection

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001011487A2 (en) * 1999-08-04 2001-02-15 Board Of Trustees Of The University Of Illinois Apparatus, method and product for multi-attribute drug comparison
US20170357878A1 (en) * 2014-08-05 2017-12-14 Sri International Multi-dimensional realization of visual content of an image collection
CN105373547A (en) * 2014-08-25 2016-03-02 北大方正集团有限公司 Knowledge point importance calculation method and apparatus
CN104392247A (en) * 2014-11-07 2015-03-04 上海交通大学 Similarity network fast fusion method used for data clustering
CN105808689A (en) * 2016-03-03 2016-07-27 中国地质大学(武汉) Drainage system entity semantic similarity measurement method based on artificial neural network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110688722A (en) * 2019-10-17 2020-01-14 深制科技(苏州)有限公司 Automatic generation method of part attribute matrix based on deep learning
CN110688722B (en) * 2019-10-17 2023-08-08 深制科技(苏州)有限公司 Automatic generation method of part attribute matrix based on deep learning

Also Published As

Publication number Publication date
CN109344405B (en) 2023-04-14

Similar Documents

Publication Publication Date Title
CN104298715B (en) A kind of more indexed results ordering by merging methods based on TF IDF
CN104573046A (en) Comment analyzing method and system based on term vector
CN105512808A (en) Power system transient stability assessment method based on big data
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN112100999A (en) Resume text similarity matching method and system
CN106980650A (en) A kind of emotion enhancing word insertion learning method towards Twitter opinion classifications
CN104008187A (en) Semi-structured text matching method based on the minimum edit distance
CN113157919B (en) Sentence text aspect-level emotion classification method and sentence text aspect-level emotion classification system
CN112925904A (en) Lightweight text classification method based on Tucker decomposition
Li et al. Shape optimisation of blended-wing-body underwater gliders based on free-form deformation
CN117132132A (en) Photovoltaic power generation power prediction method based on meteorological data
Ye et al. Big data processing framework for manufacturing
Nabil et al. Cufe at semeval-2016 task 4: A gated recurrent model for sentiment classification
CN109344405A (en) A kind of similarity processing method based on TF-IDF thought and neural network
CN110611334A (en) Copula-garch model-based multi-wind-farm output correlation method
CN104778205B (en) A kind of mobile application sequence and clustering method based on Heterogeneous Information network
CN112231476B (en) Improved graphic neural network scientific literature big data classification method
CN110825852B (en) Long text-oriented semantic matching method and system
Shen et al. An improved parallel Bayesian text classification algorithm
Luo A new text classifier based on random forests
CN107577690B (en) Recommendation method and recommendation device for mass information data
CN113722951B (en) Scatterer three-dimensional finite element grid optimization method based on neural network
CN104462458A (en) Data mining method of big data system
CN109145518B (en) Method for constructing reliability decision graph model of large-scale complex equipment
CN104965869A (en) Mobile application sorting and clustering method based on heterogeneous information network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant