CN109344405A

CN109344405A - A kind of similarity processing method based on TF-IDF thought and neural network

Info

Publication number: CN109344405A
Application number: CN201811114655.6A
Authority: CN
Inventors: 马佳; 支含绪; 邓森洋
Original assignee: Mdt Infotech Ltd Jiaxing
Current assignee: Mdt Infotech Ltd Jiaxing
Priority date: 2018-09-25
Filing date: 2018-09-25
Publication date: 2019-02-15
Anticipated expiration: 2038-09-25
Also published as: CN109344405B

Abstract

The invention discloses a kind of similarity processing method based on TF-IDF thought and neural network comprising the steps of: A, creation element dictionary；B, according to all samples in TF-IDF thought numeralization sample set；C, all samples in matrixing sample set；D, neural network is built；E, the similitude between certain sample and all samples is calculated.The present invention is based on the similarity processing methods of TF-IDF thought and neural network can be in relatively something or other and N number of things similitude, it need to only be calculated 1 time by the extremely short time, the similitude that the things Yu N number of things can be obtained is greatly improved the efficiency for calculating something or other and N number of things calculating similitude.

Description

A kind of similarity processing method based on TF-IDF thought and neural network

Technical field

It is specifically a kind of based on the processing of the similitude of TF-IDF thought and neural network the present invention relates to artificial intelligence field Method.

Background technique

Currently, generally requiring to carry out numerical value to correlate using in similitude between calculating things using mathematical way Change processing.

TF-IDF means Term Frequency-Inverse Document Frequency, the i.e. inverse text frequency of word frequency- Rate, theoretical foundation are principles in information theory, presently mainly for the content to text (Term) in article (Document) Carry out a kind of mode of numeralization processing, and be based on TF-IDF thought, can to many things (it is main its be something or other by sub- member The case where element composition) carry out similar processing.

According to TF-IDF thought, more similar idea such as PF-IPF(Part-Frequncey- can be derived Inverse Product Frequency, part frequency-traitor's property product frequency), FF-IPF(Feature Frequency- Inverse Part Frequency, the inverse part frequency of characteristic frequency -) etc..

After carrying out numeralization processing to correlate, so that it may (such as European by related algorithm based on feature vector Distance, the cosine law, Pearson came correlation, Spearman rank correlation coefficient etc.) to progress similitude meter between different things It calculates.

However in this way, similitude that can only two-by-two between things, if necessary to calculate something or other and N number of things Between similitude, it is necessary to calculate its similitude between each things, the in this way calculation amount when calculating similitude just Can be very big, it will result in a large amount of wasting of resources in this way and the waiting time be too long.

Summary of the invention

The purpose of the present invention is to provide a kind of similarity processing method based on TF-IDF thought and neural network, with solution Certainly the problems mentioned above in the background art.

To achieve the above object, the invention provides the following technical scheme:

A kind of similarity processing method based on TF-IDF thought and neural network comprising the steps of:

A, element dictionary is created；

B, according to all samples in TF-IDF thought numeralization sample set；

C, all samples in matrixing sample set；

D, neural network is built；

E, training neural network；

F, the similitude between certain sample and all samples is calculated.

As further technical solution of the present invention: the step A is specifically: acquisition need to currently calculate similitude object Subset of elements in sample set in all the elements is carried out polymerization processing by all sample sets, to remove duplicate element.Using poly- Element after conjunction is classified to element dictionary library.

As further technical solution of the present invention: the step B is specifically: according to TF-IDF thought, in sample set All samples carry out numeralization processing.

As further technical solution of the present invention: the step C is specifically: by the sample in step B, being converted into one Input matrix and the sparse output matrix of a M × M are tieed up with N × M.

As further technical solution of the present invention: the step D is specifically: by first in the element dictionary in step A The number of element, determines the input layer number N of neural network；By the number of sample in sample set, neural network is determined In output layer neuron number M；The neuron number of the number of plies of hidden layer and each hidden layer, according to the quasi- of sample training Conjunction degree is determined, and gradually approaches the neuron number of optimal the hidden layer number of plies and each hidden layer.

As further technical solution of the present invention: the step E is specifically: utilizing the neural network and step in step D Sample set in rapid C after matrixing, is trained neural network.

As further technical solution of the present invention: the step F is specifically: for sample to be calculated, being based on TF- IDF thought carries out vectorization processing to it, is made inferences using the neural network after training in step E, by once calculating, fastly Speed obtains the similitude between current sample and all known samples.

Compared with prior art, the beneficial effects of the present invention are: the present invention is based on TF-IDF thought and the phases of neural network It can need to only be calculated 1 time by the extremely short time in relatively something or other and N number of things similitude like property processing method The similitude of the things Yu N number of things is obtained, the efficiency for calculating something or other and N number of things calculating similitude is greatly improved.

Detailed description of the invention

Fig. 1 is product sample collection matrixing schematic diagram；

Fig. 2 is training (product) sample set schematic diagram in neural network；

Fig. 3 is the similitude schematic diagram quickly calculated between certain sample (product) and all samples (product) using neural network.

Specific embodiment

The technical scheme in the embodiments of the invention will be clearly and completely described below, it is clear that described implementation Example is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, this field is common Technical staff's every other embodiment obtained without making creative work belongs to the model that the present invention protects It encloses.

Fig. 1-3 is please referred to, a kind of similarity processing method based on TF-IDF thought and neural network includes following step It is rapid:

A, element dictionary is created.All sample sets of similitude object need to currently be calculated by obtaining, will be in sample set in all the elements Subset of elements (form the subset element of the object, such as the product, part be its subset) carry out polymerization processing, to go Except duplicate element.Using the element after polymerization, it is classified to element dictionary library.(by taking part dictionary as an example, such as table 1)；

B, according to all samples in TF-IDF thought numeralization sample set；According to TF-IDF thought, each sample is calculated in element The TF-IDF value of each element in dictionary library.(by taking the PF-IPF for calculating product as an example, calculation method " is based on reference to patent of invention The product structure numeralization processing method of TF-IDF thought ", calculates the PF-IPF value of each part in the product) (such as table 2 It is shown)；

C, all samples in matrixing sample set.By the sample in step B, being converted into one has N × M dimension input matrix (such as Shown in Fig. 1) and a M × M sparse output matrix；

D, neural network is built.By the number of element in the element dictionary in step A, the input layer nerve of neural network is determined First number N；By the number of sample in sample set, the number M of the output layer neuron in neural network is determined；The layer of hidden layer Several neuron numbers with each hidden layer, are determined according to the fitting degree of sample training, and gradually approach optimal hidden Containing several neuron numbers with each hidden layer layer by layer；

E, training neural network.Using the neural network in step D and the sample set after matrixing in step C, to neural network It is trained (as shown in Figure 2)；

F, the similitude of certain sample Yu all sample parts is calculated.For sample to be calculated, it is carried out based on TF-IDF thought Vectorization processing, is made inferences using the neural network after training in step E, by once calculating, quickly obtains current sample Similitude (as shown in Figure 3) between all known samples.

Table 1 is part dictionary table:

Dash number	Part
		1	Outer-hexagonal bolt M10 × 20
2	Nut M10
		3	Full thread stud M10 × 25
…	…
		1000	Four cylinder engine rack
…	…
		N	Turbine

；

Table 2 is the PF-IPF Value Data table of certain product:

Dash number	PF-IPF value
		1	0
2	0.001
		3	0.00065
…	…
		1000	1.889
…	…
		N	0

It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included within the present invention.

In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art The other embodiments being understood that.

Claims

1. a kind of similarity processing method based on TF-IDF thought and neural network, which is characterized in that comprise the steps of:

A, element dictionary is created；

B, according to all samples in TF-IDF thought numeralization sample set；

C, all samples in matrixing sample set；

D, neural network is built；

E, training neural network；

F, the similitude between certain sample and all samples is calculated.

2. a kind of similarity processing method based on TF-IDF thought and neural network according to claim 1, feature It is, the step A is specifically: all sample sets that need to currently calculate similitude object is obtained, by all the elements in sample set In subset of elements carry out polymerization processing, to remove duplicate element, using the element after polymerization, be classified to element dictionary Library.

3. a kind of similarity processing method based on TF-IDF thought and neural network according to claim 1, feature It is, the step B is specifically: according to TF-IDF thought, numeralization processing is carried out to all samples in sample set.

4. a kind of similarity processing method based on TF-IDF thought and neural network according to claim 1, feature Be, the step C is specifically: by the sample in step B, be converted into one have N × M dimension input matrix and a M × The sparse output matrix of M.

5. a kind of similarity processing method based on TF-IDF thought and neural network according to claim 1, feature It is, the step D is specifically: by the number of element in the element dictionary in step A, determines the input layer mind of neural network Through first number N；By the number of sample in sample set, the number M of the output layer neuron in neural network is determined；Hidden layer The neuron number of the number of plies and each hidden layer is determined according to the fitting degree of sample training, and is gradually approached optimal The neuron number of the hidden layer number of plies and each hidden layer.

6. a kind of similarity processing method based on TF-IDF thought and neural network according to claim 1, feature It is, the step E is specifically: using the neural network in step D and the sample set after matrixing in step C, to nerve net Network is trained.

7. -6 any a kind of similarity processing method based on TF-IDF thought and neural network according to claim 1, It is characterized in that, the step F is specifically: for sample to be calculated, vectorization processing is carried out to it based on TF-IDF thought, It is made inferences using the neural network after training in step E, by once calculating, quickly obtains current sample and all known samples This similitude.