CN110096634B - House property data vector alignment method based on particle swarm optimization - Google Patents

House property data vector alignment method based on particle swarm optimization Download PDF

Info

Publication number
CN110096634B
CN110096634B CN201910354563.3A CN201910354563A CN110096634B CN 110096634 B CN110096634 B CN 110096634B CN 201910354563 A CN201910354563 A CN 201910354563A CN 110096634 B CN110096634 B CN 110096634B
Authority
CN
China
Prior art keywords
similarity
house source
particle
sim
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910354563.3A
Other languages
Chinese (zh)
Other versions
CN110096634A (en
Inventor
蔡彪
谭富文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Univeristy of Technology
Original Assignee
Chengdu Univeristy of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Univeristy of Technology filed Critical Chengdu Univeristy of Technology
Priority to CN201910354563.3A priority Critical patent/CN110096634B/en
Publication of CN110096634A publication Critical patent/CN110096634A/en
Application granted granted Critical
Publication of CN110096634B publication Critical patent/CN110096634B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/16Real estate

Abstract

The invention discloses a real estate data vector alignment method based on particle swarm optimization. The model fusing the similarity of the multi-attribute structure entity provided by the invention is used for obtaining the similarity of each attribute weight and a total similarity threshold value by crawling the second-hand house data and preprocessing the data and respectively solving the similarity of different second-hand house attributes, then constructing the model fusing the similarity of the multi-attribute structure entity, optimizing the multi-attribute weight by using the model fusing the similarity of the multi-attribute structure entity, realizing the matching work of the similarity of the house property and obtaining the alignment result with better performance.

Description

House property data vector alignment method based on particle swarm optimization
Technical Field
The invention relates to a data vector alignment method, in particular to a real estate data vector alignment method based on particle swarm optimization.
Background
The property right law clearly stipulates that the state carries out a unified registration system on real estate, and the real estate registration system is integrated, and the real estate data is also an important work in real estate registration. The integration speed is low, the scale is large, the difficulty is large, errors are easy to occur, and the updating is delayed through manual work based on Excel, so that the requirement of actual requirements cannot be met. How to automatically construct a new real estate database has higher research value and application prospect for the unified registration of real estate.
The existing real estate data fusion technology is based on a cloud architecture technology, and data integration is carried out on cloud service dynamic migration in a large logic set. And information among multiple departments is integrated by using a GIS technology and the current communication technology.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides the method for aligning the real estate data vector based on particle swarm optimization, which solves the problem that the real estate data of different real estate transaction service platforms are difficult to align.
In order to achieve the purpose of the invention, the invention adopts the technical scheme that: a real estate data vector alignment method based on particle swarm optimization is characterized by comprising the following steps:
s1, crawling second-hand house source data of a certain city from different second-hand house source webpages;
s2, preprocessing the second-hand house source data;
s3, calculating the similarity of different attributes of the preprocessed second-hand house source data, and constructing a fusion multi-attribute structure model;
s4, taking an array formed by the weight values of all attributes in the fusion multi-attribute structure model as the position of one particle in the particle swarm, initializing the number, the iteration times, the cognitive factors and the social factors of the particle swarm, initializing the positions of the individual particle swarm, namely the weight values of the similarity of all attributes, initializing the speed of the individual particle swarm, calculating the initial value of each individual extreme value, and making the initial value of the global extreme value equal to the initial value of the individual extreme value;
s5, calculating the total similarity of all entity pairs according to the fusion multi-attribute structure model, calculating a threshold value of the total similarity, and bringing the threshold value into a training set to compare real classification results to obtain an F1 value of the training set;
s6, taking the F1 value of the training set as the fitness of each particle, updating the individual extreme value to be the fitness of the particle when the fitness of the particle is greater than the individual extreme value of the particle, calculating the maximum fitness of the current group, and updating the global extreme value to be the maximum fitness when the maximum fitness is greater than the global extreme value;
s7, dividing the particle swarm into particle swarms of 3 grades according to the particle fitness, and calculating the self-adaptive inertia weight of objective function values of different grades;
s8, updating and calculating the speed of the particle swarm according to the individual extreme value, the global extreme value, the inertia weight, the cognitive factor and the social factor, updating the position of the particle swarm according to the speed of the particle swarm, and adding 1 to the iteration number;
s9, when the iteration times are smaller than the maximum iteration times, returning to the step S5, otherwise, outputting the positions of the particle swarm, namely the attribute weights in the multi-attribute structure model;
and S10, calculating a threshold value of the test lumped similarity, and predicting the entity pairs in the test set by using the multi-attribute structural model and the threshold value of the test lumped similarity to realize the matching of the second-hand house source.
Further, the method comprises the following steps: the preprocessing in the step S2 comprises complementing incomplete house source data and normalizing the house source data.
Further, the method comprises the following steps: the attributes of the second-hand house source data in the step S3 comprise a cell name, a title, a floor type graph, a price, an area, an orientation and a floor;
the calculation formula of the cell name similarity sim _ name (a, B) is as follows:
Figure BDA0002044981940000031
in the above formula, nameA and nameB are respectively the cell names of house source A and house source B in two house source webpages;
the title similarity sim _ title (A, B) is calculated by the following method:
adding a blank space between words of a group of entity pair titles S1 and S2 in two house source webpages, respectively calculating TF values and IDF values of each word, calculating TFIDF values through the TF values and the IDF values, further obtaining a word frequency-inverse text frequency matrix, and calculating cosine similarity sim _ title (A, B) of the two house source title word frequency-inverse text frequency matrices;
the calculation formula of the TFIDF value is as follows:
TFIDF i,j =TF i,j ×IDF i,j
in the above formula, TFIDF i,j For the word frequency-inverse text frequency matrix, TF i,j As a word-frequency matrix, IDF i,j Is an inverse text frequency matrix;
the calculation method of the similarity sim _ img (A, B) of the indoor graph comprises the following steps:
carrying out scaling and graying on two images img1 and img2 of the entity pair in the two house source webpages;
establishing an SURF algorithm model, respectively extracting the characteristics des1 and des2 of the img1 and img2 of the two pictures through the SURF algorithm model, and matching characteristic points through a Knn algorithm according to the characteristics desl and des 2;
calculating the number of the matching feature points with the distance ratio larger than 0.9, and calculating the proportion of the matching feature points in the total matching feature points as the picture similarity sim _ img (A, B);
the calculation formula of the price similarity sim _ price (A, B) is as follows:
Figure BDA0002044981940000032
in the above formula, price (a, B) is a relative value of Price, and the calculation formula is:
Figure BDA0002044981940000033
in the above formula, P A And P B Prices, max (P), of house source A and house source B in two house source web pages, respectively n )、Min(P n ) Are respectively two house sourcesAll entity pairs in the webpage are the maximum difference value and the minimum difference value of prices in the same house source;
the calculation formula of the area similarity sim _ size (a, B) is as follows:
Figure BDA0002044981940000041
in the above formula, size (a, B) is the area difference between the house source a and the house source B in the two house source webpages, and the calculation formula is:
Size(A,B)=|S A -S B |
in the above formula, S A And S B The areas of the house source A and the house source B in the two house source webpages respectively;
when the orientation of the house sources in the two house source webpages is the same, the orientation similarity sim _ direction (A, B) is 1, otherwise the orientation similarity sim _ direction (A, B) is 0;
when the floors of the house sources in the two house source webpages are the same, the floor similarity sim _ floor (A, B) is 1, otherwise, the floor similarity sim _ floor (A, B) is 0.
Further: the calculation formula of the fusion multi-attribute structure model Sim (a, B) in step S3 is:
Sim(A,B)=ω 1 ×sim_name(A,B)+ω 2 ×sim_title(A,B)+ω 3 ×sim_im9(A,B)+ω 4 ×sim_price(A,B)+ω 5 ×sim_size(A,B) +ω 6 ×sim_direction(A,B)+ω 7 ×sim_floor(A,B)
in the above formula, ω 1 Is the weight, omega, of the similarity of the cell name attributes 2 As a weight, ω, of similarity of title attributes 3 Is the weight, omega, of the similarity of the attributes of the house type graph 4 As a weight of the similarity of the price attributes, ω 5 As a weight of area attribute similarity, ω 6 As weights towards attribute similarity, ω 7 And the weight value of the similarity of the floor attributes is obtained.
Further: the calculation formula of the threshold Sim of the total similarity in step S5 is:
Sim=A[i]-C[i]/2
in the above formula, ai is the ith item of the same house source total similarity set arranged in ascending order, ci = ai-Bi, B i is the ith item of different house source total similarity sets arranged in descending order, and i is the item of the set C which is first greater than 0.
Further, the method comprises the following steps: the calculation formula of the adaptive inertial weight ω in step S7 is:
Figure BDA0002044981940000051
in the above formula, ω max 、ω min Maximum and minimum values, f, initially set for the adaptive inertial weight ω, respectively i Is the current objective function value of the ith particle, i =1,2 m For optimal particle fitness in the particle swarm, f avg Is the average value f of the fitness values of the constituent subgroups with a fitness value greater than the average value of the population fitness values' avg Is the average of the fitness values of the formed subgroups with fitness values smaller than the average of the population fitness values.
Further: in step S8, the velocity of the particle group is updated as follows:
V id =ωV′ id +C 1 random(0,1)(P id -X′ id )+C 2 random(0,1)(P gd -X′ id )
in the above formula, V id Is the velocity of the particle group after update, V' id Is the velocity of the ith particle at the current iteration number, ω is the adaptive inertial weight, C1 is the cognitive factor, C2 is the social factor, C1= C2 ∈ [0,4 ]]Random (0, 1) is the interval [0, 1]]Random number of (1), P id Is the d-th dimension, P, of the individual extremum of the i-th variable gd D dimension, X 'of global extremum' id The position of the ith particle at the current iteration number.
Further: the position update of the particle group in step S8 is:
X id =X′ id +V id
in the above formula,X id Is the updated position of the ith particle, X' id Is the position of the ith particle at the current iteration number, V id The updated velocity for the ith particle.
The beneficial effects of the invention are as follows: the model fusing the similarity of the multi-attribute structure entity provided by the invention is used for obtaining the similarity of each attribute weight and a total similarity threshold value by crawling the second-hand house data and preprocessing the data and respectively solving the similarity of different second-hand house attributes, then constructing the model fusing the similarity of the multi-attribute structure entity, optimizing the multi-attribute weight by using the model fusing the similarity of the multi-attribute structure entity, realizing the matching work of the similarity of the house property and obtaining the alignment result with better performance.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a diagram illustrating weight distribution of different attributes in the present invention;
FIG. 3 is the maximum value of the adaptive value of the PSO algorithm of the present invention under different iteration times;
FIG. 4 is the average value of the fitness of the invention and a standard PSO algorithm under different iteration times;
FIG. 5 shows F1 values for different iterations of the present invention and a standard PSO algorithm.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
As shown in fig. 1, a real estate data vector alignment method based on particle swarm optimization is characterized by comprising the following steps:
s1, crawling second-hand house source data of a certain city from two second-hand house source webpages.
S2, preprocessing the second-hand house source data; the preprocessing comprises complementing incomplete house source data and normalizing the house source data.
S3, calculating the similarity of different attributes of the preprocessed second-hand house source data, and constructing a fusion multi-attribute structure model;
the attributes of the second-hand house source data comprise a cell name, a title, a house type diagram, a price, an area, an orientation and a floor;
although the cell name belongs to the category of textual information, we cannot compute it with TF-IDF based methods. Therefore, a text comparison mode is directly performed after normalization, if the cell names are completely the same, the similarity is 1, and if not, the similarity is 0. The calculation formula of the cell name similarity sim _ name (a, B) is as follows:
Figure BDA0002044981940000071
in the above formula, nameA and nameB are respectively the cell names of the house source A and the house source B in different house source webpages in the two house source webpages; the two house source web pages are respectively a chain house and a resident.
The title is generally short, few stop words are in the title and are key information, so that the similarity of the title is calculated by adopting a TF-IDF-based method. The title similarity sim _ title (A, B) is calculated by the following method:
adding a blank space between words of a group of entities in two house source web pages to the titles S1 and S2, respectively calculating TF (word frequency, counting the occurrence frequency value of each word in the title) and IDF (inverse text frequency, which are used for correcting a word characteristic value represented by the word frequency) of each word, thereby improving the importance degree of the word in the text, calculating TFIDF value through the TF value and the IDF value, further obtaining a word frequency-inverse text frequency matrix, and calculating cosine similarity sim _ title (A, B) of the two house source title word frequency-inverse text frequency matrices;
the calculation formula of the TFIDF value is as follows:
TFIDF i,j =TF i,j ×IDF i,j
in the above formula, TFIDF i,j For the word frequency-inverse text frequency matrix, TF i,j As a word-frequency matrix, IDF i,j Is an inverse text frequency matrix;
the house type graph is generally a house type graph issued by house agency service personnel when house sales are carried out through previous developers, the pictures are regular, and due to the fact that the sources of uploaded pictures of each person are different, the pictures have the problems of different resolutions and picture rotation. Aiming at the characteristics, the invention selects an improved SIFT algorithm, namely an SURF algorithm. The method has the advantages of capturing the local characteristics of the picture, not influencing the zooming, the rotation and the brightness of the picture, and the like. The calculation method of the similarity sim _ img (A, B) of the indoor graph comprises the following steps:
carrying out scaling and graying on two images img1 and img2 of the entity pair in the two house source webpages;
establishing an SURF algorithm model, respectively extracting the characteristics des1 and des2 of the img1 and img2 of the two pictures through the SURF algorithm model, and matching characteristic points through a Knn algorithm according to the characteristics desl and des 2;
calculating the number of the matching feature points with the distance ratio larger than 0.9, and calculating the proportion of the matching feature points in the total matching feature points as the picture similarity sim _ img (A, B);
price information of the same house source registered in different house agencies is different, and the calculation formula of the price similarity sim _ price (A, B) is as follows:
Figure BDA0002044981940000081
in the above formula, price (a, B) is a relative value of Price, and the calculation formula is:
Figure BDA0002044981940000082
in the above formula, P A And P B Are respectively twoPrice, max (P), of Source A and Source B in Individual Source Web Page n )、 Min(P n ) Respectively setting the maximum difference value and the minimum difference value of prices in the same house source for all entity pairs in two house source webpages;
although the information areas registered by the same house source in different houses are different, the difference is not large, the difference is generally a number smaller than 1, and the calculation formula of the area similarity sim _ size (a, B) is as follows:
Figure BDA0002044981940000083
in the above formula, size (a, B) is the area difference between house source a and house source B in two house source webpages, and the calculation formula is:
Size(A,B)=|S A -S B |
in the above formula, S A And S B The areas of the house source A and the house source B in the two house source webpages respectively;
when the orientation of the house sources in the two house source webpages is the same, the orientation similarity sim _ direction (A, B) is 1, otherwise, the orientation similarity sim _ direction (A, B) is 0;
when the floors of the house sources in the two house source webpages are the same, the floor similarity sim _ floor (A, B) is 1, otherwise, the floor similarity sim _ floor (A, B) is 0.
The calculation formula of the fusion multi-attribute structure model Sim (A, B) is as follows:
Sim(A,B)=ω 1 ×sim_name(A,B)+ω 2 ×sim_title(A,B)+ω 3 ×sim_img(A,B)+ω 4 ×sim_price(A,B)+ω 5 ×sim_size(A,B) +ω 6 ×sim_direction(A,B)+ω 7 ×sim_floor(A,B)
in the above formula, ω 1 As a weight, ω, of the similarity of cell name attributes 2 As a weight, ω, of similarity of title attributes 3 Is the weight, omega, of the similarity of the attributes of the house type graph 4 As a weight of the similarity of the price attributes, ω 5 As a weight of area attribute similarity, ω 6 As weights towards attribute similarity, ω 7 As a floor attributeAnd (4) the weight of the similarity.
S4, fusing each attribute weight [ omega ] in the multi-attribute structure model 1 ,ω 2 ,…,ω 7 ]The formed array is used as the position of one particle in the particle swarm, the number, the iteration times, the cognitive factors and the social factors of the particle swarm are initialized, the position of each individual particle swarm, namely the weight of each attribute similarity, the speed of each individual particle swarm is initialized, the initial value of each individual extreme value is calculated, and the initial value of the global extreme value is equal to the initial value of the individual extreme value; take [0,1]The random number in the range is used as the initial position of the particle, and the initial velocity of each particle is also set to 0,1]Random numbers within a range. Thus, an initial population of particles is generated.
S5, calculating the total similarity of all entity pairs according to the fused multi-attribute structure model, calculating the threshold value of the total similarity, and bringing the threshold value into a training set to compare real classification results to obtain an F1 value of the training set; the initialized or updated weight is brought into each entity pair instance, the total similarity of each entity pair is calculated, then the obtained total entity similarity is divided into two lists A and B according to whether the actual entity pair is the same house source or not (the list A represents the set of the total similarity of the same house source actually, and the list A represents the set of the total similarity of different house sources actually), the lists A and B are respectively arranged in ascending order and descending order, then the difference is made to obtain the list C, and the calculation formula of the threshold value Sim of the total similarity is as follows:
Sim=A[i]-C[i]/2
in the above formula, a [ i ] is the i-th item of the set of the total similarity of the same house source arranged in an ascending order, C [ i ] = a [ i ] -B [ i ], B [ i ] is the i-th item of the set of the total similarity of different house sources arranged in a descending order, and i is the item which is firstly greater than 0 in the set C.
S6, taking the F1 value of the training set as the fitness of each particle, updating the individual extreme value to be the particle fitness when the fitness of the particle is greater than the historical extreme value of the particle, calculating the maximum fitness of the current group, and updating the global extreme value to be the maximum fitness when the maximum fitness is greater than the global extreme value;
s7, dividing the particle swarm into subgroups of 3 levels according to the particle fitness, and calculating the self-adaptive inertia weight of objective function values of different levels; in the standard particle swarm optimization algorithm, the inertia weight is one of important parameters, and the influence degree of the current particle speed on the updated speed can be changed by changing the inertia weight, so that the optimizing capability and the convergence speed of the whole algorithm are controlled. The speed of the particles with stronger exploration capacity is higher; particles with stronger development ability and smaller speed. Based on the discovery, the optimization of the inertia weight of the particle swarm has important significance. The calculation formula of the self-adaptive inertia weight omega is as follows:
Figure BDA0002044981940000101
in the above formula, ω max 、ω min Maximum and minimum values, f, initially set for the adaptive inertial weight ω, respectively i Is the current objective function value of the ith particle, i =1,2 m For optimal particle fitness in the particle swarm, f avg Is the average value f of the fitness values of the constituent subgroups with a fitness value greater than the average value of the population fitness values' avg Is the average of the fitness values of the formed subgroups with fitness values smaller than the average of the population fitness values.
S8, updating and calculating the speed of the particle swarm according to the individual extreme value, the global extreme value, the inertia weight, the cognitive factor and the social factor, updating the position of the particle swarm according to the speed of the particle swarm, and adding 1 to the number of iterations;
the velocity of the particle population is updated as:
V id =ωV′ id +C 1 random(0,1)(P id -X′ id )+C 2 random(0,1)(P gd -X′ id )
in the above formula, V id Is the velocity, V 'of the updated particle group' id Is the velocity of the ith particle at the current iteration number, ω is the adaptive inertial weight, C1 is the cognitive factor, C2 is the social factor, C1= C2 ∈ [0,4 ]]Random (0, 1) is the interval [0, 1]]On the followingNumber of machines, pi d Is the d-th dimension, P, of the individual extremum of the i-th variable gd D dimension, X 'of a global extreme value' id The position of the ith particle at the current iteration number.
The position of the particle swarm is updated as follows:
X id =X′ id +V id
in the above formula, X id Is the updated position of the ith particle, X' id Is the position of the ith particle at the current iteration number, V id Updated velocity for the ith particle.
And S9, when the iteration times are smaller than the maximum iteration times, returning to the step S5, otherwise, outputting the position of the particle swarm, namely the weight of each attribute in the multi-attribute structure model.
And S10, calculating a threshold value of the test lumped similarity, and predicting the entity pairs in the test set by using the multi-attribute structural model and the threshold value of the test lumped similarity to realize the matching of the second-hand house source.
The experimental data of the invention are information such as cell names, titles, family type graphs, prices, areas, orientations, floors and the like of the second-hand house data respectively crawled from the chain house website and the security guest website. The expression form of the attribute value of the same attribute name of different second-hand house service platforms can be different, for example, the semantics of "60 square meters" and "60 square meters" are identical, but the expression form is different. In order to enhance the reliability of the similarity of the attributes, the expression form of the attribute value needs to be normalized. Table 1 is an example of partially normalized attribute values.
Table 1 attribute value normalization example
Properties Attribute values in chain families Attribute values in live guests Normalized attribute values
Cell name Red maple ridge three stages Three stages of Zhongchang red maple Ling Red maple ridge three stages
Area of 88.5 square meter 88.5 square meters 88.5 square meters
Floor level Middle layer (23 layers in total) Middle floor/total 23 floors Middle layer (23 layers in all)
The experiment of the invention mainly judges the similarity of second-hand rooms of a real estate service platform, can be regarded as a two-classification model, combines the evaluation indexes of a classification algorithm, and selects the accuracy (P), the recall ratio (R) and the F1 value which are commonly used by the classification problem as the standard for evaluating the performance of the algorithm. Table 2 illustrates the relevant parameters of the evaluation index.
TABLE 2 evaluation index parameters
Figure BDA0002044981940000121
Precision rate (Precision), also called Precision rate. It is predicted how many of the results are correctly classified as positive samples. The calculation formula is as follows:
Figure BDA0002044981940000122
recall (Recall), also known as Recall. What is the correct classification in the true positive sample result. The completeness of the classification result of the model is reflected. The calculation formula is as follows:
Figure BDA0002044981940000123
in order to be able to evaluate the model proposed by the invention comprehensively, F is used 1 And evaluating the classification effect of the whole data set by using the harmonic average value of the value, the accuracy rate and the recall rate, wherein the calculation formula is as follows:
Figure BDA0002044981940000124
through analysis and parameter optimization of main parameters of the particle swarm algorithm, the fact that when the number N of particle swarms is 50, the iteration number Step is 100, the learning factor C1= C2=2, and the minimum value and the maximum value of the inertia weight are 0.4 and 0.9 respectively is found that the classification effect is the best, the particle convergence condition is good, and the optimal classification effect on a test set shown in table 3 and the particle distribution diagram shown in fig. 2 are obtained.
TABLE 3 FAPSO optimal Classification Effect
P R F1
FAPSO 0.882 0.845 0.863
The similarity weights of the attributes in the optimal classification effect are respectively [1,0.236,0.72,1, 0.975,1], and the total similarity threshold is 5.249. From the weight distribution of the similarity, the similarity of the picture, the area, the orientation, the cell name and the floor has a large influence on the total similarity, and the title has a small influence on the total similarity. This is because the title contains a lot of information, but the format is not uniform, and other attributes are inherent to the house, and therefore, the influence on the total similarity is large.
Compared with the standard particle swarm algorithm PSO, the particle swarm algorithm FAPSO based on the self-adaptive inertial weight adopts a fitness self-adaptive inertial weight method, and reduces the particle dimensionality by independently calculating the total similarity threshold in a fitness function. The number N of particle swarms and the maximum iteration number Step of the particle swarms of the adaptive inertial weight and the standard PSO algorithm are respectively set to be 50 and 100, the learning factor C1= C2=2, wherein the inertial weight of the standard PSO algorithm is 0.6, and the minimum value and the maximum value of the inertial weight of FAPSO are respectively 0.4 and 0.9 of empirical values. The performance comparison and analysis as a function of iteration number is shown in fig. 3, 4 and 5.
As can be seen from fig. 3, the FAPSO and PSO algorithms have an influence on the global optimum value of the fitness as the number of iterations increases, and we find that the global optimum value of the PSO increases significantly as the number of iterations increases, but the algorithm used in the present invention increases more stably.
As can be seen from fig. 4, as the number of iterations increases, the fitness average of the PSO current iteration population increases significantly, but the algorithm used in the present invention increases more smoothly. The FAPSO stability is obviously superior to that of PSO.
As can be seen from fig. 5, as the number of iterations increases, the global optimal particle obtained from the current number of iterations is classified into a test set, and the obtained test set F1, FAPSO is significantly better than PSO.
In conclusion, the FAPSO algorithm is significantly better than the standard PSO algorithm, regardless of the convergence rate or the optimization ability of the particles. And when the number of iterations of the FAPSO algorithm is small, particles with high quality can be obtained, and the overall performance of the particle population is obviously high.

Claims (1)

1. A real estate data vector alignment method based on particle swarm optimization is characterized by comprising the following steps:
s1, crawling second-hand house source data of a certain city from different second-hand house source webpages;
s2, preprocessing the second-hand house source data;
s3, calculating the similarity of different attributes of the preprocessed second-hand house source data, and constructing a fusion multi-attribute structure model;
s4, taking an array formed by the weight values of all attributes in the fusion multi-attribute structure model as the position of one particle in the particle swarm, initializing the number, the iteration times, the cognitive factors and the social factors of the particle swarm, initializing the positions of the individual particle swarm, namely the weight values of the similarity of all attributes, initializing the speed of the individual particle swarm, calculating the initial value of each individual extreme value, and making the initial value of the global extreme value equal to the initial value of the individual extreme value;
s5, calculating the total similarity of all entity pairs according to the fusion multi-attribute structure model, calculating a threshold value of the total similarity, and bringing the threshold value into a training set to compare real classification results to obtain an F1 value of the training set;
s6, taking the F1 value of the training set as the fitness of each particle, updating the individual extreme value as the fitness of the particle when the fitness of the particle is greater than the individual extreme value of the particle, calculating the maximum fitness of the current group, and updating the global extreme value as the maximum fitness when the maximum fitness is greater than the global extreme value;
s7, dividing the particle swarm into particle swarms of 3 grades according to the particle fitness, and calculating the self-adaptive inertia weight of objective function values of different grades;
s8, updating and calculating the speed of the particle swarm according to the individual extreme value, the global extreme value, the inertia weight, the cognitive factor and the social factor, updating the position of the particle swarm according to the speed of the particle swarm, and adding 1 to the iteration number;
s9, when the iteration times are smaller than the maximum iteration times, returning to the step S5, otherwise, outputting the positions of the particle swarm, namely the attribute weights in the multi-attribute structure model;
s10, calculating a threshold value of the test lumped similarity, and predicting the entity pairs in the test set by using the multi-attribute structural model and the threshold value of the test lumped similarity to realize the matching of the second-hand house source;
the preprocessing in the step S2 comprises complementing incomplete house source data and normalizing the house source data;
the attributes of the second-hand house source data in the step S3 comprise a cell name, a title, a floor type graph, a price, an area, an orientation and a floor;
the calculation formula of the cell name similarity sim _ name (a, B) is as follows:
Figure FDA0003811032720000021
in the above formula, nameA and nameB are respectively the cell names of house source A and house source B in two house source webpages;
the title similarity sim _ title (A, B) is calculated by the following method:
adding a blank space between words of a group of entity pair titles S1 and S2 in two house source web pages, respectively calculating TF values and IDF values of all words, calculating TFIDF values through the TF values and the IDF values, further obtaining word frequency-inverse text frequency matrixes, and calculating cosine similarity sim _ title (A, B) of the two house source title word frequency-inverse text frequency matrixes;
the calculation formula of the TFIDF value is as follows:
TFIDF i,j =TF i,j ×IDF i,j
in the above formula, TFIDF i,j For the word frequency-inverse text frequency matrix, TF i,j As a word-frequency matrix, IDF i,j Is an inverse text frequency matrix;
the calculation method of the similarity sim _ img (A, B) of the floor plan comprises the following steps:
carrying out scaling and graying on two images img1 and img2 of the entity pair in the two house source webpages;
establishing an SURF algorithm model, respectively extracting the characteristics des1 and des2 of the img1 and img2 of the two pictures through the SURF algorithm model, and matching characteristic points through a Knn algorithm according to the characteristics des1 and des 2;
calculating the number of the matching feature points with the distance ratio larger than 0.9, and calculating the proportion of the matching feature points in the total matching feature points as the picture similarity sim _ img (A, B);
the calculation formula of the price similarity sim _ price (A, B) is as follows:
Figure FDA0003811032720000031
in the above formula, price (a, B) is a relative value of Price, and the calculation formula is:
Figure FDA0003811032720000032
in the above formula, P A And P B Prices, max (P), of Source A and Source B in the two Source Web pages, respectively n )、Min(P n ) Respectively setting the maximum difference value and the minimum difference value of prices in the same house source for all entity pairs in two house source webpages;
the calculation formula of the area similarity sim _ size (A, B) is as follows:
Figure FDA0003811032720000033
in the above formula, size (a, B) is the area difference between the house source a and the house source B in the two house source webpages, and the calculation formula is:
Size(A,B)=|S A -S B |
in the above formula, S A And S B The areas of the house source A and the house source B in the two house source webpages respectively;
when the orientation of the house sources in the two house source webpages is the same, the orientation similarity sim _ direction (A, B) is 1, otherwise, the orientation similarity sim _ direction (A, B) is 0;
when the floors of the house sources in the two house source webpages are the same, the floor similarity sim _ floor (A, B) is 1, otherwise, the floor similarity sim _ floor (A, B) is 0;
the calculation formula of the fusion multi-attribute structure model Sim (a, B) in step S3 is:
Sim(A,B)=ω 1 ×sim_name(A,B)+ω 2 ×sim_title(A,B)+ω 3 ×sim_img(A,B)+ω 4 ×sim_price(A,B)+ω 5 ×sim_size(A,B)+ω 6 ×sim_direction(A,B)+ω 7 ×sim_floor(A,B)
in the above formula, ω 1 Is the weight, omega, of the similarity of the cell name attributes 2 As a weight, ω, of similarity of title attributes 3 Is the weight, omega, of the similarity of the attributes of the house type graph 4 As a weight of the similarity of the price attributes, ω 5 As a weight of area attribute similarity, ω 6 As a weight of similarity of orientation attributes, ω 7 The weight value of the similarity of the floor attributes is obtained;
the calculation formula of the threshold Sim of the total similarity in step S5 is:
Sim=A[i]-C[i]/2
in the above formula, ai is the i-th item of the same house source total similarity set in ascending order, ci = ai-Bi, B [ i ] is the i-th item of different house source total similarity sets in descending order, and i is the item which is first greater than 0 in the set C;
the calculation formula of the adaptive inertia weight ω in step S7 is:
Figure FDA0003811032720000041
in the above formula, ω max 、ω min Respectively, the adaptive inertial weight omegaSet maximum and minimum values, f i I =1,2, \ 8230;, where m, m is the population size, f is the current value of the objective function for the ith particle m For optimal particle fitness in the particle swarm, f avg Is the average value f of the fitness values of the constituent subgroups with a fitness value greater than the average value of the population fitness values' avg The average value of the fitness values of the formed subgroups with the fitness value smaller than the average value of the population fitness values;
in step S8, the velocity of the particle group is updated as follows:
V id =ωV′ id +C 1 random(0,1)(P id -X′ id )+C 2 random(0,1)(P gd -X′ id )
in the above formula, V id Is the velocity of the particle group after update, V' id Is the velocity of the ith particle at the current iteration number, ω is the adaptive inertial weight, C1 is the cognitive factor, C2 is the social factor, C1= C2 ∈ [0,4 ]]Random (0, 1) is the interval [0, 1]]Random number of (2), P id D-dimension, P, of the individual extremum of the i-th variable gd D dimension, X 'of a global extreme value' id The position of the ith particle under the current iteration number;
in step S8, the position of the particle group is updated as follows:
X id =X′ id +V id
in the above formula, X id Is the updated position of the ith particle, X' id Is the position of the ith particle at the current iteration number, V id The updated velocity for the ith particle.
CN201910354563.3A 2019-04-29 2019-04-29 House property data vector alignment method based on particle swarm optimization Active CN110096634B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910354563.3A CN110096634B (en) 2019-04-29 2019-04-29 House property data vector alignment method based on particle swarm optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910354563.3A CN110096634B (en) 2019-04-29 2019-04-29 House property data vector alignment method based on particle swarm optimization

Publications (2)

Publication Number Publication Date
CN110096634A CN110096634A (en) 2019-08-06
CN110096634B true CN110096634B (en) 2023-02-24

Family

ID=67446329

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910354563.3A Active CN110096634B (en) 2019-04-29 2019-04-29 House property data vector alignment method based on particle swarm optimization

Country Status (1)

Country Link
CN (1) CN110096634B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125054A (en) * 2019-11-21 2020-05-08 青岛聚好联科技有限公司 Method and device for community data migration
CN110928357A (en) * 2019-12-16 2020-03-27 徐州工业职业技术学院 Maximum power point tracking method of photovoltaic array under time-varying shadow condition
CN111291155A (en) * 2020-01-17 2020-06-16 青梧桐有限责任公司 Method and system for identifying homonymous cells based on text similarity
CN111259966A (en) * 2020-01-17 2020-06-09 青梧桐有限责任公司 Method and system for identifying homonymous cell with multi-feature fusion
CN111626343B (en) * 2020-05-13 2022-05-03 哈尔滨工程大学 Ship data relation extraction method based on PGM and PSO clustering
CN112733433A (en) * 2020-12-25 2021-04-30 北京航天测控技术有限公司 Equipment testability strategy optimization method and device
CN113704480B (en) * 2021-11-01 2022-01-25 成都我行我数科技有限公司 Intelligent minimum stock unit matching method
CN114821685B (en) * 2022-05-12 2023-05-02 黑龙江省科学院智能制造研究所 PSO (particle swarm optimization) optimization Delaunay triangulation-based non-contact 3D fingerprint identification method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256067A (en) * 2018-01-16 2018-07-06 平安好房(上海)电子商务有限公司 Calculate method, apparatus, equipment and the storage medium of source of houses similarity
CN108427714A (en) * 2018-02-02 2018-08-21 北京邮电大学 The source of houses based on machine learning repeats record recognition methods and system
CN108734393A (en) * 2018-05-14 2018-11-02 平安好房(上海)电子商务有限公司 Matching process, user equipment, storage medium and the device of information of real estate
CN109033465A (en) * 2018-08-31 2018-12-18 北京诸葛找房信息技术有限公司 Based on geographical location multi-platform cell combining method similar with name
CN109035078A (en) * 2018-08-31 2018-12-18 北京诸葛找房信息技术有限公司 A kind of source of houses polymerization based on the similar calculating of various dimensions information

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130218864A1 (en) * 2012-02-18 2013-08-22 Harrison Gregory Hong Real Estate Search Engine
US20150120352A1 (en) * 2013-10-24 2015-04-30 Matt Steel Real Estate Information Management

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256067A (en) * 2018-01-16 2018-07-06 平安好房(上海)电子商务有限公司 Calculate method, apparatus, equipment and the storage medium of source of houses similarity
CN108427714A (en) * 2018-02-02 2018-08-21 北京邮电大学 The source of houses based on machine learning repeats record recognition methods and system
CN108734393A (en) * 2018-05-14 2018-11-02 平安好房(上海)电子商务有限公司 Matching process, user equipment, storage medium and the device of information of real estate
CN109033465A (en) * 2018-08-31 2018-12-18 北京诸葛找房信息技术有限公司 Based on geographical location multi-platform cell combining method similar with name
CN109035078A (en) * 2018-08-31 2018-12-18 北京诸葛找房信息技术有限公司 A kind of source of houses polymerization based on the similar calculating of various dimensions information

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
An Improved Partical Swarm Optimization Algorithm with Adaptive Inertia Weights;Mi Li et al.;《International Journal of Information Technology & Decision Making》;20190424;全文 *
EnAli: entity alignment across multiple heterogeneous data source;Chao Kong et al.;《Frontiers of computer Science》;20180603;全文 *
不动产数据房地一体整合关键技术研究;张西军等;《测绘与空间地理信息》;20190225;全文 *
不动产数据整合背景下的非空间属性数据整合技术研究;王征强等;《测绘标准化》;20180315;全文 *
南京"数字房产"空间数据整合的关键技术研究;蒋海琴;《中国优秀博硕士学位论文全文数据库(硕士)》;20030215;全文 *
城市房产管理信息系统设计与实现--以宜春市为例;晏启明;《中国优秀博硕士学位论文全文数据库(硕士)》;20140315;全文 *
基于自适应惯性权重的均值粒子群优化算法;赵志刚等;《计算机工程与科学》;20160331;全文 *
面向不动产登记的异构信息资源整合策略;郑少楠等;《浙江大学学报(理学版)》;20150114;全文 *

Also Published As

Publication number Publication date
CN110096634A (en) 2019-08-06

Similar Documents

Publication Publication Date Title
CN110096634B (en) House property data vector alignment method based on particle swarm optimization
US11748379B1 (en) Systems and methods for generating and implementing knowledge graphs for knowledge representation and analysis
CN109345348B (en) Multi-dimensional information portrait recommendation method based on travel agency users
Gao et al. Location-centered house price prediction: A multi-task learning approach
US7774227B2 (en) Method and system utilizing online analytical processing (OLAP) for making predictions about business locations
TW202022769A (en) Risk identification model training method and device and server
CN111125453B (en) Opinion leader role identification method in social network based on subgraph isomorphism and storage medium
WO2021109464A1 (en) Personalized teaching resource recommendation method for large-scale users
CN109947987B (en) Cross collaborative filtering recommendation method
CN106528812B (en) A kind of cloud recommended method based on USDR model
CN106610970A (en) Collaborative filtering-based content recommendation system and method
CN108897750B (en) Personalized place recommendation method and device integrating multiple contextual information
Zuo et al. A large group decision-making method and its application to the evaluation of property perceived service quality
CN112925908A (en) Attention-based text classification method and system for graph Attention network
CN107944485A (en) The commending system and method, personalized recommendation system found based on cluster group
Chu et al. Cultural difference and visual information on hotel rating prediction
Huang et al. Research on urban modern architectural art based on artificial intelligence and GIS image recognition system
Cong Personalized recommendation of film and television culture based on an intelligent classification algorithm
CN108898244B (en) Digital signage position recommendation method coupled with multi-source elements
Nafkha et al. Do customers choose proper tariff? empirical analysis based on polish data using unsupervised techniques
Zhang et al. MugRep: A multi-task hierarchical graph representation learning framework for real estate appraisal
CN116244513A (en) Random group POI recommendation method, system, equipment and storage medium
CN115422441A (en) Continuous interest point recommendation method based on social space-time information and user preference
Nie et al. Optimization of the economic and trade management legal model based on the support vector machine algorithm and logistic regression algorithm
KR102137230B1 (en) Server for providing artificial intelligence based real estate auction information service using analysis of real estate title and deeds

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant