CN113627864A - Urban functional area identification process based on time-space semantic mining - Google Patents
Urban functional area identification process based on time-space semantic mining Download PDFInfo
- Publication number
- CN113627864A CN113627864A CN202010373505.8A CN202010373505A CN113627864A CN 113627864 A CN113627864 A CN 113627864A CN 202010373505 A CN202010373505 A CN 202010373505A CN 113627864 A CN113627864 A CN 113627864A
- Authority
- CN
- China
- Prior art keywords
- area
- function
- poi
- text
- functional
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000005065 mining Methods 0.000 title claims abstract description 40
- 238000000034 method Methods 0.000 title claims description 38
- 238000009826 distribution Methods 0.000 claims abstract description 58
- 238000011160 research Methods 0.000 claims abstract description 30
- 230000000694 effects Effects 0.000 claims abstract description 6
- 230000006870 function Effects 0.000 claims description 104
- 239000013598 vector Substances 0.000 claims description 64
- 230000006399 behavior Effects 0.000 claims description 41
- 238000012549 training Methods 0.000 claims description 18
- 239000011159 matrix material Substances 0.000 claims description 16
- 238000000605 extraction Methods 0.000 claims description 8
- 238000004422 calculation algorithm Methods 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 239000013256 coordination polymer Substances 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000007619 statistical method Methods 0.000 claims description 3
- 230000002708 enhancing effect Effects 0.000 claims description 2
- 238000001125 extrusion Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 4
- 238000005192 partition Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 2
- 239000004698 Polyethylene Substances 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 239000003963 antioxidant agent Substances 0.000 description 1
- 230000003078 antioxidant effect Effects 0.000 description 1
- 230000003416 augmentation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000004033 plastic Substances 0.000 description 1
- 229920003023 plastic Polymers 0.000 description 1
- -1 polyethylene Polymers 0.000 description 1
- 229920000573 polyethylene Polymers 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
- G06Q10/103—Workflow collaboration or project management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Business, Economics & Management (AREA)
- Strategic Management (AREA)
- Human Resources & Organizations (AREA)
- Evolutionary Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Entrepreneurship & Innovation (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Economics (AREA)
- Marketing (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a city functional area recognition flow based on space-time semantic mining, which comprises documents, words, an extrusion functional unit, space-time data, a theme model, document theme distribution and unit function distribution, wherein the hidden function of an area is found through the theme model, the hidden function is similar to the text theme mining, a basic functional unit is equivalent to a document in a corpus, space-time data in the functional unit is similar to words in the document, the unit function distribution obtained through the theme model is equivalent to the document theme distribution, the used city space-time data is typical New wave microblog position sign-in data, each sign-in data comprises user information, space coordinates of the sign-in position, release time, release text and the like, the dynamic activity mode of people can be reflected from different angles, and POI in a research area is obtained from a hundred-degree map, and realizing the functional identification of the area.
Description
Technical Field
The invention relates to the technical field of urban functional area identification, in particular to an urban functional area identification process based on space-time semantic mining.
Background
The traditional urban functional partition research is mostly based on data obtained by satellite remote sensing, questionnaire investigation, field visit and the like, and then is assisted by an index system to identify the urban functional partition [ 1-3 ], but the methods consume too high labor cost, and the analysis process has subjective factors of investigators, so that the urban functional partition is difficult to be accurately monitored dynamically for a long time.
Disclosure of Invention
The invention aims to provide a high-performance antioxidant polyethylene plastic to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme: a city functional area identification process based on space-time semantic mining comprises documents, words, an extrusion functional unit, space-time data, a theme model, document theme distribution and unit function distribution, firstly, the implicit function of an area is found through the theme model, the mining analogy is carried out on the basis of the text theme, a basic functional unit is equivalent to a document in a corpus, the space-time data in the functional unit is similar to the words in the document, the unit function distribution obtained after the theme model is passed is equivalent to the document theme distribution, the city space-time data used is typical Xinlang microblog position sign-in data, each sign-in data comprises user information, space coordinates of the sign-in position, release time, release text and the like, the dynamic activity mode of people can be reflected from different angles, POI in a research area is obtained from a hundred-degree map, and each record comprises the name of a physical entity, Spatial coordinates, addresses, types and the like, and POI category density characteristics of different basic function units are calculated according to the types.
The city functional area identification process based on spatio-temporal semantic mining as claimed in claim 1, which is characterized by comprising the following steps: firstly, dividing a research area into basic function units with independent spaces by taking a building as a division basis, converting discrete microblog registration data into registration event sets according to space coordinates and distributing the registration event sets to each unit; then, taking the basic function unit as an object, extracting a behavior pattern and text features of the check-in event set, calculating POI category density, and substituting the POI category density into the DMR topic model to obtain a function vector of the basic function unit; because the obtained function vector has no definite function semantics, the function vector is clustered and analyzed to obtain unit clusters with similar functions, and finally, the function attributes of each area cluster are marked according to the POI structure in the function unit, semantic explanation is given, and function area identification is completed.
Further, the process is as follows: in order to obtain the urban functional area, basic functional unit division is firstly carried out, the outline of a building in a research area is identified, then a connected area marking algorithm is used for dividing the basic functional unit, then check-in data in the research area are spatialized and mapped to each basic unit, and then the processes of basic unit feature extraction, potential function mining and functional area marking are introduced in detail.
Further, in the behavior mode location check-in data, the check-in behavior of the user each time may be represented as C ═ user, latitude, longtude, time, text }, where the user is a user identifier; latitude is the latitude of the check-in position; longituude is the longitude of the check-in location; time is the time of check-in; text is a text issued when checking in, the collection of the text and the text constitutes a movement behavior of the user, and represents that the user appears at a certain place at a certain time, and a user behavior pattern in the basic functional unit is defined as follows: the average number of times of users appearing in a certain basic function unit in a segment is divided into 12 time periods each for 2 hours, working days and weekends are distinguished to obtain 24 time intervals, the average number of user sign-in behaviors C in each time period of each area is counted to form a behavior pattern matrix P, a user behavior pattern matrix P is formed, the behavior pattern matrix represents the average number of times of certain behavior patterns appearing in a certain time user movement pattern matrix, the horizontal axis represents time intervals t1, … and t24, the vertical axis represents areas R1, … and Rn, and n represents the number of basic units, numbers in the matrix represent the average number of times of certain behavior patterns appearing in a certain time interval ti in an area Rj, such as a shaded number 6, and therefore, a 24-dimensional behavior pattern vector of each area is obtained.
Furthermore, text data of position check-in is abundant in short text, the characteristic extraction is difficult, text characteristics in an expansion region are expanded by adopting a characteristic expansion method based on a Word2vec Word vector model to relieve the characteristic sparseness problem, words are projected to a vector space by the Word2vec, the Word belongs to distributed Word-vector [15, 16], a Word vector training model based on a neural network is provided based on the distribution hypothesis theory of Word semantics, a low-dimensional Word vector of a target Word is obtained through the relation between the target Word and context, the training efficiency is high, the Word vector obtained by training on large-scale linguistic data has strong correlation on syntax and semantics, because the Word2vec Word vector model can discover the semantic relation between words, similar words of keywords are searched by using the Word vector training model for expanding the characteristics of the short text, the theme is enhanced to a certain degree, and the functionality is better realized, the specific steps of text feature augmentation are as follows,
s1: data preprocessing: dividing a large number of collected microblog linguistic data into words and removing stop words and interference words;
s2: training a word vector model: configuring Word2vec model parameters, and substituting the parameters into data for training;
s3: extracting keywords: the average length of the existing corpus text is 17 words obtained through statistical analysis, so the TF-IDF values of the words in the text to be expanded are calculated, and the first 10 words are selected as key words;
s4 text expansion: and traversing the keywords, and expanding 5 nearest words according to the previously obtained Word2vec model to serve as expanded text characteristics.
Further, for each basic functional unit, constructing a POI density feature vector, wherein for each region r, the number of the ith type POI is nri, the number of all POI in the region r is sr, and then the density vri of the ith type POI in the region is
The POI density feature vector of the region r is xr ═ (vr1, vr2, …, vrF, 1), where F is the number of POI categories and the last 1 is a default value, for the purpose of describing the mean value of each topic later.
Furthermore, the obtained behavior pattern and text features are used as 'texts' of the basic functional units, the 'texts' of each basic unit are collected to form a 'document set' input DMR topic model, meanwhile, each basic functional unit has POI density features as prior data, hidden variables, namely function vectors of the basic units are deduced according to the features in the observable 'texts', and finally, each basic unit can be represented as probability distribution under each function, specifically:
s1: giving an r basic function unit in a research area, generating an introduction vector lambda k obeying Gaussian distribution for each implicit function k by a hyperparameter sigma, wherein the introduction vector lambda k is the same as the length of the POI density characteristic, and generating a function-characteristic Dirichlet distribution beta k of the basic unit by a priori parameter eta;
s2: let α r, k ═ exp (xTr λ k), where xr is the POI density characteristic of the base unit r and θ r is the dirichlet function distribution obed a priori parameter of α r;
s3: for the nth feature fr, n in the base unit r, its function distribution zr, n is the polynomial distribution of θ r obtained from step S1, from which the function distribution β zr, n of this word can be determined;
s4: generating features fr, n according to the features fr, n obeying the polynomial distribution of the beta zr, n;
s5: traversing N features in the basic unit r, and repeatedly executing the steps S3-S4 to generate the basic unit r;
s6: and traversing R basic units in the research area, and repeatedly executing the steps S1-S5 to generate the whole research area.
Further, the function vector has no clear semantic expression, and cannot qualitatively judge the function region, so that the function vector is firstly clustered and analyzed by using a k-means algorithm, so that regions with similar functions are aggregated together, the number of clustering centers is determined by using an average contour coefficient, then, the aggregated region cluster is functionally labeled by a POI distribution structure inside the region cluster, namely a science region, a residential area, a commercial area, a working area, a living facility area and a mixed area, wherein the POI distribution structure comprises Frequency Densities (FD) of different types of POI in the region cluster and proportions (Category Proport, CP) of different types of POI in the region cluster, and for the region cluster c, the calculation mode is as follows:
wherein i is a POI category; m is the total number of POI categories; nc, i is the number of POI in the category i in c; ni is the total number of POIs in category i; FDc, i represents the frequency density of the ith POI in the area cluster c; CPc, i represents the proportion of the frequency density of the ith POI in the area cluster c to the frequency density of POIs in all categories of the cluster,
the category proportion represents the importance degree of one type of POI in a certain area, while the relevance degrees of different types of POI and functions are different, research confirms that in the current area cluster, if the sum of the category proportions of POI belonging to a certain function exceeds 50%, the area cluster takes the function as the main part and is a single city functional area; if the sum of the proportion of the POI categories belonging to each function does not exceed 50%, the area cluster is a mixed area.
The invention has the beneficial effects that: the topic model is widely applied in the field of natural language processing, and performs well in the text implicit semantic mining, a document is mapped from a term space to a topic space and is represented as the probability distribution of a plurality of topics, the idea can also be mapped to the functional discovery of a region, a region is regarded as a document, active spatio-temporal data in the region is words in the document, the function of the region is the topic of the document, then the function distribution of each region can be obtained by using the topic model, an LDA (LatentDirichletAllocation) topic model [11] is a classic model of text semantic mining, which is a Bayesian generation model containing hidden quantity, and tries to represent the document by the distribution of the topic, and carves the topic by the distribution of the words, but for the topic expression, only the words in the text are considered, the contribution of other data associated with the document to the topic is lacked, so a plurality of extended models based on the LDA appear, the DMR (Dirichlet multinomial regression) model [12] is a topic model derived from LDA, and compared with other topic models, the Dirichlet topic distribution prior parameters of the document in the model take the influence of relevant characteristics of the document into consideration, so that more complex and effective auxiliary characteristics are introduced, and the topic extraction effect of the model is enhanced.
Based on the research, the city functional area identification method based on space-time semantic mining is provided based on position check-in data and POI data, a building is used as a basic functional unit in a research area, user behavior patterns and text features in the functional unit are extracted through the position check-in data, a DMR topic model is input together with the POI density to obtain functional vectors of the functional unit, then the vectors are further clustered, areas with similar functions are aggregated, and semantic interpretation is carried out on labels of the areas to realize the function identification of the areas.
Drawings
FIG. 1 is an analogy diagram of region-function and document-subject of a city functional area identification process based on spatio-temporal semantic mining according to the present invention;
FIG. 2 is a basic flow chart of the urban functional area recognition based on the urban functional area recognition flow of the spatio-temporal semantic mining according to the present invention;
FIG. 3 is a user behavior pattern matrix diagram of an urban functional area identification process based on spatio-temporal semantic mining according to the present invention;
FIG. 4 is a POI type diagram of an urban functional area identification process based on spatio-temporal semantic mining according to the present invention;
FIG. 5 is a DMR model generation process diagram of the city functional area identification process based on spatio-temporal semantic mining
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts based on the embodiments of the present invention belong to the protection scope of the present invention.
Referring to fig. 1-5, the present invention provides a technical solution:
a city functional area identification process based on space-time semantic mining comprises documents, words, an extrusion functional unit, space-time data, a theme model, document theme distribution and unit function distribution, firstly, the implicit function of an area is found through the theme model, the mining analogy is carried out on the basis of the text theme, a basic functional unit is equivalent to a document in a corpus, the space-time data in the functional unit is similar to the words in the document, the unit function distribution obtained after the theme model is passed is equivalent to the document theme distribution, the city space-time data used is typical Xinlang microblog position sign-in data, each sign-in data comprises user information, space coordinates of the sign-in position, release time, release text and the like, the dynamic activity mode of people can be reflected from different angles, POI in a research area is obtained from a hundred-degree map, and each record comprises the name of a physical entity, Spatial coordinates, addresses, types and the like, and POI category density characteristics of different basic function units are calculated according to the types.
In order to further improve the use function of the urban functional area identification process based on the spatio-temporal semantic mining, a research area is divided into basic functional units with independent spaces by taking buildings as division basis, and discrete microblog check-in data is converted into check-in event sets according to space coordinates to be distributed to each unit; then, taking the basic function unit as an object, extracting a behavior pattern and text features of the check-in event set, calculating POI category density, and substituting the POI category density into the DMR topic model to obtain a function vector of the basic function unit; because the obtained function vector has no definite function semantics, the function vector is clustered and analyzed to obtain unit clusters with similar functions, and finally, the function attributes of each area cluster are marked according to the POI structure in the function unit, semantic explanation is given, and function area identification is completed.
In order to further improve the use function of the urban functional area identification process based on the space-time semantic mining, the process comprises the following steps: in order to obtain the urban functional area, basic functional unit division is firstly carried out, the outline of a building in a research area is identified, then a connected area marking algorithm is used for dividing the basic functional unit, then check-in data in the research area are spatialized and mapped to each basic unit, and then the processes of basic unit feature extraction, potential function mining and functional area marking are introduced in detail.
In order to further improve the use function of the city function region identification process based on the spatio-temporal semantic mining, in the behavior mode position check-in data, the check-in behavior of the user every time can be expressed as C ═ user, latitude, longtude, time and text, wherein the user is a user identifier; latitude is the latitude of the check-in position; longituude is the longitude of the check-in location; time is the time of check-in; text is a text issued when checking in, the collection of the text and the text constitutes a movement behavior of the user, and represents that the user appears at a certain place at a certain time, and a user behavior pattern in the basic functional unit is defined as follows: the average number of times of users appearing in a certain basic function unit in a segment is divided into 12 time periods each for 2 hours, working days and weekends are distinguished to obtain 24 time intervals, the average number of user sign-in behaviors C in each time period of each area is counted to form a behavior pattern matrix P, a user behavior pattern matrix P is formed, the behavior pattern matrix represents the average number of times of certain behavior patterns appearing in a certain time user movement pattern matrix, the horizontal axis represents time intervals t1, … and t24, the vertical axis represents areas R1, … and Rn, and n represents the number of basic units, numbers in the matrix represent the average number of times of certain behavior patterns appearing in a certain time interval ti in an area Rj, such as a shaded number 6, and therefore, a 24-dimensional behavior pattern vector of each area is obtained.
In order to further improve the use function of a city functional area recognition process based on space-time semantic mining, text data signed at a position is more in short text, and the feature extraction is difficult, the text features in an extension area are expanded by adopting a feature expansion method based on a Word2vec Word vector model to relieve the feature sparseness problem, Word2vec projects words to a vector space, belongs to a distributed-term-orientation Word vector [15, 16], and the Word vector training model based on the Word semantic distribution hypothesis theory provides a Word vector training model based on a neural network, obtains a low-dimensional Word vector of a target Word through the relation between the target Word and a context, has high training efficiency, and the Word vector obtained by training on a large-scale corpus has strong correlation in syntax and semantics, and because the Word2vec Word vector model can find the semantic relation between words, the similar words of keywords are searched by using the Word vector training model, the method is used for expanding the characteristics of the short text, simultaneously enhancing the theme to a certain extent and better embodying the functionality, and the specific steps of text characteristic expansion are as follows,
s1: data preprocessing: dividing a large number of collected microblog linguistic data into words and removing stop words and interference words;
s2: training a word vector model: configuring Word2vec model parameters, and substituting the parameters into data for training;
s3: extracting keywords: the average length of the existing corpus text is 17 words obtained through statistical analysis, so the TF-IDF values of the words in the text to be expanded are calculated, and the first 10 words are selected as key words;
s4 text expansion: and traversing the keywords, and expanding 5 nearest words according to the previously obtained Word2vec model to serve as expanded text characteristics.
In order to further improve the use function of the city functional area identification process based on spatio-temporal semantic mining, for each basic functional unit, a POI density feature vector is constructed, for each area r, the number of the ith type POI is nri, the number of all POIs in the area r is sr, and then the density vri of the ith type POI in the area is
The POI density feature vector of the region r is xr ═ (vr1, vr2, …, vrF, 1), where F is the number of POI categories and the last 1 is a default value, for the purpose of describing the mean value of each topic later.
In order to further improve the using function of the city functional area identification process based on the spatio-temporal semantic mining, the obtained behavior pattern and text features are used as 'texts' of basic functional units, the 'texts' of each basic unit are collected to form a 'document set' input DMR topic model, meanwhile, each basic functional unit has POI density features as prior data, hidden variables, namely function vectors of the basic units are deduced according to the features in the observable 'texts', and finally, each basic unit can be represented as probability distribution under each function, specifically:
s1: giving an r basic function unit in a research area, generating an introduction vector lambda k obeying Gaussian distribution for each implicit function k by a hyperparameter sigma, wherein the introduction vector lambda k is the same as the length of the POI density characteristic, and generating a function-characteristic Dirichlet distribution beta k of the basic unit by a priori parameter eta;
s2: let α r, k ═ exp (xTr λ k), where xr is the POI density characteristic of the base unit r and θ r is the dirichlet function distribution obed a priori parameter of α r;
s3: for the nth feature fr, n in the base unit r, its function distribution zr, n is the polynomial distribution of θ r obtained from step S1, from which the function distribution β zr, n of this word can be determined;
s4: generating features fr, n according to the features fr, n obeying the polynomial distribution of the beta zr, n;
s5: traversing N features in the basic unit r, and repeatedly executing the steps S3-S4 to generate the basic unit r;
s6: and traversing R basic units in the research area, and repeatedly executing the steps S1-S5 to generate the whole research area.
In order to further improve the use function of an urban functional area identification process based on space-time semantic mining, wherein functional vectors have no clear semantic expression and cannot qualitatively judge functional areas, the functional vectors are firstly clustered and analyzed by using a k-means algorithm, so that areas with similar functions are aggregated together, the number of clustering centers is determined by using an average contour coefficient, then, the aggregated area cluster is functionally labeled by a POI distribution structure inside the area cluster, wherein the POI distribution structure comprises Frequency Density (FD) of POIs of different types in the area cluster and proportion (Category Proport, CP) of POIs of different types in the area cluster, and for the area cluster c, the calculation mode is as follows:
wherein i is a POI category; m is the total number of POI categories; nc, i is the number of POI in the category i in c; ni is the total number of POIs in category i; FDc, i represents the frequency density of the ith POI in the area cluster c; CPc, i represents the proportion of the frequency density of the ith POI in the area cluster c to the frequency density of POIs in all categories of the cluster,
the category proportion represents the importance degree of one type of POI in a certain area, while the relevance degrees of different types of POI and functions are different, research confirms that in the current area cluster, if the sum of the category proportions of POI belonging to a certain function exceeds 50%, the area cluster takes the function as the main part and is a single city functional area; if the sum of the proportion of the POI categories belonging to each function does not exceed 50%, the area cluster is a mixed area.
The topic model is widely applied in the field of natural language processing, and performs well in the text implicit semantic mining, a document is mapped from a term space to a topic space and is represented as the probability distribution of a plurality of topics, the idea can also be mapped to the functional discovery of a region, a region is regarded as a document, active spatio-temporal data in the region is words in the document, the function of the region is the topic of the document, then the function distribution of each region can be obtained by using the topic model, an LDA (LatentDirichletAllocation) topic model [11] is a classic model of text semantic mining, which is a Bayesian generation model containing hidden quantity, and tries to represent the document by the distribution of the topic, and carves the topic by the distribution of the words, but for the topic expression, only the words in the text are considered, the contribution of other data associated with the document to the topic is lacked, so a plurality of extended models based on the LDA appear, the DMR (Dirichlet multinomial regression) model [12] is a topic model derived from LDA, and compared with other topic models, the Dirichlet topic distribution prior parameters of the document in the model take the influence of relevant characteristics of the document into consideration, so that more complex and effective auxiliary characteristics are introduced, and the topic extraction effect of the model is enhanced.
Based on the research, the city functional area identification method based on space-time semantic mining is provided based on position check-in data and POI data, a building is used as a basic functional unit in a research area, user behavior patterns and text features in the functional unit are extracted through the position check-in data, a DMR topic model is input together with the POI density to obtain functional vectors of the functional unit, then the vectors are further clustered, areas with similar functions are aggregated, and semantic interpretation is carried out on labels of the areas to realize the function identification of the areas.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (8)
1. A city functional area identification process based on space-time semantic mining is characterized in that: including documents, words, extruded functional units, spatio-temporal data, topic models, document topic distributions, and unit function distributions, first attempt to discover the functions implied by the regions through the topic models, similar to text topic mining, the basic functional units are equivalent to documents in a corpus, the spatio-temporal data in the functional units are similar to words in the documents, after passing through the topic model, the obtained unit function distribution is equivalent to document topic distribution, the used city space-time data is representative sign-in data of the Xinlang microblog positions, each sign-in data comprises user information, space coordinates of the sign-in positions, release time, release texts and the like, and the dynamic activity mode of people can be reflected from different angles, and simultaneously, POI in the research area is obtained from the Baidu map, each record comprises the name, the space coordinate, the address, the type and the like of the physical entity, and POI category density characteristics of different basic function units are calculated according to the type of the record.
2. The city functional area identification process based on spatio-temporal semantic mining as claimed in claim 1, which is characterized by comprising the following steps: firstly, dividing a research area into basic function units with independent spaces by taking a building as a division basis, converting discrete microblog registration data into registration event sets according to space coordinates and distributing the registration event sets to each unit; then, taking the basic function unit as an object, extracting a behavior pattern and text features of the check-in event set, calculating POI category density, and substituting the POI category density into the DMR topic model to obtain a function vector of the basic function unit; because the obtained function vector has no definite function semantics, the function vector is clustered and analyzed to obtain unit clusters with similar functions, and finally, the function attributes of each area cluster are marked according to the POI structure in the function unit, semantic explanation is given, and function area identification is completed.
3. The city functional area identification process based on spatio-temporal semantic mining as claimed in claim 1, which is characterized by comprising the following steps: in order to obtain the urban functional area, basic functional unit division is firstly carried out, the outline of a building in a research area is identified, then a connected area marking algorithm is used for dividing the basic functional unit, then check-in data in the research area are spatialized and mapped to each basic unit, and then the processes of basic unit feature extraction, potential function mining and functional area marking are introduced in detail.
4. The city functional area recognition process based on spatio-temporal semantic mining as claimed in claim 1, wherein the behavior pattern location check-in data can be expressed as C ═ user, latitude, longtude, time, text } where user is user id; latitude is the latitude of the check-in position; longituude is the longitude of the check-in location; time is the time of check-in; text is a text issued when checking in, the collection of the text and the text constitutes a movement behavior of the user, and represents that the user appears at a certain place at a certain time, and a user behavior pattern in the basic functional unit is defined as follows: the average number of times of users appearing in a certain basic function unit in a segment is divided into 12 time periods each for 2 hours, working days and weekends are distinguished to obtain 24 time intervals, the average number of user sign-in behaviors C in each time period of each area is counted to form a behavior pattern matrix P, a user behavior pattern matrix P is formed, the behavior pattern matrix represents the average number of times of certain behavior patterns appearing in a certain time user movement pattern matrix, the horizontal axis represents time intervals t1, … and t24, the vertical axis represents areas R1, … and Rn, and n represents the number of basic units, numbers in the matrix represent the average number of times of certain behavior patterns appearing in a certain time interval ti in an area Rj, such as a shaded number 6, and therefore, a 24-dimensional behavior pattern vector of each area is obtained.
5. The city functional area recognition process based on spatio-temporal semantic mining as claimed in claim 1, wherein the text data of position check-in is many in short text, the feature extraction is difficult, the feature expansion method based on Word2vec Word vector model is adopted to expand the text features in the region to alleviate the feature sparseness problem, Word2vec projects the words to a vector space belonging to distributeddepere-presentation Word vectors [15, 16], it is based on the Word semantic distribution hypothesis theory, a Word vector training model based on neural network is proposed, the low dimensional Word vector of the target Word is obtained by the relation between the target Word and the context, not only the training efficiency is high, but also the Word vector obtained by training on large-scale corpus has strong correlation in syntax and semantics, because the Word2vec Word vector model can find the semantic relation between words, it is used to find words of similar keywords, the method is used for expanding the characteristics of the short text, simultaneously enhancing the theme to a certain extent and better embodying the functionality, and the specific steps of text characteristic expansion are as follows,
s1: data preprocessing: dividing a large number of collected microblog linguistic data into words and removing stop words and interference words;
s2: training a word vector model: configuring Word2vec model parameters, and substituting the parameters into data for training;
s3: extracting keywords: the average length of the existing corpus text is 17 words obtained through statistical analysis, so the TF-IDF values of the words in the text to be expanded are calculated, and the first 10 words are selected as key words;
s4 text expansion: and traversing the keywords, and expanding 5 nearest words according to the previously obtained Word2vec model to serve as expanded text characteristics.
6. The flow of identifying urban functional areas based on spatio-temporal semantic mining as claimed in claim 1, wherein for each basic functional unit, a POI density feature vector is constructed, for each area r, the number of i-th type POIs is nri, the number of all POIs in the area r is sr, and then the density vri of i-th type POIs in the area is
The POI density feature vector of the region r is xr ═ (vr1, vr2, …, vrF, 1), where F is the number of POI categories and the last 1 is a default value, for the purpose of describing the mean value of each topic later.
7. The flow of identifying urban functional areas based on spatio-temporal semantic mining according to claim 1, wherein the obtained behavior patterns and text features are used as "texts" of basic functional units, the "texts" of each basic unit are collected to form a "document set" input DMR topic model, meanwhile, each basic functional unit has POI density features as prior data, hidden variables, i.e. functional vectors of the basic units, are inferred according to features in the observable "texts", and finally, each basic unit is represented as probability distribution under each function, specifically:
s1: giving an r basic function unit in a research area, generating an introduction vector lambda k obeying Gaussian distribution for each implicit function k by a hyperparameter sigma, wherein the introduction vector lambda k is the same as the length of the POI density characteristic, and generating a function-characteristic Dirichlet distribution beta k of the basic unit by a priori parameter eta;
s2: let α r, k ═ exp (xTr λ k), where xr is the POI density characteristic of the base unit r and θ r is the dirichlet function distribution obed a priori parameter of α r;
s3: for the nth feature fr, n in the base unit r, its function distribution zr, n is the polynomial distribution of θ r obtained from step S1, from which the function distribution β zr, n of this word can be determined;
s4: generating features fr, n according to the features fr, n obeying the polynomial distribution of the beta zr, n;
s5: traversing N features in the basic unit r, and repeatedly executing the steps S3-S4 to generate the basic unit r;
s6: and traversing R basic units in the research area, and repeatedly executing the steps S1-S5 to generate the whole research area.
8. The city functional area identification process based on spatio-temporal semantic mining according to claim 1, characterized in that the functional vectors have no clear semantic expression and cannot qualitatively judge the functional areas, so that the functional vectors are firstly clustered and analyzed by k-means algorithm to make areas with similar functions be aggregated together, and the number of clustering centers is determined by average contour coefficient, and then the aggregated area cluster is functionally labeled by its internal POI distribution structure, which includes Frequency Density (FD) of different types of POIs in the area cluster and proportion (catgoryproport, CP) of different types of POIs in the area cluster, and for the area cluster c, the calculation method is as follows:
wherein i is a POI category; m is the total number of POI categories; nc, i is the number of POI in the category i in c; ni is the total number of POIs in category i; FDc, i represents the frequency density of the ith POI in the area cluster c; CPc, i represents the proportion of the frequency density of the ith POI in the area cluster c to the frequency density of POIs in all categories of the cluster,
the category proportion represents the importance degree of one type of POI in a certain area, while the relevance degrees of different types of POI and functions are different, research confirms that in the current area cluster, if the sum of the category proportions of POI belonging to a certain function exceeds 50%, the area cluster takes the function as the main part and is a single city functional area; if the sum of the proportion of the POI categories belonging to each function does not exceed 50%, the area cluster is a mixed area.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010373505.8A CN113627864A (en) | 2020-05-06 | 2020-05-06 | Urban functional area identification process based on time-space semantic mining |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010373505.8A CN113627864A (en) | 2020-05-06 | 2020-05-06 | Urban functional area identification process based on time-space semantic mining |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113627864A true CN113627864A (en) | 2021-11-09 |
Family
ID=78376613
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010373505.8A Pending CN113627864A (en) | 2020-05-06 | 2020-05-06 | Urban functional area identification process based on time-space semantic mining |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113627864A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115018683A (en) * | 2022-08-05 | 2022-09-06 | 深圳市朝阳辉电气设备有限公司 | Hierarchical access implementation method for smart city space-time cloud platform |
CN116307792A (en) * | 2022-10-12 | 2023-06-23 | 广州市阿尔法软件信息技术有限公司 | Urban physical examination subject scene-oriented evaluation method and device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105938481A (en) * | 2016-04-07 | 2016-09-14 | 北京航空航天大学 | Anomaly detection method of multi-mode text data in cities |
CN110442715A (en) * | 2019-07-31 | 2019-11-12 | 北京大学 | A kind of conurbation geographical semantics method for digging based on polynary big data |
-
2020
- 2020-05-06 CN CN202010373505.8A patent/CN113627864A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105938481A (en) * | 2016-04-07 | 2016-09-14 | 北京航空航天大学 | Anomaly detection method of multi-mode text data in cities |
CN110442715A (en) * | 2019-07-31 | 2019-11-12 | 北京大学 | A kind of conurbation geographical semantics method for digging based on polynary big data |
Non-Patent Citations (1)
Title |
---|
于璐等: "基于时空语义挖掘的城市功能区识别研究", 四川大学学报(自然科学版), 26 April 2019 (2019-04-26), pages 246 - 252 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115018683A (en) * | 2022-08-05 | 2022-09-06 | 深圳市朝阳辉电气设备有限公司 | Hierarchical access implementation method for smart city space-time cloud platform |
CN115018683B (en) * | 2022-08-05 | 2022-11-18 | 深圳市朝阳辉电气设备有限公司 | Hierarchical access implementation method for smart city space-time cloud platform |
CN116307792A (en) * | 2022-10-12 | 2023-06-23 | 广州市阿尔法软件信息技术有限公司 | Urban physical examination subject scene-oriented evaluation method and device |
CN116307792B (en) * | 2022-10-12 | 2024-03-12 | 广州市阿尔法软件信息技术有限公司 | Urban physical examination subject scene-oriented evaluation method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Cao et al. | A density-based method for adaptive LDA model selection | |
CN106919689A (en) | Professional domain knowledge mapping dynamic fixing method based on definitions blocks of knowledge | |
Ding et al. | A multiway p-spectral clustering algorithm | |
CN113627864A (en) | Urban functional area identification process based on time-space semantic mining | |
Lei et al. | An incremental clustering algorithm based on grid | |
Zhang et al. | Computing the minimum-support for mining frequent patterns | |
Kanagal et al. | Indexing correlated probabilistic databases | |
CN116756347B (en) | Semantic information retrieval method based on big data | |
Umarani et al. | A study on effective mining of association rules from huge databases | |
Sabah et al. | Big data with decision tree induction | |
Liu et al. | Community detection based on topic distance in social tagging networks | |
Chen et al. | Feature selection based on BP neural network and adaptive particle swarm algorithm | |
Li et al. | High resolution radar data fusion based on clustering algorithm | |
CN108241669A (en) | A kind of construction method and system of adaptive text feature cluster | |
Zhu et al. | Burst topic detection in real time spatial–temporal data stream | |
Hou et al. | A clustering algorithm based on matrix over high dimensional data stream | |
CN111858946A (en) | Construction method of tobacco monopoly market supervision big data E-R model | |
Li | [Retracted] Multidimensional Discrete Big Data Clustering Algorithm Based on Dynamic Grid | |
Mangalampalli et al. | Fuzzy Logic-based Preprocessing for Fuzzy Association Rule Mining | |
Li et al. | A general feature abstraction method for clustering algorithm | |
Shao et al. | An Incremental Clustering Algorithm Based on CFS | |
Dehideniya et al. | Dynamic partitional clustering using multi-agent technology | |
Wan et al. | GMM-ClusterForest: a novel indexing approach for multi-features based similarity search in high-dimensional spaces | |
Hu et al. | An improved text clustering method based on hybrid model | |
Han et al. | Principles and Perspectives of Granular Computing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |