CN113627864A - Urban functional area identification process based on time-space semantic mining - Google Patents

Urban functional area identification process based on time-space semantic mining Download PDF

Info

Publication number
CN113627864A
CN113627864A CN202010373505.8A CN202010373505A CN113627864A CN 113627864 A CN113627864 A CN 113627864A CN 202010373505 A CN202010373505 A CN 202010373505A CN 113627864 A CN113627864 A CN 113627864A
Authority
CN
China
Prior art keywords
area
function
poi
text
functional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010373505.8A
Other languages
Chinese (zh)
Inventor
孙勇
蔡绍硕
蔡同建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Zhongchengshi Big Data Co ltd
Original Assignee
Wuhan Zhongchengshi Big Data Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Zhongchengshi Big Data Co ltd filed Critical Wuhan Zhongchengshi Big Data Co ltd
Priority to CN202010373505.8A priority Critical patent/CN113627864A/en
Publication of CN113627864A publication Critical patent/CN113627864A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a city functional area recognition flow based on space-time semantic mining, which comprises documents, words, an extrusion functional unit, space-time data, a theme model, document theme distribution and unit function distribution, wherein the hidden function of an area is found through the theme model, the hidden function is similar to the text theme mining, a basic functional unit is equivalent to a document in a corpus, space-time data in the functional unit is similar to words in the document, the unit function distribution obtained through the theme model is equivalent to the document theme distribution, the used city space-time data is typical New wave microblog position sign-in data, each sign-in data comprises user information, space coordinates of the sign-in position, release time, release text and the like, the dynamic activity mode of people can be reflected from different angles, and POI in a research area is obtained from a hundred-degree map, and realizing the functional identification of the area.

Description

Urban functional area identification process based on time-space semantic mining
Technical Field
The invention relates to the technical field of urban functional area identification, in particular to an urban functional area identification process based on space-time semantic mining.
Background
The traditional urban functional partition research is mostly based on data obtained by satellite remote sensing, questionnaire investigation, field visit and the like, and then is assisted by an index system to identify the urban functional partition [ 1-3 ], but the methods consume too high labor cost, and the analysis process has subjective factors of investigators, so that the urban functional partition is difficult to be accurately monitored dynamically for a long time.
Disclosure of Invention
The invention aims to provide a high-performance antioxidant polyethylene plastic to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme: a city functional area identification process based on space-time semantic mining comprises documents, words, an extrusion functional unit, space-time data, a theme model, document theme distribution and unit function distribution, firstly, the implicit function of an area is found through the theme model, the mining analogy is carried out on the basis of the text theme, a basic functional unit is equivalent to a document in a corpus, the space-time data in the functional unit is similar to the words in the document, the unit function distribution obtained after the theme model is passed is equivalent to the document theme distribution, the city space-time data used is typical Xinlang microblog position sign-in data, each sign-in data comprises user information, space coordinates of the sign-in position, release time, release text and the like, the dynamic activity mode of people can be reflected from different angles, POI in a research area is obtained from a hundred-degree map, and each record comprises the name of a physical entity, Spatial coordinates, addresses, types and the like, and POI category density characteristics of different basic function units are calculated according to the types.
The city functional area identification process based on spatio-temporal semantic mining as claimed in claim 1, which is characterized by comprising the following steps: firstly, dividing a research area into basic function units with independent spaces by taking a building as a division basis, converting discrete microblog registration data into registration event sets according to space coordinates and distributing the registration event sets to each unit; then, taking the basic function unit as an object, extracting a behavior pattern and text features of the check-in event set, calculating POI category density, and substituting the POI category density into the DMR topic model to obtain a function vector of the basic function unit; because the obtained function vector has no definite function semantics, the function vector is clustered and analyzed to obtain unit clusters with similar functions, and finally, the function attributes of each area cluster are marked according to the POI structure in the function unit, semantic explanation is given, and function area identification is completed.
Further, the process is as follows: in order to obtain the urban functional area, basic functional unit division is firstly carried out, the outline of a building in a research area is identified, then a connected area marking algorithm is used for dividing the basic functional unit, then check-in data in the research area are spatialized and mapped to each basic unit, and then the processes of basic unit feature extraction, potential function mining and functional area marking are introduced in detail.
Further, in the behavior mode location check-in data, the check-in behavior of the user each time may be represented as C ═ user, latitude, longtude, time, text }, where the user is a user identifier; latitude is the latitude of the check-in position; longituude is the longitude of the check-in location; time is the time of check-in; text is a text issued when checking in, the collection of the text and the text constitutes a movement behavior of the user, and represents that the user appears at a certain place at a certain time, and a user behavior pattern in the basic functional unit is defined as follows: the average number of times of users appearing in a certain basic function unit in a segment is divided into 12 time periods each for 2 hours, working days and weekends are distinguished to obtain 24 time intervals, the average number of user sign-in behaviors C in each time period of each area is counted to form a behavior pattern matrix P, a user behavior pattern matrix P is formed, the behavior pattern matrix represents the average number of times of certain behavior patterns appearing in a certain time user movement pattern matrix, the horizontal axis represents time intervals t1, … and t24, the vertical axis represents areas R1, … and Rn, and n represents the number of basic units, numbers in the matrix represent the average number of times of certain behavior patterns appearing in a certain time interval ti in an area Rj, such as a shaded number 6, and therefore, a 24-dimensional behavior pattern vector of each area is obtained.
Furthermore, text data of position check-in is abundant in short text, the characteristic extraction is difficult, text characteristics in an expansion region are expanded by adopting a characteristic expansion method based on a Word2vec Word vector model to relieve the characteristic sparseness problem, words are projected to a vector space by the Word2vec, the Word belongs to distributed Word-vector [15, 16], a Word vector training model based on a neural network is provided based on the distribution hypothesis theory of Word semantics, a low-dimensional Word vector of a target Word is obtained through the relation between the target Word and context, the training efficiency is high, the Word vector obtained by training on large-scale linguistic data has strong correlation on syntax and semantics, because the Word2vec Word vector model can discover the semantic relation between words, similar words of keywords are searched by using the Word vector training model for expanding the characteristics of the short text, the theme is enhanced to a certain degree, and the functionality is better realized, the specific steps of text feature augmentation are as follows,
s1: data preprocessing: dividing a large number of collected microblog linguistic data into words and removing stop words and interference words;
s2: training a word vector model: configuring Word2vec model parameters, and substituting the parameters into data for training;
s3: extracting keywords: the average length of the existing corpus text is 17 words obtained through statistical analysis, so the TF-IDF values of the words in the text to be expanded are calculated, and the first 10 words are selected as key words;
s4 text expansion: and traversing the keywords, and expanding 5 nearest words according to the previously obtained Word2vec model to serve as expanded text characteristics.
Further, for each basic functional unit, constructing a POI density feature vector, wherein for each region r, the number of the ith type POI is nri, the number of all POI in the region r is sr, and then the density vri of the ith type POI in the region is
Figure BDA0002479039250000031
The POI density feature vector of the region r is xr ═ (vr1, vr2, …, vrF, 1), where F is the number of POI categories and the last 1 is a default value, for the purpose of describing the mean value of each topic later.
Furthermore, the obtained behavior pattern and text features are used as 'texts' of the basic functional units, the 'texts' of each basic unit are collected to form a 'document set' input DMR topic model, meanwhile, each basic functional unit has POI density features as prior data, hidden variables, namely function vectors of the basic units are deduced according to the features in the observable 'texts', and finally, each basic unit can be represented as probability distribution under each function, specifically:
s1: giving an r basic function unit in a research area, generating an introduction vector lambda k obeying Gaussian distribution for each implicit function k by a hyperparameter sigma, wherein the introduction vector lambda k is the same as the length of the POI density characteristic, and generating a function-characteristic Dirichlet distribution beta k of the basic unit by a priori parameter eta;
s2: let α r, k ═ exp (xTr λ k), where xr is the POI density characteristic of the base unit r and θ r is the dirichlet function distribution obed a priori parameter of α r;
s3: for the nth feature fr, n in the base unit r, its function distribution zr, n is the polynomial distribution of θ r obtained from step S1, from which the function distribution β zr, n of this word can be determined;
s4: generating features fr, n according to the features fr, n obeying the polynomial distribution of the beta zr, n;
s5: traversing N features in the basic unit r, and repeatedly executing the steps S3-S4 to generate the basic unit r;
s6: and traversing R basic units in the research area, and repeatedly executing the steps S1-S5 to generate the whole research area.
Further, the function vector has no clear semantic expression, and cannot qualitatively judge the function region, so that the function vector is firstly clustered and analyzed by using a k-means algorithm, so that regions with similar functions are aggregated together, the number of clustering centers is determined by using an average contour coefficient, then, the aggregated region cluster is functionally labeled by a POI distribution structure inside the region cluster, namely a science region, a residential area, a commercial area, a working area, a living facility area and a mixed area, wherein the POI distribution structure comprises Frequency Densities (FD) of different types of POI in the region cluster and proportions (Category Proport, CP) of different types of POI in the region cluster, and for the region cluster c, the calculation mode is as follows:
Figure BDA0002479039250000051
Figure BDA0002479039250000052
wherein i is a POI category; m is the total number of POI categories; nc, i is the number of POI in the category i in c; ni is the total number of POIs in category i; FDc, i represents the frequency density of the ith POI in the area cluster c; CPc, i represents the proportion of the frequency density of the ith POI in the area cluster c to the frequency density of POIs in all categories of the cluster,
the category proportion represents the importance degree of one type of POI in a certain area, while the relevance degrees of different types of POI and functions are different, research confirms that in the current area cluster, if the sum of the category proportions of POI belonging to a certain function exceeds 50%, the area cluster takes the function as the main part and is a single city functional area; if the sum of the proportion of the POI categories belonging to each function does not exceed 50%, the area cluster is a mixed area.
The invention has the beneficial effects that: the topic model is widely applied in the field of natural language processing, and performs well in the text implicit semantic mining, a document is mapped from a term space to a topic space and is represented as the probability distribution of a plurality of topics, the idea can also be mapped to the functional discovery of a region, a region is regarded as a document, active spatio-temporal data in the region is words in the document, the function of the region is the topic of the document, then the function distribution of each region can be obtained by using the topic model, an LDA (LatentDirichletAllocation) topic model [11] is a classic model of text semantic mining, which is a Bayesian generation model containing hidden quantity, and tries to represent the document by the distribution of the topic, and carves the topic by the distribution of the words, but for the topic expression, only the words in the text are considered, the contribution of other data associated with the document to the topic is lacked, so a plurality of extended models based on the LDA appear, the DMR (Dirichlet multinomial regression) model [12] is a topic model derived from LDA, and compared with other topic models, the Dirichlet topic distribution prior parameters of the document in the model take the influence of relevant characteristics of the document into consideration, so that more complex and effective auxiliary characteristics are introduced, and the topic extraction effect of the model is enhanced.
Based on the research, the city functional area identification method based on space-time semantic mining is provided based on position check-in data and POI data, a building is used as a basic functional unit in a research area, user behavior patterns and text features in the functional unit are extracted through the position check-in data, a DMR topic model is input together with the POI density to obtain functional vectors of the functional unit, then the vectors are further clustered, areas with similar functions are aggregated, and semantic interpretation is carried out on labels of the areas to realize the function identification of the areas.
Drawings
FIG. 1 is an analogy diagram of region-function and document-subject of a city functional area identification process based on spatio-temporal semantic mining according to the present invention;
FIG. 2 is a basic flow chart of the urban functional area recognition based on the urban functional area recognition flow of the spatio-temporal semantic mining according to the present invention;
FIG. 3 is a user behavior pattern matrix diagram of an urban functional area identification process based on spatio-temporal semantic mining according to the present invention;
FIG. 4 is a POI type diagram of an urban functional area identification process based on spatio-temporal semantic mining according to the present invention;
FIG. 5 is a DMR model generation process diagram of the city functional area identification process based on spatio-temporal semantic mining
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts based on the embodiments of the present invention belong to the protection scope of the present invention.
Referring to fig. 1-5, the present invention provides a technical solution:
a city functional area identification process based on space-time semantic mining comprises documents, words, an extrusion functional unit, space-time data, a theme model, document theme distribution and unit function distribution, firstly, the implicit function of an area is found through the theme model, the mining analogy is carried out on the basis of the text theme, a basic functional unit is equivalent to a document in a corpus, the space-time data in the functional unit is similar to the words in the document, the unit function distribution obtained after the theme model is passed is equivalent to the document theme distribution, the city space-time data used is typical Xinlang microblog position sign-in data, each sign-in data comprises user information, space coordinates of the sign-in position, release time, release text and the like, the dynamic activity mode of people can be reflected from different angles, POI in a research area is obtained from a hundred-degree map, and each record comprises the name of a physical entity, Spatial coordinates, addresses, types and the like, and POI category density characteristics of different basic function units are calculated according to the types.
In order to further improve the use function of the urban functional area identification process based on the spatio-temporal semantic mining, a research area is divided into basic functional units with independent spaces by taking buildings as division basis, and discrete microblog check-in data is converted into check-in event sets according to space coordinates to be distributed to each unit; then, taking the basic function unit as an object, extracting a behavior pattern and text features of the check-in event set, calculating POI category density, and substituting the POI category density into the DMR topic model to obtain a function vector of the basic function unit; because the obtained function vector has no definite function semantics, the function vector is clustered and analyzed to obtain unit clusters with similar functions, and finally, the function attributes of each area cluster are marked according to the POI structure in the function unit, semantic explanation is given, and function area identification is completed.
In order to further improve the use function of the urban functional area identification process based on the space-time semantic mining, the process comprises the following steps: in order to obtain the urban functional area, basic functional unit division is firstly carried out, the outline of a building in a research area is identified, then a connected area marking algorithm is used for dividing the basic functional unit, then check-in data in the research area are spatialized and mapped to each basic unit, and then the processes of basic unit feature extraction, potential function mining and functional area marking are introduced in detail.
In order to further improve the use function of the city function region identification process based on the spatio-temporal semantic mining, in the behavior mode position check-in data, the check-in behavior of the user every time can be expressed as C ═ user, latitude, longtude, time and text, wherein the user is a user identifier; latitude is the latitude of the check-in position; longituude is the longitude of the check-in location; time is the time of check-in; text is a text issued when checking in, the collection of the text and the text constitutes a movement behavior of the user, and represents that the user appears at a certain place at a certain time, and a user behavior pattern in the basic functional unit is defined as follows: the average number of times of users appearing in a certain basic function unit in a segment is divided into 12 time periods each for 2 hours, working days and weekends are distinguished to obtain 24 time intervals, the average number of user sign-in behaviors C in each time period of each area is counted to form a behavior pattern matrix P, a user behavior pattern matrix P is formed, the behavior pattern matrix represents the average number of times of certain behavior patterns appearing in a certain time user movement pattern matrix, the horizontal axis represents time intervals t1, … and t24, the vertical axis represents areas R1, … and Rn, and n represents the number of basic units, numbers in the matrix represent the average number of times of certain behavior patterns appearing in a certain time interval ti in an area Rj, such as a shaded number 6, and therefore, a 24-dimensional behavior pattern vector of each area is obtained.
In order to further improve the use function of a city functional area recognition process based on space-time semantic mining, text data signed at a position is more in short text, and the feature extraction is difficult, the text features in an extension area are expanded by adopting a feature expansion method based on a Word2vec Word vector model to relieve the feature sparseness problem, Word2vec projects words to a vector space, belongs to a distributed-term-orientation Word vector [15, 16], and the Word vector training model based on the Word semantic distribution hypothesis theory provides a Word vector training model based on a neural network, obtains a low-dimensional Word vector of a target Word through the relation between the target Word and a context, has high training efficiency, and the Word vector obtained by training on a large-scale corpus has strong correlation in syntax and semantics, and because the Word2vec Word vector model can find the semantic relation between words, the similar words of keywords are searched by using the Word vector training model, the method is used for expanding the characteristics of the short text, simultaneously enhancing the theme to a certain extent and better embodying the functionality, and the specific steps of text characteristic expansion are as follows,
s1: data preprocessing: dividing a large number of collected microblog linguistic data into words and removing stop words and interference words;
s2: training a word vector model: configuring Word2vec model parameters, and substituting the parameters into data for training;
s3: extracting keywords: the average length of the existing corpus text is 17 words obtained through statistical analysis, so the TF-IDF values of the words in the text to be expanded are calculated, and the first 10 words are selected as key words;
s4 text expansion: and traversing the keywords, and expanding 5 nearest words according to the previously obtained Word2vec model to serve as expanded text characteristics.
In order to further improve the use function of the city functional area identification process based on spatio-temporal semantic mining, for each basic functional unit, a POI density feature vector is constructed, for each area r, the number of the ith type POI is nri, the number of all POIs in the area r is sr, and then the density vri of the ith type POI in the area is
Figure BDA0002479039250000091
The POI density feature vector of the region r is xr ═ (vr1, vr2, …, vrF, 1), where F is the number of POI categories and the last 1 is a default value, for the purpose of describing the mean value of each topic later.
In order to further improve the using function of the city functional area identification process based on the spatio-temporal semantic mining, the obtained behavior pattern and text features are used as 'texts' of basic functional units, the 'texts' of each basic unit are collected to form a 'document set' input DMR topic model, meanwhile, each basic functional unit has POI density features as prior data, hidden variables, namely function vectors of the basic units are deduced according to the features in the observable 'texts', and finally, each basic unit can be represented as probability distribution under each function, specifically:
s1: giving an r basic function unit in a research area, generating an introduction vector lambda k obeying Gaussian distribution for each implicit function k by a hyperparameter sigma, wherein the introduction vector lambda k is the same as the length of the POI density characteristic, and generating a function-characteristic Dirichlet distribution beta k of the basic unit by a priori parameter eta;
s2: let α r, k ═ exp (xTr λ k), where xr is the POI density characteristic of the base unit r and θ r is the dirichlet function distribution obed a priori parameter of α r;
s3: for the nth feature fr, n in the base unit r, its function distribution zr, n is the polynomial distribution of θ r obtained from step S1, from which the function distribution β zr, n of this word can be determined;
s4: generating features fr, n according to the features fr, n obeying the polynomial distribution of the beta zr, n;
s5: traversing N features in the basic unit r, and repeatedly executing the steps S3-S4 to generate the basic unit r;
s6: and traversing R basic units in the research area, and repeatedly executing the steps S1-S5 to generate the whole research area.
In order to further improve the use function of an urban functional area identification process based on space-time semantic mining, wherein functional vectors have no clear semantic expression and cannot qualitatively judge functional areas, the functional vectors are firstly clustered and analyzed by using a k-means algorithm, so that areas with similar functions are aggregated together, the number of clustering centers is determined by using an average contour coefficient, then, the aggregated area cluster is functionally labeled by a POI distribution structure inside the area cluster, wherein the POI distribution structure comprises Frequency Density (FD) of POIs of different types in the area cluster and proportion (Category Proport, CP) of POIs of different types in the area cluster, and for the area cluster c, the calculation mode is as follows:
Figure BDA0002479039250000111
Figure BDA0002479039250000112
wherein i is a POI category; m is the total number of POI categories; nc, i is the number of POI in the category i in c; ni is the total number of POIs in category i; FDc, i represents the frequency density of the ith POI in the area cluster c; CPc, i represents the proportion of the frequency density of the ith POI in the area cluster c to the frequency density of POIs in all categories of the cluster,
the category proportion represents the importance degree of one type of POI in a certain area, while the relevance degrees of different types of POI and functions are different, research confirms that in the current area cluster, if the sum of the category proportions of POI belonging to a certain function exceeds 50%, the area cluster takes the function as the main part and is a single city functional area; if the sum of the proportion of the POI categories belonging to each function does not exceed 50%, the area cluster is a mixed area.
The topic model is widely applied in the field of natural language processing, and performs well in the text implicit semantic mining, a document is mapped from a term space to a topic space and is represented as the probability distribution of a plurality of topics, the idea can also be mapped to the functional discovery of a region, a region is regarded as a document, active spatio-temporal data in the region is words in the document, the function of the region is the topic of the document, then the function distribution of each region can be obtained by using the topic model, an LDA (LatentDirichletAllocation) topic model [11] is a classic model of text semantic mining, which is a Bayesian generation model containing hidden quantity, and tries to represent the document by the distribution of the topic, and carves the topic by the distribution of the words, but for the topic expression, only the words in the text are considered, the contribution of other data associated with the document to the topic is lacked, so a plurality of extended models based on the LDA appear, the DMR (Dirichlet multinomial regression) model [12] is a topic model derived from LDA, and compared with other topic models, the Dirichlet topic distribution prior parameters of the document in the model take the influence of relevant characteristics of the document into consideration, so that more complex and effective auxiliary characteristics are introduced, and the topic extraction effect of the model is enhanced.
Based on the research, the city functional area identification method based on space-time semantic mining is provided based on position check-in data and POI data, a building is used as a basic functional unit in a research area, user behavior patterns and text features in the functional unit are extracted through the position check-in data, a DMR topic model is input together with the POI density to obtain functional vectors of the functional unit, then the vectors are further clustered, areas with similar functions are aggregated, and semantic interpretation is carried out on labels of the areas to realize the function identification of the areas.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (8)

1. A city functional area identification process based on space-time semantic mining is characterized in that: including documents, words, extruded functional units, spatio-temporal data, topic models, document topic distributions, and unit function distributions, first attempt to discover the functions implied by the regions through the topic models, similar to text topic mining, the basic functional units are equivalent to documents in a corpus, the spatio-temporal data in the functional units are similar to words in the documents, after passing through the topic model, the obtained unit function distribution is equivalent to document topic distribution, the used city space-time data is representative sign-in data of the Xinlang microblog positions, each sign-in data comprises user information, space coordinates of the sign-in positions, release time, release texts and the like, and the dynamic activity mode of people can be reflected from different angles, and simultaneously, POI in the research area is obtained from the Baidu map, each record comprises the name, the space coordinate, the address, the type and the like of the physical entity, and POI category density characteristics of different basic function units are calculated according to the type of the record.
2. The city functional area identification process based on spatio-temporal semantic mining as claimed in claim 1, which is characterized by comprising the following steps: firstly, dividing a research area into basic function units with independent spaces by taking a building as a division basis, converting discrete microblog registration data into registration event sets according to space coordinates and distributing the registration event sets to each unit; then, taking the basic function unit as an object, extracting a behavior pattern and text features of the check-in event set, calculating POI category density, and substituting the POI category density into the DMR topic model to obtain a function vector of the basic function unit; because the obtained function vector has no definite function semantics, the function vector is clustered and analyzed to obtain unit clusters with similar functions, and finally, the function attributes of each area cluster are marked according to the POI structure in the function unit, semantic explanation is given, and function area identification is completed.
3. The city functional area identification process based on spatio-temporal semantic mining as claimed in claim 1, which is characterized by comprising the following steps: in order to obtain the urban functional area, basic functional unit division is firstly carried out, the outline of a building in a research area is identified, then a connected area marking algorithm is used for dividing the basic functional unit, then check-in data in the research area are spatialized and mapped to each basic unit, and then the processes of basic unit feature extraction, potential function mining and functional area marking are introduced in detail.
4. The city functional area recognition process based on spatio-temporal semantic mining as claimed in claim 1, wherein the behavior pattern location check-in data can be expressed as C ═ user, latitude, longtude, time, text } where user is user id; latitude is the latitude of the check-in position; longituude is the longitude of the check-in location; time is the time of check-in; text is a text issued when checking in, the collection of the text and the text constitutes a movement behavior of the user, and represents that the user appears at a certain place at a certain time, and a user behavior pattern in the basic functional unit is defined as follows: the average number of times of users appearing in a certain basic function unit in a segment is divided into 12 time periods each for 2 hours, working days and weekends are distinguished to obtain 24 time intervals, the average number of user sign-in behaviors C in each time period of each area is counted to form a behavior pattern matrix P, a user behavior pattern matrix P is formed, the behavior pattern matrix represents the average number of times of certain behavior patterns appearing in a certain time user movement pattern matrix, the horizontal axis represents time intervals t1, … and t24, the vertical axis represents areas R1, … and Rn, and n represents the number of basic units, numbers in the matrix represent the average number of times of certain behavior patterns appearing in a certain time interval ti in an area Rj, such as a shaded number 6, and therefore, a 24-dimensional behavior pattern vector of each area is obtained.
5. The city functional area recognition process based on spatio-temporal semantic mining as claimed in claim 1, wherein the text data of position check-in is many in short text, the feature extraction is difficult, the feature expansion method based on Word2vec Word vector model is adopted to expand the text features in the region to alleviate the feature sparseness problem, Word2vec projects the words to a vector space belonging to distributeddepere-presentation Word vectors [15, 16], it is based on the Word semantic distribution hypothesis theory, a Word vector training model based on neural network is proposed, the low dimensional Word vector of the target Word is obtained by the relation between the target Word and the context, not only the training efficiency is high, but also the Word vector obtained by training on large-scale corpus has strong correlation in syntax and semantics, because the Word2vec Word vector model can find the semantic relation between words, it is used to find words of similar keywords, the method is used for expanding the characteristics of the short text, simultaneously enhancing the theme to a certain extent and better embodying the functionality, and the specific steps of text characteristic expansion are as follows,
s1: data preprocessing: dividing a large number of collected microblog linguistic data into words and removing stop words and interference words;
s2: training a word vector model: configuring Word2vec model parameters, and substituting the parameters into data for training;
s3: extracting keywords: the average length of the existing corpus text is 17 words obtained through statistical analysis, so the TF-IDF values of the words in the text to be expanded are calculated, and the first 10 words are selected as key words;
s4 text expansion: and traversing the keywords, and expanding 5 nearest words according to the previously obtained Word2vec model to serve as expanded text characteristics.
6. The flow of identifying urban functional areas based on spatio-temporal semantic mining as claimed in claim 1, wherein for each basic functional unit, a POI density feature vector is constructed, for each area r, the number of i-th type POIs is nri, the number of all POIs in the area r is sr, and then the density vri of i-th type POIs in the area is
Figure FDA0002479039240000031
The POI density feature vector of the region r is xr ═ (vr1, vr2, …, vrF, 1), where F is the number of POI categories and the last 1 is a default value, for the purpose of describing the mean value of each topic later.
7. The flow of identifying urban functional areas based on spatio-temporal semantic mining according to claim 1, wherein the obtained behavior patterns and text features are used as "texts" of basic functional units, the "texts" of each basic unit are collected to form a "document set" input DMR topic model, meanwhile, each basic functional unit has POI density features as prior data, hidden variables, i.e. functional vectors of the basic units, are inferred according to features in the observable "texts", and finally, each basic unit is represented as probability distribution under each function, specifically:
s1: giving an r basic function unit in a research area, generating an introduction vector lambda k obeying Gaussian distribution for each implicit function k by a hyperparameter sigma, wherein the introduction vector lambda k is the same as the length of the POI density characteristic, and generating a function-characteristic Dirichlet distribution beta k of the basic unit by a priori parameter eta;
s2: let α r, k ═ exp (xTr λ k), where xr is the POI density characteristic of the base unit r and θ r is the dirichlet function distribution obed a priori parameter of α r;
s3: for the nth feature fr, n in the base unit r, its function distribution zr, n is the polynomial distribution of θ r obtained from step S1, from which the function distribution β zr, n of this word can be determined;
s4: generating features fr, n according to the features fr, n obeying the polynomial distribution of the beta zr, n;
s5: traversing N features in the basic unit r, and repeatedly executing the steps S3-S4 to generate the basic unit r;
s6: and traversing R basic units in the research area, and repeatedly executing the steps S1-S5 to generate the whole research area.
8. The city functional area identification process based on spatio-temporal semantic mining according to claim 1, characterized in that the functional vectors have no clear semantic expression and cannot qualitatively judge the functional areas, so that the functional vectors are firstly clustered and analyzed by k-means algorithm to make areas with similar functions be aggregated together, and the number of clustering centers is determined by average contour coefficient, and then the aggregated area cluster is functionally labeled by its internal POI distribution structure, which includes Frequency Density (FD) of different types of POIs in the area cluster and proportion (catgoryproport, CP) of different types of POIs in the area cluster, and for the area cluster c, the calculation method is as follows:
Figure FDA0002479039240000041
Figure FDA0002479039240000051
wherein i is a POI category; m is the total number of POI categories; nc, i is the number of POI in the category i in c; ni is the total number of POIs in category i; FDc, i represents the frequency density of the ith POI in the area cluster c; CPc, i represents the proportion of the frequency density of the ith POI in the area cluster c to the frequency density of POIs in all categories of the cluster,
the category proportion represents the importance degree of one type of POI in a certain area, while the relevance degrees of different types of POI and functions are different, research confirms that in the current area cluster, if the sum of the category proportions of POI belonging to a certain function exceeds 50%, the area cluster takes the function as the main part and is a single city functional area; if the sum of the proportion of the POI categories belonging to each function does not exceed 50%, the area cluster is a mixed area.
CN202010373505.8A 2020-05-06 2020-05-06 Urban functional area identification process based on time-space semantic mining Pending CN113627864A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010373505.8A CN113627864A (en) 2020-05-06 2020-05-06 Urban functional area identification process based on time-space semantic mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010373505.8A CN113627864A (en) 2020-05-06 2020-05-06 Urban functional area identification process based on time-space semantic mining

Publications (1)

Publication Number Publication Date
CN113627864A true CN113627864A (en) 2021-11-09

Family

ID=78376613

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010373505.8A Pending CN113627864A (en) 2020-05-06 2020-05-06 Urban functional area identification process based on time-space semantic mining

Country Status (1)

Country Link
CN (1) CN113627864A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115018683A (en) * 2022-08-05 2022-09-06 深圳市朝阳辉电气设备有限公司 Hierarchical access implementation method for smart city space-time cloud platform
CN116307792A (en) * 2022-10-12 2023-06-23 广州市阿尔法软件信息技术有限公司 Urban physical examination subject scene-oriented evaluation method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105938481A (en) * 2016-04-07 2016-09-14 北京航空航天大学 Anomaly detection method of multi-mode text data in cities
CN110442715A (en) * 2019-07-31 2019-11-12 北京大学 A kind of conurbation geographical semantics method for digging based on polynary big data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105938481A (en) * 2016-04-07 2016-09-14 北京航空航天大学 Anomaly detection method of multi-mode text data in cities
CN110442715A (en) * 2019-07-31 2019-11-12 北京大学 A kind of conurbation geographical semantics method for digging based on polynary big data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
于璐等: "基于时空语义挖掘的城市功能区识别研究", 四川大学学报(自然科学版), 26 April 2019 (2019-04-26), pages 246 - 252 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115018683A (en) * 2022-08-05 2022-09-06 深圳市朝阳辉电气设备有限公司 Hierarchical access implementation method for smart city space-time cloud platform
CN115018683B (en) * 2022-08-05 2022-11-18 深圳市朝阳辉电气设备有限公司 Hierarchical access implementation method for smart city space-time cloud platform
CN116307792A (en) * 2022-10-12 2023-06-23 广州市阿尔法软件信息技术有限公司 Urban physical examination subject scene-oriented evaluation method and device
CN116307792B (en) * 2022-10-12 2024-03-12 广州市阿尔法软件信息技术有限公司 Urban physical examination subject scene-oriented evaluation method and device

Similar Documents

Publication Publication Date Title
Cao et al. A density-based method for adaptive LDA model selection
CN106919689A (en) Professional domain knowledge mapping dynamic fixing method based on definitions blocks of knowledge
Ding et al. A multiway p-spectral clustering algorithm
CN113627864A (en) Urban functional area identification process based on time-space semantic mining
Lei et al. An incremental clustering algorithm based on grid
Zhang et al. Computing the minimum-support for mining frequent patterns
Kanagal et al. Indexing correlated probabilistic databases
CN116756347B (en) Semantic information retrieval method based on big data
Umarani et al. A study on effective mining of association rules from huge databases
Sabah et al. Big data with decision tree induction
Liu et al. Community detection based on topic distance in social tagging networks
Chen et al. Feature selection based on BP neural network and adaptive particle swarm algorithm
Li et al. High resolution radar data fusion based on clustering algorithm
CN108241669A (en) A kind of construction method and system of adaptive text feature cluster
Zhu et al. Burst topic detection in real time spatial–temporal data stream
Hou et al. A clustering algorithm based on matrix over high dimensional data stream
CN111858946A (en) Construction method of tobacco monopoly market supervision big data E-R model
Li [Retracted] Multidimensional Discrete Big Data Clustering Algorithm Based on Dynamic Grid
Mangalampalli et al. Fuzzy Logic-based Preprocessing for Fuzzy Association Rule Mining
Li et al. A general feature abstraction method for clustering algorithm
Shao et al. An Incremental Clustering Algorithm Based on CFS
Dehideniya et al. Dynamic partitional clustering using multi-agent technology
Wan et al. GMM-ClusterForest: a novel indexing approach for multi-features based similarity search in high-dimensional spaces
Hu et al. An improved text clustering method based on hybrid model
Han et al. Principles and Perspectives of Granular Computing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination