CN113627864A

CN113627864A - Urban functional area identification process based on time-space semantic mining

Info

Publication number: CN113627864A
Application number: CN202010373505.8A
Authority: CN
Inventors: 孙勇; 蔡绍硕; 蔡同建
Original assignee: Wuhan Zhongchengshi Big Data Co ltd
Current assignee: Wuhan Zhongchengshi Big Data Co ltd
Priority date: 2020-05-06
Filing date: 2020-05-06
Publication date: 2021-11-09

Abstract

The invention discloses a city functional area recognition flow based on space-time semantic mining, which comprises documents, words, an extrusion functional unit, space-time data, a theme model, document theme distribution and unit function distribution, wherein the hidden function of an area is found through the theme model, the hidden function is similar to the text theme mining, a basic functional unit is equivalent to a document in a corpus, space-time data in the functional unit is similar to words in the document, the unit function distribution obtained through the theme model is equivalent to the document theme distribution, the used city space-time data is typical New wave microblog position sign-in data, each sign-in data comprises user information, space coordinates of the sign-in position, release time, release text and the like, the dynamic activity mode of people can be reflected from different angles, and POI in a research area is obtained from a hundred-degree map, and realizing the functional identification of the area.

Description

Urban functional area identification process based on time-space semantic mining

Technical Field

The invention relates to the technical field of urban functional area identification, in particular to an urban functional area identification process based on space-time semantic mining.

Background

The traditional urban functional partition research is mostly based on data obtained by satellite remote sensing, questionnaire investigation, field visit and the like, and then is assisted by an index system to identify the urban functional partition [ 1-3 ], but the methods consume too high labor cost, and the analysis process has subjective factors of investigators, so that the urban functional partition is difficult to be accurately monitored dynamically for a long time.

Disclosure of Invention

The invention aims to provide a high-performance antioxidant polyethylene plastic to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: a city functional area identification process based on space-time semantic mining comprises documents, words, an extrusion functional unit, space-time data, a theme model, document theme distribution and unit function distribution, firstly, the implicit function of an area is found through the theme model, the mining analogy is carried out on the basis of the text theme, a basic functional unit is equivalent to a document in a corpus, the space-time data in the functional unit is similar to the words in the document, the unit function distribution obtained after the theme model is passed is equivalent to the document theme distribution, the city space-time data used is typical Xinlang microblog position sign-in data, each sign-in data comprises user information, space coordinates of the sign-in position, release time, release text and the like, the dynamic activity mode of people can be reflected from different angles, POI in a research area is obtained from a hundred-degree map, and each record comprises the name of a physical entity, Spatial coordinates, addresses, types and the like, and POI category density characteristics of different basic function units are calculated according to the types.

The city functional area identification process based on spatio-temporal semantic mining as claimed in claim 1, which is characterized by comprising the following steps: firstly, dividing a research area into basic function units with independent spaces by taking a building as a division basis, converting discrete microblog registration data into registration event sets according to space coordinates and distributing the registration event sets to each unit; then, taking the basic function unit as an object, extracting a behavior pattern and text features of the check-in event set, calculating POI category density, and substituting the POI category density into the DMR topic model to obtain a function vector of the basic function unit; because the obtained function vector has no definite function semantics, the function vector is clustered and analyzed to obtain unit clusters with similar functions, and finally, the function attributes of each area cluster are marked according to the POI structure in the function unit, semantic explanation is given, and function area identification is completed.

Further, the process is as follows: in order to obtain the urban functional area, basic functional unit division is firstly carried out, the outline of a building in a research area is identified, then a connected area marking algorithm is used for dividing the basic functional unit, then check-in data in the research area are spatialized and mapped to each basic unit, and then the processes of basic unit feature extraction, potential function mining and functional area marking are introduced in detail.

Further, in the behavior mode location check-in data, the check-in behavior of the user each time may be represented as C ═ user, latitude, longtude, time, text }, where the user is a user identifier; latitude is the latitude of the check-in position; longituude is the longitude of the check-in location; time is the time of check-in; text is a text issued when checking in, the collection of the text and the text constitutes a movement behavior of the user, and represents that the user appears at a certain place at a certain time, and a user behavior pattern in the basic functional unit is defined as follows: the average number of times of users appearing in a certain basic function unit in a segment is divided into 12 time periods each for 2 hours, working days and weekends are distinguished to obtain 24 time intervals, the average number of user sign-in behaviors C in each time period of each area is counted to form a behavior pattern matrix P, a user behavior pattern matrix P is formed, the behavior pattern matrix represents the average number of times of certain behavior patterns appearing in a certain time user movement pattern matrix, the horizontal axis represents time intervals t1, … and t24, the vertical axis represents areas R1, … and Rn, and n represents the number of basic units, numbers in the matrix represent the average number of times of certain behavior patterns appearing in a certain time interval ti in an area Rj, such as a shaded number 6, and therefore, a 24-dimensional behavior pattern vector of each area is obtained.

Furthermore, text data of position check-in is abundant in short text, the characteristic extraction is difficult, text characteristics in an expansion region are expanded by adopting a characteristic expansion method based on a Word2vec Word vector model to relieve the characteristic sparseness problem, words are projected to a vector space by the Word2vec, the Word belongs to distributed Word-vector [15, 16], a Word vector training model based on a neural network is provided based on the distribution hypothesis theory of Word semantics, a low-dimensional Word vector of a target Word is obtained through the relation between the target Word and context, the training efficiency is high, the Word vector obtained by training on large-scale linguistic data has strong correlation on syntax and semantics, because the Word2vec Word vector model can discover the semantic relation between words, similar words of keywords are searched by using the Word vector training model for expanding the characteristics of the short text, the theme is enhanced to a certain degree, and the functionality is better realized, the specific steps of text feature augmentation are as follows,

s1: data preprocessing: dividing a large number of collected microblog linguistic data into words and removing stop words and interference words;

s2: training a word vector model: configuring Word2vec model parameters, and substituting the parameters into data for training;

s3: extracting keywords: the average length of the existing corpus text is 17 words obtained through statistical analysis, so the TF-IDF values of the words in the text to be expanded are calculated, and the first 10 words are selected as key words;

s4 text expansion: and traversing the keywords, and expanding 5 nearest words according to the previously obtained Word2vec model to serve as expanded text characteristics.

Further, for each basic functional unit, constructing a POI density feature vector, wherein for each region r, the number of the ith type POI is nri, the number of all POI in the region r is sr, and then the density vri of the ith type POI in the region is

The POI density feature vector of the region r is xr ═ (vr1, vr2, …, vrF, 1), where F is the number of POI categories and the last 1 is a default value, for the purpose of describing the mean value of each topic later.

Furthermore, the obtained behavior pattern and text features are used as 'texts' of the basic functional units, the 'texts' of each basic unit are collected to form a 'document set' input DMR topic model, meanwhile, each basic functional unit has POI density features as prior data, hidden variables, namely function vectors of the basic units are deduced according to the features in the observable 'texts', and finally, each basic unit can be represented as probability distribution under each function, specifically:

s1: giving an r basic function unit in a research area, generating an introduction vector lambda k obeying Gaussian distribution for each implicit function k by a hyperparameter sigma, wherein the introduction vector lambda k is the same as the length of the POI density characteristic, and generating a function-characteristic Dirichlet distribution beta k of the basic unit by a priori parameter eta;

s2: let α r, k ═ exp (xTr λ k), where xr is the POI density characteristic of the base unit r and θ r is the dirichlet function distribution obed a priori parameter of α r;

s3: for the nth feature fr, n in the base unit r, its function distribution zr, n is the polynomial distribution of θ r obtained from step S1, from which the function distribution β zr, n of this word can be determined;

s4: generating features fr, n according to the features fr, n obeying the polynomial distribution of the beta zr, n;

s5: traversing N features in the basic unit r, and repeatedly executing the steps S3-S4 to generate the basic unit r;

s6: and traversing R basic units in the research area, and repeatedly executing the steps S1-S5 to generate the whole research area.

Further, the function vector has no clear semantic expression, and cannot qualitatively judge the function region, so that the function vector is firstly clustered and analyzed by using a k-means algorithm, so that regions with similar functions are aggregated together, the number of clustering centers is determined by using an average contour coefficient, then, the aggregated region cluster is functionally labeled by a POI distribution structure inside the region cluster, namely a science region, a residential area, a commercial area, a working area, a living facility area and a mixed area, wherein the POI distribution structure comprises Frequency Densities (FD) of different types of POI in the region cluster and proportions (Category Proport, CP) of different types of POI in the region cluster, and for the region cluster c, the calculation mode is as follows:

wherein i is a POI category; m is the total number of POI categories; nc, i is the number of POI in the category i in c; ni is the total number of POIs in category i; FDc, i represents the frequency density of the ith POI in the area cluster c; CPc, i represents the proportion of the frequency density of the ith POI in the area cluster c to the frequency density of POIs in all categories of the cluster,

the category proportion represents the importance degree of one type of POI in a certain area, while the relevance degrees of different types of POI and functions are different, research confirms that in the current area cluster, if the sum of the category proportions of POI belonging to a certain function exceeds 50%, the area cluster takes the function as the main part and is a single city functional area; if the sum of the proportion of the POI categories belonging to each function does not exceed 50%, the area cluster is a mixed area.

The invention has the beneficial effects that: the topic model is widely applied in the field of natural language processing, and performs well in the text implicit semantic mining, a document is mapped from a term space to a topic space and is represented as the probability distribution of a plurality of topics, the idea can also be mapped to the functional discovery of a region, a region is regarded as a document, active spatio-temporal data in the region is words in the document, the function of the region is the topic of the document, then the function distribution of each region can be obtained by using the topic model, an LDA (LatentDirichletAllocation) topic model [11] is a classic model of text semantic mining, which is a Bayesian generation model containing hidden quantity, and tries to represent the document by the distribution of the topic, and carves the topic by the distribution of the words, but for the topic expression, only the words in the text are considered, the contribution of other data associated with the document to the topic is lacked, so a plurality of extended models based on the LDA appear, the DMR (Dirichlet multinomial regression) model [12] is a topic model derived from LDA, and compared with other topic models, the Dirichlet topic distribution prior parameters of the document in the model take the influence of relevant characteristics of the document into consideration, so that more complex and effective auxiliary characteristics are introduced, and the topic extraction effect of the model is enhanced.

Based on the research, the city functional area identification method based on space-time semantic mining is provided based on position check-in data and POI data, a building is used as a basic functional unit in a research area, user behavior patterns and text features in the functional unit are extracted through the position check-in data, a DMR topic model is input together with the POI density to obtain functional vectors of the functional unit, then the vectors are further clustered, areas with similar functions are aggregated, and semantic interpretation is carried out on labels of the areas to realize the function identification of the areas.

Drawings

FIG. 1 is an analogy diagram of region-function and document-subject of a city functional area identification process based on spatio-temporal semantic mining according to the present invention;

FIG. 2 is a basic flow chart of the urban functional area recognition based on the urban functional area recognition flow of the spatio-temporal semantic mining according to the present invention;

FIG. 3 is a user behavior pattern matrix diagram of an urban functional area identification process based on spatio-temporal semantic mining according to the present invention;

FIG. 4 is a POI type diagram of an urban functional area identification process based on spatio-temporal semantic mining according to the present invention;

FIG. 5 is a DMR model generation process diagram of the city functional area identification process based on spatio-temporal semantic mining

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts based on the embodiments of the present invention belong to the protection scope of the present invention.

Referring to fig. 1-5, the present invention provides a technical solution:

a city functional area identification process based on space-time semantic mining comprises documents, words, an extrusion functional unit, space-time data, a theme model, document theme distribution and unit function distribution, firstly, the implicit function of an area is found through the theme model, the mining analogy is carried out on the basis of the text theme, a basic functional unit is equivalent to a document in a corpus, the space-time data in the functional unit is similar to the words in the document, the unit function distribution obtained after the theme model is passed is equivalent to the document theme distribution, the city space-time data used is typical Xinlang microblog position sign-in data, each sign-in data comprises user information, space coordinates of the sign-in position, release time, release text and the like, the dynamic activity mode of people can be reflected from different angles, POI in a research area is obtained from a hundred-degree map, and each record comprises the name of a physical entity, Spatial coordinates, addresses, types and the like, and POI category density characteristics of different basic function units are calculated according to the types.

In order to further improve the use function of the urban functional area identification process based on the spatio-temporal semantic mining, a research area is divided into basic functional units with independent spaces by taking buildings as division basis, and discrete microblog check-in data is converted into check-in event sets according to space coordinates to be distributed to each unit; then, taking the basic function unit as an object, extracting a behavior pattern and text features of the check-in event set, calculating POI category density, and substituting the POI category density into the DMR topic model to obtain a function vector of the basic function unit; because the obtained function vector has no definite function semantics, the function vector is clustered and analyzed to obtain unit clusters with similar functions, and finally, the function attributes of each area cluster are marked according to the POI structure in the function unit, semantic explanation is given, and function area identification is completed.

In order to further improve the use function of the urban functional area identification process based on the space-time semantic mining, the process comprises the following steps: in order to obtain the urban functional area, basic functional unit division is firstly carried out, the outline of a building in a research area is identified, then a connected area marking algorithm is used for dividing the basic functional unit, then check-in data in the research area are spatialized and mapped to each basic unit, and then the processes of basic unit feature extraction, potential function mining and functional area marking are introduced in detail.

In order to further improve the use function of the city function region identification process based on the spatio-temporal semantic mining, in the behavior mode position check-in data, the check-in behavior of the user every time can be expressed as C ═ user, latitude, longtude, time and text, wherein the user is a user identifier; latitude is the latitude of the check-in position; longituude is the longitude of the check-in location; time is the time of check-in; text is a text issued when checking in, the collection of the text and the text constitutes a movement behavior of the user, and represents that the user appears at a certain place at a certain time, and a user behavior pattern in the basic functional unit is defined as follows: the average number of times of users appearing in a certain basic function unit in a segment is divided into 12 time periods each for 2 hours, working days and weekends are distinguished to obtain 24 time intervals, the average number of user sign-in behaviors C in each time period of each area is counted to form a behavior pattern matrix P, a user behavior pattern matrix P is formed, the behavior pattern matrix represents the average number of times of certain behavior patterns appearing in a certain time user movement pattern matrix, the horizontal axis represents time intervals t1, … and t24, the vertical axis represents areas R1, … and Rn, and n represents the number of basic units, numbers in the matrix represent the average number of times of certain behavior patterns appearing in a certain time interval ti in an area Rj, such as a shaded number 6, and therefore, a 24-dimensional behavior pattern vector of each area is obtained.

In order to further improve the use function of a city functional area recognition process based on space-time semantic mining, text data signed at a position is more in short text, and the feature extraction is difficult, the text features in an extension area are expanded by adopting a feature expansion method based on a Word2vec Word vector model to relieve the feature sparseness problem, Word2vec projects words to a vector space, belongs to a distributed-term-orientation Word vector [15, 16], and the Word vector training model based on the Word semantic distribution hypothesis theory provides a Word vector training model based on a neural network, obtains a low-dimensional Word vector of a target Word through the relation between the target Word and a context, has high training efficiency, and the Word vector obtained by training on a large-scale corpus has strong correlation in syntax and semantics, and because the Word2vec Word vector model can find the semantic relation between words, the similar words of keywords are searched by using the Word vector training model, the method is used for expanding the characteristics of the short text, simultaneously enhancing the theme to a certain extent and better embodying the functionality, and the specific steps of text characteristic expansion are as follows,

In order to further improve the use function of the city functional area identification process based on spatio-temporal semantic mining, for each basic functional unit, a POI density feature vector is constructed, for each area r, the number of the ith type POI is nri, the number of all POIs in the area r is sr, and then the density vri of the ith type POI in the area is

In order to further improve the using function of the city functional area identification process based on the spatio-temporal semantic mining, the obtained behavior pattern and text features are used as 'texts' of basic functional units, the 'texts' of each basic unit are collected to form a 'document set' input DMR topic model, meanwhile, each basic functional unit has POI density features as prior data, hidden variables, namely function vectors of the basic units are deduced according to the features in the observable 'texts', and finally, each basic unit can be represented as probability distribution under each function, specifically:

In order to further improve the use function of an urban functional area identification process based on space-time semantic mining, wherein functional vectors have no clear semantic expression and cannot qualitatively judge functional areas, the functional vectors are firstly clustered and analyzed by using a k-means algorithm, so that areas with similar functions are aggregated together, the number of clustering centers is determined by using an average contour coefficient, then, the aggregated area cluster is functionally labeled by a POI distribution structure inside the area cluster, wherein the POI distribution structure comprises Frequency Density (FD) of POIs of different types in the area cluster and proportion (Category Proport, CP) of POIs of different types in the area cluster, and for the area cluster c, the calculation mode is as follows:

The topic model is widely applied in the field of natural language processing, and performs well in the text implicit semantic mining, a document is mapped from a term space to a topic space and is represented as the probability distribution of a plurality of topics, the idea can also be mapped to the functional discovery of a region, a region is regarded as a document, active spatio-temporal data in the region is words in the document, the function of the region is the topic of the document, then the function distribution of each region can be obtained by using the topic model, an LDA (LatentDirichletAllocation) topic model [11] is a classic model of text semantic mining, which is a Bayesian generation model containing hidden quantity, and tries to represent the document by the distribution of the topic, and carves the topic by the distribution of the words, but for the topic expression, only the words in the text are considered, the contribution of other data associated with the document to the topic is lacked, so a plurality of extended models based on the LDA appear, the DMR (Dirichlet multinomial regression) model [12] is a topic model derived from LDA, and compared with other topic models, the Dirichlet topic distribution prior parameters of the document in the model take the influence of relevant characteristics of the document into consideration, so that more complex and effective auxiliary characteristics are introduced, and the topic extraction effect of the model is enhanced.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A city functional area identification process based on space-time semantic mining is characterized in that: including documents, words, extruded functional units, spatio-temporal data, topic models, document topic distributions, and unit function distributions, first attempt to discover the functions implied by the regions through the topic models, similar to text topic mining, the basic functional units are equivalent to documents in a corpus, the spatio-temporal data in the functional units are similar to words in the documents, after passing through the topic model, the obtained unit function distribution is equivalent to document topic distribution, the used city space-time data is representative sign-in data of the Xinlang microblog positions, each sign-in data comprises user information, space coordinates of the sign-in positions, release time, release texts and the like, and the dynamic activity mode of people can be reflected from different angles, and simultaneously, POI in the research area is obtained from the Baidu map, each record comprises the name, the space coordinate, the address, the type and the like of the physical entity, and POI category density characteristics of different basic function units are calculated according to the type of the record.

2. The city functional area identification process based on spatio-temporal semantic mining as claimed in claim 1, which is characterized by comprising the following steps: firstly, dividing a research area into basic function units with independent spaces by taking a building as a division basis, converting discrete microblog registration data into registration event sets according to space coordinates and distributing the registration event sets to each unit; then, taking the basic function unit as an object, extracting a behavior pattern and text features of the check-in event set, calculating POI category density, and substituting the POI category density into the DMR topic model to obtain a function vector of the basic function unit; because the obtained function vector has no definite function semantics, the function vector is clustered and analyzed to obtain unit clusters with similar functions, and finally, the function attributes of each area cluster are marked according to the POI structure in the function unit, semantic explanation is given, and function area identification is completed.

3. The city functional area identification process based on spatio-temporal semantic mining as claimed in claim 1, which is characterized by comprising the following steps: in order to obtain the urban functional area, basic functional unit division is firstly carried out, the outline of a building in a research area is identified, then a connected area marking algorithm is used for dividing the basic functional unit, then check-in data in the research area are spatialized and mapped to each basic unit, and then the processes of basic unit feature extraction, potential function mining and functional area marking are introduced in detail.

4. The city functional area recognition process based on spatio-temporal semantic mining as claimed in claim 1, wherein the behavior pattern location check-in data can be expressed as C ═ user, latitude, longtude, time, text } where user is user id; latitude is the latitude of the check-in position; longituude is the longitude of the check-in location; time is the time of check-in; text is a text issued when checking in, the collection of the text and the text constitutes a movement behavior of the user, and represents that the user appears at a certain place at a certain time, and a user behavior pattern in the basic functional unit is defined as follows: the average number of times of users appearing in a certain basic function unit in a segment is divided into 12 time periods each for 2 hours, working days and weekends are distinguished to obtain 24 time intervals, the average number of user sign-in behaviors C in each time period of each area is counted to form a behavior pattern matrix P, a user behavior pattern matrix P is formed, the behavior pattern matrix represents the average number of times of certain behavior patterns appearing in a certain time user movement pattern matrix, the horizontal axis represents time intervals t1, … and t24, the vertical axis represents areas R1, … and Rn, and n represents the number of basic units, numbers in the matrix represent the average number of times of certain behavior patterns appearing in a certain time interval ti in an area Rj, such as a shaded number 6, and therefore, a 24-dimensional behavior pattern vector of each area is obtained.

5. The city functional area recognition process based on spatio-temporal semantic mining as claimed in claim 1, wherein the text data of position check-in is many in short text, the feature extraction is difficult, the feature expansion method based on Word2vec Word vector model is adopted to expand the text features in the region to alleviate the feature sparseness problem, Word2vec projects the words to a vector space belonging to distributeddepere-presentation Word vectors [15, 16], it is based on the Word semantic distribution hypothesis theory, a Word vector training model based on neural network is proposed, the low dimensional Word vector of the target Word is obtained by the relation between the target Word and the context, not only the training efficiency is high, but also the Word vector obtained by training on large-scale corpus has strong correlation in syntax and semantics, because the Word2vec Word vector model can find the semantic relation between words, it is used to find words of similar keywords, the method is used for expanding the characteristics of the short text, simultaneously enhancing the theme to a certain extent and better embodying the functionality, and the specific steps of text characteristic expansion are as follows,

6. The flow of identifying urban functional areas based on spatio-temporal semantic mining as claimed in claim 1, wherein for each basic functional unit, a POI density feature vector is constructed, for each area r, the number of i-th type POIs is nri, the number of all POIs in the area r is sr, and then the density vri of i-th type POIs in the area is

7. The flow of identifying urban functional areas based on spatio-temporal semantic mining according to claim 1, wherein the obtained behavior patterns and text features are used as "texts" of basic functional units, the "texts" of each basic unit are collected to form a "document set" input DMR topic model, meanwhile, each basic functional unit has POI density features as prior data, hidden variables, i.e. functional vectors of the basic units, are inferred according to features in the observable "texts", and finally, each basic unit is represented as probability distribution under each function, specifically:

8. The city functional area identification process based on spatio-temporal semantic mining according to claim 1, characterized in that the functional vectors have no clear semantic expression and cannot qualitatively judge the functional areas, so that the functional vectors are firstly clustered and analyzed by k-means algorithm to make areas with similar functions be aggregated together, and the number of clustering centers is determined by average contour coefficient, and then the aggregated area cluster is functionally labeled by its internal POI distribution structure, which includes Frequency Density (FD) of different types of POIs in the area cluster and proportion (catgoryproport, CP) of different types of POIs in the area cluster, and for the area cluster c, the calculation method is as follows: