CN108763349B

CN108763349B - Urban land utilization mixedness measuring and calculating method and system based on social media data

Info

Publication number: CN108763349B
Application number: CN201810461482.9A
Authority: CN
Inventors: 邢汉发; 孟媛
Original assignee: Individual
Current assignee: Individual
Priority date: 2018-05-15
Filing date: 2018-05-15
Publication date: 2021-08-31
Anticipated expiration: 2038-05-15
Also published as: CN108763349A

Abstract

The invention discloses a method and a system for measuring and calculating urban land utilization mixing degree based on social media data, wherein firstly, time information, position information and text information of the social media data are obtained through a web crawler method; then, introducing an LDA theme model, and extracting a dynamic text theme by constructing text documents at different time intervals; then, calculating the time similarity and the semantic similarity of the dynamic text theme, and constructing an urban land utilization mixing degree measuring and calculating index by using a hierarchical clustering model; and finally, based on the obtained mixing degree measurement and calculation index, the urban land utilization mixing degree is measured and calculated by using an information entropy algorithm.

Description

Urban land utilization mixedness measuring and calculating method and system based on social media data

Technical Field

The invention relates to the technical field of urban land utilization mixing degree measurement and calculation, in particular to a method and a system for measuring and calculating urban land utilization mixing degree based on social media data.

Background

With the advance of the urbanization process of China, the urban aggregation degree is increased continuously, and the urban function mixing degree is increased gradually, so that urban land presents abundant diversity and mixing in the actual utilization. The urban land utilization degree is the mixed utilization degree of land functions in an urban range, and the measurement, calculation and analysis results of the urban land utilization degree are of great significance for dividing urban spatial layout, analyzing population and employment post distribution, planning travel distance and travel spatial distribution of residents and the like. The existing urban land utilization mixing degree measurement and calculation is mainly completed through questionnaire investigation, field investigation, remote sensing image interpretation and other modes. The method has high measuring and calculating cost and long measuring and calculating period, and is difficult to finely quantify the mixing degree of urban land utilization, so that a new measuring and calculating means is urgently needed.

In recent years, with the development of mobile internet technology, social media platforms such as Facebook (Facebook), Twitter (Twitter), and micro blog (microbog) have appeared, which generate social media data capable of reflecting land use information in cities to some extent. For example: from the viewpoint of the distributed contents, people often distribute data related to "shopping", "cate", "shopping", etc. in commercial districts, and such information may correspond to commercial districts; from the time of release, people often release data at the place of residence at night or on a weekday, and such information may correspond to the place of residence. In view of this, there are scholars applying social media data to applications of land use identification: the method comprises the steps of applying a spectrum clustering algorithm in a front-Martinez V, a front-Martinez E, spectrum clustering for sensing and using Twitter activity [ J ], applying Engineering Applications of Artificial Intelligence 2014,35: 237-; "Chen Y, Liu X, Li X, et al, delilinating Urban functional areas with building-level social media data: adaptive time warping (DTW) discrete k-media method [ J ]. Landscape and Urban Planning,2017,160: 48-60", combines the vacation sign-in data with the k-media algorithm based on time series analysis, and proposes a classification method for Urban land utilization under the building scale.

In fact, the social media data has the characteristics of rich text information, real-time release time and the like, so that the urban land utilization information can be more finely described, and the social media data is applied to the measurement and calculation of the urban land utilization mixing degree and is a feasible way.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a method and a system for measuring and calculating the city land utilization degree based on social media data. The invention provides a more effective method for measuring and calculating the urban land utilization mixing degree by introducing the time change characteristic and the text semantic characteristic of the social media data.

As a first aspect of the invention, a method for measuring and calculating urban land utilization mixedness based on social media data is provided;

the urban land utilization mixedness measuring and calculating method based on social media data comprises the following steps:

step (1): acquiring the publishing time, the position information and the text data of the social media data, and preprocessing the text data in the social media data;

step (2): dividing time into a plurality of time intervals, and extracting a text theme corresponding to each time interval and the distribution probability of each text theme in each time interval from the text data obtained by preprocessing;

and (3): constructing a city land utilization mixing degree measuring and calculating index according to the distribution probability of each text theme in each time interval obtained in the step (2);

and (4): according to the constructed urban land utilization mixing degree measuring and calculating indexes, measuring and calculating the urban land utilization mixing degree;

the social media data set comprises: a Sina microblog dataset, a Twitter dataset, a Facebook dataset, etc.

As a further improvement of the present invention, the pretreatment comprises: deleting punctuation marks, uniformly replacing English letters with lower case, removing stop words, and removing words with the occurrence frequency lower than a threshold value;

the stop word, for example: so, Yi, u, We, so, do.

The calculation process of the word occurrence frequency is as follows: and taking each word as a statistical unit, and counting the occurrence times of the word in all the words.

The text themes of different time intervals comprise two parts, wherein one part is words with similar expression themes, such as delicious words, restaurants, spicy words and the like all represent themes related to diet, and the set of words is collectively called as the text theme; the other part is the percentage (i.e. the distribution probability) of the text theme in different time periods, for example, the proportion of a certain text theme in 12:00 is 30% of the whole time period, and only 3% in 20:00, i.e. the dynamism of the text theme is embodied.

As a further improvement of the invention, the step (2) comprises the following steps:

step (21): constructing 48 time intervals by taking one hour as a time interval based on the text data obtained by preprocessing in the step (1); wherein, the 1 st to 24 th hours are the time intervals of working days, and the 25 th to 48 th hours are the time intervals of rest days; combining the text information in each time interval to obtain 48 text documents, wherein each text document comprises text data;

step (22): calculating a text topic of each text document and a distribution probability of each text topic in each time interval through a Latent Dirichlet Allocation (LDA) topic model;

step (23): and performing normalization processing on the distribution probability of each text topic in 48 time intervals by using a Max-Min standardization method to obtain the normalized distribution probability of each text topic in each time interval.

As a further improvement of the invention, the step (22) of calculating the distribution probability of each text topic in each time interval comprises the following steps:

in the formula, θ is a distribution probability of each text topic in each time interval, z is a text topic, w represents a word in the text document, n is a total number of words, and α, β are hyper-parameters of the LDA topic model, where α ═ 50/k, and β ═ 0.1.

As a further improvement of the invention, the step (3) comprises the following steps:

step (31): according to the distribution probability of each text topic obtained in the step (2) after normalization processing in each Time interval, performing Time similarity calculation by using a Time-forwarded edition Distance (TWED) model to obtain the Time similarity between any two text topics;

step (32): clustering the text topics by utilizing a hierarchical clustering model according to the time similarity between any two text topics; taking the clustering result as a land utilization mixing degree measuring and calculating index considering the time similarity;

step (33): on the basis of the obtained land utilization mixed degree measuring and calculating index considering the time similarity, performing semantic similarity calculation by using a TF-IDF model, and correcting a calculation result;

step (34): based on the corrected calculation result of the semantic similarity, clustering the text topics by using a hierarchical clustering model, and taking the clustering result as a land utilization mixing degree measurement and calculation index considering the semantic similarity; and the land utilization mixing degree measuring and calculating index considering the semantic similarity is the urban land utilization mixing degree measuring and calculating index.

As a further improvement of the present invention, in the step (31), the calculation formula based on the TWED model is:

in the formula (I), the compound is shown in the specification,

for a text topic z₁、z₂The time similarity calculation result of (a) is,

respectively a text topic z₁、z₂The probability of distribution within each time interval, and,

respectively calculated by the following formula:

wherein the content of the first and second substances,

wherein the content of the first and second substances,

respectively dynamic text subject z₁、z₂Distribution probability in ith and jth time interval, and

the parameters v and lambda are hyper-parameters of the TWED model.

As a further improvement of the present invention, in the step (32):

the clustering number is determined by Davies-bouldin (DB) index, and the lower the DB index is, the better the clustering effect is.

The calculation formula of the DB index is as follows:

wherein N is the number of clusters, S_iIs the standard deviation of the ith dynamic text topic distribution probability, S_jFor the standard deviation of the jth dynamic text topic distribution probability, c1 and c2 are two different clustering results.

As a further improvement of the invention, in the step (33),

the semantic similarity calculation formula based on the TF-IDF model is as follows:

in the formula (I), the compound is shown in the specification,

for dynamic text subject z₁、z₂The result of the calculation of the degree of similarity of the texts,

respectively dynamic text subject z₁、z₂Probability is distributed to each word, and

n is the number of words in each dynamic text topic.

As a further improvement of the invention, in the step (33),

on the basis of the obtained semantic similarity, correcting a calculation result, wherein a correction formula is as follows:

in the formula (I), the compound is shown in the specification,

modified dynamic text topic z₁、z₂The text similarity calculation result of (1).

As a further improvement of the invention, the step (4) comprises the following steps:

step (41): repartitioning of the text document: dividing a region to be researched into a plurality of units according to a road, wherein each unit is called as a land utilization unit; extracting time information and text data of the social media data in each land utilization unit according to the position information of the social media data, and combining all the text data in each land utilization unit to serve as a new text document;

step (42): calculating a text topic of each new text document and a distribution probability of each new text topic in each land utilization unit through a Latent Dirichlet Allocation (LDA) topic model;

calculating the distribution probability of each urban land utilization mixing degree measuring and calculating index in each land utilization unit according to the distribution probability of each new text theme in each land utilization unit and the urban land utilization mixing degree measuring and calculating index constructed in the step (3);

step (43): and calculating the calculation result of the land utilization mixing degree by using an information entropy algorithm based on the distribution probability of the calculation indexes of the land utilization mixing degree in each land utilization unit of each city.

As a further improvement of the invention, in the step (42), the distribution probability of each land use mixedness measuring and calculating index in each land use unit is as follows:

in the formula (I), the compound is shown in the specification,

the land utilization unit p is used for calculating indexes of the ith land utilization degree_iThe probability of the distribution in (1) is,

for dynamic text subject z_mIn land use unit p_iM is the total number of dynamic text topics contained in the ith land use mixture degree measurement index.

As a further improvement of the invention, the calculation formula of the land use mixing degree in the step (43) by using the information entropy algorithm is as follows:

in the formula, H (p)_i) For land use units p_iThe result of the measurement and calculation of the degree of mixing,

the land utilization unit p is used for calculating indexes of the ith land utilization degree_iC is the total number of the land utilization mixing degree measuring and calculating indexes.

As a second aspect of the present invention, a system for measuring and calculating urban land use mixedness based on social media data is provided;

social media data-based urban land utilization mixedness measuring and calculating system comprises: the computer program product comprises a memory, a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of any of the above methods.

As a third aspect of the present invention, there is provided a computer-readable storage medium;

a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, perform the steps of any of the above methods.

Compared with the prior art, the invention has the beneficial effects that:

firstly, acquiring time information, position information and text information of social media data by a web crawler method; then, introducing an LDA theme model, and extracting a dynamic text theme by constructing text documents at different time intervals; then, calculating the time similarity and the semantic similarity of the dynamic text theme, and constructing a land utilization mixing degree measuring and calculating index by using a hierarchical clustering model; and finally, based on the obtained mixing degree measurement and calculation index, the urban land utilization mixing degree is measured and calculated by using an information entropy algorithm.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of time varying information of

dynamic text topics

14, 20, 23, 29;

3(a) -3 (d) are schematic diagrams of text semantic information of the

dynamic text topics

14, 20, 23, 29;

FIG. 4 is a diagram illustrating a time similarity calculation result of a dynamic text topic;

FIG. 5 is a schematic diagram of an index for measuring and calculating the degree of mixing of urban land utilization in consideration of time similarity;

FIG. 6 is a diagram illustrating semantic similarity calculations for a dynamic text topic;

FIG. 7 is a schematic diagram of the urban land use mixture degree measurement index considering semantic similarity;

fig. 8 is a schematic diagram of land use estimation results.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

In order to clearly understand the technical features, objects and effects of the present invention, a Twitter data set of toronto region is taken as an example, and embodiments of the present invention will be described with reference to the accompanying drawings.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

As a first embodiment of the invention, a method for measuring and calculating urban land utilization mixedness based on social media data is provided;

as shown in fig. 1, the method for measuring and calculating urban land utilization mixedness based on social media data includes:

step (1): acquiring the publishing time, the position information and the text data of the social media data, and preprocessing the text data in the social media data; the social media data comprises: a Sina microblog dataset, a Twitter dataset, a Facebook dataset, etc.

The pretreatment comprises the following steps: deleting punctuation marks, uniformly replacing English letters with lower case, removing stop words, and removing words with the occurrence frequency lower than a threshold value;

the stop word, for example: so, Yi, u, We, so, do.

In this embodiment, the processed Twitter data includes time information, geographical location information, and text information, as shown in table 1.

Table 1 Twitter data and its time information, geographical location information, text information.

The step (2) comprises the following steps:

The step (22) of calculating the distribution probability of each text topic in each time interval comprises the following steps:

The specific calculation method of the text topic can refer to "Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation [ J ]. Journal of Machine Learning Research,2003,3(Jan):993 and 1022.";

by applying the method, the total number of the extracted dynamic text topics is 30. The time change information and the text semantic information included in the

dynamic text topics

14, 20, 23, and 29 are shown in fig. 2, 3(a) -3 (d), respectively.

the step (3) comprises the following steps:

in the step (31), the calculation formula based on the TWED model is as follows:

in the formula (I), the compound is shown in the specification,

for a text topic z₁、z₂The time similarity calculation result of (a) is,

respectively calculated by the following formula:

wherein the content of the first and second substances,

wherein the content of the first and second substances,

the parameters v and lambda are hyper-parameters of the TWED model. The parameters v and lambda can be calculated by referring to "Serra J, Arcos J L. an empirical evaluation of similarity measures for time series classification [ J].Knowledge-Based Systems,2014,67:305-314.”

The calculation formula of the DB index is as follows:

in the formula (I), the compound is shown in the specification,

n is the number of words in each dynamic text topic.

in the formula (I), the compound is shown in the specification,

In this embodiment, based on the 30 dynamic text topics obtained in step (2), the time similarity of the topics is calculated by using the TWED model, and the similarity calculation result is shown in fig. 4. On the basis, a hierarchical clustering model is applied to extract a land utilization mixing degree measuring and calculating index considering time similarity. Wherein, the number of clusters obtained by experimental calculation is 6. The land use mixture degree measurement index considering the time similarity is shown in fig. 5. Then, based on the obtained measurement index, the TF-IDF model is applied to calculate the semantic similarity of the subject, and the similarity calculation result is shown in FIG. 6. And extracting land utilization mixing degree measuring and calculating indexes considering semantic similarity by applying a hierarchical clustering model. Wherein, the number of clusters obtained by experimental calculation is 9. The land use mixture degree measurement index considering the semantic similarity is shown in fig. 7. The land utilization mixed degree measuring and calculating indexes considering time and semantic similarity are combined to obtain 14 final land utilization mixed degree measuring and calculating indexes, wherein dynamic text topics contained in each index are shown in a table 2:

TABLE 2 measurement and calculation indexes of land utilization degree

Measurement and calculation index	Dynamic text themes
		1	Subject 11
2	Theme 13
		3	Subject 16
4	Subject 23
		5	Subject 27
6	Subject 7,8,15,21
		7	Subject 4,20,30
8	Subject 1,17,18,19
		9	Subject 3,25
10	Subject 9,22
		11	Subject matter 10,28
12	Subject 2,5,24,29
		13	Subject 6,12
14	Subject 14,26

and (42) calculating the distribution probability of each land utilization mixed degree measuring index in each land utilization unit:

in the formula (I), the compound is shown in the specification,

In the step (43), the land use mixedness calculation formula by using the information entropy algorithm is as follows:

In the embodiment, the 14 land utilization mixing degree measurement indexes obtained in the step (3) are applied, and the land utilization mixing degree measurement is performed by using an information entropy algorithm, and the measurement result is shown in fig. 8.

As a second embodiment of the present invention, a system for measuring and calculating urban land use mixedness based on social media data is provided;

As a third embodiment of the present invention, there is provided a computer-readable storage medium;

Firstly, acquiring time information, position information and text information of social media data by a web crawler method; then, introducing an LDA theme model, and extracting a dynamic text theme by constructing text documents at different time intervals; then, calculating the time similarity and the semantic similarity of the dynamic text theme, and constructing an urban land utilization mixing degree measuring and calculating index by using a hierarchical clustering model; and finally, based on the obtained mixing degree measurement and calculation index, the urban land utilization mixing degree is measured and calculated by using an information entropy algorithm.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. The urban land utilization mixedness measuring and calculating method based on social media data is characterized by comprising the following steps of:

the step (3) comprises the following steps:

step (31): according to the distribution probability of each text theme obtained in the step (2) after normalization processing in each time interval, performing time similarity calculation by using a TWED model to obtain the time similarity between any two text themes; the English name of TWED is Time-warp Edit Distance;

step (34): based on the corrected calculation result of the semantic similarity, clustering the text topics by using a hierarchical clustering model, and taking the clustering result as a land utilization mixing degree measurement and calculation index considering the semantic similarity; the land utilization mixing degree measuring and calculating index considering the semantic similarity is the urban land utilization mixing degree measuring and calculating index;

the step (4) comprises the following steps:

step (42): calculating a text topic of each new text document and the distribution probability of each new text topic in each land utilization unit through a potential Dirichlet topic model;

2. The method for urban land use mixedness estimation based on social media data according to claim 1, wherein,

the step (2) comprises the following steps:

step (22): calculating a text topic of each text document and the distribution probability of each text topic in each time interval through a potential Dirichlet LDA topic model;

3. The method for urban land use mixedness estimation based on social media data according to claim 1, wherein,

in the formula (I), the compound is shown in the specification,

for a text topic z₁、z₂The time similarity calculation result of (a) is,

respectively calculated by the following formula:

wherein the content of the first and second substances,

wherein the content of the first and second substances,

the parameters v and lambda are hyper-parameters of the TWED model.

4. The method for urban land use mixedness estimation based on social media data according to claim 1, wherein,

in the step (33), the step of,

in the formula (I), the compound is shown in the specification,

n is the number of words in each dynamic text topic;

in the step (33), the step of,

in the formula (I), the compound is shown in the specification,

5. The method for urban land use mixedness estimation based on social media data according to claim 1, wherein,

in the formula (I), the compound is shown in the specification,

6. The method for urban land use mixedness estimation based on social media data according to claim 1, wherein,

7. Social media data-based urban land utilization mixedness measuring and calculating system comprises: a memory, a processor, and computer instructions stored on the memory and executed on the processor, the computer instructions, when executed by the processor, performing the steps of any of the methods of claims 1-6.

8. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, perform the steps of any of the methods of claims 1-6.