CN108763349B - Urban land utilization mixedness measuring and calculating method and system based on social media data - Google Patents

Urban land utilization mixedness measuring and calculating method and system based on social media data Download PDF

Info

Publication number
CN108763349B
CN108763349B CN201810461482.9A CN201810461482A CN108763349B CN 108763349 B CN108763349 B CN 108763349B CN 201810461482 A CN201810461482 A CN 201810461482A CN 108763349 B CN108763349 B CN 108763349B
Authority
CN
China
Prior art keywords
text
land utilization
calculating
social media
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810461482.9A
Other languages
Chinese (zh)
Other versions
CN108763349A (en
Inventor
邢汉发
孟媛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201810461482.9A priority Critical patent/CN108763349B/en
Publication of CN108763349A publication Critical patent/CN108763349A/en
Application granted granted Critical
Publication of CN108763349B publication Critical patent/CN108763349B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Educational Administration (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Primary Health Care (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Game Theory and Decision Science (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a system for measuring and calculating urban land utilization mixing degree based on social media data, wherein firstly, time information, position information and text information of the social media data are obtained through a web crawler method; then, introducing an LDA theme model, and extracting a dynamic text theme by constructing text documents at different time intervals; then, calculating the time similarity and the semantic similarity of the dynamic text theme, and constructing an urban land utilization mixing degree measuring and calculating index by using a hierarchical clustering model; and finally, based on the obtained mixing degree measurement and calculation index, the urban land utilization mixing degree is measured and calculated by using an information entropy algorithm.

Description

Urban land utilization mixedness measuring and calculating method and system based on social media data
Technical Field
The invention relates to the technical field of urban land utilization mixing degree measurement and calculation, in particular to a method and a system for measuring and calculating urban land utilization mixing degree based on social media data.
Background
With the advance of the urbanization process of China, the urban aggregation degree is increased continuously, and the urban function mixing degree is increased gradually, so that urban land presents abundant diversity and mixing in the actual utilization. The urban land utilization degree is the mixed utilization degree of land functions in an urban range, and the measurement, calculation and analysis results of the urban land utilization degree are of great significance for dividing urban spatial layout, analyzing population and employment post distribution, planning travel distance and travel spatial distribution of residents and the like. The existing urban land utilization mixing degree measurement and calculation is mainly completed through questionnaire investigation, field investigation, remote sensing image interpretation and other modes. The method has high measuring and calculating cost and long measuring and calculating period, and is difficult to finely quantify the mixing degree of urban land utilization, so that a new measuring and calculating means is urgently needed.
In recent years, with the development of mobile internet technology, social media platforms such as Facebook (Facebook), Twitter (Twitter), and micro blog (microbog) have appeared, which generate social media data capable of reflecting land use information in cities to some extent. For example: from the viewpoint of the distributed contents, people often distribute data related to "shopping", "cate", "shopping", etc. in commercial districts, and such information may correspond to commercial districts; from the time of release, people often release data at the place of residence at night or on a weekday, and such information may correspond to the place of residence. In view of this, there are scholars applying social media data to applications of land use identification: the method comprises the steps of applying a spectrum clustering algorithm in a front-Martinez V, a front-Martinez E, spectrum clustering for sensing and using Twitter activity [ J ], applying Engineering Applications of Artificial Intelligence 2014,35: 237-; "Chen Y, Liu X, Li X, et al, delilinating Urban functional areas with building-level social media data: adaptive time warping (DTW) discrete k-media method [ J ]. Landscape and Urban Planning,2017,160: 48-60", combines the vacation sign-in data with the k-media algorithm based on time series analysis, and proposes a classification method for Urban land utilization under the building scale.
In fact, the social media data has the characteristics of rich text information, real-time release time and the like, so that the urban land utilization information can be more finely described, and the social media data is applied to the measurement and calculation of the urban land utilization mixing degree and is a feasible way.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a method and a system for measuring and calculating the city land utilization degree based on social media data. The invention provides a more effective method for measuring and calculating the urban land utilization mixing degree by introducing the time change characteristic and the text semantic characteristic of the social media data.
As a first aspect of the invention, a method for measuring and calculating urban land utilization mixedness based on social media data is provided;
the urban land utilization mixedness measuring and calculating method based on social media data comprises the following steps:
step (1): acquiring the publishing time, the position information and the text data of the social media data, and preprocessing the text data in the social media data;
step (2): dividing time into a plurality of time intervals, and extracting a text theme corresponding to each time interval and the distribution probability of each text theme in each time interval from the text data obtained by preprocessing;
and (3): constructing a city land utilization mixing degree measuring and calculating index according to the distribution probability of each text theme in each time interval obtained in the step (2);
and (4): according to the constructed urban land utilization mixing degree measuring and calculating indexes, measuring and calculating the urban land utilization mixing degree;
the social media data set comprises: a Sina microblog dataset, a Twitter dataset, a Facebook dataset, etc.
As a further improvement of the present invention, the pretreatment comprises: deleting punctuation marks, uniformly replacing English letters with lower case, removing stop words, and removing words with the occurrence frequency lower than a threshold value;
the stop word, for example: so, Yi, u, We, so, do.
The calculation process of the word occurrence frequency is as follows: and taking each word as a statistical unit, and counting the occurrence times of the word in all the words.
The text themes of different time intervals comprise two parts, wherein one part is words with similar expression themes, such as delicious words, restaurants, spicy words and the like all represent themes related to diet, and the set of words is collectively called as the text theme; the other part is the percentage (i.e. the distribution probability) of the text theme in different time periods, for example, the proportion of a certain text theme in 12:00 is 30% of the whole time period, and only 3% in 20:00, i.e. the dynamism of the text theme is embodied.
As a further improvement of the invention, the step (2) comprises the following steps:
step (21): constructing 48 time intervals by taking one hour as a time interval based on the text data obtained by preprocessing in the step (1); wherein, the 1 st to 24 th hours are the time intervals of working days, and the 25 th to 48 th hours are the time intervals of rest days; combining the text information in each time interval to obtain 48 text documents, wherein each text document comprises text data;
step (22): calculating a text topic of each text document and a distribution probability of each text topic in each time interval through a Latent Dirichlet Allocation (LDA) topic model;
step (23): and performing normalization processing on the distribution probability of each text topic in 48 time intervals by using a Max-Min standardization method to obtain the normalized distribution probability of each text topic in each time interval.
As a further improvement of the invention, the step (22) of calculating the distribution probability of each text topic in each time interval comprises the following steps:
Figure BDA0001661015480000031
in the formula, θ is a distribution probability of each text topic in each time interval, z is a text topic, w represents a word in the text document, n is a total number of words, and α, β are hyper-parameters of the LDA topic model, where α ═ 50/k, and β ═ 0.1.
As a further improvement of the invention, the step (3) comprises the following steps:
step (31): according to the distribution probability of each text topic obtained in the step (2) after normalization processing in each Time interval, performing Time similarity calculation by using a Time-forwarded edition Distance (TWED) model to obtain the Time similarity between any two text topics;
step (32): clustering the text topics by utilizing a hierarchical clustering model according to the time similarity between any two text topics; taking the clustering result as a land utilization mixing degree measuring and calculating index considering the time similarity;
step (33): on the basis of the obtained land utilization mixed degree measuring and calculating index considering the time similarity, performing semantic similarity calculation by using a TF-IDF model, and correcting a calculation result;
step (34): based on the corrected calculation result of the semantic similarity, clustering the text topics by using a hierarchical clustering model, and taking the clustering result as a land utilization mixing degree measurement and calculation index considering the semantic similarity; and the land utilization mixing degree measuring and calculating index considering the semantic similarity is the urban land utilization mixing degree measuring and calculating index.
As a further improvement of the present invention, in the step (31), the calculation formula based on the TWED model is:
Figure BDA0001661015480000032
in the formula (I), the compound is shown in the specification,
Figure BDA0001661015480000033
for a text topic z1、z2The time similarity calculation result of (a) is,
Figure BDA0001661015480000034
respectively a text topic z1、z2The probability of distribution within each time interval, and,
Figure BDA0001661015480000035
respectively calculated by the following formula:
Figure BDA0001661015480000036
Figure BDA0001661015480000037
Figure BDA0001661015480000038
wherein the content of the first and second substances,
Figure BDA0001661015480000039
wherein the content of the first and second substances,
Figure BDA00016610154800000310
respectively dynamic text subject z1、z2Distribution probability in ith and jth time interval, and
Figure BDA00016610154800000311
the parameters v and lambda are hyper-parameters of the TWED model.
As a further improvement of the present invention, in the step (32):
the clustering number is determined by Davies-bouldin (DB) index, and the lower the DB index is, the better the clustering effect is.
The calculation formula of the DB index is as follows:
Figure BDA0001661015480000041
wherein N is the number of clusters, SiIs the standard deviation of the ith dynamic text topic distribution probability, SjFor the standard deviation of the jth dynamic text topic distribution probability, c1 and c2 are two different clustering results.
As a further improvement of the invention, in the step (33),
the semantic similarity calculation formula based on the TF-IDF model is as follows:
Figure BDA0001661015480000042
in the formula (I), the compound is shown in the specification,
Figure BDA0001661015480000043
for dynamic text subject z1、z2The result of the calculation of the degree of similarity of the texts,
Figure BDA0001661015480000044
respectively dynamic text subject z1、z2Probability is distributed to each word, and
Figure BDA0001661015480000045
n is the number of words in each dynamic text topic.
As a further improvement of the invention, in the step (33),
on the basis of the obtained semantic similarity, correcting a calculation result, wherein a correction formula is as follows:
Figure BDA0001661015480000046
in the formula (I), the compound is shown in the specification,
Figure BDA0001661015480000047
modified dynamic text topic z1、z2The text similarity calculation result of (1).
As a further improvement of the invention, the step (4) comprises the following steps:
step (41): repartitioning of the text document: dividing a region to be researched into a plurality of units according to a road, wherein each unit is called as a land utilization unit; extracting time information and text data of the social media data in each land utilization unit according to the position information of the social media data, and combining all the text data in each land utilization unit to serve as a new text document;
step (42): calculating a text topic of each new text document and a distribution probability of each new text topic in each land utilization unit through a Latent Dirichlet Allocation (LDA) topic model;
calculating the distribution probability of each urban land utilization mixing degree measuring and calculating index in each land utilization unit according to the distribution probability of each new text theme in each land utilization unit and the urban land utilization mixing degree measuring and calculating index constructed in the step (3);
step (43): and calculating the calculation result of the land utilization mixing degree by using an information entropy algorithm based on the distribution probability of the calculation indexes of the land utilization mixing degree in each land utilization unit of each city.
As a further improvement of the invention, in the step (42), the distribution probability of each land use mixedness measuring and calculating index in each land use unit is as follows:
Figure BDA0001661015480000051
in the formula (I), the compound is shown in the specification,
Figure BDA0001661015480000052
the land utilization unit p is used for calculating indexes of the ith land utilization degreeiThe probability of the distribution in (1) is,
Figure BDA0001661015480000053
for dynamic text subject zmIn land use unit piM is the total number of dynamic text topics contained in the ith land use mixture degree measurement index.
As a further improvement of the invention, the calculation formula of the land use mixing degree in the step (43) by using the information entropy algorithm is as follows:
Figure BDA0001661015480000054
in the formula, H (p)i) For land use units piThe result of the measurement and calculation of the degree of mixing,
Figure BDA0001661015480000055
the land utilization unit p is used for calculating indexes of the ith land utilization degreeiC is the total number of the land utilization mixing degree measuring and calculating indexes.
As a second aspect of the present invention, a system for measuring and calculating urban land use mixedness based on social media data is provided;
social media data-based urban land utilization mixedness measuring and calculating system comprises: the computer program product comprises a memory, a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of any of the above methods.
As a third aspect of the present invention, there is provided a computer-readable storage medium;
a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, perform the steps of any of the above methods.
Compared with the prior art, the invention has the beneficial effects that:
firstly, acquiring time information, position information and text information of social media data by a web crawler method; then, introducing an LDA theme model, and extracting a dynamic text theme by constructing text documents at different time intervals; then, calculating the time similarity and the semantic similarity of the dynamic text theme, and constructing a land utilization mixing degree measuring and calculating index by using a hierarchical clustering model; and finally, based on the obtained mixing degree measurement and calculation index, the urban land utilization mixing degree is measured and calculated by using an information entropy algorithm.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of time varying information of dynamic text topics 14, 20, 23, 29;
3(a) -3 (d) are schematic diagrams of text semantic information of the dynamic text topics 14, 20, 23, 29;
FIG. 4 is a diagram illustrating a time similarity calculation result of a dynamic text topic;
FIG. 5 is a schematic diagram of an index for measuring and calculating the degree of mixing of urban land utilization in consideration of time similarity;
FIG. 6 is a diagram illustrating semantic similarity calculations for a dynamic text topic;
FIG. 7 is a schematic diagram of the urban land use mixture degree measurement index considering semantic similarity;
fig. 8 is a schematic diagram of land use estimation results.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
In order to clearly understand the technical features, objects and effects of the present invention, a Twitter data set of toronto region is taken as an example, and embodiments of the present invention will be described with reference to the accompanying drawings.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
As a first embodiment of the invention, a method for measuring and calculating urban land utilization mixedness based on social media data is provided;
as shown in fig. 1, the method for measuring and calculating urban land utilization mixedness based on social media data includes:
step (1): acquiring the publishing time, the position information and the text data of the social media data, and preprocessing the text data in the social media data; the social media data comprises: a Sina microblog dataset, a Twitter dataset, a Facebook dataset, etc.
The pretreatment comprises the following steps: deleting punctuation marks, uniformly replacing English letters with lower case, removing stop words, and removing words with the occurrence frequency lower than a threshold value;
the stop word, for example: so, Yi, u, We, so, do.
The calculation process of the word occurrence frequency is as follows: and taking each word as a statistical unit, and counting the occurrence times of the word in all the words.
In this embodiment, the processed Twitter data includes time information, geographical location information, and text information, as shown in table 1.
Table 1 Twitter data and its time information, geographical location information, text information.
Figure BDA0001661015480000071
Step (2): dividing time into a plurality of time intervals, and extracting a text theme corresponding to each time interval and the distribution probability of each text theme in each time interval from the text data obtained by preprocessing;
the text themes of different time intervals comprise two parts, wherein one part is words with similar expression themes, such as delicious words, restaurants, spicy words and the like all represent themes related to diet, and the set of words is collectively called as the text theme; the other part is the percentage (i.e. the distribution probability) of the text theme in different time periods, for example, the proportion of a certain text theme in 12:00 is 30% of the whole time period, and only 3% in 20:00, i.e. the dynamism of the text theme is embodied.
The step (2) comprises the following steps:
step (21): constructing 48 time intervals by taking one hour as a time interval based on the text data obtained by preprocessing in the step (1); wherein, the 1 st to 24 th hours are the time intervals of working days, and the 25 th to 48 th hours are the time intervals of rest days; combining the text information in each time interval to obtain 48 text documents, wherein each text document comprises text data;
step (22): calculating a text topic of each text document and a distribution probability of each text topic in each time interval through a Latent Dirichlet Allocation (LDA) topic model;
step (23): and performing normalization processing on the distribution probability of each text topic in 48 time intervals by using a Max-Min standardization method to obtain the normalized distribution probability of each text topic in each time interval.
The step (22) of calculating the distribution probability of each text topic in each time interval comprises the following steps:
Figure BDA0001661015480000072
in the formula, θ is a distribution probability of each text topic in each time interval, z is a text topic, w represents a word in the text document, n is a total number of words, and α, β are hyper-parameters of the LDA topic model, where α ═ 50/k, and β ═ 0.1.
The specific calculation method of the text topic can refer to "Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation [ J ]. Journal of Machine Learning Research,2003,3(Jan):993 and 1022.";
by applying the method, the total number of the extracted dynamic text topics is 30. The time change information and the text semantic information included in the dynamic text topics 14, 20, 23, and 29 are shown in fig. 2, 3(a) -3 (d), respectively.
And (3): constructing a city land utilization mixing degree measuring and calculating index according to the distribution probability of each text theme in each time interval obtained in the step (2);
the step (3) comprises the following steps:
step (31): according to the distribution probability of each text topic obtained in the step (2) after normalization processing in each Time interval, performing Time similarity calculation by using a Time-forwarded edition Distance (TWED) model to obtain the Time similarity between any two text topics;
in the step (31), the calculation formula based on the TWED model is as follows:
Figure BDA0001661015480000081
in the formula (I), the compound is shown in the specification,
Figure BDA0001661015480000082
for a text topic z1、z2The time similarity calculation result of (a) is,
Figure BDA0001661015480000083
respectively a text topic z1、z2The probability of distribution within each time interval, and,
Figure BDA0001661015480000084
respectively calculated by the following formula:
Figure BDA0001661015480000085
Figure BDA0001661015480000086
Figure BDA0001661015480000087
wherein the content of the first and second substances,
Figure BDA0001661015480000088
wherein the content of the first and second substances,
Figure BDA0001661015480000089
respectively dynamic text subject z1、z2Distribution probability in ith and jth time interval, and
Figure BDA00016610154800000810
the parameters v and lambda are hyper-parameters of the TWED model. The parameters v and lambda can be calculated by referring to "Serra J, Arcos J L. an empirical evaluation of similarity measures for time series classification [ J].Knowledge-Based Systems,2014,67:305-314.”
Step (32): clustering the text topics by utilizing a hierarchical clustering model according to the time similarity between any two text topics; taking the clustering result as a land utilization mixing degree measuring and calculating index considering the time similarity;
the clustering number is determined by Davies-bouldin (DB) index, and the lower the DB index is, the better the clustering effect is.
The calculation formula of the DB index is as follows:
Figure BDA0001661015480000091
wherein N is the number of clusters, SiIs the standard deviation of the ith dynamic text topic distribution probability, SjFor the standard deviation of the jth dynamic text topic distribution probability, c1 and c2 are two different clustering results.
Step (33): on the basis of the obtained land utilization mixed degree measuring and calculating index considering the time similarity, performing semantic similarity calculation by using a TF-IDF model, and correcting a calculation result;
the semantic similarity calculation formula based on the TF-IDF model is as follows:
Figure BDA0001661015480000092
in the formula (I), the compound is shown in the specification,
Figure BDA0001661015480000093
for dynamic text subject z1、z2The result of the calculation of the degree of similarity of the texts,
Figure BDA0001661015480000094
respectively dynamic text subject z1、z2Probability is distributed to each word, and
Figure BDA0001661015480000095
n is the number of words in each dynamic text topic.
On the basis of the obtained semantic similarity, correcting a calculation result, wherein a correction formula is as follows:
Figure BDA0001661015480000096
in the formula (I), the compound is shown in the specification,
Figure BDA0001661015480000097
modified dynamic text topic z1、z2The text similarity calculation result of (1).
Step (34): based on the corrected calculation result of the semantic similarity, clustering the text topics by using a hierarchical clustering model, and taking the clustering result as a land utilization mixing degree measurement and calculation index considering the semantic similarity; and the land utilization mixing degree measuring and calculating index considering the semantic similarity is the urban land utilization mixing degree measuring and calculating index.
In this embodiment, based on the 30 dynamic text topics obtained in step (2), the time similarity of the topics is calculated by using the TWED model, and the similarity calculation result is shown in fig. 4. On the basis, a hierarchical clustering model is applied to extract a land utilization mixing degree measuring and calculating index considering time similarity. Wherein, the number of clusters obtained by experimental calculation is 6. The land use mixture degree measurement index considering the time similarity is shown in fig. 5. Then, based on the obtained measurement index, the TF-IDF model is applied to calculate the semantic similarity of the subject, and the similarity calculation result is shown in FIG. 6. And extracting land utilization mixing degree measuring and calculating indexes considering semantic similarity by applying a hierarchical clustering model. Wherein, the number of clusters obtained by experimental calculation is 9. The land use mixture degree measurement index considering the semantic similarity is shown in fig. 7. The land utilization mixed degree measuring and calculating indexes considering time and semantic similarity are combined to obtain 14 final land utilization mixed degree measuring and calculating indexes, wherein dynamic text topics contained in each index are shown in a table 2:
TABLE 2 measurement and calculation indexes of land utilization degree
Measurement and calculation index Dynamic text themes
1 Subject 11
2 Theme 13
3 Subject 16
4 Subject 23
5 Subject 27
6 Subject 7,8,15,21
7 Subject 4,20,30
8 Subject 1,17,18,19
9 Subject 3,25
10 Subject 9,22
11 Subject matter 10,28
12 Subject 2,5,24,29
13 Subject 6,12
14 Subject 14,26
And (4): according to the constructed urban land utilization mixing degree measuring and calculating indexes, measuring and calculating the urban land utilization mixing degree;
as a further improvement of the invention, the step (4) comprises the following steps:
step (41): repartitioning of the text document: dividing a region to be researched into a plurality of units according to a road, wherein each unit is called as a land utilization unit; extracting time information and text data of the social media data in each land utilization unit according to the position information of the social media data, and combining all the text data in each land utilization unit to serve as a new text document;
step (42): calculating a text topic of each new text document and a distribution probability of each new text topic in each land utilization unit through a Latent Dirichlet Allocation (LDA) topic model;
calculating the distribution probability of each urban land utilization mixing degree measuring and calculating index in each land utilization unit according to the distribution probability of each new text theme in each land utilization unit and the urban land utilization mixing degree measuring and calculating index constructed in the step (3);
and (42) calculating the distribution probability of each land utilization mixed degree measuring index in each land utilization unit:
Figure BDA0001661015480000101
in the formula (I), the compound is shown in the specification,
Figure BDA0001661015480000102
the land utilization unit p is used for calculating indexes of the ith land utilization degreeiThe probability of the distribution in (1) is,
Figure BDA0001661015480000103
for dynamic text subject zmIn land use unit piM is the total number of dynamic text topics contained in the ith land use mixture degree measurement index.
Step (43): and calculating the calculation result of the land utilization mixing degree by using an information entropy algorithm based on the distribution probability of the calculation indexes of the land utilization mixing degree in each land utilization unit of each city.
In the step (43), the land use mixedness calculation formula by using the information entropy algorithm is as follows:
Figure BDA0001661015480000111
in the formula, H (p)i) For land use units piThe result of the measurement and calculation of the degree of mixing,
Figure BDA0001661015480000112
the land utilization unit p is used for calculating indexes of the ith land utilization degreeiC is the total number of the land utilization mixing degree measuring and calculating indexes.
In the embodiment, the 14 land utilization mixing degree measurement indexes obtained in the step (3) are applied, and the land utilization mixing degree measurement is performed by using an information entropy algorithm, and the measurement result is shown in fig. 8.
As a second embodiment of the present invention, a system for measuring and calculating urban land use mixedness based on social media data is provided;
social media data-based urban land utilization mixedness measuring and calculating system comprises: the computer program product comprises a memory, a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of any of the above methods.
As a third embodiment of the present invention, there is provided a computer-readable storage medium;
a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, perform the steps of any of the above methods.
Firstly, acquiring time information, position information and text information of social media data by a web crawler method; then, introducing an LDA theme model, and extracting a dynamic text theme by constructing text documents at different time intervals; then, calculating the time similarity and the semantic similarity of the dynamic text theme, and constructing an urban land utilization mixing degree measuring and calculating index by using a hierarchical clustering model; and finally, based on the obtained mixing degree measurement and calculation index, the urban land utilization mixing degree is measured and calculated by using an information entropy algorithm.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (8)

1. The urban land utilization mixedness measuring and calculating method based on social media data is characterized by comprising the following steps of:
step (1): acquiring the publishing time, the position information and the text data of the social media data, and preprocessing the text data in the social media data;
step (2): dividing time into a plurality of time intervals, and extracting a text theme corresponding to each time interval and the distribution probability of each text theme in each time interval from the text data obtained by preprocessing;
and (3): constructing a city land utilization mixing degree measuring and calculating index according to the distribution probability of each text theme in each time interval obtained in the step (2);
the step (3) comprises the following steps:
step (31): according to the distribution probability of each text theme obtained in the step (2) after normalization processing in each time interval, performing time similarity calculation by using a TWED model to obtain the time similarity between any two text themes; the English name of TWED is Time-warp Edit Distance;
step (32): clustering the text topics by utilizing a hierarchical clustering model according to the time similarity between any two text topics; taking the clustering result as a land utilization mixing degree measuring and calculating index considering the time similarity;
step (33): on the basis of the obtained land utilization mixed degree measuring and calculating index considering the time similarity, performing semantic similarity calculation by using a TF-IDF model, and correcting a calculation result;
step (34): based on the corrected calculation result of the semantic similarity, clustering the text topics by using a hierarchical clustering model, and taking the clustering result as a land utilization mixing degree measurement and calculation index considering the semantic similarity; the land utilization mixing degree measuring and calculating index considering the semantic similarity is the urban land utilization mixing degree measuring and calculating index;
and (4): according to the constructed urban land utilization mixing degree measuring and calculating indexes, measuring and calculating the urban land utilization mixing degree;
the step (4) comprises the following steps:
step (41): repartitioning of the text document: dividing a region to be researched into a plurality of units according to a road, wherein each unit is called as a land utilization unit; extracting time information and text data of the social media data in each land utilization unit according to the position information of the social media data, and combining all the text data in each land utilization unit to serve as a new text document;
step (42): calculating a text topic of each new text document and the distribution probability of each new text topic in each land utilization unit through a potential Dirichlet topic model;
calculating the distribution probability of each urban land utilization mixing degree measuring and calculating index in each land utilization unit according to the distribution probability of each new text theme in each land utilization unit and the urban land utilization mixing degree measuring and calculating index constructed in the step (3);
step (43): and calculating the calculation result of the land utilization mixing degree by using an information entropy algorithm based on the distribution probability of the calculation indexes of the land utilization mixing degree in each land utilization unit of each city.
2. The method for urban land use mixedness estimation based on social media data according to claim 1, wherein,
the step (2) comprises the following steps:
step (21): constructing 48 time intervals by taking one hour as a time interval based on the text data obtained by preprocessing in the step (1); wherein, the 1 st to 24 th hours are the time intervals of working days, and the 25 th to 48 th hours are the time intervals of rest days; combining the text information in each time interval to obtain 48 text documents, wherein each text document comprises text data;
step (22): calculating a text topic of each text document and the distribution probability of each text topic in each time interval through a potential Dirichlet LDA topic model;
step (23): and performing normalization processing on the distribution probability of each text topic in 48 time intervals by using a Max-Min standardization method to obtain the normalized distribution probability of each text topic in each time interval.
3. The method for urban land use mixedness estimation based on social media data according to claim 1, wherein,
in the step (31), the calculation formula based on the TWED model is as follows:
Figure FDA0003134189450000021
in the formula (I), the compound is shown in the specification,
Figure FDA0003134189450000022
for a text topic z1、z2The time similarity calculation result of (a) is,
Figure FDA0003134189450000023
respectively a text topic z1、z2The probability of distribution within each time interval, and,
Figure FDA0003134189450000024
respectively calculated by the following formula:
Figure FDA0003134189450000025
Figure FDA0003134189450000026
Figure FDA0003134189450000027
wherein the content of the first and second substances,
Figure FDA0003134189450000028
wherein the content of the first and second substances,
Figure FDA0003134189450000029
respectively dynamic text subject z1、z2Distribution probability in ith and jth time interval, and
Figure FDA00031341894500000210
the parameters v and lambda are hyper-parameters of the TWED model.
4. The method for urban land use mixedness estimation based on social media data according to claim 1, wherein,
in the step (33), the step of,
the semantic similarity calculation formula based on the TF-IDF model is as follows:
Figure FDA00031341894500000211
in the formula (I), the compound is shown in the specification,
Figure FDA00031341894500000212
for dynamic text subject z1、z2The result of the calculation of the degree of similarity of the texts,
Figure FDA00031341894500000213
respectively dynamic text subject z1、z2Probability is distributed to each word, and
Figure FDA0003134189450000031
n is the number of words in each dynamic text topic;
in the step (33), the step of,
on the basis of the obtained semantic similarity, correcting a calculation result, wherein a correction formula is as follows:
Figure FDA0003134189450000032
in the formula (I), the compound is shown in the specification,
Figure FDA0003134189450000033
modified dynamic text topic z1、z2The text similarity calculation result of (1).
5. The method for urban land use mixedness estimation based on social media data according to claim 1, wherein,
and (42) calculating the distribution probability of each land utilization mixed degree measuring index in each land utilization unit:
Figure FDA0003134189450000034
in the formula (I), the compound is shown in the specification,
Figure FDA0003134189450000035
the land utilization unit p is used for calculating indexes of the ith land utilization degreeiThe probability of the distribution in (1) is,
Figure FDA0003134189450000036
for dynamic text subject zmIn land use unit piM is the total number of dynamic text topics contained in the ith land use mixture degree measurement index.
6. The method for urban land use mixedness estimation based on social media data according to claim 1, wherein,
in the step (43), the land use mixedness calculation formula by using the information entropy algorithm is as follows:
Figure FDA0003134189450000037
in the formula, H (p)i) For land use units piThe result of the measurement and calculation of the degree of mixing,
Figure FDA0003134189450000038
the land utilization unit p is used for calculating indexes of the ith land utilization degreeiC is the total number of the land utilization mixing degree measuring and calculating indexes.
7. Social media data-based urban land utilization mixedness measuring and calculating system comprises: a memory, a processor, and computer instructions stored on the memory and executed on the processor, the computer instructions, when executed by the processor, performing the steps of any of the methods of claims 1-6.
8. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, perform the steps of any of the methods of claims 1-6.
CN201810461482.9A 2018-05-15 2018-05-15 Urban land utilization mixedness measuring and calculating method and system based on social media data Active CN108763349B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810461482.9A CN108763349B (en) 2018-05-15 2018-05-15 Urban land utilization mixedness measuring and calculating method and system based on social media data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810461482.9A CN108763349B (en) 2018-05-15 2018-05-15 Urban land utilization mixedness measuring and calculating method and system based on social media data

Publications (2)

Publication Number Publication Date
CN108763349A CN108763349A (en) 2018-11-06
CN108763349B true CN108763349B (en) 2021-08-31

Family

ID=64006944

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810461482.9A Active CN108763349B (en) 2018-05-15 2018-05-15 Urban land utilization mixedness measuring and calculating method and system based on social media data

Country Status (1)

Country Link
CN (1) CN108763349B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111324831A (en) * 2018-12-17 2020-06-23 中国移动通信集团北京有限公司 Method and device for detecting fraudulent website
CN110633890A (en) * 2019-08-06 2019-12-31 广东晟腾地信科技有限公司 Land utilization efficiency judgment method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679942A (en) * 2015-01-29 2015-06-03 华南理工大学 Construction land bearing efficiency measuring method based on data mining
CN107885833A (en) * 2017-11-09 2018-04-06 山东师范大学 Method and system based on the change of Web newsletter archive quick detections ground mulching

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679942A (en) * 2015-01-29 2015-06-03 华南理工大学 Construction land bearing efficiency measuring method based on data mining
CN107885833A (en) * 2017-11-09 2018-04-06 山东师范大学 Method and system based on the change of Web newsletter archive quick detections ground mulching

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Employing Crowdsourced Geographic Information to Classify Land Cover with Spatial Clustering and Topic Model;Mengyuan等;《remote sensing 》;20170613;全文 *
Spectral clustering for sensing urban land use using Twitter activity;Frias-Martinez V等;《Engineering Applications of Artificial Intelligence》;20141031;全文 *
基于多源轨迹数据挖掘的居民通勤行为与城市职住空间特征研究;毛峰;《中国博士学位论文全文数据库 基础科学辑》;20151015;全文 *
基于微博数据的深圳市居民生活空间研究;陈名娇;《中国优秀硕士学位论文全文数据库 工程科技Ⅱ辑》;20170715;全文 *

Also Published As

Publication number Publication date
CN108763349A (en) 2018-11-06

Similar Documents

Publication Publication Date Title
Dahal et al. Topic modeling and sentiment analysis of global climate change tweets
Wu et al. A decision framework for electric vehicle charging station site selection for residential communities under an intuitionistic fuzzy environment: A case of Beijing
Si et al. Exploiting social relations and sentiment for stock prediction
Kovacs-Gyori et al. # London2012: Towards citizen-contributed urban planning through sentiment analysis of twitter data
Hidayatullah et al. Road traffic topic modeling on Twitter using latent dirichlet allocation
CN104199972A (en) Named entity relation extraction and construction method based on deep learning
Fu et al. Identifying spatiotemporal urban activities through linguistic signatures
CN110781670B (en) Chinese place name semantic disambiguation method based on encyclopedic knowledge base and word vectors
Li et al. Importance degree research of safety risk management processes of urban rail transit based on text mining method
CN112100999B (en) Resume text similarity matching method and system
Li et al. Mining public opinion on transportation systems based on social media data
CN104077417A (en) Figure tag recommendation method and system in social network
Huang et al. Research on urban modern architectural art based on artificial intelligence and GIS image recognition system
CN103345474A (en) Method for online tracking of document theme
Jain et al. Nowcasting gentrification using Airbnb data
CN108763349B (en) Urban land utilization mixedness measuring and calculating method and system based on social media data
Chi et al. A supernetwork-based online post informative quality evaluation model
Lu et al. A novel fuzzy logic-based text classification method for tracking rare events on twitter
Cui et al. Predicting and improving the waterlogging resilience of urban communities in China—a case study of nanjing
Shan et al. Social media-based urban disaster recovery and resilience analysis of the Henan deluge
Kumbalaparambi et al. Assessment of urban air quality from Twitter communication using self-attention network and a multilayer classification model
CN107315807B (en) Talent recommendation method and device
Huang Web mining for the mayoral election prediction in Taiwan
Jiao et al. Can urban environmental problems be accurately identified? A complaint text mining method
CN113222471A (en) Asset wind control method and device based on new media data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant