CN115577190A - Tourist behavior data extraction method - Google Patents

Tourist behavior data extraction method Download PDF

Info

Publication number
CN115577190A
CN115577190A CN202211270201.4A CN202211270201A CN115577190A CN 115577190 A CN115577190 A CN 115577190A CN 202211270201 A CN202211270201 A CN 202211270201A CN 115577190 A CN115577190 A CN 115577190A
Authority
CN
China
Prior art keywords
time
travel
space
tourist
check
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211270201.4A
Other languages
Chinese (zh)
Other versions
CN115577190B (en
Inventor
赵莹
杨羽菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202211270201.4A priority Critical patent/CN115577190B/en
Publication of CN115577190A publication Critical patent/CN115577190A/en
Application granted granted Critical
Publication of CN115577190B publication Critical patent/CN115577190B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application belongs to the technical field of travel data processing and discloses a method for extracting tourist behavior data. The method comprises the following steps: obtaining check-in data of tourist attractions, and performing structured processing to obtain a check-in time-space database; acquiring a first travel note sample from a travel website, marking time information and location information of a travel note text to obtain a marked travel space-time path, and forming a preliminary analysis module based on a marking method; acquiring a second travel note sample, operating a preliminary analysis module to obtain an analysis travel space-time path of the second travel note sample, and perfecting the preliminary analysis module based on the analysis travel space-time path to obtain a final analysis module; applying a final analysis module to travel note samples in a preset time window and a preset destination range to obtain a travel note time-space database; and obtaining a visualized tourist space-time behavior path diagram based on the check-in time-space database and the tour timing time-space database. And providing structured data for subsequent patent analysis in the tourism field.

Description

Tourist behavior data extraction method
Technical Field
The application relates to the technical field of travel data processing, in particular to a tourist behavior data extraction method.
Background
In recent years, with the rapid development of economic and traffic, the tourism willingness of domestic tourists is obviously increased, the number of tourists is continuously increased, and the income of tourism-related industries is increased. Meanwhile, with the development of the internet technology, tourists leave a large amount of tourism related data on the internet in the tourism process, and the tourism related data can be used for analyzing and researching tourism marketing plans, tourist quantity prediction, route planning, scenic spot evaluation and the like, so that better tourism service is further provided for the tourists and better tourism products are developed. However, although the information obtained in the prior art has wide information sources, data collection is concentrated on extracting static tourist information, and the structured processing of dynamic tourist information among different scenic spots is lacked.
Disclosure of Invention
Therefore, the embodiment of the application provides a tourist behavior data extraction method, and structured extraction and visual processing of dynamic tourist information are achieved.
In a first aspect, the application provides a method for extracting tourist behavior data.
The application is realized by the following technical scheme:
a method of guest behavior data extraction, the method comprising:
acquiring tourist attraction sign-in data, and performing structured processing on the tourist attraction sign-in data to obtain a sign-in time-space database based on the tourist attraction sign-in data;
acquiring a first travel note sample from a travel website, marking time information and location information of each travel note text in the first travel note sample to obtain a marked travel space-time path, and forming a preliminary analysis module based on a marking method for marking the travel space-time path;
acquiring a second travel note sample, operating the primary analysis module to obtain analysis travel space-time paths of all travel note texts of the second travel note sample, and perfecting the primary analysis module based on the analysis travel space-time paths to obtain a final analysis module;
applying the final analysis module to travel record samples of a preset time window and a preset destination range to obtain a travel record time-space database based on travel records;
and constructing a tourist movement behavior database based on the check-in time space database and the travel time space database, and obtaining a visualized tourist time space behavior path diagram based on the tourist movement behavior database.
In a preferred example of the present application, before the step of constructing the guest movement behavior database based on the check-in time space database and the guest time space database, obtaining a visualized guest time space behavior path map based on the guest movement behavior database, the method further includes:
collecting comment data of tourist attractions, calculating the comment data proportion of all tourist attractions of a local city in a single tourist attraction, and acquiring the reference number of tourist visions of the tourist attractions based on the comment data proportion;
and obtaining a first number of tourism visiting persons based on the sign-in time and space database, obtaining a second number of tourism visiting persons based on the travel time and space database, and respectively calculating the deviation proportion of the first number of tourism visiting persons, the second number of tourism visiting persons and the reference number of tourism visiting persons.
In a preferred example of the present application, after calculating the deviation ratio between the first and second numbers of visitors and the reference number of visitors respectively, the method further includes:
comparing the deviation proportion with a preset deviation proportion, and if the deviation proportion exceeds the preset deviation proportion, further perfecting the sign-on time-space database and the travel time-space database;
and if the deviation ratio is within a preset deviation ratio, obtaining a tourist flow behavior database on the basis of the check-in time space database and the tour record time space database.
In a preferred example of the present application, the obtaining tourist attraction check-in data and performing a structured processing on the tourist attraction check-in data to obtain a check-in time-space database based on the tourist attraction check-in data may further include:
acquiring a scenic spot sign-in ID of each scenic spot according to the name of the scenic spot in the target area to obtain a list of the scenic spots, the scenic spot sign-in IDs and scenic spot numbers;
acquiring sign-in data of user IDs corresponding to all the scenic spot sign-in IDs of the scenic spots in a time window to obtain an initial user database of the scenic spots, wherein the initial user database comprises the user IDs, sign-in time, sign-in places and sign-in contents;
and acquiring personal information of all user IDs in the initial user database, supplementing the initial user database as an attached table, and acquiring a check-in time space database of the check-in data of the tourist attraction.
In a preferred example of the present application, the step of obtaining the scenic spot check-in ID name of each scenic spot further includes:
and establishing a master check-in ID and a plurality of slave check-in IDs of the scenic spots, summarizing the plurality of slave check-in IDs to the master check-in ID, and using the master check-in ID as the scenic spot check-in ID of each scenic spot.
In a preferred example of the present application, the step of obtaining check-in data of user IDs corresponding to check-in IDs of all scenic spots in the time window may further include:
if the same user ID corresponds to a plurality of different scenic spot sign-in IDs, associating the user ID with the plurality of different scenic spot sign-in IDs;
and if the same user ID corresponds to a plurality of same scenic spot sign-in IDs, performing duplicate removal processing on the same scenic spot sign-in IDs.
In a preferred example of the present application, the method may further include the steps of obtaining a first travel note sample from a travel website, marking time information and location information of each travel note text in the first travel note sample to obtain a marked travel space-time path, and forming a preliminary parsing module based on a marking method for the marked travel space-time path, including:
marking a time keyword of each travel note text in the first travel note sample based on a precise date, a precise time, a fuzzy time and a relative time, placing the time keyword into a time word bank, and dividing the travel note text into text segments by taking the precise date and the precise time as dividing points;
identifying place keywords of the text segment based on the precise place, the fuzzy place and the associated place, and placing the place keywords into a place thesaurus;
and constructing a preliminary analysis module based on the time keyword and the place keyword extraction method.
In a preferred example of the present application, the obtaining a second travel note sample, sequentially operating the preliminary parsing module to obtain parsing travel time-space paths of all travel note texts of the second travel note sample, and completing the preliminary parsing module based on the parsing travel time-space paths to obtain a final parsing module includes:
if the second travel record sample contains the travel photos, extracting time information and longitude and latitude information in the travel photos, arranging the travel photos based on a time sequence, and generating an image space-time path of the travel photos.
In a preferred example of the present application, the step of refining the preliminary analysis module based on the analysis of the travel spatio-temporal path to obtain the final analysis module may further include:
and extracting part of travel note texts from the second travel note samples, marking the travel note texts to obtain marked travel space-time paths, comparing the analyzed travel space-time paths obtained by the primary analysis module with the marked travel space-time paths, and perfecting the primary analysis module based on comparison results to obtain a final time word bank and a final analysis module.
In a preferred example of the present application, it may be further configured that the step of obtaining a visualized tourist spatiotemporal behavior path map based on the check-in spatiotemporal database and the tour spatiotemporal database includes:
forming a path map of single or a plurality of tourists through an ArcGIS software tool based on the check-in time space database and the tour timing time space database, and adding a time axis and a map base map into the path map to form a visualized tourist time space behavior path map.
In summary, compared with the prior art, the beneficial effects brought by the technical scheme provided by the embodiment of the present application at least include:
obtaining check-in data of tourist attractions, and performing structured processing to obtain a check-in time-space database; marking time information and place information of travel note texts acquired from a travel website, forming an analysis module based on a marking method, and further analyzing travel note samples by using the analysis module to acquire a travel note time-space database; and forming a tourist flow behavior database based on the check-in time-space database and the tourist time-space database, and performing spatial analysis and visualization based on the tourist flow behavior database. On the basis of fusing objective time-space geographic information, on the basis of dynamic tourism information data of individuals among different scenic spots, the extraction method of the subjective sign-in data and the tourism data content is structured, the lack of attention to individual subjective information in other data collection methods is supplemented, and the refinement and concretization of time-space paths are realized.
Drawings
Fig. 1 is a schematic flowchart of a method for extracting guest behavior data according to an exemplary embodiment of the present application;
FIG. 2 is a schematic flow chart illustrating a process of creating a check-in time-space database according to an exemplary embodiment of the present application;
fig. 3 is a schematic flowchart of creating a travel time space database according to an exemplary embodiment of the present application.
Detailed Description
The present embodiment is only for explaining the present application, and it is not limited to the present application, and those skilled in the art can make modifications of the present embodiment without inventive contribution as needed after reading the present specification, but all of them are protected by patent law within the scope of the claims of the present application.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In addition, the term "and/or" in the present application is only one kind of association relationship describing the associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, in the present application, the character "/" indicates that the preceding and following related objects are in an "or" relationship, unless otherwise specified.
The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution.
The embodiments of the present application will be described in further detail with reference to the drawings attached hereto.
In an embodiment of the present application, a method for extracting guest behavior data is provided, as shown in fig. 1, the main steps are described as follows:
and S10, acquiring tourist area sign-in data, and performing structured processing on the tourist area sign-in data to obtain a sign-in time-space database based on the tourist area sign-in data.
Specifically, the example of obtaining the tourist attraction sign-in data from the microblog is described. Python is used for obtaining the tourist attraction check-in data in the target area, and then the tourist attraction check-in data is subjected to structured processing, as shown in fig. 1, the specific steps are as follows:
sequentially searching the scenic spot names one by one on a microblog sign-in page according to the scenic spot names in the target area, acquiring the scenic spot sign-in ID and the scenic spot number corresponding to each scenic spot name, creating a list, associating the scenic spot names with the scenic spot sign-in ID and the scenic spot numbers as row elements of the list, and obtaining the list containing the scenic spot names, the scenic spot sign-in IDs and the scenic spot numbers;
and acquiring user check-in data corresponding to all the scenic spot check-in IDs of all the scenic spots in the time window, wherein the user check-in data comprises the user ID, the check-in time, the check-in place and the check-in content which are related to the scenic spot check-in IDs, and acquiring an initial user database of the scenic spot. It should be noted that social media data represented by microblogs can only be crawled in real time, so a time window for data acquisition needs to be set, and the social media data can be divided into three types according to the purpose of data analysis: a year unit that sets a time of a certain year; a month unit setting a certain month time in a certain year; and the holiday unit sets the time of 3-7 days of a legal holiday. And further acquiring personal information corresponding to all user IDs in the initial user database, wherein the personal information comprises user gender, user destination, birth year, month and day and graduation institution information, and supplementing the initial user database by taking the personal information as an attached table to obtain a check-in time-space database based on the check-in data of the tourist attraction.
Preferably, a master check-in ID and several slave check-in IDs of the scenic spots are established, and the several slave check-in IDs are collected to the master check-in ID, and the master check-in ID is used as the scenic spot check-in ID of each scenic spot.
Specifically, because the microblog check-in page has a plurality of place names for the same place, or a certain scenic spot includes a plurality of small scenic spots, some scenic spots have a plurality of scenic spot check-in IDs. The method can establish primary check-in IDs of scenic spots and secondary check-in IDs of the scenic spots, and when a plurality of secondary check-in IDs belong to the same primary check-in ID, the secondary check-in IDs are gathered into the primary check-in ID together, and the primary check-in ID is used as the scenic spot check-in ID of each scenic spot.
Preferably, if the same user ID corresponds to a plurality of different scenic spot check-in IDs, the user ID is associated with the plurality of different scenic spot check-in IDs; if the same user ID corresponds to a plurality of same scenic spot sign-in IDs, the same scenic spot sign-in IDs are subjected to deduplication processing. Specifically, the python traversal data can be used for filtering the repeated data, and the repeated data can be prevented from being stored in the database to cause a large amount of redundant data through the deduplication processing.
Because the same user may use the microblog to check in at a plurality of scenic spots in the target area, the same user ID corresponding to a plurality of different scenic spot check-in IDs exist in the initial user database, and the plurality of different scenic spot check-in IDs of the same user ID are associated. And meanwhile, if the same user ID corresponds to a plurality of same scenic spot sign-in IDs, duplicate removal is carried out on the repeated scenic spot sign-in IDs, a behavior track based on the flow of the individual user in the scenic spot space is formed, and the behavior track is expressed in a tourism space-time data table form. The travel space-time data table is divided into rows according to the user ID, and sequentially arranges 'time point 1, place 1, time point 2 and place 2 … …', and specifically lists travel space-time information corresponding to each user ID.
As shown in table 1, a part of data of the check-in time-space database is shown, which includes information such as "user ID", "user source place", "time point 1", "check-in place 1", and the like.
TABLE 1
Figure BDA0003894852010000051
Figure BDA0003894852010000061
Note: (1) it should be noted that the user's destination of the visitor cannot completely include the check-in place, otherwise, the user should be excluded as a non-visitor group; (2) the user needs to pay attention to the self-setting of the time limit for removing the weight and pay attention to the situation that the tourists in the last row in the table go to the vacation area for a long time.
As shown in fig. 2, the travel note data comes from the text content and picture links of the authoritative strategy website, including but not limited to the following strategy websites: the contents of texts and pictures on the skyscraper net, the carry-away net, the go-to net and the poor trip net are connected. The specific acquisition mode is as follows:
s20, automatically extracting different types of travel notes by using Python, randomly extracting a first travel note sample from the travel notes, marking the time information and the location information of each travel note text to obtain a marked travel space-time path of the travel note text, summarizing a marking method of the time information and the location information in the marked travel space-time path, and forming an initial analysis module based on the marking method.
Specifically, the time keywords and the place keywords in the time thesaurus and the place thesaurus are connected in series corresponding to each user ID according to the time sequence and the place change. Manually identifying time keywords in each travel note text, marking the time keywords according to 'precise date', 'precise time', 'fuzzy time' and 'relative time', classifying and placing the time keywords into corresponding time word banks, and dividing the travel note text into text segments which are ordered according to the time sequence by taking the 'precise date' and the 'precise time' as dividing points. For example, the "exact date" may be expressed as September six DAYs, 9/6, DAY2, and/or DAY2; the expression "precise time" may be in the form of ten and a half hours, 14, and/or four pm; the expression "fuzzy time" may be in the form of morning, afternoon, evening, early morning, breakfast, lunch and/or night scenes; the expression "relative time" may be in the form of 15 minutes later, approximately 2 hours walked and/or about 1.5 hours played. The relative time may be calculated as an absolute time by a median method. The application modes of the accurate date, the accurate time, the fuzzy time and the relative time are as follows: the 'precise date' and the 'precise time' are used as dividing points, the 'fuzzy time' can estimate specific time according to the morning and the afternoon, and the relative time needs to adopt a median method to calculate the absolute time. The using mode of the time word bank is matching and calculating, and the final aim is to obtain a specific time point.
And manually identifying the place keywords in the text segment of each travel note text, marking the place keywords in the travel note text according to the accurate place, the fuzzy place and the associated place, and placing the place keywords into a place word library. The use mode of the place word library is association and comparison, and the final aim is to unify each place to the same level and correspond to the time key words. By way of example, the "exact location" may be expressed in hong Kong (Port), fujian (Fujian province, min), guangzhou township (waist of a thin year), and/or Huangguoshu waterfall (Huangguoshu); the expression of "fuzzy place" may be main, terminal and/or mountain top; the expression of "associated place" can be that the user has arrived somewhere, goes to somewhere, returns to somewhere, visits somewhere, climbs somewhere, goes around to somewhere, stays somewhere, goes from A place to B place. The application modes of the accurate place, the fuzzy place and the associated place are association and comparison, namely in the corresponding time descriptive statement group, the position under the time is extracted according to the matching condition to form a space-time path; the fuzzy place needs to supplement a specific position through context; if a plurality of positions exist, whether time needs to be supplemented or positions need to be deleted is further analyzed; if the precise location is in the previous text segment and the current text segment has no precise location, the precise location information of the previous segment can be merged.
A preliminary analysis module for machine learning is written according to the logic of time sequence writing and place change based on the extraction method of the time keywords and the place keywords. The method comprises the following specific steps:
f1, setting a preview area, and screening text segments with space-time information;
f2, setting a character query input box of a time word bank, and automatically identifying time keywords in the text segment with the space-time information;
f3, setting an output result of a text query input box of the time lexicon to be linked with a matched query input interface in the task template, and corresponding the identified time keywords to the divided time lexicon;
setting an output conversion interface to be linked with a corresponding conversion program, outputting the time information corresponding to each input user ID to a preset table, and outputting according to the matching sequence from date to time to form a descriptive statement group taking time as a core;
f5, setting a place thesaurus query input box, and automatically identifying place keywords in the text segment with the space-time information;
f6, setting an output result of the text query input box of the place lexicon to be linked with a matched query input interface in the task template, and corresponding the identified place keywords to the divided place lexicon;
and F7, setting an output conversion interface to be linked with a corresponding conversion program, and outputting the site information of each user ID corresponding to different time to the preset table to form a space-time path.
And S30, obtaining a second travel note sample, operating the preliminary analysis module to obtain analysis travel space-time paths of all travel note texts of the second travel note sample, and perfecting the preliminary analysis module based on the analysis travel space-time paths to obtain a final analysis module.
Specifically, the preliminary analysis module is operated in a circulating rolling mode, and an analysis travel space-time path formed by the preliminary analysis module is obtained.
Preferably, if the travel picture is contained in the travel note text of the second travel note sample, the time information and the latitude and longitude information in the travel picture are interpreted, the travel picture is arranged based on the time sequence, and the image spatiotemporal path of the travel picture is generated.
Furthermore, part of travel notes texts are extracted from the second travel notes sample, time keywords and place keywords are marked on the travel notes texts to obtain marked travel space-time paths, the marked travel space-time paths of the travel notes texts are compared with analysis travel space-time paths formed by the primary analysis module, the consistency rate of the marked travel space-time paths and the analysis travel space-time paths is calculated based on difference points in comparison results, and the primary analysis module is completed.
Judging whether the consistency rate reaches a preset threshold value, and if the consistency rate does not reach the preset threshold value, continuing to perfect the primary analysis module; and when the consistency rate reaches a preset threshold value, fixing the analysis module to obtain a final analysis module.
And S40, applying the final analysis module to the analyzed travel space-time path in the travel note text of the travel note sample in the preset time window and the preset destination range, and constructing a travel note time-space database based on the travel note based on the analyzed travel space-time path.
Preferably, collecting comment data of the tourist attractions, calculating the comment data proportion of all the tourist attractions of the local city in a single tourist attraction, and acquiring the number of tourist visitors in the tourist attractions based on the comment data proportion;
and counting the number of the tourism visiting persons signed in the time-space database and the travel time-space database, and calculating the deviation ratio of the number of the tourism visiting persons based on the comment data.
Specifically, the 'statistical yearbook' or the tourism statistical data issued by the official website of the tourism data are searched, and the number of the tourism receptionists of the target city in the time window is obtained; collecting comment data of tourists on tourist attractions at an authoritative tourist website to obtain the comment data proportion of each tourist attraction in all tourist attractions in a local city; and multiplying the comment data proportion of each tourist attraction by the number of tourist receptions in the city to obtain the number of tourists in a single tourist attraction.
And summarizing the data in the sign-on time-space database and the travel time-space database in the same time window and the same region range to obtain the number of tourism visitors based on the sign-on time-space database and the travel time-space database, comparing the number of tourism visitors with the number of tourism visitors obtained through the comment data, and calculating the deviation ratio.
The number of tourism visitors obtained through comment data is set to be x 1 The number of tourism visitors obtained by signing in the time-space database is x 2 The deviation ratio of the two is
Figure BDA0003894852010000081
If deviation ratio y 1 If the deviation ratio is less than or equal to 10%, the data source signed to the time-space database is proved to be saturated, and if the deviation ratio is y 1 If the data source is more than 10%, the data source signed to the time-space database is proved to be unsaturated, the number of the acquired data sources signed to the time-space database needs to be increased until the deviation proportion y 2 Less than or equal to 10%.
Set throughThe number of tourism visitors obtained from the travel time space database is x 3 The number of tourism visitors obtained by the comment data is x 1 With a deviation ratio of
Figure BDA0003894852010000082
If deviation ratio y 2 Less than or equal to 10%, the data source of the travel time-space database is proved to be saturated, if the deviation ratio y 2 If the deviation ratio is more than 10%, the data source of the travel time-space database is proved not to be saturated, and the number of the data sources of the travel time-space database needs to be increased until the deviation ratio y 2 Less than or equal to 10%. Before generating the visual analysis, the saturation and accuracy evaluation is carried out on the database, and the reliability of the individual data is determined.
Meanwhile, the consistency of the marked travel space-time path and the analyzed travel space-time path formed by the preliminary analysis module can be judged by comparing the time deviation and the place deviation.
After data saturation and accuracy evaluation, a structured database of the spatiotemporal behaviors of the tourists in a designated time window and a designated area range can be obtained, wherein the database comprises a user ID, visiting time, visiting scenic spot positions, visiting sequence and the like.
And S50, constructing a tourist flow behavior database based on the check-in time-space database and the tourist time-space database, and obtaining a visual tourist time-space behavior path diagram based on the tourist flow behavior database.
Specifically, a sign-on time-space database and a tour space-time database are used as a basis, and a track intervals inter lines tool under a tracking analysis tools in ArcGIS software is used for realizing point connection lines to form a path diagram of a single or a plurality of tourists;
and (3) displaying a time axis and a map base map in a matched manner, and finally forming a visual tourist space-time behavior path map so as to facilitate subsequent professional analysis.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (Synchlink), DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the system described in this application is divided into different functional units or modules to perform all or part of the above-mentioned functions.

Claims (10)

1. A method for extracting guest behavior data, the method comprising:
acquiring tourist attraction sign-in data, and performing structured processing on the tourist attraction sign-in data to obtain a sign-in time-space database based on the tourist attraction sign-in data;
acquiring a first travel note sample from a travel website, marking time information and location information of each travel note text in the first travel note sample to obtain a marked travel space-time path, and forming a preliminary analysis module based on a marking method for marking the travel space-time path;
acquiring a second travel note sample, operating the primary analysis module to obtain analysis travel space-time paths of all travel note texts of the second travel note sample, and perfecting the primary analysis module based on the analysis travel space-time paths to obtain a final analysis module;
applying the final analysis module to travel record samples of a preset time window and a preset destination range to obtain a travel record time-space database based on travel records;
and constructing a tourist movement behavior database based on the check-in time space database and the travel time space database, and obtaining a visualized tourist time space behavior path diagram based on the tourist movement behavior database.
2. The method for extracting tourist behavior data according to claim 1, wherein the step of constructing the tourist movement behavior database based on the check-in time space database and the tourist timing time space database, and obtaining the visualized tourist time space behavior path map based on the tourist movement behavior database further comprises:
collecting comment data of tourist attractions, calculating the comment data proportion of all tourist attractions of a local city in a single tourist attraction, and acquiring the reference number of tourist visions of the tourist attractions based on the comment data proportion;
and obtaining a first number of tourism visiting persons based on the check-in time-space database, obtaining a second number of tourism visiting persons based on the travel time-space database, and respectively calculating the deviation proportion of the first number of tourism visiting persons, the second number of tourism visiting persons and the reference number of tourism visiting persons.
3. The method as claimed in claim 2, wherein after calculating the deviation ratio between the first and second numbers of visitors and the reference number of visitors respectively, the method further comprises:
comparing the deviation proportion with a preset deviation proportion, and if the deviation proportion exceeds the preset deviation proportion, further perfecting the sign-on time-space database and the travel time-space database;
and if the deviation ratio is within a preset deviation ratio, obtaining a tourist flow behavior database on the basis of the check-in time space database and the tour record time space database.
4. The method as claimed in claim 1, wherein the step of obtaining the tourist attraction check-in data and performing the structuring process on the tourist attraction check-in data to obtain the check-in space-time database based on the tourist attraction check-in data comprises:
acquiring a scenic spot sign-in ID of each scenic spot name according to the scenic spot names in the target area to obtain a list of the scenic spot names, the scenic spot sign-in IDs and scenic spot numbers;
acquiring sign-in data of user IDs corresponding to all the scenic spot sign-in IDs of the scenic spots in a time window to obtain an initial user database of the scenic spots, wherein the initial user database comprises the user IDs, sign-in time, sign-in places and sign-in contents;
and acquiring personal information of all user IDs in the initial user database, supplementing the initial user database as an attached table, and acquiring a check-in time space database of the check-in data of the tourist attraction.
5. The guest behavior data extraction method of claim 4, wherein the step of obtaining the guest check-in ID for each guest name further comprises:
and establishing a master check-in ID and a plurality of slave check-in IDs of the scenic spots, summarizing the plurality of slave check-in IDs to the master check-in ID, and using the master check-in ID as the scenic spot check-in ID of each scenic spot.
6. The method according to claim 4 or 5, wherein the step of acquiring check-in data of user IDs corresponding to check-in IDs of all the scenic spots within the time window further comprises:
if the same user ID corresponds to a plurality of different scenic spot sign-in IDs, associating the user ID with the plurality of different scenic spot sign-in IDs;
and if the same user ID corresponds to a plurality of same scenic spot sign-in IDs, performing duplicate removal processing on the same scenic spot sign-in IDs.
7. The method as claimed in claim 1, wherein the step of obtaining a first travel record sample from a travel website, marking time information and location information of each travel record text in the first travel record sample to obtain a marked travel spatiotemporal path, and forming a preliminary analysis module based on the marking method of the marked travel spatiotemporal path comprises:
marking a time keyword of each travel note text in the first travel note sample based on a precise date, a precise time, a fuzzy time and a relative time, placing the time keyword into a time word bank, and dividing the travel note text into text segments by taking the precise date and the precise time as dividing points;
identifying a place keyword for the text segment based on the precise place, the fuzzy place, and the associated place, and placing the place keyword into a place thesaurus;
and constructing a preliminary analysis module based on the time keyword and the place keyword extraction method.
8. The method for extracting tourist behavior data according to claim 1, wherein the step of obtaining a second travel note sample, sequentially operating the preliminary analysis module to obtain analysis travel space-time paths of all travel note texts of the second travel note sample, and completing the preliminary analysis module based on the analysis travel space-time paths to obtain a final analysis module comprises:
and if the second travel record sample contains the travel photos, extracting time information and longitude and latitude information in the travel photos, arranging the travel photos based on a time sequence, and generating an image space-time path of the travel photos.
9. The method as claimed in claim 1, wherein the step of refining the preliminary analysis module based on the analysis of the travel spatiotemporal path to obtain the final analysis module comprises:
and extracting part of travel note texts from the second travel note samples, marking the travel note texts to obtain marked travel space-time paths, comparing the analyzed travel space-time paths obtained by the preliminary analysis module with the marked travel space-time paths, and perfecting the preliminary analysis module based on a comparison result to obtain a final analysis module.
10. The method of claim 1, wherein the step of obtaining a visualized tourist spatiotemporal behavior path map based on the tourist flow behavior database comprises:
and forming a path map of single or multiple tourists through an ArcGIS software tool based on the tourist flow behavior database, and adding a time axis and a map base map into the path map to form a visualized tourist space-time behavior path map.
CN202211270201.4A 2022-10-18 2022-10-18 Tourist behavior data extraction method Active CN115577190B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211270201.4A CN115577190B (en) 2022-10-18 2022-10-18 Tourist behavior data extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211270201.4A CN115577190B (en) 2022-10-18 2022-10-18 Tourist behavior data extraction method

Publications (2)

Publication Number Publication Date
CN115577190A true CN115577190A (en) 2023-01-06
CN115577190B CN115577190B (en) 2023-05-30

Family

ID=84585619

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211270201.4A Active CN115577190B (en) 2022-10-18 2022-10-18 Tourist behavior data extraction method

Country Status (1)

Country Link
CN (1) CN115577190B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116821692A (en) * 2023-08-28 2023-09-29 北京化工大学 Method, device and storage medium for constructing descriptive text and space scene sample set

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007241903A (en) * 2006-03-10 2007-09-20 Nagasaki Prefecture Dynamic recording method for tourists
US20120084000A1 (en) * 2010-10-01 2012-04-05 Microsoft Corporation Travel Route Planning Using Geo-Tagged Photographs
CN105550951A (en) * 2015-12-30 2016-05-04 南京邮电大学 Decision assistant system and method of tour travel
WO2016132189A1 (en) * 2015-02-21 2016-08-25 Malekzadeh Mohammadsharif Method for tourism management and quality control
CN106021618A (en) * 2016-07-13 2016-10-12 桂林电子科技大学 System and method for inquiring and managing touring information of scenic spot
CN109086919A (en) * 2018-07-17 2018-12-25 新华三云计算技术有限公司 A kind of sight spot route planning method, device, system and electronic equipment
JP2019023851A (en) * 2017-07-21 2019-02-14 株式会社エヌ・ティ・ティ・アド Data analysis system and data analysis method
CN110544115A (en) * 2019-08-16 2019-12-06 北京慧辰资道资讯股份有限公司 Method and device for analyzing characteristics of tourists from scenic spot tourism big data
CN113609842A (en) * 2021-08-17 2021-11-05 四川轻化工大学 Method for obtaining scenic spot comment data and travel experience evaluation
CN113742481A (en) * 2021-07-14 2021-12-03 安徽师范大学 Social media big data-based travel stream emotion space-time change feature research method
CN115330221A (en) * 2022-08-18 2022-11-11 湖州师范学院 Rural tourism information data analysis feedback system and method

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007241903A (en) * 2006-03-10 2007-09-20 Nagasaki Prefecture Dynamic recording method for tourists
US20120084000A1 (en) * 2010-10-01 2012-04-05 Microsoft Corporation Travel Route Planning Using Geo-Tagged Photographs
WO2016132189A1 (en) * 2015-02-21 2016-08-25 Malekzadeh Mohammadsharif Method for tourism management and quality control
CN105550951A (en) * 2015-12-30 2016-05-04 南京邮电大学 Decision assistant system and method of tour travel
CN106021618A (en) * 2016-07-13 2016-10-12 桂林电子科技大学 System and method for inquiring and managing touring information of scenic spot
JP2019023851A (en) * 2017-07-21 2019-02-14 株式会社エヌ・ティ・ティ・アド Data analysis system and data analysis method
CN109086919A (en) * 2018-07-17 2018-12-25 新华三云计算技术有限公司 A kind of sight spot route planning method, device, system and electronic equipment
CN110544115A (en) * 2019-08-16 2019-12-06 北京慧辰资道资讯股份有限公司 Method and device for analyzing characteristics of tourists from scenic spot tourism big data
CN113742481A (en) * 2021-07-14 2021-12-03 安徽师范大学 Social media big data-based travel stream emotion space-time change feature research method
CN113609842A (en) * 2021-08-17 2021-11-05 四川轻化工大学 Method for obtaining scenic spot comment data and travel experience evaluation
CN115330221A (en) * 2022-08-18 2022-11-11 湖州师范学院 Rural tourism information data analysis feedback system and method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
邵隽;常雪松;赵雅敏;: "基于游记大数据的华山景区游客行为模式研究", 中国园林 *
陈子微;姚建盛;: "基于旅游数字足迹的游客时空行为研究――以南京市玄武区为例" *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116821692A (en) * 2023-08-28 2023-09-29 北京化工大学 Method, device and storage medium for constructing descriptive text and space scene sample set

Also Published As

Publication number Publication date
CN115577190B (en) 2023-05-30

Similar Documents

Publication Publication Date Title
Brockett et al. Using rank statistics for determining programmatic efficiency differences in data envelopment analysis
CN107133277A (en) Recommend method in a kind of tourist attractions based on Dynamic Theme model and matrix decomposition
Zhang et al. Characterizing scientific production and consumption in physics
CN112380425B (en) Community recommendation method, system, computer equipment and storage medium
CN110309432B (en) Synonym determining method based on interest points and map interest point processing method
Ali et al. Rule-guided human classification of Volunteered Geographic Information
CN106776609A (en) Reprint the statistical method and device of quantity in website
Sallard et al. An open data-driven approach for travel demand synthesis: an application to São Paulo
CN115577190A (en) Tourist behavior data extraction method
CN110245286B (en) travel recommendation method and device based on data mining
Goncalves et al. Gathering alumni information from a web social network
Liao et al. Fusing geographic information into latent factor model for pick-up region recommendation
Simou et al. A GIS-based methodology to explore and manage the historical heritage of Rabat city (Morocco)
Badwi et al. Modeling and prediction of expected informal growth in the Greater Cairo Region, Egypt
Foody Rating crowdsourced annotations: evaluating contributions of variable quality and completeness
van Erp et al. Georeferencing animal specimen datasets
CN111369294B (en) Software cost estimation method and device
Xi et al. A satellite imagery dataset for long-term sustainable development in united states cities
Ostermann et al. Reproducible research and GIScience: An evaluation using GIScience conference papers
JP5639549B2 (en) Information retrieval apparatus, method, and program
Feng et al. Integrated imputation of activity-travel diaries incorporating the measurement of uncertainty
Cai et al. Discovering common semantic trajectories from geo-tagged social media
CN110633890A (en) Land utilization efficiency judgment method and system
Loai Ali et al. Towards rule-guided classification for volunteered geographic information
Xu et al. [Retracted] Tourist Attraction Recommendation Method and Data Management Based on Big Data Analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant