WO2019200752A1 - 基于语义理解的兴趣点查询方法、装置和计算机设备 - Google Patents

基于语义理解的兴趣点查询方法、装置和计算机设备 Download PDF

Info

Publication number
WO2019200752A1
WO2019200752A1 PCT/CN2018/095502 CN2018095502W WO2019200752A1 WO 2019200752 A1 WO2019200752 A1 WO 2019200752A1 CN 2018095502 W CN2018095502 W CN 2018095502W WO 2019200752 A1 WO2019200752 A1 WO 2019200752A1
Authority
WO
WIPO (PCT)
Prior art keywords
interest
point
index
topic
query
Prior art date
Application number
PCT/CN2018/095502
Other languages
English (en)
French (fr)
Inventor
王健宗
吴天博
黄章成
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019200752A1 publication Critical patent/WO2019200752A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/26Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 specially adapted for navigation in a road network
    • G01C21/34Route searching; Route guidance
    • G01C21/36Input/output arrangements for on-board computers
    • G01C21/3679Retrieval, searching and output of POI information, e.g. hotels, restaurants, shops, filling stations, parking facilities

Definitions

  • the present application relates to search query technology, and in particular to a method, device and computer device for querying interest points based on semantic understanding.
  • POI Point of Interest
  • the existing spatial keyword query technology mainly focuses on the spatio-temporal characteristics of POI, without semantic connection, mechanically treating keywords as text characters. Failed to understand the specific semantics and connections of user behavior in POI, unable to make accurate search according to the user's intention, or recommend that the searched content has poor matching with the user's willingness to search, can't understand the user's behavior and search mode, and can't further Recommend information that satisfies users.
  • the existing POI query technology has low precision and cannot be promoted and used in areas requiring multi-dimensional refinement of information properties, such as the financial field.
  • the main purpose of the present application is to provide a method for querying interest points based on semantic understanding, which aims to solve the technical problem that the existing POI query technology is not applicable to the financial field that requires multi-dimensional refinement of information.
  • the present application proposes a method for querying interest points based on semantic understanding, including:
  • each point of interest includes a description of the information and a geographic location
  • the probability distribution of the theme is matched for each of the points of interest;
  • the interest point information similar to the query body is filtered according to the index path.
  • the application also provides a point of interest query device based on semantic understanding, comprising:
  • An obtaining module configured to acquire a plurality of points of interest in a specified database in the financial field, where each point of interest includes an information description and a geographic location;
  • a matching module configured to match a topic distribution probability of each of the points of interest according to the information description in each interest point
  • a building module configured to build an index path according to the topic distribution probability and the geographic location
  • a screening module configured to filter, according to the index path, interest point information similar to the query body.
  • the application also provides a computer device comprising a memory and a processor, the memory storing computer readable instructions, the processor implementing the steps of the method described above when the computer readable instructions are executed.
  • the present application also provides a computer non-transitory readable storage medium having stored thereon computer readable instructions that, when executed by a processor, implement the steps of the methods described above.
  • the utility model has the beneficial technical effects: the POI search technology of the present application incorporates the user's search semantic understanding, so that the search information is closer to the real intention of the user, and the matching of the search content and the user's willingness to search is improved; and the key is passed in the POI search technology.
  • Word semantics ie, keyword distribution probability
  • the information coverage of search information is increased, not only limited to the shape of text characters, but also extended to content meaning, improve the accuracy of search information; through multiple dimensions Limit the impact factor of POI search, refine the precision of search information, and promote the application of POI search in the financial field, so as to better serve users in the financial field and provide financial information that is more realistic, more detailed and more in line with user needs.
  • FIG. 1 is a schematic flowchart of a method for querying a point of interest based on semantic understanding according to an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of a point of interest query device based on semantic understanding according to an embodiment of the present application
  • FIG. 4 is a schematic structural diagram of a building module according to an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a building unit according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a screening module according to an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a determining unit according to an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a screening module according to another embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a screening module according to still another embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a screening module according to still another embodiment of the present application.
  • FIG. 11 is a schematic diagram showing the internal structure of a computer device according to an embodiment of the present application.
  • a semantic point-based interest point query method includes:
  • S1 Acquire multiple points of interest in a specified database in the financial field, where each point of interest includes an information description and a geographic location.
  • the POI of the designated database of the financial domain of the present embodiment is a text description set with a time stamp, and each POI point is represented by a (loc, words) binary group, where loc represents a geographical location and words represent a POI information description.
  • loc represents a geographical location
  • words represent a POI information description.
  • This embodiment further refines and labels the database in the financial field, so that under the support of the search engine, the information of the specific financial service project can be queried to overcome the existing technology that cannot match the appropriate specific financial project through the search engine. defect.
  • the POI point of this embodiment includes coordinate information of a geographical location and a POI information description. Since the coordinate information of the geographical location does not have the text description information and does not have the text classification function, the POI information description can be used to classify the POI points.
  • the POI information description is converted into a topic distribution probability, that is, the interest point set of the embodiment is a topic distribution probability set with a geographical location label, so that the intrinsic meaning of the POI information description is better understood, and the subject is based on the theme.
  • the similarity measure function of the distribution probability is used to characterize the semantic association between points of interest.
  • the composition of the POI information in each POI point is first analyzed, the central word is extracted, and then the topic distribution probability is predicted according to the central word topic.
  • the subject distribution probability of this embodiment is equivalent to two points of the high-dimensional space, and the correlation of the topic distribution probabilities of the two central words is represented by the spatial distance of the two points in the high-dimensional space, where the spatial distance includes the geographical location. distance.
  • the spatial parameters of the two POI point high-dimensional spaces respectively containing the central word "coffee” and the central word "Starbucks" are substituted into the above formula, and the output calculation result is less than a preset threshold, for example, the threshold is 1, indicating that the respective values are included.
  • the two POI points of the central word "coffee” and the central word “Starbucks” have no correlation from the text font, but have a great correlation from the distribution probability of the subject with semantic understanding, that is, compared to the simple text font.
  • the correlation between the information descriptions of the two POI points is judged, and the correlation of the information descriptions of the two POI points is more accurate based on the subject distribution probability of the semantic understanding.
  • the index path is constructed according to the distribution probability and the geographical location of the above topic.
  • the index path establishment process under the above weight setting is as follows. Starting from the M point index, the M1 point close to the M point point distribution probability is discarded, and the M1* point closest to the M point geographical position is discarded, and the M1 point is used as a reference. M2 point that is similar to the M1 point topic distribution probability. If the M2 point does not exist (or the similarity of the geographic location between M1 and M2* is much larger than the probability distribution of the topic between M1 and M2), then select M1 Point to the nearest M2* point, and then continue to find the M3 point that is similar to the M2* point topic distribution probability. Search so until it finds N point, forming an index path from M point to N point.
  • the index path is constructed with the geographical location as the main consideration.
  • the process is similar to the above, that is, the POI point with the closest geographical position is the closest.
  • the next POI point is connected by the probability of the topic distribution to form an index path with the geographical location as the main consideration.
  • the weight distribution of the topic distribution probability and the geographic location are both 0.5, and the similarity between the topic distribution probability and the geographical location is considered, that is, the POI point whose closest to the topic distribution probability and the geographical proximity is selected forms an index path, so as to further Fit the user's search needs.
  • S4 Filter interest point information similar to the query body from the index path according to a specified rule.
  • the specified rule in this embodiment is selected according to the needs of the user when querying, and the interest point with a relatively close geographical location or the point of interest with high semantic similarity is selected, and the distance of the geographical position is calculated by the coordinate information of the geographical position. For example, if a point of interest close to a geographical location is selected, the search result is closer to the geographic location of the query subject, and the relevance of the text semantics may not be high.
  • the search information is closer to the user's intention. For example, the query in this embodiment describes that "coffee" and POI point description "Starbucks" will be considered relevant due to their similar topic distribution probabilities.
  • the query body of this embodiment is the information to be searched by the user.
  • This application adopts the NIQ-tree based POI search strategy to ensure effective pruning effect through accurate solution space upper and lower bound calculation.
  • Mbr D S (q,p) and minD T (q,N) represent the theoretical minimum distance from q to N
  • is a user-specified parameter, indicating the distance of the geographic location and the textual information description (ie the subject distribution probability of the keyword)
  • P represents each POI point.
  • step S2 of the embodiment includes:
  • the first keyword set in this embodiment is all keywords of all topics including the financial domain database of the insurance service
  • the POI point is the POI point
  • the second keyword set is the POI point corresponding keyword of the topic, wherein the POI point corresponds to the topic.
  • the text W describes the 'medical insurance' for the information of the POI point
  • the POI point corresponding to the topic of 'Medical Insurance' is different in the topic distribution probability of each topic.
  • the POI point of 'Medical Insurance' is N point, and the theme set includes the fund theme Z. 1.
  • the probability is greater than its relative to Z 1 or Z 2 .
  • the larger the calculated value the smaller the similarity of the topic distribution probability.
  • the subject distribution probability ⁇ words ⁇ p 1 , p 2 , . . .
  • n
  • P The POI points are indicated, that is, the keywords of each POI point have different subject distribution probabilities with respect to POI points of different topics in order to determine the next connected POI point whose subject distribution probability is closest.
  • step S3 includes:
  • S30 Obtain a weight setting according to a geographical location index and an index according to a topic distribution probability.
  • the weight setting of this step directly affects the search result, and the weight setting can be set autonomously according to the intention of the user.
  • the weight value of this embodiment is between [0, 1]. For example, if the user-set weight has a geographic position of 0.7 and the topic distribution probability is 0.3, the final search result is definitely a POI point that is closer to the query subject's geographic location, and the text similarity may not be high, and the user's search is performed.
  • the intention is not consistent; on the contrary, the result is the opposite, not to repeat, but the geographical position of the weight is 0.5, the probability of the topic distribution is 0.5, and the ratio of the two is relatively high, which will retrieve the geographically close and in line with the user's intention. Points of Interest.
  • This step refers to the difference in weight settings, and the index path of the build is also different. For example, if the geographic location weight is large, the index root points are accessed from the retrieval root node in the manner closest to the geographic location.
  • step S31 includes:
  • S311 Organize all points of interest of the designated database in the above financial field at the geospatial level according to the geographical similarity.
  • the fast retrieval of the POI point in the embodiment of the present application depends on the effective data index.
  • the data index in this embodiment is different from the traditional POI indexing method, and is a hierarchical index structure that combines the probability of the geographic location and the text semantics. Makes search pruning from different dimensions.
  • the indexing mechanism based on the IDistance (Big Data Classification Method) geographic location, topic distribution probability, and text keyword three-layer coordination is defined as NIQ-tree (where NIQ is the initial combination of N-Gram, IDistance, and Quadtree). ) Index structure.
  • the quadtree index of this embodiment is a tree structure that recursively divides the geospatial layer into different levels. For example, it is divided into four equal subspaces, so recursively, until the level of the tree reaches a certain depth or meets certain requirements and then stops segmentation.
  • the quadtree of the embodiment has a simple structure, the geographical locations are all stored on the leaf nodes, the intermediate nodes and the root nodes do not store the geographical locations, and when the geographical space layer data distribution is relatively uniform, the spatial data insertion with a relatively high geographical position is inserted. And query efficiency.
  • the upper right is the first quadrant 0
  • the upper left is the second quadrant 1
  • the lower left is the third quadrant 2
  • the lower right is the fourth quadrant 3 .
  • the spatial structure data is approximated by the MBR (Minimum Bounding Rectangle)
  • the quadtree node is the main component of the quadtree structure. It is mainly used to store the geographical location identification number.
  • MBR which is also the main part of the quadtree algorithm operation.
  • the minimum outsourcing rectangle of the MBR corresponding region in the quadtree node type structure, and the minimum outsourcing rectangle of the node of the upper layer contains the smallest outer bounding rectangle region of the next layer.
  • the quadtree of the embodiment maintains the consistency of the geographical location index with the information data of the geographical location stored in the file or the database, avoids uneven geographical distribution, and avoids the continuous insertion of the geographic location, the quadtree
  • the hierarchy will continue to deepen, forming a severely unbalanced quadtree, resulting in a large increase in the depth of each query, and a sharp decline in query efficiency.
  • S312 Refining each interest point in the topic layer according to the similarity of the distribution probability of each interest point topic.
  • the NIQ-tree further subdivides the POI points in the MBR at the topic level.
  • the polygon-oriented spatial clustering algorithm should first obtain the minimum circumscribed rectangle of the polygon, and then perform spatial clustering according to the minimum circumscribed rectangle.
  • the MBR is the minimum bounding rectangle, the smallest contains the rectangle, or the smallest outsourcing rectangle.
  • the theme layer is further refined by refining the POI points in the MBR to improve the accuracy of the search matching.
  • S313 Establish a high-dimensional index path in the geospatial layer and the topic layer by IDistance according to each interest point refined by the topic layer.
  • IDistance is used to build a high-dimensional index structure for efficient and efficient retrieval.
  • the IDistance of this embodiment can classify all POI points of the specified financial database, record the information of each class, and then record all the class information into the file, so as to be in the above geography according to the weight of each POI point of the specified financial database.
  • the spatial layer and the theme layer construct a high-dimensional B+tree (multiple search trees, not binary), and store the necessary information of the B+tree, so that after the user inputs the reference point, the neighboring points are searched in the B+tree, and The similarity between the search results and the reference point is analyzed by linear search result comparison.
  • step S312 of the embodiment the method includes:
  • S310 Perform a thumbnail construction on the topic layer in the text layer based on the N-Gram to refine the points of interest.
  • the text layer in this embodiment is also an important component of the NIQ-tree index structure in this embodiment.
  • the three-layer index structure is used for fast pruning.
  • the theme layer is further refined in the text layer, and the theme layer is constructed based on the N-Gram in the text layer, that is, the topics with similar texts are first classified, and then classified according to the topic distribution probability, which is equivalent to A small subset is divided into a large collection of subject distribution probabilities.
  • the thumbnail layer is constructed in the text layer only to further refine the theme layer.
  • the text layer is omitted, and only the layer layer structure of the topic layer and the geospatial layer is reserved. The effect of POI point indexing can still be achieved.
  • the edit distance between the two strings can be determined by the Needleman-Wunsch algorithm (global sequence alignment algorithm) or the Smith-Waterman algorithm (local sequence alignment). Algorithm), this embodiment defines an edit distance between two strings as an N-Gram distance.
  • the N-Gram of the string s represents a segment obtained by dividing the original word by the length N, that is, all substrings of length N in s.
  • the N-Gram distance between the two strings can be defined from the number of shared substrings.
  • the index path includes an index node
  • the step S4 may specifically include:
  • the query body entered by the user includes a geographic location and a search text keyword.
  • the minimum matching distance in this step is expressed by the Euclidean distance, and the calculation formula is as follows: And normalize it to [0,1], where q is the query body, o is the reference POI point, and D s is the Euclidean distance.
  • q is the query body
  • o is the reference POI point
  • D s is the Euclidean distance.
  • Other embodiments of the present application may also express the semantic relevance of two texts by a cosine distance, a Mahalanobis distance, or a Pap address, and the like.
  • S42 Determine whether the correlation between the index node and the query entity is within a threshold condition.
  • the index path of this embodiment is formed by connecting a plurality of index nodes.
  • step S42 includes:
  • S420 Determine whether the index node is close to the geographic location of the query entity and/or whether the similarity degree of the topic distribution probability of the index node and the query subject is within a preset range.
  • the similarity degree of the topic distribution probability of the above index node and the query subject in the step is expressed as
  • TD W represents the topic distribution probability corresponding to the keyword in the POI point
  • is the modulus of TD W .
  • the preset range of the geographical proximity of the embodiment is less than 500 m.
  • step S4 specifically includes:
  • S44 Receive a query body of a financial data class of a specified object input by a user.
  • This embodiment is a specific scenario of the semantic-based POI search technology in the financial field, in order to obtain more detailed and more valuable financial data.
  • the specified object of this embodiment includes all the companies and groups involved in the financial database, and the query subject of the financial data category includes database data related to the market and operation, including information descriptions of the geographic location and the financial data category. For example, a query for a particular financial service point around. Through the financial institution portrait modeling (name, service object, main business...), a special financial site query and recommendation system is established to make big data search technology more suitable for application in the financial service industry.
  • S45 Retrieve the financial data with similar semantics in the specified database according to the information description carried in the query body.
  • step S45 the method includes:
  • the risk estimation level is estimated by calculating the market credit, debt ratio, marketing field evaluation, marketing market prospect evaluation and other operational and market-related data of the designated object, which is beneficial to the banking industry or investors to reduce investment. risk.
  • the investment risk estimation model of this embodiment is obtained by training a risk data sample into a convolutional neural network.
  • step S46 the method includes:
  • the credit level assessment is formed.
  • the information searched by the semantically understood POI is more comprehensive, the risk estimation level and the industry analysis data are more reliable, and the credit rating evaluation has more reference value, which is beneficial to financial institutions such as banks.
  • the POI search technology in the embodiment of the present application incorporates the user's search semantic understanding, so that the search information is closer to the true intention of the user, and the matching between the search content and the user's willingness to search is improved; by keyword semantics in the POI search technology (ie, The keyword distribution probability of the keyword) similarity matching query, the information coverage of the search information is increased, not only limited to the shape of the text characters, but also extended to the meaning of the content, improving the accuracy of the search information; limiting the POI search through multiple dimensions Impact factors, refine the accuracy of search information, and promote the application of POI search in the financial field, so as to better serve users in the financial field and provide financial information that better meets user needs.
  • keyword semantics in the POI search technology ie, The keyword distribution probability of the keyword
  • a semantic point-based interest point query device includes:
  • the obtaining module 1 is configured to acquire a plurality of points of interest in a specified database in the financial field, where the points of interest include information descriptions and geographic locations.
  • the POI of the designated database of the financial domain of the present embodiment is a text description set with a time stamp, and each POI point is represented by a (loc, words) binary group, where loc represents a geographical location and words represent a POI information description.
  • loc represents a geographical location
  • words represent a POI information description.
  • This embodiment further refines and labels the database in the financial field, so that under the support of the search engine, the information of the specific financial service project can be queried to overcome the existing technology that cannot match the appropriate specific financial project through the search engine. defect.
  • the matching module 2 is configured to match the topic distribution probability to each interest point in the specified database in the financial domain according to the information description in each interest point.
  • the POI point of this embodiment includes coordinate information of a geographical location and a POI information description. Since the coordinate information of the geographical location does not have the text description information and does not have the text classification function, the POI information description can be used to classify the POI points.
  • the POI information description is converted into a topic distribution probability, that is, the interest point set in this embodiment is a series of topic distribution probability sets with geographical location tags, so that the intrinsic meaning of the POI information description can be better understood.
  • the semantic association between points of interest is characterized by a similarity measure function based on the topic distribution probability.
  • the composition of the POI information in each POI point is first analyzed, the central word is extracted, and then the topic distribution probability is predicted according to the central word topic.
  • the subject distribution probability of this embodiment is equivalent to two points of the high-dimensional space, and the spatial distribution distance of the two points in the high-dimensional space is used to represent the subject distribution probability correlation of the two central words, where the spatial distance includes the geographical location. distance.
  • the spatial parameters of the two POI point high-dimensional spaces respectively containing the central word "coffee” and the central word “Starbucks” are substituted into the above formula, and the output calculation result is less than a preset threshold, for example, the threshold is 1, indicating that the respective values are included.
  • the two POI points of the central word "coffee” and the central word “Starbucks” have no correlation from the text font, but have a great correlation from the distribution probability of the subject with semantic understanding, that is, compared to the simple text
  • the relevance of the information description of the two POI points is judged on the font, and the relevance of the information description of the two POI points based on the semantic distribution of the subject is more accurate.
  • the building module 3 is configured to construct an index path according to the distribution probability and the geographical location of the above theme.
  • the index path is also different according to the user's weight setting. For example, the index path from M point to point N is established. If the weight of the topic distribution probability is greater than the weight of the geographic location, the index path takes the correlation of the topic distribution probability of the two POI points as the main consideration, that is, the priority topic distribution probability.
  • the closest POI point when there is no POI point with the closest topic distribution probability, or when the similarity of the geographic location when searching for the next POI point is far greater than the similarity of the topic distribution probability, then the location in the index path is Next POI point.
  • the index path establishment process under the above weight setting is as follows. Starting from the M point index, the M1 point close to the M point point distribution probability is discarded, and the M1* point closest to the M point geographical position is discarded, and the M1 point is used as a reference. M2 point that is similar to the M1 point topic distribution probability. If the M2 point does not exist (or the similarity of the geographic location between M1 and M2* is much larger than the probability distribution of the topic between M1 and M2), then select M1 Point to the nearest M2* point, and then continue to find the M3 point that is similar to the M2* point topic distribution probability. Search so until it finds N point, forming an index path from M point to N point.
  • the index path is constructed with the geographical location as the main consideration.
  • the process is similar to the above, that is, the POI point with the closest geographical position is the closest.
  • the next POI point is connected by the probability of the topic distribution to form an index path with the geographical location as the main consideration.
  • the weight distribution of the topic distribution probability and the geographic location are both 0.5, and the similarity between the topic distribution probability and the geographical location is considered, that is, the POI point whose closest to the topic distribution probability and the geographical proximity is selected forms an index path, so as to further Fit the user's search needs.
  • the screening module 4 is configured to filter the interest point information similar to the query body according to the index path.
  • the specified rule in this embodiment is selected according to the needs of the user when querying, and the interest point with a relatively close spatial distance or the point of interest with high text similarity is selected, and the distance of the geographical position is calculated by the coordinate information of the geographical position. For example, if a point of interest close to a geographical location is selected, the search result is closer to the geographic location of the query subject, and the text similarity may not be high.
  • This embodiment makes the retrieval information closer to the user's intention by using the intrinsic meaning of the text description as a reference amount. For example, the present embodiment in the query description "coffee” and POI point description "Starbucks" will be considered relevant due to their similar topic distribution probabilities.
  • the query body of this embodiment is the information to be searched by the user.
  • This embodiment adopts the NIQ-tree based POI search strategy to ensure effective pruning effect by accurately calculating the upper and lower bounds of the solution space.
  • Mbr D S (q,p) and minD T (q,N) represent the theoretical minimum distance from q to N
  • is a user-specified parameter representing the geographic location and textual information description (ie the subject's subject distribution probability)
  • the weight between similarities, P represents each POI point.
  • the foregoing matching module 2 includes:
  • the statistic unit 21 is configured to collect a first keyword set in the specified database and a second keyword set in each interest point topic.
  • the calculating unit 12 is configured to calculate a topic distribution probability of the second keyword set with respect to the first keyword set.
  • the text W describes the 'medical insurance' for the information of the POI point
  • the POI point corresponding to the topic of 'Medical Insurance' is different in the topic distribution probability of each topic.
  • the POI point of 'Medical Insurance' is N point, and the theme set includes the fund theme Z. 1.
  • the probability is greater than its relative to Z 1 or Z 2 .
  • the larger the calculated value the smaller the similarity of the topic distribution probability.
  • the subject distribution probability ⁇ words ⁇ p 1 , p 2 , . . .
  • n
  • P The POI points are indicated, that is, the keywords of each POI point have different subject distribution probabilities with respect to POI points of different topics in order to determine the next connected POI point whose subject distribution probability is closest.
  • the foregoing building module 3 includes:
  • the obtaining unit 30 is configured to obtain a weight setting according to the geographic location index and the index according to the topic distribution probability.
  • the weight setting of this embodiment directly affects the search result, and the weight setting can be set autonomously according to the intention of the user.
  • the weight value of this embodiment is between [0, 1]. For example, if the user-set weight has a geographic position of 0.7 and the topic distribution probability is 0.3, the final search result is definitely a POI point that is closer to the query subject's geographic location, and the text similarity may not be high, and the user's search is performed.
  • the intention is not consistent; on the contrary, the result is the opposite, not to repeat, but the geographical position of the weight is 0.5, the probability of the topic distribution is 0.5, and the ratio of the two is relatively high, which will retrieve the geographically close and in line with the user's intention. Points of Interest.
  • the building unit 31 is configured to construct the index path according to the weight setting.
  • This embodiment refers to the difference in weight setting, and the index path constructed is also different. For example, if the geographic location weight is large, the index root points are accessed from the retrieval root node in the manner closest to the geographic location.
  • the foregoing index path is a geographic location and a topic distribution probability collaborative index path
  • the foregoing building unit 31 includes:
  • the organization sub-unit 311 is configured to organize all points of interest of the designated database in the financial field at the geospatial layer according to the geographical similarity.
  • the fast retrieval of the POI point in the embodiment of the present application depends on the effective data index.
  • the data index in this embodiment is different from the traditional POI indexing method, and is a hierarchical index structure that combines the probability of the geographic location and the text semantics. Makes search pruning from different dimensions.
  • an indexing mechanism based on IDistance-based geographic location, topic distribution probability, and text keyword three-layer coordination is defined as an NIQ-tree index structure.
  • NIQ-tree index structure an indexing mechanism based on IDistance-based geographic location, topic distribution probability, and text keyword three-layer coordination.
  • all POI points are organized by Quadtree based on geographic similarity, and the geospatial layer is at the top of the NIQ-tree index structure because the geospatial layer data is two-dimensional, cut
  • the branch speed is much larger than the high-dimensional theme layer.
  • the quadtree index of this embodiment is a tree structure that recursively divides the geospatial layer into different levels. For example, it is divided into four equal subspaces, so recursively, until the level of the tree reaches a certain depth or meets certain requirements and then stops segmentation.
  • the quadtree of the embodiment has a simple structure, the geographical location is stored on the leaf node, the intermediate node and the root node do not store the geographical location, and when the geographical space layer data distribution is relatively uniform, the spatial data with the relatively high geographical position Insert and query efficiency.
  • the upper right is the first quadrant 0
  • the upper left is the second quadrant 1
  • the lower left is the third quadrant 2
  • the lower right is the fourth quadrant 3 .
  • the MBR minimum outsourcing rectangle
  • the quadtree node is the main component of the quadtree structure, and is mainly used to store the geographical location identification number and MBR.
  • the main part of the fork tree algorithm operation The minimum outsourcing rectangle of the MBR corresponding region in the quadtree node type structure, and the minimum outsourcing rectangle of the node of the upper layer contains the smallest outer bounding rectangle region of the next layer.
  • a full quadtree is first generated to avoid reallocation of memory when the geographic location is inserted, speed up the insertion, and finally release the memory space occupied by the empty node.
  • the quadtree of the embodiment maintains the consistency of the geographical location index with the information data of the geographical location stored in the file or the database, avoids uneven geographical distribution, and avoids the continuous insertion of the geographic location, the quadtree
  • the hierarchy will continue to deepen, forming a severely unbalanced quadtree, resulting in a large increase in the depth of each query, and a sharp decline in query efficiency.
  • the refinement sub-unit 312 is configured to refine each interest point in the topic layer according to the similarity of the distribution probability of each interest point topic.
  • the NIQ-tree further subdivides the POI point in the MBR (Minimum Bounding Rectangle) in the theme layer.
  • the polygon-oriented spatial clustering algorithm should first obtain the minimum circumscribed rectangle of the polygon, and then perform spatial clustering according to the minimum circumscribed rectangle.
  • the MBR is the minimum bounding rectangle, the smallest contains the rectangle, or the smallest outsourcing rectangle.
  • the theme layer is further refined by refining the POI points in the MBR to improve the accuracy of the search matching.
  • the establishing sub-unit 313 is configured to establish a high-dimensional index path in the foregoing geospatial layer and the topic layer by IDistance according to each interest point refined by the topic layer.
  • IDistance is used to build a high-dimensional index structure for efficient and efficient retrieval.
  • the IDistance of this embodiment can classify all POI points of the specified financial database, record the information of each class, and then record all the class information into the file, so as to be in the above geography according to the weight of each POI point of the specified financial database.
  • the spatial layer and the theme layer construct a high-dimensional B+tree, and store the necessary information of the B+tree, so that after the user inputs the reference point, the neighboring points are searched in the B+tree, and the search results and the reference point are analyzed through linear search result comparison. The similarity.
  • building unit 31 further includes:
  • the construction sub-unit 310 is configured to perform a thumbnail construction on the theme layer in the text layer based on the N-Gram to refine the points of interest.
  • the text layer in this embodiment is also an important component of the NIQ-tree index structure in this embodiment.
  • the three-layer index structure is used for fast pruning.
  • the theme layer is further refined in the text layer, and the theme layer is constructed based on the N-Gram in the text layer, that is, the topics with similar texts are first classified, and then classified according to the topic distribution probability, which is equivalent to A small subset is divided into a large collection of subject distribution probabilities.
  • the thumbnail layer is constructed in the text layer only to further refine the theme layer.
  • the text layer is omitted, and only the layer layer structure of the topic layer and the geospatial layer is reserved. The effect of POI point indexing can still be achieved.
  • the edit distance between the two strings can be determined by the Needleman-Wunsch algorithm or the Smith-Waterman algorithm.
  • the edit distance is the N-Gram distance.
  • the N-Gram of the string s represents a segment obtained by dividing the original word by the length N, that is, all substrings of length N in s.
  • the difference between two string lengths is ignored, it is obviously insufficient to count only the common substrings.
  • the string girl and girlfriend, the number of common substrings owned by the two are obviously equal to the number of common substrings owned by the girl and itself, but it cannot be considered that the girl and the girlfriend are two identical matches.
  • This embodiment proposes to define the N-Gram distance based on the non-repetitive N-Gram participle, and the formula is expressed as:
  • the index path of the embodiment of the present application includes an index node
  • the screening module 4 includes:
  • the first receiving unit 40 is configured to receive a query body input by the user.
  • the query subject input by the user received by the first receiving unit 40 includes a geographic location and a search text keyword.
  • the query unit 41 is configured to sequentially access and query the index node having the smallest matching distance from the root node of the NIQ-tree.
  • the minimum matching distance in this embodiment is represented by the Euclidean distance, and the calculation formula is as follows: And normalize it to [0,1], where q is the query body, o is the reference POI point, and D s is the Euclidean distance.
  • q is the query body
  • o is the reference POI point
  • D s is the Euclidean distance.
  • Other embodiments of the present application may also express the semantic relevance of two texts by a cosine distance, a Mahalanobis distance, or a Pap address, and the like.
  • the index path of the embodiment is formed by connecting a plurality of index nodes, and the determining unit 42 determines whether the geographical position and/or the textual information description of the index node is related to the geographic location and/or the textual information description of the query body.
  • the threshold is 85% or more.
  • the weight between the similarity of the described and textual information description ie, topic distribution probability
  • the calling unit 43 is configured to: if the correlation between the index node and the query body is within a threshold condition, call the information data of the index node as the interest point information similar to the query body.
  • the determining unit 42 includes:
  • TD W represents the topic distribution probability corresponding to the keyword in the POI point
  • is the modulus of TD W .
  • the preset range of the geographical proximity of the embodiment is less than 500 m.
  • the determining sub-unit 421 is configured to determine that the correlation between the index node and the query body is within a threshold condition if it is within a preset range; if not, the value is not within the threshold condition.
  • the screening module 4 includes:
  • the second receiving unit 44 is configured to receive a search subject of a financial data class of the specified object input by the user.
  • This embodiment is a specific scenario of the semantic-based POI search technology in the financial field, in order to obtain more detailed and more valuable financial data.
  • the specified object of this embodiment includes all the companies and groups involved in the financial database, and the query subject of the financial data category includes database data related to the market and operation, including information descriptions of the geographic location and the financial data category. For example, a query for a particular financial service point around. Through the financial institution portrait modeling (name, service object, main business...), a special financial site query and recommendation system is established to make big data search technology more suitable for application in the financial service industry.
  • the retrieving unit 45 is configured to retrieve the financial data with similar semantics in the specified database according to the information description carried in the query body.
  • the screening module 4 in another embodiment of the present application includes:
  • the input and output unit 46 is configured to input financial data into the investment risk estimation model to output a risk estimation level of the specified object.
  • the risk estimation level is estimated by calculating the market credit, debt ratio, marketing field evaluation, marketing market prospect evaluation and other operational and market-related data of the designated object, which is beneficial to the banking industry or investors to reduce investment. risk.
  • the investment risk estimation model of this embodiment is obtained by training a risk data sample into a convolutional neural network.
  • a screening module 4 in another embodiment of the present application includes:
  • the forming unit 47 is configured to form a credit fund rating evaluation database according to the risk estimation level and the industry analysis data.
  • the credit level assessment is formed.
  • the information searched by the semantically understood POI is more comprehensive, the risk estimation level and the industry analysis data are more reliable, and the credit rating evaluation has more reference value, which is beneficial to financial institutions such as banks.
  • the matching unit 48 is configured to match the credit resource according to the credit fund rating database.
  • a high level of credit funding matches high credit standards.
  • a customer with a high level of credit funds is classified as a superior customer for tracking.
  • the computer device may be a server, and its internal structure may be as shown in FIG. 11.
  • the computer device includes a processor, memory, network interface, and database connected by a system bus. Among them, the computer designed processor is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions, and a database.
  • the memory provides an environment for the operation of operating systems and computer readable instructions in a non-volatile storage medium.
  • the database of the computer device is used to store data such as interest point queries based on semantic understanding.
  • An embodiment of the present application also provides a computer non-volatile readable storage medium having stored thereon computer readable instructions that, when executed, perform the processes of the embodiments of the methods described above.
  • the above description is only the preferred embodiment of the present application, and is not intended to limit the scope of the patent application, and the equivalent structure or equivalent process transformations made by the specification and the drawings of the present application, or directly or indirectly applied to other related The technical field is equally included in the scope of patent protection of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Remote Sensing (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请揭示了基于语义理解的兴趣点查询方法,包括:获取金融领域的指定数据库中的多个兴趣点;根据各兴趣点中的信息描述,分别为各兴趣点匹配主题分布概率;根据主题分布概率以及地理位置构建索引路径;根据索引路径筛选与查询主体相似的兴趣点信息。本申请POI搜索技术融入了搜索语义理解,提高搜索内容与搜索意愿的匹配性。

Description

基于语义理解的兴趣点查询方法、装置和计算机设备
本申请要求于2018年4月17日提交中国专利局、申请号为2018103452526,发明名称为“基于语义理解的兴趣点查询方法、装置和计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及到搜索查询技术,特别是涉及到基于语义理解的兴趣点查询方法、装置和计算机设备。
背景技术
POI(Point of Interest,兴趣点)搜索技术受限于空间关键字查询技术的发展,现有的空间关键字查询技术主要针对POI的时空特性,没有语义联系,机械地将关键字视为文本字符,未能理解POI中用户行为的具体语义和联系,无法根据用户的意图做出准确的搜索,或推荐搜索的内容与用户的搜索意愿匹配性差,无法理解用户的行为和搜索模式,也不能进一步推荐让用户满意的信息。且现有POI查询技术搜索信息的精度低,无法在需要多维度细化信息性质的领域内推广使用,比如金融领域等。
技术问题
本申请的主要目的为提供一种基于语义理解的兴趣点查询方法,旨在解决现有POI查询技术不适用于需要多维度细化信息的金融领域的技术问题。
技术解决方案
本申请提出一种基于语义理解的兴趣点查询方法,包括:
获取金融领域的指定数据库中的多个兴趣点,其中每个兴趣点包括信息描述以及地理位置;
根据各兴趣点中的信息描述,分别为所述各兴趣点匹配主题分布概率;
根据所述主题分布概率以及地理位置构建索引路径;
根据所述索引路径筛选与查询主体相似的兴趣点信息。
本申请还提供一种基于语义理解的兴趣点查询装置,包括:
获取模块,用于获取金融领域的指定数据库中的多个兴趣点,其中每个兴趣点包括信息描述以及地理位置;
匹配模块,用于根据各兴趣点中的信息描述,分别为所述各兴趣点匹配主题分布概率;
构建模块,用于根据所述主题分布概率以及地理位置构建索引路径;
筛选模块,用于根据所述索引路径筛选与查询主体相似的兴趣点信息。
本申请还提供一种计算机设备,包括存储器和处理器,所述存储器存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现上述方法的步骤。
本申请还提供一种计算机非易失性可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现上述的方法的步骤。
有益效果
本申请有益技术效果:本申请POI搜索技术中融入了用户的搜索语义理解,以便搜索信息与用户的真实意愿更贴近,提高搜索内容与用户搜索意愿的匹配性;通过在POI搜索技术中通过关键字语义(即关键字的主题分布概率)相似度匹配查询,搜素信息的信息量覆盖面增大,不仅仅限于文本字符的形状,扩展至与内容意思相关,提高搜索信息的精度;通过多维度限定POI搜索的影响因子,细化搜索信息的精度,推进POI搜索在金融领域的应用,以便在金融领域更好的服务用户,提供更真实、更细致、更符合用户需求的金融信息。
附图说明
图1本申请一实施例的基于语义理解的兴趣点查询方法流程示意图;
图2本申请一实施例的基于语义理解的兴趣点查询装置结构示意图;
图3本申请一实施例的匹配模块的结构示意图;
图4本申请一实施例的构建模块的结构示意图;
图5本申请一实施例的构建单元的结构示意图;
图6本申请一实施例的筛选模块的结构示意图;
图7本申请一实施例的判断单元的结构示意图;
图8本申请另一实施例的筛选模块的结构示意图;
图9本申请再一实施例的筛选模块的结构示意图;
图10本申请又一实施例的筛选模块的结构示意图;
图11本申请一实施例的计算机设备内部结构示意图。
本发明的最佳实施方式
参照图1,本申请一实施例的基于语义理解的兴趣点查询方法,包括:
S1:获取金融领域的指定数据库中的多个兴趣点,其中每个兴趣点包括信息描述以及地理位置。
本实施例的金融领域的指定数据库的兴趣点POI为附有时间标签的文本描述集合,每个POI点由(loc,words)二元组表示,其中loc代表地理位置,words代表POI信息描述。举例地,A公司(地点,文本=服务项目、服务对象、主营业务等),比如,数据集中A公司1(深圳福田区莲花支路,文本=保险业务、法人与自然人、汽车保险&旅游保险&家财险&意外保险);A公司2(上海陆家嘴,文本=金融资产服务、法人企业、网络融资)等。本实施例通过将金融领域的数据库进一步细化与标注,以便在搜索引擎的支撑下,能查询到具体金融服务项目的信息,以克服现有通过搜索引擎不能匹配到合适的具体金融项目的技术缺陷。
S2:根据各兴趣点中的信息描述,分别为各兴趣点匹配主题分布概率。
本实施例的POI点包括地理位置的坐标信息以及POI信息描述。由于地理位置的坐标信息不带有文本 描述信息,不具有文本分类作用,可通过POI信息描述对POI点进行主题细化分类。本实施例通过将POI信息描述转换为主题分布概率,即本实施例的兴趣点集合为附有地理位置标签的主题分布概率集合,使得更好地理解POI信息描述的内在意义,并通过基于主题分布概率的相似性度量函数来表征兴趣点之间的语义关联。本实施例首先通过剖析各POI点中的POI信息描述的组成结构,提取中心词,然后根据中心词主题来预测主题分布概率。比如,中心词“咖啡”和中心词“星巴克”这两个词的相似度,就是将上述两个中心词的主题分布概率运用指定的定量化测量公式进行计算,如β words={p 1,p 2,...,p n},其中n=|Z|,P表示各POI点,然后分析计算值,比如计算值越大,相似性越低。本实施例的主题分布概率相当于高维空间的两个点,通过两点在高维空间的空间距离来表示两个中心词的主题分布概率的相关性,此处空间距离包括地理位置形成的距离。举例地,将分别包含中心词“咖啡”和中心词“星巴克”的两个POI点高维空间的空间参数代入上述公式,输出的计算结果小于预设的阈值,比如阈值为1,表明分别包含中心词“咖啡”和中心词“星巴克”的两个POI点从文本字体上没有相关性,但从附有语义理解的主题分布概率上看具有极大相关性,即相比于单纯从文本字体上判断两个POI点的信息描述的相关性,基于语义理解的主题分布概率判断两个POI点的信息描述的相关性更准确。
S3:根据上述主题分布概率以及地理位置构建索引路径。
本实施例中在通过POI检索的过程中,会综合考虑两种参量,即主题分布概率以及地理位置,以进一步快速查询到更贴近用户检索本意的信息。本实施例中根据用户的权重设置的不同,索引路径也不同。比如从M点到N点的索引路径建立条件为,主题分布概率的权重大于地理位置的权重,则索引路径以两个POI点的主题分布概率的相关性为主要考虑因素,即优先主题分布概率最接近的POI点,当不存在主题分布概率最接近的POI点,或在搜寻下一POI点时地理位置的相近性远大于主题分布概率的相近性时,则以地理位置衔接索引路径中的下一POI点。上述权重设置下的索引路径建立过程如下,从M点开始索引,寻找与M点主题分布分布概率相近的M1点舍弃与M点地理位置最接近的M1*点,继续以M1点为基准,寻找与M1点主题分布概率相近的M2点,若M2点不存在(或M1与M2*之间的地理位置的相近性判断远大于M1与M2之间的主题分布概率相近似),则选择与M1点地理位置最接近的M2*点,然后继续寻找与M2*点主题分布概率相近的M3点,如此搜寻下去,直至找到N点,形成从M点到N点的索引路径。反之,则以地理位置为主要考虑因素构建索引路径,过程与上面类似,即优先地理位置最接近的POI点,当不存在地理位置最接近的POI点或主题分布概率的相近性远大于地理位置的相近性时,则以主题分布概率衔接下一POI点,以形成以地理位置为主要考虑因素构建索引路径。本实施例优选主题分布概率和地理位置的权重均为0.5,同时考虑主题分布概率和地理位置的相似度,即选择主题分布概率与地理位置相近度均最接近的POI点形成索引路径,以便更贴合用户的搜索需求。
S4:根据指定规则从上述索引路径中筛选与查询主体相似的兴趣点信息。
本实施例的指定规则依据用户查询时自身的需要进行选择,选择地理位置较近的兴趣点或是文本语 义相似度高的兴趣点,地理位置的距离通过地理位置的坐标信息计算得到。比如,选择地理位置近的兴趣点,则搜索结果为距离查询主体的地理位置较近,而文本语义的相关度可能不高。本实施例通过将文本描述的内在意义作为参考量,检索信息更贴近用户本意。举例地,本实施例查询描述“咖啡”与POI点描述“星巴克”将因其相似的主题分布概率被认为相关。本实施例的查询主体为用户输入的待搜索信息。
本申请采用基于NIQ-tree的POI搜索策略,通过精确的解空间上下界计算来保证有效的剪枝效果。具体来说,POI搜索将从NIQ-tree的根节点开始,依次访问与查询具有最小匹配距离的索引节点(通过优先队列来完成),其中索引节点N与查询q的最佳匹配距离D bm(q,N)计算为D bm(q,N)=λ×min p∈N.mbrD S(q,p)+(1-λ)×minD T(q,N),其中min p∈N.mbrD S(q,p)和minD T(q,N)表示从q到N的理论最小距离,λ是用户指定参数,表示地理位置的距离和文本式信息描述(即关键字的主题分布概率)相似度之间的权重,P表示各POI点。
进一步地,本实施例的步骤S2,包括:
S21:统计上述指定数据库中的第一关键字集合以及各兴趣点主题中第二关键字集合。
本实施例的基于主题分布概率的POI点表示方法,给定一个由n个关键字组成的文本W,V为金融领域POI点数据集中所有关键字集合,Z={z 1,z 2,...,z n}为主题集,则W对应于Z中每个主题z i∈Z的主题概率分布TD W[z i]的计算公式如下:
Figure PCTCN2018095502-appb-000001
其中
Figure PCTCN2018095502-appb-000002
表示主题Z i中所有关键字的集合,则
Figure PCTCN2018095502-appb-000003
表示W中属于主题Z i的关键字的个数;α表示对称边界,通常设置为0.1;|W|表示W中关键字的个数;|Z|表示Z中总共的主题个数。本实施例的第一关键字集合为包括保险业务的金融领域数据库的所有主题的所有关键字,兴趣点即POI点,第二关键字集合为POI点对应主题的关键字,其中POI点对应主题为上述包括保险业务的金融领域数据库中的主题之一。
S22:计算上述第二关键字集合相对于上述第一关键字集合的主题分布概率。
举例地,文本W为POI点的信息描述‘医疗保险’,V为包括保险业务的金融领域数据库中的所有关键字,即第一关键字集合中的关键字总量,Z={z 1,z 2,...,z n}为包括保险业务的金融领域数据库的众多主题形成的主题集,每个主题的关键字的数量各不同,各POI点对应主题集中每个主题的主题分布概率可分别通过上述公式获得,比如,信息描述为‘医疗保险’的POI点对应主题集中每个主题的主题分布概率也不同,设‘医疗保险’的POI点为N点,主题集中包括基金主题Z 1、股票主题Z 2、等,则N点基于关键字‘保’‘险’,(即第二关键字数量为2)且主题归属于保险主题Z 3,计算得到相对于Z 3的主题分布概率大于其相对于Z 1或Z 2的。计算值越大,主题分布概率的相似性越小。通过上述公式,计算得到POI点数据集中每个POI点中的关键字所对应的主题分布概率β words={p 1,p 2,...,p n},其中n=|Z|,P表示各POI点,即每个POI点的关键字相对于不同主题的POI点的主题分布概率不同,以 便确定主题分布概率最接近的下一衔接POI点。
进一步地,步骤S3,包括:
S30:获取依据地理位置索引和依据主题分布概率索引的权重设置。
本步骤的权重设置直接影响搜索结果,权重设置可根据用户使用意图进行自主设置,本实施例的权重值为[0,1]之间。举例地,如果用户设置的权重中地理位置占0.7,主题分布概率占0.3,则最后的搜索结果肯定是距离查询主体地理位置较近的POI点,而文本相似度可能不高,与用户的检索用意不太相符;反之,结果则相反,不赘述,但权重中地理位置占0.5,主题分布概率占0.5,两者占比都比较高,就会检索到地理位置较近的且符合用户用意的兴趣点。
S31:根据上述权重设置构建上述索引路径。
本步骤是指权重设置的不同,构建的索引路径也不同。比如,地理位置权重占比大,则从检索根节点依次按照地理位置最近的方式访问各索引兴趣点。
进一步地,本实施例的上述索引路径为地理位置、主题分布概率协同索引路径,步骤S31,包括:
S311:根据地理位置相似性在地理空间层组织上述金融领域的指定数据库的所有兴趣点。
本申请实施例的POI点的快速检索依赖于有效的数据索引,本实施例的数据索引与传统的POI索引方式不同,为融合地理位置和文本语义的主题分布概率两信息的层次式索引结构,使得从不同维度执行搜索剪枝。本实施例的将基于IDistance(大数据分类方法)的地理位置、主题分布概率及文本关键字三层协同的索引机制定义为NIQ-tree(其中NIQ为N-Gram、IDistance和Quadtree的首字母组合)索引结构。在地理空间层,通过Quadtree(四叉树)根据地理位置相似性来组织所有的POI点,且地理空间层在NIQ-tree索引结构的最上部,因为地理空间层的数据是二维的,剪枝速度远大于高维的主题层。本实施例的四叉树索引是将地理空间层递归划分为不同层次的树结构。比如等分成四个相等的子空间,如此递归下去,直至树的层次达到一定深度或者满足某种要求后停止分割。本实施例的四叉树结构简单,地理位置都存储在叶子节点上,中间节点以及根节点不存储地理位置,并且当地理空间层数据分布比较均匀时,具有比较高的地理位置的空间数据插入和查询效率。
本申请另一实施例中,通过分别定义一个平面区域的四个子区域索引号,比如右上为第一象限0,左上为第二象限1,左下为第三象限2,右下为第四象限3。并通过地理位置的数据结构采用MBR(Minimum Bounding Rectangle,最小外接矩形)对地理位置的空间数据进行近似,四叉树节点是四叉树结构的主要组成部分,主要用于存储地理位置的标识号和MBR,也是四叉树算法操作的主要部分。四叉树节点类型结构中MBR对应区域的最小外包矩形,上一层的节点的最小外包矩形包含下一层最小外包矩形区域。以便将地理位置的信息存储在完全包含它的最小矩形节点中,不存储在它的父节点中,每个地理位置只在树中存储一次,避免存储空间的浪费。本实施例中首先生成满四叉树,避免在地理位置插入时需要重新分配内存,加快插入的速度,最后将空的节点所占内存空间释放掉。本实施例的四叉 树维护地理位置索引与对存储在文件或数据库中的地理位置的信息数据的一致性,避免了地理位置分布不均匀,避免随着地理位置的不断插入,四叉树的层次会不断地加深,而形成一棵严重不平衡的四叉树,导致每次查询的深度大大的增多,且查询效率的急剧下降的情况。
S312:根据各兴趣点主题分布概率的相近度在主题层细化各兴趣点。
对于空间层Quadtree的每个叶子节点,NIQ-tree在主题层进一步细分MBR中POI点。本实施例的空间层Quadtree中面向多边形的空间聚类算法,应先求取多边形的最小外接矩形,然后根据最小外接矩形进行空间聚类。MBR为最小边界矩形,最小包含矩形,或最小外包矩形,通过细化MBR中的POI点进一步细化主题层,以便提高搜索匹配的精准度。
S313:根据主题层细化后的各兴趣点,通过IDistance在地理空间层以及主题层建立高维索引路径。
最后利用IDistance建立高维索引结构,以便进行高效的快速检索。本实施例的IDistance可对指定金融数据库的所有POI点进行分类,并记录每个类的信息,然后将所有类信息记录到文件中,以便于根据指定金融数据库的各POI点权值在上述地理空间层以及主题层构建高维B+tree(多路搜索树,并不是二叉的),并储存B+tree的必要信息,以便用户输入参考点后,在B+tree中搜索临近点,并通过线性搜索结果比较,分析搜索结果与参考点的相近度。
进一步地,本实施例的步骤S312之后,包括:
S310:基于N-Gram在文本层对主题层进行略图构建以对所述各兴趣点进行细化。
本实施例中文本层也是本实施例NIQ-tree索引结构的重要组成部分,通过将地理空间层、主体层、以及文本层组成三维交织网状的索引结构,通过三层索引结构进行快速剪枝,进一步提高检索效率。本实施例通过在文本层对主题层进一步细化,基于N-Gram在文本层对主题层进行略图构建,即将文本相近的主题先进行归类,然后再根据主题分布概率进行分类,相当于在主题分布概率大集合中划分小子集。基于N-Gram在文本层对主题层进行略图构建只是对主题层的进一步细化,本申请其他实施例为简化索引过程,可省略文本层,只保留主题层和地理空间层两个层结构,依然可实现POI点索引的效果。
本实施例基于N-Gram在文本层对主题层进行略图构建时,两个字符串之间的编辑距离可利用Needleman-Wunsch算法(全局序列比对算法)或Smith-Waterman算法(局部序列比对算法),本实施例定义两个字符串之间的编辑距离为N-Gram距离。举例地,字符串s的N-Gram就表示按长度N切分原词得到的词段,也就是s中所有长度为N的子字符串。举例地有两个字符串,然后分别求它们的N-Gram(汉语语言模型),那么就可以从共有子串的数量定义两个字符串间的N-Gram距离。但是若忽略了两个字符串长度差异,仅对共有子串进行计数显然也存在不足。比如字符串girl和girlfriend,二者所拥有的公共子串数量显然与girl和其自身所拥有的公共子串数量相等,但是并不能据此认为girl和girlfriend是两个等同的匹配。本实施例提出以非重复的N-Gram分词为基础来定义N-Gram距离,公式表述为:|G N(s)|+|G N(t)|-2×|G N(s)∩G N(t)|,其中,|G N(s)|是字符串s的N-Gram集合,N值取2或者3。以N =2为例,对字符串Gorbachev和Gorbechyov进行分段,可得结果分别为:Go,or,rb,ba,ac,ch,he,ev;Go,or,rb,be,ec,ch,hy,yo,ov;结合上面的公式,即可算得两个字符串之间的距离是8+9-2×4=9。显然,字符串之间的距离越小,相距就越接近。当两个字符串完全相等的时候,它们之间的距离就是0。
进一步地,基于本实施例的实施场景,所述索引路径包括索引节点,步骤S4具体可以包括:
S40:接收用户输入的查询主体。
用户输入的查询主体包括地理位置和查寻文本关键字。
S41:从NIQ-tree的根节点开始,依次访问并查询具有最小匹配距离的索引节点。
本步骤中的最小匹配距离,通过欧式距离表示,计算公式如下:
Figure PCTCN2018095502-appb-000004
,并将其规范到[0,1]之间,其中q表示查询主体,o表示参照POI点,D s表示欧式距离。比如,两个文本的信息描述的主题分布概率之间的欧式距离越小,表示两个文本的语义相关性越高。本申请其他实施例也可通过余弦距离、马氏距离或巴氏距离等来表示两个文本的语义相关性。
S42:判断上述索引节点与上述查询主体的相关度是否在阈值条件内。
本实施例的索引路径由多个索引节点连接而成,通过判断索引节点的地理位置和/或文本式信息描述,是否与查询主体的地理位置和/或文本式信息描述的相关度在需求阈值内,比如阈值为85%以上。本步骤的相关度表示为:D(q,o)=λ×D S(q,o)+(1-λ)×D T(q,o),其中λ是用户指定参数,表示地理位置和文本式信息描述(即主题分布概率)的相似度之间的权重,在[0,1]之间。
S43:若是,则调出上述索引节点的信息数据作为与所述查询主体相似的兴趣点信息。
进一步地,步骤S42,包括:
S420:判断上述索引节点与上述查询主体的地理位置相近度和/或上述索引节点与上述查询主体的主题分布概率相似度是否在预设范围内。
本步骤中的上述索引节点与上述查询主体的主题分布概率相似度,表示为
Figure PCTCN2018095502-appb-000005
其中,TD W表示POI点中关键字对应的主题分布概率,||TD W||是TD W的模。比如,本实施例的地理位置相近度的预设范围为小于500m。
S421:若是,则判定索引节点与所述查询主体的相关度在阈值条件内;若否,则不在阈值条件内。
本申请另一实施例中,步骤S4具体包括:
S44:接收用户输入的指定对象的金融数据类的查询主体。
本实施例是基于语义的POI搜索技术在金融领域的具体场景,以便获取更细化、更具有参考价值的金融数据。本实施例的指定对象包括金融数据库中涉及的所有公司和群体,所指金融数据类的查询主体包括与市场、经营相关的数据库数据,包括地理位置和金融数据类的信息描述。举例地,周边某特定金 融服务点的查询。通过金融机构画像建模(名称,服务对象,主营业务……),建立专门的金融站点查询以及推荐系统,以便将大数据搜索技术更适合在金融服务行业应用。
S45:按照上述查询主体中携带的信息描述调取指定数据库中语义相似的金融数据。
本实施例通过调取基于语义相似的金融数据,取有利于投资策略的趋势数据,更有助于帮助用户进行精准的市场分析,或通过市场信息数据进行定向的业务关联分析,促进市场的有效开拓。举例地,信息描述为“汽车”,则会调取所有跟汽车相关的金融数据,比如,汽车市场价格、汽车服务业收费、汽车零配件价格、二手车交易市场等等信息,以便用户更便捷的选择所需要的数据。
本申请再一实施例中,步骤S45之后,包括:
S46:将上述金融数据输入投资风险估算模型中,以输出上述指定对象的风险估算等级。
本实施例通过对搜索到的指定对象的市场信用、负债率、营销领域评估、营销市场前景评价等经营、市场有关的数据,进行风险估算等级的估测,有利于银行业或投资者降低投资风险。本实施例的投资风险估算模型通过将风险数据样本输入卷积神经网络训练得到。
本申请又一实施例中,步骤S46之后,包括:
S47:根据上述风险估算等级以及行业分析数据,形成信贷资金等级评估数据库。
根据风险估算等级以及行业分析数据,形成信贷等级评估,通过语义理解的POI搜索到的信息更全面,风险估算等级以及行业分析数据更可靠,信贷等级评估更有参考价值,有利于银行等金融企业更全面的构造数据仓库,形成信贷资金等级评估数据库,以便指定更切实可行的市场策略。
S48:根据上述信贷资金等级评估数据库匹配信贷资源。
举例地,信贷资金等级高则匹配高额信贷标准;将信贷资金等级高的客户划分为优等客户进行追踪。
本申请实施例POI搜索技术中融入了用户的搜索语义理解,以便搜索信息与用户的真实意愿更贴近,提高搜索内容与用户搜索意愿的匹配性;通过在POI搜索技术中通过关键字语义(即关键字的主题分布概率)相似度匹配查询,搜素信息的信息量覆盖面增大,不仅仅限于文本字符的形状,扩展至与内容意思相关,提高搜索信息的精度;通过多维度限定POI搜索的影响因子,细化搜索信息的精度,推进POI搜索在金融领域的应用,以便在金融领域更好的服务用户,提供更符合用户需求的金融信息。
参照图2,本申请一实施例的基于语义理解的兴趣点查询装置,包括:
获取模块1,用于获取金融领域的指定数据库中的多个兴趣点,兴趣点包括信息描述以及地理位置。
本实施例的金融领域的指定数据库的兴趣点POI为附有时间标签的文本描述集合,每个POI点由(loc,words)二元组表示,其中loc代表地理位置,words代表POI信息描述。举例地,A公司(地点,文本=服务项目、服务对象、主营业务等),比如,数据集中A公司1(深圳福田区莲花支路,文本=保险业务、法人与自然人、汽车保险&旅游保险&家财险&意外保险);A公司2(上海陆家嘴,文本=金融资产服务、法人企业、网络融资)等。本实施例通过将金融领域的数据库进一步细化与标注,以便 在搜索引擎的支撑下,能查询到具体金融服务项目的信息,以克服现有通过搜索引擎不能匹配到合适的具体金融项目的技术缺陷。
匹配模块2,用于根据各兴趣点中的信息描述,分别为金融领域的指定数据库中的各兴趣点匹配主题分布概率。
本实施例的POI点包括地理位置的坐标信息以及POI信息描述。由于地理位置的坐标信息不带有文本描述信息,不具有文本分类作用,可通过POI信息描述对POI点进行主题细化分类。本实施例通过将POI信息描述转换为主题分布概率,即本实施例的兴趣点集合为一系列的附有地理位置标签的主题分布概率集合,使得能够更好地理解POI信息描述的内在意义,并通过基于主题分布概率的相似性度量函数来表征兴趣点之间的语义关联。本实施例首先通过剖析各POI点中的POI信息描述的组成结构,提取中心词,然后根据中心词主题来预测主题分布概率。比如,中心词“咖啡”和中心词“星巴克”这两个词的相似度,就是将上述两个中心词的主题分布概率运用指定的定量化测量公式进行计算,如β words={p 1,p 2,...,p n},其中n=|Z|,P表示各POI点,然后分析计算值,比如计算值越大,相似性越低。本实施例的主题分布概率相当于高维空间的两个点,通过上述两点在高维空间的空间距离来表示两个中心词的主题分布概率相关性,此处空间距离包括地理位置形成的距离。举例地,将分别包含中心词“咖啡”和中心词“星巴克”的两个POI点高维空间的空间参数代入上述公式,输出的计算结果小于预设的阈值,比如阈值为1,表明分别包含中心词“咖啡”和中心词“星巴克”的两个POI点从文本字体上没有相关性,但从附有语义理解的主题分布概率上看具有极大的相关性,即相比于单纯从文本字体上判断两个POI点的信息描述的相关性,基于语义理解的主题分布概率判断两个POI点的信息描述的相关性会更准确。
构建模块3,用于根据上述主题分布概率以及地理位置构建索引路径。
本实施例中在通过POI检索的过程中,会综合考虑两种参量,即主题分布概率以及地理位置,以进一步快速查询到更贴近用户检索本意的信息。本实施例中根据用户的权重设置的不同,索引路径也不同。比如从M点到N点的索引路径建立条件为,主题分布概率的权重大于地理位置的权重,则索引路径以两个POI点的主题分布概率的相关性为主要考虑因素,即优先主题分布概率最接近的POI点,当不存在主题分布概率最接近的POI点,或在搜寻下一POI点时地理位置的相近性远大于主题分布概率的相近性时,则以地理位置衔接索引路径中的下一POI点。上述权重设置下的索引路径建立过程如下,从M点开始索引,寻找与M点主题分布分布概率相近的M1点舍弃与M点地理位置最接近的M1*点,继续以M1点为基准,寻找与M1点主题分布概率相近的M2点,若M2点不存在(或M1与M2*之间的地理位置的相近性判断远大于M1与M2之间的主题分布概率相近似),则选择与M1点地理位置最接近的M2*点,然后继续寻找与M2*点主题分布概率相近的M3点,如此搜寻下去,直至找到N点,形成从M点到N点的索引路径。反之,则以地理位置为主要考虑因素构建索引路径,过程与上面类似,即 优先地理位置最接近的POI点,当不存在地理位置最接近的POI点或主题分布概率的相近性远大于地理位置的相近性时,则以主题分布概率衔接下一POI点,以形成以地理位置为主要考虑因素构建索引路径。本实施例优选主题分布概率和地理位置的权重均为0.5,同时考虑主题分布概率和地理位置的相似度,即选择主题分布概率与地理位置相近度均最接近的POI点形成索引路径,以便更贴合用户的搜索需求。
筛选模块4,用于根据上述索引路径筛选与查询主体相似的兴趣点信息。
本实施例的指定规则依据用户查询时自身的需要进行选择,选择空间距离较近的兴趣点或是文本相似度高的兴趣点,地理位置的距离通过地理位置的坐标信息计算得到。比如,选择地理位置近的兴趣点,则搜索结果为距离查询主体的地理位置较近,而文本相似度可能不高。本实施例通过将文本描述的内在意义作为参考量,使得检索信息更贴近用户本意。举例地,本实施例在查询描述“咖啡”与POI点描述“星巴克”将因其相似的主题分布概率而被认为相关。本实施例的查询主体为用户输入的待搜索信息。
本实施例采用基于NIQ-tree的POI搜索策略,通过精确的解空间上下界计算来保证有效的剪枝效果。具体来说,POI搜索将从NIQ-tree的根节点开始,依次访问与查询具有最小匹配距离的索引节点(通过优先队列来完成),其中索引节点N与查询q的最佳匹配距离D bm(q,N)计算为D bm(q,N)=λ×min p∈N.mbrD S(q,p)+(1-λ)×minD T(q,N),其中min p∈N.mbrD S(q,p)和minD T(q,N)表示从q到N的理论最小距离,λ是用户指定参数,表示地理位置的和文本式信息描述(即关键字的主题分布概率)相似度之间的权重,P表示各POI点。
参照图3,上述匹配模块2,包括:
统计单元21,用于统计上述指定数据库中的第一关键字集合以及各兴趣点主题中第二关键字集合。
本实施例的基于主题分布概率的POI表示方法,给定一个由n个关键字组成的文本W,V为金融领域POI数据集中所有关键字集合,Z={z 1,z 2,...,z n}为主题集,则W对应于Z中每个主题z i∈Z的主题概率分布TD W[z i]的计算公式如下:
Figure PCTCN2018095502-appb-000006
个数;α表示对称边界,通常设置为0.1;|W|表示W中关键字的个数;|Z|表示Z中总共的主题个数。本实施例的第一关键字集合为包括保险业务的金融领域数据库的所有主题的所有关键字,兴趣点即POI点,第二关键字集合为POI点对应主题的关键字,其中POI点对应主题为上述包括保险业务的金融领域数据库中的主题之一。
计算单元12,用于计算上述第二关键字集合相对于上述第一关键字集合的主题分布概率。
举例地,文本W为POI点的信息描述‘医疗保险’,V为包括保险业务的金融领域数据库中的所有关键字,即第一关键字集合中的关键字总量,Z={z 1,z 2,...,z n}为包括保险业务的金融领域数据库的 众多主题形成的主题集,每个主题的关键字的数量各不同,各POI点对应主题集中每个主题的主题分布概率可分别通过上述公式获得,比如,信息描述为‘医疗保险’的POI点对应主题集中每个主题的主题分布概率也不同,设‘医疗保险’的POI点为N点,主题集中包括基金主题Z 1、股票主题Z 2、等,则N点基于关键字‘保’‘险’,(即第二关键字数量为2)且主题归属于保险主题Z 3,计算得到相对于Z 3的主题分布概率大于其相对于Z 1或Z 2的。计算值越大,主题分布概率的相似性越小。通过上述公式,计算得到POI点数据集中每个POI点中的关键字所对应的主题分布概率β words={p 1,p 2,...,p n},其中n=|Z|,P表示各POI点,即每个POI点的关键字相对于不同主题的POI点的主题分布概率不同,以便确定主题分布概率最接近的下一衔接POI点。
参照图4,上述构建模块3,包括:
获取单元30,用于获取依据地理位置索引和依据主题分布概率索引的权重设置。
本实施例的权重设置直接影响搜索结果,权重设置可根据用户使用意图进行自主设置,本实施例的权重值为[0,1]之间。举例地,如果用户设置的权重中地理位置占0.7,主题分布概率占0.3,则最后的搜索结果肯定是距离查询主体地理位置较近的POI点,而文本相似度可能不高,与用户的检索用意不太相符;反之,结果则相反,不赘述,但权重中地理位置占0.5,主题分布概率占0.5,两者占比都比较高,就会检索到地理位置较近的且符合用户用意的兴趣点。
构建单元31,用于根据上述权重设置构建上述索引路径。
本实施例是指权重设置的不同,构建的索引路径也不同。比如,地理位置权重占比大,则从检索根节点依次按照地理位置最近的方式访问各索引兴趣点。
参照图5,上述索引路径为地理位置、主题分布概率协同索引路径,上述构建单元31,包括:
组织子单元311,用于根据地理位置相似性在地理空间层组织金融领域的指定数据库的所有兴趣点。
本申请实施例的POI点的快速检索依赖于有效的数据索引,本实施例的数据索引与传统的POI索引方式不同,为融合地理位置和文本语义的主题分布概率两信息的层次式索引结构,使得从不同维度执行搜索剪枝。本实施例的将基于IDistance的地理位置、主题分布概率及文本关键字三层协同的索引机制定义为NIQ-tree索引结构。在地理空间层,通过Quadtree(四叉树)根据地理位置相似性来组织所有的POI点,且地理空间层在NIQ-tree索引结构的最上部,因为地理空间层的数据是二维的,剪枝速度远大于高维的主题层。本实施例的四叉树索引是将地理空间层递归划分为不同层次的树结构。比如等分成四个相等的子空间,如此递归下去,直至树的层次达到一定深度或者满足某种要求后停止分割。本实施例的四叉树结构简单,地理位置都存储在叶子节点上,中间节点以及根节点不存储地理位置,并且当地理位置空间层数据分布比较均匀时,具有比较高的地理位置的空间数据插入和查询效率。
本申请另一实施例中,通过分别定义一个平面区域的四个子区域索引号,比如右上为第一象限0,左上为第二象限1,左下为第三象限2,右下为第四象限3。并通过地理位置数据结构采用MBR(最小 外包矩形)对地理位置的空间数据进行近似,四叉树节点是四叉树结构的主要组成部分,主要用于存储地理位置的标识号和MBR,也是四叉树算法操作的主要部分。四叉树节点类型结构中MBR对应区域的最小外包矩形,上一层的节点的最小外包矩形包含下一层最小外包矩形区域。以便将地理位置的信息存储在完全包含它的最小矩形节点中,不存储在它的父节点中,每个地理位置只在树中存储一次,避免存储空间的浪费。本实施例中首先生成满四叉树,避免在地理位置插入时需要重新分配内存,加快插入的速度,最后将空的节点所占内存空间释放掉。本实施例的四叉树维护地理位置索引与对存储在文件或数据库中的地理位置的信息数据的一致性,避免了地理位置分布不均匀,避免随着地理位置的不断插入,四叉树的层次会不断地加深,而形成一棵严重不平衡的四叉树,导致每次查询的深度大大的增多,且查询效率的急剧下降的情况。
细化子单元312,用于根据各兴趣点主题分布概率的相近度在主题层细化各兴趣点。
对于空间层Quadtree的每个叶子节点,NIQ-tree在主题层进一步细分MBR(Minimum Bounding Rectangle,最小外接矩形)中POI点。本实施例的空间层Quadtree中面向多边形的空间聚类算法,应先求取多边形的最小外接矩形,然后根据最小外接矩形进行空间聚类。MBR为最小边界矩形,最小包含矩形,或最小外包矩形,通过细化MBR中的POI点进一步细化主题层,以便提高搜索匹配的精准度。
建立子单元313,用于根据所述主题层细化后的各兴趣点,通过IDistance在上述地理空间层以及主题层建立高维索引路径。
最后利用IDistance建立高维索引结构,以便进行高效的快速检索。本实施例的IDistance可对指定金融数据库的所有POI点进行分类,并记录每个类的信息,然后将所有类信息记录到文件中,以便于根据指定金融数据库的各POI点权值在上述地理空间层以及主题层构建高维B+tree,并储存B+tree的必要信息,以便用户输入参考点后,在B+tree中搜索临近点,并通过线性搜索结果比较,分析搜索结果与参考点的相近度。
进一步地,构建单元31,还包括:
构建子单元310,用于基于N-Gram在文本层对主题层进行略图构建以对所述各兴趣点进行细化。
本实施例中文本层也是本实施例NIQ-tree索引结构的重要组成部分,通过将地理空间层、主体层、以及文本层组成三维交织网状的索引结构,通过三层索引结构进行快速剪枝,进一步提高检索效率。本实施例通过在文本层对主题层进一步细化,基于N-Gram在文本层对主题层进行略图构建,即将文本相近的主题先进行归类,然后再根据主题分布概率进行分类,相当于在主题分布概率大集合中划分小子集。基于N-Gram在文本层对主题层进行略图构建只是对主题层的进一步细化,本申请其他实施例为简化索引过程,可省略文本层,只保留主题层和地理空间层两个层结构,依然可实现POI点索引的效果。
本实施例基于N-Gram在文本层对主题层进行略图构建时,两个字符串之间的编辑距离可利用Needleman-Wunsch算法或Smith-Waterman算法,本实施例定义两个字符串之间的编辑距离为N-Gram 距离。举例地,字符串s的N-Gram就表示按长度N切分原词得到的词段,也就是s中所有长度为N的子字符串。举例地有两个字符串,然后分别求它们的N-Gram,那么就可以从共有子串的数量定义两个字符串间的N-Gram距离。但是若忽略了两个字符串长度差异,仅对共有子串进行计数显然也存在不足。比如字符串girl和girlfriend,二者所拥有的公共子串数量显然与girl和其自身所拥有的公共子串数量相等,但是并不能据此认为girl和girlfriend是两个等同的匹配。本实施例提出以非重复的N-Gram分词为基础来定义N-Gram距离,公式表述为:|G N(s)|+|G N(t)|-2×|G N(s)∩G N(t)|,其中,|G N(s)|是字符串s的N-Gram集合,N值取2或者3。以N=2为例,对字符串Gorbachev和Gorbechyov进行分段,可得结果分别为:Go,or,rb,ba,ac,ch,he,ev;Go,or,rb,be,ec,ch,hy,yo,ov;结合上面的公式,即可算得两个字符串之间的距离是8+9-2×4=9。显然,字符串之间的距离越小,相距就越接近。当两个字符串完全相等的时候,它们之间的距离就是0。
参照图6,本申请一实施例的所述索引路径包括索引节点,筛选模块4,包括:
第一接收单元40,用于接收用户输入的查询主体。
第一接收单元40接收的用户输入的查询主体包括地理位置和查寻文本关键字。
查询单元41,用于从NIQ-tree的根节点开始,依次访问并查询具有最小匹配距离的索引节点。
本实施例中的最小匹配距离,通过欧式距离表示,计算公式如下:
Figure PCTCN2018095502-appb-000007
,并将其规范到[0,1]之间,其中q表示查询主体,o表示参照POI点,D s表示欧式距离。比如,两个文本的信息描述的主题分布概率之间的欧式距离越小,表示两个文本的语义相关性越高。本申请其他实施例也可通过余弦距离、马氏距离或巴氏距离等来表示两个文本的语义相关性。
判断单元42,用于判断上述索引节点与上述查询主体的相关度是否在阈值条件内。
本实施例的索引路径由多个索引节点连接而成,通过判断单元42判断索引节点的地理位置和/或文本式信息描述,是否与查询主体的地理位置和/或文本式信息描述的相关度在需求阈值内,比如阈值为85%以上。本实施例的相关度表示为:D(q,o)=λ×D S(q,o)+(1-λ)×D T(q,o),其中λ是用户指定参数,表示的信息描述的和文本式信息描述(即主题分布概率)的相似度之间的权重,在[0,1]之间。
调出单元43,用于若上述索引节点与上述查询主体的相关度在阈值条件内,则调出上述索引节点的信息数据作为与上述查询主体相似的兴趣点信息。
参照图7,上述判断单元42,包括:
判断子单元420,用于判断上述索引节点与上述查询主体的地理位置相近度和/或上述索引节点与上述查询主体的主题分布概率相似度是否在预设范围内。
本实施例中的上述索引节点与上述查询主体的主题分布概率相似度,表示为
Figure PCTCN2018095502-appb-000008
其中,TD W表示POI点中关键字对应的主题分布概率,||TD W||是TD W的模。比如,本实施例的地理位 置相近度的预设范围为小于500m。
判定子单元421:用于若在预设范围内,则判定所述索引节点与所述查询主体的相关度在阈值条件内;若否,则不在阈值条件内。
参照图8,本申请另一实施例中,筛选模块4,包括:
第二接收单元44,用于接收用户输入的指定对象的金融数据类的查寻主体。
本实施例是基于语义的POI搜索技术在金融领域的具体场景,以便获取更细化、更具有参考价值的金融数据。本实施例的指定对象包括金融数据库中涉及的所有公司和群体,所指金融数据类的查询主体包括与市场、经营相关的数据库数据,包括地理位置和金融数据类的信息描述。举例地,周边某特定金融服务点的查询。通过金融机构画像建模(名称,服务对象,主营业务……),建立专门的金融站点查询以及推荐系统,以便使大数据搜索技术更适合应用在金融服务行业。
调取单元45,用于按照上述查询主体中携带的信息描述调取指定数据库中语义相似的金融数据。
本实施例通过调取基于语义相似的金融数据,取有利于投资策略的趋势数据,更有助于帮助用户进行精准的市场分析,或通过市场信息数据进行定向的业务关联分析,促进市场的有效开拓。举例地,信息描述为“汽车”,则会调取所有跟汽车相关的金融数据,比如,汽车市场价格、汽车服务业收费、汽车零配件价格、二手车交易市场等等信息,以便用户更便捷的选择所需要的数据。
参照图9,本申请再一实施例中的筛选模块4,包括:
输入输出单元46,用于将金融数据输入投资风险估算模型中,以输出指定对象的风险估算等级。
本实施例通过对搜索到的指定对象的市场信用、负债率、营销领域评估、营销市场前景评价等经营、市场有关的数据,进行风险估算等级的估测,有利于银行业或投资者降低投资风险。本实施例的投资风险估算模型通过将风险数据样本输入卷积神经网络训练得到。
参照图10,本申请又一实施例中的筛选模块4,包括:
形成单元47,用于根据上述风险估算等级以及行业分析数据,形成信贷资金等级评估数据库。
根据风险估算等级以及行业分析数据,形成信贷等级评估,通过语义理解的POI搜索到的信息更全面,风险估算等级以及行业分析数据更可靠,信贷等级评估更有参考价值,有利于银行等金融企业更全面的构造数据仓库,形成信贷资金等级评估数据库,以便指定更切实可行的市场策略。
匹配单元48,用于根据上述信贷资金等级评估数据库匹配信贷资源。
举例地,信贷资金等级高则匹配高额信贷标准。再举例地,将信贷资金等级高的客户划分为优等客户进行追踪。
参照图11,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图11所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储 器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储基于语义理解的兴趣点查询等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令在执行时,执行如上述各方法的实施例的流程。本领域技术人员可以理解,图11中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定。
本申请一实施例还提供一种计算机非易失性可读存储介质,其上存储有计算机可读指令,该计算机可读指令在执行时,执行如上述各方法的实施例的流程。以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种基于语义理解的兴趣点查询方法,其特征在于,包括:
    获取金融领域的指定数据库中的多个兴趣点,其中每个兴趣点包括信息描述以及地理位置;
    根据各兴趣点中的信息描述,分别为所述各兴趣点匹配主题分布概率;
    根据所述主题分布概率以及地理位置构建索引路径;
    根据所述索引路径筛选与查询主体相似的兴趣点信息。
  2. 根据权利要求1所述的基于语义理解的兴趣点查询方法,其特征在于,所述根据各兴趣点中的信息描述,分别为所述各兴趣点匹配主题分布概率的步骤,包括:
    统计所述指定数据库中的第一关键字集合以及各兴趣点主题中第二关键字集合;
    计算所述第二关键字集合相对于所述第一关键字集合的主题分布概率。
  3. 根据权利要求1所述的基于语义理解的兴趣点查询方法,其特征在于,所述根据所述主题分布概率以及地理位置构建索引路径的步骤,包括:
    获取依据地理位置索引和依据主题分布概率索引的权重设置;
    根据所述权重设置构建所述索引路径。
  4. 根据权利要求3所述的基于语义理解的兴趣点查询方法,其特征在于,所述索引路径为地理位置、主题分布概率协同索引路径,所述根据所述权重设置构建所述索引路径的步骤,包括:
    根据地理位置相似性在地理空间层组织所述指定数据库的所有兴趣点;
    根据各兴趣点主题分布概率的相近度在主题层细化各兴趣点;
    根据所述主题层细化后的各兴趣点,通过IDistance在所述地理空间层以及主题层建立高维索引路径。
  5. 根据权利要求4所述的基于语义理解的兴趣点查询方法,其特征在于,所述根据各兴趣点主题分布概率的相近度在主题层细化各兴趣点的步骤之后,包括:
    基于N-Gram在文本层对主题层进行略图构建以对所述各兴趣点进行细化。
  6. 根据权利要求4所述的基于语义理解的兴趣点查询方法,其特征在于,所述索引路径包括索引节点,所述根据所述索引路径筛选与查询主体相似的兴趣点信息的步骤,包括:
    接收用户输入的查询主体;
    从NIQ-tree的根节点开始,依次访问并查询具有最小匹配距离的索引节点;
    判断所述索引节点与所述查询主体的相关度是否在阈值条件内;
    若是,则调出所述索引节点的信息数据作为与所述查询主体相似的兴趣点信息。
  7. 根据权利要求6所述的基于语义理解的兴趣点查询方法,其特征在于,所述判断所述索引节点与所述查询主体的相关度是否在阈值条件内的步骤,包括:
    判断所述索引节点与所述查询主体的地理位置相近度和/或所述索引节点与所述查询主体的主题分布概率相似度是否在预设范围内;
    若是,则判定所述索引节点与所述查询主体的相关度在阈值条件内;若否,则不在阈值条件内。
  8. 一种基于语义理解的兴趣点查询装置,其特征在于,包括:
    获取模块,用于获取金融领域的指定数据库中的多个兴趣点,其中每个兴趣点包括信息描述以及地理位置;
    匹配模块,用于根据各兴趣点中的信息描述,分别为所述各兴趣点匹配主题分布概率;
    构建模块,用于根据所述主题分布概率以及地理位置构建索引路径;
    筛选模块,用于根据所述索引路径筛选与查询主体相似的兴趣点信息。
  9. 根据权利要求8所述的基于语义理解的兴趣点查询装置,其特征在于,所述匹配模块,包括:
    统计单元,用于统计所述指定数据库中的第一关键字集合以及各兴趣点主题中第二关键字集合;
    计算单元,用于计算所述第二关键字集合相对于所述第一关键字集合的主题分布概率。
  10. 根据权利要求8所述的基于语义理解的兴趣点查询装置,其特征在于,所述构建模块,包括:
    获取单元,用于获取依据地理位置索引和依据主题分布概率索引的权重设置;
    构建单元,用于根据所述权重设置构建所述索引路径。
  11. 根据权利要求10所述的基于语义理解的兴趣点查询装置,其特征在于,所述构建单元,包括:
    组织子单元,用于根据地理位置相似性在地理空间层组织所述指定数据库的所有兴趣点;
    细化子单元,用于根据各兴趣点主题分布概率的相近度在主题层细化各兴趣点;
    建立子单元,用于根据所述主题层细化后的各兴趣点,通过IDistance在所述地理空间层以及主题层建立高维索引路径。
  12. 根据权利要求11所述的基于语义理解的兴趣点查询装置,其特征在于,所述构建单元,包括:
    构建子单元,用于基于N-Gram在文本层对主题层进行略图构建以对所述各兴趣点进行细化。
  13. 根据权利要求11所述的基于语义理解的兴趣点查询装置,其特征在于,所述索引路径包括索引节点,所述筛选模块,包括:
    第一接收单元,用于接收用户输入的查询主体;
    查询单元,用于从NIQ-tree的根节点开始,依次访问并查询具有最小匹配距离的索引节点;
    判断单元,用于判断所述索引节点与所述查询主体的相关度是否在阈值条件内;
    调出单元,用于若索引节点与查询主体的相关度在阈值条件内,则调出所述索引节点的信息数据作为与所述查询主体相似的兴趣点信息。
  14. 根据权利要求13所述的基于语义理解的兴趣点查询装置,其特征在于,所述判断单元,包括:
    判断子单元,用于判断所述索引节点与所述查询主体的地理位置相近度和/或所述索引节点与所述查 询主体的主题分布概率相似度是否在预设范围内;
    判定子单元:用于若在预设范围内,则判定所述索引节点与所述查询主体的相关度在阈值条件内;若否,则不在阈值条件内。
  15. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现基于语义理解的兴趣点查询方法,方法包括:
    获取金融领域的指定数据库中的多个兴趣点,其中每个兴趣点包括信息描述以及地理位置;
    根据各兴趣点中的信息描述,分别为所述各兴趣点匹配主题分布概率;
    根据所述主题分布概率以及地理位置构建索引路径;
    根据所述索引路径筛选与查询主体相似的兴趣点信息。
  16. 根据权利要求15所述的计算机设备,其特征在于,所述根据各兴趣点中的信息描述,分别为所述各兴趣点匹配主题分布概率的步骤,包括:
    统计所述指定数据库中的第一关键字集合以及各兴趣点主题中第二关键字集合;
    计算所述第二关键字集合相对于所述第一关键字集合的主题分布概率。
  17. 根据权利要求15所述的计算机设备,其特征在于,所述根据所述主题分布概率以及地理位置构建索引路径的步骤,包括:
    获取依据地理位置索引和依据主题分布概率索引的权重设置;
    根据所述权重设置构建所述索引路径。
  18. 一种计算机非易失性可读存储介质,其上存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现基于语义理解的兴趣点查询方法,方法包括:
    获取金融领域的指定数据库中的多个兴趣点,其中每个兴趣点包括信息描述以及地理位置;
    根据各兴趣点中的信息描述,分别为所述各兴趣点匹配主题分布概率;
    根据所述主题分布概率以及地理位置构建索引路径;
    根据所述索引路径筛选与查询主体相似的兴趣点信息。
  19. 根据权利要求18所述的计算机非易失性可读存储介质,其特征在于,所述根据各兴趣点中的信息描述,分别为所述各兴趣点匹配主题分布概率的步骤,包括:
    统计所述指定数据库中的第一关键字集合以及各兴趣点主题中第二关键字集合;
    计算所述第二关键字集合相对于所述第一关键字集合的主题分布概率。
  20. 根据权利要求18所述的计算机非易失性可读存储介质,其特征在于,所述根据所述主题分布概率以及地理位置构建索引路径的步骤,包括:
    获取依据地理位置索引和依据主题分布概率索引的权重设置;
    根据所述权重设置构建所述索引路径。
PCT/CN2018/095502 2018-04-17 2018-07-12 基于语义理解的兴趣点查询方法、装置和计算机设备 WO2019200752A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810345252.6 2018-04-17
CN201810345252.6A CN108763293A (zh) 2018-04-17 2018-04-17 基于语义理解的兴趣点查询方法、装置和计算机设备

Publications (1)

Publication Number Publication Date
WO2019200752A1 true WO2019200752A1 (zh) 2019-10-24

Family

ID=64010803

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/095502 WO2019200752A1 (zh) 2018-04-17 2018-07-12 基于语义理解的兴趣点查询方法、装置和计算机设备

Country Status (2)

Country Link
CN (1) CN108763293A (zh)
WO (1) WO2019200752A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909627A (zh) * 2019-11-04 2020-03-24 中国科学院深圳先进技术研究院 区域poi配置可视化方法及系统
CN111506813A (zh) * 2020-04-08 2020-08-07 中国电子科技集团公司第五十四研究所 一种基于用户画像的遥感信息精准推荐方法
CN112507047A (zh) * 2020-06-16 2021-03-16 中山大学 一种基于兴趣点偏好的最优有序路径查询方法
CN112686580A (zh) * 2021-01-31 2021-04-20 重庆渝高科技产业(集团)股份有限公司 一种可自定义流程的工作流定义方法及系统
CN113743591A (zh) * 2021-09-14 2021-12-03 北京邮电大学 一种自动化剪枝卷积神经网络的方法及其系统
CN113918837A (zh) * 2021-10-15 2022-01-11 山东大学 城市兴趣点类别表示的生成方法及系统
CN115277452A (zh) * 2022-07-01 2022-11-01 中铁第四勘察设计院集团有限公司 基于边端协同的ResNet自适应加速计算方法及应用

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111209491A (zh) * 2018-11-22 2020-05-29 北京嘀嘀无限科技发展有限公司 用于数据库构建的系统和方法
CN111291776B (zh) * 2018-12-07 2023-06-02 北方工业大学 基于众源轨迹数据的航道信息提取方法
CN111460248B (zh) * 2019-01-19 2023-05-23 北京嘀嘀无限科技发展有限公司 用于线上到线下服务的系统和方法
CN111460325B (zh) * 2019-01-22 2023-06-27 阿里巴巴集团控股有限公司 Poi搜索方法、装置与设备
CN109974732B (zh) * 2019-03-28 2022-11-15 东北大学 一种基于语义感知的Top-k多请求路径规划方法
CN110347925B (zh) * 2019-07-12 2023-11-14 腾讯科技(深圳)有限公司 信息处理方法及计算机可读存储介质
CN110704611B (zh) * 2019-08-08 2022-08-19 国家计算机网络与信息安全管理中心 基于特征解交织的非法文本识别方法及装置
CN111831928A (zh) * 2019-09-17 2020-10-27 北京嘀嘀无限科技发展有限公司 一种poi排序方法及装置
CN111460104B (zh) * 2020-04-01 2023-09-22 神思电子技术股份有限公司 行业自适应的智能搜索方法
CN111782748B (zh) * 2020-06-28 2024-01-12 北京百度网讯科技有限公司 地图检索方法、信息点poi语义向量的计算方法和装置
CN111884940B (zh) * 2020-07-17 2022-03-22 中国人民解放军国防科技大学 兴趣匹配方法、装置、计算机设备和存储介质
CN112328890B (zh) * 2020-11-23 2024-04-12 北京百度网讯科技有限公司 搜索地理位置点的方法、装置、设备及存储介质
CN113254743B (zh) * 2021-05-31 2022-12-09 西安电子科技大学 一种车联网中动态空间数据的安全语义感知搜索方法
CN113568947A (zh) * 2021-07-21 2021-10-29 众安在线财产保险股份有限公司 数据处理方法、系统以及计算机存储介质
CN113792608B (zh) * 2021-08-19 2022-05-10 广州云硕科技发展有限公司 一种智能语义分析方法及系统
CN115577294B (zh) * 2022-11-22 2023-03-24 深圳市规划和自然资源数据管理中心(深圳市空间地理信息中心) 一种基于兴趣点空间分布和语义信息的城市区域分类方法
CN116184312B (zh) * 2022-12-22 2023-11-21 泰州雷德波达定位导航科技有限公司 一种基于语义Wi-Fi的室内众源指纹库构建方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722509A (zh) * 2010-10-06 2012-10-10 通用汽车环球科技运作有限责任公司 用于支持本地兴趣点发现的语义搜索系统和方法的邻域导引
CN104679801A (zh) * 2013-12-03 2015-06-03 高德软件有限公司 一种兴趣点搜索方法和装置
US9817907B1 (en) * 2014-06-18 2017-11-14 Google Inc. Using place of accommodation as a signal for ranking reviews and point of interest search results

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722509A (zh) * 2010-10-06 2012-10-10 通用汽车环球科技运作有限责任公司 用于支持本地兴趣点发现的语义搜索系统和方法的邻域导引
CN104679801A (zh) * 2013-12-03 2015-06-03 高德软件有限公司 一种兴趣点搜索方法和装置
US9817907B1 (en) * 2014-06-18 2017-11-14 Google Inc. Using place of accommodation as a signal for ranking reviews and point of interest search results

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LIU, HUIWEN: "Query Processing for Large-scale Semantic Trajectories", CHINESE MASTER'S THESES FULL-TEXT DATABASE, INFORMATION SCIENCE & TECHNOLOGY, 15 April 2018 (2018-04-15), ISSN: 1674-0246 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909627B (zh) * 2019-11-04 2022-04-26 中国科学院深圳先进技术研究院 区域poi配置可视化方法及系统
CN110909627A (zh) * 2019-11-04 2020-03-24 中国科学院深圳先进技术研究院 区域poi配置可视化方法及系统
CN111506813A (zh) * 2020-04-08 2020-08-07 中国电子科技集团公司第五十四研究所 一种基于用户画像的遥感信息精准推荐方法
CN112507047A (zh) * 2020-06-16 2021-03-16 中山大学 一种基于兴趣点偏好的最优有序路径查询方法
CN112507047B (zh) * 2020-06-16 2024-03-26 中山大学 一种基于兴趣点偏好的最优有序路径查询方法
CN112686580A (zh) * 2021-01-31 2021-04-20 重庆渝高科技产业(集团)股份有限公司 一种可自定义流程的工作流定义方法及系统
CN112686580B (zh) * 2021-01-31 2023-05-16 重庆渝高科技产业(集团)股份有限公司 一种可自定义流程的工作流定义方法及系统
CN113743591B (zh) * 2021-09-14 2023-12-26 北京邮电大学 一种自动化剪枝卷积神经网络的方法及其系统
CN113743591A (zh) * 2021-09-14 2021-12-03 北京邮电大学 一种自动化剪枝卷积神经网络的方法及其系统
CN113918837A (zh) * 2021-10-15 2022-01-11 山东大学 城市兴趣点类别表示的生成方法及系统
CN113918837B (zh) * 2021-10-15 2024-02-06 山东大学 城市兴趣点类别表示的生成方法及系统
CN115277452A (zh) * 2022-07-01 2022-11-01 中铁第四勘察设计院集团有限公司 基于边端协同的ResNet自适应加速计算方法及应用
CN115277452B (zh) * 2022-07-01 2023-11-28 中铁第四勘察设计院集团有限公司 基于边端协同的ResNet自适应加速计算方法及应用

Also Published As

Publication number Publication date
CN108763293A (zh) 2018-11-06

Similar Documents

Publication Publication Date Title
WO2019200752A1 (zh) 基于语义理解的兴趣点查询方法、装置和计算机设备
CN111428053B (zh) 一种面向税务领域知识图谱的构建方法
WO2022116537A1 (zh) 一种资讯推荐方法、装置、电子设备和存储介质
CN111767716B (zh) 企业多级行业信息的确定方法、装置及计算机设备
CN107066599A (zh) 一种基于知识库推理的相似上市公司企业检索分类方法及系统
CN110458324B (zh) 风险概率的计算方法、装置和计算机设备
Fu et al. Identifying spatiotemporal urban activities through linguistic signatures
CN110232126B (zh) 热点挖掘方法及服务器和计算机可读存储介质
CN111967761A (zh) 一种基于知识图谱的监控预警方法、装置及电子设备
CN108595525A (zh) 一种律师信息处理方法和系统
CN112508743B (zh) 技术转移办公室通用信息交互方法、终端及介质
CN110929134A (zh) 投融资数据管理方法、装置、计算机设备及存储介质
CN112632405A (zh) 一种推荐方法、装置、设备及存储介质
CN111708774A (zh) 一种基于大数据的产业分析系统
CN111159763A (zh) 一种涉法人员群体画像分析系统及方法
US11847169B2 (en) Method for data processing and interactive information exchange with feature data extraction and bidirectional value evaluation for technology transfer and computer used therein
CN113591476A (zh) 一种基于机器学习的数据标签推荐方法
CN108614860A (zh) 一种律师信息处理方法和系统
Wang et al. A web text mining approach for the evaluation of regional characteristics at the town level
CN110334112B (zh) 一种简历信息检索方法及装置
CN115952770A (zh) 一种数据标准化的处理方法、装置、电子设备及存储介质
CN116467291A (zh) 一种知识图谱存储与搜索方法及系统
CN114706996A (zh) 一种基于多元异构数据挖掘的供应链在线知识图谱构建方法
Singh et al. Multi-feature segmentation and cluster based approach for product feature categorization
CN116303983A (zh) 一种关键词推荐方法、装置及电子设备

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 01.02.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18915005

Country of ref document: EP

Kind code of ref document: A1