CN110674367B - Single Chinese character retrieval method and device based on travel industry products - Google Patents

Single Chinese character retrieval method and device based on travel industry products Download PDF

Info

Publication number
CN110674367B
CN110674367B CN201910855488.9A CN201910855488A CN110674367B CN 110674367 B CN110674367 B CN 110674367B CN 201910855488 A CN201910855488 A CN 201910855488A CN 110674367 B CN110674367 B CN 110674367B
Authority
CN
China
Prior art keywords
matching point
compared
matching
coordinate data
characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910855488.9A
Other languages
Chinese (zh)
Other versions
CN110674367A (en
Inventor
王星杰
洪晓
李少辉
朱少东
赵文志
王偕旭
李乐天
谢富成
刘骏杰
陈光站
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Guangzhilv International Travel Service Co ltd
Guangzhou Yiqixing Information Technology Co ltd
Original Assignee
Guangzhou Guangzhilv International Travel Service Co ltd
Guangzhou Yiqixing Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Guangzhilv International Travel Service Co ltd, Guangzhou Yiqixing Information Technology Co ltd filed Critical Guangzhou Guangzhilv International Travel Service Co ltd
Priority to CN201910855488.9A priority Critical patent/CN110674367B/en
Publication of CN110674367A publication Critical patent/CN110674367A/en
Application granted granted Critical
Publication of CN110674367B publication Critical patent/CN110674367B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a single Chinese character retrieval method and a single Chinese character retrieval device based on products in the tourism industry, wherein the method constructs a coordinate system corresponding to each document to be compared according to characters in each document to be compared and query characters input by a user; matching the characters of the document to be compared and the query character in each coordinate system, and obtaining a plurality of matching points according to a preset screening strategy; connecting all the matching points to obtain a plurality of similarity judgment lines; respectively calculating the score of each similarity judgment line according to the influence degree of the position, the position sequence degree and the continuity degree of the characters; according to the scores of the similarity judgment lines in the documents to be compared, respectively obtaining similarity scores between the query characters and the documents to be compared; and sequencing the documents to be compared according to the similarity scores to obtain a query result. By adopting the technical scheme of the invention, the problem that tourism products are difficult to be subjected to similarity scoring/sequencing due to word meaning loss can be solved under a word list method full-text retrieval system.

Description

Single Chinese character retrieval method and device based on travel industry products
Technical Field
The invention relates to a search engine technology, in particular to a single Chinese character retrieval method and a single Chinese character retrieval device based on products in the tourism industry.
Background
Currently, in the field of Online Travel Agents (OTA), if a travel product retrieved by a user can be accurately located from a massive travel product library, and a travel product with the highest correlation degree is recommended to the user, improvement of user retention degree and order conversion rate is facilitated. Therefore, the efficient searching method of the travel products is an important precondition for realizing the rapid flow conversion.
In the prior art, a word list method or word list method full text retrieval system is generally adopted to retrieve tourism products. The word list method full-text retrieval system firstly carries out word segmentation on search contents input by a user, secondly carries out extraction on key words according to the content after word segmentation, and finally retrieves related travel products according to the key words. However, the vocabulary full-text retrieval system depends on a preset dictionary base, and the dictionary base needs to be maintained and updated all the time. However, in the OTA field, product information such as various groups, hotels, entrance tickets, etc. contains a lot of proper nouns (e.g., "high-speed rail tour", "japanese tour", "ulan", "weijing international", etc.), and the search method using the vocabulary full-text search system may have wrong phrase resolution, resulting in low precision. In addition, new words in the OTA field are endlessly layered, and it is difficult to record a dictionary with all words.
The word table method full text retrieval system firstly splits the search content input by the user into a plurality of single characters, and secondly retrieves the related travel products according to the split single characters. The full-text retrieval system based on the word table method does not need to depend on word banks and word segmentation, and the recall ratio can reach 100%. However, the existing word table method full-text retrieval system needs to divide the search content into a plurality of single characters, so that the number of related travel products retrieved according to the single characters is large; meanwhile, the search content is divided into single characters, which causes the condition of word meaning loss and is difficult to perform similarity scoring/sequencing on the obtained travel products, so that a user needs to spend a large amount of time to find the travel products related to the self demand, and further the retention degree and the order conversion rate of the user are reduced.
Disclosure of Invention
The embodiment of the invention provides a single Chinese character retrieval method and a single Chinese character retrieval device based on products in the tourism industry, which can solve the problem that the tourism products are difficult to be subjected to similarity scoring/sequencing due to word meaning deficiency under a word table method full-text retrieval system.
The embodiment of the invention provides a single Chinese character retrieval method based on products in the tourism industry, which comprises the following steps:
acquiring N query characters input by a user, wherein N is a positive integer greater than 0;
acquiring all documents to be compared according to the N query characters, wherein each document to be compared comprises a plurality of characters, and the document to be compared comprises the N query characters;
respectively constructing a plane rectangular coordinate system corresponding to each document to be compared according to the characters in each document to be compared and the N query characters; one document to be compared corresponds to one rectangular plane coordinate system;
matching the document to be compared with the characters with the same N query characters in each plane rectangular coordinate system to obtain coordinates of a plurality of matching points to form first matching point coordinate data; one rectangular plane coordinate system corresponds to one first matching point coordinate data;
screening the conflict matching points in the coordinate data of each first matching point according to a preset screening strategy to obtain coordinate data of a second matching point;
connecting the matching points in the second matching point coordinate data according to a preset connection rule to obtain a plurality of similarity judgment lines;
respectively calculating the similarity score of each similarity judgment line according to the influence degree of the position, the position sequence degree of the character and the continuity degree;
respectively obtaining similarity scores between the N query characters and the documents to be compared according to the similarity scores of the similarity judgment lines in the documents to be compared;
and sequencing the documents to be compared according to the similarity scores of the documents to be compared to obtain a query result.
As a preferred scheme, according to a preset screening strategy, screening the conflicting matching points in each first matching point coordinate data to obtain second matching point coordinate data, specifically:
judging whether the first matching point coordinate data has a conflict matching point; if the first matching point coordinate data does not have the conflict matching point, marking the first matching point data as second matching point coordinate data;
if the first matching point coordinate data contains the conflict matching points, extracting all the conflict matching points from the first matching point coordinate data;
calculating a cosine value between each straight line and the abscissa axis according to the straight line determined by each conflict matching point and a matching point adjacent to the right side of the conflict matching point, and sequentially obtaining a first cosine value corresponding to each conflict matching point;
respectively calculating the difference between the first cosine value and the optimal reference value, and acquiring a conflict matching point with the minimum difference;
and eliminating all the conflict matching points in the first matching point coordinate data except the conflict matching point with the minimum difference value to obtain second matching point coordinate data.
Preferably, the similarity score of each of the similarity determination lines is calculated as follows:
Figure BDA0002196134290000031
wherein the content of the first and second substances,
Figure BDA0002196134290000032
for the degree of influence of position, sim (L)i) Is the degree of the positional order of the characters, (X)i+1-Xi) Is a degree of continuity.
Preferably, the influence degree of the position is calculated as follows:
Figure BDA0002196134290000033
wherein, f (X)i) The value range is (0,1), the constant a as a super parameter can be adjusted according to the actual situation, and XiThe abscissa value of the ith matching point.
Preferably, the calculation method of the position order of the characters is as follows:
Figure BDA0002196134290000034
wherein L isiIs a passing point (X)i,Yi) And point (X)i+1,Yi+1) Is the matching point M, cos thetai=(Xi,Yi) A matching point M adjacent to the right of the matching pointi+1=(Xi+1,Yi+1) And cosine of the axis of abscissa.
Preferably, the continuity is obtained from a difference in abscissa values between two matching points passing through the similarity judgment line.
Correspondingly, this embodiment still provides a single chinese character retrieval device based on tourism trade product, includes:
the device comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring N query characters input by a user, and N is a positive integer larger than 0;
the query module is used for acquiring all documents to be compared according to the N query characters, wherein each document to be compared comprises a plurality of characters, and the document to be compared comprises the N query characters;
the rectangular coordinate system generating module is used for respectively constructing a planar rectangular coordinate system corresponding to each document to be compared according to the characters in each document to be compared and the N query characters; one document to be compared corresponds to one rectangular plane coordinate system;
the matching data acquisition module is used for matching the document to be compared with the characters with the same N query characters in each rectangular plane coordinate system to acquire coordinates of a plurality of matching points to form first matching point coordinate data; one rectangular plane coordinate system corresponds to one first matching point coordinate data;
the matching point data screening module is used for screening the conflict matching points in the first matching point coordinate data according to a preset screening strategy to obtain second matching point coordinate data;
a similarity judgment line obtaining module, configured to connect matching points in each of the second matching point coordinate data according to a preset connection rule to obtain multiple similarity judgment lines;
the first calculation module is used for respectively calculating the similarity score of each similarity judgment line according to the influence degree of the position, the position sequence degree of the character and the continuity degree;
a second calculation module, configured to obtain similarity scores between the N query characters and the documents to be compared, respectively, according to similarity scores of the similarity judgment lines in the documents to be compared;
and the ranking module is used for ranking the documents to be compared according to the similarity scores of the documents to be compared to obtain a query result.
As a preferred scheme, according to a preset screening strategy, screening the conflicting matching points in each first matching point coordinate data to obtain second matching point coordinate data, specifically:
judging whether the first matching point coordinate data has a conflict matching point; if the first matching point coordinate data does not have the conflict matching point, marking the first matching point data as second matching point coordinate data;
if the first matching point coordinate data contains the conflict matching points, extracting all the conflict matching points from the first matching point coordinate data;
calculating a cosine value between each straight line and the abscissa axis according to the straight line determined by each conflict matching point and a matching point adjacent to the right side of the conflict matching point, and sequentially obtaining a first cosine value corresponding to each conflict matching point;
respectively calculating the difference between the first cosine value and the optimal reference value, and acquiring a conflict matching point with the minimum difference;
and eliminating all the conflict matching points in the first matching point coordinate data except the conflict matching point with the minimum difference value to obtain second matching point coordinate data.
The embodiment of the invention has the following beneficial effects:
the embodiment of the invention provides a single Chinese character retrieval method based on products in the tourism industry, which comprises the steps of constructing a coordinate system corresponding to each document to be compared according to characters in each document to be compared and query characters input by a user; matching the characters of the document to be compared and the query character in each coordinate system, and obtaining a plurality of matching points according to a preset screening strategy; connecting all the matching points to obtain a plurality of similarity judgment lines; respectively calculating the score of each similarity judgment line according to the influence degree of the position, the position sequence degree and the continuity degree of the characters; according to the scores of the similarity judgment lines in the documents to be compared, respectively obtaining similarity scores between the query characters and the documents to be compared; and sequencing the documents to be compared according to the similarity scores to obtain a query result. Compared with the prior art that the tourism products are searched by word list full-text search, the method and the system can solve the problem that the tourism products are difficult to be subjected to similarity scoring/sequencing due to word meaning loss under a word list full-text search system.
Drawings
FIG. 1 is a schematic flow chart of a first embodiment of a single Chinese character retrieval method based on products in the travel industry according to the present invention;
fig. 2 is a schematic structural diagram of a second embodiment of the single chinese character retrieval device based on products in the travel industry according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a single chinese character search method based on products in the travel industry according to a first embodiment of the present invention. As shown in fig. 1, the construction method includes steps 101 to 109, and each step is as follows:
step 101: n query characters input by a user are obtained, wherein N is a positive integer larger than 0.
Step 102: and acquiring all documents to be compared according to the N query characters, wherein each document to be compared comprises a plurality of characters, and each document to be compared comprises the N query characters.
In this embodiment, according to the N query characters, the word table method full-text retrieval system is used to obtain all the documents to be compared, which is beneficial to ensuring that the recall ratio of the obtained documents to be compared is 100%; meanwhile, the word list method full-text retrieval system does not need to depend on word banks and word segmentation, and has strong new word processing capacity.
In this embodiment, the documents to be compared refer to travel products. For example, the N query characters input by the user are "shanghai double-flying", the obtained document to be compared is "shanghai double-flying 5 tours", and the document to be compared of "shanghai double-flying 5 tours" contains the four query characters input by the user, i.e., "shanghai double-flying".
Step 103: respectively constructing a plane rectangular coordinate system corresponding to each document to be compared according to the characters in each document to be compared and the N query characters; wherein, one document to be compared corresponds to one rectangular plane coordinate system.
For example, N query characters are "guangzhou hotel", the document to be compared is "guangzhou hotel (original guangzhou hotel)", N query characters are used as ordinate, the document to be compared is used as abscissa, and each character is a unit length, a plane rectangular coordinate system is established, wherein the coordinate of each query character is: pAll-grass of Longtube Fang(0,0),PState of the year(0,1),PGuest with a lock(0,2),PShop(0, 3); the coordinates of the characters in each document to be compared are: pAll-grass of Longtube Fang(0,0),PState of the year(1,0),PGuest with a lock(2,0),PShop(3,0),P((4,0),POriginal source(5,0),PAll-grass of Longtube Fang(6,0),PState of the year(7,0),PWine(8,0),PShop(9,0),P)(10,0)。
As an example of this embodiment, a rectangular plane coordinate system may be established by taking the query character as an abscissa and taking the document to be compared as an ordinate. At this time, the single Chinese character retrieval method can be realized only by correspondingly adjusting the formulas in the subsequent steps.
Step 104: in each plane rectangular coordinate system, matching the document to be compared with the characters with the same number of N query characters to obtain the coordinates of a plurality of matching points to form first matching point coordinate data; and one planar rectangular coordinate system corresponds to one first matching point coordinate data.
For example, the N query characters are "guangzhou hotel", the document to be compared is "guangzhou hotel (original guangzhou hotel)", and in the rectangular plane coordinate system constructed by "guangzhou hotel" and "guangzhou hotel (original guangzhou hotel)", the characters of the document to be compared, which are the same as the N query characters, such as "guang", "state", "guest", "museum", "guang", "state", are matched, and the coordinates corresponding to these characters, such as P, are obtainedGuang 1(0,0),PState 1(1,1),PGuest with a lock(2,2),PShop 1(3,3),PGuang 2(6,0),PState 2And (7,1) forming first matching point coordinate data.
Step 105: and screening the conflict matching points in the coordinate data of each first matching point according to a preset screening strategy to obtain coordinate data of a second matching point.
In this embodiment, step 105 specifically includes: judging whether the first matching point coordinate data has a conflict matching point; if the first matching point coordinate data does not have the conflict matching point, marking the first matching point data as second matching point coordinate data; if the first matching point coordinate data has the conflict matching points, extracting all the conflict matching points from the first matching point coordinate data; calculating a cosine value between each collision matching point and the abscissa axis according to a straight line determined by each collision matching point and a matching point adjacent to the right side of the collision matching point, and sequentially obtaining a first cosine value corresponding to each collision matching point; respectively calculating the difference between the first cosine value and the optimal reference value, and acquiring a conflict matching point with the minimum difference; eliminating all conflict matching points in the first matching point coordinate data except the conflict matching point with the minimum difference value to obtain second matching point coordinate data; the judgment standard of the conflict matching points is that a plurality of matching points appear on the same abscissa value, and the matching points are called conflict matching points.
For example, if the N query characters input by the user are "guangzhou hotel", the document to be compared is "guangzhou hotel (original guangzhou hotel)", and in the rectangular plane coordinate system constructed by "guangzhou hotel" and "guangzhou hotel (original guangzhou hotel)", the characters of the document to be compared, which are the same as the N query characters, such as "guangzhou", "state", "guest", "restaurant", "guangzhou", "state", are matched, and the coordinates corresponding to these characters, such as P, are obtainedGuang 1(0,0),PState 1(1,1),PGuest with a lock(2,2),PShop 1(3,3),PGuang 2(6,0),PState 2And (7,1) forming first matching point coordinate data, wherein the first matching point coordinate data does not contain the conflict matching point, and therefore the first matching point coordinate data is marked as second matching point coordinate data.
If the N query characters input by the user are ' Guangdong Guangzhou hotels ', the document to be compared is ' Guangdong Guangzhou hotels ' (former Guangzhou hotels '). In a plane rectangular coordinate system constructed by 'Guangdong Guangzhou Hotel' and 'Guangdong Guangzhou Hotel (original Guangzhou Hotel'), the coordinates of each query character are as follows: pAll-grass of Longtube Fang(0,0),PEast(0,1),PAll-grass of Longtube Fang(0,2),PState of the year(0,3),PGuest with a lock(0,4),PShop(0, 5); the coordinates of the characters in each document to be compared are: pAll-grass of Longtube Fang(0,0),PEast(1,0),PAll-grass of Longtube Fang(2,0),PState of the year(3,0),PGuest with a lock(4,0),PShop(5,0),P((6,0),POriginal source(7,0),PAll-grass of Longtube Fang(8,0),PState of the year(9,0),PWine(10,0),PShop(11,0),P)(12,0). Matching the document to be compared with the characters with the same N query characters, such as 'wide', 'east', 'wide', 'state', 'guest', 'museum', 'wide', 'state', and obtaining the corresponding coordinates of the characters, such as PGuang 1(0,0),PGuang 1(0,2),PEast(1,1),PGuang 2(2,0),PGuang 2(2,2),PState 1(3,3),PGuest with a lock(4,4),PShop(5,5),PGuang 3(8,0),PGuang 3(8,2),PState 2(9,3) forming first matching point coordinate data, wherein if the first matching point coordinate data contains a conflict matching point "wide", calculating according to the following steps:
firstly, according to the longitudinal coordinate value of the conflict matching point, selecting from small to large, and marking as MijWhere index i represents the ith match point, index j represents the jth conflicting match point, PGuang 1(0,0) is M11,PGuang 1(0,2) is M21,PGuang 2(2,0) is M42,PGuang 2(2,2) is M52,PGuang 3(8,0) is M93,PGuang 3(8,2) is M103. With PGuang 2(2,0) and PGuang 2(2,2) for example, P is calculated separatelyGuang 2(2,0) and PGuang 2(2,0) matching point P adjacent to the rightState 1(3,3) calculating the cosine value between the horizontal and vertical scalesThe formula is as follows:
Figure BDA0002196134290000081
respectively adding PGuang 2(2,0) and PState 1(3,3) substituting into the formula to obtain
Figure BDA0002196134290000082
In the same way, PGuang 2(2,2) and neighboring matching points PState 1(3,3), cosine values between the horizontal and vertical scales are as follows:
Figure BDA0002196134290000083
next, cos θ is calculated separately1,cosθ2The difference between the two is 45 degrees, the conflict matching point with small difference is selected, and P is selectedGuang 2(2,0) collision matching point, thus the collision matching point PGuang 2And (2,2) removing.
In this embodiment, if MiMatch points for conflicts, and Mi+1Also for conflicting matching points, we should prefer Mi+2After the determined selection, go back to the calculation Mi+1And so on.
Step 106: and connecting the matching points in the coordinate data of each second matching point according to a preset connection rule to obtain a plurality of similarity judgment lines.
In this embodiment, taking the plane rectangular coordinate system constructed by "guangzhou hotel" and "guangzhou hotel (original guangzhou hotel)" as an example, the matching point in the second matching point coordinate data is PGuang 1(0,0),PState 1(1,1),PGuest with a lock(2,2),PShop 1(3,3),PGuang 2(6,0),PState 2(7,1) connecting the matching points in a left-to-right manner to obtain L1,L2,L3,L4,L5Wherein L is1Is passing through PGuang 1(0,0) and PState 1Straight line of (1,1), L2Is passing through PState 1(1,1) and PGuest with a lockStraight line of (2,2), L3Is passing through PGuest with a lock(2,2) and PShop 1Straight line of (3,3), L4Is passing through PShop 1(3,3) and PGuang 2(60) straight line, L5Is passing through PGuang 2(6,0) and PState 2The straight line of (7, 1).
Step 107: and respectively calculating the similarity score of each similarity judgment line according to the influence degree of the position, the position sequence degree of the character and the continuity.
In this embodiment, the influence of the location refers to a relationship between the location of the N query characters input by the user and the starting location of the document to be compared, for example, if the N query characters are "guangzhou", the document to be compared S1 is "guangzhou hotel", and the document to be compared S2 is "guangdong guangzhou hotel", the correlation between "guangzhou" and "guangzhou hotel" is higher than the correlation between "guangzhou" and "guangdong guangzhou hotel".
In the present embodiment, the position order degree of the characters refers to the relationship between the position of each character in the N query characters input by the user and the position of each character in the document to be compared, for example, if the N query characters are "guangzhou", the document to be compared S1 is "guangzhou winston", the document to be compared S2 is "winston hotel in guangxi congratu", and the correlation of "winston hotel in guangxi congratu" is higher than that of "winston hotel in guangxi congratu.
In this embodiment, the continuity refers to a length relationship between the characters in the document to be compared and N query characters input by the user and continuously matched characters, for example, if the N query characters are "guangdong zhou", the document to be compared S1 is "guangdong guangzhou shengliang", and the document to be compared S2 is "guangdong guangzhou", the number of the characters continuously matched by "guangdong guangzhou shengliang" and "guangdong guangzhou" is greater than the number of the characters continuously matched by "guangdong guangzhou" and "guangdong guangzhou", so that the "guangdong shengliang hotel" has a higher correlation than the "guangdong guangzhou".
In this embodiment, the similarity score of each similarity judgment line is calculated as follows:
Figure BDA0002196134290000101
wherein the content of the first and second substances,
Figure BDA0002196134290000102
for the degree of influence of position, sim (L)i) Is the degree of the positional order of the characters, (X)i+1-Xi) In order to be a degree of continuity,
Figure BDA0002196134290000103
and (4) representing the similarity score of the ith similarity judgment line.
In this embodiment, when calculating the influence degree of the position, the smaller the abscissa of the line segment start matching point is, the greater the influence degree is, so that:
Figure BDA0002196134290000104
formula for calculating the influence degree as a position, wherein f (X)i) The value range is (0,1), the constant a as a super parameter can be adjusted according to the actual situation, and the value range is +/-infinity, XiExpressed as the abscissa value of the ith matching point.
In the present embodiment, when calculating the degree of the order of the positions of the characters, the inventors found through a large number of experiments that when the positions of N query characters match and are continuous, and at the same time θ ∈ [0 °, 90 °), the slope k of the connected similarity determination line becomes 1, and forms an angle θ of 45 degrees with the abscissa axis. Therefore, the highest score is obtained when the angle θ between the line segments is 45 degrees. In addition, when θ ∈ [90 °, 180 ° ], it belongs to a punitive score, i.e., a negative score. As the angle θ increases, its penalty should be larger. Then, a segmentation formula for obtaining the similarity:
Figure BDA0002196134290000105
therefore, the formula for calculating the degree of positional order of characters is as follows:
Figure BDA0002196134290000106
wherein L isiTo pass through the matching point Mi(Xi,Yi) And matching point Mi+1(Xi+1,Yi+1) Is the matching point M, cos thetai=(Xi,Yi) A matching point M adjacent to the right of the matching pointi+1=(Xi+1,Yi+1) Cosine of the axis of abscissa; by sim (L)i) The control of forward adding or backward subtracting can be obtained, thereby effectively improving the precision ratio.
In the present embodiment, in calculating the continuity, the inventors found through a large number of experiments that the longer the length of the similarity judgment line is, the higher the degree of influence is, and thus the formula len (L) is usedi)=(Xi+1-Xi) The continuity of the similarity judgment line obtained from the difference in abscissa values between the two matching points passing through the similarity judgment line is calculated.
Step 108: and respectively obtaining similarity scores between the N query characters and the documents to be compared according to the similarity scores of the similarity judgment lines in the documents to be compared.
In this implementation, the similarity scores of the similarity judgment lines in the documents to be compared are summed up, and the calculation formula is as follows:
Figure BDA0002196134290000111
wherein L isiThe ith similarity judgment line is shown.
Figure BDA0002196134290000112
And (4) representing the similarity score of the ith similarity judgment line. The similarity score of each document to be compared can be obtained through the formula, and the higher the similarity score is, the higher the similarity is. Generally, if there is no similar judgment line with a slope k equal to 1 in all the "similar judgment lines", the documents to be compared corresponding to the similar judgment lines are eliminated.
Step 109: and sequencing the documents to be compared according to the similarity scores of the documents to be compared to obtain a query result.
In view of the above, the single chinese character retrieval method based on products in the travel industry according to the embodiments of the present invention constructs a coordinate system corresponding to each document to be compared according to characters in each document to be compared and query characters input by a user; matching the characters of the document to be compared and the query character in each coordinate system, and obtaining a plurality of matching points according to a preset screening strategy; connecting all the matching points to obtain a plurality of similarity judgment lines; respectively calculating the score of each similarity judgment line according to the influence degree of the position, the position sequence degree and the continuity degree of the characters; according to the scores of the similarity judgment lines in the documents to be compared, respectively obtaining similarity scores between the query characters and the documents to be compared; and sequencing the documents to be compared according to the similarity scores to obtain a query result. Compared with the prior art that the tourism products are searched by word list full-text search, the method and the system can solve the problem that the tourism products are difficult to be subjected to similarity scoring/sequencing due to word meaning loss under a word list full-text search system.
Second embodiment of the invention:
fig. 2 is a schematic structural diagram of a single chinese character retrieval device based on products in the travel industry according to a second embodiment of the present invention. The device includes: the device comprises an acquisition module 201, a query module 202, a rectangular coordinate system generation module 203, a matching point data acquisition module 204, a matching point data screening module 205, a similarity judgment line acquisition module 206, a first calculation module 207, a second calculation module 208 and a sorting module 209.
An obtaining module 201, configured to obtain N query characters input by a user, where N is a positive integer greater than 0;
the query module 202 is configured to obtain all documents to be compared according to the N query characters, where each document to be compared includes a plurality of characters, and the document to be compared includes the N query characters;
the rectangular coordinate system generating module 203 is configured to respectively construct a planar rectangular coordinate system corresponding to each document to be compared according to the characters in each document to be compared and the N query characters; one document to be compared corresponds to one rectangular plane coordinate system;
a matching point data obtaining module 204, configured to match, in each planar rectangular coordinate system, a document to be compared with characters of which the N query characters are the same, obtain coordinates of a plurality of matching points, and form first matching point coordinate data; one rectangular plane coordinate system corresponds to one first matching point coordinate data;
the matching point data screening module 205 is configured to screen a conflict matching point in each first matching point coordinate data according to a preset screening policy to obtain second matching point coordinate data;
a similarity judgment line obtaining module 206, configured to connect the matching points in each second matching point coordinate data according to a preset connection rule, so as to obtain multiple similarity judgment lines;
a first calculating module 207, configured to calculate a similarity score of each similarity judgment line according to the influence degree of the position, the position order degree of the character, and the continuity degree;
the second calculating module 208 is configured to obtain similarity scores between the N query characters and the documents to be compared, respectively, according to the similarity score of each similarity judgment line in each document to be compared;
and the sorting module 209 is configured to sort the documents to be compared according to the similarity scores of the documents to be compared, so as to obtain a query result.
The more detailed working principle and process of this embodiment can refer to, but are not limited to, the single chinese character retrieval method based on products in the travel industry described in the first embodiment.
As can be seen from the above, the single chinese character retrieval device based on the products in the travel industry according to the embodiments of the present invention obtains the score of the similarity between each document to be compared and the query character input by the user by calculating the influence degree of the position of each similarity judgment line, the position sequence degree of the character, and the continuity degree, and then ranks the documents to be compared. Therefore, the problem that the tourism products are difficult to score/sort in similarity due to word sense loss under a word list full-text retrieval system is solved.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (8)

1. A single Chinese character retrieval method based on products in the travel industry is characterized by comprising the following steps:
acquiring N query characters input by a user, wherein N is a positive integer greater than 0;
acquiring all documents to be compared according to the N query characters, wherein each document to be compared comprises a plurality of characters, and the document to be compared comprises the N query characters;
respectively constructing a plane rectangular coordinate system corresponding to each document to be compared according to the characters in each document to be compared and the N query characters; one document to be compared corresponds to one rectangular plane coordinate system;
matching the document to be compared with the characters with the same N query characters in each plane rectangular coordinate system to obtain coordinates of a plurality of matching points to form first matching point coordinate data; one rectangular plane coordinate system corresponds to one first matching point coordinate data;
screening the conflict matching points in the coordinate data of each first matching point according to a preset screening strategy to obtain coordinate data of a second matching point;
connecting the matching points in the second matching point coordinate data according to a preset connection rule to obtain a plurality of similarity judgment lines;
respectively calculating the similarity score of each similarity judgment line according to the influence degree of the position, the position sequence degree of the character and the continuity degree;
respectively obtaining similarity scores between the N query characters and the documents to be compared according to the similarity scores of the similarity judgment lines in the documents to be compared;
and sequencing the documents to be compared according to the similarity scores of the documents to be compared to obtain a query result.
2. The single chinese character retrieval method based on travel industry products as claimed in claim 1, wherein the screening of the conflicting matching points in each of the first matching point coordinate data according to a preset screening policy to obtain second matching point coordinate data specifically comprises:
judging whether the first matching point coordinate data has a conflict matching point; if the first matching point coordinate data does not have the conflict matching point, marking the first matching point data as second matching point coordinate data;
if the first matching point coordinate data contains the conflict matching points, extracting all the conflict matching points from the first matching point coordinate data;
calculating a cosine value between each straight line and the abscissa axis according to the straight line determined by each conflict matching point and a matching point adjacent to the right side of the conflict matching point, and sequentially obtaining a first cosine value corresponding to each conflict matching point;
respectively calculating the difference between the first cosine value and the optimal reference value, and acquiring a conflict matching point with the minimum difference;
and eliminating all the conflict matching points in the first matching point coordinate data except the conflict matching point with the minimum difference value to obtain second matching point coordinate data.
3. The single Chinese character retrieval method based on travel industry products as claimed in claim 1, wherein the similarity score of each similarity judgment line is calculated by the following method:
Figure FDA0002196134280000021
wherein the content of the first and second substances,
Figure FDA0002196134280000022
for the degree of influence of position, sim (L)i) Is the degree of the positional order of the characters, (X)i+1-Xi) Is a degree of continuity.
4. The single Chinese character retrieval method based on travel industry products as claimed in claim 3, wherein the influence degree of the position is calculated by the following method:
Figure FDA0002196134280000023
wherein, f (X)i) The position influence degree is (0,1) in the value range, the constant a as a super parameter can be adjusted according to the actual situation, and XiThe abscissa value of the ith matching point.
5. The single Chinese character retrieval method based on travel industry products as claimed in claim 3, wherein the position sequence degree of the characters is calculated by the following method:
Figure FDA0002196134280000031
wherein L isiIs a passing point (X)i,Yi) And point (X)i+1,Yi+1) Is the matching point M, cos thetai=(Xi,Yi) A matching point M adjacent to the right of the matching pointi+1=(Xi+1,Yi+1) And cosine of the axis of abscissa.
6. The single chinese character retrieval method based on travel industry products as set forth in claim 3, wherein the continuity is obtained by a difference of abscissa values between two matching points passing through the similarity judgment line.
7. A single Chinese character retrieval device based on products in the travel industry is characterized by comprising:
the device comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring N query characters input by a user, and N is a positive integer larger than 0;
the query module is used for acquiring all documents to be compared according to the N query characters, wherein each document to be compared comprises a plurality of characters, and the document to be compared comprises the N query characters;
the rectangular coordinate system generating module is used for respectively constructing a planar rectangular coordinate system corresponding to each document to be compared according to the characters in each document to be compared and the N query characters; one document to be compared corresponds to one rectangular plane coordinate system;
the matching data acquisition module is used for matching the document to be compared with the characters with the same N query characters in each rectangular plane coordinate system to acquire coordinates of a plurality of matching points to form first matching point coordinate data; one rectangular plane coordinate system corresponds to one first matching point coordinate data;
the matching point data screening module is used for screening the conflict matching points in the first matching point coordinate data according to a preset screening strategy to obtain second matching point coordinate data;
a similarity judgment line obtaining module, configured to connect matching points in each of the second matching point coordinate data according to a preset connection rule to obtain multiple similarity judgment lines;
the first calculation module is used for respectively calculating the similarity score of each similarity judgment line according to the influence degree of the position, the position sequence degree of the character and the continuity degree;
a second calculation module, configured to obtain similarity scores between the N query characters and the documents to be compared, respectively, according to similarity scores of the similarity judgment lines in the documents to be compared;
and the ranking module is used for ranking the documents to be compared according to the similarity scores of the documents to be compared to obtain a query result.
8. The single chinese character retrieval device based on travel industry products as recited in claim 7, wherein the step of screening the conflicting matching points in each of the first matching point coordinate data according to a preset screening policy to obtain second matching point coordinate data specifically comprises:
judging whether the first matching point coordinate data has a conflict matching point; if the first matching point coordinate data does not have the conflict matching point, marking the first matching point data as second matching point coordinate data;
if the first matching point coordinate data contains the conflict matching points, extracting all the conflict matching points from the first matching point coordinate data;
calculating a cosine value between each straight line and the abscissa axis according to the straight line determined by each conflict matching point and a matching point adjacent to the right side of the conflict matching point, and sequentially obtaining a first cosine value corresponding to each conflict matching point;
respectively calculating the difference between the first cosine value and the optimal reference value, and acquiring a conflict matching point with the minimum difference;
and eliminating all the conflict matching points in the first matching point coordinate data except the conflict matching point with the minimum difference value to obtain second matching point coordinate data.
CN201910855488.9A 2019-09-09 2019-09-09 Single Chinese character retrieval method and device based on travel industry products Active CN110674367B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910855488.9A CN110674367B (en) 2019-09-09 2019-09-09 Single Chinese character retrieval method and device based on travel industry products

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910855488.9A CN110674367B (en) 2019-09-09 2019-09-09 Single Chinese character retrieval method and device based on travel industry products

Publications (2)

Publication Number Publication Date
CN110674367A CN110674367A (en) 2020-01-10
CN110674367B true CN110674367B (en) 2022-02-01

Family

ID=69077824

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910855488.9A Active CN110674367B (en) 2019-09-09 2019-09-09 Single Chinese character retrieval method and device based on travel industry products

Country Status (1)

Country Link
CN (1) CN110674367B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719128A (en) * 2009-12-31 2010-06-02 浙江工业大学 Fuzzy matching-based Chinese geo-code determination method
CN101882163A (en) * 2010-06-30 2010-11-10 中国科学院地理科学与资源研究所 Fuzzy Chinese address geographic evaluation method based on matching rule
CN104090989A (en) * 2014-07-30 2014-10-08 携程计算机技术(上海)有限公司 Website searching system and method based on mobile terminal
CN104965894A (en) * 2015-06-19 2015-10-07 成都国腾实业集团有限公司 Data analysis system for IDC hazardous information monitoring platform
CN106407418A (en) * 2016-09-23 2017-02-15 Tcl集团股份有限公司 A face identification-based personalized video recommendation method and recommendation system
CN106933947A (en) * 2017-01-20 2017-07-07 北京三快在线科技有限公司 A kind of searching method and device, electronic equipment
CN108694186A (en) * 2017-04-07 2018-10-23 阿里巴巴集团控股有限公司 Data transmission method for uplink and server application, computing device and computer-readable medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902740B (en) * 2014-04-22 2017-07-18 锤子科技(北京)有限公司 The staying method and device of short message identifying code
WO2017081562A1 (en) * 2015-11-09 2017-05-18 Imi: Intelligence & Management Of Information Inc. Method and system for processing and searching documents

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719128A (en) * 2009-12-31 2010-06-02 浙江工业大学 Fuzzy matching-based Chinese geo-code determination method
CN101882163A (en) * 2010-06-30 2010-11-10 中国科学院地理科学与资源研究所 Fuzzy Chinese address geographic evaluation method based on matching rule
CN104090989A (en) * 2014-07-30 2014-10-08 携程计算机技术(上海)有限公司 Website searching system and method based on mobile terminal
CN104965894A (en) * 2015-06-19 2015-10-07 成都国腾实业集团有限公司 Data analysis system for IDC hazardous information monitoring platform
CN106407418A (en) * 2016-09-23 2017-02-15 Tcl集团股份有限公司 A face identification-based personalized video recommendation method and recommendation system
CN106933947A (en) * 2017-01-20 2017-07-07 北京三快在线科技有限公司 A kind of searching method and device, electronic equipment
CN108694186A (en) * 2017-04-07 2018-10-23 阿里巴巴集团控股有限公司 Data transmission method for uplink and server application, computing device and computer-readable medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
AutoCAD下矢量汉字库的剖析与应用;杨宠;《电脑编程技巧与维护》;19970228;全文 *

Also Published As

Publication number Publication date
CN110674367A (en) 2020-01-10

Similar Documents

Publication Publication Date Title
Gao et al. Database saliency for fast image retrieval
Mitra et al. An automatic approach to identify word sense changes in text media across timescales
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
US20110106805A1 (en) Method and system for searching multilingual documents
CN102750347B (en) Method for reordering image or video search
KR100903961B1 (en) Indexing And Searching Method For High-Demensional Data Using Signature File And The System Thereof
US10942973B2 (en) Automatically generating and evaluating candidate terms for trademark clearance
CN106844481B (en) Font similarity and font replacement method
CN110162752B (en) Article judging and re-processing method and device and electronic equipment
CN102073884A (en) Handwriting recognition method, system and handwriting recognition terminal
CN112148886A (en) Method and system for constructing content knowledge graph
US20080037904A1 (en) Apparatus, method and program storage medium for image interpretation
CN114328798B (en) Processing method, device, equipment, storage medium and program product for searching text
CN104965928B (en) One kind being based on the matched Chinese character image search method of shape
CN112100493B (en) Document ordering method, device, equipment and storage medium
CN110674367B (en) Single Chinese character retrieval method and device based on travel industry products
KR20110039900A (en) Iamge data recognition and managing method for ancient documents using intelligent recognition library and management tool
CN106776590A (en) A kind of method and system for obtaining entry translation
CN114416914B (en) Processing method based on picture question and answer
TW201407390A (en) Data clustering apparatus and method
Chang et al. Chinese document layout analysis using an adaptive regrouping strategy
JPH08272811A (en) Document management method and device therefor
CN114691911A (en) Cross-view angle geographic image retrieval method based on information bottleneck variation distillation
Wu et al. Similar image retrieval in large-scale trademark databases based on regional and boundary fusion feature
CN109815312B (en) Document query method and device, computing equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant