CN101702171A - Approximating matching method for numerous character strings - Google Patents
Approximating matching method for numerous character strings Download PDFInfo
- Publication number
- CN101702171A CN101702171A CN200910219048A CN200910219048A CN101702171A CN 101702171 A CN101702171 A CN 101702171A CN 200910219048 A CN200910219048 A CN 200910219048A CN 200910219048 A CN200910219048 A CN 200910219048A CN 101702171 A CN101702171 A CN 101702171A
- Authority
- CN
- China
- Prior art keywords
- matching
- algorithm
- character strings
- parameter
- product
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an approximating matching method for numerous character strings, which comprises the steps of: (1) choosing a main matching parameter of an object to be matched, (2) adjusting the weighted value of the parameter, (3) constructing a many-to-many matching module by utilizing stable marriage asymmetric algorithm, (4) creating an optimized list according to the editing distance algorithm or the longest common sub-sequence, aiming at the matching items in the many-to-many matching module. Basing on an algorithm module of main body, the invention automatically chooses the algorithm by adding attributive analysis. After constructing the module, the matching result is stable, the matching rate is high, and the matching is real-time and rapid. According to different application scenes, the optimized list can be created by utilizing different approximating matching algorithms for different character strings. With excellent application prospect, the pertinent product strategy can be made in a short time, the product competitiveness can be enhanced, the running efficiency of website can be promoted, and the system performance can be improved.
Description
Technical field:
What the present invention relates to is a kind of matching process of product data, and what be specifically related to is a kind of approximate adaptation method that is applied in a large amount of character strings in the ecommerce.
Background technology:
Along with rapid development of electronic commerce, competition also grows in intensity based on the e-commerce website of B2C pattern, the price between present each product sold of its core body, the difference of movable and service.So, understand every data of extraneous website product every day in real time, and then make correct sales tactics, enhance competitiveness, become imperative.
The algorithm that is applied in e-commerce field product approximate adaptation method at present has following two kinds:
One, editing distance algorithm:
Be used to judge similarity degree between the character string, equal a character string is converted into the required minimum cost of another character string by basic transformation.Similarity between the different length character string that editing distance can calculate.Distance algorithm is with the item and the similarity degree of intended target item that decide in the index file.It is a measure of similarity between two character strings, and editing distance is exactly to be used for calculating the character number that is transformed into required minimum insertion, deletion and replacement of another character string from a character string.For example, " three " is 1 with the editing distance of " tree " two character strings, because only need delete a character, two character strings are just the same.Three and tree editing distance are 1, because only do deletion action one time.Notion is readily appreciated that, but very complicated in the essence and the conversion of the approximate match algorithm of character string, needs further investigation to be applied to actual scene.The scene of widespread use has:
1, biological computation DNA gene mutation
2, speech recognition
3, spell check
4, plagiarize detection
Shortcoming: only can calculate the similarity of 2 character strings, can't obtain one of them character string and be the substring of another character string or the length of subsequence.As be used in the e-commerce product coupling, the production code member that wherein mates the other side is contained in the name of product, uses this algorithmic match to calculate relatively difficulty so.
Two, the longest common subsequence algorithm:
An ordered series of numbers S, if be respectively the subsequence of two or more known ordered series of numbers, and it is the longest to be that all meet in this condition sequence, then S is called the longest common subsequence of known array.It has used the dynamic programming principle.The scene of widespread use has:
1, information retrieval
2, data scrubbing
3, plagiarize identification
4, dna sequence dna contrast
Example: S1=ACCGGTCGAGTGCGCGGAAGCCGGCCGAA
S2=GTCGTTCGGAATGCCGTTGCTCTGTAAA
One of target of two character strings relatively will know that exactly they have the similar of " how " on earth.The measurement of similarity can have a lot of standards, and for example we we can say, if one of them character string is the subsequence of another one character string, they are exactly similar.Top S1, S2 is not the subsequence of another one.Perhaps we also can define similarity like this: just can allow a character string become another if only operate (such as replacing or insert, deleting) with correction seldom, just say that they are similar.Definition like this: to two character strings, S1 and S2, find a sequence S3, the whole elements that occur among the S3 all not only appear among the S1 but also appear among the S2, and the order that occurs is identical, but can be discontinuous: the longest S3 that can find under this prerequisite be long more, just says that S1 is similar more with S2.In the above example, the longest S3 that can find is GTCGTCGGAAGCCGGCCGAA
Shortcoming: the longest common subsequence character string that calculates is discontinuous, does not support backtracking.So in the e-commerce product coupling, there is certain probability coupling inaccurate.
Present in the best match algorithm of a large amount of characters that the structure product mates, the neither one proper model, it is the asymmetric multi-to-multi model of base configuration that this paper has selected with the stable marriage symmetry algorithm.The stable marriage symmetry algorithm:
Have n schoolgirl and n boy student are arranged in the corporations, every schoolgirl sorts the boy student according to her preference degree, and every boy student also sorts the schoolgirl according to the preference degree of oneself simultaneously.Then this n schoolgirl and n boy student are made into complete marriage as shown in Figure 1.
Shortcoming: need the sequence of the good object that everyone has a preference for of structure in advance, as be used in the e-commerce product coupling, lack the algorithm of structure preference lists.
Therefore, each medium-and-large-sized e-commerce website all can use outstanding info web to follow the trail of or the reptile tool software at present, grasp extraneous commodity data, but it has certain prerequisite and limitation:
1, some commodity of extraneous website but lack determinant attribute or specification and mate extracting as direct relatively foundation; Such as product " Creative ZEN 4GB BLACK Mp3Mp4Video Player withExpandable SD Card Slot ", this is this name of product of a certain B2C website, yet the product naming method of each B2C website is different, if so lacked production code member in the information list, the accuracy rate that coupling grasps can reduce.
2, the attribute of similar homologous series commodity or specification are closely similar, cause the accuracy rate that grasps data to reduce, and need manually distinguish coupling, have therefore reduced efficient.Such as name of product " Epson Light BlackInk Cartridge T096720 " and " Epson Matte Black Ink Cartridge T096820 ", production code member is respectively " T096720 " and " T096820 ", the similarity of character string of primary attribute is very high, if coupling has grabbed this two products simultaneously, manually select the optimum matching object with regard to also needing, its search efficiency is low, and real-time is poor.
Along with the impetus of e-commerce development is more and more powerful, the processing of the data in the data mining later stage of ecommerce, processing, conversion will be very promising fields, and therefore, the coupling of existing product data remains further to be improved.
Summary of the invention:
The present invention seeks to be, overcome the deficiencies in the prior art, and provide a kind of many algorithm models based on body, increase attributive analysis selection algorithm automatically, make its matching result good stability, matching rate height, can utilize the approximate adaptation method of a kind of a large amount of character strings of different character string approximate match algorithm construction preference lists according to different application scenarioss.
To achieve these goals, technical scheme of the present invention is as follows:
A kind of approximate adaptation method of a large amount of character strings is characterized in that, its method step is as follows:
(1) the main matching parameter of an object to be matched of selection;
(2) adjust the parameter weighted value, mainly be provided with following 3 kinds of parameters;
(a), the goods number approximate match, the parameter value of editing distance algorithm or the longest common subsequence algorithm computation;
(b), the editing distance algorithm of commodity name approximate match or the parameter value of the longest common subsequence algorithm computation;
(c), the parameter value of the price range of commodity comparison;
(3) utilize multi-to-multi Matching Model of stable marriage asymmetric arithmetic structure, the stable marriage asymmetric arithmetic is the innovation on stable marriage algorithm basis.
(4) at the occurrence in the multi-to-multi model, and according to editing distance algorithm or preference lists of the longest common subsequence structure.
Further, main matching parameter is after target data is sorted out (for example brand or classification) by certain mode in the described step (1), selects the characteristic matching attribute of distinguishing easily according to the product data source.
Further, described target data is to collect data by the large-tonnage product attribute of commodity on the search engine technique acquisition internet.
The present invention is based on the stable marriage asymmetric arithmetic, constructs the multi-to-multi Matching Model of a large amount of character strings, and the preference lists of the occurrence of this model has been used editing distance algorithm and the longest common subsequence algorithm in the approximate match algorithm of character string.Product one's own side and the other side carries out production code member and production code member, name of product and name of product, production code member and name of product, price and price, combine this 3 kinds of algorithmic techniques between these product component attributes of type of sale and type of sale, set the automatic computing that certain parameter weight carries out and draw optimum matching.
The present invention has increased attributive analysis and has come automatic selection algorithm, the identification and the automatic coupling that are used for commodity, take all factors into consideration the semantic similarity of commodity assembly notion and attribute assembly notion, proposition is based on product ontology structure semantics similarity matching algorithm, solved the automatic matching problem of body merchandise news between the external world, accuracy rate has also reached comparatively ideal results.
The present invention compares with the matching process of at present a large amount of character strings, after setting up model, the good stability of matching result, the matching rate height, real-time, can be according to different application scenarioss, utilize different character string approximate match algorithm construction preference lists, solved at present a lot of traditional artificial complex operations, the problem that efficient is not high, and can in the shortest time, make product decisions pointedly, strengthen product competitiveness, improve the website operational efficiency, improve system performance, have good development and application prospect.
Description of drawings:
Further specify the present invention below in conjunction with the drawings and specific embodiments.
Fig. 1 is made into complete marriage figure for schoolgirl and the boy student by existing stable marriage symmetry algorithm;
Fig. 2 is made into complete marriage figure for the schoolgirl and the boy student of structure preference lists of the present invention.
Fig. 3 is a process flow diagram of the present invention.
Embodiment:
For technological means, creation characteristic that the present invention is realized, reach purpose and effect is easy to understand, below in conjunction with concrete diagram, further set forth the present invention.
The present invention is based on the stable marriage asymmetric arithmetic, constructs the multi-to-multi Matching Model of a large amount of character strings, and the preference lists of the occurrence of this model has been used editing distance algorithm and the longest common subsequence algorithm in the approximate match algorithm of character string.Product one's own side and the other side carries out production code member and production code member, name of product and name of product, production code member and name of product, price and price, combine this 3 kinds of algorithmic techniques between these product component attributes of type of sale and type of sale, set the automatic computing that certain parameter weight carries out and draw optimum matching.
Referring to Fig. 3, concrete matching process of the present invention is as follows:
(1) by the large-tonnage product attribute data of commodity on the search engine technique acquisition internet, collects target data;
(2) target data is sorted out (for example brand or classification) by certain mode after, the characteristic matching attribute of selecting to distinguish easily according to product data sources is as matching parameter, for example ProductName etc.
(3) adjust the parameter weighted value, mainly be provided with following 3 kinds of parameters;
(a), the goods number approximate match, the parameter value of editing distance algorithm or the longest common subsequence algorithm computation;
(b), the editing distance algorithm of commodity name approximate match or the parameter value of the longest common subsequence algorithm computation;
(c), the parameter value of the price range of commodity comparison;
(4) the selected source data that needs coupling is according to multi-to-multi Matching Model of stable marriage asymmetric arithmetic structure;
(5) at the occurrence in the multi-to-multi model, by the preference lists of an editing distance algorithm or an occurrence of the longest common subsequence structure, different scenes can be selected different character string approximate match algorithms.
(5) obtain final process result after utilizing this method computing, wherein can operand be narrowed down to a less computing unit, improve efficiency of algorithm according to the brand or the classified information of product.
It should be noted that, the present invention has increased attributive analysis and has come automatic selection algorithm, the identification and the automatic coupling that are used for commodity, take all factors into consideration the semantic similarity of commodity assembly notion and attribute assembly notion, proposition is based on product ontology structure semantics similarity matching algorithm, solved the automatic matching problem of body merchandise news between the external world, accuracy rate has also reached comparatively ideal results.
For instance, have n schoolgirl and n boy student are arranged in the corporations, every schoolgirl sorts the boy student according to her preference degree, and every boy student also sorts the schoolgirl according to the preference degree of oneself simultaneously.Set up matching parameter according to schoolgirl and boy student's preference degree then, the selected source data that needs coupling, according to multi-to-multi Matching Model of stable marriage asymmetric arithmetic structure, promptly boy student and schoolgirl have a preference for identical construction; According to the algorithm (as Fig. 2) of editing distance algorithm or the longest common subsequence structure preference lists, this n schoolgirl and n boy student are made into complete marriage again.
Based on above-mentioned, the present invention will be one of most active branch of data management, field of information processing research, development and application.Its aid decision making person has solved at present a lot of traditional artificial complex operations, the problem that efficient is not high.It can help the enterprise's understanding propensity to consume, market trend real-time, makes product decisions pointedly in the shortest time, strengthens product competitiveness, improves the website operational efficiency, improves system performance, has good development and application prospect.
More than show and described ultimate principle of the present invention and principal character and advantage of the present invention.The technician of the industry should understand; the present invention is not restricted to the described embodiments; that describes in the foregoing description and the instructions just illustrates principle of the present invention; without departing from the spirit and scope of the present invention; the present invention also has various changes and modifications, and these changes and improvements all fall in the claimed scope of the invention.The claimed scope of the present invention is defined by appending claims and equivalent thereof.
Claims (3)
1. the approximate adaptation method of a large amount of character strings is characterized in that, its method step is as follows:
(1) the main matching parameter of an object to be matched of selection;
(2) adjust the parameter weighted value, mainly be provided with following 3 kinds of parameters;
(a), the goods number approximate match, the parameter value of editing distance algorithm or the longest common subsequence algorithm computation;
(b), the editing distance algorithm of commodity name approximate match or the parameter value of the longest common subsequence algorithm computation;
(c), the parameter value of the price range of commodity comparison;
(3) utilize multi-to-multi Matching Model of stable marriage asymmetric arithmetic structure;
(4) at the occurrence in the multi-to-multi model, and according to editing distance algorithm or preference lists of the longest common subsequence structure.
2. the approximate adaptation method of a kind of a large amount of character strings according to claim 1, it is characterized in that, main matching parameter is after target data is sorted out (for example brand or classification) by certain mode in the described step (1), selects the characteristic matching attribute of distinguishing easily according to the product data source.
3. the approximate adaptation method of a kind of a large amount of character strings according to claim 2 is characterized in that, described target data is to collect data by the large-tonnage product attribute of commodity on the search engine technique acquisition internet.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200910219048A CN101702171A (en) | 2009-11-19 | 2009-11-19 | Approximating matching method for numerous character strings |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200910219048A CN101702171A (en) | 2009-11-19 | 2009-11-19 | Approximating matching method for numerous character strings |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101702171A true CN101702171A (en) | 2010-05-05 |
Family
ID=42157086
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN200910219048A Pending CN101702171A (en) | 2009-11-19 | 2009-11-19 | Approximating matching method for numerous character strings |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101702171A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102184169A (en) * | 2011-04-20 | 2011-09-14 | 北京百度网讯科技有限公司 | Method, device and equipment used for determining similarity information among character string information |
CN102541989A (en) * | 2010-10-28 | 2012-07-04 | 微软公司 | Robust auto-correction for data retrieval |
CN102682079A (en) * | 2012-03-30 | 2012-09-19 | 梁宗强 | Method and module for allocating weights to search non-pharmaceutical medical project names |
CN105824992A (en) * | 2016-03-10 | 2016-08-03 | 东南大学 | Intelligent matching method and system for data modules of relaying protection equipment |
CN106776493A (en) * | 2015-11-19 | 2017-05-31 | 腾讯科技(深圳)有限公司 | Information filtering method and information filtrating device |
CN110232140A (en) * | 2019-06-19 | 2019-09-13 | 河北工业大学 | The disposable approximate pattern matching method integrally constrained with part- |
CN110245167A (en) * | 2019-06-19 | 2019-09-17 | 河北工业大学 | The non-overlapping approximate pattern matching method integrally constrained with part- |
US10489461B2 (en) | 2014-08-20 | 2019-11-26 | Oracle International Corporation | Multidimensional spatial searching for identifying substantially similar data fields |
CN111324784A (en) * | 2015-03-09 | 2020-06-23 | 阿里巴巴集团控股有限公司 | Character string processing method and device |
-
2009
- 2009-11-19 CN CN200910219048A patent/CN101702171A/en active Pending
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102541989A (en) * | 2010-10-28 | 2012-07-04 | 微软公司 | Robust auto-correction for data retrieval |
CN102541989B (en) * | 2010-10-28 | 2015-12-09 | 微软技术许可有限责任公司 | The sane automatic correction of data retrieval |
CN102184169B (en) * | 2011-04-20 | 2013-06-19 | 北京百度网讯科技有限公司 | Method, device and equipment used for determining similarity information among character string information |
CN102184169A (en) * | 2011-04-20 | 2011-09-14 | 北京百度网讯科技有限公司 | Method, device and equipment used for determining similarity information among character string information |
CN102682079A (en) * | 2012-03-30 | 2012-09-19 | 梁宗强 | Method and module for allocating weights to search non-pharmaceutical medical project names |
US10489461B2 (en) | 2014-08-20 | 2019-11-26 | Oracle International Corporation | Multidimensional spatial searching for identifying substantially similar data fields |
CN111324784B (en) * | 2015-03-09 | 2023-05-16 | 创新先进技术有限公司 | Character string processing method and device |
CN111324784A (en) * | 2015-03-09 | 2020-06-23 | 阿里巴巴集团控股有限公司 | Character string processing method and device |
CN106776493A (en) * | 2015-11-19 | 2017-05-31 | 腾讯科技(深圳)有限公司 | Information filtering method and information filtrating device |
CN106776493B (en) * | 2015-11-19 | 2020-03-03 | 腾讯科技(深圳)有限公司 | Information filtering method and information filtering device |
CN105824992B (en) * | 2016-03-10 | 2019-01-29 | 东南大学 | A kind of intelligent Matching method and system of relay protection device data model |
CN105824992A (en) * | 2016-03-10 | 2016-08-03 | 东南大学 | Intelligent matching method and system for data modules of relaying protection equipment |
CN110245167A (en) * | 2019-06-19 | 2019-09-17 | 河北工业大学 | The non-overlapping approximate pattern matching method integrally constrained with part- |
CN110232140A (en) * | 2019-06-19 | 2019-09-13 | 河北工业大学 | The disposable approximate pattern matching method integrally constrained with part- |
CN110245167B (en) * | 2019-06-19 | 2021-08-03 | 河北工业大学 | Non-overlapping approximate pattern matching method with local-overall constraint |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101702171A (en) | Approximating matching method for numerous character strings | |
CN112069415B (en) | Interest point recommendation method based on heterogeneous attribute network characterization learning | |
CN102279851B (en) | Intelligent navigation method, device and system | |
CN111612549B (en) | Construction method of platform operation service system | |
CN110532379B (en) | Electronic information recommendation method based on LSTM (least Square TM) user comment sentiment analysis | |
CN108845988B (en) | Entity identification method, device, equipment and computer readable storage medium | |
CN108563690B (en) | Collaborative filtering recommendation method based on object-oriented clustering | |
CN102043781A (en) | Web page resource recommendation method and device | |
CN101980211A (en) | Machine learning model and establishing method thereof | |
CN110990670B (en) | Growth incentive book recommendation method and recommendation system | |
US20230306035A1 (en) | Automatic recommendation of analysis for dataset | |
CN104408648A (en) | Method and device for choosing items | |
CN113570413A (en) | Method and device for generating advertisement keywords, storage medium and electronic equipment | |
CN111523055A (en) | Collaborative recommendation method and system based on agricultural product characteristic attribute comment tendency | |
CN107256238A (en) | Recommendation method for personalized information and information recommendation system under a kind of multi-constraint condition | |
CN111191099A (en) | User activity type identification method based on social media | |
CN106980639B (en) | Short text data aggregation system and method | |
Zeng et al. | Pyramid hybrid pooling quantization for efficient fine-grained image retrieval | |
CN114840766A (en) | User portrait construction method, system, equipment and storage medium | |
CN116308683B (en) | Knowledge-graph-based clothing brand positioning recommendation method, equipment and storage medium | |
CN113821718A (en) | Article information pushing method and device | |
CN111598645A (en) | Random forest and collaborative filtering second-hand room fusion recommendation method | |
Kang et al. | Recognising informative Web page blocks using visual segmentation for efficient information extraction. | |
CN114881722A (en) | Hotspot-based travel product matching method, system, equipment and storage medium | |
CN115203532A (en) | Project recommendation method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20100505 |