CN101702171A - Approximating matching method for numerous character strings - Google Patents

Approximating matching method for numerous character strings Download PDF

Info

Publication number
CN101702171A
CN101702171A CN200910219048A CN200910219048A CN101702171A CN 101702171 A CN101702171 A CN 101702171A CN 200910219048 A CN200910219048 A CN 200910219048A CN 200910219048 A CN200910219048 A CN 200910219048A CN 101702171 A CN101702171 A CN 101702171A
Authority
CN
China
Prior art keywords
matching
algorithm
character strings
parameter
product
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200910219048A
Other languages
Chinese (zh)
Inventor
蒋以仁
宋卫卫
王皓伊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Newegg Information Technology (Xi'an) Co Ltd
Original Assignee
Newegg Information Technology (Xi'an) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Newegg Information Technology (Xi'an) Co Ltd filed Critical Newegg Information Technology (Xi'an) Co Ltd
Priority to CN200910219048A priority Critical patent/CN101702171A/en
Publication of CN101702171A publication Critical patent/CN101702171A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an approximating matching method for numerous character strings, which comprises the steps of: (1) choosing a main matching parameter of an object to be matched, (2) adjusting the weighted value of the parameter, (3) constructing a many-to-many matching module by utilizing stable marriage asymmetric algorithm, (4) creating an optimized list according to the editing distance algorithm or the longest common sub-sequence, aiming at the matching items in the many-to-many matching module. Basing on an algorithm module of main body, the invention automatically chooses the algorithm by adding attributive analysis. After constructing the module, the matching result is stable, the matching rate is high, and the matching is real-time and rapid. According to different application scenes, the optimized list can be created by utilizing different approximating matching algorithms for different character strings. With excellent application prospect, the pertinent product strategy can be made in a short time, the product competitiveness can be enhanced, the running efficiency of website can be promoted, and the system performance can be improved.

Description

A kind of approximate adaptation method of a large amount of character strings
Technical field:
What the present invention relates to is a kind of matching process of product data, and what be specifically related to is a kind of approximate adaptation method that is applied in a large amount of character strings in the ecommerce.
Background technology:
Along with rapid development of electronic commerce, competition also grows in intensity based on the e-commerce website of B2C pattern, the price between present each product sold of its core body, the difference of movable and service.So, understand every data of extraneous website product every day in real time, and then make correct sales tactics, enhance competitiveness, become imperative.
The algorithm that is applied in e-commerce field product approximate adaptation method at present has following two kinds:
One, editing distance algorithm:
Be used to judge similarity degree between the character string, equal a character string is converted into the required minimum cost of another character string by basic transformation.Similarity between the different length character string that editing distance can calculate.Distance algorithm is with the item and the similarity degree of intended target item that decide in the index file.It is a measure of similarity between two character strings, and editing distance is exactly to be used for calculating the character number that is transformed into required minimum insertion, deletion and replacement of another character string from a character string.For example, " three " is 1 with the editing distance of " tree " two character strings, because only need delete a character, two character strings are just the same.Three and tree editing distance are 1, because only do deletion action one time.Notion is readily appreciated that, but very complicated in the essence and the conversion of the approximate match algorithm of character string, needs further investigation to be applied to actual scene.The scene of widespread use has:
1, biological computation DNA gene mutation
2, speech recognition
3, spell check
4, plagiarize detection
Shortcoming: only can calculate the similarity of 2 character strings, can't obtain one of them character string and be the substring of another character string or the length of subsequence.As be used in the e-commerce product coupling, the production code member that wherein mates the other side is contained in the name of product, uses this algorithmic match to calculate relatively difficulty so.
Two, the longest common subsequence algorithm:
An ordered series of numbers S, if be respectively the subsequence of two or more known ordered series of numbers, and it is the longest to be that all meet in this condition sequence, then S is called the longest common subsequence of known array.It has used the dynamic programming principle.The scene of widespread use has:
1, information retrieval
2, data scrubbing
3, plagiarize identification
4, dna sequence dna contrast
Example: S1=ACCGGTCGAGTGCGCGGAAGCCGGCCGAA
S2=GTCGTTCGGAATGCCGTTGCTCTGTAAA
One of target of two character strings relatively will know that exactly they have the similar of " how " on earth.The measurement of similarity can have a lot of standards, and for example we we can say, if one of them character string is the subsequence of another one character string, they are exactly similar.Top S1, S2 is not the subsequence of another one.Perhaps we also can define similarity like this: just can allow a character string become another if only operate (such as replacing or insert, deleting) with correction seldom, just say that they are similar.Definition like this: to two character strings, S1 and S2, find a sequence S3, the whole elements that occur among the S3 all not only appear among the S1 but also appear among the S2, and the order that occurs is identical, but can be discontinuous: the longest S3 that can find under this prerequisite be long more, just says that S1 is similar more with S2.In the above example, the longest S3 that can find is GTCGTCGGAAGCCGGCCGAA
Shortcoming: the longest common subsequence character string that calculates is discontinuous, does not support backtracking.So in the e-commerce product coupling, there is certain probability coupling inaccurate.
Present in the best match algorithm of a large amount of characters that the structure product mates, the neither one proper model, it is the asymmetric multi-to-multi model of base configuration that this paper has selected with the stable marriage symmetry algorithm.The stable marriage symmetry algorithm:
Have n schoolgirl and n boy student are arranged in the corporations, every schoolgirl sorts the boy student according to her preference degree, and every boy student also sorts the schoolgirl according to the preference degree of oneself simultaneously.Then this n schoolgirl and n boy student are made into complete marriage as shown in Figure 1.
Shortcoming: need the sequence of the good object that everyone has a preference for of structure in advance, as be used in the e-commerce product coupling, lack the algorithm of structure preference lists.
Therefore, each medium-and-large-sized e-commerce website all can use outstanding info web to follow the trail of or the reptile tool software at present, grasp extraneous commodity data, but it has certain prerequisite and limitation:
1, some commodity of extraneous website but lack determinant attribute or specification and mate extracting as direct relatively foundation; Such as product " Creative ZEN 4GB BLACK Mp3Mp4Video Player withExpandable SD Card Slot ", this is this name of product of a certain B2C website, yet the product naming method of each B2C website is different, if so lacked production code member in the information list, the accuracy rate that coupling grasps can reduce.
2, the attribute of similar homologous series commodity or specification are closely similar, cause the accuracy rate that grasps data to reduce, and need manually distinguish coupling, have therefore reduced efficient.Such as name of product " Epson Light BlackInk Cartridge T096720 " and " Epson Matte Black Ink Cartridge T096820 ", production code member is respectively " T096720 " and " T096820 ", the similarity of character string of primary attribute is very high, if coupling has grabbed this two products simultaneously, manually select the optimum matching object with regard to also needing, its search efficiency is low, and real-time is poor.
Along with the impetus of e-commerce development is more and more powerful, the processing of the data in the data mining later stage of ecommerce, processing, conversion will be very promising fields, and therefore, the coupling of existing product data remains further to be improved.
Summary of the invention:
The present invention seeks to be, overcome the deficiencies in the prior art, and provide a kind of many algorithm models based on body, increase attributive analysis selection algorithm automatically, make its matching result good stability, matching rate height, can utilize the approximate adaptation method of a kind of a large amount of character strings of different character string approximate match algorithm construction preference lists according to different application scenarioss.
To achieve these goals, technical scheme of the present invention is as follows:
A kind of approximate adaptation method of a large amount of character strings is characterized in that, its method step is as follows:
(1) the main matching parameter of an object to be matched of selection;
(2) adjust the parameter weighted value, mainly be provided with following 3 kinds of parameters;
(a), the goods number approximate match, the parameter value of editing distance algorithm or the longest common subsequence algorithm computation;
(b), the editing distance algorithm of commodity name approximate match or the parameter value of the longest common subsequence algorithm computation;
(c), the parameter value of the price range of commodity comparison;
(3) utilize multi-to-multi Matching Model of stable marriage asymmetric arithmetic structure, the stable marriage asymmetric arithmetic is the innovation on stable marriage algorithm basis.
(4) at the occurrence in the multi-to-multi model, and according to editing distance algorithm or preference lists of the longest common subsequence structure.
Further, main matching parameter is after target data is sorted out (for example brand or classification) by certain mode in the described step (1), selects the characteristic matching attribute of distinguishing easily according to the product data source.
Further, described target data is to collect data by the large-tonnage product attribute of commodity on the search engine technique acquisition internet.
The present invention is based on the stable marriage asymmetric arithmetic, constructs the multi-to-multi Matching Model of a large amount of character strings, and the preference lists of the occurrence of this model has been used editing distance algorithm and the longest common subsequence algorithm in the approximate match algorithm of character string.Product one's own side and the other side carries out production code member and production code member, name of product and name of product, production code member and name of product, price and price, combine this 3 kinds of algorithmic techniques between these product component attributes of type of sale and type of sale, set the automatic computing that certain parameter weight carries out and draw optimum matching.
The present invention has increased attributive analysis and has come automatic selection algorithm, the identification and the automatic coupling that are used for commodity, take all factors into consideration the semantic similarity of commodity assembly notion and attribute assembly notion, proposition is based on product ontology structure semantics similarity matching algorithm, solved the automatic matching problem of body merchandise news between the external world, accuracy rate has also reached comparatively ideal results.
The present invention compares with the matching process of at present a large amount of character strings, after setting up model, the good stability of matching result, the matching rate height, real-time, can be according to different application scenarioss, utilize different character string approximate match algorithm construction preference lists, solved at present a lot of traditional artificial complex operations, the problem that efficient is not high, and can in the shortest time, make product decisions pointedly, strengthen product competitiveness, improve the website operational efficiency, improve system performance, have good development and application prospect.
Description of drawings:
Further specify the present invention below in conjunction with the drawings and specific embodiments.
Fig. 1 is made into complete marriage figure for schoolgirl and the boy student by existing stable marriage symmetry algorithm;
Fig. 2 is made into complete marriage figure for the schoolgirl and the boy student of structure preference lists of the present invention.
Fig. 3 is a process flow diagram of the present invention.
Embodiment:
For technological means, creation characteristic that the present invention is realized, reach purpose and effect is easy to understand, below in conjunction with concrete diagram, further set forth the present invention.
The present invention is based on the stable marriage asymmetric arithmetic, constructs the multi-to-multi Matching Model of a large amount of character strings, and the preference lists of the occurrence of this model has been used editing distance algorithm and the longest common subsequence algorithm in the approximate match algorithm of character string.Product one's own side and the other side carries out production code member and production code member, name of product and name of product, production code member and name of product, price and price, combine this 3 kinds of algorithmic techniques between these product component attributes of type of sale and type of sale, set the automatic computing that certain parameter weight carries out and draw optimum matching.
Referring to Fig. 3, concrete matching process of the present invention is as follows:
(1) by the large-tonnage product attribute data of commodity on the search engine technique acquisition internet, collects target data;
(2) target data is sorted out (for example brand or classification) by certain mode after, the characteristic matching attribute of selecting to distinguish easily according to product data sources is as matching parameter, for example ProductName etc.
(3) adjust the parameter weighted value, mainly be provided with following 3 kinds of parameters;
(a), the goods number approximate match, the parameter value of editing distance algorithm or the longest common subsequence algorithm computation;
(b), the editing distance algorithm of commodity name approximate match or the parameter value of the longest common subsequence algorithm computation;
(c), the parameter value of the price range of commodity comparison;
(4) the selected source data that needs coupling is according to multi-to-multi Matching Model of stable marriage asymmetric arithmetic structure;
(5) at the occurrence in the multi-to-multi model, by the preference lists of an editing distance algorithm or an occurrence of the longest common subsequence structure, different scenes can be selected different character string approximate match algorithms.
(5) obtain final process result after utilizing this method computing, wherein can operand be narrowed down to a less computing unit, improve efficiency of algorithm according to the brand or the classified information of product.
It should be noted that, the present invention has increased attributive analysis and has come automatic selection algorithm, the identification and the automatic coupling that are used for commodity, take all factors into consideration the semantic similarity of commodity assembly notion and attribute assembly notion, proposition is based on product ontology structure semantics similarity matching algorithm, solved the automatic matching problem of body merchandise news between the external world, accuracy rate has also reached comparatively ideal results.
For instance, have n schoolgirl and n boy student are arranged in the corporations, every schoolgirl sorts the boy student according to her preference degree, and every boy student also sorts the schoolgirl according to the preference degree of oneself simultaneously.Set up matching parameter according to schoolgirl and boy student's preference degree then, the selected source data that needs coupling, according to multi-to-multi Matching Model of stable marriage asymmetric arithmetic structure, promptly boy student and schoolgirl have a preference for identical construction; According to the algorithm (as Fig. 2) of editing distance algorithm or the longest common subsequence structure preference lists, this n schoolgirl and n boy student are made into complete marriage again.
Based on above-mentioned, the present invention will be one of most active branch of data management, field of information processing research, development and application.Its aid decision making person has solved at present a lot of traditional artificial complex operations, the problem that efficient is not high.It can help the enterprise's understanding propensity to consume, market trend real-time, makes product decisions pointedly in the shortest time, strengthens product competitiveness, improves the website operational efficiency, improves system performance, has good development and application prospect.
More than show and described ultimate principle of the present invention and principal character and advantage of the present invention.The technician of the industry should understand; the present invention is not restricted to the described embodiments; that describes in the foregoing description and the instructions just illustrates principle of the present invention; without departing from the spirit and scope of the present invention; the present invention also has various changes and modifications, and these changes and improvements all fall in the claimed scope of the invention.The claimed scope of the present invention is defined by appending claims and equivalent thereof.

Claims (3)

1. the approximate adaptation method of a large amount of character strings is characterized in that, its method step is as follows:
(1) the main matching parameter of an object to be matched of selection;
(2) adjust the parameter weighted value, mainly be provided with following 3 kinds of parameters;
(a), the goods number approximate match, the parameter value of editing distance algorithm or the longest common subsequence algorithm computation;
(b), the editing distance algorithm of commodity name approximate match or the parameter value of the longest common subsequence algorithm computation;
(c), the parameter value of the price range of commodity comparison;
(3) utilize multi-to-multi Matching Model of stable marriage asymmetric arithmetic structure;
(4) at the occurrence in the multi-to-multi model, and according to editing distance algorithm or preference lists of the longest common subsequence structure.
2. the approximate adaptation method of a kind of a large amount of character strings according to claim 1, it is characterized in that, main matching parameter is after target data is sorted out (for example brand or classification) by certain mode in the described step (1), selects the characteristic matching attribute of distinguishing easily according to the product data source.
3. the approximate adaptation method of a kind of a large amount of character strings according to claim 2 is characterized in that, described target data is to collect data by the large-tonnage product attribute of commodity on the search engine technique acquisition internet.
CN200910219048A 2009-11-19 2009-11-19 Approximating matching method for numerous character strings Pending CN101702171A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910219048A CN101702171A (en) 2009-11-19 2009-11-19 Approximating matching method for numerous character strings

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910219048A CN101702171A (en) 2009-11-19 2009-11-19 Approximating matching method for numerous character strings

Publications (1)

Publication Number Publication Date
CN101702171A true CN101702171A (en) 2010-05-05

Family

ID=42157086

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910219048A Pending CN101702171A (en) 2009-11-19 2009-11-19 Approximating matching method for numerous character strings

Country Status (1)

Country Link
CN (1) CN101702171A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184169A (en) * 2011-04-20 2011-09-14 北京百度网讯科技有限公司 Method, device and equipment used for determining similarity information among character string information
CN102541989A (en) * 2010-10-28 2012-07-04 微软公司 Robust auto-correction for data retrieval
CN102682079A (en) * 2012-03-30 2012-09-19 梁宗强 Method and module for allocating weights to search non-pharmaceutical medical project names
CN105824992A (en) * 2016-03-10 2016-08-03 东南大学 Intelligent matching method and system for data modules of relaying protection equipment
CN106776493A (en) * 2015-11-19 2017-05-31 腾讯科技(深圳)有限公司 Information filtering method and information filtrating device
CN110232140A (en) * 2019-06-19 2019-09-13 河北工业大学 The disposable approximate pattern matching method integrally constrained with part-
CN110245167A (en) * 2019-06-19 2019-09-17 河北工业大学 The non-overlapping approximate pattern matching method integrally constrained with part-
US10489461B2 (en) 2014-08-20 2019-11-26 Oracle International Corporation Multidimensional spatial searching for identifying substantially similar data fields
CN111324784A (en) * 2015-03-09 2020-06-23 阿里巴巴集团控股有限公司 Character string processing method and device

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541989A (en) * 2010-10-28 2012-07-04 微软公司 Robust auto-correction for data retrieval
CN102541989B (en) * 2010-10-28 2015-12-09 微软技术许可有限责任公司 The sane automatic correction of data retrieval
CN102184169B (en) * 2011-04-20 2013-06-19 北京百度网讯科技有限公司 Method, device and equipment used for determining similarity information among character string information
CN102184169A (en) * 2011-04-20 2011-09-14 北京百度网讯科技有限公司 Method, device and equipment used for determining similarity information among character string information
CN102682079A (en) * 2012-03-30 2012-09-19 梁宗强 Method and module for allocating weights to search non-pharmaceutical medical project names
US10489461B2 (en) 2014-08-20 2019-11-26 Oracle International Corporation Multidimensional spatial searching for identifying substantially similar data fields
CN111324784B (en) * 2015-03-09 2023-05-16 创新先进技术有限公司 Character string processing method and device
CN111324784A (en) * 2015-03-09 2020-06-23 阿里巴巴集团控股有限公司 Character string processing method and device
CN106776493A (en) * 2015-11-19 2017-05-31 腾讯科技(深圳)有限公司 Information filtering method and information filtrating device
CN106776493B (en) * 2015-11-19 2020-03-03 腾讯科技(深圳)有限公司 Information filtering method and information filtering device
CN105824992B (en) * 2016-03-10 2019-01-29 东南大学 A kind of intelligent Matching method and system of relay protection device data model
CN105824992A (en) * 2016-03-10 2016-08-03 东南大学 Intelligent matching method and system for data modules of relaying protection equipment
CN110245167A (en) * 2019-06-19 2019-09-17 河北工业大学 The non-overlapping approximate pattern matching method integrally constrained with part-
CN110232140A (en) * 2019-06-19 2019-09-13 河北工业大学 The disposable approximate pattern matching method integrally constrained with part-
CN110245167B (en) * 2019-06-19 2021-08-03 河北工业大学 Non-overlapping approximate pattern matching method with local-overall constraint

Similar Documents

Publication Publication Date Title
CN101702171A (en) Approximating matching method for numerous character strings
CN112069415B (en) Interest point recommendation method based on heterogeneous attribute network characterization learning
CN102279851B (en) Intelligent navigation method, device and system
CN111612549B (en) Construction method of platform operation service system
CN110532379B (en) Electronic information recommendation method based on LSTM (least Square TM) user comment sentiment analysis
CN108845988B (en) Entity identification method, device, equipment and computer readable storage medium
CN108563690B (en) Collaborative filtering recommendation method based on object-oriented clustering
CN102043781A (en) Web page resource recommendation method and device
CN101980211A (en) Machine learning model and establishing method thereof
CN110990670B (en) Growth incentive book recommendation method and recommendation system
US20230306035A1 (en) Automatic recommendation of analysis for dataset
CN104408648A (en) Method and device for choosing items
CN113570413A (en) Method and device for generating advertisement keywords, storage medium and electronic equipment
CN111523055A (en) Collaborative recommendation method and system based on agricultural product characteristic attribute comment tendency
CN107256238A (en) Recommendation method for personalized information and information recommendation system under a kind of multi-constraint condition
CN111191099A (en) User activity type identification method based on social media
CN106980639B (en) Short text data aggregation system and method
Zeng et al. Pyramid hybrid pooling quantization for efficient fine-grained image retrieval
CN114840766A (en) User portrait construction method, system, equipment and storage medium
CN116308683B (en) Knowledge-graph-based clothing brand positioning recommendation method, equipment and storage medium
CN113821718A (en) Article information pushing method and device
CN111598645A (en) Random forest and collaborative filtering second-hand room fusion recommendation method
Kang et al. Recognising informative Web page blocks using visual segmentation for efficient information extraction.
CN114881722A (en) Hotspot-based travel product matching method, system, equipment and storage medium
CN115203532A (en) Project recommendation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20100505