CN101702171A

CN101702171A - Approximating matching method for numerous character strings

Info

Publication number: CN101702171A
Application number: CN200910219048A
Authority: CN
Inventors: 蒋以仁; 宋卫卫; 王皓伊
Original assignee: Newegg Information Technology (Xi'an) Co Ltd
Current assignee: Newegg Information Technology (Xi'an) Co Ltd
Priority date: 2009-11-19
Filing date: 2009-11-19
Publication date: 2010-05-05

Abstract

The invention discloses an approximating matching method for numerous character strings, which comprises the steps of: (1) choosing a main matching parameter of an object to be matched, (2) adjusting the weighted value of the parameter, (3) constructing a many-to-many matching module by utilizing stable marriage asymmetric algorithm, (4) creating an optimized list according to the editing distance algorithm or the longest common sub-sequence, aiming at the matching items in the many-to-many matching module. Basing on an algorithm module of main body, the invention automatically chooses the algorithm by adding attributive analysis. After constructing the module, the matching result is stable, the matching rate is high, and the matching is real-time and rapid. According to different application scenes, the optimized list can be created by utilizing different approximating matching algorithms for different character strings. With excellent application prospect, the pertinent product strategy can be made in a short time, the product competitiveness can be enhanced, the running efficiency of website can be promoted, and the system performance can be improved.

Description

A kind of approximate adaptation method of a large amount of character strings

Technical field:

What the present invention relates to is a kind of matching process of product data, and what be specifically related to is a kind of approximate adaptation method that is applied in a large amount of character strings in the ecommerce.

Background technology:

Along with rapid development of electronic commerce, competition also grows in intensity based on the e-commerce website of B2C pattern, the price between present each product sold of its core body, the difference of movable and service.So, understand every data of extraneous website product every day in real time, and then make correct sales tactics, enhance competitiveness, become imperative.

The algorithm that is applied in e-commerce field product approximate adaptation method at present has following two kinds:

One, editing distance algorithm:

Be used to judge similarity degree between the character string, equal a character string is converted into the required minimum cost of another character string by basic transformation.Similarity between the different length character string that editing distance can calculate.Distance algorithm is with the item and the similarity degree of intended target item that decide in the index file.It is a measure of similarity between two character strings, and editing distance is exactly to be used for calculating the character number that is transformed into required minimum insertion, deletion and replacement of another character string from a character string.For example, " three " is 1 with the editing distance of " tree " two character strings, because only need delete a character, two character strings are just the same.Three and tree editing distance are 1, because only do deletion action one time.Notion is readily appreciated that, but very complicated in the essence and the conversion of the approximate match algorithm of character string, needs further investigation to be applied to actual scene.The scene of widespread use has:

1, biological computation DNA gene mutation

2, speech recognition

3, spell check

4, plagiarize detection

Shortcoming: only can calculate the similarity of 2 character strings, can't obtain one of them character string and be the substring of another character string or the length of subsequence.As be used in the e-commerce product coupling, the production code member that wherein mates the other side is contained in the name of product, uses this algorithmic match to calculate relatively difficulty so.

Two, the longest common subsequence algorithm:

An ordered series of numbers S, if be respectively the subsequence of two or more known ordered series of numbers, and it is the longest to be that all meet in this condition sequence, then S is called the longest common subsequence of known array.It has used the dynamic programming principle.The scene of widespread use has:

1, information retrieval

2, data scrubbing

3, plagiarize identification

4, dna sequence dna contrast

Example: S1=ACCGGTCGAGTGCGCGGAAGCCGGCCGAA

S2＝GTCGTTCGGAATGCCGTTGCTCTGTAAA

One of target of two character strings relatively will know that exactly they have the similar of " how " on earth.The measurement of similarity can have a lot of standards, and for example we we can say, if one of them character string is the subsequence of another one character string, they are exactly similar.Top S1, S2 is not the subsequence of another one.Perhaps we also can define similarity like this: just can allow a character string become another if only operate (such as replacing or insert, deleting) with correction seldom, just say that they are similar.Definition like this: to two character strings, S1 and S2, find a sequence S3, the whole elements that occur among the S3 all not only appear among the S1 but also appear among the S2, and the order that occurs is identical, but can be discontinuous: the longest S3 that can find under this prerequisite be long more, just says that S1 is similar more with S2.In the above example, the longest S3 that can find is GTCGTCGGAAGCCGGCCGAA

Shortcoming: the longest common subsequence character string that calculates is discontinuous, does not support backtracking.So in the e-commerce product coupling, there is certain probability coupling inaccurate.

Present in the best match algorithm of a large amount of characters that the structure product mates, the neither one proper model, it is the asymmetric multi-to-multi model of base configuration that this paper has selected with the stable marriage symmetry algorithm.The stable marriage symmetry algorithm:

Have n schoolgirl and n boy student are arranged in the corporations, every schoolgirl sorts the boy student according to her preference degree, and every boy student also sorts the schoolgirl according to the preference degree of oneself simultaneously.Then this n schoolgirl and n boy student are made into complete marriage as shown in Figure 1.

Shortcoming: need the sequence of the good object that everyone has a preference for of structure in advance, as be used in the e-commerce product coupling, lack the algorithm of structure preference lists.

Therefore, each medium-and-large-sized e-commerce website all can use outstanding info web to follow the trail of or the reptile tool software at present, grasp extraneous commodity data, but it has certain prerequisite and limitation:

1, some commodity of extraneous website but lack determinant attribute or specification and mate extracting as direct relatively foundation; Such as product " Creative ZEN 4GB BLACK Mp3Mp4Video Player withExpandable SD Card Slot ", this is this name of product of a certain B2C website, yet the product naming method of each B2C website is different, if so lacked production code member in the information list, the accuracy rate that coupling grasps can reduce.

2, the attribute of similar homologous series commodity or specification are closely similar, cause the accuracy rate that grasps data to reduce, and need manually distinguish coupling, have therefore reduced efficient.Such as name of product " Epson Light BlackInk Cartridge T096720 " and " Epson Matte Black Ink Cartridge T096820 ", production code member is respectively " T096720 " and " T096820 ", the similarity of character string of primary attribute is very high, if coupling has grabbed this two products simultaneously, manually select the optimum matching object with regard to also needing, its search efficiency is low, and real-time is poor.

Along with the impetus of e-commerce development is more and more powerful, the processing of the data in the data mining later stage of ecommerce, processing, conversion will be very promising fields, and therefore, the coupling of existing product data remains further to be improved.

Summary of the invention:

The present invention seeks to be, overcome the deficiencies in the prior art, and provide a kind of many algorithm models based on body, increase attributive analysis selection algorithm automatically, make its matching result good stability, matching rate height, can utilize the approximate adaptation method of a kind of a large amount of character strings of different character string approximate match algorithm construction preference lists according to different application scenarioss.

To achieve these goals, technical scheme of the present invention is as follows:

A kind of approximate adaptation method of a large amount of character strings is characterized in that, its method step is as follows:

(1) the main matching parameter of an object to be matched of selection;

(2) adjust the parameter weighted value, mainly be provided with following 3 kinds of parameters;

(a), the goods number approximate match, the parameter value of editing distance algorithm or the longest common subsequence algorithm computation;

(b), the editing distance algorithm of commodity name approximate match or the parameter value of the longest common subsequence algorithm computation;

(c), the parameter value of the price range of commodity comparison;

(3) utilize multi-to-multi Matching Model of stable marriage asymmetric arithmetic structure, the stable marriage asymmetric arithmetic is the innovation on stable marriage algorithm basis.

(4) at the occurrence in the multi-to-multi model, and according to editing distance algorithm or preference lists of the longest common subsequence structure.

Further, main matching parameter is after target data is sorted out (for example brand or classification) by certain mode in the described step (1), selects the characteristic matching attribute of distinguishing easily according to the product data source.

Further, described target data is to collect data by the large-tonnage product attribute of commodity on the search engine technique acquisition internet.

The present invention is based on the stable marriage asymmetric arithmetic, constructs the multi-to-multi Matching Model of a large amount of character strings, and the preference lists of the occurrence of this model has been used editing distance algorithm and the longest common subsequence algorithm in the approximate match algorithm of character string.Product one's own side and the other side carries out production code member and production code member, name of product and name of product, production code member and name of product, price and price, combine this 3 kinds of algorithmic techniques between these product component attributes of type of sale and type of sale, set the automatic computing that certain parameter weight carries out and draw optimum matching.

The present invention has increased attributive analysis and has come automatic selection algorithm, the identification and the automatic coupling that are used for commodity, take all factors into consideration the semantic similarity of commodity assembly notion and attribute assembly notion, proposition is based on product ontology structure semantics similarity matching algorithm, solved the automatic matching problem of body merchandise news between the external world, accuracy rate has also reached comparatively ideal results.

The present invention compares with the matching process of at present a large amount of character strings, after setting up model, the good stability of matching result, the matching rate height, real-time, can be according to different application scenarioss, utilize different character string approximate match algorithm construction preference lists, solved at present a lot of traditional artificial complex operations, the problem that efficient is not high, and can in the shortest time, make product decisions pointedly, strengthen product competitiveness, improve the website operational efficiency, improve system performance, have good development and application prospect.

Description of drawings:

Further specify the present invention below in conjunction with the drawings and specific embodiments.

Fig. 1 is made into complete marriage figure for schoolgirl and the boy student by existing stable marriage symmetry algorithm;

Fig. 2 is made into complete marriage figure for the schoolgirl and the boy student of structure preference lists of the present invention.

Fig. 3 is a process flow diagram of the present invention.

Embodiment:

For technological means, creation characteristic that the present invention is realized, reach purpose and effect is easy to understand, below in conjunction with concrete diagram, further set forth the present invention.

Referring to Fig. 3, concrete matching process of the present invention is as follows:

(1) by the large-tonnage product attribute data of commodity on the search engine technique acquisition internet, collects target data;

(2) target data is sorted out (for example brand or classification) by certain mode after, the characteristic matching attribute of selecting to distinguish easily according to product data sources is as matching parameter, for example ProductName etc.

(3) adjust the parameter weighted value, mainly be provided with following 3 kinds of parameters;

(c), the parameter value of the price range of commodity comparison;

(4) the selected source data that needs coupling is according to multi-to-multi Matching Model of stable marriage asymmetric arithmetic structure;

(5) at the occurrence in the multi-to-multi model, by the preference lists of an editing distance algorithm or an occurrence of the longest common subsequence structure, different scenes can be selected different character string approximate match algorithms.

(5) obtain final process result after utilizing this method computing, wherein can operand be narrowed down to a less computing unit, improve efficiency of algorithm according to the brand or the classified information of product.

It should be noted that, the present invention has increased attributive analysis and has come automatic selection algorithm, the identification and the automatic coupling that are used for commodity, take all factors into consideration the semantic similarity of commodity assembly notion and attribute assembly notion, proposition is based on product ontology structure semantics similarity matching algorithm, solved the automatic matching problem of body merchandise news between the external world, accuracy rate has also reached comparatively ideal results.

For instance, have n schoolgirl and n boy student are arranged in the corporations, every schoolgirl sorts the boy student according to her preference degree, and every boy student also sorts the schoolgirl according to the preference degree of oneself simultaneously.Set up matching parameter according to schoolgirl and boy student's preference degree then, the selected source data that needs coupling, according to multi-to-multi Matching Model of stable marriage asymmetric arithmetic structure, promptly boy student and schoolgirl have a preference for identical construction; According to the algorithm (as Fig. 2) of editing distance algorithm or the longest common subsequence structure preference lists, this n schoolgirl and n boy student are made into complete marriage again.

Based on above-mentioned, the present invention will be one of most active branch of data management, field of information processing research, development and application.Its aid decision making person has solved at present a lot of traditional artificial complex operations, the problem that efficient is not high.It can help the enterprise's understanding propensity to consume, market trend real-time, makes product decisions pointedly in the shortest time, strengthens product competitiveness, improves the website operational efficiency, improves system performance, has good development and application prospect.

More than show and described ultimate principle of the present invention and principal character and advantage of the present invention.The technician of the industry should understand; the present invention is not restricted to the described embodiments; that describes in the foregoing description and the instructions just illustrates principle of the present invention; without departing from the spirit and scope of the present invention; the present invention also has various changes and modifications, and these changes and improvements all fall in the claimed scope of the invention.The claimed scope of the present invention is defined by appending claims and equivalent thereof.

Claims

1. the approximate adaptation method of a large amount of character strings is characterized in that, its method step is as follows:

(1) the main matching parameter of an object to be matched of selection;

(c), the parameter value of the price range of commodity comparison;

(3) utilize multi-to-multi Matching Model of stable marriage asymmetric arithmetic structure;

2. the approximate adaptation method of a kind of a large amount of character strings according to claim 1, it is characterized in that, main matching parameter is after target data is sorted out (for example brand or classification) by certain mode in the described step (1), selects the characteristic matching attribute of distinguishing easily according to the product data source.

3. the approximate adaptation method of a kind of a large amount of character strings according to claim 2 is characterized in that, described target data is to collect data by the large-tonnage product attribute of commodity on the search engine technique acquisition internet.