CN110517077A - Commodity similarity analysis method, apparatus and storage medium based on attributive distance - Google Patents

Commodity similarity analysis method, apparatus and storage medium based on attributive distance Download PDF

Info

Publication number
CN110517077A
CN110517077A CN201910772621.4A CN201910772621A CN110517077A CN 110517077 A CN110517077 A CN 110517077A CN 201910772621 A CN201910772621 A CN 201910772621A CN 110517077 A CN110517077 A CN 110517077A
Authority
CN
China
Prior art keywords
commodity
distance
data
inherent nature
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910772621.4A
Other languages
Chinese (zh)
Inventor
葛忠林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Cargo Price Technology Co Ltd
Original Assignee
Tianjin Cargo Price Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Cargo Price Technology Co Ltd filed Critical Tianjin Cargo Price Technology Co Ltd
Priority to CN201910772621.4A priority Critical patent/CN110517077A/en
Publication of CN110517077A publication Critical patent/CN110517077A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0283Price estimation or determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Development Economics (AREA)
  • Data Mining & Analysis (AREA)
  • Strategic Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • General Engineering & Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Databases & Information Systems (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a kind of commodity similarity analysis method, apparatus and storage medium based on attributive distance, method include: to choose any two commodity data as commodity data pair to be analyzed;To commodity data to be analyzed to attributes extraction processing is carried out, to obtain commodity inherent nature;Based on the commodity inherent nature, to commodity data to be analyzed to the processing of multilayer distance algorithm is carried out, to obtain multiple distance values;Multiple distance values are inputted into preset prediction model, to obtain the similarity value of commodity data pair to be analyzed.Through the above technical solutions, the similar commodity in chaotic data can be fast and accurately identified, Artificial Cognition is carried out according to priori knowledge without practitioner, the accuracy of commodity similarity identification is improved, also improves working efficiency.

Description

Commodity similarity analysis method, apparatus and storage medium based on attributive distance
Technical field
The present invention relates to data statistic analysis technical fields, and in particular to a kind of commodity similarity based on attributive distance point Analyse method, apparatus and storage medium.
Background technique
Modern society's type of merchandize is various, and for wholesale commodity on the market because many factors influence, it is low that price has height to have, Merchandise sales practitioner is highly desirable to obtain a kind of ability for distinguishing identical commodity, to reach increase working achievement.It is existing The problem of be: be difficult to find whether commodity are the same commodity, about commodity similarity data analysis result inaccuracy, working Staff recognizes commodity and needs many priori knowledges, analysis method also most existing defects.
Specifically, existing identification scheme lacks, and is mostly based on product name similarity calculation, this method accuracy is not high; Practitioner analyzes the method for commodity also due to information excavating channel and focus are often with subjectivity, data analyst meeting Go to collect data towards expected subconsciousness judgement, so analysis result is often not accurate enough, finally result in can not be the discovery that it is same The floating of a commodity price.
Therefore, it is competed between similar commodity and brings immense pressure to practitioner, accurately quickly finding has competition A pair of of commodity of relationship necessarily can for industry product the market expansion and reduce cost and bring important reference.
Summary of the invention
The embodiment of the present invention be designed to provide a kind of commodity similarity analysis method, apparatus based on attributive distance and Storage medium improves working efficiency to improve the accuracy of commodity similarity identification.
To achieve the above object, in a first aspect, the embodiment of the invention provides a kind of commodity based on attributive distance are similar Spend analysis method, comprising:
Any two commodity data is chosen as commodity data pair to be analyzed;
To the commodity data to be analyzed to attributes extraction processing is carried out, to obtain commodity inherent nature;
Based on the commodity inherent nature, to the commodity data to be analyzed to the processing of multilayer distance algorithm is carried out, to obtain To multiple distance values;
Multiple distance values are inputted into preset prediction model, to obtain the similarity of the commodity data pair to be analyzed Value.
As a kind of specific embodiment of the application, to the commodity data to be analyzed to carrying out attributes extraction processing, To obtain commodity inherent nature, specifically include:
Word segmentation processing is carried out to the product name of the commodity data pair to be analyzed, to extract the commodity inherent nature.
As a kind of specific embodiment of the application, it is based on the commodity inherent nature, to the commodity number to be analyzed It is specifically included according to progress multilayer distance algorithm processing with obtaining multiple distance values:
Vectorization processing is carried out to the commodity inherent nature;
After editing distance algorithm, COS distance algorithm or optimization crf condition random field algorithm opposite direction quantification treatment The commodity inherent nature carries out multilayer distance and calculates, to obtain multiple distance values.
Wherein, the commodity inherent nature includes product name, brand, single-item specification, sells specification, single-item unit, packet Number, model or taste in filling carry out vectorization processing to the commodity inherent nature and specifically include:
Using tf-idf or n-gram algorithm to product name, brand, sell specification, model and taste carry out at vectorization Reason;
Vectorization processing is carried out to single-item unit and single-item specification using 01 matching way.
Further, as a kind of preferred embodiment of the application, the commodity similarity analysis method further includes instruction Practice the prediction model, specifically include:
Sample data is obtained, the sample data includes the commodity pair that two similarity values are more than threshold value;
Attributes extraction processing is carried out to the sample data, to obtain sample inherent nature;
Vectorization processing is carried out to the sample inherent nature using tf-idf or n-gram algorithm or 01 matching way;
After editing distance algorithm, COS distance algorithm or optimization crf condition random field algorithm opposite direction quantification treatment The sample inherent nature carries out multilayer distance and calculates, to obtain multiple sample distance values;
According to multiple sample distance values, logistic regression training is carried out to the sample data using machine learning method, with Obtain the prediction model.
As a kind of specific embodiment of the application, according to multiple sample distance values, using machine learning method to institute It states sample data and carries out logistic regression training, to obtain the prediction model, specifically include:
Distance matrix is established according to multiple sample distance values;
Machine learning method is used to carry out logistic regression training to solve multiple Attribute Weight weight values;
The prediction model is determined according to multiple Attribute Weight weight values.
Second aspect, the embodiment of the present application also provides a kind of commodity similarity analysis device based on attributive distance, packet It includes:
Module is chosen, for choosing any two commodity data as commodity data pair to be analyzed;
Extraction module is used for the commodity data to be analyzed to attributes extraction processing is carried out, to obtain the potential category of commodity Property;
Computing module, for being based on the commodity inherent nature, to the commodity data to be analyzed to progress multilayer distance Algorithm process, to obtain multiple distance values;
Prediction module, for multiple distance values to be inputted preset prediction model, to obtain the commodity to be analyzed The similarity value of data pair.
Further, which further includes training module, for training the prediction model, specifically Include:
Sample data is obtained, the sample data includes the commodity pair that two similarity values are more than threshold value;
Attributes extraction processing is carried out to the sample data, to obtain sample inherent nature;
Vectorization processing is carried out to the sample inherent nature using tf-idf or n-gram algorithm or 01 matching way;
After editing distance algorithm, COS distance algorithm or optimization crf condition random field algorithm opposite direction quantification treatment The sample inherent nature carries out multilayer distance and calculates, to obtain multiple sample distance values;
According to multiple sample distance values, logistic regression training is carried out to the sample data using machine learning method, with Obtain the prediction model.
The third aspect, the embodiment of the invention also provides a kind of commodity similarity analysis device based on attributive distance, packet Processor, input equipment, output equipment and memory are included, the processor, input equipment, output equipment and memory mutually interconnect It connects, wherein the memory is for storing computer program, and the computer program includes program instruction, the processor quilt It is configured to call described program instruction, the method for executing above-mentioned first aspect.
Fourth aspect, the embodiment of the invention also provides a kind of computer readable storage mediums, are stored with computer Program, the computer program include program instruction, and described program instruction when being executed by a processor executes the processor The method of above-mentioned first aspect.
Implement the embodiment of the present invention, first extracts the commodity inherent nature of commodity data pair to be analyzed, it is potential based on the commodity Attribute carries out the processing of multilayer distance algorithm to obtain multiple distance values, and multiple distance values are finally inputted preset prediction model, To obtain the similarity value of the commodity data pair to be analyzed;Through the above technical solutions, confusion can be fast and accurately identified Similar commodity in data carry out Artificial Cognition according to priori knowledge without practitioner, improve commodity similarity identification Accuracy also improves working efficiency.
Detailed description of the invention
It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art Embodiment or attached drawing needed to be used in the description of the prior art are briefly described.
Fig. 1 is the exemplary flow for the commodity similarity analysis method based on attributive distance that first embodiment of the invention provides Figure;
Fig. 2 is the sub-process figure of step S101 in Fig. 1;
Fig. 3 is the structural representation for the commodity similarity analysis device based on attributive distance that one embodiment of the invention provides Figure;
Fig. 4 be another embodiment of the present invention provides the commodity similarity analysis method based on attributive distance structural representation Figure.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.
It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded Body, step, operation, the presence or addition of element, component and/or its set.
Referring to FIG. 1, being the commodity similarity analysis method based on attributive distance that first embodiment of the invention provides.Such as Shown in figure, this method be may comprise steps of:
S101, training prediction model.
Specifically, as shown in Fig. 2, step S101 includes:
S1011 obtains sample data.
In the present embodiment, in the shiploads of merchandise data of known similarity value, two similarity values are chosen more than threshold value The commodity of (such as more than 98%, i.e. similarity value is higher) are to as sample data.Understandably, sample data is actually one A training sample set, interior includes the higher commodity pair of multiple similarity values.
S1012 carries out attributes extraction processing to sample data, to obtain sample inherent nature.
Specifically, word segmentation processing is carried out to the product name in sample data, to extract sample inherent nature.
S1013 carries out vectorization processing to sample inherent nature using tf-idf or n-gram algorithm or 01 matching way.
Sample inherent nature includes but are not limited to product name, brand, single-item specification, sells specification, single-item unit, packet Number, model or taste etc. in filling.In the present embodiment, to product name, brand, sells specification, model and taste this several and make Vectorization processing is carried out with tf-idf or n-gram algorithm, and single-item unit and single-item specification then do 01 vectorization of exact matching, Wherein, the data of missing are not involved in calculating.For example, if the single-item unit of two commodity is just the same, directly matching is 1, different then matching is 0.
S1014, using editing distance algorithm, COS distance algorithm or optimization crf condition random field algorithm to vectorization at Sample inherent nature after reason carries out multilayer distance and calculates, to obtain multiple sample distance values.
Now the formula about Attribute Recognition is described as follows:
From identification item property algorithm:
Setting X sequence, random sequence Y probability P (Y | X)
X={ X1,X2…Xn, Y={ Y1,Y2…Yn}
Wherein, tk(Yi-1,Yi, X, i) and indicate transfer function tk, indicate that sequence Y is corresponding in position i-1 and i under sequence X Value transition probability, and sl(Yi, X, i) and indicate function of state sl, indicate that sequence Y is in the corresponding value probability of position i under sequence X.Separately Outer λklThe weight of respectively two functions.tk(Yi-1,Yi, X, i), k=1,2,3 ... K, K are defined in the part spy of the node The total number of function is levied, i is present node in the position of sequence;sl(Yi, X, i), l=1,2,3 ... L, L are defined in the node Node diagnostic function total number, i is present node in the position of sequence.
Enable sl=tk, to feature, simultaneously normalizing is obtained for i summation at various locations:
F (Y, X) is function name.
Return the extensive factor:
In the present embodiment, for doing vectorization using tfidf and ngram when attribute is natural language.Two-value distance Formula:
Set item property k=1,2,3,4K
Self-defining attribute condition distance:
Editing distance algorithmic notation is as follows:
Wherein S1, S2 are sentence, and i, j are the position in sentence
COS distance algorithmic notation is as follows:
Wherein a, b term vector
It is as follows to optimize crf condition random field distance algorithm:
Characteristic evaluating function: Φi,j(xi,j, λ) and=exp { xi,jλT}
Wherein X is that a pair of of short sentence sentence forms two-dimensional matrix term vector, and i, j are matrix position, and λ is weight parameter, and T is matrix Transposition.
For example, editing distance algorithm, COS distance algorithm or optimization crf condition random field algorithm can be used to do for product name Distance calculates, so that the analysis for commodity similarity provides fundamental parameter.
S1015 establishes distance matrix according to multiple sample distance values.
S1016 uses machine learning method to carry out logistic regression training to solve multiple Attribute Weight weight values.
S1017 determines prediction model according to multiple Attribute Weight weight values.
S102 chooses any two commodity data as commodity data pair to be analyzed.
S103, commodity data to be analyzed is to attributes extraction processing is carried out, to obtain commodity inherent nature.
Specifically, word segmentation processing is carried out to the product name of analysis commodity data centering, to extract commodity inherent nature
S104, is based on commodity inherent nature, and commodity data to be analyzed is multiple to obtain to progress multilayer distance algorithm processing Distance value.
In the step, vectorization processing first is carried out to commodity inherent nature, then carry out multilayer algorithm process.
Wherein, commodity inherent nature includes but are not limited to product name, brand, single-item specification, sells specification, single-item list Number, model or taste etc. in position, packaging.In the present embodiment, to product name, brand, selling specification, model and taste, this is several Item carries out vectorization processing using tf-idf or n-gram algorithm, and single-item unit and single-item specification then do 01 vector of exact matching Change, wherein the data of missing are not involved in calculating.For example, it if the single-item unit of two commodity is just the same, directly matches It is 1, different then matching is 0.
Further, using editing distance algorithm, COS distance algorithm or optimization crf condition random field algorithm to vectorization Treated, and commodity inherent nature carries out the processing of multilayer distance algorithm, to obtain multiple distance values.The used tool of the part Body algorithm please refers to aforementioned sample data part, and details are not described herein.
Multiple distance values are inputted preset prediction model, to obtain the similar of the commodity data pair to be analyzed by S105 Angle value.
Implement the commodity similarity analysis method based on attributive distance provided by the embodiment of the present invention, using machine learning Algorithm trains prediction model, first extracts the commodity inherent nature of commodity data pair to be analyzed, is carried out based on the commodity inherent nature Multiple distance values are finally inputted preset prediction model, to obtain to obtain multiple distance values by the processing of multilayer distance algorithm State the similarity value of commodity data pair to be analyzed;Through the above technical solutions, can be fast and accurately identified in chaotic data Similar commodity carry out Artificial Cognition according to priori knowledge without practitioner, improve the accuracy of commodity similarity identification, It improves work efficiency.
Based on identical inventive concept, the embodiment of the present invention also provides a kind of commodity similarity analysis based on attributive distance Device.As shown in figure 3, the commodity similarity analysis device includes:
Training module 10, for training prediction model;
Module 11 is chosen, for choosing any two commodity data as commodity data pair to be analyzed;
Extraction module 12 is used for the commodity data to be analyzed to progress attributes extraction processing, potential to obtain commodity Attribute;
Computing module 13, for be based on the commodity inherent nature, to the commodity data to be analyzed to carry out multilayer away from From algorithm process, to obtain multiple distance values;
Prediction module 14, for multiple distance values to be inputted preset prediction model, to obtain the quotient to be analyzed The similarity value of product data pair.
Specifically, in the present embodiment, training module 10 is specifically used for:
Sample data is obtained, the sample data includes the commodity pair that two similarity values are more than threshold value;
Attributes extraction processing is carried out to the sample data, to obtain sample inherent nature;
Vectorization processing is carried out to the sample inherent nature using tf-idf or n-gram algorithm or 01 matching way;
After editing distance algorithm, COS distance algorithm or optimization crf condition random field algorithm opposite direction quantification treatment The sample inherent nature carries out multilayer distance and calculates, to obtain multiple sample distance values;
According to multiple sample distance values, logistic regression training is carried out to the sample data using machine learning method, with Obtain the prediction model.
Further, above-mentioned commodity inherent nature includes but are not limited to product name, brand, single-item specification, sells rule Lattice, single-item unit, number, model or taste etc. in packaging, computing module 13 is specifically used for:
Using tf-idf or n-gram algorithm to product name, brand, sell specification, model and taste carry out at vectorization Reason;
Vectorization processing is carried out to single-item unit and single-item specification using 01 matching way;
After editing distance algorithm, COS distance algorithm or optimization crf condition random field algorithm opposite direction quantification treatment The commodity inherent nature carries out multilayer distance and calculates, to obtain multiple distance values.
It should be noted that the specific workflow of the present embodiment please refers to preceding method embodiment part, herein no longer It repeats.
Further, another embodiment of the present invention additionally provides a kind of commodity similarity analysis dress based on attributive distance It sets.As shown in figure 4, should commodity similarity analysis device based on attributive distance, may include: one or more processors 101, One or more input equipments 102, one or more output equipments 103 and memory 104, above-mentioned processor 101, input equipment 102, output equipment 103 and memory 104 are connected with each other by bus 105.Memory 104 is for storing computer program, institute Stating computer program includes program instruction, and the processor 101 is configured for calling the described program instruction execution above method The method of embodiment part.
It should be appreciated that in embodiments of the present invention, alleged processor 101 can be central processing unit (Central Processing Unit, CPU), which can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic Device, discrete gate or transistor logic, discrete hardware components etc..General processor can be microprocessor or this at Reason device is also possible to any conventional processor etc..
Input equipment 102 may include keyboard etc., and output equipment 103 may include display (LCD etc.), loudspeaker etc..
The memory 104 may include read-only memory and random access memory, and to processor 101 provide instruction and Data.The a part of of memory 104 can also include nonvolatile RAM.For example, memory 104 can also be deposited Store up the information of device type.
In the specific implementation, processor 101 described in the embodiment of the present invention, input equipment 102, output equipment 103 can Execute realization described in the embodiment of the commodity similarity analysis method provided in an embodiment of the present invention based on attributive distance Mode, details are not described herein.
Implement the commodity similarity analysis device based on attributive distance provided by the embodiment of the present invention, using machine learning Algorithm trains prediction model, first extracts the commodity inherent nature of commodity data pair to be analyzed, is carried out based on the commodity inherent nature Multiple distance values are finally inputted preset prediction model, to obtain to obtain multiple distance values by the processing of multilayer distance algorithm State the similarity value of commodity data pair to be analyzed;Through the above technical solutions, can be fast and accurately identified in chaotic data Similar commodity carry out Artificial Cognition according to priori knowledge without practitioner, improve the accuracy of commodity similarity identification, It improves work efficiency.
Correspondingly, the embodiment of the invention provides a kind of computer readable storage medium, the computer-readable storage mediums Matter is stored with computer program, and the computer program includes program instruction, and described program instructs realization when being executed by processor: The above-mentioned commodity similarity analysis method based on attributive distance.
The computer readable storage medium can be the internal storage unit of system described in aforementioned any embodiment, example Such as the hard disk or memory of system.The computer readable storage medium is also possible to the External memory equipment of the system, such as The plug-in type hard disk being equipped in the system, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Further, the computer readable storage medium can also be wrapped both The internal storage unit for including the system also includes External memory equipment.The computer readable storage medium is described for storing Other programs and data needed for computer program and the system.The computer readable storage medium can be also used for temporarily When store the data that has exported or will export.
Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure Member and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware With the interchangeability of software, each exemplary composition and step are generally described according to function in the above description.This A little functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Specially Industry technical staff can use different methods to achieve the described function each specific application, but this realization is not It is considered as beyond the scope of this invention.
In several embodiments provided herein, it should be understood that disclosed device and method can pass through it Its mode is realized.For example, the apparatus embodiments described above are merely exemplary, for example, the division of the unit, only Only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components can be tied Another system is closed or is desirably integrated into, or some features can be ignored or not executed.In addition, shown or discussed phase Mutually between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication of device or unit Connection is also possible to electricity, mechanical or other form connections.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.Some or all of unit therein can be selected to realize the embodiment of the present invention according to the actual needs Purpose.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, is also possible to two or more units and is integrated in one unit.It is above-mentioned integrated Unit both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the present invention Portion or part steps.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic or disk etc. are various can store journey The medium of sequence code.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with right It is required that protection scope subject to.

Claims (10)

1. a kind of commodity similarity analysis method based on attributive distance characterized by comprising
Any two commodity data is chosen as commodity data pair to be analyzed;
To the commodity data to be analyzed to attributes extraction processing is carried out, to obtain commodity inherent nature;
It is more to obtain to the commodity data to be analyzed to progress multilayer distance algorithm processing based on the commodity inherent nature A distance value;
Multiple distance values are inputted into preset prediction model, to obtain the similarity value of the commodity data pair to be analyzed.
2. commodity similarity analysis method as described in claim 1, which is characterized in that the commodity data to be analyzed into The processing of row attributes extraction, to obtain commodity inherent nature, specifically includes:
Word segmentation processing is carried out to the product name of the commodity data pair to be analyzed, to extract the commodity inherent nature.
3. commodity similarity analysis method as described in claim 1, which is characterized in that it is based on the commodity inherent nature, it is right The commodity data to be analyzed specifically includes progress multilayer distance algorithm processing to obtain multiple distance values:
Vectorization processing is carried out to the commodity inherent nature;
Described in after editing distance algorithm, COS distance algorithm or optimization crf condition random field algorithm opposite direction quantification treatment Commodity inherent nature carries out multilayer distance and calculates, to obtain multiple distance values.
4. commodity similarity analysis method as claimed in claim 3, which is characterized in that the commodity inherent nature includes commodity Title, single-item specification, sells specification, single-item unit, number, model or taste in packaging at brand, to the commodity inherent nature Vectorization processing is carried out to specifically include:
Using tf-idf or n-gram algorithm to product name, brand, sell specification, model and taste carry out vectorization processing;
Vectorization processing is carried out to single-item unit and single-item specification using 01 matching way.
5. commodity similarity analysis method according to any one of claims 1-4, which is characterized in that the commodity similarity point Analysis method further includes the trained prediction model, is specifically included:
Sample data is obtained, the sample data includes the commodity pair that two similarity values are more than threshold value;
Attributes extraction processing is carried out to the sample data, to obtain sample inherent nature;
Vectorization processing is carried out to the sample inherent nature using tf-idf or n-gram algorithm or 01 matching way;
Described in after editing distance algorithm, COS distance algorithm or optimization crf condition random field algorithm opposite direction quantification treatment Sample inherent nature carries out multilayer distance and calculates, to obtain multiple sample distance values;
According to multiple sample distance values, logistic regression training is carried out to the sample data using machine learning method, to obtain The prediction model.
6. commodity similarity analysis method as claimed in claim 5, which is characterized in that according to multiple sample distance values, use Machine learning method carries out logistic regression training to the sample data, to obtain the prediction model, specifically includes:
Distance matrix is established according to multiple sample distance values;
Machine learning method is used to carry out logistic regression training to solve multiple Attribute Weight weight values;
The prediction model is determined according to multiple Attribute Weight weight values.
7. a kind of commodity similarity analysis device based on attributive distance characterized by comprising
Module is chosen, for choosing any two commodity data as commodity data pair to be analyzed;
Extraction module is used for the commodity data to be analyzed to attributes extraction processing is carried out, to obtain commodity inherent nature;
Computing module, for being based on the commodity inherent nature, to the commodity data to be analyzed to progress multilayer distance algorithm Processing, to obtain multiple distance values;
Prediction module, for multiple distance values to be inputted preset prediction model, to obtain the commodity data to be analyzed Pair similarity value.
8. commodity similarity analysis device as claimed in claim 7, which is characterized in that further include training module, for training The prediction model, specifically includes:
Sample data is obtained, the sample data includes the commodity pair that two similarity values are more than threshold value;
Attributes extraction processing is carried out to the sample data, to obtain sample inherent nature;
Vectorization processing is carried out to the sample inherent nature using tf-idf or n-gram algorithm or 01 matching way;
Described in after editing distance algorithm, COS distance algorithm or optimization crf condition random field algorithm opposite direction quantification treatment Sample inherent nature carries out multilayer distance and calculates, to obtain multiple sample distance values;
According to multiple sample distance values, logistic regression training is carried out to the sample data using machine learning method, to obtain The prediction model.
9. a kind of commodity similarity analysis device based on attributive distance, which is characterized in that including processor, input equipment, defeated Equipment and memory out, the processor, input equipment, output equipment and memory are connected with each other, wherein the memory is used In storage computer program, the computer program includes program instruction, and the processor is configured for calling described program Instruction executes method as claimed in claim 6.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, the computer program include program instruction, and described program instruction executes the processor such as Method of claim 6.
CN201910772621.4A 2019-08-21 2019-08-21 Commodity similarity analysis method, apparatus and storage medium based on attributive distance Pending CN110517077A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910772621.4A CN110517077A (en) 2019-08-21 2019-08-21 Commodity similarity analysis method, apparatus and storage medium based on attributive distance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910772621.4A CN110517077A (en) 2019-08-21 2019-08-21 Commodity similarity analysis method, apparatus and storage medium based on attributive distance

Publications (1)

Publication Number Publication Date
CN110517077A true CN110517077A (en) 2019-11-29

Family

ID=68625952

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910772621.4A Pending CN110517077A (en) 2019-08-21 2019-08-21 Commodity similarity analysis method, apparatus and storage medium based on attributive distance

Country Status (1)

Country Link
CN (1) CN110517077A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639970A (en) * 2020-05-28 2020-09-08 深圳壹账通智能科技有限公司 Method for determining price of article based on image recognition and related equipment
CN112330037A (en) * 2020-11-11 2021-02-05 天津汇商共达科技有限责任公司 Method and device for predicting inventory proportion of new product and server
CN112395501A (en) * 2020-11-17 2021-02-23 航天信息股份有限公司 Enterprise recommendation method and device, storage medium and electronic equipment
CN113298493A (en) * 2021-05-21 2021-08-24 陕西合友网络科技有限公司 Navigation system and method for administrative examination and approval intelligent navigation
CN113643100A (en) * 2021-08-30 2021-11-12 北京值得买科技股份有限公司 Commodity similarity judgment module contribution quantification method and system
CN116188091A (en) * 2023-05-04 2023-05-30 品茗科技股份有限公司 Method, device, equipment and medium for automatic matching unit price reference of cost list

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018166343A1 (en) * 2017-03-13 2018-09-20 腾讯科技(深圳)有限公司 Data fusion method and device, storage medium and electronic device
CN108932647A (en) * 2017-07-24 2018-12-04 上海宏原信息科技有限公司 A kind of method and apparatus for predicting its model of similar article and training
CN109670161A (en) * 2017-10-13 2019-04-23 北京京东尚科信息技术有限公司 Commodity similarity calculating method and device, storage medium, electronic equipment
CN109697641A (en) * 2017-10-20 2019-04-30 北京京东尚科信息技术有限公司 The method and apparatus for calculating commodity similarity

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018166343A1 (en) * 2017-03-13 2018-09-20 腾讯科技(深圳)有限公司 Data fusion method and device, storage medium and electronic device
CN108932647A (en) * 2017-07-24 2018-12-04 上海宏原信息科技有限公司 A kind of method and apparatus for predicting its model of similar article and training
CN109670161A (en) * 2017-10-13 2019-04-23 北京京东尚科信息技术有限公司 Commodity similarity calculating method and device, storage medium, electronic equipment
CN109697641A (en) * 2017-10-20 2019-04-30 北京京东尚科信息技术有限公司 The method and apparatus for calculating commodity similarity

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639970A (en) * 2020-05-28 2020-09-08 深圳壹账通智能科技有限公司 Method for determining price of article based on image recognition and related equipment
CN112330037A (en) * 2020-11-11 2021-02-05 天津汇商共达科技有限责任公司 Method and device for predicting inventory proportion of new product and server
CN112395501A (en) * 2020-11-17 2021-02-23 航天信息股份有限公司 Enterprise recommendation method and device, storage medium and electronic equipment
CN113298493A (en) * 2021-05-21 2021-08-24 陕西合友网络科技有限公司 Navigation system and method for administrative examination and approval intelligent navigation
CN113643100A (en) * 2021-08-30 2021-11-12 北京值得买科技股份有限公司 Commodity similarity judgment module contribution quantification method and system
CN116188091A (en) * 2023-05-04 2023-05-30 品茗科技股份有限公司 Method, device, equipment and medium for automatic matching unit price reference of cost list

Similar Documents

Publication Publication Date Title
CN110517077A (en) Commodity similarity analysis method, apparatus and storage medium based on attributive distance
CN109657238B (en) Knowledge graph-based context identification completion method, system, terminal and medium
CN108509413A (en) Digest extraction method, device, computer equipment and storage medium
AU2020236989B2 (en) Handling categorical field values in machine learning applications
CN112100387B (en) Training method and device of neural network system for text classification
WO2019194986A1 (en) Automated extraction of product attributes from images
CN108170859A (en) Method, apparatus, storage medium and the terminal device of speech polling
CN110413319B (en) Code function taste detection method based on deep semantics
CN109992763A (en) Language marks processing method, system, electronic equipment and computer-readable medium
KR101782120B1 (en) Apparatus and method for recommending financial instruments based on consultation information and data clustering
CN109598517B (en) Commodity clearance processing, object processing and category prediction method and device thereof
CN111461164B (en) Sample data set capacity expansion method and model training method
CN108984500A (en) Extracting method, terminal device and the medium of amount information
CN113611405A (en) Physical examination item recommendation method, device, equipment and medium
CN110222330A (en) Method for recognizing semantics and device, storage medium, computer equipment
CN112069801A (en) Sentence backbone extraction method, equipment and readable storage medium based on dependency syntax
CN109684476A (en) A kind of file classification method, document sorting apparatus and terminal device
CN109902157A (en) A kind of training sample validation checking method and device
WO2024060684A1 (en) Model training method, image processing method, device, and storage medium
CN110377733A (en) A kind of text based Emotion identification method, terminal device and medium
CN109241529A (en) The determination method and apparatus of viewpoint label
CN110532456A (en) Case querying method, device, computer equipment and storage medium
CN110287495A (en) A kind of power marketing profession word recognition method and system
CN111428486A (en) Article information data processing method, apparatus, medium, and electronic device
CN109033078B (en) The recognition methods of sentence classification and device, storage medium, processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191129

RJ01 Rejection of invention patent application after publication