CN113724055B - Commodity attribute mining method and device - Google Patents

Commodity attribute mining method and device Download PDF

Info

Publication number
CN113724055B
CN113724055B CN202111076600.2A CN202111076600A CN113724055B CN 113724055 B CN113724055 B CN 113724055B CN 202111076600 A CN202111076600 A CN 202111076600A CN 113724055 B CN113724055 B CN 113724055B
Authority
CN
China
Prior art keywords
commodity
attribute
commodity attribute
phrases
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111076600.2A
Other languages
Chinese (zh)
Other versions
CN113724055A (en
Inventor
陈东东
章钦
郭雪茹
周源
易津锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong Technology Information Technology Co Ltd
Original Assignee
Jingdong Technology Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong Technology Information Technology Co Ltd filed Critical Jingdong Technology Information Technology Co Ltd
Priority to CN202111076600.2A priority Critical patent/CN113724055B/en
Publication of CN113724055A publication Critical patent/CN113724055A/en
Application granted granted Critical
Publication of CN113724055B publication Critical patent/CN113724055B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0623Item investigation
    • G06Q30/0625Directed, with specific intent or strategy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a commodity attribute mining method and device. The method comprises the following steps: determining a mapping relationship between the commodity attribute number and the phrase describing the commodity attribute; inputting the phrases into a preset vectorization model to obtain sentence vectors which are output by the vectorization model and represent the semantics of the phrases; the vectorization model is obtained by training a phrase set of marked attribute data corresponding to the target commodity class on the basis of a pre-training network model; and determining commodity attribute characteristics corresponding to the commodity attribute numbers based on commodity public attribute characteristics corresponding to the clustering clusters of the sentence vectors and the mapping relation. The method provided by the disclosure can effectively analyze unstructured data, fully mine hidden attribute characteristics of the commodity, and improve the efficiency of commodity attribute mining.

Description

Commodity attribute mining method and device
Technical Field
The disclosure relates to the technical field of big data analysis, in particular to a commodity attribute mining method and device. In addition, the invention also relates to an electronic device and a processor readable storage medium.
Background
For each item sold on the network platform, in addition to the structured information in the database describing the nature of the item, a portion of unstructured information is available, the most important of which is the information in the item detail page picture. In general, each item attribute number corresponds to a number of item detail page pictures, each item detail page picture containing a number of advertisement sentences or explanatory sentences, each sentence describing a feature of a certain dimension of the item. How to dig out the commodity attribute features expressed by each statement based on the corresponding relation, and further reversely deduce the hidden attribute features of each commodity attribute number, so that the hidden attribute features are used for further commodity recommendation, commodity customization or selling point digging and the like to become a problem to be solved urgently.
Disclosure of Invention
Therefore, the disclosure provides a commodity attribute mining method and device, so as to solve the defects that in the prior art, the limitation of a manual mining scheme and a computer-aided mining scheme is high, time and labor are wasted, honor attributes are easy to generate, more manual intervention is needed, and the commodity attribute mining efficiency and stability are poor.
The disclosure provides a commodity attribute mining method, comprising:
determining a mapping relationship between the commodity attribute number and the phrase describing the commodity attribute; wherein the phrase is in the commodity detail page picture;
inputting the phrases into a preset vectorization model to obtain sentence vectors which are output by the vectorization model and represent the semantics of the phrases;
the vectorization model is obtained by training a phrase set of marked attribute data corresponding to the target commodity class on the basis of a pre-training network model;
and determining commodity attribute characteristics corresponding to the commodity attribute numbers based on commodity public attribute characteristics corresponding to the clustering clusters of the sentence vectors and the mapping relation.
Further, determining the commodity attribute feature corresponding to the commodity attribute number based on the commodity public attribute feature corresponding to the cluster of the sentence vector and the mapping relation specifically includes:
clustering the sentence vectors to obtain corresponding clustering clusters;
extracting commodity attribute features of commonalities of short sentences in the cluster, and taking the commodity attribute features of commonalities as commodity public attribute features corresponding to the cluster;
and matching the commodity public attribute features corresponding to the cluster with the mapping relation to obtain commodity attribute features of the commodity attribute numbers.
Further, the clustering processing is performed on the sentence vectors to obtain corresponding clusters, which specifically includes: based on the semantics of the sentence corresponding to the sentence vector, clustering the sentence vector by using a K-means clustering model to obtain a cluster comprising a plurality of sentence vectors; the semantics of the short sentences corresponding to the sentence vectors in the cluster satisfy the preset semantic approximate conditions.
Further, the determining the mapping relationship between the commodity attribute number and the phrase describing the commodity attribute specifically includes:
mapping the commodity detail page picture to the commodity attribute number to obtain an initial mapping relation between the commodity attribute number and the commodity detail page picture;
identifying the text information of the commodity detail page picture by utilizing an optical character identification mode to obtain a phrase which corresponds to the commodity detail page picture and describes commodity attributes;
and determining the mapping relation according to the initial mapping relation and the short sentence which corresponds to the commodity detail page picture and describes the commodity attribute.
Further, the product attribute mining method further comprises the following steps:
based on a preset triplet loss function, gathering sentence vectors corresponding to similar phrases in the phrases in space, keeping sentence vectors corresponding to non-similar phrases in the phrases far away in space, and adding corresponding pseudo marks according to space distance for sample sentences without marked attribute data in the phrases so as to determine a training sample set of the pre-training network model;
training the pre-training network model based on the training sample set.
Further, training the pre-training network model by using a migration learning mode based on the phrase set of the marked attribute data to obtain the vectorization model.
Further, the commodity attribute number is inventory management information of the commodity, and the inventory management information refers to a numerical code or an alphabetical code for uniquely identifying the commodity.
The present disclosure also provides a commodity attribute mining apparatus, comprising:
a mapping relation determining unit for determining a mapping relation between the commodity attribute number and the phrase describing the commodity attribute; wherein the phrase is in the commodity detail page picture;
the vectorization processing unit is used for inputting the phrases into a preset vectorization model to obtain sentence vectors which are output by the vectorization model and represent the semantics of the phrases;
the vectorization model is obtained by training a phrase set of marked attribute data corresponding to the target commodity class on the basis of a pre-training network model;
and the commodity attribute feature determining unit is used for determining commodity attribute features corresponding to the commodity attribute numbers based on commodity public attribute features corresponding to the clustering clusters of the sentence vectors and the mapping relation.
Further, the commodity attribute feature determining unit is specifically configured to:
clustering the sentence vectors to obtain corresponding clustering clusters;
extracting commodity attribute features of commonalities of short sentences in the cluster, and taking the commodity attribute features of commonalities as commodity public attribute features corresponding to the cluster;
and matching the commodity public attribute features corresponding to the cluster with the mapping relation to obtain commodity attribute features of the commodity attribute numbers.
Further, the clustering processing is performed on the sentence vectors to obtain corresponding clusters, which specifically includes: based on the semantics of the sentence corresponding to the sentence vector, clustering the sentence vector by using a K-means clustering model to obtain a cluster comprising a plurality of sentence vectors; the semantics of the short sentences corresponding to the sentence vectors in the cluster satisfy the preset semantic approximate conditions.
Further, the mapping relation determining unit is specifically configured to:
mapping the commodity detail page picture to the commodity attribute number to obtain an initial mapping relation between the commodity attribute number and the commodity detail page picture;
identifying the text information of the commodity detail page picture by utilizing an optical character identification mode to obtain a phrase which corresponds to the commodity detail page picture and describes commodity attributes;
and determining the mapping relation according to the initial mapping relation and the short sentence which corresponds to the commodity detail page picture and describes the commodity attribute.
Further, the product attribute mining device further includes:
the sample set determining unit is used for gathering sentence vectors corresponding to similar phrases in the phrases in space based on a preset triplet loss function, keeping sentence vectors corresponding to non-similar phrases in the phrases far away in space, and adding corresponding pseudo marks according to space distance for sample sentences without marked attribute data in the phrases so as to determine a training sample set of the pre-training network model;
and the model training unit is used for training the pre-training network model based on the training sample set.
Further, training the pre-training network model by using a migration learning mode based on the phrase set of the marked attribute data to obtain the vectorization model.
Further, the commodity attribute number is inventory management information of the commodity, and the inventory management information refers to a numerical code or an alphabetical code for uniquely identifying the commodity.
The present disclosure also provides an electronic device, including: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the commodity attribute mining method according to any one of the preceding claims when the program is executed by the processor.
The present disclosure also provides a processor-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the commodity attribute mining method according to any one of the preceding claims.
According to the commodity attribute mining method, mapping relations between commodity attribute numbers and short sentences used for describing commodity attributes in commodity detail page pictures are determined, and the short sentences are input into a preset vectorization model to obtain sentence vectors representing short sentence semantics; clustering the sentence vectors, and determining commodity attribute characteristics corresponding to the commodity attribute numbers based on commodity public attribute characteristics corresponding to the clustering clusters of the sentence vectors and the mapping relation; the vectorization model is obtained by training a phrase set of marked attribute data corresponding to the target commodity class on the basis of a pre-training network model. The unstructured data can be effectively analyzed, hidden attribute characteristics of the commodity can be fully excavated, and the commodity attribute excavation efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, a brief description will be given below of the drawings required for the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without any inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a commodity attribute mining method provided in an embodiment of the present disclosure;
fig. 2 is a complete flow diagram of a commodity attribute mining method according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a commodity attribute mining apparatus provided in an embodiment of the present disclosure;
fig. 4 is a schematic entity structure of an electronic device according to an embodiment of the disclosure.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. Based on the embodiments in this disclosure, all other embodiments that a person of ordinary skill in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.
Embodiments of the commodity attribute mining method according to the present disclosure are described in detail below. As shown in fig. 1, which is a flow chart of a commodity attribute mining method provided by an embodiment of the present disclosure, a specific implementation process includes the following steps:
step 101: and determining the mapping relation between the commodity attribute number and the short sentence for describing the commodity attribute in the commodity detail page picture.
Specifically, the commodity attribute number (Stock Keeping Unit, SKU), i.e., inventory management information, refers to a numerical code or an alphabetical code assigned to a commodity for uniquely identifying the commodity attribute, so that an enterprise can manage inventory more easily and effectively. The digits or codes of the property number of the article are typically between 8 and 12 characters and are located on the price label of the article. The commodity detail page picture is a picture containing commodity detail information on a network sales counter, and the commodity detail information is unstructured information. For each item, in addition to the structured information in the database describing the nature of the item, a portion of unstructured information is available, the most important of which is the information in the item detail page picture. Each commodity attribute number corresponds to at least one commodity detail page picture, and each commodity detail page picture comprises a plurality of advertisement sentences or explanatory sentences, and each sentence describes commodity attribute characteristics (or nonsensical sentences) of a certain dimension of a commodity. The phrases are the advertisement sentences or the explanatory sentences, and each commodity detail page picture corresponds to at least one phrase for describing commodity attributes.
In this step, the commodity detail page picture is mapped to the commodity attribute number first to obtain an initial mapping relationship between the commodity attribute number and the commodity detail page picture. Further, identifying the text information of the commodity detail page picture to obtain a short sentence which corresponds to the commodity detail page picture and describes commodity attributes, and determining the mapping relation among the commodity attribute number, the commodity detail page picture and the short sentence which describes commodity attributes according to the initial mapping relation and the short sentence which corresponds to the commodity detail page picture and describes commodity attributes.
It should be noted that, in the practical implementation of the present invention, the text information of the commodity detail page picture may be identified by means including but not limited to OCR (Optical Character Recognition ). Specifically, when the character information of the commodity detail page pictures is identified by adopting an optical character identification mode, each commodity detail page picture can be mapped to a corresponding commodity attribute number through related operation of a database, and the character information on the commodity detail page picture is read through an optical character identification technology. Generally, one commodity attribute number corresponds to a plurality of commodity detail page pictures, and one commodity detail page picture corresponds to a plurality of OCR phrases, i.e. phrases describing commodity attributes. After the text information on the commodity detail page picture is read, the text information can be cleaned, short sentences with fonts not conforming to preset conditions are filtered, short sentences which can be directly used for describing commodity attributes of the vectorization model are obtained, and finally the mapping relation between commodity attribute numbers and commodity detail page pictures and short sentences is obtained.
Step 102: and inputting the phrases into a preset vectorization model to obtain sentence vectors which are output by the vectorization model and represent the semantics of the phrases. The vectorization model is obtained by training a phrase set of marked attribute data corresponding to the target commodity class on the basis of a pre-training network model.
In the embodiment of the invention, before executing the step, fine-tuning (fine-tuning) is needed to be performed in advance by utilizing manually marked commodity attribute data on the basis of the Bert model, so as to obtain a fine-tuned Bert model, further train a vectorization model capable of effectively distinguishing the meaning of the phrases, perform clustering by utilizing the output sentence vector of the vectorization model, and obtain a plurality of clustering clusters of the phrases describing different commodity attributes based on the clustering result, thereby being capable of mining hidden commodity attribute information on new unmarked data.
It should be noted that the Bert model is a pre-trained model. Assuming that an existing A training set is adopted, the initial network model is pre-trained by the A training set, network parameters are learned on the A training set task and then stored for later use, when a new task B is started, the same network structure can be adopted, the network parameters learned by the A training set can be loaded when the network parameters are initialized, the training data of the task B is used for training the network model, when the loaded parameters are kept unchanged, the model is called as 'frezen', and when the loaded network parameters are continuously fine-tuned along with the training of the task B, the network parameters are better adjusted, so that the model is more suitable for the current task B. In the embodiment of the invention, the training set A can be a short sentence set of marked attribute data corresponding to the target commodity class; the training set B corresponding to task B may refer to an unlabeled sample sentence set.
As shown in fig. 2, in the implementation process, the Bert model refers to an enabling model of NLP, and a corresponding output sentence vector can be obtained by inputting a phrase, wherein a Class label of the first dimension of output can represent Class information of the whole phrase, and phrases with similar semantics are more similar in a vector space. The invention can adopt NLP-Bert-as-service model, and also can use the Bert model (Chinese-Bert-wwm), and the model is trained in billions of Chinese character data sets, so that semantic information in short sentences can be acquired well.
In this step, after the phrases are input into a preset vectorization model to obtain sentence vectors which are output by the vectorization model and represent the semantics of the phrases, sentence vectors corresponding to similar phrases in the phrases can be gathered in space based on a preset triplet loss function, sentence vectors corresponding to non-similar phrases in the phrases are far away in space, corresponding pseudo marks are added according to the space distance for sample sentences without attribute data in the phrases, so as to determine a training sample set of the pre-training network model, and training is performed on the pre-training network model based on the training sample set.
In the practical implementation process, because the input short sentence is the related text information of a specific commodity class, professional words and expressions in a specific field often appear, and the effects of the Bert model may be affected. Therefore, in the embodiment of the invention, based on the short sentence set of the marked attribute information (namely, the short sentence set of the marked attribute data corresponding to the target commodity Class), the transfer learning can be performed by utilizing the short sentence set of the marked attribute data on the basis of the Class label output by the Bert model, so that the final output vector meets the requirements of the field to which the target commodity Class belongs.
Because the output form of the vectorization model is a sentence vector representing the meaning of the phrase, and the category labeling information of the manually labeled phrase set exists, a triple loss function (triple loss) can be selected as the loss function in the implementation process. The system hopes that the finally obtained commodity attribute information is not limited to the manually marked category, but also hopes that new category which is not manually marked can be automatically learned. The principle of Triplet loss is: and selecting a triplet (a, b, c) each time, wherein a and b belong to the same category, a and c belong to different categories, the loss function is that the distance between a and b is subtracted by the distance between a and c plus a margin value, and the distance between a and b is more than the distance between a and c by more than one margin as far as possible through continuous back propagation. And finally, the vectors corresponding to similar phrases are gathered in space, and the vectors corresponding to non-similar phrases are far away in space. The tag data here is often a relatively small quantity that is tagged by hand, so by calculating the nearest and farthest samples with ebadd for each unlabeled data, then providing it with a pseudo tag for the Triplet loss training. Specifically, the nearest sample is taken as positive and the farthest sample is taken as negative, so that the training set of the fine-tuning Bert model is gradually increased. In addition, in the transfer learning process, the transfer learning layer can select two layers of fully connected networks to achieve a good effect, and the input is a Class label vector output by the Bert model, and the output is a final embedding vector, namely a sentence vector.
Step 103: and determining commodity attribute characteristics corresponding to the commodity attribute numbers based on commodity public attribute characteristics corresponding to the clustering clusters of the sentence vectors and the mapping relation.
In the embodiment of the invention, based on the commodity public attribute features corresponding to the clustering clusters of the sentence vectors and the mapping relation, the commodity attribute features corresponding to the commodity attribute numbers are determined, and the specific implementation process comprises the following steps: clustering the sentence vectors to obtain corresponding clustering clusters; extracting commodity attribute features of commonalities of short sentences in the cluster, and taking the commodity attribute features of commonalities as commodity public attribute features corresponding to the cluster; and matching the commodity public attribute features corresponding to the cluster with the mapping relation to obtain commodity attribute features of the commodity attribute numbers. After vectorization, the phrases have corresponding relations with the obtained corresponding sentence vectors, so that the corresponding relations between the clusters formed by the sentence vectors and the phrases can be obtained, and the corresponding commodity attribute characteristics can be determined by matching the corresponding relations with the mapping relations. The common commodity attribute features refer to commodity attribute features of a plurality of phrases in the cluster.
And clustering the sentence vectors to obtain corresponding clustering clusters, wherein the specific implementation process comprises the following steps of: clustering the sentence vectors based on the semantics of the sentences corresponding to the sentence vectors to obtain a cluster containing a plurality of sentence vectors; the semantics of the short sentences corresponding to the sentence vectors in the cluster satisfy the preset semantic approximate conditions.
It should be noted that, in the implementation process of the present invention, the sentence vectors are clustered by a mode including, but not limited to, a K-means clustering model, that is, a K-means clustering algorithm (Kmeans Clustering Algorithm), which is not limited herein.
As shown in fig. 2, an Embedding result obtained by performing transfer learning based on the Bert model is clustered based on a preset clustering algorithm to obtain a plurality of clusters, and the phrases in each cluster are similar in semantic meaning. Because of the transfer learning of step S102, the vectorization model not only can identify the general semantics of Chinese, but also can be more sensitive to the words of specific goods. Therefore, short sentences can be clustered into corresponding clusters by selecting the preset super-parameters, namely the cluster number n. Finally, extracting common commodity attribute characteristics by analyzing the phrases in each cluster, and taking the common commodity attribute characteristics as commodity common attribute characteristics corresponding to the phrases in the cluster; and determining commodity attribute features corresponding to the commodity attribute numbers (namely, sku1 and sku 2) based on commodity public attribute features corresponding to the clustering clusters of the sentence vectors and the mapping relation.
In a specific implementation process, the commodity attribute mining method disclosed by the invention can be practically applied to attribute mining tasks of commodities such as electric toothbrushes and electric shavers, and the commodity types in practical application are not specifically limited. Taking an electric toothbrush as an example, if the obtained commodity attribute number is MS12446103; the item detail page of the electric toothbrush is collected and phrases in the item detail page picture describing the attributes of the electric toothbrush are identified based on optical character recognition technology, including but not limited to: "the duration is 60 hours", "the duration promotes 50%", "the color is pink and blue", "shake frequency through the sound wave technology and reach 31000 minutes", "about 40000 minutes high-speed sound wave shake", etc.; after determining the mapping relation between the commodity attribute numbers and the phrases, inputting the phrases into a vectorization model which is obtained by fine adjustment based on the marked electric toothbrush attribute data in advance, and outputting corresponding sentence vectors; clustering sentence vectors, for example, converging sentence vectors corresponding to sentences with the vibration frequency reaching 31000 minutes, the high-speed sound vibration about 40000 minutes, the high-efficiency sound vibration and the like by the sound wave technology as the vibration frequency characteristic of the electric toothbrush into corresponding clusters, wherein the common property characteristics of commodities corresponding to the clusters can be extracted as high-frequency sound waves, and further determining that the commodity property characteristics corresponding to the commodity property numbers are high-frequency sound waves according to the constructed mapping relation and commodity property numbers MS12446103 of the electric toothbrush; semantic meaning of 'duration 60 hours', 'duration 50%' and the like are that sentence vectors corresponding to short sentences used for representing the duration characteristics of the electric toothbrush are converged into corresponding cluster clusters, commodity public attribute characteristics corresponding to the cluster clusters can be extracted into-overlength duration, and further commodity attribute characteristics corresponding to commodity attribute numbers of the electric toothbrush are determined to be 'overlength duration' according to the constructed mapping relation and commodity attribute numbers MS12446103; the number of clusters n is 2.
By adopting the commodity attribute mining method disclosed by the embodiment of the disclosure, the mapping relation between commodity attribute numbers and phrases for describing commodity attributes in the commodity detail page pictures is determined, and the phrases are input into a preset vectorization model to obtain sentence vectors which are output by the vectorization model and represent the semantics of the phrases; clustering the sentence vectors, and determining commodity attribute characteristics corresponding to the commodity attribute numbers based on commodity public attribute characteristics corresponding to the clustering clusters of the sentence vectors and the mapping relation; the vectorization model is obtained by performing transfer learning training by utilizing a phrase set of marked attribute data corresponding to the target commodity class on the basis of a pre-training network model. The vectorization is constructed in a transfer learning mode, the labeled phrase information can be fully utilized, so that the vectorization model effect is more suitable for the current commodity class, the unstructured data is effectively analyzed, the hidden attribute characteristics of the commodity are fully excavated, and the commodity attribute excavation efficiency is improved.
Corresponding to the commodity attribute mining method provided by the above, the disclosure also provides a commodity attribute mining device. Since the embodiment of the apparatus is similar to the method embodiment described above, the description is relatively simple, and reference should be made to the description of the method embodiment section described above, and the embodiments of the commodity attribute mining apparatus described below are illustrative only. Fig. 3 is a schematic structural diagram of a commodity attribute mining apparatus according to an embodiment of the present disclosure.
The commodity attribute excavating device specifically comprises the following parts:
the mapping relation determining unit 301 is configured to determine a mapping relation between the commodity attribute number and a phrase for describing the commodity attribute in the commodity detail page picture.
Specifically, the commodity attribute number, i.e., inventory management information, refers to a numerical code or an alphabetical code assigned to a commodity for uniquely identifying the commodity attribute, so that an enterprise can manage inventory more easily and effectively. The commodity detail page picture is a picture containing commodity detail information on a network sales counter, and the commodity detail information is unstructured information. For each item, in addition to the structured information in the database describing the nature of the item, a portion of unstructured information is available, the most important of which is the information in the item detail page picture. Each commodity attribute number corresponds to at least one commodity detail page picture, each commodity detail page picture comprises a plurality of advertisement sentences or explanatory sentences, and each sentence describes commodity attribute characteristics of a certain dimension of a commodity. The phrases are the advertisement sentences or the explanatory sentences, and each commodity detail page picture corresponds to at least one phrase for describing commodity attributes.
In a specific implementation process, the mapping relationship determining unit 301 first needs to map the commodity detail page picture to the commodity attribute number, so as to obtain an initial mapping relationship between the commodity attribute number and the commodity detail page picture. Further, the mapping relationship determining unit 301 identifies text information of the commodity detail page picture, obtains a phrase describing the commodity attribute corresponding to the commodity detail page picture, and determines, according to the initial mapping relationship and the phrase describing the commodity attribute corresponding to the commodity detail page picture, a mapping relationship among the commodity attribute number, the commodity detail page picture and the phrase describing the commodity attribute.
The vectorization processing unit 302 is configured to input the phrase into a preset vectorization model, and obtain a sentence vector that is output by the vectorization model and represents the meaning of the phrase.
The vectorization model is obtained by training a phrase set of marked attribute data corresponding to the target commodity class on the basis of a pre-training network model;
and a commodity attribute feature determining unit 303, configured to determine a commodity attribute feature corresponding to the commodity attribute number based on a commodity public attribute feature corresponding to the cluster of the sentence vector and the mapping relationship.
Further, the commodity attribute feature determining unit is specifically configured to:
clustering the sentence vectors to obtain corresponding clustering clusters;
extracting commodity attribute features of commonalities of short sentences in the cluster, and taking the commodity attribute features of commonalities as commodity public attribute features corresponding to the cluster;
and matching the commodity public attribute features corresponding to the cluster with the mapping relation to obtain commodity attribute features of the commodity attribute numbers.
Further, the clustering processing is performed on the sentence vectors to obtain corresponding clusters, which specifically includes: based on the semantics of the sentence corresponding to the sentence vector, clustering the sentence vector by using a K-means clustering model to obtain a cluster comprising a plurality of sentence vectors; the semantics of the short sentences corresponding to the sentence vectors in the cluster satisfy the preset semantic approximate conditions.
Further, the mapping relation determining unit is specifically configured to:
mapping the commodity detail page picture to the commodity attribute number to obtain an initial mapping relation between the commodity attribute number and the commodity detail page picture;
identifying the text information of the commodity detail page picture by utilizing an optical character identification mode to obtain a phrase which corresponds to the commodity detail page picture and describes commodity attributes;
and determining the mapping relation according to the initial mapping relation and the short sentence which corresponds to the commodity detail page picture and describes the commodity attribute.
Further, the product attribute mining device further includes:
the sample set determining unit is used for gathering sentence vectors corresponding to similar phrases in the phrases in space based on a preset triplet loss function, keeping sentence vectors corresponding to non-similar phrases in the phrases far away in space, and adding corresponding pseudo marks according to space distance for sample sentences without marked attribute data in the phrases so as to determine a training sample set of the pre-training network model;
and the model training unit is used for training the pre-training network model based on the training sample set.
Further, the commodity attribute number corresponds to at least one commodity detail page picture, and the commodity detail page picture corresponds to at least one phrase describing commodity attributes.
Further, the commodity attribute number is inventory management information of the commodity, and the inventory management information refers to a numerical code or an alphabetical code for uniquely identifying the commodity.
By adopting the commodity attribute mining device disclosed by the embodiment of the disclosure, the sentence vector which is output by the vectorization model and represents the meaning of the sentence is obtained by determining the mapping relation between the commodity attribute number and the sentence which is used for describing the commodity attribute in the commodity detail page picture and inputting the sentence into the preset vectorization model; clustering the sentence vectors, and determining commodity attribute characteristics corresponding to the commodity attribute numbers based on commodity public attribute characteristics corresponding to the clustering clusters of the sentence vectors and the mapping relation; the vectorization model is obtained by performing transfer learning training by utilizing a phrase set of marked attribute data corresponding to the target commodity class on the basis of a pre-training network model. The vectorization is constructed in a transfer learning mode, the labeled phrase information can be fully utilized, so that the vectorization model effect is more suitable for the current commodity class, the unstructured data is effectively analyzed, the hidden attribute characteristics of the commodity are fully excavated, and the commodity attribute excavation efficiency is improved.
Corresponding to the commodity attribute mining method provided by the above, the present disclosure also provides an electronic device. Since the embodiments of the electronic device are similar to the method embodiments described above, the description is relatively simple, and reference should be made to the description of the method embodiments described above, and the electronic device described below is merely illustrative. Fig. 4 is a schematic physical structure of an electronic device according to an embodiment of the disclosure. The electronic device may include: a processor (processor) 401, a memory (memory) 402, and a communication bus 403, wherein the processor 401, the memory 402 perform communication with each other through the communication bus 403, and communicate with the outside through a communication interface 404. The processor 401 may invoke logic instructions in the memory 402 to perform a commodity attribute mining method comprising: determining a mapping relationship between the commodity attribute number and the phrase describing the commodity attribute; wherein the phrase is in the commodity detail page picture; inputting the phrases into a preset vectorization model to obtain sentence vectors which are output by the vectorization model and represent the semantics of the phrases; the vectorization model is obtained by training a phrase set of marked attribute data corresponding to the target commodity class on the basis of a pre-training network model; and determining commodity attribute characteristics corresponding to the commodity attribute numbers based on commodity public attribute characteristics corresponding to the clustering clusters of the sentence vectors and the mapping relation.
Further, the logic instructions in memory 402 described above may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or a part of the technical solution, or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present disclosure. And the aforementioned storage medium includes: a Memory chip, a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present disclosure also provides a computer program product including a computer program stored on a processor-readable storage medium, the computer program including program instructions which, when executed by a computer, are capable of performing the commodity attribute mining method provided by the above-described method embodiments. The method comprises the following steps: determining a mapping relationship between the commodity attribute number and the phrase describing the commodity attribute; wherein the phrase is in the commodity detail page picture; inputting the phrases into a preset vectorization model to obtain sentence vectors which are output by the vectorization model and represent the semantics of the phrases; the vectorization model is obtained by training a phrase set of marked attribute data corresponding to the target commodity class on the basis of a pre-training network model; and determining commodity attribute characteristics corresponding to the commodity attribute numbers based on commodity public attribute characteristics corresponding to the clustering clusters of the sentence vectors and the mapping relation.
In yet another aspect, the embodiments of the present disclosure further provide a processor-readable storage medium having stored thereon a computer program that, when executed by a processor, is implemented to perform the commodity attribute mining method provided by the above embodiments. The method comprises the following steps: determining a mapping relationship between the commodity attribute number and the phrase describing the commodity attribute; wherein the phrase is in the commodity detail page picture; inputting the phrases into a preset vectorization model to obtain sentence vectors which are output by the vectorization model and represent the semantics of the phrases; the vectorization model is obtained by training a phrase set of marked attribute data corresponding to the target commodity class on the basis of a pre-training network model; and determining commodity attribute characteristics corresponding to the commodity attribute numbers based on commodity public attribute characteristics corresponding to the clustering clusters of the sentence vectors and the mapping relation.
The processor-readable storage medium may be any available medium or data storage device that can be accessed by a processor, including, but not limited to, magnetic storage (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical storage (e.g., CD, DVD, BD, HVD, etc.), semiconductor storage (e.g., ROM, EPROM, EEPROM, nonvolatile storage (NAND FLASH), solid State Disk (SSD)), and the like.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are merely for illustrating the technical solution of the present disclosure, and are not limiting thereof; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present disclosure.

Claims (10)

1. A commodity attribute mining method, comprising:
determining a mapping relationship between the commodity attribute number and the phrase describing the commodity attribute; wherein the phrase is in the commodity detail page picture;
inputting the phrases into a preset vectorization model to obtain sentence vectors which are output by the vectorization model and represent the semantics of the phrases;
the vectorization model is obtained by training a phrase set of marked attribute data corresponding to the target commodity class on the basis of a pre-training network model;
based on the commodity public attribute features corresponding to the clustering clusters of the sentence vectors and the mapping relation, determining the commodity attribute features corresponding to the commodity attribute numbers specifically comprises the following steps: clustering the sentence vectors to obtain corresponding clustering clusters; extracting commodity attribute features of commonalities of short sentences in the cluster, and taking the commodity attribute features of commonalities as commodity public attribute features corresponding to the cluster.
2. The commodity attribute mining method according to claim 1, wherein determining commodity attribute features corresponding to the commodity attribute numbers based on commodity common attribute features corresponding to clusters of the sentence vectors and the mapping relationship, further comprises:
and matching the commodity public attribute features corresponding to the cluster with the mapping relation to obtain commodity attribute features of the commodity attribute numbers.
3. The commodity attribute mining method according to claim 2, wherein the clustering process is performed on the sentence vectors to obtain corresponding clusters, and specifically includes:
based on the semantics of the sentence corresponding to the sentence vector, clustering the sentence vector by using a K-means clustering model to obtain a cluster comprising a plurality of sentence vectors; the semantics of the short sentences corresponding to the sentence vectors in the cluster satisfy the preset semantic approximate conditions.
4. The commodity attribute mining method according to claim 1, wherein the determining a mapping relationship between commodity attribute numbers and phrases describing commodity attributes specifically includes:
mapping the commodity detail page picture to the commodity attribute number to obtain an initial mapping relation between the commodity attribute number and the commodity detail page picture;
identifying the text information of the commodity detail page picture by utilizing an optical character identification mode to obtain a phrase which corresponds to the commodity detail page picture and describes commodity attributes;
and determining the mapping relation between the determined commodity attribute number and the short sentence according to the initial mapping relation and the short sentence which is corresponding to the commodity detail page picture and is used for describing the commodity attribute.
5. The commodity attribute mining method according to claim 1, further comprising: based on a preset triplet loss function, gathering sentence vectors corresponding to similar phrases in the phrases in space, keeping sentence vectors corresponding to non-similar phrases in the phrases far away in space, and adding corresponding pseudo marks according to space distance for sample sentences without marked attribute data in the phrases so as to determine a training sample set of the pre-training network model;
training the pre-training network model based on the training sample set.
6. The commodity attribute mining method according to claim 1, further comprising: and training the pre-training network model by utilizing a migration learning mode based on the phrase set of the marked attribute data to obtain the vectorization model.
7. The commodity attribute mining method according to any one of claims 1 to 6, wherein the commodity attribute number is inventory management information of a commodity, the inventory management information being a numerical code or an alphabetical code for uniquely identifying the commodity.
8. A commodity attribute mining apparatus, comprising:
a mapping relation determining unit for determining a mapping relation between the commodity attribute number and the phrase describing the commodity attribute; wherein the phrase is in the commodity detail page picture;
the vectorization processing unit is used for inputting the phrases into a preset vectorization model to obtain sentence vectors which are output by the vectorization model and represent the semantics of the phrases;
the vectorization model is obtained by training a phrase set of marked attribute data corresponding to the target commodity class on the basis of a pre-training network model;
the commodity attribute feature determining unit is configured to determine a commodity attribute feature corresponding to the commodity attribute number based on a commodity public attribute feature corresponding to the cluster of the sentence vector and the mapping relationship, and specifically, the commodity attribute feature determining unit is configured to: clustering the sentence vectors to obtain corresponding clustering clusters; extracting commodity attribute features of commonalities of short sentences in the cluster, and taking the commodity attribute features of commonalities as commodity public attribute features corresponding to the cluster.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the commodity attribute mining method according to any one of claims 1 to 7 when the program is executed by the processor.
10. A processor readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the commodity attribute mining method according to any one of claims 1 to 7.
CN202111076600.2A 2021-09-14 2021-09-14 Commodity attribute mining method and device Active CN113724055B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111076600.2A CN113724055B (en) 2021-09-14 2021-09-14 Commodity attribute mining method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111076600.2A CN113724055B (en) 2021-09-14 2021-09-14 Commodity attribute mining method and device

Publications (2)

Publication Number Publication Date
CN113724055A CN113724055A (en) 2021-11-30
CN113724055B true CN113724055B (en) 2024-04-09

Family

ID=78683691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111076600.2A Active CN113724055B (en) 2021-09-14 2021-09-14 Commodity attribute mining method and device

Country Status (1)

Country Link
CN (1) CN113724055B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114169966B (en) * 2021-12-08 2022-08-05 海南港航控股有限公司 Method and system for extracting unit data of goods by tensor

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011227886A (en) * 2010-03-30 2011-11-10 Rakuten Inc Commodity information providing system, commodity information providing method, and program
CN106067132A (en) * 2016-05-27 2016-11-02 乐视控股(北京)有限公司 The method to set up of item property and device
CN106408321A (en) * 2015-07-31 2017-02-15 华为技术有限公司 Management method and device of commodity template, and method and device for calling database, and system
CN107203548A (en) * 2016-03-17 2017-09-26 阿里巴巴集团控股有限公司 Attribute acquisition methods and device
CN107679247A (en) * 2017-10-31 2018-02-09 南威软件股份有限公司 A kind of method that electric business website realizes self-defined maintenance items extension information
CN109670066A (en) * 2018-12-11 2019-04-23 江西师范大学 A kind of Freehandhand-drawing formula toggery image search method based on dual path Deep Semantics network
CN110347835A (en) * 2019-07-11 2019-10-18 招商局金融科技有限公司 Text Clustering Method, electronic device and storage medium
CN110490682A (en) * 2018-05-15 2019-11-22 北京京东尚科信息技术有限公司 The method and apparatus for analyzing item property
CN111401409A (en) * 2020-02-28 2020-07-10 创新奇智(青岛)科技有限公司 Commodity brand feature acquisition method, sales volume prediction method, device and electronic equipment
CN113065882A (en) * 2020-01-02 2021-07-02 阿里巴巴集团控股有限公司 Commodity processing method and device and electronic equipment

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011227886A (en) * 2010-03-30 2011-11-10 Rakuten Inc Commodity information providing system, commodity information providing method, and program
CN106408321A (en) * 2015-07-31 2017-02-15 华为技术有限公司 Management method and device of commodity template, and method and device for calling database, and system
CN107203548A (en) * 2016-03-17 2017-09-26 阿里巴巴集团控股有限公司 Attribute acquisition methods and device
CN106067132A (en) * 2016-05-27 2016-11-02 乐视控股(北京)有限公司 The method to set up of item property and device
CN107679247A (en) * 2017-10-31 2018-02-09 南威软件股份有限公司 A kind of method that electric business website realizes self-defined maintenance items extension information
CN110490682A (en) * 2018-05-15 2019-11-22 北京京东尚科信息技术有限公司 The method and apparatus for analyzing item property
CN109670066A (en) * 2018-12-11 2019-04-23 江西师范大学 A kind of Freehandhand-drawing formula toggery image search method based on dual path Deep Semantics network
CN110347835A (en) * 2019-07-11 2019-10-18 招商局金融科技有限公司 Text Clustering Method, electronic device and storage medium
CN113065882A (en) * 2020-01-02 2021-07-02 阿里巴巴集团控股有限公司 Commodity processing method and device and electronic equipment
CN111401409A (en) * 2020-02-28 2020-07-10 创新奇智(青岛)科技有限公司 Commodity brand feature acquisition method, sales volume prediction method, device and electronic equipment

Also Published As

Publication number Publication date
CN113724055A (en) 2021-11-30

Similar Documents

Publication Publication Date Title
US10878197B2 (en) Self-learning user interface with image-processed QA-pair corpus
US20160203412A1 (en) Inferred Facts Discovered through Knowledge Graph Derived Contextual Overlays
CN104978356B (en) A kind of recognition methods of synonym and device
US20120221508A1 (en) Systems and methods for efficient development of a rule-based system using crowd-sourcing
US20180068221A1 (en) System and Method of Advising Human Verification of Machine-Annotated Ground Truth - High Entropy Focus
US20080201131A1 (en) Method and apparatus for automatically discovering features in free form heterogeneous data
CN111191275A (en) Sensitive data identification method, system and device
US20190340503A1 (en) Search system for providing free-text problem-solution searching
US20050021357A1 (en) System and method for the efficient creation of training data for automatic classification
CN112989059A (en) Method and device for identifying potential customer, equipment and readable computer storage medium
CN113724055B (en) Commodity attribute mining method and device
CN115600109A (en) Sample set optimization method and device, equipment, medium and product thereof
JP2022082523A (en) Method and apparatus for providing information about machine learning based similar items
US11595525B2 (en) Assigning customer calls to customer care agents based on compatability
CN113869609A (en) Method and system for predicting confidence of frequent subgraph of root cause analysis
CN111126038B (en) Information acquisition model generation method and device and information acquisition method and device
CN115210705A (en) Vector embedding model for relational tables with invalid or equivalent values
CN115344504B (en) Software test case automatic generation method and tool based on requirement specification
CN116127013A (en) Personal sensitive information knowledge graph query method and device
WO2021136009A1 (en) Search information processing method and apparatus, and electronic device
CN110895564A (en) Potential customer data processing method and device
CN112328899A (en) Information processing method, information processing apparatus, storage medium, and electronic device
JP2017188025A (en) Data analysis system, control method thereof, program, and recording medium
US20230186190A1 (en) Ticket embedding based on multi-dimensional it data
CN115470322B (en) Keyword generation system and method based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant